About 50 results
Open links in new tab
  1. collect_list by preserving order based on another variable

    Oct 5, 2017 · But collect_list doesn't guarantee order even if I sort the input data frame by date before aggregation. Could someone help on how to do aggregation by preserving the order based on a …

  2. pyspark collect_set or collect_list with groupby - Stack Overflow

    Jun 2, 2016 · pyspark collect_set or collect_list with groupby Ask Question Asked 9 years, 9 months ago Modified 6 years, 4 months ago

  3. dataframe - Pyspark collect list - Stack Overflow

    Jun 29, 2020 · I am doing a group by over a column in a pyspark dataframe and doing a collect list on another column to get all the available values for column_1. As below. Column_1 Column_2 A …

  4. Convert spark DataFrame column to python list - Stack Overflow

    Jul 29, 2016 · 9 A possible solution is using the collect_list() function from pyspark.sql.functions. This will aggregate all column values into a pyspark array that is converted into a python list when collected:

  5. python - How to retrieve all columns using pyspark collect_list ...

    Oct 18, 2017 · How to retrieve all columns using pyspark collect_list functions Asked 8 years, 4 months ago Modified 4 years, 3 months ago Viewed 22k times

  6. apache spark sql - How to maintain sort order in PySpark collect_list ...

    Nov 8, 2018 · I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model in...

  7. Pyspark - Preserve order of collect list and collect set over multiple ...

    Jul 7, 2020 · 0 All the collect functions (collect_set, collect_list) within spark are non-deterministic since the order of collected result depends on the order of rows in the underlying dataframe which is again …

  8. Pyspark: Using collect_list over window () with condition

    Pyspark: Using collect_list over window () with condition Ask Question Asked 5 years, 10 months ago Modified 5 years, 10 months ago

  9. python - How to create nested lists in Pyspark using collect_list over ...

    Apr 5, 2021 · 2 I have a Spark DF I aggregated using collect_list and PartitionBy to pull lists of values associated with a grouped set of columns. As a result, for the grouped columns, I now have a new …

  10. Optimize Pyspark's Collect_List Function - Stack Overflow

    This leads me to believe that the collect_list function is what's causing this problem. Instead of running the tasks in parallel, they run on a single node and run out of memory. What's the most optimal way …