I have the following simple example that I can't get to work correctly. In [1]:
from pyspark.sql import SQLContext, Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql.functions import asc, desc, sum, count sqlContext = SQLContext(sc) error_schema = StructType([ StructField('id', IntegerType(), nullable=False), StructField('error_code', IntegerType(), nullable=False), StructField('error_desc', StringType(), nullable=False) ]) error_data = sc.parallelize([ Row(1, 1, 'type 1 error'), Row(1, 2, 'type 2 error'), Row(2, 4, 'type 4 error'), Row(2, 3, 'type 3 error'), Row(2, 3, 'type 3 error'), Row(2, 2, 'type 2 error'), Row(2, 1, 'type 1 error'), Row(3, 2, 'type 2 error'), Row(3, 2, 'type 2 error'), Row(3, 2, 'type 2 error'), Row(3, 1, 'type 1 error'), Row(3, 3, 'type 3 error'), Row(3, 1, 'type 1 error'), Row(3, 1, 'type 1 error'), Row(3, 4, 'type 4 error'), Row(3, 5, 'type 5 error'), Row(3, 1, 'type 1 error'), Row(3, 1, 'type 1 error'), Row(3, 2, 'type 2 error'), Row(3, 4, 'type 4 error'), Row(3, 1, 'type 1 error'), ]) error_df = sqlContext.createDataFrame(error_data, error_schema) error_df.show() id_count = error_df.groupBy(error_df["id"]).count().orderBy(desc("count")) id_count.show() error_df.groupBy(error_df["id"], error_df["error_code"], error_df["error_desc"]).count().orderBy(id_count["id"], desc("count")).show(20) +---+----------+------------+ | id|error_code| error_desc| +---+----------+------------+ | 1| 1|type 1 error| | 1| 2|type 2 error| | 2| 4|type 4 error| | 2| 3|type 3 error| | 2| 3|type 3 error| | 2| 2|type 2 error| | 2| 1|type 1 error| | 3| 2|type 2 error| | 3| 2|type 2 error| | 3| 2|type 2 error| | 3| 1|type 1 error| | 3| 3|type 3 error| | 3| 1|type 1 error| | 3| 1|type 1 error| | 3| 4|type 4 error| | 3| 5|type 5 error| | 3| 1|type 1 error| | 3| 1|type 1 error| | 3| 2|type 2 error| | 3| 4|type 4 error| +---+----------+------------+ only showing top 20 rows +---+-----+ | id|count| +---+-----+ | 3| 14| | 2| 5| | 1| 2| +---+-----+ +---+----------+------------+-----+ | id|error_code| error_desc|count| +---+----------+------------+-----+ | 1| 1|type 1 error| 1| | 1| 2|type 2 error| 1| | 2| 3|type 3 error| 2| | 2| 2|type 2 error| 1| | 2| 1|type 1 error| 1| | 2| 4|type 4 error| 1| | 3| 1|type 1 error| 6| | 3| 2|type 2 error| 4| | 3| 4|type 4 error| 2| | 3| 3|type 3 error| 1| | 3| 5|type 5 error| 1| +---+----------+------------+-----+ In []: What I would like is to end up with that last table ordered by the ids that have the largest error count and within each id descending by count. I would like the end result to be like this. +---+----------+------------+-----+ | id|error_code| error_desc|count| +---+----------+------------+-----+ | 3| 1|type 1 error| 6| | 3| 2|type 2 error| 4| | 3| 4|type 4 error| 2| | 3| 3|type 3 error| 1| | 3| 5|type 5 error| 1| | 2| 3|type 3 error| 2| | 2| 2|type 2 error| 1| | 2| 1|type 1 error| 1| | 2| 4|type 4 error| 1| | 1| 1|type 1 error| 1| | 1| 2|type 2 error| 1| +---+----------+------------+-----+ Because id 3 has the highest error count, id 2 the next highest, 1 the least error count. What is the best way to do this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-orderBy-previous-groupBy-count-orderBy-in-pyspark-tp26864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org