I have the following simple example that I can't get to work correctly.
In [1]:
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType
from pyspark.sql.functions import asc, desc, sum, count
sqlContext = SQLContext(sc)
error_schema = StructType([
StructField('id', IntegerType(), nullable=False),
StructField('error_code', IntegerType(),
nullable=False),
StructField('error_desc', StringType(),
nullable=False)
])
error_data = sc.parallelize([
Row(1, 1, 'type 1 error'),
Row(1, 2, 'type 2 error'),
Row(2, 4, 'type 4 error'),
Row(2, 3, 'type 3 error'),
Row(2, 3, 'type 3 error'),
Row(2, 2, 'type 2 error'),
Row(2, 1, 'type 1 error'),
Row(3, 2, 'type 2 error'),
Row(3, 2, 'type 2 error'),
Row(3, 2, 'type 2 error'),
Row(3, 1, 'type 1 error'),
Row(3, 3, 'type 3 error'),
Row(3, 1, 'type 1 error'),
Row(3, 1, 'type 1 error'),
Row(3, 4, 'type 4 error'),
Row(3, 5, 'type 5 error'),
Row(3, 1, 'type 1 error'),
Row(3, 1, 'type 1 error'),
Row(3, 2, 'type 2 error'),
Row(3, 4, 'type 4 error'),
Row(3, 1, 'type 1 error'),
])
error_df = sqlContext.createDataFrame(error_data, error_schema)
error_df.show()
id_count =
error_df.groupBy(error_df["id"]).count().orderBy(desc("count"))
id_count.show()
error_df.groupBy(error_df["id"], error_df["error_code"],
error_df["error_desc"]).count().orderBy(id_count["id"],
desc("count")).show(20)
+---+----------+------------+
| id|error_code| error_desc|
+---+----------+------------+
| 1| 1|type 1 error|
| 1| 2|type 2 error|
| 2| 4|type 4 error|
| 2| 3|type 3 error|
| 2| 3|type 3 error|
| 2| 2|type 2 error|
| 2| 1|type 1 error|
| 3| 2|type 2 error|
| 3| 2|type 2 error|
| 3| 2|type 2 error|
| 3| 1|type 1 error|
| 3| 3|type 3 error|
| 3| 1|type 1 error|
| 3| 1|type 1 error|
| 3| 4|type 4 error|
| 3| 5|type 5 error|
| 3| 1|type 1 error|
| 3| 1|type 1 error|
| 3| 2|type 2 error|
| 3| 4|type 4 error|
+---+----------+------------+
only showing top 20 rows
+---+-----+
| id|count|
+---+-----+
| 3| 14|
| 2| 5|
| 1| 2|
+---+-----+
+---+----------+------------+-----+
| id|error_code| error_desc|count|
+---+----------+------------+-----+
| 1| 1|type 1 error| 1|
| 1| 2|type 2 error| 1|
| 2| 3|type 3 error| 2|
| 2| 2|type 2 error| 1|
| 2| 1|type 1 error| 1|
| 2| 4|type 4 error| 1|
| 3| 1|type 1 error| 6|
| 3| 2|type 2 error| 4|
| 3| 4|type 4 error| 2|
| 3| 3|type 3 error| 1|
| 3| 5|type 5 error| 1|
+---+----------+------------+-----+
In []:
What I would like is to end up with that last table ordered by the ids
that have the largest error count and within each id descending by
count. I would like the end result to be like this.
+---+----------+------------+-----+
| id|error_code| error_desc|count|
+---+----------+------------+-----+
| 3| 1|type 1 error| 6|
| 3| 2|type 2 error| 4|
| 3| 4|type 4 error| 2|
| 3| 3|type 3 error| 1|
| 3| 5|type 5 error| 1|
| 2| 3|type 3 error| 2|
| 2| 2|type 2 error| 1|
| 2| 1|type 1 error| 1|
| 2| 4|type 4 error| 1|
| 1| 1|type 1 error| 1|
| 1| 2|type 2 error| 1|
+---+----------+------------+-----+
Because id 3 has the highest error count, id 2 the next highest, 1 the
least error count.
What is the best way to do this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-orderBy-previous-groupBy-count-orderBy-in-pyspark-tp26864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]