how to orderBy previous groupBy.count.orderBy in pyspark

webe3vt Mon, 02 May 2016 12:10:16 -0700

I have the following simple example that I can't get to work correctly.

In [1]:


from pyspark.sql import SQLContext, Row
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType
from pyspark.sql.functions import asc, desc, sum, count
sqlContext = SQLContext(sc)

error_schema = StructType([
                    StructField('id', IntegerType(), nullable=False),
                    StructField('error_code', IntegerType(),
nullable=False),
                    StructField('error_desc', StringType(),
nullable=False)
                ])
error_data = sc.parallelize([
                                Row(1, 1, 'type 1 error'),
                                Row(1, 2, 'type 2 error'),
                                Row(2, 4, 'type 4 error'),
                                Row(2, 3, 'type 3 error'),
                                Row(2, 3, 'type 3 error'),
                                Row(2, 2, 'type 2 error'),
                                Row(2, 1, 'type 1 error'),
                                Row(3, 2, 'type 2 error'),
                                Row(3, 2, 'type 2 error'),
                                Row(3, 2, 'type 2 error'),
                                Row(3, 1, 'type 1 error'),
                                Row(3, 3, 'type 3 error'),
                                Row(3, 1, 'type 1 error'),
                                Row(3, 1, 'type 1 error'),
                                Row(3, 4, 'type 4 error'),
                                Row(3, 5, 'type 5 error'),
                                Row(3, 1, 'type 1 error'),
                                Row(3, 1, 'type 1 error'),
                                Row(3, 2, 'type 2 error'),
                                Row(3, 4, 'type 4 error'),
                                Row(3, 1, 'type 1 error'),

                            ])
error_df = sqlContext.createDataFrame(error_data, error_schema)
error_df.show()
id_count =
error_df.groupBy(error_df["id"]).count().orderBy(desc("count"))
id_count.show()
error_df.groupBy(error_df["id"], error_df["error_code"],
error_df["error_desc"]).count().orderBy(id_count["id"],
desc("count")).show(20)

+---+----------+------------+
| id|error_code|  error_desc|
+---+----------+------------+
|  1|         1|type 1 error|
|  1|         2|type 2 error|
|  2|         4|type 4 error|
|  2|         3|type 3 error|
|  2|         3|type 3 error|
|  2|         2|type 2 error|
|  2|         1|type 1 error|
|  3|         2|type 2 error|
|  3|         2|type 2 error|
|  3|         2|type 2 error|
|  3|         1|type 1 error|
|  3|         3|type 3 error|
|  3|         1|type 1 error|
|  3|         1|type 1 error|
|  3|         4|type 4 error|
|  3|         5|type 5 error|
|  3|         1|type 1 error|
|  3|         1|type 1 error|
|  3|         2|type 2 error|
|  3|         4|type 4 error|
+---+----------+------------+
only showing top 20 rows

+---+-----+
| id|count|
+---+-----+
|  3|   14|
|  2|    5|
|  1|    2|
+---+-----+

+---+----------+------------+-----+
| id|error_code|  error_desc|count|
+---+----------+------------+-----+
|  1|         1|type 1 error|    1|
|  1|         2|type 2 error|    1|
|  2|         3|type 3 error|    2|
|  2|         2|type 2 error|    1|
|  2|         1|type 1 error|    1|
|  2|         4|type 4 error|    1|
|  3|         1|type 1 error|    6|
|  3|         2|type 2 error|    4|
|  3|         4|type 4 error|    2|
|  3|         3|type 3 error|    1|
|  3|         5|type 5 error|    1|
+---+----------+------------+-----+


In []:

What I would like is to end up with that last table ordered by the ids
that have the largest error count and within each id descending by
count.  I would like the end result to be like this.

+---+----------+------------+-----+
| id|error_code|  error_desc|count|
+---+----------+------------+-----+
|  3|         1|type 1 error|    6|
|  3|         2|type 2 error|    4|
|  3|         4|type 4 error|    2|
|  3|         3|type 3 error|    1|
|  3|         5|type 5 error|    1|
|  2|         3|type 3 error|    2|
|  2|         2|type 2 error|    1|
|  2|         1|type 1 error|    1|
|  2|         4|type 4 error|    1|
|  1|         1|type 1 error|    1|
|  1|         2|type 2 error|    1|
+---+----------+------------+-----+

Because id 3 has the highest error count, id 2 the next highest, 1 the
least error count.

What is the best way to do this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-orderBy-previous-groupBy-count-orderBy-in-pyspark-tp26864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

how to orderBy previous groupBy.count.orderBy in pyspark

Reply via email to