Babak Alipour created SPARK-17788:
-------------------------------------

             Summary: RangePartitioner results in few very large tasks and many 
small to empty tasks 
                 Key: SPARK-17788
                 URL: https://issues.apache.org/jira/browse/SPARK-17788
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.0.0
         Environment: Ubuntu 14.04 64bit
Java 1.8.0_101
            Reporter: Babak Alipour


Greetings everyone,

I was trying to read a single field of a Hive table stored as Parquet in Spark 
(~140GB for the entire table, this single field is a Double, ~1.4B records) and 
look at the sorted output using the following:
sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
​But this simple line of code gives:
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more 
than 17179869176 bytes

Same error for:
sql("SELECT " + field + " FROM MY_TABLE).sort(field)
and:
sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)

After doing some searching, the issue seems to lie in the RangePartitioner 
trying to create equal ranges. [1]

[1] 
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
 

 The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
the data which roughly equates 1 billion records), other numbers in the dataset 
are as high as 2000. With the RangePartitioner trying to create equal ranges, 
some tasks are becoming almost empty while others are extremely large, due to 
the heavily skewed distribution. 

This is either a bug in Apache Spark or a major limitation of the framework. I 
hope one of the devs can help solve this issue.

P.S. Email thread on Spark user mailing list:
http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to