[jira] [Resolved] (SPARK-16191) Code-Generated SpecificColumnarIterator fails for wide pivot with caching

Sean Owen (JIRA) Wed, 24 Aug 2016 13:16:29 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-16191.
-------------------------------
    Resolution: Duplicate

> Code-Generated SpecificColumnarIterator fails for wide pivot with caching
> -------------------------------------------------------------------------
>
>                 Key: SPARK-16191
>                 URL: https://issues.apache.org/jira/browse/SPARK-16191
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Matthew Livesey
>
> When caching a pivot of more than 2260 columns, the instance of 
> SpecificColumnarIterator which is generated by code-generation fails to be 
> compiled with:
> bq. failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of 
> method \"()Z\" of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator\"
>  grows beyond 64 KB
> This can be re-produced in PySpark with the following (it took some trial and 
> error to find that 2261 is the magic number at which the generated class 
> breaks the 64KB limit).
> {code}
> def build_pivot(width):
>     categories = ["cat_%s" % i for i in range(0,width)]
>     customers = ["cust_%s" % i for i in range(0,10)]
>     rows = []
>     for cust in customers:
>         for cat in categories:
>             for i in range(0,4):
>                 row = (cust, cat, i, 7.0)
>                 rows.append(row)
>     rdd = sc.parallelize(rows)
>     df = sqlContext.createDataFrame(rdd, ["customer", "category", "instance", 
> "value"])
>     pivot_value_rows = 
> df.select("category").distinct().orderBy("category").collect()
>     pivot_values = [r.category for r in pivot_value_rows]
>     import pyspark.sql.functions as func
>     pivot = df.groupBy('customer').pivot("category", 
> pivot_values).agg(func.sum(df.value)).cache()
>     pivot.write.save('my_pivot', mode='overwrite')
> for i in [2260, 2261]:
>     try:
>         build_pivot(i)
>         print "Succeeded for %s" % i
>     except:
>         print "Failed for %s" % i
> {code}
> Removing the `cache()` call avoids the problem and allows wider pivots, since 
> ColumnarIterator is specifically related to caching it does not get generated 
> where caching is not used.
> This could be symptomatic of a general problem that generated code can break 
> the 64KB bytecode limit, and so occur in other cases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-16191) Code-Generated SpecificColumnarIterator fails for wide pivot with caching

Reply via email to