Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-25 Thread Jianneng Li
an't do all joins this way. Best, Jianneng From: yeikel valdes Sent: Tuesday, February 25, 2020 5:48 AM To: Jianneng Li Cc: user@spark.apache.org ; genie_...@outlook.com Subject: Re: [Spark SQL] Memory problems with packing too many joins into the same W

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-24 Thread Jianneng Li
many joins into the same WholeStageCodegen I have encountered too many joins problem before. Since the joined dataframe is small enough, I convert join to udf operation, which is much faster and didn’t generate out of memory problem. 2020年2月25日 10:15,Jianneng Li mailto:jianneng...@workday.com>

[Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-24 Thread Jianneng Li
Hello everyone, WholeStageCodegen generates code that appends results into a BufferedRowIterator, which keeps the results in an in-memory linked list

Re: Where does the Driver run?

2019-03-29 Thread Jianneng Li
quests and call SparkContext accordingly. Best, Jianneng From: Pat Ferrel Sent: Thursday, March 28, 2019 10:10 AM To: Jianneng Li Cc: user@spark.apache.org; ak...@hacked.work; andrew.m...@gmail.com; and...@actionml.com Subject: Re: Where does the Driver run?

Re: Where does the Driver run?

2019-03-28 Thread Jianneng Li
Hi Pat, The driver runs in the same JVM as SparkContext. You didn't go into detail about how you "launch" the job (i.e. how the SparkContext is created), so it's hard for me to guess where the driver is. For reference, we've had success launching Spark programmatically to YARN in cluster mode

Questions about Spark Shuffle and Heap

2015-12-04 Thread Jianneng Li
Hi, On the Spark Configuration page ( http://spark.apache.org/docs/1.5.2/configuration.html), the documentation for spark.shuffle.memoryFraction mentions that the fraction is taken from the Java heap. However, the documentation for spark.shuffle.io.preferDirectBufs implies that off-heap memory mig

Record-at-a-time model for Spark Streaming

2014-10-07 Thread Jianneng Li
Hello, I understand that Spark Streaming uses micro-batches to implement streaming, while traditional streaming systems use the record-at-a-time processing model. The performance benefit of the former is throughput, and the latter is latency. I'm wondering what it would take to implement record-at