Do you need it on disk or just push it to memory? Can you try to increase 
memory or # of cores (I know it sounds basic)

> On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi 
> <diwakar.dhanusk...@gmail.com> wrote:
> 
> Hello, 
> 
> I have 400K json messages pulled from Kafka into spark streaming using 
> DirectStream approach. Size of 400K messages is around 5G.  Kafka topic is 
> single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to 
> convert  rdd into dataframe. It takes almost 2.3 minutes to convert into 
> dataframe. 
> 
> I am running in Yarn client mode with executor memory as 15G and executor 
> cores as 2.
> 
> Caching rdd before converting into dataframe  doesn't change processing time. 
> Whether introducing hash partitions inside foreachRDD  will help? (or) Will 
> partitioning topic and have more than one DirectStream help?. How can I 
> approach this situation to reduce time in converting to dataframe.. 
> 
> Regards, 
> Diwakar. 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to