Do you need it on disk or just push it to memory? Can you try to increase memory or # of cores (I know it sounds basic)
> On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi > <diwakar.dhanusk...@gmail.com> wrote: > > Hello, > > I have 400K json messages pulled from Kafka into spark streaming using > DirectStream approach. Size of 400K messages is around 5G. Kafka topic is > single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to > convert rdd into dataframe. It takes almost 2.3 minutes to convert into > dataframe. > > I am running in Yarn client mode with executor memory as 15G and executor > cores as 2. > > Caching rdd before converting into dataframe doesn't change processing time. > Whether introducing hash partitions inside foreachRDD will help? (or) Will > partitioning topic and have more than one DirectStream help?. How can I > approach this situation to reduce time in converting to dataframe.. > > Regards, > Diwakar. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org