Okay got that bout parallelism that you are saying . Yes , We use dataframe to infer schema and process data. The json schema has xml data as one of key value pair. Xml data needs to be processed in foreachRDD. Json schema doesn't change often.
Regards, Diwakar. Sent from Samsung Mobile. <div>-------- Original message --------</div><div>From: Cody Koeninger <c...@koeninger.org> </div><div>Date:19/07/2016 20:49 (GMT+05:30) </div><div>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> </div><div>Cc: Martin Eden <martineden...@gmail.com>, user <user@spark.apache.org> </div><div>Subject: Re: Spark streaming takes longer time to read json into dataframes </div><div> </div>Yes, if you need more parallelism, you need to either add more kafka partitions or shuffle in spark. Do you actually need the dataframe api, or are you just using it as a way to infer the json schema? Inferring the schema is going to require reading through the RDD once before doing any other work. You may be better off defining your schema in advance. On Sun, Jul 17, 2016 at 9:33 PM, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> wrote: > Hi, > > Repartition would create shuffle over network which I should avoid > to reduce processing time because the size of messages at most in a batch > will be 5G. > Partitioning topic and parallelize receiving in Direct Stream might do the > trick. > > > Sent from Samsung Mobile. > > > -------- Original message -------- > From: Martin Eden <martineden...@gmail.com> > Date:16/07/2016 14:01 (GMT+05:30) > To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> > Cc: user <user@spark.apache.org> > Subject: Re: Spark streaming takes longer time to read json into dataframes > > Hi, > > I would just do a repartition on the initial direct DStream since otherwise > each RDD in the stream has exactly as many partitions as you have partitions > in the Kafka topic (in your case 1). Like that receiving is still done in > only 1 thread but at least the processing further down is done in parallel. > > If you want to parallelize your receiving as well I would partition my Kafka > topic and then the RDDs in the initial DStream will have as many partitions > as you set in Kafka. > > Have you seen this? > http://spark.apache.org/docs/latest/streaming-kafka-integration.html > > M > > On Sat, Jul 16, 2016 at 5:26 AM, Diwakar Dhanuskodi > <diwakar.dhanusk...@gmail.com> wrote: >> >> >> ---------- Forwarded message ---------- >> From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> >> Date: Sat, Jul 16, 2016 at 9:30 AM >> Subject: Re: Spark streaming takes longer time to read json into >> dataframes >> To: Jean Georges Perrin <j...@jgp.net> >> >> >> Hello, >> >> I need it on memory. Increased executor memory to 25G and executor cores >> to 3. Got same result. There is always one task running under executor for >> rdd.read.json() because rdd partition size is 1 . Doing hash partitioning >> inside foreachRDD is a good approach? >> >> Regards, >> Diwakar. >> >> On Sat, Jul 16, 2016 at 9:20 AM, Jean Georges Perrin <j...@jgp.net> wrote: >>> >>> Do you need it on disk or just push it to memory? Can you try to increase >>> memory or # of cores (I know it sounds basic) >>> >>> > On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi >>> > <diwakar.dhanusk...@gmail.com> wrote: >>> > >>> > Hello, >>> > >>> > I have 400K json messages pulled from Kafka into spark streaming using >>> > DirectStream approach. Size of 400K messages is around 5G. Kafka topic is >>> > single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to >>> > convert rdd into dataframe. It takes almost 2.3 minutes to convert into >>> > dataframe. >>> > >>> > I am running in Yarn client mode with executor memory as 15G and >>> > executor cores as 2. >>> > >>> > Caching rdd before converting into dataframe doesn't change processing >>> > time. Whether introducing hash partitions inside foreachRDD will help? >>> > (or) >>> > Will partitioning topic and have more than one DirectStream help?. How >>> > can I >>> > approach this situation to reduce time in converting to dataframe.. >>> > >>> > Regards, >>> > Diwakar. >>> >> >> >