Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread dddaaa
No, I just made sure I'm not doing it. changed the path in .start() to another path and the same still occurs. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.a

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread Arbab Khalil
Why are you reading batch from kafka and writing it as stream? On Fri, Jul 27, 2018, 1:40 PM dddaaa wrote: > No, I just made sure I'm not doing it. > changed the path in .start() to another path and the same still occurs. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.co

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread dddaaa
This is a mistake in the code snippet I posted. The right code that is actually running and producing the error is: / df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka_broker") \ .option("subscribe", "test_hdfs3") \ .load()

Question of spark streaming

2018-07-27 Thread utkarsh rathor
I am following the book *Spark the Definitive Guide* The following code is *executed locally using spark-shell* Procedure: Started the spark-shell without any other options val static = spark.read.json("/part-00079-tid-730451297822678341-1dda7027-2071-4d73-a0e2-7fb6a91e1d1f-0-c000.json") val da

Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread dineshdharme
I am trying to do few (union + reduceByKey) operations on a hiearchical dataset in a iterative fashion in rdd. The first few loops run fine but on the subsequent loops, the operations ends up using the whole scratch space provided to it. I have set the spark scratch directory, i.e. SPARK_LOCAL_DI

Re: Question of spark streaming

2018-07-27 Thread Arun Mahadevan
“activityQuery.awaitTermination()” is a blocking call. You can just skip this line and run other commands in the same shell to query the stream. Running the query from a different shell won’t help since the memory sink where the results are store is not shared between the two shells.

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Vadim Semenov
`spark.worker.cleanup.enabled=true` doesn't work for YARN. On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote: > > I am trying to do few (union + reduceByKey) operations on a hiearchical > dataset in a iterative fashion in rdd. The first few loops run fine but on > the subsequent loops, the operat

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Dinesh Dharme
Yeah, you are right. I ran the experiments locally not on YARN. On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote: > `spark.worker.cleanup.enabled=true` doesn't work for YARN. > On Fri, Jul 27, 2018 at 8:52 AM dineshdharme > wrote: > > > > I am trying to do few (union + reduceByKey) operati

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread Arbab Khalil
Please try adding an other option of starting offset. I have done the same thing many times with different versions of spark that supports structured streaming. The other I am seeing is could be something that it could be at write time. Can you please confirm it be doing printSchema function after

How to Create one DB connection per executor and close it after the job is done?

2018-07-27 Thread kant kodali
Hi All, I understand creating a connection forEachPartition but I am wondering can I create one DB connection per executor and close it after the job is done? any sample code would help. you can imagine I am running a simple batch processing application. Thanks!