text processing in spark (Spark job stucks for several minutes)

2017-10-25 Thread Donni Khan
Hi, I'm applying preprocessing methods on big data of text by using spark-Java. I created my own NLP pipline as a normal java code and call it in the map function like this: MyRDD.map(call nlp pipeline fr each row) I run my job in a cluster 14 machines(32 Cores and about 140G for each). The job

JIRA Ticket 21641

2017-10-25 Thread kant kodali
Hi All, I am just wondering if anyone had a chance to look at this ticket ? https://issues.apache.org/jira/browse/SPARK-21641 I am not expecting it to be resolved quickly however would like to know if this is something that will be implemented or not (since I see no comments in the ticket) Th

Does spark sql has timezone support?

2017-10-25 Thread kant kodali
Hi All, Does spark sql has timezone support? Thanks, kant

Re: spark session jdbc performance

2017-10-25 Thread Gourav Sengupta
Hi Naveen, Can you please copy and paste the lines in your original email again, and perhaps then Lucas can go through it completely & kindly stop thinking that others are responding by assuming things? On other hand, please try to let me know how things are going on, there is another post on thi

Re: Spark streaming for CEP

2017-10-25 Thread anna stax
Thanks very much Mich, Thomas and Stephan . I will look into it. On Tue, Oct 24, 2017 at 8:02 PM, lucas.g...@gmail.com wrote: > This looks really interesting, thanks for linking! > > Gary Lucas > > On 24 October 2017 at 15:06, Mich Talebzadeh > wrote: > >> Great thanks Steve >> >> Dr Mich Tale

Re: Structured Stream in Spark

2017-10-25 Thread Tathagata Das
Please do not confuse old Spark Streaming (DStreams) with Structured Streaming. Structured Streaming's offset and checkpoint management is far more robust than DStreams. Take a look at my talk - https://spark-summit.org/2017/speakers/tathagata-das/ On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Moha

Re: Dataset API Question

2017-10-25 Thread Wenchen Fan
It's because of different API design. *RDD.checkpoint* returns void, which means it mutates the RDD state so you need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed. *Dataset.checkpoint* returns a new Dataset, which means there is no isCheckpointed state in Dataset, and thus

Re: Structured Stream in Spark

2017-10-25 Thread KhajaAsmath Mohammed
Thanks Subhash. Have you ever used zero data loss concept with streaming. I am bit worried to use streamig when it comes to data loss. https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/ does structured streaming handles it internally? On Wed,

Re: spark session jdbc performance

2017-10-25 Thread lucas.g...@gmail.com
Are we seeing the UI is showing only one partition to run the query? The original poster hasn't replied yet. My assumption is that there's only one executor configured / deployed. But we only know what the OP stated which wasn't enough to be sure of anything. Why are you suggesting that partiti

Re: spark session jdbc performance

2017-10-25 Thread Gourav Sengupta
Hi Lucas, so if I am assuming things, can you please explain why the UI is showing only one partition to run the query? Regards, Gourav Sengupta On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com wrote: > Gourav, I'm assuming you misread the code. It's 30 partitions, which > isn't a ridic

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram
No problem! Take a look at this: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing Thanks, Subhash On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi Sriram, > > Thanks. This is w

Re: Structured Stream in Spark

2017-10-25 Thread KhajaAsmath Mohammed
Hi Sriram, Thanks. This is what I was looking for. one question, where do we need to specify the checkpoint directory in case of structured streaming? Thanks, Asmath On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram wrote: > Hi Asmath, > > Here is an example of using structured streaming to rea

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram
Hi Asmath, Here is an example of using structured streaming to read from Kafka: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredKafkaWordCount.scala In terms of parsing the JSON, there is a from_json function that you can use.

Structured Stream in Spark

2017-10-25 Thread KhajaAsmath Mohammed
Hi, Could anyone provide suggestions on how to parse json data from kafka and load it back in hive. I have read about structured streaming but didn't find any examples. is there any best practise on how to read it and parse it with structured streaming for this use case? Thanks, Asmath

Re: Dataset API Question

2017-10-25 Thread Bernard Jesop
Actually, I realized keeping the info would not be enough as I need to find back the checkpoint files to delete them :/ 2017-10-25 19:07 GMT+02:00 Bernard Jesop : > As far as I understand, Dataset.rdd is not the same as InternalRDD. > It is just another RDD representation of the same Dataset and

Re: Dataset API Question

2017-10-25 Thread Bernard Jesop
As far as I understand, Dataset.rdd is not the same as InternalRDD. It is just another RDD representation of the same Dataset and is created on demand (lazy val) when Dataset.rdd is called. This totally explains the observed behavior. But how would would it be possible to know that a Dataset have

Re: Dataset API Question

2017-10-25 Thread Reynold Xin
It is a bit more than syntactic sugar, but not much more: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533 BTW this is basically writing all the data out, and then create a new Dataset to load them in. On Wed, Oct 25, 2017 at 6:51 AM, Be

Re: spark session jdbc performance

2017-10-25 Thread lucas.g...@gmail.com
Gourav, I'm assuming you misread the code. It's 30 partitions, which isn't a ridiculous value. Maybe you misread the upperBound for the partitions? (That would be ridiculous) Why not use the PK as the partition column? Obviously it depends on the downstream queries. If you're going to be perfo

Structured Stream equivalent of reduceByKey

2017-10-25 Thread Piyush Mukati
Hi, we are migrating some jobs from Dstream to Structured Stream. Currently to handle aggregations we call map and reducebyKey on each RDD like rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b)) The final output of each RDD is merged to the sink with support for aggregation at

Dataset API Question

2017-10-25 Thread Bernard Jesop
Hello everyone, I have a question about checkpointing on dataset. It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD there is no Dataset.isCheckpointed(). I wonder if Dataset.checkpoint is a syntactic sugar for Dataset.rdd.checkpoint. When I do : Dataset.checkpoint; Data

[ANNOUNCE] Apache Spark 2.1.2

2017-10-25 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.2! Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.2 visit http://spark.apache.org/downloa