Hi,
I'm applying preprocessing methods on big data of text by using spark-Java.
I created my own NLP pipline as a normal java code and call it in the map
function like this:
MyRDD.map(call nlp pipeline fr each row)
I run my job in a cluster 14 machines(32 Cores and about 140G for each).
The job
Hi All,
I am just wondering if anyone had a chance to look at this ticket ?
https://issues.apache.org/jira/browse/SPARK-21641
I am not expecting it to be resolved quickly however would like to know if
this is something that will be implemented or not (since I see no comments
in the ticket)
Th
Hi All,
Does spark sql has timezone support?
Thanks,
kant
Hi Naveen,
Can you please copy and paste the lines in your original email again, and
perhaps then Lucas can go through it completely & kindly stop thinking that
others are responding by assuming things?
On other hand, please try to let me know how things are going on, there is
another post on thi
Thanks very much Mich, Thomas and Stephan . I will look into it.
On Tue, Oct 24, 2017 at 8:02 PM, lucas.g...@gmail.com
wrote:
> This looks really interesting, thanks for linking!
>
> Gary Lucas
>
> On 24 October 2017 at 15:06, Mich Talebzadeh
> wrote:
>
>> Great thanks Steve
>>
>> Dr Mich Tale
Please do not confuse old Spark Streaming (DStreams) with Structured
Streaming. Structured Streaming's offset and checkpoint management is far
more robust than DStreams.
Take a look at my talk -
https://spark-summit.org/2017/speakers/tathagata-das/
On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Moha
It's because of different API design.
*RDD.checkpoint* returns void, which means it mutates the RDD state so you
need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed.
*Dataset.checkpoint* returns a new Dataset, which means there is no
isCheckpointed state in Dataset, and thus
Thanks Subhash.
Have you ever used zero data loss concept with streaming. I am bit worried
to use streamig when it comes to data loss.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
does structured streaming handles it internally?
On Wed,
Are we seeing the UI is showing only one partition to run the query? The
original poster hasn't replied yet.
My assumption is that there's only one executor configured / deployed. But
we only know what the OP stated which wasn't enough to be sure of anything.
Why are you suggesting that partiti
Hi Lucas,
so if I am assuming things, can you please explain why the UI is showing
only one partition to run the query?
Regards,
Gourav Sengupta
On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com
wrote:
> Gourav, I'm assuming you misread the code. It's 30 partitions, which
> isn't a ridic
No problem! Take a look at this:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
Thanks,
Subhash
On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Hi Sriram,
>
> Thanks. This is w
Hi Sriram,
Thanks. This is what I was looking for.
one question, where do we need to specify the checkpoint directory in case
of structured streaming?
Thanks,
Asmath
On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram
wrote:
> Hi Asmath,
>
> Here is an example of using structured streaming to rea
Hi Asmath,
Here is an example of using structured streaming to read from Kafka:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredKafkaWordCount.scala
In terms of parsing the JSON, there is a from_json function that you can
use.
Hi,
Could anyone provide suggestions on how to parse json data from kafka and
load it back in hive.
I have read about structured streaming but didn't find any examples. is
there any best practise on how to read it and parse it with structured
streaming for this use case?
Thanks,
Asmath
Actually, I realized keeping the info would not be enough as I need to find
back the checkpoint files to delete them :/
2017-10-25 19:07 GMT+02:00 Bernard Jesop :
> As far as I understand, Dataset.rdd is not the same as InternalRDD.
> It is just another RDD representation of the same Dataset and
As far as I understand, Dataset.rdd is not the same as InternalRDD.
It is just another RDD representation of the same Dataset and is created on
demand (lazy val) when Dataset.rdd is called.
This totally explains the observed behavior.
But how would would it be possible to know that a Dataset have
It is a bit more than syntactic sugar, but not much more:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533
BTW this is basically writing all the data out, and then create a new
Dataset to load them in.
On Wed, Oct 25, 2017 at 6:51 AM, Be
Gourav, I'm assuming you misread the code. It's 30 partitions, which isn't
a ridiculous value. Maybe you misread the upperBound for the partitions?
(That would be ridiculous)
Why not use the PK as the partition column? Obviously it depends on the
downstream queries. If you're going to be perfo
Hi,
we are migrating some jobs from Dstream to Structured Stream.
Currently to handle aggregations we call map and reducebyKey on each RDD
like
rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
The final output of each RDD is merged to the sink with support for
aggregation at
Hello everyone,
I have a question about checkpointing on dataset.
It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
there is no Dataset.isCheckpointed().
I wonder if Dataset.checkpoint is a syntactic sugar for
Dataset.rdd.checkpoint.
When I do :
Dataset.checkpoint; Data
We are happy to announce the availability of Spark 2.1.2!
Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1
maintenance
branch of Spark. We strongly recommend all 2.1.x users to upgrade to this
stable release.
To download Apache Spark 2.1.2 visit http://spark.apache.org/downloa
21 matches
Mail list logo