Re: Help with collect() in Spark Streaming

Holden Karau Fri, 11 Sep 2015 10:00:57 -0700

Having the driver write the data instead of a worker probably won't spread
it up, you still need to copy all of the data to a single node. Is there
something which forces you to only write from a single node?


On Friday, September 11, 2015, Luca <luke1...@gmail.com> wrote:

> Hi,
> thanks for answering.
>
> With the *coalesce() *transformation a single worker is in charge of
> writing to HDFS, but I noticed that the single write operation usually
> takes too much time, slowing down the whole computation (this is
> particularly true when 'unified' is made of several partitions). Besides,
> 'coalesce' forces me to perform a further repartitioning ('true' flag), in
> order not to lose upstream parallelism (by the way, did I get this part
> right?).
> Am I wrong in thinking that having the driver do the writing will speed
> things up, without the need of repartitioning data?
>
> Hope I have been clear, I am pretty new to Spark. :)
>
> 2015-09-11 18:19 GMT+02:00 Holden Karau <hol...@pigscanfly.ca
> <javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca');>>:
>
>> A common practice to do this is to use foreachRDD with a local var to
>> accumulate the data (you can see it in the Spark Streaming test code).
>>
>> That being said, I am a little curious why you want the driver to create
>> the file specifically.
>>
>> On Friday, September 11, 2015, allonsy <luke1...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I have a JavaPairDStream<Integer, String> object and I'd like the Driver
>>> to
>>> create a txt file (on HDFS) containing all of its elements.
>>>
>>> At the moment, I use the /coalesce(1, true)/ method:
>>>
>>>
>>> JavaPairDStream<Integer, String> unified = [partitioned stuff]
>>> unified.foreachRDD(new Function<JavaPairRDD&lt;Integer, String>, Void>()
>>> {
>>>                                 public Void call(JavaPairRDD<Integer,
>>> String> arg0) throws Exception {
>>>                                         arg0.coalesce(1,
>>> true).saveAsTextFile(<HDFS path>);
>>>                                         return null;
>>>                                 }
>>> });
>>>
>>>
>>> but this implies that a /single worker/ is taking all the data and
>>> writing
>>> to HDFS, and that could be a major bottleneck.
>>>
>>> How could I replace the worker with the Driver? I read that /collect()/
>>> might do this, but I haven't the slightest idea on how to implement it.
>>>
>>> Can anybody help me?
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

Re: Help with collect() in Spark Streaming

Reply via email to