subject:"How can i remove the need for calling cache"

Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi

thanks Vadim. yes this is a good option for us. thanks From: Vadim Semenov Sent: Wednesday, August 2, 2017 6:24:40 PM To: Suzen, Mehmet Cc: jeff saremi; user@spark.apache.org Subject: Re: How can i remove the need for calling cache So if you just save an RDD to

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov

So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to create a new RDD that reads that data, this way you'll avoid recomputing the RDD but may lose time on saving/loading. Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient method that gives you t

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet

On 3 August 2017 at 03:00, Vadim Semenov wrote: > `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it > just saves data to some destination. Yes, that's what I thought, so the statement "..otherwise saving it on a file will require recomputation." from the book is not ent

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov

`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it just saves data to some destination. `cache/persist` allow you to cache data and keep the DAG in case of some executor that holds data goes down, so Spark would still be able to recalculate missing partitions `localCheckp

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov

gt;> memory, otherwise saving it on a file will require recomputation."* >> >> >> To me that means checkpoint will not prevent the recomputation that i was >> hoping for >> ---------- >> *From:* Vadim Semenov >> *Sent:* Tue

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov

To:* jeff saremi > *Cc:* user@spark.apache.org > *Subject:* Re: How can i remove the need for calling cache > > You can use `.checkpoint()`: > ``` > val sc: SparkContext > sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory") > myrdd.checkpoint() > val res

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet

On 3 August 2017 at 01:05, jeff saremi wrote: > Vadim: > > This is from the Mastering Spark book: > > "It is strongly recommended that a checkpointed RDD is persisted in memory, > otherwise saving it on a file will require recomputation." Is this really true? I had the impression that DAG will no

Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi

hoping for From: Vadim Semenov Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory&quo

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi

Thanks Mark. I'll examine the status more carefully to observe this. From: Mark Hamstra Sent: Tuesday, August 1, 2017 11:25:46 AM To: user@spark.apache.org Subject: Re: How can i remove the need for calling cache Very likely, much of the potential duplicati

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi

Thanks Vadim. I'll try that From: Vadim Semenov Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpoi

Re: How can i remove the need for calling cache

2017-08-01 Thread Vadim Semenov

You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory") myrdd.checkpoint() val result1 = myrdd.map(op1(_)) result1.count() // Will save `myrdd` to HDFS and do map(op1… val result2 = myrdd.map(op2(_)) result2.count() // Will load `myrdd` from HDFS

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi

emove the need for calling cache Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi mailto:jeffsar...@hotmail.com>> wrote: Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in fi

Re: How can i remove the need for calling cache

2017-08-01 Thread Mark Hamstra

Very likely, much of the potential duplication is already being avoided even without calling cache/persist. When running the above code without `myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at least one of them you will likely see that many Stages are marked as "skipped", whi

Re: How can i remove the need for calling cache

2017-08-01 Thread lucas.g...@gmail.com

Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi wrote: > Calling cache/persist fails all our jobs (i have posted 2 threads on > this). > > And we're giving up hope in finding a solution. > So I'd like to find a workaround for that: > > If

How can i remove the need for calling cache

2017-08-01 Thread jeff saremi

Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in finding a solution. So I'd like to find a workaround for that: If I save an RDD to hdfs and read it back, can I use it in more than one operation? Example: (using cache) // do a whole bunch

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

How can i remove the need for calling cache

15 matches

Site Navigation

Mail list logo

Footer information