thanks Vadim. yes this is a good option for us. thanks
From: Vadim Semenov
Sent: Wednesday, August 2, 2017 6:24:40 PM
To: Suzen, Mehmet
Cc: jeff saremi; user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
So if you just save an RDD to
So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have
to create a new RDD that reads that data, this way you'll avoid recomputing
the RDD but may lose time on saving/loading.
Exactly same thing happens in 'checkpoint', 'checkpoint' is just a
convenient method that gives you t
On 3 August 2017 at 03:00, Vadim Semenov wrote:
> `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
> just saves data to some destination.
Yes, that's what I thought, so the statement "..otherwise saving it on
a file will require recomputation." from the book is not ent
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
just saves data to some destination.
`cache/persist` allow you to cache data and keep the DAG in case of some
executor that holds data goes down, so Spark would still be able to
recalculate missing partitions
`localCheckp
gt;> memory, otherwise saving it on a file will require recomputation."*
>>
>>
>> To me that means checkpoint will not prevent the recomputation that i was
>> hoping for
>> ----------
>> *From:* Vadim Semenov
>> *Sent:* Tue
To:* jeff saremi
> *Cc:* user@spark.apache.org
> *Subject:* Re: How can i remove the need for calling cache
>
> You can use `.checkpoint()`:
> ```
> val sc: SparkContext
> sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory")
> myrdd.checkpoint()
> val res
On 3 August 2017 at 01:05, jeff saremi wrote:
> Vadim:
>
> This is from the Mastering Spark book:
>
> "It is strongly recommended that a checkpointed RDD is persisted in memory,
> otherwise saving it on a file will require recomputation."
Is this really true? I had the impression that DAG will no
hoping for
From: Vadim Semenov
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory&quo
Thanks Mark. I'll examine the status more carefully to observe this.
From: Mark Hamstra
Sent: Tuesday, August 1, 2017 11:25:46 AM
To: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
Very likely, much of the potential duplicati
Thanks Vadim. I'll try that
From: Vadim Semenov
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpoi
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory")
myrdd.checkpoint()
val result1 = myrdd.map(op1(_))
result1.count() // Will save `myrdd` to HDFS and do map(op1…
val result2 = myrdd.map(op2(_))
result2.count() // Will load `myrdd` from HDFS
emove the need for calling cache
Hi Jeff, that looks sane to me. Do you have additional details?
On 1 August 2017 at 11:05, jeff saremi
mailto:jeffsar...@hotmail.com>> wrote:
Calling cache/persist fails all our jobs (i have posted 2 threads on this).
And we're giving up hope in fi
Very likely, much of the potential duplication is already being avoided
even without calling cache/persist. When running the above code without
`myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at
least one of them you will likely see that many Stages are marked as
"skipped", whi
Hi Jeff, that looks sane to me. Do you have additional details?
On 1 August 2017 at 11:05, jeff saremi wrote:
> Calling cache/persist fails all our jobs (i have posted 2 threads on
> this).
>
> And we're giving up hope in finding a solution.
> So I'd like to find a workaround for that:
>
> If
Calling cache/persist fails all our jobs (i have posted 2 threads on this).
And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:
If I save an RDD to hdfs and read it back, can I use it in more than one
operation?
Example: (using cache)
// do a whole bunch
15 matches
Mail list logo