date:20140602

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas

OK, thanks for confirming. Is there something we can do about that leftover
part- files problem in Spark, or is that for the Hadoop team?


2014년 6월 2일 월요일, Aaron Davidson님이 작성한 메시지:

> Yes.
>
>
> On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> So in summary:
>
>- As of Spark 1.0.0, saveAsTextFile() will no longer clobber by
>default.
>- There is an open JIRA issue to add an option to allow clobbering.
>- Even when clobbering, part- files may be left over from previous
>saves, which is dangerous.
>
> Is this correct?
>
>
> On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson  wrote:
>
> +1 please re-add this feature
>
>
> On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell 
> wrote:
>
> Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
> I accidentally assigned myself way back when I created it). This
> should be an easy fix.
>
> On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu  wrote:
> > Hi, Patrick,
> >
> > I think https://issues.apache.org/jira/browse/SPARK-1677 is talking
> about
> > the same thing?
> >
> > How about assigning it to me?
> >
> > I think I missed the configuration part in my previous commit, though I
> > declared that in the PR description
> >
> > Best,
> >
> > --
> > Nan Zhu
> >
> > On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
> >
> > Hey There,
> >
> > The issue was that the old behavior could cause users to silently
> > overwrite data, which is pretty bad, so to be conservative we decided
> > to enforce the same checks that Hadoop does.
> >
> > This was documented by this JIRA:
> > https://issues.apache.org/jira/browse/SPARK-1100
> >
> https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
> >
> > However, it would be very easy to add an option that allows preserving
> > the old behavior. Is anyone here interested in contributing that? I
> > created a JIRA for it:
> >
> > https://issues.apache.org/jira/browse/SPARK-1993
> >
> > - Patrick
> >
> > On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
> >  wrote:
> >
> > Indeed, the behavior has changed for good or for bad. I mean, I agree
> with
> > the danger you mention but I'm not sure it's happening like that. Isn't
> > there a mechanism for overwrite in Hadoop that automatically removes part
> > files, then writes a _temporary folder and then only the part files along
> > with the _success folder.
> >
> > In any case this change of behavior should be documented IMO.
> >
> > Cheers
> > Pierre
> >
> > Message sent from a mobile device - excuse typos and abbreviations
> >
> > Le 2 juin 2014 à 17:42, Nicholas Chammas  a
> > écrit :
> >
> > What I've found using saveAsTextFile() against S3 (prior to Spark
> 1.0.0.) is
> > that files get overwritten automatically. This is one danger to this
> though.
> > If I save to a directory that already has 20 part- files, but this time
> > around I'm only saving 15 part- files, then there will be 5 leftover
> part-
> > files from the previous set mixed in with the 15 newer files. This is
> > potentially dangerous.
> >
> > I haven't checked to see if this behavior has changed in 1.0.0. Are you
>
>

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans

I'm a bit confused because the PR mentioned by Patrick seems to adress all 
these issues:
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

Was it not accepted? Or is the description of this PR not completely 
implemented?

Message sent from a mobile device - excuse typos and abbreviations

> Le 2 juin 2014 à 23:08, Nicholas Chammas  a écrit 
> :
> 
> OK, thanks for confirming. Is there something we can do about that leftover 
> part- files problem in Spark, or is that for the Hadoop team?
> 
> 
> 2014년 6월 2일 월요일, Aaron Davidson님이 작성한 메시지:
>> Yes.
>> 
>> 
>> On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas 
>>  wrote:
>> So in summary:
>> As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
>> There is an open JIRA issue to add an option to allow clobbering.
>> Even when clobbering, part- files may be left over from previous saves, 
>> which is dangerous.
>> Is this correct?
>> 
>> 
>> On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson  wrote:
>> +1 please re-add this feature
>> 
>> 
>> On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell  wrote:
>> Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
>> I accidentally assigned myself way back when I created it). This
>> should be an easy fix.
>> 
>> On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu  wrote:
>> > Hi, Patrick,
>> >
>> > I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about
>> > the same thing?
>> >
>> > How about assigning it to me?
>> >
>> > I think I missed the configuration part in my previous commit, though I
>> > declared that in the PR description
>> >
>> > Best,
>> >
>> > --
>> > Nan Zhu
>> >
>> > On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
>> >
>> > Hey There,
>> >
>> > The issue was that the old behavior could cause users to silently
>> > overwrite data, which is pretty bad, so to be conservative we decided
>> > to enforce the same checks that Hadoop does.
>> >
>> > This was documented by this JIRA:
>> > https://issues.apache.org/jira/browse/SPARK-1100
>> > https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
>> >
>> > However, it would be very easy to add an option that allows preserving
>> > the old behavior. Is anyone here interested in contributing that? I
>> > created a JIRA for it:
>> >
>> > https://issues.apache.org/jira/browse/SPARK-1993
>> >
>> > - Patrick
>> >
>> > On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
>> >  wrote:
>> >
>> > Indeed, the behavior has changed for good or for bad. I mean, I agree with
>> > the danger you mention but I'm not sure it's happening like that. Isn't
>> > there a mechanism for overwrite in Hadoop that automatically removes part
>> > files, then writes a _temporary folder and then only the part files along
>> > with the _success folder.
>> >
>> > In any case this change of behavior should be documented IMO.
>> >
>> > Cheers
>> > Pierre
>> >
>> > Message sent from a mobile device - excuse typos and abbreviations
>> >
>> > Le 2 juin 2014 à 17:42, Nicholas Chammas  a
>> > écrit :
>> >
>> > What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) 
>> > is
>> > that files get overwritten automatically. This is one danger to this 
>> > though.
>> > If I save to a directory that already has 20 part- files, but this time
>> > around I'm only saving 15 part- files, then there will be 5 leftover part-
>> > files from the previous set mixed in with the 15 newer files. This is
>> > potentially dangerous.
>> >
>> > I haven't checked to see if this behavior has changed in 1.0.0. Are you

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Sean Owen

I assume the idea is for Spark to "rm -r dir/", which would clean out
everything that was there before. It's just doing this instead of the
caller. Hadoop still won't let you write into a location that already
exists regardless, and part of that is for this reason that you might
end up with files mixed-up from different jobs.

This doesn't need a change to Hadoop and probably shouldn't; it's a
change to semantics provided by Spark to do the delete for you if you
set a flag. Viewed that way, meh, seems like the caller could just do
that themselves rather than expand the Spark API (via a utility method
if you like), but I can see it both ways. Caller beware.

On Mon, Jun 2, 2014 at 10:08 PM, Nicholas Chammas
 wrote:
> OK, thanks for confirming. Is there something we can do about that leftover
> part- files problem in Spark, or is that for the Hadoop team?
>
>
> 2014년 6월 2일 월요일, Aaron Davidson님이 작성한 메시지:
>
>> Yes.
>>
>>
>> On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas
>>  wrote:
>>
>> So in summary:
>>
>> As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
>> There is an open JIRA issue to add an option to allow clobbering.
>> Even when clobbering, part- files may be left over from previous saves,
>> which is dangerous.
>>
>> Is this correct?

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas

Fair enough. That rationale makes sense.

I would prefer that a Spark clobber option also delete the destination
files, but as long as it's a non-default option I can see the "caller
beware" side of that argument as well.

Nick


2014년 6월 2일 월요일, Sean Owen님이 작성한 메시지:

> I assume the idea is for Spark to "rm -r dir/", which would clean out
> everything that was there before. It's just doing this instead of the
> caller. Hadoop still won't let you write into a location that already
> exists regardless, and part of that is for this reason that you might
> end up with files mixed-up from different jobs.
>
> This doesn't need a change to Hadoop and probably shouldn't; it's a
> change to semantics provided by Spark to do the delete for you if you
> set a flag. Viewed that way, meh, seems like the caller could just do
> that themselves rather than expand the Spark API (via a utility method
> if you like), but I can see it both ways. Caller beware.
>
> On Mon, Jun 2, 2014 at 10:08 PM, Nicholas Chammas
> > wrote:
> > OK, thanks for confirming. Is there something we can do about that
> leftover
> > part- files problem in Spark, or is that for the Hadoop team?
> >
> >
> > 2014년 6월 2일 월요일, Aaron Davidson>님이
> 작성한 메시지:
> >
> >> Yes.
> >>
> >>
> >> On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas
> >> > wrote:
> >>
> >> So in summary:
> >>
> >> As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
> >> There is an open JIRA issue to add an option to allow clobbering.
> >> Even when clobbering, part- files may be left over from previous saves,
> >> which is dangerous.
> >>
> >> Is this correct?
>

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu

I made the PR, the problem is …after many rounds of review, that configuration 
part is missed….sorry about that  

I will fix it  

Best,  

--  
Nan Zhu


On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote:

> I'm a bit confused because the PR mentioned by Patrick seems to adress all 
> these issues:
> https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
>  
> Was it not accepted? Or is the description of this PR not completely 
> implemented?
>  
> Message sent from a mobile device - excuse typos and abbreviations
>  
> Le 2 juin 2014 à 23:08, Nicholas Chammas  (mailto:nicholas.cham...@gmail.com)> a écrit :
>  
> > OK, thanks for confirming. Is there something we can do about that leftover 
> > part- files problem in Spark, or is that for the Hadoop team?
> >  
> >  
> > 2014년 6월 2일 월요일, Aaron Davidson > (mailto:ilike...@gmail.com)>님이 작성한 메시지:
> > > Yes.
> > >  
> > >  
> > > On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas 
> > >  wrote:
> > > > So in summary:
> > > > As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
> > > > There is an open JIRA issue to add an option to allow clobbering.
> > > > Even when clobbering, part- files may be left over from previous saves, 
> > > > which is dangerous.
> > > >  
> > > > Is this correct?
> > > >  
> > > >  
> > > >  
> > > >  
> > > > On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson  
> > > > wrote:
> > > > > +1 please re-add this feature
> > > > >  
> > > > >  
> > > > > On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell  
> > > > > wrote:
> > > > > > Thanks for pointing that out. I've assigned you to SPARK-1677 (I 
> > > > > > think
> > > > > > I accidentally assigned myself way back when I created it). This
> > > > > > should be an easy fix.
> > > > > >  
> > > > > > On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu  
> > > > > > wrote:
> > > > > > > Hi, Patrick,
> > > > > > >
> > > > > > > I think https://issues.apache.org/jira/browse/SPARK-1677 is 
> > > > > > > talking about
> > > > > > > the same thing?
> > > > > > >
> > > > > > > How about assigning it to me?
> > > > > > >
> > > > > > > I think I missed the configuration part in my previous commit, 
> > > > > > > though I
> > > > > > > declared that in the PR description
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > --
> > > > > > > Nan Zhu
> > > > > > >
> > > > > > > On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
> > > > > > >
> > > > > > > Hey There,
> > > > > > >
> > > > > > > The issue was that the old behavior could cause users to silently
> > > > > > > overwrite data, which is pretty bad, so to be conservative we 
> > > > > > > decided
> > > > > > > to enforce the same checks that Hadoop does.
> > > > > > >
> > > > > > > This was documented by this JIRA:
> > > > > > > https://issues.apache.org/jira/browse/SPARK-1100
> > > > > > > https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
> > > > > > >
> > > > > > > However, it would be very easy to add an option that allows 
> > > > > > > preserving
> > > > > > > the old behavior. Is anyone here interested in contributing that? 
> > > > > > > I
> > > > > > > created a JIRA for it:
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/SPARK-1993
> > > > > > >
> > > > > > > - Patrick
> > > > > > >
> > > > > > > On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
> > > > > > >  wrote:
> > > > > > >
> > > > > > > Indeed, the behavior has changed for good or for bad. I mean, I 
> > > > > > > agree with
> > > > > > > the danger you mention but I'm not sure it's happening like that. 
> > > > > > > Isn't
> > > > > > > there a mechanism for overwrite in Hadoop that automatically 
> > > > > > > removes part
> > > > > > > files, then writes a _temporary folder and then only the part 
> > > > > > > files along
> > > > > > > with the _success folder.
> > > > > > >
> > > > > > > In any case this change of behavior should be documented IMO.
> > > > > > >
> > > > > > > Cheers
> > > > > > > Pierre
> > > > > > >
> > > > > > > Message sent from a mobile device - excuse typos and abbreviations
> > > > > > >
> > > > > > > Le 2 juin 2014 à 17:42, Nicholas Chammas 
> > > > > > >  a
> > > > > > > écrit :
> > > > > > >
> > > > > > > What I've found using saveAsTextFile() against S3 (prior to Spark 
> > > > > > > 1.0.0.) is
> > > > > > > that files get overwritten automatically. This is one danger to 
> > > > > > > this though.
> > > > > > > If I save to a directory that already has 20 part- files, but 
> > > > > > > this time
> > > > > > > around I'm only saving 15 part- files, then there will be 5 
> > > > > > > leftover part-
> > > > > > > files from the previous set mixed in with the 15 newer files. 
> > > > > > > This is
> > > > > > > potentially dangerous.
> > > > > > >
> > > > > > > I haven't checked to see if this behavior has changed in 1.0.0. 
> > > > > > > Are you

Re: How to create RDDs from another RDD?

2014-06-02 Thread Andrew Ash

Hi Gerard,

Usually when I want to split one RDD into several, I'm better off
re-thinking the algorithm to do all the computation at once.  Example:

Suppose you had a dataset that was the tuple (URL, webserver,
pageSizeBytes), and you wanted to find out the average page size that each
webserver (e.g. Apache, nginx, IIS, etc) served.  Rather than splitting
your allPagesRDD into an RDD for each webserver, like nginxRDD, apacheRDD,
IISRDD, it's probably better to do the average computation over all at
once, like this:

// allPagesRDD is (URL, webserver, pageSizeBytes)
allPagesRDD.keyBy(getWebserver)
  .map(k => (k.pageSizeBytes, 1))
  .reduceByKey( (a,b) => (a._1 + b._1, a._2 + b._2)
  .mapValues( v => (v._1 / v._2) )

For this example you could use something like Summingbird to keep from
doing the average tracking yourself.

Can you go into more detail about why you want to split one RDD into
several?

On Mon, Jun 2, 2014 at 1:13 PM, Gerard Maas  wrote:

> The RDD API has  functions to join multiple RDDs, such as PariRDD.join
> or PariRDD.cogroup that take another RDD as input. e.g.
>  firstRDD.join(secondRDD)
>
> I'm looking for ways to do the opposite: split an existing RDD. What is
> the right way to create derivate RDDs from an existing RDD?
>
> e.g. imagine I've an  collection or pairs as input: colRDD =
>  (k1->v1)...(kx->vy)...
> I could do:
> val byKey = colRDD.groupByKey() = (k1->(k1->v1...
> k1->vn)),...(kn->(kn->vy, ...))
>
> Now, I'd like to create an RDD from the values to have something like:
>
> val groupedRDDs = (k1->RDD(k1->v1,...k1->vn), kn -> RDD(kn->vy, ...))
>
> in this example, there's an f(byKey) = groupedRDDs.  What's that f(x) ?
>
> Would:  byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}  the
> right/recommended way to do this?  Any other options?
>
> Thanks,
>
> Gerard.
>

Interactive modification of DStreams

2014-06-02 Thread lbustelo

This is a general question about whether Spark Streaming can be interactive
like batch Spark jobs. I've read plenty of threads and done my fair bit of
experimentation and I'm thinking the answer is NO, but it does not hurt to
ask.

More specifically, I would like to be able to do:
1. Add/Remove steps to the Streaming Job
2. Modify Window durations
3. Stop and Restart context.

I've tried the following:

1. Modify the DStream after it has been started… BOOM! Exceptions
everywhere.

2. Stop the DStream, Make modification, Start… NOT GOOD :( In 0.9.0 I was
getting deadlocks. I also tried 1.0.0 and it did not work.

3. Based on information provided here

, I was been able to prototype modifying the RDD computation within a
forEachRDD. That is nice, but you are then bounded to the specified batch
size. That got me to wanting to modify Window durations. Is changing the
Window duration possible?

4. Tried running multiple streaming context from within a single Driver
application and got several exceptions. The first one was bind exception on
the web port. Then once the app started getting run (cores were taken but
1st job) it did not run correctly. A lot of
"akka.pattern.AskTimeoutException: Timed out"
.

I've tried my experiments in 0.9.0, 0.9.1 and 1.0.0 running on Standalone
Cluster setup.
Thanks in advanced

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Interactive-modification-of-DStreams-tp6740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

85 matches

Mail list logo