Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-29 Thread Erik Erlandson
- Original Message - > Sure, drop() would be useful, but breaking the "transformations are lazy; > only actions launch jobs" model is abhorrent -- which is not to say that we > haven't already broken that model for useful operations (cf. > RangePartitioner, which is used for sorted RDDs),

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Erik Erlandson
- Original Message - > It could make sense to add a skipHeader argument to SparkContext.textFile? I also looked into this. I don't think it's feasible given the limits of the InputFormat and RecordReader interfaces. RecordReader can't (I think) *ever* know which split it's attached

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
Yeah, the input format doesn't support this behavior. But it does tell you the byte position of each record in the file. On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin wrote: > Yes, that could work. But it is not as simple as just a binary flag. > > We might want to skip the first row for every

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
Yes, that could work. But it is not as simple as just a binary flag. We might want to skip the first row for every file, or the header only for the first file. The former is not really supported out of the box by the input format I think? On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza wrote: > I

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin wrote: > If the purpose is for dropping csv headers, perhaps we don't really need a > common drop and only one that drops the first line in a file? I'd really > try hard to a

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring header

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
You can find some of the prior, related discussion here: https://issues.apache.org/jira/browse/SPARK-1021 On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson wrote: > > > - Original Message - > > Rather than embrace non-lazy transformations and add more of them, I'd > > rather we 1) try to

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - > Rather than embrace non-lazy transformations and add more of them, I'd > rather we 1) try to fully characterize the needs that are driving their > creation/usage; and 2) design and implement new Spark abstractions that > will allow us to meet those needs and elimina

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation. T

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - > Sure, drop() would be useful, but breaking the "transformations are lazy; > only actions launch jobs" model is abhorrent -- which is not to say that we > haven't already broken that model for useful operations (cf. > RangePartitioner, which is used for sorted RDDs),

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Sure, drop() would be useful, but breaking the "transformations are lazy; only actions launch jobs" model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/ > > > > > > > > > -- > > If you reply to this email, your message will be added to the discussion > > below: > >

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Aniket
t; > I wrote up a discussion of these trade-offs here: > > > > > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/ > > > > > ------------------ > If you reply to this email, your message w

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Andrew Ash
Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson wrote: > A few week

RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315 Supporting the drop method would make some operations convenient, however it forces computation of >= 1 partition of the parent RDD, and so it would behave like a "part