- Original Message -
> Sure, drop() would be useful, but breaking the "transformations are lazy;
> only actions launch jobs" model is abhorrent -- which is not to say that we
> haven't already broken that model for useful operations (cf.
> RangePartitioner, which is used for sorted RDDs),
- Original Message -
> It could make sense to add a skipHeader argument to SparkContext.textFile?
I also looked into this. I don't think it's feasible given the limits of the
InputFormat and RecordReader interfaces. RecordReader can't (I think) *ever*
know which split it's attached
Yeah, the input format doesn't support this behavior. But it does tell you
the byte position of each record in the file.
On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin wrote:
> Yes, that could work. But it is not as simple as just a binary flag.
>
> We might want to skip the first row for every
Yes, that could work. But it is not as simple as just a binary flag.
We might want to skip the first row for every file, or the header only for
the first file. The former is not really supported out of the box by the
input format I think?
On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza
wrote:
> I
It could make sense to add a skipHeader argument to SparkContext.textFile?
On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin wrote:
> If the purpose is for dropping csv headers, perhaps we don't really need a
> common drop and only one that drops the first line in a file? I'd really
> try hard to a
If the purpose is for dropping csv headers, perhaps we don't really need a
common drop and only one that drops the first line in a file? I'd really
try hard to avoid a common drop/dropWhile because they can be expensive to
do.
Note that I think we will be adding this functionality (ignoring header
You can find some of the prior, related discussion here:
https://issues.apache.org/jira/browse/SPARK-1021
On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson wrote:
>
>
> - Original Message -
> > Rather than embrace non-lazy transformations and add more of them, I'd
> > rather we 1) try to
- Original Message -
> Rather than embrace non-lazy transformations and add more of them, I'd
> rather we 1) try to fully characterize the needs that are driving their
> creation/usage; and 2) design and implement new Spark abstractions that
> will allow us to meet those needs and elimina
Rather than embrace non-lazy transformations and add more of them, I'd
rather we 1) try to fully characterize the needs that are driving their
creation/usage; and 2) design and implement new Spark abstractions that
will allow us to meet those needs and eliminate existing non-lazy
transformation. T
- Original Message -
> Sure, drop() would be useful, but breaking the "transformations are lazy;
> only actions launch jobs" model is abhorrent -- which is not to say that we
> haven't already broken that model for useful operations (cf.
> RangePartitioner, which is used for sorted RDDs),
Sure, drop() would be useful, but breaking the "transformations are lazy;
only actions launch jobs" model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but rather that each such
exception to
http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > >
> >
> >
> > --
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
t; > I wrote up a discussion of these trade-offs here:
> >
> >
> http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> >
>
>
> ------------------
> If you reply to this email, your message w
Personally I'd find the method useful -- I've often had a .csv file with a
header row that I want to drop so filter it out, which touches all
partitions anyway. I don't have any comments on the implementation quite
yet though.
On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson wrote:
> A few week
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315:
https://issues.apache.org/jira/browse/SPARK-2315
Supporting the drop method would make some operations convenient, however it
forces computation of >= 1 partition of the parent RDD, and so it would behave
like a "part
15 matches
Mail list logo