Re: Adding RDD function to segment an RDD (like substring)

2014-12-09 Thread Mark Hamstra
`zipWithIndex` is both compute intensive and breaks Spark's "transformations are lazy" model, so it is probably not appropriate to add this to the public RDD API. If `zipWithIndex` weren't already what I consider to be broken, I'd be much friendlier to building something more on top of it, but I r

Adding RDD function to segment an RDD (like substring)

2014-12-09 Thread Ganelin, Ilya
Hi all – a utility that I’ve found useful several times now when working with RDDs is to be able to reason about segments of the RDD. For example, if I have two large RDDs and I want to combine them in a way that would be intractable in terms of memory or disk storage (e.g. A cartesian) but a p