subject:"Record metadata with RDDs and DataFrames"

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin

Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal. On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling wrote: > I'm considering a few approaches -- one of which is to provide new > functions like mapLeft, mapRight, filterLeft, etc. > > But this all falls shorts with DataFrame

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling

I'm considering a few approaches -- one of which is to provide new functions like mapLeft, mapRight, filterLeft, etc. But this all falls shorts with DataFrames. RDDs can easily be extended from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add special columns? On Wed, Jul 15, 2015

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin

How about just using two fields, one boolean field to mark good/bad, and another to get the source file? On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling wrote: > Hi all, > > I'm working on an ETL task with Spark. As part of this work, I'd like to > mark records with some info such as: > > 1. Whet

Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling

Hi all, I'm working on an ETL task with Spark. As part of this work, I'd like to mark records with some info such as: 1. Whether the record is good or bad (e.g, Either) 2. Originating file and lines Part of my motivation is to prevent errors with individual records from stopping the entire pipe