You could create a pair rdd with key as the record id and then do reduceByKey. There will be some shuffle involved here.
Another way would be create your own custom input format which will be slightly intricate but should do the job. It will have to handle different lines parsing into a custom record. On Jun 11, 2015 12:39 PM, "wushuzh" <[email protected]> wrote: > Hello, > > I have a large CSV file in which the continued records(with same RecordID) > may have the context meaning. I should see these continued records as ONE > complete record. Also the recordID will be reset to 1 at some time when the > csv dumper system think it's necessary. > > I'd like to get some suggestion about how to do analyze with this kind of > file in Spark ? for example, > > I need to get the number of the complete record which should consists >=2 > continued records. Obviously, "2, s2, 9, r1, 7, r2, 8, r3, 3" is one of my > target. > > A example sample of csv > > RecordID,stdID,stdVal,refID,refVal > 1,s1,10,r1,7 > 2,s2,9,r1,7 > 2,s2,9,r2,8 > 2,s2,9,r3,3 > 3,s1,12,r2,10 > ... > 42,s3,8,r7,5 > 1,s2,11,r3,5 > > Best regards, > Jiaqiang > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/how-to-deal-with-continued-records-tp23269.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
