Re: how to deal with continued records

Sonal Goyal Thu, 11 Jun 2015 01:34:14 -0700

You could create a pair rdd with key as the record id and then do
reduceByKey. There will be some shuffle involved here.


Another way would be create your own custom input format which will be
slightly intricate but should do the job. It will have to handle different
lines parsing into a custom record.
On Jun 11, 2015 12:39 PM, "wushuzh" <[email protected]> wrote:

> Hello,
>
> I have a large CSV file in which the continued records(with same RecordID)
> may have the context meaning. I should see these continued records as ONE
> complete record. Also the recordID will be reset to 1 at some time when the
> csv dumper system think it's necessary.
>
> I'd like to get some suggestion about how to do analyze with this kind of
> file in Spark ? for example,
>
> I need to get the number of the complete record which should consists >=2
> continued records. Obviously, "2, s2, 9, r1, 7, r2, 8, r3, 3" is one of my
> target.
>
> A example sample of csv
>
> RecordID,stdID,stdVal,refID,refVal
> 1,s1,10,r1,7
> 2,s2,9,r1,7
> 2,s2,9,r2,8
> 2,s2,9,r3,3
> 3,s1,12,r2,10
> ...
> 42,s3,8,r7,5
> 1,s2,11,r3,5
>
> Best regards,
> Jiaqiang
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-deal-with-continued-records-tp23269.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: how to deal with continued records

Reply via email to