newbie question regarding sorted data process, and sequential match of records

Troy X Mon, 16 Mar 2015 13:00:15 -0700

Hi Experts,
I'm trying to transform couple of thousands delimited files that is stored on 
HDFS using PIG.  Each file is between 20 to 200 MB in size. The files have very 
simple column definitions like event history ;
TimeStamp, Location, Source, Target, EventType,Description
The logic is as follows ;- Each file is already in natural order by timestamp 
column- Event type can be either start or complete
What Im trying to do is to match first Completion event that occured after a 
Start event for a given Location , Source , Target combination to be able to 
calculate the durations. So the transformation will convert all the files ;
FROM ;
TimeStamp,Location,Source,Target,EventType,Description14:00:43,A,S1,D1,Start,Description114:01:02,A,S1,D2,Start,Description214:01:43,A,S1,D1,Complete,Description314:03:02,A,S1,D2,Complete,Description414:03:43,A,S2,D1,Start,Description514:03:43,A,S1,D1,Start,Description614:04:53,A,S2,D1,Complete,Description7
TO ;
TimeStamp,Location,Source,Target,Duration14:00:43,A,S1,D1,01:0014:01:02,A,S1,D2,02:0014:03:43,A,S2,D1,01:10
I thought that I should leverage the fact that individual files are already 
sorted and filenames reveal which file comes first, or I may import them and 
sort them all together at once. However Im not sure how to process these files 
in that order and apply the grouping / sequence based duration extraction in 
each file..
Can I ask your opinion or guidance / hints ? Which way is better to leverage 
parallelism of Hadoop cluster ?


Kind Regards,

newbie question regarding sorted data process, and sequential match of records

Reply via email to