Hi Experts,
I'm trying to transform couple of thousands delimited files that is stored on
HDFS using PIG. Each file is between 20 to 200 MB in size. The files have very
simple column definitions like event history ;
TimeStamp, Location, Source, Target, EventType,Description
The logic is as follows ;- Each file is already in natural order by timestamp
column- Event type can be either start or complete
What Im trying to do is to match first Completion event that occured after a
Start event for a given Location , Source , Target combination to be able to
calculate the durations. So the transformation will convert all the files ;
FROM ;
TimeStamp,Location,Source,Target,EventType,Description14:00:43,A,S1,D1,Start,Description114:01:02,A,S1,D2,Start,Description214:01:43,A,S1,D1,Complete,Description314:03:02,A,S1,D2,Complete,Description414:03:43,A,S2,D1,Start,Description514:03:43,A,S1,D1,Start,Description614:04:53,A,S2,D1,Complete,Description7
TO ;
TimeStamp,Location,Source,Target,Duration14:00:43,A,S1,D1,01:0014:01:02,A,S1,D2,02:0014:03:43,A,S2,D1,01:10
I thought that I should leverage the fact that individual files are already
sorted and filenames reveal which file comes first, or I may import them and
sort them all together at once. However Im not sure how to process these files
in that order and apply the grouping / sequence based duration extraction in
each file..
Can I ask your opinion or guidance / hints ? Which way is better to leverage
parallelism of Hadoop cluster ?
Kind Regards,