Ok, thanks.. i'll definitely keep that in mind. As i'm still using pig 0.6.0 because the old code base i'm working on was using it.
Ah so does the behavior you mentioned also apply to 0.6.0 then i would have no issue here:) I was just thinking that as different chunks get processed in different tasks,that are shipped to different machines, reading past the record would not be possible as the next record might not be available on the machine. (So is there actually an extra dfs access performed in order to read past the record?) Will This message was sent from my mobile phone. I apologize for any typos and abbreviations. ----- Reply message ----- From: "Dmitriy Ryaboy" <[email protected]> Date: Tue, Mar 1, 2011 22:05 Subject: Custom Slicer To: "[email protected]" <[email protected]> Cc: "Lai Will" <[email protected]> Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can read up what those entail in Hadoop documentation and books. As far as dealing with partial records at the beginning and end of the slice, the normal pattern is to always read a full record even if it takes you past the configured range, and to ignore any partial records in the beginning of a slice (because the previous slice will pick them up as part of its read). So if I was to represent records as letters, and slice boundaries as dots, something like this: aaabbb.bbccccdd.ddeee.eeee Would be read in as follows: Slice 1: aaabbbbb Slice 2: (skips bb) ccccdddd Slice 3: (skips dd) eeeeeee Slice 4: (skips eeee) -- nothing -- -D On Tue, Mar 1, 2011 at 12:45 PM, Lai Will <[email protected]<mailto:[email protected]>> wrote: Hello, The data I want to process is XML. It boils down to <element> ... </element> <element> ... </element> According to what I read in the documentation. When loading the file using the default Slicer, I end up in block sized chunks, that will very likely contain partial <element>s at the beginning and at the end. I don't want to ignore those. I want to have slice at the element boundaries, and have reasonably sized chunks (e.g. the largest chunk that is smaller than block size and that contains only whole <element>s. Unfortunately the user documentation is not very helpful to me, so can anyone help me on that? I found a XMLLoader in the Piggybank but that does not solve my issue with slicing. Best, Will
