This is a bad way to do it, but I wonder what would happen in that case.
Assuming we have following file
<element>
...
</element>
<element>
...
</element>
and we use TextInputFormat to get single lines as records and then check if we
find a <element> then buffer all the lines until we find the </element> tag.
Then use a XML parser to parse the buffered string in order to get our XML
object.
>From my understanding now, as Pig/Hadoop will only make sure that we read
>entire records, it could happen that our file get sliced like
---slice1---
<element>
...
---slice2---
</element>
<element>
...
</element>
then as we will read entire records this will be all fine, but we lose the
first element, right?
It's kind of a record granularity mismatch.
Would be great if someone could confirm that.
Thanks,
Will
From: Dmitriy Ryaboy [mailto:[email protected]]
Sent: Dienstag, 1. März 2011 22:05
To: [email protected]
Cc: Lai Will
Subject: Re: Custom Slicer
Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can
read up what those entail in Hadoop documentation and books.
As far as dealing with partial records at the beginning and end of the slice,
the normal pattern is to always read a full record even if it takes you past
the configured range, and to ignore any partial records in the beginning of a
slice (because the previous slice will pick them up as part of its read). So if
I was to represent records as letters, and slice boundaries as dots, something
like this:
aaabbb.bbccccdd.ddeee.eeee
Would be read in as follows:
Slice 1: aaabbbbb
Slice 2: (skips bb) ccccdddd
Slice 3: (skips dd) eeeeeee
Slice 4: (skips eeee) -- nothing --
-D
On Tue, Mar 1, 2011 at 12:45 PM, Lai Will
<[email protected]<mailto:[email protected]>> wrote:
Hello,
The data I want to process is XML. It boils down to
<element>
...
</element>
<element>
...
</element>
According to what I read in the documentation. When loading the file using the
default Slicer, I end up in block sized chunks, that will very likely contain
partial <element>s at the beginning and at the end. I don't want to ignore
those.
I want to have slice at the element boundaries, and have reasonably sized
chunks (e.g. the largest chunk that is smaller than block size and that
contains only whole <element>s.
Unfortunately the user documentation is not very helpful to me, so can anyone
help me on that?
I found a XMLLoader in the Piggybank but that does not solve my issue with
slicing.
Best,
Will