Hi

I see. Thanks.

I just changed it and used XMLLoader as follow:

rawData = load '$INPUT' using
org.apache.pig.piggybank.storage.XMLLoader('en:ManagementNode') as
(doc:chararray);
raw = FOREACH rawData GENERATE doc;

However I am getting this exception:

java.lang.RuntimeException: XML tag identifier 'en:ManagementNode' does not
match the regular expression /[a-zA-Z\_][0-9a-zA-Z\-_]+/

It has to be because of my XML file:

<cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
xmlns:en="CLL-NB">
<cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName" vendorName
="vendorName"/>
<cn:configData>
<en:ManagementNode xmlns:en="CLL-NB">
<en:neGroup>Group_1</en:neGroup>
<en:neVersion>2.1.0</en:neVersion>
<en:neId>100</en:neId>
<en:neName>TK0005</en:neName>
<en:neIp>192.168.0.2</en:neIp>
</en:ManagementNode>
<en:ManagementNode xmlns:en="CLL-NB">
<en:neGroup>Group_1</en:neGroup>
<en:neVersion>2.1.0</en:neVersion>
<en:neId>101</en:neId>
<en:neName>TK0002</en:neName>
<en:neIp>192.168.0.3</en:neIp>
</en:ManagementNode>
</cn:configData>
<cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
</cn:bulkCmConfigDataFile>

I was looking at XMLLoader.java and I see the string that should match

private static final String XMLTagNameRegExp = "[a-zA-Z\\_][0-9a-zA-Z\\-_]+";

So I was thinking in maybe change that String to
"[a-zA-Z\\_\:][0-9a-zA-Z\\-_\:]+" and re deploy ?

Also, how could I use XPath?

raw = FOREACH rawLogs GENERATE
XPath(doc,'en:ManagementNode/en:neGroup'),XPath(doc,'en:ManagementNode/en:neVersion'),XPath(doc,'en:ManagementNode/en:neId'),XPath(doc,'en:ManagementNode/en:neName');

My command looks like

pig -x tez -m /home/hduser/test/param.txt -f /home/hduser/test/script.pig


Thanks.




J. Reyes.



On 15 November 2015 at 22:55, Rajesh Balamohan <[email protected]>
wrote:

> TFileLoader can not parse xml files. Script posted here tries to parse XML
> file via TFileLoader which could be causing the issue.
>
>
> https://pig.apache.org/docs/r0.15.0/api/org/apache/pig/piggybank/storage/XMLLoader.html
> in piggybank.jar might be useful for parsing XML contents.  You can refer
> to
>
> https://github.com/apache/pig/blob/a44b85a0ab941cdd1d2d7f6e457303aef1e57501/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestXMLLoader.java
> for
> example.
>
>
> If you are interested in using pig+tez, you need to run "pig -x tez" to
> inform pig to make use of tez execution engine instead of MR.
>
> ~Rajesh.B
>
> On Sun, Nov 15, 2015 at 1:11 AM, Julian Reyes <[email protected]>
> wrote:
>
> > Hi,
> >
> > I just was trying to get started using Pig and get familiar with it but I
> > am getting problems while reading the XML.
> >
> > My XML looks like the following (of course, its much bigger, I just added
> > first entries):
> >
> > <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
> > xmlns:en="CLL-NB">
> > <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName"
> vendorName
> > ="vendorName"/>
> > <cn:configData>
> > <en:ManagementNode xmlns:en="CLL-NB">
> > <en:neGroup>Group_1</en:neGroup>
> > <en:neVersion>2.1.0</en:neVersion>
> > <en:neId>100</en:neId>
> > <en:neName>TK0005</en:neName>
> > <en:neIp>192.168.0.2</en:neIp>
> > </en:ManagementNode>
> > <en:ManagementNode xmlns:en="CLL-NB">
> > <en:neGroup>Group_1</en:neGroup>
> > <en:neVersion>2.1.0</en:neVersion>
> > <en:neId>101</en:neId>
> > <en:neName>TK0002</en:neName>
> > <en:neIp>192.168.0.3</en:neIp>
> > </en:ManagementNode>
> > </cn:configData>
> > <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
> > </cn:bulkCmConfigDataFile>
> >
> > And the Pig script I am trying to use is the following:
> >
> >
> > set pig.splitCombination false;
> > set tez.grouping.min-size 5242880;
> > set tez.grouping.max-size 5242880;
> >
> > register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar';
> >
> > DEFINE getDetails(raw) RETURNS void {
> >         details = FOREACH raw GENERATE configData;
> >         distinctDetails = DISTINCT details;
> >         STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');;
> > }
> >
> >
> > rmf $NODE_DETAILS
> > rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using
> > org.apache.tez.tools.TFileLoader() as (configData:chararray,
> key:chararray,
> > line:chararray);
> > raw = FOREACH rawLogs GENERATE ManagementNode,key,line;
> >
> > getDetails(raw);
> > exec;
> >
> > However, I am getting the following error:
> >
> > ERROR 2998: Unhandled internal error. null
> >
> > java.lang.StackOverflowError
> >         at
> org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148)
> >         at java.util.Arrays.hashCode(Arrays.java:3140)
> > ...
> >
> > Could it be because of the XML file?
> >
> > Thanks.
> >
> >
> > J. Reyes.
> >
>
>
>
> --
> ~Rajesh.B
>

Reply via email to