Hi,
Well, I was able to get the desired result by using the following script:
REGISTER '/usr/lib/pig/svn_pig/contrib/piggybank/java/piggybank.jar';
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPathAll();
-- Read through the node details to find out enbId
DEFINE getNodeDetails(nodeDetails) RETURNS void {
nodes = FOREACH nodeDetails GENERATE
FLATTEN(($0)),FLATTEN(($1)),FLATTEN(($2)),FLATTEN(($3));
STORE nodes INTO '$NODE_DETAILS' USING PigStorage(',');
}
rmf $NODE_DETAILS
rawNodeDetails = load '$INPUT_DETAILS' using
org.apache.pig.piggybank.storage.XMLLoader('en:ManagementNode') as
(doc:chararray);
nodeDetails = FOREACH rawNodeDetails GENERATE
XPath(doc,'ManagementNode/neGroup'),XPath(doc,'ManagementNode/neVersion'),XPath(doc,'ManagementNode/neId'),XPath(doc,'ManagementNode/neName');
getNodeDetails(nodeDetails);
exec;
Now I just need to open other XML files based on column $2 and generate
more outputs..
J. Reyes.
On 21 November 2015 at 17:44, Julian Reyes <[email protected]>
wrote:
> Hello,
>
> I was able to parse the XML by modifying XMLLoader.java
>
> I set up XMLTagNameRegExp as follow: "[a-zA-Z:\\_][0-9a-zA-Z:\\-_]+" and
> now seems to be working.
>
> However the output looks like:
>
> ((Group_1),(2.1.0),(100),(TK0005))
> ((Group_1),(2.1.0),(101),(TK0002))
>
> But I would like to store it into a csv file that looks like
>
> Group_1,2.1.0,100,TK0005
> Group_1,2.1.0,101,TK0002
>
> Also I need to keep opening more XML files, but the name of those files
> depend on the third column, so 100.xml , 101.xml , etc..
>
> How could I open those files in the same pig script and generate different
> outputs?
>
> My pig script:
>
> rmf $NODE_DETAILS
> rawData = load '$INPUT_LOGS' using
> org.apache.pig.piggybank.storage.XMLLoader('en:ManagementNode') as
> (doc:chararray);
> raw = FOREACH rawData GENERATE
> XPath(doc,'ManagementNode/neGroup'),XPath(doc,'ManagementNode/neVersion'),XPath(doc,'ManagementNode/neId'),XPath(doc,'ManagementNode/neName');
>
> --getNodeDetails(raw);
> --exec;
>
> I also tried to have the following method to try to get rid of
> parenthesis.. but I am getting exceptions..:
>
> -- Read through the node details to find out enbId
> DEFINE getNodeDetails(raw) RETURNS void {
> details = FOREACH raw GENERATE
> FLATTEN(neGroup,neVersion,neId,neName);
> distinctDetails = DISTINCT details PARALLEL 1;
> STORE distinctDetails INTO '$NODE_DETAILS' USING PigStorage('\t');
> }
>
>
> Regards,
> Thanks.
>
>
> J. Reyes.
>
>
>
> On 16 November 2015 at 17:50, Julian Reyes <[email protected]>
> wrote:
>
>> Hi
>>
>> I see. Thanks.
>>
>> I just changed it and used XMLLoader as follow:
>>
>> rawData = load '$INPUT' using
>> org.apache.pig.piggybank.storage.XMLLoader('en:ManagementNode') as
>> (doc:chararray);
>> raw = FOREACH rawData GENERATE doc;
>>
>> However I am getting this exception:
>>
>> java.lang.RuntimeException: XML tag identifier 'en:ManagementNode' does
>> not match the regular expression /[a-zA-Z\_][0-9a-zA-Z\-_]+/
>>
>> It has to be because of my XML file:
>>
>> <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
>> xmlns:en="CLL-NB">
>> <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName"
>> vendorName="vendorName"/>
>> <cn:configData>
>> <en:ManagementNode xmlns:en="CLL-NB">
>> <en:neGroup>Group_1</en:neGroup>
>> <en:neVersion>2.1.0</en:neVersion>
>> <en:neId>100</en:neId>
>> <en:neName>TK0005</en:neName>
>> <en:neIp>192.168.0.2</en:neIp>
>> </en:ManagementNode>
>> <en:ManagementNode xmlns:en="CLL-NB">
>> <en:neGroup>Group_1</en:neGroup>
>> <en:neVersion>2.1.0</en:neVersion>
>> <en:neId>101</en:neId>
>> <en:neName>TK0002</en:neName>
>> <en:neIp>192.168.0.3</en:neIp>
>> </en:ManagementNode>
>> </cn:configData>
>> <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
>> </cn:bulkCmConfigDataFile>
>>
>> I was looking at XMLLoader.java and I see the string that should match
>>
>> private static final String XMLTagNameRegExp = "[a-zA-Z\\_][0-9a-zA-Z\\-_]+";
>>
>> So I was thinking in maybe change that String to
>> "[a-zA-Z\\_\:][0-9a-zA-Z\\-_\:]+" and re deploy ?
>>
>> Also, how could I use XPath?
>>
>> raw = FOREACH rawLogs GENERATE
>> XPath(doc,'en:ManagementNode/en:neGroup'),XPath(doc,'en:ManagementNode/en:neVersion'),XPath(doc,'en:ManagementNode/en:neId'),XPath(doc,'en:ManagementNode/en:neName');
>>
>> My command looks like
>>
>> pig -x tez -m /home/hduser/test/param.txt -f /home/hduser/test/script.pig
>>
>>
>> Thanks.
>>
>>
>>
>>
>> J. Reyes.
>>
>>
>>
>> On 15 November 2015 at 22:55, Rajesh Balamohan <
>> [email protected]> wrote:
>>
>>> TFileLoader can not parse xml files. Script posted here tries to parse
>>> XML
>>> file via TFileLoader which could be causing the issue.
>>>
>>>
>>> https://pig.apache.org/docs/r0.15.0/api/org/apache/pig/piggybank/storage/XMLLoader.html
>>> in piggybank.jar might be useful for parsing XML contents. You can refer
>>> to
>>>
>>> https://github.com/apache/pig/blob/a44b85a0ab941cdd1d2d7f6e457303aef1e57501/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestXMLLoader.java
>>> for
>>> example.
>>>
>>>
>>> If you are interested in using pig+tez, you need to run "pig -x tez" to
>>> inform pig to make use of tez execution engine instead of MR.
>>>
>>> ~Rajesh.B
>>>
>>> On Sun, Nov 15, 2015 at 1:11 AM, Julian Reyes <[email protected]
>>> >
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I just was trying to get started using Pig and get familiar with it
>>> but I
>>> > am getting problems while reading the XML.
>>> >
>>> > My XML looks like the following (of course, its much bigger, I just
>>> added
>>> > first entries):
>>> >
>>> > <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
>>> > xmlns:en="CLL-NB">
>>> > <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName"
>>> vendorName
>>> > ="vendorName"/>
>>> > <cn:configData>
>>> > <en:ManagementNode xmlns:en="CLL-NB">
>>> > <en:neGroup>Group_1</en:neGroup>
>>> > <en:neVersion>2.1.0</en:neVersion>
>>> > <en:neId>100</en:neId>
>>> > <en:neName>TK0005</en:neName>
>>> > <en:neIp>192.168.0.2</en:neIp>
>>> > </en:ManagementNode>
>>> > <en:ManagementNode xmlns:en="CLL-NB">
>>> > <en:neGroup>Group_1</en:neGroup>
>>> > <en:neVersion>2.1.0</en:neVersion>
>>> > <en:neId>101</en:neId>
>>> > <en:neName>TK0002</en:neName>
>>> > <en:neIp>192.168.0.3</en:neIp>
>>> > </en:ManagementNode>
>>> > </cn:configData>
>>> > <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
>>> > </cn:bulkCmConfigDataFile>
>>> >
>>> > And the Pig script I am trying to use is the following:
>>> >
>>> >
>>> > set pig.splitCombination false;
>>> > set tez.grouping.min-size 5242880;
>>> > set tez.grouping.max-size 5242880;
>>> >
>>> > register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar';
>>> >
>>> > DEFINE getDetails(raw) RETURNS void {
>>> > details = FOREACH raw GENERATE configData;
>>> > distinctDetails = DISTINCT details;
>>> > STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');;
>>> > }
>>> >
>>> >
>>> > rmf $NODE_DETAILS
>>> > rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using
>>> > org.apache.tez.tools.TFileLoader() as (configData:chararray,
>>> key:chararray,
>>> > line:chararray);
>>> > raw = FOREACH rawLogs GENERATE ManagementNode,key,line;
>>> >
>>> > getDetails(raw);
>>> > exec;
>>> >
>>> > However, I am getting the following error:
>>> >
>>> > ERROR 2998: Unhandled internal error. null
>>> >
>>> > java.lang.StackOverflowError
>>> > at
>>> org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148)
>>> > at java.util.Arrays.hashCode(Arrays.java:3140)
>>> > ...
>>> >
>>> > Could it be because of the XML file?
>>> >
>>> > Thanks.
>>> >
>>> >
>>> > J. Reyes.
>>> >
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>
>>
>