Re: Why does Pig not use default resources from the Configuration object?

Bhooshan Mogal Mon, 15 Apr 2013 16:38:21 -0700

Hey Prashant,

Yup, I can take a stab at it. This is the first time I am looking at Pig
code, so I might take some time to get started. Will get back to you if I
have questions in the meantime. And yes, I will write it so it reads a pig
property.


-
Bhooshan.


On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi
<[email protected]>wrote:

> Hi Bhooshan,
>
> This makes more sense now. I think overriding fs implementation should go
> into core-site.xml, but it would be useful to be able to add resources if
> you have a bunch of other properties.
>
> Would you like to submit a patch? It should be based on a pig property
> that suggests the additional resource names (myfs-site.xml) in your case.
>
> -Prashant
>
>
> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <[email protected]
> > wrote:
>
>> Hi Prashant,
>>
>>
>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>> scenario that I am trying to test -
>>
>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
>> filesystem I am trying to implement - Let's call it MyFileSystem.class.
>> This filesystem uses the scheme myfs:// for its URIs
>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>> made the class available through a jar file that is part of
>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>> 3. In MyFileSystem.class, I have a static block as -
>> static {
>>     Configuration.addDefaultResource("myfs-default.xml");
>>     Configuration.addDefaultResource("myfs-site.xml");
>> }
>> Both these files are in the classpath. To be safe, I have also added the
>> my-fs-site.xml in the constructor of MyFileSystem as
>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>> resources as well as the non-default resources in the Configuration object.
>> 4. I am trying to access the filesystem in my pig script as -
>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>> (name:chararray, age:int); -- loading data
>> B = FOREACH A GENERATE name;
>> store B into 'myfs://myhost.com:8999/testoutput';
>> 5. The execution seems to start correctly, and MyFileSystem.class is
>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>> is loaded and the properties defined in it are available.
>> 6. However, when Pig tries to submit the job, it cannot find these
>> properties and the job fails to submit successfully.
>> 7. If I move all the properties defined in myfs-site.xml to
>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>> However, this is not ideal as I do not want to proliferate core-site.xml
>> with all of the properties for a separate filesystem.
>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>> that while creating the JobConf object for a job, pig adds very specific
>> resources to the job object, and ignores the resources that may have been
>> added already (eg myfs-site.xml) in the Configuration object.
>> 9. I have tested this with native map-reduce code as well as hive, and
>> this approach of having a separate config file for MyFileSystem works fine
>> in both those cases.
>>
>> So, to summarize, I am looking for a way to ask Pig to load parameters
>> from my own config file before submitting a job.
>>
>> Thanks,
>> -
>> Bhooshan.
>>
>>
>>
>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <[email protected]
>> > wrote:
>>
>>> +User group
>>>
>>> Hi Bhooshan,
>>>
>>> By default you should be running in MapReduce mode unless specified
>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>> provide your code here?
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]>
>>> wrote:
>>>
>>>  Apologies for the premature send. I may have some more information.
>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>> exectype local -
>>>
>>> 2013-04-13 07:37:13,758 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>> to hadoop file system at: local
>>> 2013-04-13 07:37:13,760 [main] WARN
>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1200: Pig script failed to parse:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>
>>>
>>> Here is the stacktrace =
>>>
>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>> during parsing. Pig script failed to parse:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>         at
>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>         at
>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>         at
>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>         at
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>         at
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>         at org.apache.pig.Main.run(Main.java:555)
>>>         at org.apache.pig.Main.main(Main.java:111)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>> Caused by: Failed to parse: Pig script failed to parse:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>         ... 14 more
>>> Caused by:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>         at
>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>         ... 15 more
>>>
>>>
>>>
>>>
>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>> [email protected]> wrote:
>>>
>>>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
>>>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
>>>> Configuration.addResource.
>>>>
>>>> I see what you are saying though. The patch might require users to take
>>>> care of adding the default config resources as well apart from their own
>>>> resources?
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>> [email protected]> wrote:
>>>>
>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>>>> configuration resources?
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Prashant,
>>>>>>
>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>> reply. I was not subscribed to the dev mailing list and hence did not 
>>>>>> get a
>>>>>> notification about your reply. I have copied our thread below so you can
>>>>>> get some context.
>>>>>>
>>>>>> I tried the patch that you pointed to, however with that patch looks
>>>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>>>> the script in local mode inspite of having fs.default.name defined
>>>>>> as the location of the HDFS namenode.
>>>>>>
>>>>>> Here is what I am trying to do - I have developed my own
>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>> my pig script. This implementation requires its own *-default and
>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH 
>>>>>> as
>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these 
>>>>>> files,
>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>> the pig code, it seems to me that pig does not use all the resources 
>>>>>> added
>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>> the problem, because pig can find my config parameters if I define them 
>>>>>> in
>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>
>>>>>> Let me know if you need more details about the issue.
>>>>>>
>>>>>>
>>>>>> Here is our previous conversation -
>>>>>>
>>>>>> Hi Bhooshan,
>>>>>>
>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>> (unreleased). Take a look and see if you can apply the patch to the 
>>>>>> version
>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>
>>>>>> With this patch, the following property will allow you to override the
>>>>>> default and pass in your own configuration.
>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>> > Hi Folks,
>>>>>> >
>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage 
>>>>>> > system
>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>> > implemented the Hadoop FileSystem, I had added these configuration 
>>>>>> > files as
>>>>>> > default resources in a static block using
>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine 
>>>>>> > and
>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just 
>>>>>> > fine
>>>>>> > for our storage system. However, when we tried using this storage 
>>>>>> > system in
>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>> > were not available. Upon further debugging, we saw that the config 
>>>>>> > files
>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that 
>>>>>> > the
>>>>>> > Configuration object was created as Configuration conf = new
>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the 
>>>>>> > conf
>>>>>> > object. As a result, properties from the default resources (including 
>>>>>> > my
>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>> >
>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>> > Configuration.addDefaultResource, but still could not figure out why 
>>>>>> > Pig
>>>>>> > does not use default resources?
>>>>>> >
>>>>>> > Could someone on the list explain why this is the case?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > --
>>>>>> > Bhooshan
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>


-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Reply via email to