Re: Why does Pig not use default resources from the Configuration object?

Prashant Kommireddi Mon, 15 Apr 2013 17:58:15 -0700

Pig actually does not add it core-site.xml or hadoop-site.xml explicitly,
it merely looks for these resources to be present on the classpath.


JobConf is the interface describing MR specifics to hadoop and pig uses it
to define jobs for execution. It loads up mapred*.xml. It does extend from
Configuration and uses the props loaded by it.


On Mon, Apr 15, 2013 at 5:34 PM, Bhooshan Mogal <[email protected]>wrote:

> Thanks! Quick question before starting this though. Since resources are
> added to the Configuration object in various classes in hadoop
> (Configuration.java adds core-*.xml, HDFSConfiguration.java adds
> hdfs-*.xml), why does Pig create a new JobConf object with selected
> resources before submitting a job and not reuse the Configuration object
> that may have been created earlier? Trying to understand why Pig adds
> core-site.xml, hdfs-site.xml, yarn-site.xml again.
>
>
> On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi 
> <[email protected]>wrote:
>
>> Sounds good. Here is a doc on contributing patch (for some pointers)
>> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>>
>>
>> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <[email protected]
>> > wrote:
>>
>>> Hey Prashant,
>>>
>>> Yup, I can take a stab at it. This is the first time I am looking at Pig
>>> code, so I might take some time to get started. Will get back to you if I
>>> have questions in the meantime. And yes, I will write it so it reads a pig
>>> property.
>>>
>>> -
>>> Bhooshan.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>>> [email protected]> wrote:
>>>
>>>> Hi Bhooshan,
>>>>
>>>> This makes more sense now. I think overriding fs implementation should
>>>> go into core-site.xml, but it would be useful to be able to add
>>>> resources if you have a bunch of other properties.
>>>>
>>>> Would you like to submit a patch? It should be based on a pig property
>>>> that suggests the additional resource names (myfs-site.xml) in your case.
>>>>
>>>> -Prashant
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Prashant,
>>>>>
>>>>>
>>>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>>>> scenario that I am trying to test -
>>>>>
>>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for
>>>>> a filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>>>> This filesystem uses the scheme myfs:// for its URIs
>>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>>>> made the class available through a jar file that is part of
>>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>>> 3. In MyFileSystem.class, I have a static block as -
>>>>> static {
>>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>>> }
>>>>> Both these files are in the classpath. To be safe, I have also added
>>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>>>> resources as well as the non-default resources in the Configuration 
>>>>> object.
>>>>> 4. I am trying to access the filesystem in my pig script as -
>>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>>> (name:chararray, age:int); -- loading data
>>>>> B = FOREACH A GENERATE name;
>>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>>> invoked correctly. In MyFileSystem.class, I can also see that 
>>>>> myfs-site.xml
>>>>> is loaded and the properties defined in it are available.
>>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>>> properties and the job fails to submit successfully.
>>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>>> with all of the properties for a separate filesystem.
>>>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>>>> that while creating the JobConf object for a job, pig adds very specific
>>>>> resources to the job object, and ignores the resources that may have been
>>>>> added already (eg myfs-site.xml) in the Configuration object.
>>>>> 9. I have tested this with native map-reduce code as well as hive, and
>>>>> this approach of having a separate config file for MyFileSystem works fine
>>>>> in both those cases.
>>>>>
>>>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>>>> from my own config file before submitting a job.
>>>>>
>>>>> Thanks,
>>>>> -
>>>>> Bhooshan.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> +User group
>>>>>>
>>>>>> Hi Bhooshan,
>>>>>>
>>>>>> By default you should be running in MapReduce mode unless specified
>>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>>>> provide your code here?
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  Apologies for the premature send. I may have some more information.
>>>>>> After I applied the patch and set 
>>>>>> "pig.use.overriden.hadoop.configs=true",
>>>>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>>>>> exectype local -
>>>>>>
>>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
>>>>>> Connecting
>>>>>> to hadoop file system at: local
>>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser 
>>>>>> is
>>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>>> - ERROR 1200: Pig script failed to parse:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>
>>>>>>
>>>>>> Here is the stacktrace =
>>>>>>
>>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>>>>> during parsing. Pig script failed to parse:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>>         at
>>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>>         at
>>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>>         at
>>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>>         at
>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>>         at
>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>         at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>         at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>>         at
>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>>         ... 14 more
>>>>>> Caused by:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>>         at
>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>>         ... 15 more
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>>
>>>>>>> I see what you are saying though. The patch might require users to
>>>>>>> take care of adding the default config resources as well apart from 
>>>>>>> their
>>>>>>> own resources?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>>> your configuration resources?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Prashant,
>>>>>>>>>
>>>>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not 
>>>>>>>>> get a
>>>>>>>>> notification about your reply. I have copied our thread below so you 
>>>>>>>>> can
>>>>>>>>> get some context.
>>>>>>>>>
>>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it 
>>>>>>>>> is
>>>>>>>>> running the script in local mode inspite of having 
>>>>>>>>> fs.default.namedefined as the location of the HDFS namenode.
>>>>>>>>>
>>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use 
>>>>>>>>> it in
>>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>>> *-site.xml files. I have added the path to these files in 
>>>>>>>>> PIG_CLASSPATH as
>>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these 
>>>>>>>>> files,
>>>>>>>>> as I am able to read these configurations in my code. However, pig 
>>>>>>>>> code
>>>>>>>>> cannot find these configuration parameters. Upon doing some debugging 
>>>>>>>>> in
>>>>>>>>> the pig code, it seems to me that pig does not use all the resources 
>>>>>>>>> added
>>>>>>>>> in the Configuration object, but only seems to use certain specific 
>>>>>>>>> ones
>>>>>>>>> like hadoop-site, core-site, 
>>>>>>>>> pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible 
>>>>>>>>> to
>>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on 
>>>>>>>>> this as
>>>>>>>>> the problem, because pig can find my config parameters if I define 
>>>>>>>>> them in
>>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>>
>>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is our previous conversation -
>>>>>>>>>
>>>>>>>>> Hi Bhooshan,
>>>>>>>>>
>>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the 
>>>>>>>>> version
>>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>>
>>>>>>>>> With this patch, the following property will allow you to override the
>>>>>>>>> default and pass in your own configuration.
>>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>> > Hi Folks,
>>>>>>>>> >
>>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a 
>>>>>>>>> > storage system
>>>>>>>>> > at work. This implementation uses some config files that are 
>>>>>>>>> > similar in
>>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>>> > *-site.xml for users to override default properties. In the class 
>>>>>>>>> > that
>>>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration 
>>>>>>>>> > files as
>>>>>>>>> > default resources in a static block using
>>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working 
>>>>>>>>> > fine and
>>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs 
>>>>>>>>> > just fine
>>>>>>>>> > for our storage system. However, when we tried using this storage 
>>>>>>>>> > system in
>>>>>>>>> > pig scripts, we saw errors indicating that our configuration 
>>>>>>>>> > parameters
>>>>>>>>> > were not available. Upon further debugging, we saw that the config 
>>>>>>>>> > files
>>>>>>>>> > were added to the Configuration object as resources, but were part 
>>>>>>>>> > of
>>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw 
>>>>>>>>> > that the
>>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the 
>>>>>>>>> > conf
>>>>>>>>> > object. As a result, properties from the default resources 
>>>>>>>>> > (including my
>>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>>> >
>>>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>>>> > Configuration.addDefaultResource, but still could not figure out 
>>>>>>>>> > why Pig
>>>>>>>>> > does not use default resources?
>>>>>>>>> >
>>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > --
>>>>>>>>> > Bhooshan
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Bhooshan
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>

Re: Why does Pig not use default resources from the Configuration object?

Reply via email to