Re: Why does Pig not use default resources from the Configuration object?

Bhooshan Mogal Thu, 20 Jun 2013 20:33:09 -0700

Hi Prashant,

Any update regarding this?


-
Bhooshan.


On Wed, May 29, 2013 at 4:55 PM, Bhooshan Mogal <[email protected]>wrote:

> Hi Prashant,
>
> Apologies for the delay regarding this. I took some more time over the
> past couple of weeks to investigate this issue. It does turn out that Pig
> does not eliminate parameters from non-standard configuration resources. It
> seems that the reason why parameters from my config file were unavailable
> to pig was because I was adding these files to the configuration object
> using the Configuration.addDefaultResources() method. So the problem is the
> same one that I originally described. If I use
> conf.addResource("my-conf-site.xml") as opposed to
> Configuration.addDefaultResource("my-conf-site.xml"), the problem does not
> occur. Pig does not seem to use parameters from the defaultResources list
> in the Configuration class. In Main.java at
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/Main.java, I
> can see that the Configuration object is created as Configuration conf =
> new Configuration(false); (line 168). As a false is passed to the
> constructor, it does not include defaultResources while building the
> Configuration object.
>
> Could you (or anyone else on the list) explain why pig does not use
> defaultResources from the Configuration object? If my findings are correct,
> is there a case for not passing this false to the Configuration object,
> based on whether a pig parameter is set? I would be more than happy to
> provide a patch for this parameter if required.
>
>
> Thanks,
> Bhooshan.
>
>
> On Mon, Apr 15, 2013 at 5:57 PM, Prashant Kommireddi 
> <[email protected]>wrote:
>
>> Pig actually does not add it core-site.xml or hadoop-site.xml explicitly,
>> it merely looks for these resources to be present on the classpath.
>>
>> JobConf is the interface describing MR specifics to hadoop and pig uses
>> it to define jobs for execution. It loads up mapred*.xml. It does extend
>> from Configuration and uses the props loaded by it.
>>
>>
>> On Mon, Apr 15, 2013 at 5:34 PM, Bhooshan Mogal <[email protected]
>> > wrote:
>>
>>> Thanks! Quick question before starting this though. Since resources are
>>> added to the Configuration object in various classes in hadoop
>>> (Configuration.java adds core-*.xml, HDFSConfiguration.java adds
>>> hdfs-*.xml), why does Pig create a new JobConf object with selected
>>> resources before submitting a job and not reuse the Configuration object
>>> that may have been created earlier? Trying to understand why Pig adds
>>> core-site.xml, hdfs-site.xml, yarn-site.xml again.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <
>>> [email protected]> wrote:
>>>
>>>> Sounds good. Here is a doc on contributing patch (for some pointers)
>>>> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <
>>>> [email protected]> wrote:
>>>>
>>>>> Hey Prashant,
>>>>>
>>>>> Yup, I can take a stab at it. This is the first time I am looking at
>>>>> Pig code, so I might take some time to get started. Will get back to you 
>>>>> if
>>>>> I have questions in the meantime. And yes, I will write it so it reads a
>>>>> pig property.
>>>>>
>>>>> -
>>>>> Bhooshan.
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Bhooshan,
>>>>>>
>>>>>> This makes more sense now. I think overriding fs implementation
>>>>>> should go into core-site.xml, but it would be useful to be able to add
>>>>>> resources if you have a bunch of other properties.
>>>>>>
>>>>>> Would you like to submit a patch? It should be based on a pig
>>>>>> property that suggests the additional resource names (myfs-site.xml) in
>>>>>> your case.
>>>>>>
>>>>>> -Prashant
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Prashant,
>>>>>>>
>>>>>>>
>>>>>>> Yes, I am running in MapReduce mode. Let me give you the steps in
>>>>>>> the scenario that I am trying to test -
>>>>>>>
>>>>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem
>>>>>>> for a filesystem I am trying to implement - Let's call it
>>>>>>> MyFileSystem.class. This filesystem uses the scheme myfs:// for its URIs
>>>>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml
>>>>>>> and made the class available through a jar file that is part of
>>>>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>>>>> 3. In MyFileSystem.class, I have a static block as -
>>>>>>> static {
>>>>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>>>>> }
>>>>>>> Both these files are in the classpath. To be safe, I have also added
>>>>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>>>>> conf.addResource("myfs-site.xml"), so that it is part of both the 
>>>>>>> default
>>>>>>> resources as well as the non-default resources in the Configuration 
>>>>>>> object.
>>>>>>> 4. I am trying to access the filesystem in my pig script as -
>>>>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>>>>> (name:chararray, age:int); -- loading data
>>>>>>> B = FOREACH A GENERATE name;
>>>>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>>>>> invoked correctly. In MyFileSystem.class, I can also see that 
>>>>>>> myfs-site.xml
>>>>>>> is loaded and the properties defined in it are available.
>>>>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>>>>> properties and the job fails to submit successfully.
>>>>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>>>>> core-site.xml, the job gets submitted successfully, and it even 
>>>>>>> succeeds.
>>>>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>>>>> with all of the properties for a separate filesystem.
>>>>>>> 8. As I said earlier, upon taking a closer look at the pig code, I
>>>>>>> saw that while creating the JobConf object for a job, pig adds very
>>>>>>> specific resources to the job object, and ignores the resources that may
>>>>>>> have been added already (eg myfs-site.xml) in the Configuration object.
>>>>>>> 9. I have tested this with native map-reduce code as well as hive,
>>>>>>> and this approach of having a separate config file for MyFileSystem 
>>>>>>> works
>>>>>>> fine in both those cases.
>>>>>>>
>>>>>>> So, to summarize, I am looking for a way to ask Pig to load
>>>>>>> parameters from my own config file before submitting a job.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -
>>>>>>> Bhooshan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> +User group
>>>>>>>>
>>>>>>>> Hi Bhooshan,
>>>>>>>>
>>>>>>>> By default you should be running in MapReduce mode unless specified
>>>>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can 
>>>>>>>> you
>>>>>>>> provide your code here?
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>  Apologies for the premature send. I may have some more
>>>>>>>> information. After I applied the patch and set
>>>>>>>> "pig.use.overriden.hadoop.configs=true", I saw an NPE (stacktrace 
>>>>>>>> below)
>>>>>>>> and a message saying pig was running in exectype local -
>>>>>>>>
>>>>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
>>>>>>>> Connecting
>>>>>>>> to hadoop file system at: local
>>>>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>>>>> org.apache.hadoop.conf.Configuration - 
>>>>>>>> mapred.used.genericoptionsparser is
>>>>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>>>>> 2013-04-13 07:37:14,162 [main] ERROR
>>>>>>>> org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to 
>>>>>>>> parse:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is the stacktrace =
>>>>>>>>
>>>>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000:
>>>>>>>> Error during parsing. Pig script failed to parse:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>> Method)
>>>>>>>>         at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>         at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>>>>         ... 14 more
>>>>>>>> Caused by:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>>>>         ... 15 more
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>>>>
>>>>>>>>> I see what you are saying though. The patch might require users to
>>>>>>>>> take care of adding the default config resources as well apart from 
>>>>>>>>> their
>>>>>>>>> own resources?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>>>>> your configuration resources?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Prashant,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your response to my question, and sorry for the
>>>>>>>>>>> delayed reply. I was not subscribed to the dev mailing list and 
>>>>>>>>>>> hence did
>>>>>>>>>>> not get a notification about your reply. I have copied our thread 
>>>>>>>>>>> below so
>>>>>>>>>>> you can get some context.
>>>>>>>>>>>
>>>>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that 
>>>>>>>>>>> it is
>>>>>>>>>>> running the script in local mode inspite of having
>>>>>>>>>>> fs.default.name defined as the location of the HDFS namenode.
>>>>>>>>>>>
>>>>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use 
>>>>>>>>>>> it in
>>>>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>>>>> *-site.xml files. I have added the path to these files in 
>>>>>>>>>>> PIG_CLASSPATH as
>>>>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these 
>>>>>>>>>>> files,
>>>>>>>>>>> as I am able to read these configurations in my code. However, pig 
>>>>>>>>>>> code
>>>>>>>>>>> cannot find these configuration parameters. Upon doing some 
>>>>>>>>>>> debugging in
>>>>>>>>>>> the pig code, it seems to me that pig does not use all the 
>>>>>>>>>>> resources added
>>>>>>>>>>> in the Configuration object, but only seems to use certain specific 
>>>>>>>>>>> ones
>>>>>>>>>>> like hadoop-site, core-site, 
>>>>>>>>>>> pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it 
>>>>>>>>>>> possible to
>>>>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on 
>>>>>>>>>>> this as
>>>>>>>>>>> the problem, because pig can find my config parameters if I define 
>>>>>>>>>>> them in
>>>>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>>>>
>>>>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is our previous conversation -
>>>>>>>>>>>
>>>>>>>>>>> Hi Bhooshan,
>>>>>>>>>>>
>>>>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the 
>>>>>>>>>>> version
>>>>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>>>>
>>>>>>>>>>> With this patch, the following property will allow you to override 
>>>>>>>>>>> the
>>>>>>>>>>> default and pass in your own configuration.
>>>>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>>>
>>>>>>>>>>> > Hi Folks,
>>>>>>>>>>> >
>>>>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a 
>>>>>>>>>>> > storage system
>>>>>>>>>>> > at work. This implementation uses some config files that are 
>>>>>>>>>>> > similar in
>>>>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>>>>> > *-site.xml for users to override default properties. In the class 
>>>>>>>>>>> > that
>>>>>>>>>>> > implemented the Hadoop FileSystem, I had added these 
>>>>>>>>>>> > configuration files as
>>>>>>>>>>> > default resources in a static block using
>>>>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working 
>>>>>>>>>>> > fine and
>>>>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs 
>>>>>>>>>>> > just fine
>>>>>>>>>>> > for our storage system. However, when we tried using this storage 
>>>>>>>>>>> > system in
>>>>>>>>>>> > pig scripts, we saw errors indicating that our configuration 
>>>>>>>>>>> > parameters
>>>>>>>>>>> > were not available. Upon further debugging, we saw that the 
>>>>>>>>>>> > config files
>>>>>>>>>>> > were added to the Configuration object as resources, but were 
>>>>>>>>>>> > part of
>>>>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw 
>>>>>>>>>>> > that the
>>>>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in 
>>>>>>>>>>> > the conf
>>>>>>>>>>> > object. As a result, properties from the default resources 
>>>>>>>>>>> > (including my
>>>>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>>>>> >
>>>>>>>>>>> > We solved the problem by using Configuration.addResource instead 
>>>>>>>>>>> > of
>>>>>>>>>>> > Configuration.addDefaultResource, but still could not figure out 
>>>>>>>>>>> > why Pig
>>>>>>>>>>> > does not use default resources?
>>>>>>>>>>> >
>>>>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> > --
>>>>>>>>>>> > Bhooshan
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Bhooshan
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Bhooshan
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bhooshan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>



-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Reply via email to