Thanks! Quick question before starting this though. Since resources are
added to the Configuration object in various classes in hadoop
(Configuration.java adds core-*.xml, HDFSConfiguration.java adds
hdfs-*.xml), why does Pig create a new JobConf object with selected
resources before submitting a job and not reuse the Configuration object
that may have been created earlier? Trying to understand why Pig adds
core-site.xml, hdfs-site.xml, yarn-site.xml again.


On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <[email protected]>wrote:

> Sounds good. Here is a doc on contributing patch (for some pointers)
> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>
>
> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal 
> <[email protected]>wrote:
>
>> Hey Prashant,
>>
>> Yup, I can take a stab at it. This is the first time I am looking at Pig
>> code, so I might take some time to get started. Will get back to you if I
>> have questions in the meantime. And yes, I will write it so it reads a pig
>> property.
>>
>> -
>> Bhooshan.
>>
>>
>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>> [email protected]> wrote:
>>
>>> Hi Bhooshan,
>>>
>>> This makes more sense now. I think overriding fs implementation should
>>> go into core-site.xml, but it would be useful to be able to add
>>> resources if you have a bunch of other properties.
>>>
>>> Would you like to submit a patch? It should be based on a pig property
>>> that suggests the additional resource names (myfs-site.xml) in your case.
>>>
>>> -Prashant
>>>
>>>
>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>> [email protected]> wrote:
>>>
>>>> Hi Prashant,
>>>>
>>>>
>>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>>> scenario that I am trying to test -
>>>>
>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for
>>>> a filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>>> This filesystem uses the scheme myfs:// for its URIs
>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>>> made the class available through a jar file that is part of
>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>> 3. In MyFileSystem.class, I have a static block as -
>>>> static {
>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>> }
>>>> Both these files are in the classpath. To be safe, I have also added
>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>>> resources as well as the non-default resources in the Configuration object.
>>>> 4. I am trying to access the filesystem in my pig script as -
>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>> (name:chararray, age:int); -- loading data
>>>> B = FOREACH A GENERATE name;
>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>>> is loaded and the properties defined in it are available.
>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>> properties and the job fails to submit successfully.
>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>> with all of the properties for a separate filesystem.
>>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>>> that while creating the JobConf object for a job, pig adds very specific
>>>> resources to the job object, and ignores the resources that may have been
>>>> added already (eg myfs-site.xml) in the Configuration object.
>>>> 9. I have tested this with native map-reduce code as well as hive, and
>>>> this approach of having a separate config file for MyFileSystem works fine
>>>> in both those cases.
>>>>
>>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>>> from my own config file before submitting a job.
>>>>
>>>> Thanks,
>>>> -
>>>> Bhooshan.
>>>>
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>> [email protected]> wrote:
>>>>
>>>>> +User group
>>>>>
>>>>> Hi Bhooshan,
>>>>>
>>>>> By default you should be running in MapReduce mode unless specified
>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>>> provide your code here?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Apologies for the premature send. I may have some more information.
>>>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>>>> exectype local -
>>>>>
>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
>>>>> Connecting
>>>>> to hadoop file system at: local
>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>> - ERROR 1200: Pig script failed to parse:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>
>>>>>
>>>>> Here is the stacktrace =
>>>>>
>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>>>> during parsing. Pig script failed to parse:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>         at
>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>         at
>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>         at
>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>         at
>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>         at
>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>         at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>         at
>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>         ... 14 more
>>>>> Caused by:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>         at
>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>         ... 15 more
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>
>>>>>> I see what you are saying though. The patch might require users to
>>>>>> take care of adding the default config resources as well apart from their
>>>>>> own resources?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>> your configuration resources?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Prashant,
>>>>>>>>
>>>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not 
>>>>>>>> get a
>>>>>>>> notification about your reply. I have copied our thread below so you 
>>>>>>>> can
>>>>>>>> get some context.
>>>>>>>>
>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it is
>>>>>>>> running the script in local mode inspite of having 
>>>>>>>> fs.default.namedefined as the location of the HDFS namenode.
>>>>>>>>
>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it 
>>>>>>>> in
>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>> *-site.xml files. I have added the path to these files in 
>>>>>>>> PIG_CLASSPATH as
>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these 
>>>>>>>> files,
>>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>>> cannot find these configuration parameters. Upon doing some debugging 
>>>>>>>> in
>>>>>>>> the pig code, it seems to me that pig does not use all the resources 
>>>>>>>> added
>>>>>>>> in the Configuration object, but only seems to use certain specific 
>>>>>>>> ones
>>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible 
>>>>>>>> to
>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this 
>>>>>>>> as
>>>>>>>> the problem, because pig can find my config parameters if I define 
>>>>>>>> them in
>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>
>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is our previous conversation -
>>>>>>>>
>>>>>>>> Hi Bhooshan,
>>>>>>>>
>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the 
>>>>>>>> version
>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>
>>>>>>>> With this patch, the following property will allow you to override the
>>>>>>>> default and pass in your own configuration.
>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>> > Hi Folks,
>>>>>>>> >
>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage 
>>>>>>>> > system
>>>>>>>> > at work. This implementation uses some config files that are similar 
>>>>>>>> > in
>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>> > *-site.xml for users to override default properties. In the class 
>>>>>>>> > that
>>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration 
>>>>>>>> > files as
>>>>>>>> > default resources in a static block using
>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working 
>>>>>>>> > fine and
>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs 
>>>>>>>> > just fine
>>>>>>>> > for our storage system. However, when we tried using this storage 
>>>>>>>> > system in
>>>>>>>> > pig scripts, we saw errors indicating that our configuration 
>>>>>>>> > parameters
>>>>>>>> > were not available. Upon further debugging, we saw that the config 
>>>>>>>> > files
>>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw 
>>>>>>>> > that the
>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the 
>>>>>>>> > conf
>>>>>>>> > object. As a result, properties from the default resources 
>>>>>>>> > (including my
>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>> >
>>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>>> > Configuration.addDefaultResource, but still could not figure out why 
>>>>>>>> > Pig
>>>>>>>> > does not use default resources?
>>>>>>>> >
>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > --
>>>>>>>> > Bhooshan
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bhooshan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>


-- 
Bhooshan

Reply via email to