Re: Why does Pig not use default resources from the Configuration object?

Bhooshan Mogal Mon, 15 Apr 2013 10:48:50 -0700

Hi Prashant,


Yes, I am running in MapReduce mode. Let me give you the steps in the
scenario that I am trying to test -

1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
filesystem I am trying to implement - Let's call it MyFileSystem.class.
This filesystem uses the scheme myfs:// for its URIs
2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and made
the class available through a jar file that is part of HADOOP_CLASSPATH (or
PIG_CLASSPATH).
3. In MyFileSystem.class, I have a static block as -
static {
    Configuration.addDefaultResource("myfs-default.xml");
    Configuration.addDefaultResource("myfs-site.xml");
}
Both these files are in the classpath. To be safe, I have also added the
my-fs-site.xml in the constructor of MyFileSystem as
conf.addResource("myfs-site.xml"), so that it is part of both the default
resources as well as the non-default resources in the Configuration object.
4. I am trying to access the filesystem in my pig script as -
A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
(name:chararray, age:int); -- loading data
B = FOREACH A GENERATE name;
store B into 'myfs://myhost.com:8999/testoutput';
5. The execution seems to start correctly, and MyFileSystem.class is
invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
is loaded and the properties defined in it are available.
6. However, when Pig tries to submit the job, it cannot find these
properties and the job fails to submit successfully.
7. If I move all the properties defined in myfs-site.xml to core-site.xml,
the job gets submitted successfully, and it even succeeds. However, this is
not ideal as I do not want to proliferate core-site.xml with all of the
properties for a separate filesystem.
8. As I said earlier, upon taking a closer look at the pig code, I saw that
while creating the JobConf object for a job, pig adds very specific
resources to the job object, and ignores the resources that may have been
added already (eg myfs-site.xml) in the Configuration object.
9. I have tested this with native map-reduce code as well as hive, and this
approach of having a separate config file for MyFileSystem works fine in
both those cases.

So, to summarize, I am looking for a way to ask Pig to load parameters from
my own config file before submitting a job.

Thanks,
-
Bhooshan.



On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <[email protected]>wrote:

> +User group
>
> Hi Bhooshan,
>
> By default you should be running in MapReduce mode unless specified
> otherwise. Are you creating a PigServer object to run your jobs? Can you
> provide your code here?
>
> Sent from my iPhone
>
> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]>
> wrote:
>
> Apologies for the premature send. I may have some more information. After
> I applied the patch and set "pig.use.overriden.hadoop.configs=true", I saw
> an NPE (stacktrace below) and a message saying pig was running in exectype
> local -
>
> 2013-04-13 07:37:13,758 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: local
> 2013-04-13 07:37:13,760 [main] WARN  org.apache.hadoop.conf.Configuration
> - mapred.used.genericoptionsparser is deprecated. Instead, use
> mapreduce.client.genericoptionsparser.used
> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: Pig script failed to parse:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>
>
> Here is the stacktrace =
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
> during parsing. Pig script failed to parse:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>         at
> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>         at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>         at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>         at org.apache.pig.Main.run(Main.java:555)
>         at org.apache.pig.Main.main(Main.java:111)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:616)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> Caused by: Failed to parse: Pig script failed to parse:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>         at
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>         ... 14 more
> Caused by:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>         at
> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>         at
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>         ... 15 more
>
>
>
>
> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal 
> <[email protected]>wrote:
>
>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
>> Configuration.addResource.
>>
>> I see what you are saying though. The patch might require users to take
>> care of adding the default config resources as well apart from their own
>> resources?
>>
>>
>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <[email protected]
>> > wrote:
>>
>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>> configuration resources?
>>>
>>>
>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>> [email protected]> wrote:
>>>
>>>> Hi Prashant,
>>>>
>>>> Thanks for your response to my question, and sorry for the delayed
>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>> notification about your reply. I have copied our thread below so you can
>>>> get some context.
>>>>
>>>> I tried the patch that you pointed to, however with that patch looks
>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>> the script in local mode inspite of having fs.default.name defined as
>>>> the location of the HDFS namenode.
>>>>
>>>> Here is what I am trying to do - I have developed my own
>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>> my pig script. This implementation requires its own *-default and
>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>> as I am able to read these configurations in my code. However, pig code
>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>> the pig code, it seems to me that pig does not use all the resources added
>>>> in the Configuration object, but only seems to use certain specific ones
>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>> have pig load user-defined resources like say foo-default.xml and
>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>> the problem, because pig can find my config parameters if I define them in
>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>
>>>> Let me know if you need more details about the issue.
>>>>
>>>>
>>>> Here is our previous conversation -
>>>>
>>>> Hi Bhooshan,
>>>>
>>>> There is a patch that addresses what you need, and is part of 0.12
>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>
>>>> With this patch, the following property will allow you to override the
>>>> default and pass in your own configuration.
>>>> pig.use.overriden.hadoop.configs=true
>>>>
>>>>
>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>> <[email protected]>wrote:
>>>>
>>>> > Hi Folks,
>>>> >
>>>> > I had implemented the Hadoop FileSystem abstract class for a storage 
>>>> > system
>>>> > at work. This implementation uses some config files that are similar in
>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>> > *-site.xml for users to override default properties. In the class that
>>>> > implemented the Hadoop FileSystem, I had added these configuration files 
>>>> > as
>>>> > default resources in a static block using
>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just 
>>>> > fine
>>>> > for our storage system. However, when we tried using this storage system 
>>>> > in
>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>> > were not available. Upon further debugging, we saw that the config files
>>>> > were added to the Configuration object as resources, but were part of
>>>> > defaultResources. However, in Main.java in the pig source, we saw that 
>>>> > the
>>>> > Configuration object was created as Configuration conf = new
>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>> > object. As a result, properties from the default resources (including my
>>>> > config files) were not loaded and hence, unavailable.
>>>> >
>>>> > We solved the problem by using Configuration.addResource instead of
>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>> > does not use default resources?
>>>> >
>>>> > Could someone on the list explain why this is the case?
>>>> >
>>>> > Thanks,
>>>> > --
>>>> > Bhooshan
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>
>
> --
> Bhooshan
>
>


-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Reply via email to