Thanks! Quick question before starting this though. Since resources are added to the Configuration object in various classes in hadoop (Configuration.java adds core-*.xml, HDFSConfiguration.java adds hdfs-*.xml), why does Pig create a new JobConf object with selected resources before submitting a job and not reuse the Configuration object that may have been created earlier? Trying to understand why Pig adds core-site.xml, hdfs-site.xml, yarn-site.xml again.
On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <[email protected]>wrote: > Sounds good. Here is a doc on contributing patch (for some pointers) > https://cwiki.apache.org/confluence/display/PIG/HowToContribute > > > On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal > <[email protected]>wrote: > >> Hey Prashant, >> >> Yup, I can take a stab at it. This is the first time I am looking at Pig >> code, so I might take some time to get started. Will get back to you if I >> have questions in the meantime. And yes, I will write it so it reads a pig >> property. >> >> - >> Bhooshan. >> >> >> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi < >> [email protected]> wrote: >> >>> Hi Bhooshan, >>> >>> This makes more sense now. I think overriding fs implementation should >>> go into core-site.xml, but it would be useful to be able to add >>> resources if you have a bunch of other properties. >>> >>> Would you like to submit a patch? It should be based on a pig property >>> that suggests the additional resource names (myfs-site.xml) in your case. >>> >>> -Prashant >>> >>> >>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal < >>> [email protected]> wrote: >>> >>>> Hi Prashant, >>>> >>>> >>>> Yes, I am running in MapReduce mode. Let me give you the steps in the >>>> scenario that I am trying to test - >>>> >>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for >>>> a filesystem I am trying to implement - Let's call it MyFileSystem.class. >>>> This filesystem uses the scheme myfs:// for its URIs >>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and >>>> made the class available through a jar file that is part of >>>> HADOOP_CLASSPATH (or PIG_CLASSPATH). >>>> 3. In MyFileSystem.class, I have a static block as - >>>> static { >>>> Configuration.addDefaultResource("myfs-default.xml"); >>>> Configuration.addDefaultResource("myfs-site.xml"); >>>> } >>>> Both these files are in the classpath. To be safe, I have also added >>>> the my-fs-site.xml in the constructor of MyFileSystem as >>>> conf.addResource("myfs-site.xml"), so that it is part of both the default >>>> resources as well as the non-default resources in the Configuration object. >>>> 4. I am trying to access the filesystem in my pig script as - >>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS >>>> (name:chararray, age:int); -- loading data >>>> B = FOREACH A GENERATE name; >>>> store B into 'myfs://myhost.com:8999/testoutput'; >>>> 5. The execution seems to start correctly, and MyFileSystem.class is >>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml >>>> is loaded and the properties defined in it are available. >>>> 6. However, when Pig tries to submit the job, it cannot find these >>>> properties and the job fails to submit successfully. >>>> 7. If I move all the properties defined in myfs-site.xml to >>>> core-site.xml, the job gets submitted successfully, and it even succeeds. >>>> However, this is not ideal as I do not want to proliferate core-site.xml >>>> with all of the properties for a separate filesystem. >>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw >>>> that while creating the JobConf object for a job, pig adds very specific >>>> resources to the job object, and ignores the resources that may have been >>>> added already (eg myfs-site.xml) in the Configuration object. >>>> 9. I have tested this with native map-reduce code as well as hive, and >>>> this approach of having a separate config file for MyFileSystem works fine >>>> in both those cases. >>>> >>>> So, to summarize, I am looking for a way to ask Pig to load parameters >>>> from my own config file before submitting a job. >>>> >>>> Thanks, >>>> - >>>> Bhooshan. >>>> >>>> >>>> >>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi < >>>> [email protected]> wrote: >>>> >>>>> +User group >>>>> >>>>> Hi Bhooshan, >>>>> >>>>> By default you should be running in MapReduce mode unless specified >>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you >>>>> provide your code here? >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]> >>>>> wrote: >>>>> >>>>> Apologies for the premature send. I may have some more information. >>>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true", >>>>> I saw an NPE (stacktrace below) and a message saying pig was running in >>>>> exectype local - >>>>> >>>>> 2013-04-13 07:37:13,758 [main] INFO >>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >>>>> Connecting >>>>> to hadoop file system at: local >>>>> 2013-04-13 07:37:13,760 [main] WARN >>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is >>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used >>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt >>>>> - ERROR 1200: Pig script failed to parse: >>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>> java.lang.NullPointerException >>>>> >>>>> >>>>> Here is the stacktrace = >>>>> >>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error >>>>> during parsing. Pig script failed to parse: >>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>> java.lang.NullPointerException >>>>> at >>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606) >>>>> at >>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549) >>>>> at org.apache.pig.PigServer.registerQuery(PigServer.java:549) >>>>> at >>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971) >>>>> at >>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) >>>>> at >>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) >>>>> at >>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) >>>>> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) >>>>> at org.apache.pig.Main.run(Main.java:555) >>>>> at org.apache.pig.Main.main(Main.java:111) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:616) >>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) >>>>> Caused by: Failed to parse: Pig script failed to parse: >>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>> java.lang.NullPointerException >>>>> at >>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) >>>>> at >>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598) >>>>> ... 14 more >>>>> Caused by: >>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>> java.lang.NullPointerException >>>>> at >>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438) >>>>> at >>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168) >>>>> at >>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291) >>>>> at >>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789) >>>>> at >>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507) >>>>> at >>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382) >>>>> at >>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177) >>>>> ... 15 more >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal < >>>>> [email protected]> wrote: >>>>> >>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml, >>>>>> yarn-site.xml. Only my-filesystem-site.xml using both >>>>>> Configuration.addDefaultResource and Configuration.addResource. >>>>>> >>>>>> I see what you are saying though. The patch might require users to >>>>>> take care of adding the default config resources as well apart from their >>>>>> own resources? >>>>>> >>>>>> >>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add >>>>>>> your configuration resources? >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Prashant, >>>>>>>> >>>>>>>> Thanks for your response to my question, and sorry for the delayed >>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not >>>>>>>> get a >>>>>>>> notification about your reply. I have copied our thread below so you >>>>>>>> can >>>>>>>> get some context. >>>>>>>> >>>>>>>> I tried the patch that you pointed to, however with that patch >>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it is >>>>>>>> running the script in local mode inspite of having >>>>>>>> fs.default.namedefined as the location of the HDFS namenode. >>>>>>>> >>>>>>>> Here is what I am trying to do - I have developed my own >>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it >>>>>>>> in >>>>>>>> my pig script. This implementation requires its own *-default and >>>>>>>> *-site.xml files. I have added the path to these files in >>>>>>>> PIG_CLASSPATH as >>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these >>>>>>>> files, >>>>>>>> as I am able to read these configurations in my code. However, pig code >>>>>>>> cannot find these configuration parameters. Upon doing some debugging >>>>>>>> in >>>>>>>> the pig code, it seems to me that pig does not use all the resources >>>>>>>> added >>>>>>>> in the Configuration object, but only seems to use certain specific >>>>>>>> ones >>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml, >>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible >>>>>>>> to >>>>>>>> have pig load user-defined resources like say foo-default.xml and >>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this >>>>>>>> as >>>>>>>> the problem, because pig can find my config parameters if I define >>>>>>>> them in >>>>>>>> core-site.xml instead of my-filesystem-site.xml. >>>>>>>> >>>>>>>> Let me know if you need more details about the issue. >>>>>>>> >>>>>>>> >>>>>>>> Here is our previous conversation - >>>>>>>> >>>>>>>> Hi Bhooshan, >>>>>>>> >>>>>>>> There is a patch that addresses what you need, and is part of 0.12 >>>>>>>> (unreleased). Take a look and see if you can apply the patch to the >>>>>>>> version >>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135. >>>>>>>> >>>>>>>> With this patch, the following property will allow you to override the >>>>>>>> default and pass in your own configuration. >>>>>>>> pig.use.overriden.hadoop.configs=true >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>> > Hi Folks, >>>>>>>> > >>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage >>>>>>>> > system >>>>>>>> > at work. This implementation uses some config files that are similar >>>>>>>> > in >>>>>>>> > structure to hadoop config files. They have a *-default.xml and a >>>>>>>> > *-site.xml for users to override default properties. In the class >>>>>>>> > that >>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration >>>>>>>> > files as >>>>>>>> > default resources in a static block using >>>>>>>> > Configuration.addDefaultResource("my-default.xml") and >>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working >>>>>>>> > fine and >>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs >>>>>>>> > just fine >>>>>>>> > for our storage system. However, when we tried using this storage >>>>>>>> > system in >>>>>>>> > pig scripts, we saw errors indicating that our configuration >>>>>>>> > parameters >>>>>>>> > were not available. Upon further debugging, we saw that the config >>>>>>>> > files >>>>>>>> > were added to the Configuration object as resources, but were part of >>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw >>>>>>>> > that the >>>>>>>> > Configuration object was created as Configuration conf = new >>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the >>>>>>>> > conf >>>>>>>> > object. As a result, properties from the default resources >>>>>>>> > (including my >>>>>>>> > config files) were not loaded and hence, unavailable. >>>>>>>> > >>>>>>>> > We solved the problem by using Configuration.addResource instead of >>>>>>>> > Configuration.addDefaultResource, but still could not figure out why >>>>>>>> > Pig >>>>>>>> > does not use default resources? >>>>>>>> > >>>>>>>> > Could someone on the list explain why this is the case? >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > -- >>>>>>>> > Bhooshan >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Bhooshan >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Bhooshan >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Bhooshan >>>>> >>>>> >>>> >>>> >>>> -- >>>> Bhooshan >>>> >>> >>> >> >> >> -- >> Bhooshan >> > > -- Bhooshan
