Hi Prashant, Any update regarding this?
- Bhooshan. On Wed, May 29, 2013 at 4:55 PM, Bhooshan Mogal <[email protected]>wrote: > Hi Prashant, > > Apologies for the delay regarding this. I took some more time over the > past couple of weeks to investigate this issue. It does turn out that Pig > does not eliminate parameters from non-standard configuration resources. It > seems that the reason why parameters from my config file were unavailable > to pig was because I was adding these files to the configuration object > using the Configuration.addDefaultResources() method. So the problem is the > same one that I originally described. If I use > conf.addResource("my-conf-site.xml") as opposed to > Configuration.addDefaultResource("my-conf-site.xml"), the problem does not > occur. Pig does not seem to use parameters from the defaultResources list > in the Configuration class. In Main.java at > https://github.com/apache/pig/blob/trunk/src/org/apache/pig/Main.java, I > can see that the Configuration object is created as Configuration conf = > new Configuration(false); (line 168). As a false is passed to the > constructor, it does not include defaultResources while building the > Configuration object. > > Could you (or anyone else on the list) explain why pig does not use > defaultResources from the Configuration object? If my findings are correct, > is there a case for not passing this false to the Configuration object, > based on whether a pig parameter is set? I would be more than happy to > provide a patch for this parameter if required. > > > Thanks, > Bhooshan. > > > On Mon, Apr 15, 2013 at 5:57 PM, Prashant Kommireddi > <[email protected]>wrote: > >> Pig actually does not add it core-site.xml or hadoop-site.xml explicitly, >> it merely looks for these resources to be present on the classpath. >> >> JobConf is the interface describing MR specifics to hadoop and pig uses >> it to define jobs for execution. It loads up mapred*.xml. It does extend >> from Configuration and uses the props loaded by it. >> >> >> On Mon, Apr 15, 2013 at 5:34 PM, Bhooshan Mogal <[email protected] >> > wrote: >> >>> Thanks! Quick question before starting this though. Since resources are >>> added to the Configuration object in various classes in hadoop >>> (Configuration.java adds core-*.xml, HDFSConfiguration.java adds >>> hdfs-*.xml), why does Pig create a new JobConf object with selected >>> resources before submitting a job and not reuse the Configuration object >>> that may have been created earlier? Trying to understand why Pig adds >>> core-site.xml, hdfs-site.xml, yarn-site.xml again. >>> >>> >>> On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi < >>> [email protected]> wrote: >>> >>>> Sounds good. Here is a doc on contributing patch (for some pointers) >>>> https://cwiki.apache.org/confluence/display/PIG/HowToContribute >>>> >>>> >>>> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal < >>>> [email protected]> wrote: >>>> >>>>> Hey Prashant, >>>>> >>>>> Yup, I can take a stab at it. This is the first time I am looking at >>>>> Pig code, so I might take some time to get started. Will get back to you >>>>> if >>>>> I have questions in the meantime. And yes, I will write it so it reads a >>>>> pig property. >>>>> >>>>> - >>>>> Bhooshan. >>>>> >>>>> >>>>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Bhooshan, >>>>>> >>>>>> This makes more sense now. I think overriding fs implementation >>>>>> should go into core-site.xml, but it would be useful to be able to add >>>>>> resources if you have a bunch of other properties. >>>>>> >>>>>> Would you like to submit a patch? It should be based on a pig >>>>>> property that suggests the additional resource names (myfs-site.xml) in >>>>>> your case. >>>>>> >>>>>> -Prashant >>>>>> >>>>>> >>>>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Prashant, >>>>>>> >>>>>>> >>>>>>> Yes, I am running in MapReduce mode. Let me give you the steps in >>>>>>> the scenario that I am trying to test - >>>>>>> >>>>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem >>>>>>> for a filesystem I am trying to implement - Let's call it >>>>>>> MyFileSystem.class. This filesystem uses the scheme myfs:// for its URIs >>>>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml >>>>>>> and made the class available through a jar file that is part of >>>>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH). >>>>>>> 3. In MyFileSystem.class, I have a static block as - >>>>>>> static { >>>>>>> Configuration.addDefaultResource("myfs-default.xml"); >>>>>>> Configuration.addDefaultResource("myfs-site.xml"); >>>>>>> } >>>>>>> Both these files are in the classpath. To be safe, I have also added >>>>>>> the my-fs-site.xml in the constructor of MyFileSystem as >>>>>>> conf.addResource("myfs-site.xml"), so that it is part of both the >>>>>>> default >>>>>>> resources as well as the non-default resources in the Configuration >>>>>>> object. >>>>>>> 4. I am trying to access the filesystem in my pig script as - >>>>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS >>>>>>> (name:chararray, age:int); -- loading data >>>>>>> B = FOREACH A GENERATE name; >>>>>>> store B into 'myfs://myhost.com:8999/testoutput'; >>>>>>> 5. The execution seems to start correctly, and MyFileSystem.class is >>>>>>> invoked correctly. In MyFileSystem.class, I can also see that >>>>>>> myfs-site.xml >>>>>>> is loaded and the properties defined in it are available. >>>>>>> 6. However, when Pig tries to submit the job, it cannot find these >>>>>>> properties and the job fails to submit successfully. >>>>>>> 7. If I move all the properties defined in myfs-site.xml to >>>>>>> core-site.xml, the job gets submitted successfully, and it even >>>>>>> succeeds. >>>>>>> However, this is not ideal as I do not want to proliferate core-site.xml >>>>>>> with all of the properties for a separate filesystem. >>>>>>> 8. As I said earlier, upon taking a closer look at the pig code, I >>>>>>> saw that while creating the JobConf object for a job, pig adds very >>>>>>> specific resources to the job object, and ignores the resources that may >>>>>>> have been added already (eg myfs-site.xml) in the Configuration object. >>>>>>> 9. I have tested this with native map-reduce code as well as hive, >>>>>>> and this approach of having a separate config file for MyFileSystem >>>>>>> works >>>>>>> fine in both those cases. >>>>>>> >>>>>>> So, to summarize, I am looking for a way to ask Pig to load >>>>>>> parameters from my own config file before submitting a job. >>>>>>> >>>>>>> Thanks, >>>>>>> - >>>>>>> Bhooshan. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> +User group >>>>>>>> >>>>>>>> Hi Bhooshan, >>>>>>>> >>>>>>>> By default you should be running in MapReduce mode unless specified >>>>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can >>>>>>>> you >>>>>>>> provide your code here? >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> Apologies for the premature send. I may have some more >>>>>>>> information. After I applied the patch and set >>>>>>>> "pig.use.overriden.hadoop.configs=true", I saw an NPE (stacktrace >>>>>>>> below) >>>>>>>> and a message saying pig was running in exectype local - >>>>>>>> >>>>>>>> 2013-04-13 07:37:13,758 [main] INFO >>>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >>>>>>>> Connecting >>>>>>>> to hadoop file system at: local >>>>>>>> 2013-04-13 07:37:13,760 [main] WARN >>>>>>>> org.apache.hadoop.conf.Configuration - >>>>>>>> mapred.used.genericoptionsparser is >>>>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used >>>>>>>> 2013-04-13 07:37:14,162 [main] ERROR >>>>>>>> org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to >>>>>>>> parse: >>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>>>>> java.lang.NullPointerException >>>>>>>> >>>>>>>> >>>>>>>> Here is the stacktrace = >>>>>>>> >>>>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: >>>>>>>> Error during parsing. Pig script failed to parse: >>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>>>>> java.lang.NullPointerException >>>>>>>> at >>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606) >>>>>>>> at >>>>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549) >>>>>>>> at >>>>>>>> org.apache.pig.PigServer.registerQuery(PigServer.java:549) >>>>>>>> at >>>>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971) >>>>>>>> at >>>>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) >>>>>>>> at >>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) >>>>>>>> at >>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) >>>>>>>> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) >>>>>>>> at org.apache.pig.Main.run(Main.java:555) >>>>>>>> at org.apache.pig.Main.main(Main.java:111) >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>>> Method) >>>>>>>> at >>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>>>>> at >>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:616) >>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) >>>>>>>> Caused by: Failed to parse: Pig script failed to parse: >>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>>>>> java.lang.NullPointerException >>>>>>>> at >>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) >>>>>>>> at >>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598) >>>>>>>> ... 14 more >>>>>>>> Caused by: >>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate: >>>>>>>> java.lang.NullPointerException >>>>>>>> at >>>>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438) >>>>>>>> at >>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168) >>>>>>>> at >>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291) >>>>>>>> at >>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789) >>>>>>>> at >>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507) >>>>>>>> at >>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382) >>>>>>>> at >>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177) >>>>>>>> ... 15 more >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml, >>>>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both >>>>>>>>> Configuration.addDefaultResource and Configuration.addResource. >>>>>>>>> >>>>>>>>> I see what you are saying though. The patch might require users to >>>>>>>>> take care of adding the default config resources as well apart from >>>>>>>>> their >>>>>>>>> own resources? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add >>>>>>>>>> your configuration resources? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Prashant, >>>>>>>>>>> >>>>>>>>>>> Thanks for your response to my question, and sorry for the >>>>>>>>>>> delayed reply. I was not subscribed to the dev mailing list and >>>>>>>>>>> hence did >>>>>>>>>>> not get a notification about your reply. I have copied our thread >>>>>>>>>>> below so >>>>>>>>>>> you can get some context. >>>>>>>>>>> >>>>>>>>>>> I tried the patch that you pointed to, however with that patch >>>>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that >>>>>>>>>>> it is >>>>>>>>>>> running the script in local mode inspite of having >>>>>>>>>>> fs.default.name defined as the location of the HDFS namenode. >>>>>>>>>>> >>>>>>>>>>> Here is what I am trying to do - I have developed my own >>>>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use >>>>>>>>>>> it in >>>>>>>>>>> my pig script. This implementation requires its own *-default and >>>>>>>>>>> *-site.xml files. I have added the path to these files in >>>>>>>>>>> PIG_CLASSPATH as >>>>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these >>>>>>>>>>> files, >>>>>>>>>>> as I am able to read these configurations in my code. However, pig >>>>>>>>>>> code >>>>>>>>>>> cannot find these configuration parameters. Upon doing some >>>>>>>>>>> debugging in >>>>>>>>>>> the pig code, it seems to me that pig does not use all the >>>>>>>>>>> resources added >>>>>>>>>>> in the Configuration object, but only seems to use certain specific >>>>>>>>>>> ones >>>>>>>>>>> like hadoop-site, core-site, >>>>>>>>>>> pig-cluster-hadoop-site.xml,yarn-site.xml, >>>>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it >>>>>>>>>>> possible to >>>>>>>>>>> have pig load user-defined resources like say foo-default.xml and >>>>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on >>>>>>>>>>> this as >>>>>>>>>>> the problem, because pig can find my config parameters if I define >>>>>>>>>>> them in >>>>>>>>>>> core-site.xml instead of my-filesystem-site.xml. >>>>>>>>>>> >>>>>>>>>>> Let me know if you need more details about the issue. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Here is our previous conversation - >>>>>>>>>>> >>>>>>>>>>> Hi Bhooshan, >>>>>>>>>>> >>>>>>>>>>> There is a patch that addresses what you need, and is part of 0.12 >>>>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the >>>>>>>>>>> version >>>>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135. >>>>>>>>>>> >>>>>>>>>>> With this patch, the following property will allow you to override >>>>>>>>>>> the >>>>>>>>>>> default and pass in your own configuration. >>>>>>>>>>> pig.use.overriden.hadoop.configs=true >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal >>>>>>>>>>> <[email protected]>wrote: >>>>>>>>>>> >>>>>>>>>>> > Hi Folks, >>>>>>>>>>> > >>>>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a >>>>>>>>>>> > storage system >>>>>>>>>>> > at work. This implementation uses some config files that are >>>>>>>>>>> > similar in >>>>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a >>>>>>>>>>> > *-site.xml for users to override default properties. In the class >>>>>>>>>>> > that >>>>>>>>>>> > implemented the Hadoop FileSystem, I had added these >>>>>>>>>>> > configuration files as >>>>>>>>>>> > default resources in a static block using >>>>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and >>>>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working >>>>>>>>>>> > fine and >>>>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs >>>>>>>>>>> > just fine >>>>>>>>>>> > for our storage system. However, when we tried using this storage >>>>>>>>>>> > system in >>>>>>>>>>> > pig scripts, we saw errors indicating that our configuration >>>>>>>>>>> > parameters >>>>>>>>>>> > were not available. Upon further debugging, we saw that the >>>>>>>>>>> > config files >>>>>>>>>>> > were added to the Configuration object as resources, but were >>>>>>>>>>> > part of >>>>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw >>>>>>>>>>> > that the >>>>>>>>>>> > Configuration object was created as Configuration conf = new >>>>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in >>>>>>>>>>> > the conf >>>>>>>>>>> > object. As a result, properties from the default resources >>>>>>>>>>> > (including my >>>>>>>>>>> > config files) were not loaded and hence, unavailable. >>>>>>>>>>> > >>>>>>>>>>> > We solved the problem by using Configuration.addResource instead >>>>>>>>>>> > of >>>>>>>>>>> > Configuration.addDefaultResource, but still could not figure out >>>>>>>>>>> > why Pig >>>>>>>>>>> > does not use default resources? >>>>>>>>>>> > >>>>>>>>>>> > Could someone on the list explain why this is the case? >>>>>>>>>>> > >>>>>>>>>>> > Thanks, >>>>>>>>>>> > -- >>>>>>>>>>> > Bhooshan >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Bhooshan >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Bhooshan >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Bhooshan >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Bhooshan >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Bhooshan >>>>> >>>> >>>> >>> >>> >>> -- >>> Bhooshan >>> >> >> > > > -- > Bhooshan > -- Bhooshan
