Re: hadoop input/output format advanced control

2015-03-25 Thread Aaron Davidson
Should we mention that you should synchronize on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :) On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell wrote: > Great - that's even easier. Maybe we could ha

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc. On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza wrote: > Regarding Patrick's question, you can just do "new Configuration(oldConf)" > to get a cloned Configuration object and add any new properties to it. > > -Sandy > > On W

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do "new Configuration(oldConf)" to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid wrote: > Hi Nick, > > I don't remember the exact details of these scenarios, but I think the use

Re: hadoop input/output format advanced control

2015-03-25 Thread Imran Rashid
Hi Nick, I don't remember the exact details of these scenarios, but I think the user wanted a lot more control over how the files got grouped into partitions, to group the files together by some arbitrary function. I didn't think that was possible w/ CombineFileInputFormat, but maybe there is a w

Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers
yeah fair enough On Wed, Mar 25, 2015 at 2:41 PM, Patrick Wendell wrote: > Yeah I agree that might have been nicer, but I think for consistency > with the input API's maybe we should do the same thing. We can also > give an example of how to clone sc.hadoopConfiguration and then set > some new v

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency with the input API's maybe we should do the same thing. We can also give an example of how to clone sc.hadoopConfiguration and then set some new values: val conf = sc.hadoopConfiguration.clone() .set("k1", "v1") .set("k2", "v

Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers
my personal preference would be something like a Map[String, String] that only reflects the changes you want to make the Configuration for the given input/output format (so system wide defaults continue to come from sc.hadoopConfiguration), similarly to what cascading/scalding did, but am arbitrary

Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
I see - if you look, in the saving functions we have the option for the user to pass an arbitrary Configuration. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894 It seems fine to have the same option for the loading functions, if it'

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers
the (compression) codec parameter that is now part of many saveAs... methods came from a similar need. see SPARK-763 hadoop has many options like this. you either going to have to allow many more of these optional arguments to all the methods that r

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers
i would like to use objectFile with some tweaks to the hadoop conf. currently there is no way to do that, except recreating objectFile myself. and some of the code objectFile uses i have no access to, since its private to spark. On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell wrote: > Yeah - t

Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a custom conf when you create a Hadoop RDD (that's AFAIK why the conf field is there). Is there anything you can't do with that feature? On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath wrote: > Imran, on your point to read multiple

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran

Re: hadoop input/output format advanced control

2015-03-24 Thread Imran Rashid
I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose wh