Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?

On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <iras...@cloudera.com> wrote:

> I think this would be a great addition, I totally agree that you need to be
> able to set these at a finer context than just the SparkContext.
>
> Just to play devil's advocate, though -- the alternative is for you just
> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
> expose whatever you need.  Why is this solution better?  IMO the criteria
> are:
> (a) common operations
> (b) error-prone / difficult to implement
> (c) non-obvious, but important for performance
>
> I think this case fits (a) & (c), so I think its still worthwhile.  But its
> also worth asking whether or not its too difficult for a user to extend
> HadoopRDD right now.  There have been several cases in the past week where
> we've suggested that a user should read from hdfs themselves (eg., to read
> multiple files together in one partition) -- with*out* reusing the code in
> HadoopRDD, though they would lose things like the metric tracking &
> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> refactoring to make that easier to do?  Or do we just need a good example?
>
> Imran
>
> (sorry for hijacking your thread, Koert)
>
>
>
> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> > see email below. reynold suggested i send it to dev instead of user
> >
> > ---------- Forwarded message ----------
> > From: Koert Kuipers <ko...@tresata.com>
> > Date: Mon, Mar 23, 2015 at 4:36 PM
> > Subject: hadoop input/output format advanced control
> > To: "u...@spark.apache.org" <u...@spark.apache.org>
> >
> >
> > currently its pretty hard to control the Hadoop Input/Output formats used
> > in Spark. The conventions seems to be to add extra parameters to all
> > methods and then somewhere deep inside the code (for example in
> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> into
> > settings on the Hadoop Configuration object.
> >
> > for example for compression i see "codec: Option[Class[_ <:
> > CompressionCodec]] = None" added to a bunch of methods.
> >
> > how scalable is this solution really?
> >
> > for example i need to read from a hadoop dataset and i dont want the
> input
> > (part) files to get split up. the way to do this is to set
> > "mapred.min.split.size". now i dont want to set this at the level of the
> > SparkContext (which can be done), since i dont want it to apply to input
> > formats in general. i want it to apply to just this one specific input
> > dataset i need to read. which leaves me with no options currently. i
> could
> > go add yet another input parameter to all the methods
> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
> > etc.). but that seems ineffective.
> >
> > why can we not expose a Map[String, String] or some other generic way to
> > manipulate settings for hadoop input/output formats? it would require
> > adding one more parameter to all methods to deal with hadoop input/output
> > formats, but after that its done. one parameter to rule them all....
> >
> > then i could do:
> > val x = sc.textFile("/some/path", formatSettings =
> > Map("mapred.min.split.size" -> "12345"))
> >
> > or
> > rdd.saveAsTextFile("/some/path, formatSettings =
> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
> ->
> > "somecodec"))
> >
>

Reply via email to