Yes, it's AvroParquetInputFormat, which is splittable.  If I force a
repartitioning, it works. If I don't, spark chokes on my not-terribly-large
250Mb files.

PySpark's documentation says that the dictionary is turned into a
Configuration object.

@param conf: Hadoop configuration, passed in as a dict (None by default)

On Mon, Sep 15, 2014 at 3:26 PM, Sean Owen <so...@cloudera.com> wrote:

> Heh, it's still just a suggestion to Hadoop I guess, not guaranteed.
>
> Is it a splittable format? for example, some compressed formats are
> not splittable and Hadoop has to process whole files at a time.
>
> I'm also not sure if this is something to do with pyspark, since the
> underlying Scala API takes a Configuration object rather than
> dictionary.
>
> On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman
> <eric.d.fried...@gmail.com> wrote:
> > That would be awesome, but doesn't seem to have any effect.
> >
> > In PySpark, I created a dict with that key and a numeric value, then
> passed
> > it into newAPIHadoopFile as a value for the "conf" keyword.  The returned
> > RDD still has a single partition.
> >
> > On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> I think the reason is simply that there is no longer an explicit
> >> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
> >> At least, I didn't see it when I glanced just now.
> >>
> >> However, you should be able to get the same effect by setting a
> >> Configuration property, and you can do so through the newAPIHadoopFile
> >> method. You set it as a suggested maximum split size rather than
> >> suggest minimum number of splits.
> >>
> >> Although I think the old config property mapred.max.split.size is
> >> still respected, you may try
> >> mapreduce.input.fileinputformat.split.maxsize instead, which appears
> >> to be the intended replacement in the new APIs.
> >>
> >> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
> >> <eric.d.fried...@gmail.com> wrote:
> >> > sc.textFile takes a minimum # of partitions to use.
> >> >
> >> > is there a way to get sc.newAPIHadoopFile to do the same?
> >> >
> >> > I know I can repartition() and get a shuffle.  I'm wondering if
> there's
> >> > a
> >> > way to tell the underlying InputFormat (AvroParquet, in my case) how
> >> > many
> >> > partitions to use at the outset.
> >> >
> >> > What I'd really prefer is to get the partitions automatically defined
> >> > based
> >> > on the number of blocks.
> >
> >
>

Reply via email to