Yes, it's AvroParquetInputFormat, which is splittable. If I force a repartitioning, it works. If I don't, spark chokes on my not-terribly-large 250Mb files.
PySpark's documentation says that the dictionary is turned into a Configuration object. @param conf: Hadoop configuration, passed in as a dict (None by default) On Mon, Sep 15, 2014 at 3:26 PM, Sean Owen <so...@cloudera.com> wrote: > Heh, it's still just a suggestion to Hadoop I guess, not guaranteed. > > Is it a splittable format? for example, some compressed formats are > not splittable and Hadoop has to process whole files at a time. > > I'm also not sure if this is something to do with pyspark, since the > underlying Scala API takes a Configuration object rather than > dictionary. > > On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman > <eric.d.fried...@gmail.com> wrote: > > That would be awesome, but doesn't seem to have any effect. > > > > In PySpark, I created a dict with that key and a numeric value, then > passed > > it into newAPIHadoopFile as a value for the "conf" keyword. The returned > > RDD still has a single partition. > > > > On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> I think the reason is simply that there is no longer an explicit > >> min-partitions argument for Hadoop InputSplits in the new Hadoop APIs. > >> At least, I didn't see it when I glanced just now. > >> > >> However, you should be able to get the same effect by setting a > >> Configuration property, and you can do so through the newAPIHadoopFile > >> method. You set it as a suggested maximum split size rather than > >> suggest minimum number of splits. > >> > >> Although I think the old config property mapred.max.split.size is > >> still respected, you may try > >> mapreduce.input.fileinputformat.split.maxsize instead, which appears > >> to be the intended replacement in the new APIs. > >> > >> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman > >> <eric.d.fried...@gmail.com> wrote: > >> > sc.textFile takes a minimum # of partitions to use. > >> > > >> > is there a way to get sc.newAPIHadoopFile to do the same? > >> > > >> > I know I can repartition() and get a shuffle. I'm wondering if > there's > >> > a > >> > way to tell the underlying InputFormat (AvroParquet, in my case) how > >> > many > >> > partitions to use at the outset. > >> > > >> > What I'd really prefer is to get the partitions automatically defined > >> > based > >> > on the number of blocks. > > > > >