That's a good idea and one I had considered too.  Unfortunately I'm not
aware of an API in PySpark for enumerating paths on HDFS.  Have I
overlooked one?

On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <dav...@databricks.com> wrote:

> In PySpark, I think you could enumerate all the valid files, and create
> RDD by
> newAPIHadoopFile(), then union them together.
>
> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
> <eric.d.fried...@gmail.com> wrote:
> > I neglected to specify that I'm using pyspark. Doesn't look like these
> APIs have been bridged.
> >
> > ----
> > Eric Friedman
> >
> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com>
> wrote:
> >>
> >> Hi Eric,
> >>
> >> Something along the lines of the following should work
> >>
> >> val fs = getFileSystem(...) // standard hadoop API call
> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
> >> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
> >> instance of org.apache.hadoop.fs.PathFilter
> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
> >> classOf[ParquetInputFormat[Something]], classOf[Void],
> >> classOf[SomeAvroType], getConfiguration(...))
> >>
> >> You have to do some initializations on ParquetInputFormat such as
> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing
> >> already I am guessing.
> >>
> >> Cheers,
> >> Nat
> >>
> >>
> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
> >> <eric.d.fried...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> I have a directory structure with parquet+avro data in it. There are a
> >>> couple of administrative files (.foo and/or _foo) that I need to
> ignore when
> >>> processing this data or Spark tries to read them as containing parquet
> >>> content, which they do not.
> >>>
> >>> How can I set a PathFilter on the FileInputFormat used to construct an
> RDD?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Reply via email to