That's a good idea and one I had considered too. Unfortunately I'm not aware of an API in PySpark for enumerating paths on HDFS. Have I overlooked one?
On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <dav...@databricks.com> wrote: > In PySpark, I think you could enumerate all the valid files, and create > RDD by > newAPIHadoopFile(), then union them together. > > On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman > <eric.d.fried...@gmail.com> wrote: > > I neglected to specify that I'm using pyspark. Doesn't look like these > APIs have been bridged. > > > > ---- > > Eric Friedman > > > >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com> > wrote: > >> > >> Hi Eric, > >> > >> Something along the lines of the following should work > >> > >> val fs = getFileSystem(...) // standard hadoop API call > >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, > >> pathFilter).map(_.getPath.toString).mkString(",") // pathFilter is an > >> instance of org.apache.hadoop.fs.PathFilter > >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths, > >> classOf[ParquetInputFormat[Something]], classOf[Void], > >> classOf[SomeAvroType], getConfiguration(...)) > >> > >> You have to do some initializations on ParquetInputFormat such as > >> AvroReadSetup/AvroWriteSupport etc but that you should be doing > >> already I am guessing. > >> > >> Cheers, > >> Nat > >> > >> > >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman > >> <eric.d.fried...@gmail.com> wrote: > >>> Hi, > >>> > >>> I have a directory structure with parquet+avro data in it. There are a > >>> couple of administrative files (.foo and/or _foo) that I need to > ignore when > >>> processing this data or Spark tries to read them as containing parquet > >>> content, which they do not. > >>> > >>> How can I set a PathFilter on the FileInputFormat used to construct an > RDD? > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >