I'm using CDH 5.1 with spark 1.0.

When I try to run Spark SQL following the Programming Guide

val parquetFile = sqlContext.parquetFile(path)

If the "path" is a file, it throws an exception:

 Exception in thread "main" java.lang.IllegalArgumentException:
Expected hdfs://*/file.parquet for be a directory with Parquet
files/metadata
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetRelation.scala:301)
    at 
org.apache.spark.sql.parquet.ParquetRelation.parquetSchema(ParquetRelation.scala:62)
    at 
org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:69)
    at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:98)

However, if the "path" is the parent directory of the file, it succeeds.
Note: there is only one file in that directory.

I look into the source,

 /**
   * Try to read Parquet metadata at the given Path. We first see if
there is a summary file
   * in the parent directory. If so, this is used. Else we read the
actual footer at the given
   * location.
   * @param origPath The path at which we expect one (or more) Parquet files.
   * @return The `ParquetMetadata` containing among other things the schema.
   */
  def readMetaData(origPath: Path): ParquetMetadata

It doesn't require a directory, but it did throw an exception

 if (!fs.getFileStatus(path).isDir) {
  throw new IllegalArgumentException(
    s"Expected $path for be a directory with Parquet files/metadata")
 }


It seems odd to me, can anybody explains why, and how to read a file, not a
directory?

Reply via email to