[structured-streaming][parquet] readStream files order in Parquet

karthikjay Thu, 14 Jun 2018 06:59:47 -0700

My parquet files are first partitioned by environment and then by date like:


env=testing/
   date=2018-03-04/
          part1.parquet
          part2.parquet
          part3.parquet
   date=2018-03-05/
          part1.parquet
          part2.parquet
          part3.parquet
   date=2018-03-06/
          part1.parquet
          part2.parquet
          part3.parquet
In our read stream, I do the following:

val tunerParquetDF = spark
      .readStream
      .schema(...)
      .format("parquet")
      .option("basePath", basePath)
      .option("path", basePath+"/env*")
      .option("maxFilesPerTrigger", 5)
      .load()

The expected behavior is that readstream will read files in order of the
dates but the observed behavior is that files are shuffled in random order.
How do I force the date order of read in Parquet files ?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[structured-streaming][parquet] readStream files order in Parquet

Reply via email to