There is no way to force partition discovery if _spark_metadata exists

Dmitry Wed, 16 Jan 2019 10:39:43 -0800

Hello,
I have two stage processing pipeline:
1. Spark streaming job receives data from kafka and saves  it to
partitioned orc
2. There is spark etl job that  runs ones per day that compact each
partition( i have two variables for partitioning
dt=20180529/location=mumbai ( merge small files  to bigger one). Argument
for compactor job is full path to partition, so compactor job can not
update metadata.
So next time I want to read this table as orc ( if i try to read it as a
hive table it works ), spark read metadata directory,  found a structure of
orc table ( partitions and files that are placed into these partitions)  ,
tries to read some  file and fails with file not found, because  compactor
job had already removed this file and merged it to another file. I see
three workarounds
1. Remove _spark_metadata manually
2. modify spark compactor job the way when it updates metadata
3. found a configuration property that turns on ignoring of spark metadata
1 and 2 are good, but  it can be that I do not have access rights
So does the 3 chose exist ( I checked this
https://github.com/apache/spark/blob/56e9e97073cf1896e301371b3941c9307e42ff77/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L199
and could not find any property) ? If its not  I think it should be added
in some way to spark. May be it should not be a global property but a
property for query.

There is no way to force partition discovery if _spark_metadata exists

Reply via email to