Hello, I think I may have jumped to the wrong conclusion about symlinks, and I was able to get what I want working perfectly.
I added these two settings in my importer application: sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") Then when I read the parquet table, I set the "basePath" option to the parent of each of the partitions, e.g.: val df = sqlContext.read.options(Map("basePath" -> "/path/to/table")).parquet("/path/to/table/a=*") I also checked that the symlinks were followed the way I wanted, by removing one of the symlinks after creating the DataFrame, and I was able to query the DataFrame without error. - Philip On Fri, Apr 29, 2016 at 9:56 AM, Philip Weaver <philip.wea...@gmail.com> wrote: > Hello, > > I have a parquet dataset, partitioned by a column 'a'. I want to take > advantage > of Spark SQL's ability to filter to the partition when you filter on 'a'. > I also > want to periodically update individual partitions without disrupting any > jobs > that are querying the data. > > The obvious solution was to write parquet datasets to a separate directory > and > then update a symlink to point to it. Readers resolve the symlink to > construct > the DataFrame, so that when an update occurs any jobs continue to read the > version of the data that they started with. Old data is cleaned up after > no jobs > are using it. > > This strategy works fine when updating an entire top-level parquet > database. However, it seems like Spark SQL (or parquet) cannot handle > partition > directories being symlinks (and even if it could, it probably wouldn't > resolve > those symlinks so that it doesn't blow up when the symlink changes at > runtime). For example, if you create symlinks a=1, a=2 and a=3 in a > directory > and then try to load that directory in Spark SQL, you get the "Conflicting > partition column names detected". > > So my question is, can anyone think of another solution that meets my > requirements (i.e. to take advantage of paritioning and perform safe > updates of > existing partitions)? > > Thanks! > > - Philip > > >