> Or, is this an artifact of an incompatibility between ORC files written by > the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde? > 3. Is there a difference in the ORC file format spec. at play here?
Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x. We haven't changed the format compatibility in 5 years, so we're due for a refresh soon. > 5. What’s the mechanism that affects Spark here? SparkSQL has never properly supported ACID, because to do this correctly Spark has to grab locks on the table and heartbeat the lock, to prevent a compaction from removing a currently used ACID snapshot. AFAIK, there's no code in SparkSQL to handle transactions in Hive - this is not related to the format, it is related to the directory structure used to maintain ACID snapshots, so that you can delete a row without failing queries in progress. However, that's mostly an operational issue for production. Off the raw filesystem (i.e not table), I've used SparkSQL to read the ACID 2.x raw data to write a acidfsck which checks underlying structures by reading them as raw data, so that I can easily do tests like "There's only 1 delete for each ROW__ID" when ACID 2.x was in dev. You can think of the ACID data as basically Struct <ROW__ID>, Struct<Row> when reading it raw. > 6. Any similar issues with Parquet format in Hive 1.x and 2.x? Not similar - but a different set of Parquet incompatibilities are inbound, with parquet.writer.version=v2. Cheers, Gopal