Re: ORC Transaction Table - Spark

Gopal Vijayaraghavan Thu, 24 Aug 2017 09:24:05 -0700

> Or, is this an artifact of an incompatibility between ORC files written by 
> the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde?  
> 3. Is there a difference in the ORC file format spec. at play here?


Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x.

We haven't changed the format compatibility in 5 years, so we're due for a 
refresh soon.

> 5. What’s the mechanism that affects Spark here?

SparkSQL has never properly supported ACID, because to do this correctly Spark 
has to grab locks on the table and heartbeat the lock, to prevent a compaction 
from removing a currently used ACID snapshot.

AFAIK, there's no code in SparkSQL to handle transactions in Hive - this is not 
related to the format, it is related to the directory structure used to 
maintain ACID snapshots, so that you can delete a row without failing queries 
in progress.

However, that's mostly an operational issue for production. Off the raw 
filesystem (i.e not table),  I've used SparkSQL to read the ACID 2.x raw data 
to write a acidfsck which checks underlying structures by reading them as raw 
data, so that I can easily do tests like "There's only 1 delete for each 
ROW__ID" when ACID 2.x was in dev.

You can think of the ACID data as basically

Struct <ROW__ID>, Struct<Row>

when reading it raw.

> 6. Any similar issues with Parquet format in Hive 1.x and 2.x?

Not similar - but a different set of Parquet incompatibilities are inbound, 
with parquet.writer.version=v2.

Cheers,
Gopal

Re: ORC Transaction Table - Spark

Reply via email to