SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Mark Petruska Tue, 14 Nov 2017 03:47:18 -0800

  Hi,
I'm very new to spark development, and would like to get guidance from more
experienced members.
Sorry this email will be long as I try to explain the details.


Started to investigate the issue SPARK-22267
<https://issues.apache.org/jira/browse/SPARK-22267>; added some test cases
to highlight the problem in the PR
<https://github.com/apache/spark/pull/19744>. Here are my findings:

- for parquet the test case succeeds as expected

- the sql test case for orc:
    - when CONVERT_METASTORE_ORC is set to "true" the data fields are
presented in the desired order
    - when it is "false" the columns are read in the wrong order
    - Reason: when `isConvertible` returns true in `RelationConversions`
the plan executes `convertToLogicalRelation`, which in turn uses
`OrcFileFormat` to read the data; otherwise it uses the classes in
"hive-exec:1.2.1".

- the HadoopRDD test case was added to further investigate the parameter
values to discover a working combination, but unfortunately no combination
of "serialization.ddl" and "columns" result in success. It seems that those
fields do not have any effect on the order of the resulting data fields.


At this point I do not see any option to fix this issue without risking
"backward compatibility" problems.
The possible actions (as I see them):
- link a new version of "hive-exec": surely this bug has been fixed in a
newer version
- use `OrcFileFormat` for reading orc data regardless of the setting of
CONVERT_METASTORE_ORC
- also there's an `OrcNewInputFormat` class in "hive-exec", but it
implements an InputFormat interface from a different package, hence it is
incompatible with HadoopRDD at the moment

Please help me. Did I miss some viable options?

Thanks,
Mark

SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Reply via email to