Hi, I'm very new to spark development, and would like to get guidance from more experienced members. Sorry this email will be long as I try to explain the details.
Started to investigate the issue SPARK-22267 <https://issues.apache.org/jira/browse/SPARK-22267>; added some test cases to highlight the problem in the PR <https://github.com/apache/spark/pull/19744>. Here are my findings: - for parquet the test case succeeds as expected - the sql test case for orc: - when CONVERT_METASTORE_ORC is set to "true" the data fields are presented in the desired order - when it is "false" the columns are read in the wrong order - Reason: when `isConvertible` returns true in `RelationConversions` the plan executes `convertToLogicalRelation`, which in turn uses `OrcFileFormat` to read the data; otherwise it uses the classes in "hive-exec:1.2.1". - the HadoopRDD test case was added to further investigate the parameter values to discover a working combination, but unfortunately no combination of "serialization.ddl" and "columns" result in success. It seems that those fields do not have any effect on the order of the resulting data fields. At this point I do not see any option to fix this issue without risking "backward compatibility" problems. The possible actions (as I see them): - link a new version of "hive-exec": surely this bug has been fixed in a newer version - use `OrcFileFormat` for reading orc data regardless of the setting of CONVERT_METASTORE_ORC - also there's an `OrcNewInputFormat` class in "hive-exec", but it implements an InputFormat interface from a different package, hence it is incompatible with HadoopRDD at the moment Please help me. Did I miss some viable options? Thanks, Mark