Bobby Wang created ORC-1075: ------------------------------- Summary: Failed to read rows from the ORC file without statistics in RowIndex when filter is pushed down for 1.6.11 Key: ORC-1075 URL: https://issues.apache.org/jira/browse/ORC-1075 Project: ORC Issue Type: Bug Components: Java, Reader Affects Versions: 1.6.11 Reporter: Bobby Wang Attachments: none-1.orc
I have attached an ORC file that seems not to include ColumnStatistics in RowIndex. {color:#FF0000}From the ORC spec, seems RowIndex.ColumnStatistics is not a required field ???{color} {code:java} message RowIndexEntry { repeated uint64 positions = 1 [packed=true]; optional ColumnStatistics statistics = 2; } message RowIndex { repeated RowIndexEntry entry = 1; } {code} The meta of the ORC file {code:java} $ orctools meta none.orc log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file none.orc [length: 124] Structure for none.orc File Version: 0.12 with ORIGINAL Rows: 3 Compression: NONE Calendar: Julian/Gregorian Type: struct<INT:int> Stripe Statistics: Stripe 1: Column 0: count: 3 hasNull: true Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6 File Statistics: Stripes: Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10 Stream: column 0 section ROW_INDEX start: 3 length 4 Stream: column 1 section ROW_INDEX start: 7 length 6 Stream: column 1 section DATA start: 13 length 4 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 124 bytes Padding length: 0 bytes Padding ratio: 0% {code} the data of the orc file {code:java} $ orctools data none.orc log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file none.orc [length: 124] {"INT":1} {"INT":2} {"INT":3}{code} I have below code trying to read each row of the ORC file {code:java} // Pick the schema we want to read using schema evolution TypeDescription readSchema = TypeDescription.fromString("struct<INT:int>"); // Get the information from the file footer Reader reader = OrcFile.createReader(new Path("none.orc"), OrcFile.readerOptions(new Configuration())); System.out.println("File schema: " + reader.getSchema()); System.out.println("Row count: " + reader.getNumberOfRows()); RecordReader rowIterator = reader.rows( reader.options() .schema(readSchema) .searchArgument(SearchArgumentFactory.newBuilder() .equals("INT", PredicateLeaf.Type.LONG, 2L) .build(), new String[]{"INT"}) //predict push down ); // Read the row data VectorizedRowBatch batch = readSchema.createRowBatch(); LongColumnVector x = (LongColumnVector) batch.cols[0]; while (rowIterator.nextBatch(batch)) { System.out.println(batch.size); for (int row = 0; row < batch.size; ++row) { int xRow = x.isRepeating ? 0 : row; System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ? x.vector[xRow] :null)); } } rowIterator.close();{code} h2. output from 1.6.11 File schema: struct<INT:int> Row count: 3 h2. output from 1.5.10 File schema: struct<INT:int> Row count: 3 3 INT: 1 INT: 2 INT: 3 Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while there is no such issue on spark 3.0.x which depends on ORC 1.5.10 -- This message was sent by Atlassian Jira (v8.20.1#820001)