[ https://issues.apache.org/jira/browse/ARROW-18198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653209#comment-17653209 ]
David Dali Susanibar Arce commented on ARROW-18198: --------------------------------------------------- Hi [~georeth] , Please consider this [PR|https://github.com/apache/arrow-cookbook/pull/289] to add cookbook for read compressed files. {code:java} File file = new File("src/main/resources/compare/lz4.arrow"); try ( BufferAllocator rootAllocator = new RootAllocator(); FileInputStream fileInputStream = new FileInputStream(file); // ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator): Use CommonsCompressionFactory for compressed files ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator, CommonsCompressionFactory.INSTANCE) ) { System.out.println("Record batches in file: " + reader.getRecordBlocks().size()); for (ArrowBlock arrowBlock : reader.getRecordBlocks()) { reader.loadRecordBatch(arrowBlock); VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot(); System.out.println("Size: --> " + vectorSchemaRootRecover.getRowCount()); System.out.print(vectorSchemaRootRecover.contentToTSVString()); } } catch (IOException e) { e.printStackTrace(); } {code} > IndexOutOfBoundsException when loading compressed IPC format > ------------------------------------------------------------ > > Key: ARROW-18198 > URL: https://issues.apache.org/jira/browse/ARROW-18198 > Project: Apache Arrow > Issue Type: Bug > Components: Java > Affects Versions: 4.0.1, 9.0.0, 10.0.0 > Environment: Linux and Windows. > Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1. > Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05) > Reporter: Georeth Zhou > Priority: Major > > I encountered this bug when I loaded a dataframe stored in the Arrow IPC > format. > > {code:java} > // Java Code from "Apache Arrow Java Cookbook" > File file = new File("example.arrow"); > try ( > BufferAllocator rootAllocator = new RootAllocator(); > FileInputStream fileInputStream = new FileInputStream(file); > ArrowFileReader reader = new > ArrowFileReader(fileInputStream.getChannel(), rootAllocator) > ) { > System.out.println("Record batches in file: " + > reader.getRecordBlocks().size()); > for (ArrowBlock arrowBlock : reader.getRecordBlocks()) { > reader.loadRecordBatch(arrowBlock); > VectorSchemaRoot vectorSchemaRootRecover = > reader.getVectorSchemaRoot(); > System.out.print(vectorSchemaRootRecover.contentToTSVString()); > } > } catch (IOException e) { > e.printStackTrace(); > } {code} > Call stack: > {noformat} > Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, > length: 2048 (expected: range(0, 2024)) > at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701) > at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955) > at > org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451) > at > org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732) > at > org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240) > at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86) > at > org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220) > at > org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166) > at > org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197){noformat} > This bug can be reproduced by a simple dataframe created by pandas: > > {code:java} > pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') {code} > Pandas compresses the dataframe by default. If the compression is turned off, > Java can load the dataframe. Thus, I guess the bounds checking code is buggy > when loading compressed file. > > That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely > to be a pandas bug. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)