[jira] [Commented] (ARROW-18198) IndexOutOfBoundsException when loading compressed IPC format

David Dali Susanibar Arce (Jira) Fri, 30 Dec 2022 14:55:28 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-18198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653209#comment-17653209
 ]


David Dali Susanibar Arce commented on ARROW-18198:
---------------------------------------------------

Hi [~georeth] ,

Please consider this [PR|https://github.com/apache/arrow-cookbook/pull/289] to 
add cookbook for read compressed files.
{code:java}
File file = new File("src/main/resources/compare/lz4.arrow");
try (
    BufferAllocator rootAllocator = new RootAllocator();
    FileInputStream fileInputStream = new FileInputStream(file);
    // ArrowFileReader reader = new 
ArrowFileReader(fileInputStream.getChannel(), rootAllocator): Use 
CommonsCompressionFactory for compressed files
    ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(),
        rootAllocator, CommonsCompressionFactory.INSTANCE)
) {
    System.out.println("Record batches in file: " + 
reader.getRecordBlocks().size());
    for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        reader.loadRecordBatch(arrowBlock);
        VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
        System.out.println("Size: --> " + 
vectorSchemaRootRecover.getRowCount());
        System.out.print(vectorSchemaRootRecover.contentToTSVString());
    }
} catch (IOException e) {
    e.printStackTrace();
} {code}

> IndexOutOfBoundsException when loading compressed IPC format
> ------------------------------------------------------------
>
>                 Key: ARROW-18198
>                 URL: https://issues.apache.org/jira/browse/ARROW-18198
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 4.0.1, 9.0.0, 10.0.0
>         Environment: Linux and Windows.
> Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
> Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
>            Reporter: Georeth Zhou
>            Priority: Major
>
> I encountered this bug when I loaded a dataframe stored in the Arrow IPC 
> format.
>  
> {code:java}
> // Java Code from "Apache Arrow Java Cookbook"
> File file = new File("example.arrow");
> try (
>         BufferAllocator rootAllocator = new RootAllocator();
>         FileInputStream fileInputStream = new FileInputStream(file);
>         ArrowFileReader reader = new 
> ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
> ) {
>     System.out.println("Record batches in file: " + 
> reader.getRecordBlocks().size());
>     for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
>         reader.loadRecordBatch(arrowBlock);
>         VectorSchemaRoot vectorSchemaRootRecover = 
> reader.getVectorSchemaRoot();
>         System.out.print(vectorSchemaRootRecover.contentToTSVString());
>     }
> } catch (IOException e) {
>     e.printStackTrace();
> } {code}
> Call stack:
> {noformat}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, 
> length: 2048 (expected: range(0, 2024))
>     at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
>     at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
>     at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
>     at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
>     at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
>     at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
>     at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
>     at 
> org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
>     at 
> org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197){noformat}
> This bug can be reproduced by a simple dataframe created by pandas:
>  
> {code:java}
> pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') {code}
> Pandas compresses the dataframe by default. If the compression is turned off, 
> Java can load the dataframe. Thus, I guess the bounds checking code is buggy 
> when loading compressed file.
>  
> That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely 
> to be a pandas bug.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18198) IndexOutOfBoundsException when loading compressed IPC format

Reply via email to