[ https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-17062: ------------------------------------ Summary: [C#] write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch() (was: write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch()) > [C#] write_feather() in R doesn't interop with > ArrowFileReader.ReadNextRecordBatch() > ------------------------------------------------------------------------------------ > > Key: ARROW-17062 > URL: https://issues.apache.org/jira/browse/ARROW-17062 > Project: Apache Arrow > Issue Type: Bug > Components: C#, R > Affects Versions: 8.0.0 > Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4 > Reporter: Todd West > Priority: Major > Fix For: 8.0.2 > > > Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() > fails with default settings. This is specific to compressed files (see > workaround below) and it looks like what happens is C# correctly decompresses > the batches but provides the caller with the compressed versions of the data > arrays instead of the uncompressed ones. While all of the various Length > properties are set correctly in C#, the data arrays are too short to contain > all of the values in the file, the bytes do not match what the decompressed > bytes should be, and basic data accessors like PrimitiveArray<T>.Values can't > be used because they throw ArgumentOutOfRangeException. Looking through the > C# classes in the github repo it doesn't appear there's a way for the caller > to request decompression. So I'm guessing decompression is supposed to be > automatic but, for some reason, isn't. > > While functionally successful, the workaround of using uncompressed feather > isn't great as the uncompressed files are bigger than .csv. In my application > the resulting disk space penalty is hundreds of megabytes compared to the > footprint of using compressed feather. > > Simple single field repex: > In R (arrow 8.0.0): > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test > lz4.feather")}} > In C# (Apache.Arrow 8.0.0): > {{using Apache.Arrow;}} > {{using Apache.Arrow.Ipc;}} > {{using System.IO;}} > {{using System.Runtime.InteropServices;}} > {{ using FileStream stream = new("test lz4.feather", > FileMode.Open, FileAccess.Read, FileShare.Read);}} > {{ using ArrowFileReader arrowFile = new(stream);}} > {{ for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch > != null; batch = arrowFile.ReadNextRecordBatch())}} > {{ {}} > {{ IArrowArray[] fields = batch.Arrays.ToArray();}} > {{ ReadOnlySpan<double> test = MemoryMarshal.Cast<byte, > double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values > instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}} > {{ }}} > Workaround in R: > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", > compression = "uncompressed")}} > > Apologies if this is a known issue. I didn't find anything on a Jira search > and this isn't included in the [known issues list on > github|http://example.com/]. -- This message was sent by Atlassian Jira (v8.20.10#820010)