It still seems notable that our generic LZ4-compressed output stream
cannot be read by Java (independent of Arrow and the Arrow IPC
format).

On Thu, Jan 28, 2021 at 12:30 PM Antoine Pitrou <anto...@python.org> wrote:
>
> On Thu, 28 Jan 2021 18:19:00 +0000
> Joris Peeters <joris.mg.peet...@gmail.com> wrote:
>
> > To be fair, I'm happy to apply it at IPC level. Just didn't realise that
> > was a thing. IIUC what Antoine suggests, though, then just (leaving Python
> > as-is and) changing my Java to
> >
> >     var is = new FileInputStream(path.toFile());
> >     var reader = new ArrowStreamReader(is, allocator);
> >     var schema = reader.getVectorSchemaRoot().getSchema();
> >
> > (i.e. just get rid of the lz4 input stream) should work, i.e. let the
> > reader figure it out? I see no option to specify the compression in the
> > reader, so it might detect it?
>
> You would specify the compression in the *writer* (in the Python side),
> using the *options* argument here:
> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream
> or here:
> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_file.html#pyarrow.ipc.new_file
>
> (unfortunately, it seems we didn't document IpcWriteOptions, but you
> can inspect it on the Python prompt:
>
> >>> pa.ipc.IpcWriteOptions?
> Init signature: pa.ipc.IpcWriteOptions(self, /, *args, **kwargs)
> Docstring:
> IpcWriteOptions(metadata_version=MetadataVersion.V5, *,
> use_legacy_format=False, compression=None, bool use_threads=True, bool
> emit_dictionary_deltas=False) Serialization options for the IPC format.
>
>     Parameters
>     ----------
>     metadata_version : MetadataVersion, default MetadataVersion.V5
>         The metadata version to write.  V5 is the current and latest,
>         V4 is the pre-1.0 metadata version (with incompatible Union
>     layout). use_legacy_format : bool, default False
>         Whether to use the pre-Arrow 0.15 IPC format.
>     compression: str or None
>         If not None, compression codec to use for record batch buffers.
>         May only be "lz4", "zstd" or None.
>     use_threads: bool
>         Whether to use the global CPU thread pool to parallelize any
>         computational tasks like compression.
>     emit_dictionary_deltas: bool
>         Whether to emit dictionary deltas.  Default is false for maximum
>         stream compatibility.
>
> )
>
> Regards
>
> Antoine.
>
>

Reply via email to