It still seems notable that our generic LZ4-compressed output stream cannot be read by Java (independent of Arrow and the Arrow IPC format).
On Thu, Jan 28, 2021 at 12:30 PM Antoine Pitrou <anto...@python.org> wrote: > > On Thu, 28 Jan 2021 18:19:00 +0000 > Joris Peeters <joris.mg.peet...@gmail.com> wrote: > > > To be fair, I'm happy to apply it at IPC level. Just didn't realise that > > was a thing. IIUC what Antoine suggests, though, then just (leaving Python > > as-is and) changing my Java to > > > > var is = new FileInputStream(path.toFile()); > > var reader = new ArrowStreamReader(is, allocator); > > var schema = reader.getVectorSchemaRoot().getSchema(); > > > > (i.e. just get rid of the lz4 input stream) should work, i.e. let the > > reader figure it out? I see no option to specify the compression in the > > reader, so it might detect it? > > You would specify the compression in the *writer* (in the Python side), > using the *options* argument here: > https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream > or here: > https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_file.html#pyarrow.ipc.new_file > > (unfortunately, it seems we didn't document IpcWriteOptions, but you > can inspect it on the Python prompt: > > >>> pa.ipc.IpcWriteOptions? > Init signature: pa.ipc.IpcWriteOptions(self, /, *args, **kwargs) > Docstring: > IpcWriteOptions(metadata_version=MetadataVersion.V5, *, > use_legacy_format=False, compression=None, bool use_threads=True, bool > emit_dictionary_deltas=False) Serialization options for the IPC format. > > Parameters > ---------- > metadata_version : MetadataVersion, default MetadataVersion.V5 > The metadata version to write. V5 is the current and latest, > V4 is the pre-1.0 metadata version (with incompatible Union > layout). use_legacy_format : bool, default False > Whether to use the pre-Arrow 0.15 IPC format. > compression: str or None > If not None, compression codec to use for record batch buffers. > May only be "lz4", "zstd" or None. > use_threads: bool > Whether to use the global CPU thread pool to parallelize any > computational tasks like compression. > emit_dictionary_deltas: bool > Whether to emit dictionary deltas. Default is false for maximum > stream compatibility. > > ) > > Regards > > Antoine. > >