Hi, Wes

Yes currently the purpose of the key-value metadata is just a hint to indicate 
that the parquet file is compressed by plugin so that the parquet reader can 
load the plugin library and use plugin to decompress the file.
There are many optimized GZIP implementations and may not compatible with the 
standard gzip, for example due to hardware limit, the HW-GZIP history window 
size maybe smaller than the standard gzip, so that HW-GZIP can't decompress the 
file compressed by standard gzip and because we are still use the 
Compression::GZIP as Compression::type, we need that metadata to distinguish it 
from the standard gzip.

Thanks,
XieQi

-----Original Message-----
From: Wes McKinney <wesmck...@gmail.com> 
Sent: Tuesday, October 20, 2020 11:06 AM
To: dev <dev@arrow.apache.org>
Cc: Xie, Qi <qi....@intel.com>; Xu, Cheng A <cheng.a...@intel.com>; Dong, Xin 
<xin.d...@intel.com>; Zhang, Jie1 <jie1.zh...@intel.com>
Subject: Re: [Discuss] Provide pluggable APIs to support user customized 
compression codec

What is the purpose of the key-value metadata aside from automatically loading 
the plugin library if it's available (which seems like a security risk if 
reading a data file can cause a shared library to be loaded dynamically)? Is it 
necessary to have that metadata for it to be safe to use the optimized GZIP 
plugin (or could you just always have the plugin enabled on a system that 
supports it, even for files that were not compressed using the plugin but 
rather the system / standard gzip)?

On Mon, Oct 19, 2020 at 8:42 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hi,
>
> Again, I think the whole plugin concept falls outside of Arrow.
>
> It should be much simpler to simply allow people to override the 
> compression codec factory.  Then applications can define "plugins" if 
> they want to.
>
> Regards
>
> Antoine.
>
>
> Le 19/10/2020 à 03:30, Xie, Qi a écrit :
> > Hi, all
> >
> > Again as we discussed in the previous email, We are proposing an pluggable 
> > APIs to support user customized compression codec in ARROW.
> > See proposal 
> > https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMl
> > JWy6aqC6WG8/edit We want to redefine the scope of the pluggable API 
> > and have a discuss with the community.
> >
> > 1. Goal
> > Through the plugin API, the end user can use the customized compression 
> > codec to override the built-in compression codec. E.g. use the HW-GZIP 
> > codec to replace the ARROW built-in GZIP codec to speed up the 
> > compress/decompress.
> > It is not plan to add new compression codecs for Arrow.
> > Currently we are focused on parquet format. In the future will support 
> > Arrow format. But some components should be common to the Arrow, such as 
> > plugin manager module, dynamic library loading module etc.
> >
> > 2. Compatibility with the Java implementation Both implementations 
> > will write the plugin information to the parquet key value metadata, either 
> > in parquet FileMetaData level or in the ColumnMetaData level.
> > The plugin information include the plugin library name used for native 
> > parquet and plugin class name used for java parquet.
> > E.g. plugin_library_name:libgzipplugin.so, 
> > plugin_class_name:com.intel.icl.customizedGzipCodec
> > we're working in progress together with Parquet community to refine 
> > our proposal. 
> > https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
> >
> > 3. The end user API.
> > For write, the end-user should callout they want to use plugin codec, so we 
> > add a compression_plugin API in parquet WriteProperties builder, when call 
> > this function, the internal parquet writer will write the 
> > plugin_library_name and plugin_class_name to the parquet key value 
> > metadata. The end user code snippet like this:
> > parquet::WriterProperties::Builder builder; 
> > builder.compression(parquet::Compression::GZIP);
> > builder.compression_plugin("libGzipPlugin.so");
> > std::shared_ptr<parquet::WriterProperties> props = builder.build();
> >
> >
> >
> > For read, the internal parquet reader will first check if there are plugin 
> > information in the metadata. For native parquet, it will read 
> > plugin_library_name from the key value metadata, if the key exist, it will 
> > load the plugin library automatically and  return the plugin codec from 
> > GetReadCodec.
> >
> > So no code change for read, it is transparent for end-user in parquet read 
> > side.
> >
> >
> >
> > Looking forward to any other suggestions or feedback.
> >
> > Thanks,
> > XieQi
> >
> >

Reply via email to