Hi, all

Again as we discussed in the previous email, We are proposing an pluggable APIs 
to support user customized compression codec in ARROW.
See proposal 
https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
We want to redefine the scope of the pluggable API and have a discuss with the 
community.

1. Goal
Through the plugin API, the end user can use the customized compression codec 
to override the built-in compression codec. E.g. use the HW-GZIP codec to 
replace the ARROW built-in GZIP codec to speed up the compress/decompress.
It is not plan to add new compression codecs for Arrow.
Currently we are focused on parquet format. In the future will support Arrow 
format. But some components should be common to the Arrow, such as plugin 
manager module, dynamic library loading module etc.

2. Compatibility with the Java implementation
Both implementations will write the plugin information to the parquet key value 
metadata, either in parquet FileMetaData level or in the ColumnMetaData level.
The plugin information include the plugin library name used for native parquet 
and plugin class name used for java parquet.
E.g. plugin_library_name:libgzipplugin.so, 
plugin_class_name:com.intel.icl.customizedGzipCodec
we're working in progress together with Parquet community to refine our 
proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html

3. The end user API.
For write, the end-user should callout they want to use plugin codec, so we add 
a compression_plugin API in parquet WriteProperties builder, when call this 
function, the internal parquet writer will write the plugin_library_name and 
plugin_class_name to the parquet key value metadata. The end user code snippet 
like this:
parquet::WriterProperties::Builder builder;
builder.compression(parquet::Compression::GZIP);
builder.compression_plugin("libGzipPlugin.so");
std::shared_ptr<parquet::WriterProperties> props = builder.build();



For read, the internal parquet reader will first check if there are plugin 
information in the metadata. For native parquet, it will read 
plugin_library_name from the key value metadata, if the key exist, it will load 
the plugin library automatically and  return the plugin codec from GetReadCodec.

So no code change for read, it is transparent for end-user in parquet read side.



Looking forward to any other suggestions or feedback.

Thanks,
XieQi

Reply via email to