Hi, all Again as we discussed in the previous email, We are proposing an pluggable APIs to support user customized compression codec in ARROW. See proposal https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit We want to redefine the scope of the pluggable API and have a discuss with the community.
1. Goal Through the plugin API, the end user can use the customized compression codec to override the built-in compression codec. E.g. use the HW-GZIP codec to replace the ARROW built-in GZIP codec to speed up the compress/decompress. It is not plan to add new compression codecs for Arrow. Currently we are focused on parquet format. In the future will support Arrow format. But some components should be common to the Arrow, such as plugin manager module, dynamic library loading module etc. 2. Compatibility with the Java implementation Both implementations will write the plugin information to the parquet key value metadata, either in parquet FileMetaData level or in the ColumnMetaData level. The plugin information include the plugin library name used for native parquet and plugin class name used for java parquet. E.g. plugin_library_name:libgzipplugin.so, plugin_class_name:com.intel.icl.customizedGzipCodec we're working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html 3. The end user API. For write, the end-user should callout they want to use plugin codec, so we add a compression_plugin API in parquet WriteProperties builder, when call this function, the internal parquet writer will write the plugin_library_name and plugin_class_name to the parquet key value metadata. The end user code snippet like this: parquet::WriterProperties::Builder builder; builder.compression(parquet::Compression::GZIP); builder.compression_plugin("libGzipPlugin.so"); std::shared_ptr<parquet::WriterProperties> props = builder.build(); For read, the internal parquet reader will first check if there are plugin information in the metadata. For native parquet, it will read plugin_library_name from the key value metadata, if the key exist, it will load the plugin library automatically and return the plugin codec from GetReadCodec. So no code change for read, it is transparent for end-user in parquet read side. Looking forward to any other suggestions or feedback. Thanks, XieQi