Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html
This proposal here is more general to Arrow (indeed it can be used by native Parquet as well). Since Arrow is more in memory format mostly for intermediate data, I would expect less consideration in backward compatibility different from on-disk Parquet format. Considering this, we can discuss those two things separately. For Parquet part, it should be consistent behavior as Java Parquet. For Arrow part, it should also be compatible with new extendable Parquet compression codec framework. And we can start with Parquet part first. Thanks Cheng Xu From: Micah Kornfield <emkornfi...@gmail.com> Sent: Tuesday, June 23, 2020 12:11 PM To: dev <dev@arrow.apache.org> Cc: Xu, Cheng A <cheng.a...@intel.com>; Xie, Qi <qi....@intel.com> Subject: Re: Proposal for the plugin API to support user customized compression codec It would be good to clarify the exact scope of this. If it is particular to parquet then we should wait for the discussion on dev@parquet to conclude before moving forward. If it is more general to Arrow, then working through scenarios of how this would be used for decompression when the Codec can't support generic input would be useful (the codec library is a singleton across the arrow codebase). On Mon, Jun 22, 2020 at 4:23 PM Wes McKinney <wesmck...@gmail.com<mailto:wesmck...@gmail.com>> wrote: hi XieQi, Is the idea that your custom Gzip implementation would automatically override any places in the codebase where the built-in one would be used (like the Parquet codebase)? I see some things in the design doc about serializing the plugin information in the Parquet file metadata (assuming you want to speed up decompression Parquet data pages) -- is there a reason to believe that the plugin would be _required_ in order to read the file? I recall some messages to the Parquet mailing list about user-defined codecs. In general, having a plugin API to provide a means to substitute one functionally identical for another seems reasonable to me (I could envision having people customizing kernel execution in the future). We should try to create a general enough API so that it can be used for customizations beyond compression codecs so we don't have to go through a design exercise to support plugin/algorithm overrides for something else. This is something we could hash out during code review -- I should have some opinions and I'm sure others will as well - Wes On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi <qi....@intel.com<mailto:qi....@intel.com>> wrote: > > Hi, > > > In demand of better performance, quite some end users want to leverage > accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in > current Arrow compression framework, it only supports codec name based > compression implementation and can't be customized to leverage accelerators. > For example, for gzip format, we can't call customized codec to accelerate > the compression. We would like to proposal a plugin API to support the > customized compression codec. We've put the proposal here: > > > > https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit > > > > Any comment is welcome and please let us know your feedback. > > > > Thanks, > > XieQi > > >