gowa commented on PR #3121: URL: https://github.com/apache/parquet-java/pull/3121#issuecomment-2608733828
Hi @gszadovszky , @wgtmac . Thank you for your feedback. Yes, I see that it is a big feature and the implementation is far from being a simple fix. And, maybe, it should be a pluggable thing instead of being a first-class resident in the code. However, if you feel the changes can be incorporated into the main codebase, I could try to find someone to review ByteBuddy part and implement the reader part as well. As for benchmarks. I've implemented some and committed. I attempted to replicate the original org.apache.parquet.benchmarks.WriteBenchmarks with some proto stuff in org.apache.parquet.benchmarks.ProtoWriteBenchmarks. ``` The result are as follows: the bigger number of fields (especially primitives), the bigger the gain. E.g. for 100 int32 fields: Benchmark (codegenMode) (protoClass) Mode Cnt Score Error Units ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed OFF Test100Int32 ss 5 13.171 ± 1.206 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed REQUIRED_ALL Test100Int32 ss 5 6.075 ± 1.258 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed OFF Test100Int32 ss 5 13.304 ± 1.497 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed REQUIRED_ALL Test100Int32 ss 5 6.235 ± 0.617 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed OFF Test100Int32 ss 5 13.450 ± 3.429 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed REQUIRED_ALL Test100Int32 ss 5 5.947 ± 0.430 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed OFF Test100Int32 ss 5 13.433 ± 3.879 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed REQUIRED_ALL Test100Int32 ss 5 6.523 ± 2.831 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP OFF Test100Int32 ss 5 13.288 ± 0.429 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP REQUIRED_ALL Test100Int32 ss 5 6.333 ± 0.444 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY OFF Test100Int32 ss 5 13.197 ± 1.396 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY REQUIRED_ALL Test100Int32 ss 5 6.855 ± 2.689 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed OFF Test100Int32 ss 5 13.473 ± 1.930 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed REQUIRED_ALL Test100Int32 ss 5 6.006 ± 0.285 s/op ``` For 30 int32 fields: ``` Benchmark (codegenMode) (protoClass) Mode Cnt Score Error Units ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed OFF Test30Int32 ss 5 3.421 ± 1.303 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed REQUIRED_ALL Test30Int32 ss 5 2.410 ± 0.357 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed OFF Test30Int32 ss 5 3.396 ± 0.708 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed REQUIRED_ALL Test30Int32 ss 5 2.362 ± 0.174 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed OFF Test30Int32 ss 5 3.250 ± 0.721 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed REQUIRED_ALL Test30Int32 ss 5 2.310 ± 0.168 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed OFF Test30Int32 ss 5 3.447 ± 0.884 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed REQUIRED_ALL Test30Int32 ss 5 2.416 ± 0.387 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP OFF Test30Int32 ss 5 3.156 ± 0.276 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP REQUIRED_ALL Test30Int32 ss 5 2.514 ± 0.687 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY OFF Test30Int32 ss 5 3.398 ± 0.853 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY REQUIRED_ALL Test30Int32 ss 5 2.501 ± 0.323 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed OFF Test30Int32 ss 5 3.644 ± 3.423 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed REQUIRED_ALL Test30Int32 ss 5 2.384 ± 0.203 s/op ``` For 30 strings ("fieldXX:XX"): ``` Benchmark (codegenMode) (protoClass) Mode Cnt Score Error Units ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed OFF Test30String ss 5 9.426 ± 3.621 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed REQUIRED_ALL Test30String ss 5 8.257 ± 1.113 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed OFF Test30String ss 5 9.848 ± 1.141 s/op ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed REQUIRED_ALL Test30String ss 5 8.302 ± 1.910 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed OFF Test30String ss 5 10.216 ± 1.843 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed REQUIRED_ALL Test30String ss 5 8.173 ± 1.419 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed OFF Test30String ss 5 9.940 ± 1.680 s/op ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed REQUIRED_ALL Test30String ss 5 8.242 ± 1.270 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP OFF Test30String ss 5 9.833 ± 1.010 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP REQUIRED_ALL Test30String ss 5 8.247 ± 1.284 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY OFF Test30String ss 5 9.638 ± 0.502 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY REQUIRED_ALL Test30String ss 5 7.935 ± 0.889 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed OFF Test30String ss 5 9.968 ± 1.651 s/op ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed REQUIRED_ALL Test30String ss 5 8.356 ± 1.319 s/op ``` For 5-7 fields the gain is negligeable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org