andygrove commented on issue #2845: URL: https://github.com/apache/datafusion-comet/issues/2845#issuecomment-3791058388
Here is a brief summary of the changes since the 0.12.0 release. I think that we should go ahead and create the 0.13.0 release next week unless there are objections. # Changelog Since Commit 85b64e5b This document summarizes all changes since commit `85b64e5b`, organized by category. ## New Features ### Native Parquet Writer - **Experimental native Parquet writes** (#2812) - Initial support for native Parquet writes - **Complex type support** for native Parquet writer (#3214) - **File commit protocol** partially implemented for native Parquet writes (#2828) - **Remote HDFS writer** with openDAL support (#2929) - **Object store settings** now respected by Comet Writer (#3042) - **CometNativeWriteExec** support with native scan as a child (#2839) - Remove unnecessary transition for native writes (#2960) ### Native CSV Reader - **Experimental native CSV files read** (#3044) ### Iceberg Integration - **REST catalog support** for CometNativeIcebergScan (#2895) - **S3 session token** passthrough to Iceberg native scan (#2913) - **Bucket pruning** enabled with native_datafusion scans (#2888) ### Expression Support - `left` expression (#3206) - `last_day` expression (#3143) - `unix_timestamp` function (#2936) - `datediff` expression (#3145) - `date_format` expression (partial support) (#3201) - `unix_date` expression (#3141) - `from_json` (partial support) (#2934) - `explode` and `explode_outer` for array inputs (#2836) - `size` function for array inputs (#2862) - `murmur3` hash support expanded to complex types (#3077) ### ANSI Mode Support - ANSI mode `avg` expression for int inputs (#2817) - ANSI mode `sum` expression for int inputs (#2600) - ANSI mode `sum` for decimal types (#2826) ### Cast Improvements - String to float type casting (#2835) - Improved string to decimal cast compatibility (#2925) ### Configuration & Architecture - Shuffle writer buffer size now configurable (#2899) - Native operator registry implementation (#2875) - Expression registry for native planner (#2851) - Improved fallback reporting for `native_datafusion` scan (#2879) - PySpark benchmark framework (#3080) ## Performance Improvements ### Serialization & Memory - **Cache serialized query plans** to avoid per-partition serialization (#3246) - **Reduce GC pressure** in protobuf serialization (#3242) - **Reduce nativeIcebergScanMetadata** serialization points (#3243) - **Deduplicate serialized metadata** for Iceberg native scan (#2933) ### Iceberg Optimizations - Remove IcebergFileStream, use iceberg-rust's parallelization, cache SchemaAdapter (#3051) ### Hash & Type Operations - Optimize complex-type hash implementations (#3140) - Improve performance of `normalize_nan` (#2999) - Improve date truncate performance (#2997) ### Cast Operations - Improve performance of CAST from string to int (#3017, #3048) ### String Operations - Optimize `lpad`/`rpad` to remove unnecessary memory allocations per element (#2963) ### Runtime & Async - Don't busy-poll Tokio stream for plans without CometScan (#3063) - Set DataFusion session context's `target_partitions` to match Spark's `spark.task.cpus` (#3062) - Use await instead of block_on in native shuffle writer (#2937) - Refactor executePlan to avoid constantly entering tokio runtime (#2938) - Reduce syscall overhead in metrics updates (#2940) ### Shuffle & Sorting - Minor optimizations in `process_sorted_row_partition` (#3059) ## Bug Fixes ### Shuffle - Native shuffle now reports spill metrics correctly (#3197) - Fix CometShuffleManager hang by deferring SparkEnv access (#3002) ### Decimal & Numeric - Format decimal to string when casting decimal with overflow (#2916) - Modulus on decimal data type mismatch (#2922) - Fall back to Spark for MakeDecimal with unsupported input type (#2815) ### Scans - Reduce granularity of metrics updates in IcebergFileStream (#3050) - Modify CometNativeScan to generate file partitions without instantiating RDD (#2891) - NativeScan count assert firing for no reason (#2850) - Mark CometIcebergNativeScanExec's nativeIcebergScanMetadata @transient (#2930) ### Writer - Avoid duplicated writer nodes when AQE enabled (#2982) ### Compatibility - Remove fallback for maps containing complex types (#2943) - Enable cast tests for Spark 4.0 (#2919) ### Cloud/Storage - Normalize s3 paths for PME key retriever (#2874) ### Build & Dependencies - Add missing datafusion-datasource dependency (#3252) - Fix clippy warnings for Rust 1.93 (#3239) - Fix generate-release-docs.sh to compile common module ### Documentation - Correct link to tracing guide in CometConf (#2866) - Fix broken link to Apache DataFusion Comet Overview in README (#2846) ## Build & CI Improvements - Build native library once and share across CI test jobs (#3249) - Stop generating dynamic docs content in build - Skip CI workflows for benchmark changes (#3030, #3034) - Set thread thresholds envs for spark test on macOS (#2987) - Reinstate macOS CI builds of Comet with Spark 4.0 (#2950) - Disable caching on macOS (#2816) - Update actions/checkout from v4 to v6 (#2857, #2818) - Update actions/cache, upload-artifact, download-artifact (#2908, #2909, #2910, #3037) ## Documentation - Add common pitfalls and improve PR checklist in development guide (#3231) - Add guidance on disabling constant folding for literal tests (#3200) - Add documentation for native + JVM shuffle implementation (#3055) - Add documentation for fully-native Iceberg scans (#2868) - Improve cast documentation to add support per eval mode (#3056) - Update Iceberg install docs (#2824) ## Refactoring & Code Quality ### Rule Refactoring - Refactor `CometScanRule` to improve scan selection and fallback logic (#2978) - Refactor `CometExecRule` handling of `BroadcastHashJoin` and fix fallback reporting (#2856) - Refactor scan and sink handling in `CometExecRule` to reduce duplicate code (#2844) - Clean up shuffle transformation code in `CometExecRule` (#2840) - Move shuffle logic from `CometExecRule` to `CometShuffleExchangeExec` serde implementation (#2853) - Move methods from `CometSparkSessionExtensions` to `CometScanRule` and `CometExecRule` (#2873) ### Shuffle Refactoring - Split CometShuffleExternalSorter into sync/async implementations (#3192) - Refactor JVM shuffle: Move `SpillSorter` to top level class and add tests (#3081) ### Other Refactoring - Deprecate native_comet scan in favor of native_iceberg_compat (#2949) - Move string function handling to new expression registry (#2931) - Use datafusion impl of hex function (#2915) - Refactor bit_not (#2896) - Remove `row_step` from `process_sorted_row_partition` (#2920) - `ScanExec::new` no longer fetches data (#2881) - Remove low-level ffi/jvm timers from native `ScanExec` (#2939) - Reduce timer overhead in native shuffle writer (#2941) - Respect legacySizeOfNull option for size function (#3036) ### Testing - Add unit tests for `CometScanRule` (#2867) - Add unit tests for `CometExecRule` (#2863) - Enable plan stability suite for `native_datafusion` scans (#2877) - Add script to regenerate golden files for plan stability tests (#3204) - Use fixed seed in RNG in tests (#2917) - Add checks to microbenchmarks for plan running natively in Comet (#3045) ## Benchmarks ### New Benchmarks - Shuffle benchmark for deeply nested schemas (#2902) - Microbenchmark for hash expressions (#3028) - Microbenchmark for comparison expressions (#3026) - Microbenchmark for casting string to numeric (#2979) - Microbenchmark for casting string to temporal types (#2980) - Microbenchmarks for Comet `cast` expression (#2932) - to_json unit/benchmark tests (#3011) - Microbenchmark for sum int (#2805) - PySpark-based benchmarks, starting with ETL example (#3065) ### Benchmark Improvements - Improve criterion benchmarks for cast string to int (#3049) - Implement more microbenchmarks for cast expressions (#3031) - Improve conditional expression microbenchmarks (#3024) - Improve aggregate expression microbenchmarks (#3021) - Improve date/time microbenchmarks to avoid redundant/duplicate benchmarks (#3020) - Improve string expression microbenchmarks (#3012, #2964) - Refactor string benchmarks (~10x reduction in LOC) (#2907) - Refactor expression microbenchmarks to remove duplicate code (#2956) - Add ANSI support to aggregate microbenchmarks (#2901) ## Dependency Updates ### Major Upgrades - **DataFusion 51, Arrow 57, Iceberg to latest, MSRV to 1.88** (#2729) - bump lz4_flex, downgrade prost from yanked version (#2847) ### Library Updates - parquet 57.1.0 to 57.2.0 (#3074) - arrow 57.1.0 to 57.2.0 (#3073) - tokio 1.48.0 to 1.49.0 (#3039) - opendal 0.54.1 to 0.55.0 (#2821) - object_store_opendal 0.54.1 to 0.55.0 (#2819) - Various other dependency bumps (cc, libc, serde_json, tempfile, reqwest, assertables, proto) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
