andygrove opened a new pull request, #3747: URL: https://github.com/apache/datafusion-comet/pull/3747
## Which issue does this PR close? Closes #3162. ## Rationale for this change `get_json_object` is a widely-used Spark function for extracting values from JSON strings using JSONPath expressions. Without native support, queries using this function fall back to Spark's JVM execution. This PR adds an initial native implementation to allow Comet to accelerate these queries. This is a starting point. The expression is marked `Incompatible` and is **disabled by default**. Users must set `spark.comet.expression.GetJsonObject.allowIncompatible=true` to enable it. ## What changes are included in this PR? **Rust implementation** (`native/spark-expr/src/string_funcs/get_json_object.rs`): - Custom JSONPath parser supporting `$` (root), `.field`, `['field']` (bracket notation), `[n]` (array index), and `[*]` (array wildcard) - Path evaluation with separate fast-path for non-wildcard paths (zero Vec allocations) and wildcard paths - Uses `serde_json` with `preserve_order` feature for Spark-compatible key ordering - 19 unit tests **Scala serde** (`spark/src/main/scala/org/apache/comet/serde/strings.scala`): - `CometGetJsonObject` with `getSupportLevel` returning `Incompatible` (Spark's Jackson parser allows single-quoted JSON and unescaped control characters that `serde_json` does not) **Registration and wiring:** - Added to `stringExpressions` map in `QueryPlanSerde.scala` - Registered in `comet_scalar_funcs.rs` via `scalarFunctionExprToProtoWithReturnType` **SQL tests** (`get_json_object.sql`): 30 test queries covering field extraction, nested objects, arrays, wildcards, nulls, invalid JSON, bracket notation, edge cases. **Docs**: Updated `expressions.md` and `spark_expressions_support.md`. ### Current performance Benchmarked with 1M rows of JSON (~200 bytes each) on Apple M3 Ultra: | Case | Spark (ms) | Comet (ms) | Relative | |------|-----------|------------|----------| | Simple field (`$.name`) | 705 | 785 | 0.9X | | Numeric field (`$.age`) | 725 | 789 | 0.9X | | Nested field (`$.address.city`) | 773 | 805 | 1.0X | | Array element (`$.items[0]`) | 734 | 795 | 0.9X | | Nested object (`$.address`) | 869 | 926 | 0.9X | Comet is currently ~10% slower than Spark. The primary reason is that `serde_json` parses the full JSON document into a DOM tree on every row, while Spark's Jackson-based implementation uses a streaming parser that can skip irrelevant fields without allocating. ### Known limitations and future work This is an initial implementation. Known gaps that could be addressed in follow-up PRs: 1. **Streaming JSON parser**: Replace `serde_json::from_str` (full DOM parse) with a streaming approach (e.g., `jiter` or custom `serde_json::Deserializer` with `IgnoredAny`) to skip irrelevant JSON content without allocating. This would likely close the performance gap with Spark. 2. **`$.*` on arrays**: Spark distinguishes `$.*` (object wildcard, using `Wildcard` token) from `$[*]` (array wildcard, using `Subscript::Wildcard`). Our parser treats both as the same `Wildcard` segment. Currently `$.*` on arrays returns values in Comet but null in Spark. 3. **Double wildcard flattening**: Spark's `$[*][*]` triggers `FlattenStyle` which flattens nested arrays. Our implementation doesn't handle this special case. 4. **Single wildcard match after index**: For patterns like `$.arr[0][*].field`, Spark's `WriteStyle` state machine may produce different wrapping behavior than our count-based approach. 5. **`preserve_order` is workspace-wide**: Cargo unifies features, so enabling `preserve_order` on `serde_json` in `spark-expr` also enables it for all other crates in the workspace. Could be addressed by isolating the JSON parsing behind a feature flag. ## How are these changes tested? - 19 Rust unit tests covering path parsing and evaluation edge cases - 30 SQL-file-based tests (`CometSqlFileTestSuite`) that run each query through both Spark and Comet and compare results, with dictionary encoding on/off - Microbenchmark (`CometGetJsonObjectBenchmark`) comparing Spark vs Comet performance across 5 query patterns -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
