This is an automated email from the ASF dual-hosted git repository.
mbutrovich pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new 64ebfcc33 docs: various documentation updates in preparation for next
release (#3254)
64ebfcc33 is described below
commit 64ebfcc33ee77e05be65a64797047ab84db34c13
Author: Andy Grove <[email protected]>
AuthorDate: Fri Jan 23 12:23:23 2026 -0700
docs: various documentation updates in preparation for next release (#3254)
* docs: add missing expressions to user guide
Add documentation for expressions that were implemented but not
documented in the supported expressions list:
- Left (string function)
- DateDiff, DateFormat, LastDay, UnixDate, UnixTimestamp (date/time)
- Sha1 (hashing)
- JsonToStructs (struct)
Also fixes TruncTimestamp SQL from `trunc_date` to `date_trunc`.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* docs: update documentation to reflect auto scan no longer falls back to
native_comet
Updates contributor guide documentation following the change in c9af2c6e:
- parquet_scans.md: clarify that auto mode falls back to Spark's native
scan, not native_comet
- roadmap.md: reflect that the switch to native_iceberg_compat for auto
mode is complete
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* docs: add REST catalog example to Iceberg user guide
Adds documentation showing how to configure Comet's native Iceberg scan
with a REST catalog, including example Spark configuration and sample
queries demonstrating namespace creation.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* docs: add supported features section to Iceberg user guide
Adds comprehensive documentation of native Iceberg reader capabilities:
- Table spec versions (v1, v2 supported; v3 falls back)
- Schema and data types including complex types and schema evolution
- Time travel and branch reads
- MOR table delete handling (positional and equality deletes)
- Filter pushdown operations
- Partitioning with various transform types
- Storage backends (local, HDFS, S3)
Also improves the limitations section with clearer explanations.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* chore: apply prettier formatting to iceberg.md
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* docs: update CSV section to document native CSV scan support
The datasources.md previously stated that Comet does not provide native
CSV scan, but experimental native CSV support was added in commit f538424d.
Updated to reflect the new spark.comet.scan.csv.v2.enabled option.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* docs: fix incomplete sentence and update roadmap in contributor guide
- parquet_scans.md: complete the truncated sentence in S3 Support section
- roadmap.md: update native_comet removal section to reflect that auto
mode now uses native_iceberg_compat (milestone achieved)
- roadmap.md: fix typo "originally" -> "original"
Co-Authored-By: Claude Opus 4.5 <[email protected]>
---------
Co-authored-by: Claude Opus 4.5 <[email protected]>
---
docs/source/contributor-guide/parquet_scans.md | 8 +--
docs/source/contributor-guide/roadmap.md | 13 ++--
docs/source/user-guide/latest/datasources.md | 6 +-
docs/source/user-guide/latest/expressions.md | 22 ++++--
docs/source/user-guide/latest/iceberg.md | 95 +++++++++++++++++++++++---
5 files changed, 117 insertions(+), 27 deletions(-)
diff --git a/docs/source/contributor-guide/parquet_scans.md
b/docs/source/contributor-guide/parquet_scans.md
index 8c2fbfd8f..48fda7dc3 100644
--- a/docs/source/contributor-guide/parquet_scans.md
+++ b/docs/source/contributor-guide/parquet_scans.md
@@ -45,8 +45,8 @@ The `native_datafusion` and `native_iceberg_compat` scans
share the following li
- When reading Parquet files written by systems other than Spark that contain
columns with the logical types `UINT_8`
or `UINT_16`, Comet will produce different results than Spark because Spark
does not preserve or understand these
logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
- rather than signed. By default, Comet will fall back to `native_comet` when
scanning Parquet files containing `byte` or `short`
- types (regardless of the logical type). This behavior can be disabled by
setting
+ rather than signed. By default, Comet will fall back to Spark's native scan
when scanning Parquet files containing
+ `byte` or `short` types (regardless of the logical type). This behavior can
be disabled by setting
`spark.comet.scan.allowIncompatible=true`.
- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
@@ -59,11 +59,11 @@ The `native_datafusion` scan has some additional
limitations:
## S3 Support
-There are some
+There are some differences in S3 support between the scan implementations.
### `native_comet`
-The default `native_comet` Parquet scan implementation reads data from S3
using the [Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which
+The `native_comet` Parquet scan implementation reads data from S3 using the
[Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which
is identical to the approach commonly used with vanilla Spark. AWS credential
configuration and other Hadoop S3A
configurations works the same way as in vanilla Spark.
diff --git a/docs/source/contributor-guide/roadmap.md
b/docs/source/contributor-guide/roadmap.md
index 9428fa682..b1ccfb604 100644
--- a/docs/source/contributor-guide/roadmap.md
+++ b/docs/source/contributor-guide/roadmap.md
@@ -27,11 +27,10 @@ helpful to have a roadmap for some of the major items that
require coordination
### Iceberg Integration
Iceberg integration is still a work-in-progress ([#2060]), with major
improvements expected in the next few
-releases. Once this integration is complete, we plan on switching from the
`native_comet` scan to the
-`native_iceberg_compat` scan ([#2189]) so that complex types can be supported.
+releases. The default `auto` scan mode now uses `native_iceberg_compat`
instead of `native_comet`, enabling
+support for complex types.
[#2060]: https://github.com/apache/datafusion-comet/issues/2060
-[#2189]: https://github.com/apache/datafusion-comet/issues/2189
### Spark 4.0 Support
@@ -44,12 +43,12 @@ more Spark SQL tests and fully implementing ANSI support
([#313]) for all suppor
### Removing the native_comet scan implementation
We are working towards deprecating ([#2186]) and removing ([#2177]) the
`native_comet` scan implementation, which
-is the originally scan implementation that uses mutable buffers (which is
incompatible with best practices around
+is the original scan implementation that uses mutable buffers (which is
incompatible with best practices around
Arrow FFI) and does not support complex types.
-Once we are using the `native_iceberg_compat` scan (which is based on
DataFusion's `DataSourceExec`) in the Iceberg
-integration, we will be able to remove the `native_comet` scan implementation,
and can then improve the efficiency
-of our use of Arrow FFI ([#2171]).
+Now that the default `auto` scan mode uses `native_iceberg_compat` (which is
based on DataFusion's `DataSourceExec`),
+we can proceed with removing the `native_comet` scan implementation, and then
improve the efficiency of our use of
+Arrow FFI ([#2171]).
[#2186]: https://github.com/apache/datafusion-comet/issues/2186
[#2171]: https://github.com/apache/datafusion-comet/issues/2171
diff --git a/docs/source/user-guide/latest/datasources.md
b/docs/source/user-guide/latest/datasources.md
index cfce4d13d..daae6955d 100644
--- a/docs/source/user-guide/latest/datasources.md
+++ b/docs/source/user-guide/latest/datasources.md
@@ -36,7 +36,11 @@ Comet accelerates Iceberg scans of Parquet files. See the
[Iceberg Guide] for mo
### CSV
-Comet does not provide native CSV scan, but when
`spark.comet.convert.csv.enabled` is enabled, data is immediately
+Comet provides experimental native CSV scan support. When
`spark.comet.scan.csv.v2.enabled` is enabled, CSV files
+are read natively for improved performance. This feature is experimental and
performance benefits are
+workload-dependent.
+
+Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from
Spark's CSV reader is immediately
converted into Arrow format, allowing native execution to happen after that.
### JSON
diff --git a/docs/source/user-guide/latest/expressions.md
b/docs/source/user-guide/latest/expressions.md
index 6bd0fada4..0339cd2a3 100644
--- a/docs/source/user-guide/latest/expressions.md
+++ b/docs/source/user-guide/latest/expressions.md
@@ -68,6 +68,7 @@ Expressions that are not Spark-compatible will fall back to
Spark by default and
| Contains | Yes |
|
| EndsWith | Yes |
|
| InitCap | No | Behavior is different in some cases,
such as hyphenated names. |
+| Left | Yes | Length argument must be a literal
value |
| Length | Yes |
|
| Like | Yes |
|
| Lower | No | Results can vary depending on locale
and character set. Requires `spark.comet.caseConversion.enabled=true` |
@@ -94,15 +95,20 @@ Expressions that are not Spark-compatible will fall back to
Spark by default and
| Expression | SQL | Spark-Compatible? |
Compatibility Notes
|
| -------------- | ---------------------------- | ----------------- |
--------------------------------------------------------------------------------------------------------------------
|
| DateAdd | `date_add` | Yes |
|
+| DateDiff | `datediff` | Yes |
|
+| DateFormat | `date_format` | Yes | Partial
support. Only specific format patterns are supported.
|
| DateSub | `date_sub` | Yes |
|
| DatePart | `date_part(field, source)` | Yes |
Supported values of `field`:
`year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute`
|
| Extract | `extract(field FROM source)` | Yes |
Supported values of `field`:
`year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute`
|
| FromUnixTime | `from_unixtime` | No | Does not
support format, supports only -8334601211038 <= sec <= 8210266876799
|
| Hour | `hour` | Yes |
|
+| LastDay | `last_day` | Yes |
|
| Minute | `minute` | Yes |
|
| Second | `second` | Yes |
|
| TruncDate | `trunc` | Yes |
|
-| TruncTimestamp | `trunc_date` | Yes |
|
+| TruncTimestamp | `date_trunc` | Yes |
|
+| UnixDate | `unix_date` | Yes |
|
+| UnixTimestamp | `unix_timestamp` | Yes |
|
| Year | `year` | Yes |
|
| Month | `month` | Yes |
|
| DayOfMonth | `day`/`dayofmonth` | Yes |
|
@@ -163,6 +169,7 @@ Expressions that are not Spark-compatible will fall back to
Spark by default and
| ----------- | ----------------- |
| Md5 | Yes |
| Murmur3Hash | Yes |
+| Sha1 | Yes |
| Sha2 | Yes |
| XxHash64 | Yes |
@@ -256,12 +263,13 @@ Comet supports using the following aggregate functions
within window contexts wi
## Struct Expressions
-| Expression | Spark-Compatible? |
-| -------------------- | ----------------- |
-| CreateNamedStruct | Yes |
-| GetArrayStructFields | Yes |
-| GetStructField | Yes |
-| StructsToJson | Yes |
+| Expression | Spark-Compatible? | Compatibility Notes
|
+| -------------------- | ----------------- |
------------------------------------------ |
+| CreateNamedStruct | Yes |
|
+| GetArrayStructFields | Yes |
|
+| GetStructField | Yes |
|
+| JsonToStructs | No | Partial support. Requires
explicit schema. |
+| StructsToJson | Yes |
|
## Conversion Expressions
diff --git a/docs/source/user-guide/latest/iceberg.md
b/docs/source/user-guide/latest/iceberg.md
index 9bf681cb0..ad6e5b243 100644
--- a/docs/source/user-guide/latest/iceberg.md
+++ b/docs/source/user-guide/latest/iceberg.md
@@ -183,14 +183,93 @@ $SPARK_HOME/bin/spark-shell \
The same sample queries from above can be used to test Comet's fully-native
Iceberg integration,
however the scan node to look for is `CometIcebergNativeScan`.
+### Supported features
+
+The native Iceberg reader supports the following features:
+
+**Table specifications:**
+
+- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
+
+**Schema and data types:**
+
+- All primitive types including UUID
+- Complex types: arrays, maps, and structs
+- Schema evolution (adding and dropping columns)
+
+**Time travel and branching:**
+
+- `VERSION AS OF` queries to read historical snapshots
+- Branch reads for accessing named branches
+
+**Delete handling (Merge-On-Read tables):**
+
+- Positional deletes
+- Equality deletes
+- Mixed delete types
+
+**Filter pushdown:**
+
+- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
+- Logical operators (`AND`, `OR`)
+- NULL checks (`IS NULL`, `IS NOT NULL`)
+- `IN` and `NOT IN` list operations
+- `BETWEEN` operations
+
+**Partitioning:**
+
+- Standard partitioning with partition pruning
+- Date partitioning with `days()` transform
+- Bucket partitioning
+- Truncate transform
+- Hour transform
+
+**Storage:**
+
+- Local filesystem
+- Hadoop Distributed File System (HDFS)
+- S3-compatible storage (AWS S3, MinIO)
+
+### REST Catalog
+
+Comet's native Iceberg reader also supports REST catalogs. The following
example shows how to
+configure Spark to use a REST catalog with Comet's native Iceberg scan:
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+ --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
+ --repositories https://repo1.maven.org/maven2/ \
+ --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
+ --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
+ --conf
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
+ --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
+ --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
+ --conf spark.plugins=org.apache.spark.CometPlugin \
+ --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
+ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+ --conf spark.comet.scan.icebergNative.enabled=true \
+ --conf spark.comet.explainFallback.enabled=true \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=2g
+```
+
+Note that REST catalogs require explicit namespace creation before creating
tables:
+
+```scala
+scala> spark.sql("CREATE NAMESPACE rest_cat.db")
+scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING)
USING iceberg")
+scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')")
+scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
+```
+
### Current limitations
-The following scenarios are not yet supported, but are work in progress:
+The following scenarios will fall back to Spark's native Iceberg reader:
-- Iceberg table spec v3 scans will fall back.
-- Iceberg writes will fall back.
-- Iceberg table scans backed by Avro or ORC data files will fall back.
-- Iceberg table scans partitioned on `BINARY` or `DECIMAL` (with precision
>28) columns will fall back.
-- Iceberg scans with residual filters (_i.e._, filter expressions that are not
partition values,
- and are evaluated on the column values at scan time) of `truncate`,
`bucket`, `year`, `month`,
- `day`, `hour` will fall back.
+- Iceberg table spec v3 scans
+- Iceberg writes (reads are accelerated, writes use Spark)
+- Tables backed by Avro or ORC data files (only Parquet is accelerated)
+- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
+- Scans with residual filters using `truncate`, `bucket`, `year`, `month`,
`day`, or `hour`
+ transform functions (partition pruning still works, but row-level filtering
of these
+ transforms falls back)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]