(datafusion-comet) branch main updated: docs: various documentation updates in preparation for next release (#3254)

mbutrovich Fri, 23 Jan 2026 11:24:10 -0800

This is an automated email from the ASF dual-hosted git repository.

mbutrovich pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/main by this push:
     new 64ebfcc33 docs: various documentation updates in preparation for next 
release (#3254)
64ebfcc33 is described below

commit 64ebfcc33ee77e05be65a64797047ab84db34c13
Author: Andy Grove <[email protected]>
AuthorDate: Fri Jan 23 12:23:23 2026 -0700

    docs: various documentation updates in preparation for next release (#3254)
    
    * docs: add missing expressions to user guide
    
    Add documentation for expressions that were implemented but not
    documented in the supported expressions list:
    
    - Left (string function)
    - DateDiff, DateFormat, LastDay, UnixDate, UnixTimestamp (date/time)
    - Sha1 (hashing)
    - JsonToStructs (struct)
    
    Also fixes TruncTimestamp SQL from `trunc_date` to `date_trunc`.
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * docs: update documentation to reflect auto scan no longer falls back to 
native_comet
    
    Updates contributor guide documentation following the change in c9af2c6e:
    - parquet_scans.md: clarify that auto mode falls back to Spark's native 
scan, not native_comet
    - roadmap.md: reflect that the switch to native_iceberg_compat for auto 
mode is complete
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * docs: add REST catalog example to Iceberg user guide
    
    Adds documentation showing how to configure Comet's native Iceberg scan
    with a REST catalog, including example Spark configuration and sample
    queries demonstrating namespace creation.
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * docs: add supported features section to Iceberg user guide
    
    Adds comprehensive documentation of native Iceberg reader capabilities:
    - Table spec versions (v1, v2 supported; v3 falls back)
    - Schema and data types including complex types and schema evolution
    - Time travel and branch reads
    - MOR table delete handling (positional and equality deletes)
    - Filter pushdown operations
    - Partitioning with various transform types
    - Storage backends (local, HDFS, S3)
    
    Also improves the limitations section with clearer explanations.
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * chore: apply prettier formatting to iceberg.md
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * docs: update CSV section to document native CSV scan support
    
    The datasources.md previously stated that Comet does not provide native
    CSV scan, but experimental native CSV support was added in commit f538424d.
    Updated to reflect the new spark.comet.scan.csv.v2.enabled option.
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * docs: fix incomplete sentence and update roadmap in contributor guide
    
    - parquet_scans.md: complete the truncated sentence in S3 Support section
    - roadmap.md: update native_comet removal section to reflect that auto
      mode now uses native_iceberg_compat (milestone achieved)
    - roadmap.md: fix typo "originally" -> "original"
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.5 <[email protected]>
---
 docs/source/contributor-guide/parquet_scans.md |  8 +--
 docs/source/contributor-guide/roadmap.md       | 13 ++--
 docs/source/user-guide/latest/datasources.md   |  6 +-
 docs/source/user-guide/latest/expressions.md   | 22 ++++--
 docs/source/user-guide/latest/iceberg.md       | 95 +++++++++++++++++++++++---
 5 files changed, 117 insertions(+), 27 deletions(-)

diff --git a/docs/source/contributor-guide/parquet_scans.md 
b/docs/source/contributor-guide/parquet_scans.md
index 8c2fbfd8f..48fda7dc3 100644
--- a/docs/source/contributor-guide/parquet_scans.md
+++ b/docs/source/contributor-guide/parquet_scans.md
@@ -45,8 +45,8 @@ The `native_datafusion` and `native_iceberg_compat` scans 
share the following li
 - When reading Parquet files written by systems other than Spark that contain 
columns with the logical types `UINT_8`
   or `UINT_16`, Comet will produce different results than Spark because Spark 
does not preserve or understand these
   logical types. Arrow-based readers, such as DataFusion and Comet do respect 
these types and read the data as unsigned
-  rather than signed. By default, Comet will fall back to `native_comet` when 
scanning Parquet files containing `byte` or `short`
-  types (regardless of the logical type). This behavior can be disabled by 
setting
+  rather than signed. By default, Comet will fall back to Spark's native scan 
when scanning Parquet files containing
+  `byte` or `short` types (regardless of the logical type). This behavior can 
be disabled by setting
   `spark.comet.scan.allowIncompatible=true`.
 - No support for default values that are nested types (e.g., maps, arrays, 
structs). Literal default values are supported.
 
@@ -59,11 +59,11 @@ The `native_datafusion` scan has some additional 
limitations:
 
 ## S3 Support
 
-There are some
+There are some differences in S3 support between the scan implementations.
 
 ### `native_comet`
 
-The default `native_comet` Parquet scan implementation reads data from S3 
using the [Hadoop-AWS 
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
 which
+The `native_comet` Parquet scan implementation reads data from S3 using the 
[Hadoop-AWS 
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
 which
 is identical to the approach commonly used with vanilla Spark. AWS credential 
configuration and other Hadoop S3A
 configurations works the same way as in vanilla Spark.
 
diff --git a/docs/source/contributor-guide/roadmap.md 
b/docs/source/contributor-guide/roadmap.md
index 9428fa682..b1ccfb604 100644
--- a/docs/source/contributor-guide/roadmap.md
+++ b/docs/source/contributor-guide/roadmap.md
@@ -27,11 +27,10 @@ helpful to have a roadmap for some of the major items that 
require coordination
 ### Iceberg Integration
 
 Iceberg integration is still a work-in-progress ([#2060]), with major 
improvements expected in the next few
-releases. Once this integration is complete, we plan on switching from the 
`native_comet` scan to the
-`native_iceberg_compat` scan ([#2189]) so that complex types can be supported.
+releases. The default `auto` scan mode now uses `native_iceberg_compat` 
instead of `native_comet`, enabling
+support for complex types.
 
 [#2060]: https://github.com/apache/datafusion-comet/issues/2060
-[#2189]: https://github.com/apache/datafusion-comet/issues/2189
 
 ### Spark 4.0 Support
 
@@ -44,12 +43,12 @@ more Spark SQL tests and fully implementing ANSI support 
([#313]) for all suppor
 ### Removing the native_comet scan implementation
 
 We are working towards deprecating ([#2186]) and removing ([#2177]) the 
`native_comet` scan implementation, which
-is the originally scan implementation that uses mutable buffers (which is 
incompatible with best practices around
+is the original scan implementation that uses mutable buffers (which is 
incompatible with best practices around
 Arrow FFI) and does not support complex types.
 
-Once we are using the `native_iceberg_compat` scan (which is based on 
DataFusion's `DataSourceExec`) in the Iceberg
-integration, we will be able to remove the `native_comet` scan implementation, 
and can then improve the efficiency
-of our use of Arrow FFI ([#2171]).
+Now that the default `auto` scan mode uses `native_iceberg_compat` (which is 
based on DataFusion's `DataSourceExec`),
+we can proceed with removing the `native_comet` scan implementation, and then 
improve the efficiency of our use of
+Arrow FFI ([#2171]).
 
 [#2186]: https://github.com/apache/datafusion-comet/issues/2186
 [#2171]: https://github.com/apache/datafusion-comet/issues/2171
diff --git a/docs/source/user-guide/latest/datasources.md 
b/docs/source/user-guide/latest/datasources.md
index cfce4d13d..daae6955d 100644
--- a/docs/source/user-guide/latest/datasources.md
+++ b/docs/source/user-guide/latest/datasources.md
@@ -36,7 +36,11 @@ Comet accelerates Iceberg scans of Parquet files. See the 
[Iceberg Guide] for mo
 
 ### CSV
 
-Comet does not provide native CSV scan, but when 
`spark.comet.convert.csv.enabled` is enabled, data is immediately
+Comet provides experimental native CSV scan support. When 
`spark.comet.scan.csv.v2.enabled` is enabled, CSV files
+are read natively for improved performance. This feature is experimental and 
performance benefits are
+workload-dependent.
+
+Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from 
Spark's CSV reader is immediately
 converted into Arrow format, allowing native execution to happen after that.
 
 ### JSON
diff --git a/docs/source/user-guide/latest/expressions.md 
b/docs/source/user-guide/latest/expressions.md
index 6bd0fada4..0339cd2a3 100644
--- a/docs/source/user-guide/latest/expressions.md
+++ b/docs/source/user-guide/latest/expressions.md
@@ -68,6 +68,7 @@ Expressions that are not Spark-compatible will fall back to 
Spark by default and
 | Contains        | Yes               |                                        
                                                                    |
 | EndsWith        | Yes               |                                        
                                                                    |
 | InitCap         | No                | Behavior is different in some cases, 
such as hyphenated names.                                             |
+| Left            | Yes               | Length argument must be a literal 
value                                                                    |
 | Length          | Yes               |                                        
                                                                    |
 | Like            | Yes               |                                        
                                                                    |
 | Lower           | No                | Results can vary depending on locale 
and character set. Requires `spark.comet.caseConversion.enabled=true` |
@@ -94,15 +95,20 @@ Expressions that are not Spark-compatible will fall back to 
Spark by default and
 | Expression     | SQL                          | Spark-Compatible? | 
Compatibility Notes                                                             
                                     |
 | -------------- | ---------------------------- | ----------------- | 
--------------------------------------------------------------------------------------------------------------------
 |
 | DateAdd        | `date_add`                   | Yes               |          
                                                                                
                            |
+| DateDiff       | `datediff`                   | Yes               |          
                                                                                
                            |
+| DateFormat     | `date_format`                | Yes               | Partial 
support. Only specific format patterns are supported.                           
                             |
 | DateSub        | `date_sub`                   | Yes               |          
                                                                                
                            |
 | DatePart       | `date_part(field, source)`   | Yes               | 
Supported values of `field`: 
`year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute`
 |
 | Extract        | `extract(field FROM source)` | Yes               | 
Supported values of `field`: 
`year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute`
 |
 | FromUnixTime   | `from_unixtime`              | No                | Does not 
support format, supports only -8334601211038 <= sec <= 8210266876799            
                            |
 | Hour           | `hour`                       | Yes               |          
                                                                                
                            |
+| LastDay        | `last_day`                   | Yes               |          
                                                                                
                            |
 | Minute         | `minute`                     | Yes               |          
                                                                                
                            |
 | Second         | `second`                     | Yes               |          
                                                                                
                            |
 | TruncDate      | `trunc`                      | Yes               |          
                                                                                
                            |
-| TruncTimestamp | `trunc_date`                 | Yes               |          
                                                                                
                            |
+| TruncTimestamp | `date_trunc`                 | Yes               |          
                                                                                
                            |
+| UnixDate       | `unix_date`                  | Yes               |          
                                                                                
                            |
+| UnixTimestamp  | `unix_timestamp`             | Yes               |          
                                                                                
                            |
 | Year           | `year`                       | Yes               |          
                                                                                
                            |
 | Month          | `month`                      | Yes               |          
                                                                                
                            |
 | DayOfMonth     | `day`/`dayofmonth`           | Yes               |          
                                                                                
                            |
@@ -163,6 +169,7 @@ Expressions that are not Spark-compatible will fall back to 
Spark by default and
 | ----------- | ----------------- |
 | Md5         | Yes               |
 | Murmur3Hash | Yes               |
+| Sha1        | Yes               |
 | Sha2        | Yes               |
 | XxHash64    | Yes               |
 
@@ -256,12 +263,13 @@ Comet supports using the following aggregate functions 
within window contexts wi
 
 ## Struct Expressions
 
-| Expression           | Spark-Compatible? |
-| -------------------- | ----------------- |
-| CreateNamedStruct    | Yes               |
-| GetArrayStructFields | Yes               |
-| GetStructField       | Yes               |
-| StructsToJson        | Yes               |
+| Expression           | Spark-Compatible? | Compatibility Notes               
         |
+| -------------------- | ----------------- | 
------------------------------------------ |
+| CreateNamedStruct    | Yes               |                                   
         |
+| GetArrayStructFields | Yes               |                                   
         |
+| GetStructField       | Yes               |                                   
         |
+| JsonToStructs        | No                | Partial support. Requires 
explicit schema. |
+| StructsToJson        | Yes               |                                   
         |
 
 ## Conversion Expressions
 
diff --git a/docs/source/user-guide/latest/iceberg.md 
b/docs/source/user-guide/latest/iceberg.md
index 9bf681cb0..ad6e5b243 100644
--- a/docs/source/user-guide/latest/iceberg.md
+++ b/docs/source/user-guide/latest/iceberg.md
@@ -183,14 +183,93 @@ $SPARK_HOME/bin/spark-shell \
 The same sample queries from above can be used to test Comet's fully-native 
Iceberg integration,
 however the scan node to look for is `CometIcebergNativeScan`.
 
+### Supported features
+
+The native Iceberg reader supports the following features:
+
+**Table specifications:**
+
+- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
+
+**Schema and data types:**
+
+- All primitive types including UUID
+- Complex types: arrays, maps, and structs
+- Schema evolution (adding and dropping columns)
+
+**Time travel and branching:**
+
+- `VERSION AS OF` queries to read historical snapshots
+- Branch reads for accessing named branches
+
+**Delete handling (Merge-On-Read tables):**
+
+- Positional deletes
+- Equality deletes
+- Mixed delete types
+
+**Filter pushdown:**
+
+- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
+- Logical operators (`AND`, `OR`)
+- NULL checks (`IS NULL`, `IS NOT NULL`)
+- `IN` and `NOT IN` list operations
+- `BETWEEN` operations
+
+**Partitioning:**
+
+- Standard partitioning with partition pruning
+- Date partitioning with `days()` transform
+- Bucket partitioning
+- Truncate transform
+- Hour transform
+
+**Storage:**
+
+- Local filesystem
+- Hadoop Distributed File System (HDFS)
+- S3-compatible storage (AWS S3, MinIO)
+
+### REST Catalog
+
+Comet's native Iceberg reader also supports REST catalogs. The following 
example shows how to
+configure Spark to use a REST catalog with Comet's native Iceberg scan:
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+    --packages 
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
 \
+    --repositories https://repo1.maven.org/maven2/ \
+    --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
+    --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
+    --conf 
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
+    --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
+    --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
+    --conf spark.plugins=org.apache.spark.CometPlugin \
+    --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
+    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+    --conf spark.comet.scan.icebergNative.enabled=true \
+    --conf spark.comet.explainFallback.enabled=true \
+    --conf spark.memory.offHeap.enabled=true \
+    --conf spark.memory.offHeap.size=2g
+```
+
+Note that REST catalogs require explicit namespace creation before creating 
tables:
+
+```scala
+scala> spark.sql("CREATE NAMESPACE rest_cat.db")
+scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING) 
USING iceberg")
+scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2, 
'Bob')")
+scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
+```
+
 ### Current limitations
 
-The following scenarios are not yet supported, but are work in progress:
+The following scenarios will fall back to Spark's native Iceberg reader:
 
-- Iceberg table spec v3 scans will fall back.
-- Iceberg writes will fall back.
-- Iceberg table scans backed by Avro or ORC data files will fall back.
-- Iceberg table scans partitioned on `BINARY` or `DECIMAL` (with precision 
>28) columns will fall back.
-- Iceberg scans with residual filters (_i.e._, filter expressions that are not 
partition values,
-  and are evaluated on the column values at scan time) of `truncate`, 
`bucket`, `year`, `month`,
-  `day`, `hour` will fall back.
+- Iceberg table spec v3 scans
+- Iceberg writes (reads are accelerated, writes use Spark)
+- Tables backed by Avro or ORC data files (only Parquet is accelerated)
+- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
+- Scans with residual filters using `truncate`, `bucket`, `year`, `month`, 
`day`, or `hour`
+  transform functions (partition pruning still works, but row-level filtering 
of these
+  transforms falls back)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-comet) branch main updated: docs: various documentation updates in preparation for next release (#3254)

Reply via email to