This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ee8d0a2d3 Publish built docs triggered by
64ebfcc33ee77e05be65a64797047ab84db34c13
ee8d0a2d3 is described below
commit ee8d0a2d32e586693e11720328cc77d61723c2ff
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Fri Jan 23 19:24:34 2026 +0000
Publish built docs triggered by 64ebfcc33ee77e05be65a64797047ab84db34c13
---
_sources/contributor-guide/parquet_scans.md.txt | 8 +-
_sources/contributor-guide/roadmap.md.txt | 13 ++-
_sources/user-guide/latest/datasources.md.txt | 6 +-
_sources/user-guide/latest/expressions.md.txt | 22 +++--
_sources/user-guide/latest/iceberg.md.txt | 95 ++++++++++++++++++--
contributor-guide/parquet_scans.html | 8 +-
contributor-guide/roadmap.html | 12 +--
searchindex.js | 2 +-
user-guide/latest/datasources.html | 5 +-
user-guide/latest/expressions.html | 113 ++++++++++++++++--------
user-guide/latest/iceberg.html | 92 +++++++++++++++++--
11 files changed, 293 insertions(+), 83 deletions(-)
diff --git a/_sources/contributor-guide/parquet_scans.md.txt
b/_sources/contributor-guide/parquet_scans.md.txt
index 8c2fbfd8f..48fda7dc3 100644
--- a/_sources/contributor-guide/parquet_scans.md.txt
+++ b/_sources/contributor-guide/parquet_scans.md.txt
@@ -45,8 +45,8 @@ The `native_datafusion` and `native_iceberg_compat` scans
share the following li
- When reading Parquet files written by systems other than Spark that contain
columns with the logical types `UINT_8`
or `UINT_16`, Comet will produce different results than Spark because Spark
does not preserve or understand these
logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
- rather than signed. By default, Comet will fall back to `native_comet` when
scanning Parquet files containing `byte` or `short`
- types (regardless of the logical type). This behavior can be disabled by
setting
+ rather than signed. By default, Comet will fall back to Spark's native scan
when scanning Parquet files containing
+ `byte` or `short` types (regardless of the logical type). This behavior can
be disabled by setting
`spark.comet.scan.allowIncompatible=true`.
- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
@@ -59,11 +59,11 @@ The `native_datafusion` scan has some additional
limitations:
## S3 Support
-There are some
+There are some differences in S3 support between the scan implementations.
### `native_comet`
-The default `native_comet` Parquet scan implementation reads data from S3
using the [Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which
+The `native_comet` Parquet scan implementation reads data from S3 using the
[Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which
is identical to the approach commonly used with vanilla Spark. AWS credential
configuration and other Hadoop S3A
configurations works the same way as in vanilla Spark.
diff --git a/_sources/contributor-guide/roadmap.md.txt
b/_sources/contributor-guide/roadmap.md.txt
index 9428fa682..b1ccfb604 100644
--- a/_sources/contributor-guide/roadmap.md.txt
+++ b/_sources/contributor-guide/roadmap.md.txt
@@ -27,11 +27,10 @@ helpful to have a roadmap for some of the major items that
require coordination
### Iceberg Integration
Iceberg integration is still a work-in-progress ([#2060]), with major
improvements expected in the next few
-releases. Once this integration is complete, we plan on switching from the
`native_comet` scan to the
-`native_iceberg_compat` scan ([#2189]) so that complex types can be supported.
+releases. The default `auto` scan mode now uses `native_iceberg_compat`
instead of `native_comet`, enabling
+support for complex types.
[#2060]: https://github.com/apache/datafusion-comet/issues/2060
-[#2189]: https://github.com/apache/datafusion-comet/issues/2189
### Spark 4.0 Support
@@ -44,12 +43,12 @@ more Spark SQL tests and fully implementing ANSI support
([#313]) for all suppor
### Removing the native_comet scan implementation
We are working towards deprecating ([#2186]) and removing ([#2177]) the
`native_comet` scan implementation, which
-is the originally scan implementation that uses mutable buffers (which is
incompatible with best practices around
+is the original scan implementation that uses mutable buffers (which is
incompatible with best practices around
Arrow FFI) and does not support complex types.
-Once we are using the `native_iceberg_compat` scan (which is based on
DataFusion's `DataSourceExec`) in the Iceberg
-integration, we will be able to remove the `native_comet` scan implementation,
and can then improve the efficiency
-of our use of Arrow FFI ([#2171]).
+Now that the default `auto` scan mode uses `native_iceberg_compat` (which is
based on DataFusion's `DataSourceExec`),
+we can proceed with removing the `native_comet` scan implementation, and then
improve the efficiency of our use of
+Arrow FFI ([#2171]).
[#2186]: https://github.com/apache/datafusion-comet/issues/2186
[#2171]: https://github.com/apache/datafusion-comet/issues/2171
diff --git a/_sources/user-guide/latest/datasources.md.txt
b/_sources/user-guide/latest/datasources.md.txt
index dcf0424ae..444b14c2d 100644
--- a/_sources/user-guide/latest/datasources.md.txt
+++ b/_sources/user-guide/latest/datasources.md.txt
@@ -36,7 +36,11 @@ Comet accelerates Iceberg scans of Parquet files. See the
[Iceberg Guide] for mo
### CSV
-Comet does not provide native CSV scan, but when
`spark.comet.convert.csv.enabled` is enabled, data is immediately
+Comet provides experimental native CSV scan support. When
`spark.comet.scan.csv.v2.enabled` is enabled, CSV files
+are read natively for improved performance. This feature is experimental and
performance benefits are
+workload-dependent.
+
+Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from
Spark's CSV reader is immediately
converted into Arrow format, allowing native execution to happen after that.
### JSON
diff --git a/_sources/user-guide/latest/expressions.md.txt
b/_sources/user-guide/latest/expressions.md.txt
index 6bd0fada4..0339cd2a3 100644
--- a/_sources/user-guide/latest/expressions.md.txt
+++ b/_sources/user-guide/latest/expressions.md.txt
@@ -68,6 +68,7 @@ Expressions that are not Spark-compatible will fall back to
Spark by default and
| Contains | Yes |
|
| EndsWith | Yes |
|
| InitCap | No | Behavior is different in some cases,
such as hyphenated names. |
+| Left | Yes | Length argument must be a literal
value |
| Length | Yes |
|
| Like | Yes |
|
| Lower | No | Results can vary depending on locale
and character set. Requires `spark.comet.caseConversion.enabled=true` |
@@ -94,15 +95,20 @@ Expressions that are not Spark-compatible will fall back to
Spark by default and
| Expression | SQL | Spark-Compatible? |
Compatibility Notes
|
| -------------- | ---------------------------- | ----------------- |
--------------------------------------------------------------------------------------------------------------------
|
| DateAdd | `date_add` | Yes |
|
+| DateDiff | `datediff` | Yes |
|
+| DateFormat | `date_format` | Yes | Partial
support. Only specific format patterns are supported.
|
| DateSub | `date_sub` | Yes |
|
| DatePart | `date_part(field, source)` | Yes |
Supported values of `field`:
`year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute`
|
| Extract | `extract(field FROM source)` | Yes |
Supported values of `field`:
`year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute`
|
| FromUnixTime | `from_unixtime` | No | Does not
support format, supports only -8334601211038 <= sec <= 8210266876799
|
| Hour | `hour` | Yes |
|
+| LastDay | `last_day` | Yes |
|
| Minute | `minute` | Yes |
|
| Second | `second` | Yes |
|
| TruncDate | `trunc` | Yes |
|
-| TruncTimestamp | `trunc_date` | Yes |
|
+| TruncTimestamp | `date_trunc` | Yes |
|
+| UnixDate | `unix_date` | Yes |
|
+| UnixTimestamp | `unix_timestamp` | Yes |
|
| Year | `year` | Yes |
|
| Month | `month` | Yes |
|
| DayOfMonth | `day`/`dayofmonth` | Yes |
|
@@ -163,6 +169,7 @@ Expressions that are not Spark-compatible will fall back to
Spark by default and
| ----------- | ----------------- |
| Md5 | Yes |
| Murmur3Hash | Yes |
+| Sha1 | Yes |
| Sha2 | Yes |
| XxHash64 | Yes |
@@ -256,12 +263,13 @@ Comet supports using the following aggregate functions
within window contexts wi
## Struct Expressions
-| Expression | Spark-Compatible? |
-| -------------------- | ----------------- |
-| CreateNamedStruct | Yes |
-| GetArrayStructFields | Yes |
-| GetStructField | Yes |
-| StructsToJson | Yes |
+| Expression | Spark-Compatible? | Compatibility Notes
|
+| -------------------- | ----------------- |
------------------------------------------ |
+| CreateNamedStruct | Yes |
|
+| GetArrayStructFields | Yes |
|
+| GetStructField | Yes |
|
+| JsonToStructs | No | Partial support. Requires
explicit schema. |
+| StructsToJson | Yes |
|
## Conversion Expressions
diff --git a/_sources/user-guide/latest/iceberg.md.txt
b/_sources/user-guide/latest/iceberg.md.txt
index 3ecea45aa..122c80494 100644
--- a/_sources/user-guide/latest/iceberg.md.txt
+++ b/_sources/user-guide/latest/iceberg.md.txt
@@ -183,14 +183,93 @@ $SPARK_HOME/bin/spark-shell \
The same sample queries from above can be used to test Comet's fully-native
Iceberg integration,
however the scan node to look for is `CometIcebergNativeScan`.
+### Supported features
+
+The native Iceberg reader supports the following features:
+
+**Table specifications:**
+
+- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
+
+**Schema and data types:**
+
+- All primitive types including UUID
+- Complex types: arrays, maps, and structs
+- Schema evolution (adding and dropping columns)
+
+**Time travel and branching:**
+
+- `VERSION AS OF` queries to read historical snapshots
+- Branch reads for accessing named branches
+
+**Delete handling (Merge-On-Read tables):**
+
+- Positional deletes
+- Equality deletes
+- Mixed delete types
+
+**Filter pushdown:**
+
+- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
+- Logical operators (`AND`, `OR`)
+- NULL checks (`IS NULL`, `IS NOT NULL`)
+- `IN` and `NOT IN` list operations
+- `BETWEEN` operations
+
+**Partitioning:**
+
+- Standard partitioning with partition pruning
+- Date partitioning with `days()` transform
+- Bucket partitioning
+- Truncate transform
+- Hour transform
+
+**Storage:**
+
+- Local filesystem
+- Hadoop Distributed File System (HDFS)
+- S3-compatible storage (AWS S3, MinIO)
+
+### REST Catalog
+
+Comet's native Iceberg reader also supports REST catalogs. The following
example shows how to
+configure Spark to use a REST catalog with Comet's native Iceberg scan:
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+ --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
+ --repositories https://repo1.maven.org/maven2/ \
+ --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
+ --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
+ --conf
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
+ --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
+ --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
+ --conf spark.plugins=org.apache.spark.CometPlugin \
+ --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
+ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+ --conf spark.comet.scan.icebergNative.enabled=true \
+ --conf spark.comet.explainFallback.enabled=true \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=2g
+```
+
+Note that REST catalogs require explicit namespace creation before creating
tables:
+
+```scala
+scala> spark.sql("CREATE NAMESPACE rest_cat.db")
+scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING)
USING iceberg")
+scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')")
+scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
+```
+
### Current limitations
-The following scenarios are not yet supported, but are work in progress:
+The following scenarios will fall back to Spark's native Iceberg reader:
-- Iceberg table spec v3 scans will fall back.
-- Iceberg writes will fall back.
-- Iceberg table scans backed by Avro or ORC data files will fall back.
-- Iceberg table scans partitioned on `BINARY` or `DECIMAL` (with precision
>28) columns will fall back.
-- Iceberg scans with residual filters (_i.e._, filter expressions that are not
partition values,
- and are evaluated on the column values at scan time) of `truncate`,
`bucket`, `year`, `month`,
- `day`, `hour` will fall back.
+- Iceberg table spec v3 scans
+- Iceberg writes (reads are accelerated, writes use Spark)
+- Tables backed by Avro or ORC data files (only Parquet is accelerated)
+- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
+- Scans with residual filters using `truncate`, `bucket`, `year`, `month`,
`day`, or `hour`
+ transform functions (partition pruning still works, but row-level filtering
of these
+ transforms falls back)
diff --git a/contributor-guide/parquet_scans.html
b/contributor-guide/parquet_scans.html
index c31994090..65f6de183 100644
--- a/contributor-guide/parquet_scans.html
+++ b/contributor-guide/parquet_scans.html
@@ -493,8 +493,8 @@ implementation:</p>
<li><p>When reading Parquet files written by systems other than Spark that
contain columns with the logical types <code class="docutils literal
notranslate"><span class="pre">UINT_8</span></code>
or <code class="docutils literal notranslate"><span
class="pre">UINT_16</span></code>, Comet will produce different results than
Spark because Spark does not preserve or understand these
logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
-rather than signed. By default, Comet will fall back to <code class="docutils
literal notranslate"><span class="pre">native_comet</span></code> when scanning
Parquet files containing <code class="docutils literal notranslate"><span
class="pre">byte</span></code> or <code class="docutils literal
notranslate"><span class="pre">short</span></code>
-types (regardless of the logical type). This behavior can be disabled by
setting
+rather than signed. By default, Comet will fall back to Spark’s native scan
when scanning Parquet files containing
+<code class="docutils literal notranslate"><span
class="pre">byte</span></code> or <code class="docutils literal
notranslate"><span class="pre">short</span></code> types (regardless of the
logical type). This behavior can be disabled by setting
<code class="docutils literal notranslate"><span
class="pre">spark.comet.scan.allowIncompatible=true</span></code>.</p></li>
<li><p>No support for default values that are nested types (e.g., maps,
arrays, structs). Literal default values are supported.</p></li>
</ul>
@@ -507,10 +507,10 @@ types (regardless of the logical type). This behavior can
be disabled by setting
</ul>
<section id="s3-support">
<h2>S3 Support<a class="headerlink" href="#s3-support" title="Link to this
heading">#</a></h2>
-<p>There are some</p>
+<p>There are some differences in S3 support between the scan
implementations.</p>
<section id="native-comet">
<h3><code class="docutils literal notranslate"><span
class="pre">native_comet</span></code><a class="headerlink"
href="#native-comet" title="Link to this heading">#</a></h3>
-<p>The default <code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> Parquet scan implementation reads data
from S3 using the <a class="reference external"
href="https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html">Hadoop-AWS
module</a>, which
+<p>The <code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> Parquet scan implementation reads data
from S3 using the <a class="reference external"
href="https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html">Hadoop-AWS
module</a>, which
is identical to the approach commonly used with vanilla Spark. AWS credential
configuration and other Hadoop S3A
configurations works the same way as in vanilla Spark.</p>
</section>
diff --git a/contributor-guide/roadmap.html b/contributor-guide/roadmap.html
index a55d609bd..881625012 100644
--- a/contributor-guide/roadmap.html
+++ b/contributor-guide/roadmap.html
@@ -462,8 +462,8 @@ helpful to have a roadmap for some of the major items that
require coordination
<section id="iceberg-integration">
<h3>Iceberg Integration<a class="headerlink" href="#iceberg-integration"
title="Link to this heading">#</a></h3>
<p>Iceberg integration is still a work-in-progress (<a class="reference
external"
href="https://github.com/apache/datafusion-comet/issues/2060">#2060</a>), with
major improvements expected in the next few
-releases. Once this integration is complete, we plan on switching from the
<code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> scan to the
-<code class="docutils literal notranslate"><span
class="pre">native_iceberg_compat</span></code> scan (<a class="reference
external"
href="https://github.com/apache/datafusion-comet/issues/2189">#2189</a>) so
that complex types can be supported.</p>
+releases. The default <code class="docutils literal notranslate"><span
class="pre">auto</span></code> scan mode now uses <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> instead of
<code class="docutils literal notranslate"><span
class="pre">native_comet</span></code>, enabling
+support for complex types.</p>
</section>
<section id="spark-4-0-support">
<h3>Spark 4.0 Support<a class="headerlink" href="#spark-4-0-support"
title="Link to this heading">#</a></h3>
@@ -473,11 +473,11 @@ more Spark SQL tests and fully implementing ANSI support
(<a class="reference ex
<section id="removing-the-native-comet-scan-implementation">
<h3>Removing the native_comet scan implementation<a class="headerlink"
href="#removing-the-native-comet-scan-implementation" title="Link to this
heading">#</a></h3>
<p>We are working towards deprecating (<a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/2186">#2186</a>) and
removing (<a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/2177">#2177</a>) the
<code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> scan implementation, which
-is the originally scan implementation that uses mutable buffers (which is
incompatible with best practices around
+is the original scan implementation that uses mutable buffers (which is
incompatible with best practices around
Arrow FFI) and does not support complex types.</p>
-<p>Once we are using the <code class="docutils literal notranslate"><span
class="pre">native_iceberg_compat</span></code> scan (which is based on
DataFusion’s <code class="docutils literal notranslate"><span
class="pre">DataSourceExec</span></code>) in the Iceberg
-integration, we will be able to remove the <code class="docutils literal
notranslate"><span class="pre">native_comet</span></code> scan implementation,
and can then improve the efficiency
-of our use of Arrow FFI (<a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/2171">#2171</a>).</p>
+<p>Now that the default <code class="docutils literal notranslate"><span
class="pre">auto</span></code> scan mode uses <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> (which is
based on DataFusion’s <code class="docutils literal notranslate"><span
class="pre">DataSourceExec</span></code>),
+we can proceed with removing the <code class="docutils literal
notranslate"><span class="pre">native_comet</span></code> scan implementation,
and then improve the efficiency of our use of
+Arrow FFI (<a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/2171">#2171</a>).</p>
</section>
</section>
<section id="ongoing-improvements">
diff --git a/searchindex.js b/searchindex.js
index 29f590a44..089c44d20 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Format Your Code": [[12,
"format-your-code"]], "1. Install Comet": [[21, "install-comet"]], "1. Native
Operators (nativeExecs map)": [[4, "native-operators-nativeexecs-map"]], "2.
Build and Verify": [[12, "build-and-verify"]], "2. Clone Spark and Apply Diff":
[[21, "clone-spark-and-apply-diff"]], "2. Sink Operators (sinks map)": [[4,
"sink-operators-sinks-map"]], "3. Comet JVM Operators": [[4,
"comet-jvm-operators"]], "3. Run Clippy (Recommended)": [[12 [...]
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Format Your Code": [[12,
"format-your-code"]], "1. Install Comet": [[21, "install-comet"]], "1. Native
Operators (nativeExecs map)": [[4, "native-operators-nativeexecs-map"]], "2.
Build and Verify": [[12, "build-and-verify"]], "2. Clone Spark and Apply Diff":
[[21, "clone-spark-and-apply-diff"]], "2. Sink Operators (sinks map)": [[4,
"sink-operators-sinks-map"]], "3. Comet JVM Operators": [[4,
"comet-jvm-operators"]], "3. Run Clippy (Recommended)": [[12 [...]
\ No newline at end of file
diff --git a/user-guide/latest/datasources.html
b/user-guide/latest/datasources.html
index 84cd94923..15308e221 100644
--- a/user-guide/latest/datasources.html
+++ b/user-guide/latest/datasources.html
@@ -476,7 +476,10 @@ execution to happen after that, but the process may not be
efficient.</p>
</section>
<section id="csv">
<h3>CSV<a class="headerlink" href="#csv" title="Link to this
heading">#</a></h3>
-<p>Comet does not provide native CSV scan, but when <code class="docutils
literal notranslate"><span
class="pre">spark.comet.convert.csv.enabled</span></code> is enabled, data is
immediately
+<p>Comet provides experimental native CSV scan support. When <code
class="docutils literal notranslate"><span
class="pre">spark.comet.scan.csv.v2.enabled</span></code> is enabled, CSV files
+are read natively for improved performance. This feature is experimental and
performance benefits are
+workload-dependent.</p>
+<p>Alternatively, when <code class="docutils literal notranslate"><span
class="pre">spark.comet.convert.csv.enabled</span></code> is enabled, data from
Spark’s CSV reader is immediately
converted into Arrow format, allowing native execution to happen after
that.</p>
</section>
<section id="json">
diff --git a/user-guide/latest/expressions.html
b/user-guide/latest/expressions.html
index b8325ab72..f26865331 100644
--- a/user-guide/latest/expressions.html
+++ b/user-guide/latest/expressions.html
@@ -600,83 +600,87 @@ of expressions that be disabled.</p>
<td><p>No</p></td>
<td><p>Behavior is different in some cases, such as hyphenated names.</p></td>
</tr>
-<tr class="row-even"><td><p>Length</p></td>
+<tr class="row-even"><td><p>Left</p></td>
+<td><p>Yes</p></td>
+<td><p>Length argument must be a literal value</p></td>
+</tr>
+<tr class="row-odd"><td><p>Length</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>Like</p></td>
+<tr class="row-even"><td><p>Like</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>Lower</p></td>
+<tr class="row-odd"><td><p>Lower</p></td>
<td><p>No</p></td>
<td><p>Results can vary depending on locale and character set. Requires <code
class="docutils literal notranslate"><span
class="pre">spark.comet.caseConversion.enabled=true</span></code></p></td>
</tr>
-<tr class="row-odd"><td><p>OctetLength</p></td>
+<tr class="row-even"><td><p>OctetLength</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>Reverse</p></td>
+<tr class="row-odd"><td><p>Reverse</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>RLike</p></td>
+<tr class="row-even"><td><p>RLike</p></td>
<td><p>No</p></td>
<td><p>Uses Rust regexp engine, which has different behavior to Java regexp
engine</p></td>
</tr>
-<tr class="row-even"><td><p>StartsWith</p></td>
+<tr class="row-odd"><td><p>StartsWith</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>StringInstr</p></td>
+<tr class="row-even"><td><p>StringInstr</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>StringRepeat</p></td>
+<tr class="row-odd"><td><p>StringRepeat</p></td>
<td><p>Yes</p></td>
<td><p>Negative argument for number of times to repeat causes
exception</p></td>
</tr>
-<tr class="row-odd"><td><p>StringReplace</p></td>
+<tr class="row-even"><td><p>StringReplace</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>StringLPad</p></td>
+<tr class="row-odd"><td><p>StringLPad</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>StringRPad</p></td>
+<tr class="row-even"><td><p>StringRPad</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>StringSpace</p></td>
+<tr class="row-odd"><td><p>StringSpace</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>StringTranslate</p></td>
+<tr class="row-even"><td><p>StringTranslate</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>StringTrim</p></td>
+<tr class="row-odd"><td><p>StringTrim</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>StringTrimBoth</p></td>
+<tr class="row-even"><td><p>StringTrimBoth</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>StringTrimLeft</p></td>
+<tr class="row-odd"><td><p>StringTrimLeft</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>StringTrimRight</p></td>
+<tr class="row-even"><td><p>StringTrimRight</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>Substring</p></td>
+<tr class="row-odd"><td><p>Substring</p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>Upper</p></td>
+<tr class="row-even"><td><p>Upper</p></td>
<td><p>No</p></td>
<td><p>Results can vary depending on locale and character set. Requires <code
class="docutils literal notranslate"><span
class="pre">spark.comet.caseConversion.enabled=true</span></code></p></td>
</tr>
@@ -700,6 +704,16 @@ of expressions that be disabled.</p>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
+<tr class="row-odd"><td><p>DateDiff</p></td>
+<td><p><code class="docutils literal notranslate"><span
class="pre">datediff</span></code></p></td>
+<td><p>Yes</p></td>
+<td><p></p></td>
+</tr>
+<tr class="row-even"><td><p>DateFormat</p></td>
+<td><p><code class="docutils literal notranslate"><span
class="pre">date_format</span></code></p></td>
+<td><p>Yes</p></td>
+<td><p>Partial support. Only specific format patterns are supported.</p></td>
+</tr>
<tr class="row-odd"><td><p>DateSub</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">date_sub</span></code></p></td>
<td><p>Yes</p></td>
@@ -725,62 +739,77 @@ of expressions that be disabled.</p>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>Minute</p></td>
+<tr class="row-even"><td><p>LastDay</p></td>
+<td><p><code class="docutils literal notranslate"><span
class="pre">last_day</span></code></p></td>
+<td><p>Yes</p></td>
+<td><p></p></td>
+</tr>
+<tr class="row-odd"><td><p>Minute</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">minute</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>Second</p></td>
+<tr class="row-even"><td><p>Second</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">second</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>TruncDate</p></td>
+<tr class="row-odd"><td><p>TruncDate</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">trunc</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>TruncTimestamp</p></td>
-<td><p><code class="docutils literal notranslate"><span
class="pre">trunc_date</span></code></p></td>
+<tr class="row-even"><td><p>TruncTimestamp</p></td>
+<td><p><code class="docutils literal notranslate"><span
class="pre">date_trunc</span></code></p></td>
+<td><p>Yes</p></td>
+<td><p></p></td>
+</tr>
+<tr class="row-odd"><td><p>UnixDate</p></td>
+<td><p><code class="docutils literal notranslate"><span
class="pre">unix_date</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>Year</p></td>
+<tr class="row-even"><td><p>UnixTimestamp</p></td>
+<td><p><code class="docutils literal notranslate"><span
class="pre">unix_timestamp</span></code></p></td>
+<td><p>Yes</p></td>
+<td><p></p></td>
+</tr>
+<tr class="row-odd"><td><p>Year</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">year</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>Month</p></td>
+<tr class="row-even"><td><p>Month</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">month</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>DayOfMonth</p></td>
+<tr class="row-odd"><td><p>DayOfMonth</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">day</span></code>/<code class="docutils literal notranslate"><span
class="pre">dayofmonth</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>DayOfWeek</p></td>
+<tr class="row-even"><td><p>DayOfWeek</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">dayofweek</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>WeekDay</p></td>
+<tr class="row-odd"><td><p>WeekDay</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">weekday</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>DayOfYear</p></td>
+<tr class="row-even"><td><p>DayOfYear</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">dayofyear</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-even"><td><p>WeekOfYear</p></td>
+<tr class="row-odd"><td><p>WeekOfYear</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">weekofyear</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>Quarter</p></td>
+<tr class="row-even"><td><p>Quarter</p></td>
<td><p><code class="docutils literal notranslate"><span
class="pre">quarter</span></code></p></td>
<td><p>Yes</p></td>
<td><p></p></td>
@@ -1019,10 +1048,13 @@ of expressions that be disabled.</p>
<tr class="row-odd"><td><p>Murmur3Hash</p></td>
<td><p>Yes</p></td>
</tr>
-<tr class="row-even"><td><p>Sha2</p></td>
+<tr class="row-even"><td><p>Sha1</p></td>
<td><p>Yes</p></td>
</tr>
-<tr class="row-odd"><td><p>XxHash64</p></td>
+<tr class="row-odd"><td><p>Sha2</p></td>
+<td><p>Yes</p></td>
+</tr>
+<tr class="row-even"><td><p>XxHash64</p></td>
<td><p>Yes</p></td>
</tr>
</tbody>
@@ -1345,20 +1377,29 @@ of expressions that be disabled.</p>
<thead>
<tr class="row-odd"><th class="head"><p>Expression</p></th>
<th class="head"><p>Spark-Compatible?</p></th>
+<th class="head"><p>Compatibility Notes</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>CreateNamedStruct</p></td>
<td><p>Yes</p></td>
+<td><p></p></td>
</tr>
<tr class="row-odd"><td><p>GetArrayStructFields</p></td>
<td><p>Yes</p></td>
+<td><p></p></td>
</tr>
<tr class="row-even"><td><p>GetStructField</p></td>
<td><p>Yes</p></td>
+<td><p></p></td>
</tr>
-<tr class="row-odd"><td><p>StructsToJson</p></td>
+<tr class="row-odd"><td><p>JsonToStructs</p></td>
+<td><p>No</p></td>
+<td><p>Partial support. Requires explicit schema.</p></td>
+</tr>
+<tr class="row-even"><td><p>StructsToJson</p></td>
<td><p>Yes</p></td>
+<td><p></p></td>
</tr>
</tbody>
</table>
diff --git a/user-guide/latest/iceberg.html b/user-guide/latest/iceberg.html
index af6d18889..0320b76f3 100644
--- a/user-guide/latest/iceberg.html
+++ b/user-guide/latest/iceberg.html
@@ -607,17 +607,93 @@ configuration should <strong>not</strong> be used with
the hybrid Iceberg config
</div>
<p>The same sample queries from above can be used to test Comet’s fully-native
Iceberg integration,
however the scan node to look for is <code class="docutils literal
notranslate"><span class="pre">CometIcebergNativeScan</span></code>.</p>
+<section id="supported-features">
+<h3>Supported features<a class="headerlink" href="#supported-features"
title="Link to this heading">#</a></h3>
+<p>The native Iceberg reader supports the following features:</p>
+<p><strong>Table specifications:</strong></p>
+<ul class="simple">
+<li><p>Iceberg table spec v1 and v2 (v3 will fall back to Spark)</p></li>
+</ul>
+<p><strong>Schema and data types:</strong></p>
+<ul class="simple">
+<li><p>All primitive types including UUID</p></li>
+<li><p>Complex types: arrays, maps, and structs</p></li>
+<li><p>Schema evolution (adding and dropping columns)</p></li>
+</ul>
+<p><strong>Time travel and branching:</strong></p>
+<ul class="simple">
+<li><p><code class="docutils literal notranslate"><span
class="pre">VERSION</span> <span class="pre">AS</span> <span
class="pre">OF</span></code> queries to read historical snapshots</p></li>
+<li><p>Branch reads for accessing named branches</p></li>
+</ul>
+<p><strong>Delete handling (Merge-On-Read tables):</strong></p>
+<ul class="simple">
+<li><p>Positional deletes</p></li>
+<li><p>Equality deletes</p></li>
+<li><p>Mixed delete types</p></li>
+</ul>
+<p><strong>Filter pushdown:</strong></p>
+<ul class="simple">
+<li><p>Equality and comparison predicates (<code class="docutils literal
notranslate"><span class="pre">=</span></code>, <code class="docutils literal
notranslate"><span class="pre">!=</span></code>, <code class="docutils literal
notranslate"><span class="pre">></span></code>, <code class="docutils
literal notranslate"><span class="pre">>=</span></code>, <code
class="docutils literal notranslate"><span class="pre"><</span></code>,
<code class="docutils literal notranslate"><span [...]
+<li><p>Logical operators (<code class="docutils literal notranslate"><span
class="pre">AND</span></code>, <code class="docutils literal notranslate"><span
class="pre">OR</span></code>)</p></li>
+<li><p>NULL checks (<code class="docutils literal notranslate"><span
class="pre">IS</span> <span class="pre">NULL</span></code>, <code
class="docutils literal notranslate"><span class="pre">IS</span> <span
class="pre">NOT</span> <span class="pre">NULL</span></code>)</p></li>
+<li><p><code class="docutils literal notranslate"><span
class="pre">IN</span></code> and <code class="docutils literal
notranslate"><span class="pre">NOT</span> <span class="pre">IN</span></code>
list operations</p></li>
+<li><p><code class="docutils literal notranslate"><span
class="pre">BETWEEN</span></code> operations</p></li>
+</ul>
+<p><strong>Partitioning:</strong></p>
+<ul class="simple">
+<li><p>Standard partitioning with partition pruning</p></li>
+<li><p>Date partitioning with <code class="docutils literal notranslate"><span
class="pre">days()</span></code> transform</p></li>
+<li><p>Bucket partitioning</p></li>
+<li><p>Truncate transform</p></li>
+<li><p>Hour transform</p></li>
+</ul>
+<p><strong>Storage:</strong></p>
+<ul class="simple">
+<li><p>Local filesystem</p></li>
+<li><p>Hadoop Distributed File System (HDFS)</p></li>
+<li><p>S3-compatible storage (AWS S3, MinIO)</p></li>
+</ul>
+</section>
+<section id="rest-catalog">
+<h3>REST Catalog<a class="headerlink" href="#rest-catalog" title="Link to this
heading">#</a></h3>
+<p>Comet’s native Iceberg reader also supports REST catalogs. The following
example shows how to
+configure Spark to use a REST catalog with Comet’s native Iceberg scan:</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--packages<span class="w">
</span>org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--repositories<span class="w">
</span>https://repo1.maven.org/maven2/<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.extensions<span
class="o">=</span>org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.rest_cat<span
class="o">=</span>org.apache.iceberg.spark.SparkCatalog<span class="w">
</span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.rest_cat.catalog-impl<span
class="o">=</span>org.apache.iceberg.rest.RESTCatalog<span class="w">
</span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.rest_cat.uri<span
class="o">=</span>http://localhost:8181<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.rest_cat.warehouse<span
class="o">=</span>/tmp/warehouse<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--conf<span class="w"> </span>spark.plugins<span
class="o">=</span>org.apache.spark.CometPlugin<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.shuffle.manager<span
class="o">=</span>org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.extensions<span
class="o">=</span>org.apache.comet.CometSparkSessionExtensions<span class="w">
</span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.comet.scan.icebergNative.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.comet.explainFallback.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.size<span class="o">=</span>2g
+</pre></div>
+</div>
+<p>Note that REST catalogs require explicit namespace creation before creating
tables:</p>
+<div class="highlight-scala notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span><span class="w"> </span><span class="n">spark</span><span
class="p">.</span><span class="n">sql</span><span class="p">(</span><span
class="s">"CREATE NAMESPACE rest_cat.db"</span><span
class="p">)</span>
+<span class="n">scala</span><span class="o">></span><span class="w">
</span><span class="n">spark</span><span class="p">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"CREATE TABLE
rest_cat.db.test_table (id INT, name STRING) USING iceberg"</span><span
class="p">)</span>
+<span class="n">scala</span><span class="o">></span><span class="w">
</span><span class="n">spark</span><span class="p">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"INSERT INTO
rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')"</span><span class="p">)</span>
+<span class="n">scala</span><span class="o">></span><span class="w">
</span><span class="n">spark</span><span class="p">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"SELECT * FROM
rest_cat.db.test_table"</span><span class="p">).</span><span
class="n">show</span><span class="p">()</span>
+</pre></div>
+</div>
+</section>
<section id="current-limitations">
<h3>Current limitations<a class="headerlink" href="#current-limitations"
title="Link to this heading">#</a></h3>
-<p>The following scenarios are not yet supported, but are work in progress:</p>
+<p>The following scenarios will fall back to Spark’s native Iceberg reader:</p>
<ul class="simple">
-<li><p>Iceberg table spec v3 scans will fall back.</p></li>
-<li><p>Iceberg writes will fall back.</p></li>
-<li><p>Iceberg table scans backed by Avro or ORC data files will fall
back.</p></li>
-<li><p>Iceberg table scans partitioned on <code class="docutils literal
notranslate"><span class="pre">BINARY</span></code> or <code class="docutils
literal notranslate"><span class="pre">DECIMAL</span></code> (with precision
>28) columns will fall back.</p></li>
-<li><p>Iceberg scans with residual filters (<em>i.e.</em>, filter expressions
that are not partition values,
-and are evaluated on the column values at scan time) of <code class="docutils
literal notranslate"><span class="pre">truncate</span></code>, <code
class="docutils literal notranslate"><span class="pre">bucket</span></code>,
<code class="docutils literal notranslate"><span
class="pre">year</span></code>, <code class="docutils literal
notranslate"><span class="pre">month</span></code>,
-<code class="docutils literal notranslate"><span
class="pre">day</span></code>, <code class="docutils literal notranslate"><span
class="pre">hour</span></code> will fall back.</p></li>
+<li><p>Iceberg table spec v3 scans</p></li>
+<li><p>Iceberg writes (reads are accelerated, writes use Spark)</p></li>
+<li><p>Tables backed by Avro or ORC data files (only Parquet is
accelerated)</p></li>
+<li><p>Tables partitioned on <code class="docutils literal notranslate"><span
class="pre">BINARY</span></code> or <code class="docutils literal
notranslate"><span class="pre">DECIMAL</span></code> (with precision >28)
columns</p></li>
+<li><p>Scans with residual filters using <code class="docutils literal
notranslate"><span class="pre">truncate</span></code>, <code class="docutils
literal notranslate"><span class="pre">bucket</span></code>, <code
class="docutils literal notranslate"><span class="pre">year</span></code>,
<code class="docutils literal notranslate"><span
class="pre">month</span></code>, <code class="docutils literal
notranslate"><span class="pre">day</span></code>, or <code class="docutils
literal notrans [...]
+transform functions (partition pruning still works, but row-level filtering of
these
+transforms falls back)</p></li>
</ul>
</section>
</section>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]