Re: Iceberg - PySpark overwrite with a condition

Fokko Driesprong Fri, 28 Jun 2024 11:59:48 -0700

Hey Ha,

What version of Spark are you using? Can you share the whole stack trace? I
tried to reproduce it locally and it worked fine:


pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2\
    --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
    --conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
\
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
    --conf spark.sql.defaultCatalog=local
Python 3.9.6 (default, Feb  3 2024, 15:58:27)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Python version 3.9.6 (default, Feb  3 2024 15:58:27)
Spark context Web UI available at http://192.168.1.10:4040
Spark context available as 'sc' (master = local[*], app id =
local-1719599873923).
SparkSession available as 'spark'.

>>> table_name = "local.test.person_with_age"
>>>
>>> spark.sql(f"""
... CREATE TABLE {table_name} (
...     name string,
...     age int
... )
... USING iceberg
... PARTITIONED BY (age);
... """).show()
++
||
++
++

>>> spark.table(table_name).show()
+----+---+
|name|age|
+----+---+
+----+---+

>>> persons = [('Fokko', 1), ('Gurbe', 2), ('Pieter', 2)]
>>> df = spark.createDataFrame(persons, ['name', 'age'])
>>> df.writeTo(table_name).append()
>>> spark.table(table_name).show()
+------+---+
|  name|age|
+------+---+
| Fokko|  1|
| Gurbe|  2|
|Pieter|  2|
+------+---+

>>> new_person = [('Rho', 2)]
>>> df_overwrite = spark.createDataFrame(new_person, ['name', 'age'])
>>> from pyspark.sql.functions import col
>>> df_overwrite.writeTo(table_name).overwrite(col("age") >= 2)
>>> spark.table(table_name).show()
+-----+---+
| name|age|
+-----+---+
|  Rho|  2|
|Fokko|  1|
+-----+---+

The syntax with the col is the way to go. I hope this helps and let me know
if this doesn't work for you.

Kind regards,
Fokko

Op vr 28 jun 2024 om 18:09 schreef Ha Cao <[email protected]>:

> Hi Ajantha,
>
>
>
> Thanks for replying! The example, however, is in Java. I figure that that
> syntax probably only works for Java and Scala. I have tried similarly for
> PySpark but still got `Column is not iterable` with:
>
> df.writeTo(spark_table_path).using("iceberg").overwrite(col("time") >
> target_timestamp)
>
>
>
> For this, I get `Column object is not callable`:
>
>
> df.writeTo(spark_table_path).using("iceberg").overwrite(col("time").less(target_timestamp))
>
>
>
> The only example I can find in the PySpark codebase is
> https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/test_readwriter.py#L251
> but even with this, it throws `Column is not iterable`. I cannot find any
> other test case that tests `overwrite()` as a method.
>
>
>
> Thank you!
>
> Best,
>
> Ha
>
>
>
> *From:* Ajantha Bhat <[email protected]>
> *Sent:* Friday, June 28, 2024 3:52 AM
> *To:* [email protected]
> *Subject:* Re: Iceberg - PySpark overwrite with a condition
>
>
>
> Hi,
>
> Please refer this doc:
> https://iceberg.apache.org/docs/nightly/spark-writes/#overwriting-data
>
> We do have some test cases for the same:
> https://github.com/apache/iceberg/blob/91fbcaa62c25308aa815557dd2c0041f75530705/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java#L153
>
> - Ajantha
>
>
>
> On Fri, Jun 28, 2024 at 1:00 AM Ha Cao <[email protected]> wrote:
>
> Hello,
>
>
>
> I am experimenting with PySpark’s DataFrameWriterV2 overwrite()
> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.overwrite.html>
> to an Iceberg table with existing data in a target partition. My goal is
> that instead of overwriting the entire partition, it will only overwrite
> specific rows that match the condition. However, I can’t get it to work
> with any syntax and I keep getting “Column is not iterable”. I have tried:
>
>
>
> df.writeTo(spark_table_path).using("iceberg").overwrite(df.tid)
>
> df.writeTo(spark_table_path).using("iceberg").overwrite(df.tid.isin(1))
>
> df.writeTo(spark_table_path).using("iceberg").overwrite(df.tid >= 1)
>
>
>
> and all of these syntaxes fail with “Column is not iterable”.
>
>
>
> What is the correct syntax for this? I also think that there is a
> possibility that Iceberg-PySpark integration doesn’t support overwrite, but
> I don’t know how to confirm this.
>
>
>
> Thank you so much!
>
> Best,
> Ha
>
>

Re: Iceberg - PySpark overwrite with a condition

Reply via email to