Re: Question about schema evolution in iceberg table

sudsport s Thu, 14 Feb 2019 15:02:05 -0800

Adding dev@iceberg.apache.org


On Thu, Feb 14, 2019 at 3:00 PM sudsport s <sudssf2...@gmail.com> wrote:

> HI I am doing some testing with schema evolution.  I looked at
> testSchemaUpdate method and SchemaUpdate class for reference.
>
>
> Here are steps I doing to test schema evolution validation
>
> initially data is created with following schema using  "key" as partition
> key
>
> root
>  |-- id: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- key: integer (nullable = false)
>  |-- value1: string (nullable = true)
>  |-- value2: string (nullable = true)
>
> schema update to rename value1 -> v1
>
> root
>  |-- id: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- key: integer (nullable = false)
>  |-- v1: string (nullable = true)
>  |-- value2: string (nullable = true)
>
> schema update to rename key -> newKey ( I know changing partition key is
> not good idea but this is a test :) )
>
> root
>  |-- id: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- newKey: integer (nullable = false)
>  |-- v1: string (nullable = true)
>  |-- value2: string (nullable = true)
>
>
> when I read data frame using spark I get  following schema
>
> root
>  |-- id: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- newKey: integer (nullable = false)
>  |-- v1: string (nullable = true)
>  |-- value2: string (nullable = true)
>
> but when I try to run query or scan using changed column in where clause I
> get following exception
>
>
> INFO TableScan: Scanning table /tmp/schema-evolution snapshot
> 1550184572006 created at 2019-02-14 14:49:32.189 with filter
> not_null(ref(name="v1"))
> Exception in thread "main"
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
> tree:
> Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)],
> output=[count#77L])
>    +- *(1) Project
>       +- *(1) Filter (isnotnull(v1#60) && (cast(v1#60 as int) = 0))
>          +- *(1) DataSourceV2Scan [v1#60],
> IcebergScan(table=/tmp/schema-evolution, type=struct<4: v1: optional
> string>, filters=[not_null(ref(name="v1"))])
>
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>
> Caused by: com.netflix.iceberg.exceptions.ValidationException: Cannot find
> field 'v1' in struct: struct<1: id: optional string, 2: value: optional
> string, 3: key: required int, 4: value1: optional string, 5: value2:
> optional string>
> at
> com.netflix.iceberg.exceptions.ValidationException.check(ValidationException.java:39)
> at
> com.netflix.iceberg.expressions.UnboundPredicate.bind(UnboundPredicate.java:46)
>
>
> I ran same query using where various combinations "v1 = 0" , "value1 = 0"
> , "key = 0" and "newKey = 0"
>
> What is best way to query data in iceberg table when schema is changed?
>
>
> following output from metadata json
>
>
> <       "name" : "key",
> ---
> >       "name" : "newKey",
> 25c25
> <       "name" : "value1",
> ---
> >       "name" : "v1",
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Iceberg Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to iceberg-devel+unsubscr...@googlegroups.com.
> To post to this group, send email to iceberg-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/iceberg-devel/3efe985e-2302-412b-a899-8efe1fbf13c8%40googlegroups.com
> <https://groups.google.com/d/msgid/iceberg-devel/3efe985e-2302-412b-a899-8efe1fbf13c8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

Re: Question about schema evolution in iceberg table

Reply via email to