Re: Question about schema evolution in iceberg table

Ryan Blue Wed, 20 Feb 2019 11:14:04 -0800

Sudsport,

Good catch here, and thank you for the gist that reproduces the issue.


The problem happens when pushing predicates down to manifest files.
Manifests keep track of the schema and partition spec that was used to
write the manifest. The reader currently uses that schema when converting
and binding predicates to evaluate on the partition data in the manifest.
So this is a bug where we haven't passed the current table schema down to
the manifest reader.

I'll open an issue for it and fix this. Thanks!

rb

On Fri, Feb 15, 2019 at 11:34 AM suds <[email protected]> wrote:

> Thanks for reply Ryan.
>
> I created gist with code example
>
> https://gist.github.com/sudssf/e5f2de7463487f98c0a269221bbe0f1a
>
> Please let me know if I am not using API correctly.
>
>
> On Thu, Feb 14, 2019 at 5:38 PM Ryan Blue <[email protected]> wrote:
>
>> Sudsport,
>>
>> I'm wondering if you had the table cached somewhere? Those renames should
>> work. My guess is that the query used a table version that was out of date.
>>
>> Can you put together a minimal script that reproduces the error and open
>> an issue? That way I can fix it.
>>
>> rb
>>
>> On Thu, Feb 14, 2019 at 3:01 PM sudsport s <[email protected]> wrote:
>>
>>> Adding [email protected]
>>>
>>>
>>> On Thu, Feb 14, 2019 at 3:00 PM sudsport s <[email protected]> wrote:
>>>
>>>> HI I am doing some testing with schema evolution.  I looked at
>>>> testSchemaUpdate method and SchemaUpdate class for reference.
>>>>
>>>>
>>>> Here are steps I doing to test schema evolution validation
>>>>
>>>> initially data is created with following schema using  "key" as
>>>> partition key
>>>>
>>>> root
>>>>  |-- id: string (nullable = true)
>>>>  |-- value: string (nullable = true)
>>>>  |-- key: integer (nullable = false)
>>>>  |-- value1: string (nullable = true)
>>>>  |-- value2: string (nullable = true)
>>>>
>>>> schema update to rename value1 -> v1
>>>>
>>>> root
>>>>  |-- id: string (nullable = true)
>>>>  |-- value: string (nullable = true)
>>>>  |-- key: integer (nullable = false)
>>>>  |-- v1: string (nullable = true)
>>>>  |-- value2: string (nullable = true)
>>>>
>>>> schema update to rename key -> newKey ( I know changing partition key
>>>> is not good idea but this is a test :) )
>>>>
>>>> root
>>>>  |-- id: string (nullable = true)
>>>>  |-- value: string (nullable = true)
>>>>  |-- newKey: integer (nullable = false)
>>>>  |-- v1: string (nullable = true)
>>>>  |-- value2: string (nullable = true)
>>>>
>>>>
>>>> when I read data frame using spark I get  following schema
>>>>
>>>> root
>>>>  |-- id: string (nullable = true)
>>>>  |-- value: string (nullable = true)
>>>>  |-- newKey: integer (nullable = false)
>>>>  |-- v1: string (nullable = true)
>>>>  |-- value2: string (nullable = true)
>>>>
>>>> but when I try to run query or scan using changed column in where
>>>> clause I get following exception
>>>>
>>>>
>>>> INFO TableScan: Scanning table /tmp/schema-evolution snapshot
>>>> 1550184572006 created at 2019-02-14 14:49:32.189 with filter
>>>> not_null(ref(name="v1"))
>>>> Exception in thread "main"
>>>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
>>>> tree:
>>>> Exchange SinglePartition
>>>> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)],
>>>> output=[count#77L])
>>>>    +- *(1) Project
>>>>       +- *(1) Filter (isnotnull(v1#60) && (cast(v1#60 as int) = 0))
>>>>          +- *(1) DataSourceV2Scan [v1#60],
>>>> IcebergScan(table=/tmp/schema-evolution, type=struct<4: v1: optional
>>>> string>, filters=[not_null(ref(name="v1"))])
>>>>
>>>> at
>>>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>>>> at
>>>> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
>>>> at
>>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>>>>
>>>> Caused by: com.netflix.iceberg.exceptions.ValidationException: Cannot
>>>> find field 'v1' in struct: struct<1: id: optional string, 2: value:
>>>> optional string, 3: key: required int, 4: value1: optional string, 5:
>>>> value2: optional string>
>>>> at
>>>> com.netflix.iceberg.exceptions.ValidationException.check(ValidationException.java:39)
>>>> at
>>>> com.netflix.iceberg.expressions.UnboundPredicate.bind(UnboundPredicate.java:46)
>>>>
>>>>
>>>> I ran same query using where various combinations "v1 = 0" , "value1 =
>>>> 0" , "key = 0" and "newKey = 0"
>>>>
>>>> What is best way to query data in iceberg table when schema is changed?
>>>>
>>>>
>>>> following output from metadata json
>>>>
>>>>
>>>> <       "name" : "key",
>>>> ---
>>>> >       "name" : "newKey",
>>>> 25c25
>>>> <       "name" : "value1",
>>>> ---
>>>> >       "name" : "v1",
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Iceberg Developers" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/iceberg-devel/3efe985e-2302-412b-a899-8efe1fbf13c8%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/iceberg-devel/3efe985e-2302-412b-a899-8efe1fbf13c8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Iceberg Developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/iceberg-devel/CAO32DPxrri4Oz%2BuX6vwgdh3NhW5FgxEmTumRrba5N6M6Rkuy5Q%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/iceberg-devel/CAO32DPxrri4Oz%2BuX6vwgdh3NhW5FgxEmTumRrba5N6M6Rkuy5Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Iceberg Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/iceberg-devel/CAO32DPy3pnDY8qohaVjRFyLsEnT-bdkcHYX0X9dgW5dKpuoW8w%40mail.gmail.com
> <https://groups.google.com/d/msgid/iceberg-devel/CAO32DPy3pnDY8qohaVjRFyLsEnT-bdkcHYX0X9dgW5dKpuoW8w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about schema evolution in iceberg table

Reply via email to