Hi Ryan,

I'll give it a try.

Regards,
L.

On Thu, 12 Mar 2020 at 18:16, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Luis,
>
> You're right about what's happening. Because the Avro appender doesn't
> track column-level stats, Iceberg can't determine that the file only
> contains matching data rows and can be deleted. Parquet does keep those
> stats, so even though the partitioning doesn't guarantee the delete is
> safe, Iceberg can determine that it is.
>
> The solution is to add column-level stats for Avro files. Is that
> something you're interested in working on?
>
> rb
>
> On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lote...@gmail.com> wrote:
>
>> Hi,
>>
>> AvroFileAppender doesn't report min/max values (
>> https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
>> ).
>>
>> As a side effect (I think) overwrite operations (if there are data files
>> with the same partition) fail with "Cannot delete file where some, but not
>> all, rows match filter" because StrictMetricsEvaluator can't confirm all
>> rows match.
>>
>> For instance, if you modify TestLocalScan with:
>>
>>     this.partitionSpec =
>> PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();
>>
>>     this.file1Records = new ArrayList<Record>();
>>     file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
>> UUID.randomUUID().toString())));
>>     DataFile file1 = writeFile(sharedTable.location(),
>> format.addExtension("file-1"), file1Records);
>>
>>     this.file2Records = new ArrayList<Record>();
>>     file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>> UUID.randomUUID().toString())));
>>     DataFile file2 = writeFile(sharedTable.location(),
>> format.addExtension("file-2"), file2Records);
>>
>>     this.file3Records = new ArrayList<Record>();
>>     file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>> UUID.randomUUID().toString())));
>>     DataFile file3 = writeFile(sharedTable.location(),
>> format.addExtension("file-3"), file3Records);
>>
>>     sharedTable.newAppend()
>>         .appendFile(file1)
>>         .commit();
>>
>>     sharedTable.newAppend()
>>         .appendFile(file2)
>>         .commit();
>>
>>     sharedTable.newOverwrite()
>>         .overwriteByRowFilter(equal("id",1L))
>>         .addFile(file3)
>>         .commit();
>>
>>
>> Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
>> delete file where some, but not all, rows match filter ref(name="id") == 1:
>> file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.
>>
>> Am I missing something here?
>>
>> Thanks!!
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to