Feedback/guidance request:

Byte size info in avro is encapsulated in encoder
(org.apache.avro.io.BufferedBinaryEncoder) and is not exposed by avro api.

Should we carry on with the task ignoring that metric (gathering as much
info as we can inside Iceberg)?
Is it feasible to get Avro modified (to expose that info)?

Thanks,
L.

On Thu, 12 Mar 2020 at 18:19, Luis Otero <lote...@gmail.com> wrote:

> Hi Ryan,
>
> I'll give it a try.
>
> Regards,
> L.
>
> On Thu, 12 Mar 2020 at 18:16, Ryan Blue <rb...@netflix.com.invalid> wrote:
>
>> Hi Luis,
>>
>> You're right about what's happening. Because the Avro appender doesn't
>> track column-level stats, Iceberg can't determine that the file only
>> contains matching data rows and can be deleted. Parquet does keep those
>> stats, so even though the partitioning doesn't guarantee the delete is
>> safe, Iceberg can determine that it is.
>>
>> The solution is to add column-level stats for Avro files. Is that
>> something you're interested in working on?
>>
>> rb
>>
>> On Thu, Mar 12, 2020 at 10:09 AM Luis Otero <lote...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> AvroFileAppender doesn't report min/max values (
>>> https://github.com/apache/incubator-iceberg/blob/80cbc60ee55911ee627a7ad3013804394d7b5e9a/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L60
>>> ).
>>>
>>> As a side effect (I think) overwrite operations (if there are data files
>>> with the same partition) fail with "Cannot delete file where some, but not
>>> all, rows match filter" because StrictMetricsEvaluator can't confirm all
>>> rows match.
>>>
>>> For instance, if you modify TestLocalScan with:
>>>
>>>     this.partitionSpec =
>>> PartitionSpec.builderFor(SCHEMA).bucket("id",10).build();
>>>
>>>     this.file1Records = new ArrayList<Record>();
>>>     file1Records.add(record.copy(ImmutableMap.of("id", 60L, "data",
>>> UUID.randomUUID().toString())));
>>>     DataFile file1 = writeFile(sharedTable.location(),
>>> format.addExtension("file-1"), file1Records);
>>>
>>>     this.file2Records = new ArrayList<Record>();
>>>     file2Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>>> UUID.randomUUID().toString())));
>>>     DataFile file2 = writeFile(sharedTable.location(),
>>> format.addExtension("file-2"), file2Records);
>>>
>>>     this.file3Records = new ArrayList<Record>();
>>>     file3Records.add(record.copy(ImmutableMap.of("id", 1L, "data",
>>> UUID.randomUUID().toString())));
>>>     DataFile file3 = writeFile(sharedTable.location(),
>>> format.addExtension("file-3"), file3Records);
>>>
>>>     sharedTable.newAppend()
>>>         .appendFile(file1)
>>>         .commit();
>>>
>>>     sharedTable.newAppend()
>>>         .appendFile(file2)
>>>         .commit();
>>>
>>>     sharedTable.newOverwrite()
>>>         .overwriteByRowFilter(equal("id",1L))
>>>         .addFile(file3)
>>>         .commit();
>>>
>>>
>>> Fails with 'org.apache.iceberg.exceptions.ValidationException: Cannot
>>> delete file where some, but not all, rows match filter ref(name="id") == 1:
>>> file:/AVRO/file-2.avro' for AVRO format but works fine for PARQUET format.
>>>
>>> Am I missing something here?
>>>
>>> Thanks!!
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Reply via email to