Re: AvroFileAppender metrics

2020-03-13 Thread Ryan Blue
Yeah, I would probably ignore the column size metric. That's really more for columnar formats, where we could use it to estimate how much data from a row group is being projected. For Avro, we'd have to read the same amount either way. For this, I'd probably create an appender that wraps another a

Re: AvroFileAppender metrics

2020-03-13 Thread Luis Otero
Feedback/guidance request: Byte size info in avro is encapsulated in encoder (org.apache.avro.io.BufferedBinaryEncoder) and is not exposed by avro api. Should we carry on with the task ignoring that metric (gathering as much info as we can inside Iceberg)? Is it feasible to get Avro modified (to

Re: AvroFileAppender metrics

2020-03-12 Thread Luis Otero
Hi Ryan, I'll give it a try. Regards, L. On Thu, 12 Mar 2020 at 18:16, Ryan Blue wrote: > Hi Luis, > > You're right about what's happening. Because the Avro appender doesn't > track column-level stats, Iceberg can't determine that the file only > contains matching data rows and can be deleted.

Re: AvroFileAppender metrics

2020-03-12 Thread Ryan Blue
Hi Luis, You're right about what's happening. Because the Avro appender doesn't track column-level stats, Iceberg can't determine that the file only contains matching data rows and can be deleted. Parquet does keep those stats, so even though the partitioning doesn't guarantee the delete is safe,