Iceberg stores this information and other footer and file level details in 
manifests for just such a use case. The goal is always to read the files once 
and then save metrics and statistics in the manifest so they do not need be 
read again. 

If the value is not accurate there is a bug in Iceberg (recently there was one 
of these with improperly recorded file sizes). 

I would suggest taking a look at the snapshot and migrate procedures since we 
already have code for determining these values for existing files and hive 
tables

Sent from my iPhone

> On Apr 7, 2021, at 3:41 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> 
> 
> Hi,
> 
> We are in the process of converting Hive datasets to Iceberg datasets.
> 
> In this process, we noticed that each data-file entry in the manifest file 
> has a required record_count field.
> 
> Populating this accurately would require reading the footer/tail for 
> Parquet/ORC files. For AVRO files, it requires reading the block headers for 
> all blocks to determine the number of records in the AVRO file.
> 
> Is the record_count in the data-file entry expected to be accurate? or can we 
> estimate it based on size of the file and an estimation of a row size?
> 
> Thanks
> Vivek
> 

Reply via email to