Re: Usage of parquet field_id

2021-05-18 Thread Ryan Blue
Hi Weston, #1 is a problem and we should remove the auto-generation. The issue is that auto-generating an ID can result in a collision between Iceberg's field IDs and the generated IDs. Since Iceberg uses the ID to identify a field, that would result in unrelated data being mistaken for a column's

Re: Usage of parquet field_id

2021-05-18 Thread Weston Pace
Ok, this is matching my understanding of how field_id is used as well. I believe #1 will not be an issue because I think Iceberg always sets the field_id property when writing data? If that is the case then Iceberg would never have noticed the old behavior. In other words, Iceberg never relied on

Re: Referencing data files in manifest files

2021-05-18 Thread Jack Ye
You can read https://iceberg.apache.org/custom-catalog/#custom-file-io-implementation for more details of loading your custom FileIO, and see http://iceberg.apache.org/aws/#s3-fileio as an example. -Jack On Tue, May 18, 2021 at 10:16 AM Vivekanand Vellanki wrote: > Is it possible to make the Fil

Re: Referencing data files in manifest files

2021-05-18 Thread Vivekanand Vellanki
Is it possible to make the FileIO implementation extensible for a schema? For e.g. for schema hdfs://, can I ensure that Iceberg uses my custom implementation of FileIO at run time? On Tue, May 18, 2021 at 9:45 PM Daniel Weeks wrote: > Hey Vivek, > > The file_path per spec is technically just a

Re: Referencing data files in manifest files

2021-05-18 Thread Daniel Weeks
Hey Vivek, The file_path per spec is technically just a string, but the representation is expected to be a URI. How this URI is interpreted is really up to the FileIO implementation. So for example, the most common FileIO implementation is probably HadoopFileIO, which is going to use whatever fi

Re: Querying older versions of an Iceberg table

2021-05-18 Thread Daniel Weeks
Hey Vivek, I think as you can see throughout this discussion there are a number of issues with modifying the data files outside of Iceberg APIs. To maintain data integrity, it's advised to only operate on the data through Iceberg. In many ways this is similar to trying to change the history of a

Re: Usage of parquet field_id

2021-05-18 Thread Daniel Weeks
Hey Weston, >From the Iceberg's perspective, the field_id is necessary to track the evolution of the schema over time. It's best to think of the problem from a dataset perspective as opposed to a file perspective. Iceberg maintains the mapping of the schema with respect to the field ids because