Hi Chen,

Iceberg's API requires that the caller divides data correctly into files
according to the partition spec. Most of the time, users interact with
Iceberg using a processing engine like Spark or Presto that will do it for
you. If you're using the API directly, then you'll need to ensure you
partition the rows into data files and pass the correct partition tuples
when appending those files to the table.

The core API is mainly intended for use by the processing engines, but
we're expanding support in the `iceberg-data` module for people who want to
interact directly. There are probably some things we could do to make this
easier, especially when partitioning data. If you have suggestions, please
feel free to open an issue or pull request.

rb



On Thu, Jul 2, 2020 at 9:19 AM Chen Song <chen.song...@gmail.com> wrote:

> I have a question on how hidden partitioning works in Iceberg using Java
> API.
> The code is something like the following.
>
> ```
> // records is the list of records with a time column
> // table is created using partition spec hour(time)
> // records have different rows with different hours
>
> Table table = loadTable();
>
> Path path = new Path(...);
> FileAppender<Record> appender = Avro.write(fromPath(path, conf)).build();
> appender.addAll(records);
> appender.close();
>
> DataFile dataFile = DataFiles.builder(table.spec())
>
>  .withInputFile(HadoopInputFile.fromPath(path, conf))
>                                  .build();
>
> table.newAppend().appendFile(dataFile).commit();
> ```
> However, once committed, I still see only one partition count updated and
> one data file persisted, even though the underlying records
> spread different hours.
>
> I think I use the API in the wrong way but appreciate if someone can help
> me on the right way to write partitioned data.
>
>
> Thanks,
> --
> Chen Song
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to