Re: Question on partitioning using Java API

Chen Song Mon, 06 Jul 2020 08:36:06 -0700

Thanks for the clarification, Ryan.

In its simplest form, if using Parquet, it is possible to add a layer on
top of the existing GenericParquetWriter to distribute writing the list of
records into data files based on the table's partition spec.


One key feature that the processing engines like Spark or Presto provides
is the shuffling to efficiently group data rows based on partition tuples
before writing. When using core data API directly, data would be fragmented
eventually with lots of writes. Users/Developers may need to build
efficient compaction service to rewrite data periodically.

Let me know if my understanding is correct.

Chen

On Thu, Jul 2, 2020 at 1:42 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Chen,
>
> Iceberg's API requires that the caller divides data correctly into files
> according to the partition spec. Most of the time, users interact with
> Iceberg using a processing engine like Spark or Presto that will do it for
> you. If you're using the API directly, then you'll need to ensure you
> partition the rows into data files and pass the correct partition tuples
> when appending those files to the table.
>
> The core API is mainly intended for use by the processing engines, but
> we're expanding support in the `iceberg-data` module for people who want to
> interact directly. There are probably some things we could do to make this
> easier, especially when partitioning data. If you have suggestions, please
> feel free to open an issue or pull request.
>
> rb
>
>
>
> On Thu, Jul 2, 2020 at 9:19 AM Chen Song <chen.song...@gmail.com> wrote:
>
>> I have a question on how hidden partitioning works in Iceberg using Java
>> API.
>> The code is something like the following.
>>
>> ```
>> // records is the list of records with a time column
>> // table is created using partition spec hour(time)
>> // records have different rows with different hours
>>
>> Table table = loadTable();
>>
>> Path path = new Path(...);
>> FileAppender<Record> appender = Avro.write(fromPath(path, conf)).build();
>> appender.addAll(records);
>> appender.close();
>>
>> DataFile dataFile = DataFiles.builder(table.spec())
>>
>>  .withInputFile(HadoopInputFile.fromPath(path, conf))
>>                                  .build();
>>
>> table.newAppend().appendFile(dataFile).commit();
>> ```
>> However, once committed, I still see only one partition count updated and
>> one data file persisted, even though the underlying records
>> spread different hours.
>>
>> I think I use the API in the wrong way but appreciate if someone can help
>> me on the right way to write partitioned data.
>>
>>
>> Thanks,
>> --
>> Chen Song
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Chen Song

Re: Question on partitioning using Java API

Reply via email to