Yes, that sounds correct. The problem that you're talking about with small
files is a major reason to use a processing framework to write (or later
rewrite) the data.

On Mon, Jul 6, 2020 at 8:34 AM Chen Song <chen.song...@gmail.com> wrote:

> Thanks for the clarification, Ryan.
>
> In its simplest form, if using Parquet, it is possible to add a layer on
> top of the existing GenericParquetWriter to distribute writing the list of
> records into data files based on the table's partition spec.
>
> One key feature that the processing engines like Spark or Presto provides
> is the shuffling to efficiently group data rows based on partition tuples
> before writing. When using core data API directly, data would be fragmented
> eventually with lots of writes. Users/Developers may need to build
> efficient compaction service to rewrite data periodically.
>
> Let me know if my understanding is correct.
>
> Chen
>
> On Thu, Jul 2, 2020 at 1:42 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Chen,
>>
>> Iceberg's API requires that the caller divides data correctly into files
>> according to the partition spec. Most of the time, users interact with
>> Iceberg using a processing engine like Spark or Presto that will do it for
>> you. If you're using the API directly, then you'll need to ensure you
>> partition the rows into data files and pass the correct partition tuples
>> when appending those files to the table.
>>
>> The core API is mainly intended for use by the processing engines, but
>> we're expanding support in the `iceberg-data` module for people who want to
>> interact directly. There are probably some things we could do to make this
>> easier, especially when partitioning data. If you have suggestions, please
>> feel free to open an issue or pull request.
>>
>> rb
>>
>>
>>
>> On Thu, Jul 2, 2020 at 9:19 AM Chen Song <chen.song...@gmail.com> wrote:
>>
>>> I have a question on how hidden partitioning works in Iceberg using Java
>>> API.
>>> The code is something like the following.
>>>
>>> ```
>>> // records is the list of records with a time column
>>> // table is created using partition spec hour(time)
>>> // records have different rows with different hours
>>>
>>> Table table = loadTable();
>>>
>>> Path path = new Path(...);
>>> FileAppender<Record> appender = Avro.write(fromPath(path, conf)).build();
>>> appender.addAll(records);
>>> appender.close();
>>>
>>> DataFile dataFile = DataFiles.builder(table.spec())
>>>
>>>  .withInputFile(HadoopInputFile.fromPath(path, conf))
>>>                                  .build();
>>>
>>> table.newAppend().appendFile(dataFile).commit();
>>> ```
>>> However, once committed, I still see only one partition count updated
>>> and one data file persisted, even though the underlying records
>>> spread different hours.
>>>
>>> I think I use the API in the wrong way but appreciate if someone can
>>> help me on the right way to write partitioned data.
>>>
>>>
>>> Thanks,
>>> --
>>> Chen Song
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Chen Song
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to