Yes, that sounds correct. The problem that you're talking about with small files is a major reason to use a processing framework to write (or later rewrite) the data.
On Mon, Jul 6, 2020 at 8:34 AM Chen Song <chen.song...@gmail.com> wrote: > Thanks for the clarification, Ryan. > > In its simplest form, if using Parquet, it is possible to add a layer on > top of the existing GenericParquetWriter to distribute writing the list of > records into data files based on the table's partition spec. > > One key feature that the processing engines like Spark or Presto provides > is the shuffling to efficiently group data rows based on partition tuples > before writing. When using core data API directly, data would be fragmented > eventually with lots of writes. Users/Developers may need to build > efficient compaction service to rewrite data periodically. > > Let me know if my understanding is correct. > > Chen > > On Thu, Jul 2, 2020 at 1:42 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Hi Chen, >> >> Iceberg's API requires that the caller divides data correctly into files >> according to the partition spec. Most of the time, users interact with >> Iceberg using a processing engine like Spark or Presto that will do it for >> you. If you're using the API directly, then you'll need to ensure you >> partition the rows into data files and pass the correct partition tuples >> when appending those files to the table. >> >> The core API is mainly intended for use by the processing engines, but >> we're expanding support in the `iceberg-data` module for people who want to >> interact directly. There are probably some things we could do to make this >> easier, especially when partitioning data. If you have suggestions, please >> feel free to open an issue or pull request. >> >> rb >> >> >> >> On Thu, Jul 2, 2020 at 9:19 AM Chen Song <chen.song...@gmail.com> wrote: >> >>> I have a question on how hidden partitioning works in Iceberg using Java >>> API. >>> The code is something like the following. >>> >>> ``` >>> // records is the list of records with a time column >>> // table is created using partition spec hour(time) >>> // records have different rows with different hours >>> >>> Table table = loadTable(); >>> >>> Path path = new Path(...); >>> FileAppender<Record> appender = Avro.write(fromPath(path, conf)).build(); >>> appender.addAll(records); >>> appender.close(); >>> >>> DataFile dataFile = DataFiles.builder(table.spec()) >>> >>> .withInputFile(HadoopInputFile.fromPath(path, conf)) >>> .build(); >>> >>> table.newAppend().appendFile(dataFile).commit(); >>> ``` >>> However, once committed, I still see only one partition count updated >>> and one data file persisted, even though the underlying records >>> spread different hours. >>> >>> I think I use the API in the wrong way but appreciate if someone can >>> help me on the right way to write partitioned data. >>> >>> >>> Thanks, >>> -- >>> Chen Song >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Chen Song > > -- Ryan Blue Software Engineer Netflix