Replies inline.

On Sat, Feb 2, 2019 at 9:34 AM Manish Malhotra <
[email protected]> wrote:

> Does Iceberg stores table's metadata files and version information in HMS
> database?
>

Iceberg stores the location of the current table metadata file. It also
syncs the schema with the table schema for convenience, but the Hive schema
is informational. The only metadata that matters is the table metadata
file, which is the source of truth.


> How do you see Iceberg+HMS performance when its being used for big clusters
>   a. large number of partitions?, this should be good with iceberg?
>

Iceberg tables appear unpartitioned to HMS because they cannot be read
using the conventions of Hive tables.

The way Iceberg stores partition information actually scales better than
HMS in our experience.

HMS is only used for table-level commits, so it isn't a bottle-neck for
partition operations. Split planning and table updates are done by jobs.
For Spark, this is on the driver. That way, resources used for this scale
with the cluster.

Iceberg can also handle more unique partitions because it is not listing
directories. Each file could be a unique partition without incurring much
more overhead.


>   b. large number of concurrent tasks to add data/partitions. (should add
> in batches and not streaming, as Iceberg is more for batching systems?)
>

Iceberg can handle concurrent write tasks. We have tables that are updated
every couple of minutes. The problem is really the number of files that
this generates. If you write often, you will probably need to compact files
to avoid the small files problem.


>   c. large number of concurrent queries? this should also be ok as there
> is no locking?
>

Concurrent reads are fine. There is no locking required so there is no
scale bottleneck on the read side.

rb

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to