Replies inline. On Sat, Feb 2, 2019 at 9:34 AM Manish Malhotra < [email protected]> wrote:
> Does Iceberg stores table's metadata files and version information in HMS > database? > Iceberg stores the location of the current table metadata file. It also syncs the schema with the table schema for convenience, but the Hive schema is informational. The only metadata that matters is the table metadata file, which is the source of truth. > How do you see Iceberg+HMS performance when its being used for big clusters > a. large number of partitions?, this should be good with iceberg? > Iceberg tables appear unpartitioned to HMS because they cannot be read using the conventions of Hive tables. The way Iceberg stores partition information actually scales better than HMS in our experience. HMS is only used for table-level commits, so it isn't a bottle-neck for partition operations. Split planning and table updates are done by jobs. For Spark, this is on the driver. That way, resources used for this scale with the cluster. Iceberg can also handle more unique partitions because it is not listing directories. Each file could be a unique partition without incurring much more overhead. > b. large number of concurrent tasks to add data/partitions. (should add > in batches and not streaming, as Iceberg is more for batching systems?) > Iceberg can handle concurrent write tasks. We have tables that are updated every couple of minutes. The problem is really the number of files that this generates. If you write often, you will probably need to compact files to avoid the small files problem. > c. large number of concurrent queries? this should also be ok as there > is no locking? > Concurrent reads are fine. There is no locking required so there is no scale bottleneck on the read side. rb -- Ryan Blue Software Engineer Netflix
