Re: [DISCUSS] Distinct count map

2021-07-01 Thread Daniel Weeks
I would agree with including distinct counts. As you point out there are a number of strategies that can be employed by the engine based on additional information. You pointed out the non-overlapping bounds, but similarly if the bounds overlap almost entirely, you might be able to assume an even

[DISCUSS] Distinct count map

2021-07-01 Thread Ryan Blue
Hi everyone, I'm working on finalizing the spec for v2 right now and one thing that's outstanding is the map of file-level distinct counts. This field has some history. I added it in the original spec because I thought we'd want distinct value counts for cost-based optimization in SQL planners. B

Re: Parquet with - Snappy vs gzip

2021-07-01 Thread Russell Spitzer
Our tests showed snappy being about 2x the size Gzip but about half the speed. Zstd ended up about the same size as Gzip and as fast as Snappy. That said memory usage was way up with zstd > On Jul 1, 2021, at 4:16 PM, Sreeram Garlapati wrote: > > Will include Zstd as well, thank you. > However

rowGroup:File = 1:1

2021-07-01 Thread Sreeram Garlapati
Hello Iceberg devs, We are leaning towards having 1 RowGroup Per File. We would love to know if there are any additional considerations - that we potentially would have missed. *Here's my understanding on How/Why Parquet historically needed to hold multiple Row Groups - more like the major reason

Re: Parquet with - Snappy vs gzip

2021-07-01 Thread Sreeram Garlapati
Will include Zstd as well, thank you. However, we are interested in compression speed rather than ratio too. On Thu, Jul 1, 2021 at 2:01 PM Ryan Blue wrote: > You should probably try Zstd while you're at it. We had great results with > Zstd as well. My conclusion was that Zstd is probably the ri

Re: Parquet with - Snappy vs gzip

2021-07-01 Thread Ryan Blue
You should probably try Zstd while you're at it. We had great results with Zstd as well. My conclusion was that Zstd is probably the right choice when you want higher compression ratios and LZ4 was the right choice when you didn't need great compression but wanted fast compression and decompression

Re: Parquet with - Snappy vs gzip

2021-07-01 Thread Sreeram Garlapati
Slick, thanks @Ryan Blue . We will add LZ4 to our mix and report back if we find anything different. On Thu, Jul 1, 2021 at 1:50 PM Ryan Blue wrote: > The default should probably be LZ4. In our testing, LZ4 beat snappy for > every dataset for read time, write time, and compression ratio. I belie

Re: Parquet with - Snappy vs gzip

2021-07-01 Thread Ryan Blue
The default should probably be LZ4. In our testing, LZ4 beat snappy for every dataset for read time, write time, and compression ratio. I believe it also typically got a better compression ratio than gzip. Gzip was the previous default because it does a better job on compression ratio than snappy.

Parquet with - Snappy vs gzip

2021-07-01 Thread Sreeram Garlapati
Hello Iceberg devs! Do any of you folks use the underlying file format as* Parquet + Snappy.* Iceberg configures this by default as Parquet + gzip ( *write.parquet.compression-codec*). *Is there any specific reason for this Choice?* In our preliminary tests we found better numbers with *Parquet +

Re: migrating Hadoop tables to tables with hive catalog

2021-07-01 Thread Huadong Liu
Thank you all. That saves rewriting all the manifest files, which is a lot. I did the following and it seems to be working fine. 1. Create an iceberg table using the hive catalog with the table schema, partition spec etc. 2. Copy the hadoop latest vd.metadata.json to the hive table metadata js

Re: [NOTES] 23 June 2021 Iceberg Community Meeting

2021-07-01 Thread Ryan Blue
Also, we keep historical notes and a running agenda for the next sync in this doc: https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?pli=1#heading=h.z3dncl7gr8m1 Feel free to add topics for the next one, which will be on Wednesday, 21 July 2021 at 16:00 UTC. On

[NOTES] 23 June 2021 Iceberg Community Meeting

2021-07-01 Thread Carl Steinbach
Iceberg Community Meetings are open to everyone. To receive an invitation to the next meeting, please join the iceberg-s...@googlegroups.com list.Special thanks to Ryan Blue for contributing most of these notes.Attendees: Anjali Norwood, Badrul Chowdhury,

3 Iceberg talks at Subsurface

2021-07-01 Thread Dave Nielsen
There are a few Iceberg talks at the Subsurface virtual conference (free) - Iceberg Case Studies - by Ryan Blue - Why & How Net

Re: migrating Hadoop tables to tables with hive catalog

2021-07-01 Thread Ryan Murray
I had a short proposal here[1] suggesting the same as Russell. I think this is probably a more broadly useful operation but I don't really know the best place for it to live. Im happy to finish the proposal if there are some opinions on where in iceberg it is appropriate to add such functionality.

Re: migrating Hadoop tables to tables with hive catalog

2021-07-01 Thread Russell Spitzer
I think you could probably also do this by just creating a Hive table and then changing the location to point to the most recent hadoop metadata.json file. > On Jul 1, 2021, at 1:42 AM, Huadong Liu wrote: > > FYI, I was able to do the migration by casting ManifestFile to > GenericManifestFile,