Re: [DISCUSS] Community code reviews

2019-02-26 Thread RD
+1 On Tue, Feb 26, 2019 at 5:49 PM Jacques Nadeau wrote: > I'm +1 (non-binding) if you allow a window for review (for example, I > think others have suggested 1-2 business day before self+1). The post, > self +1, merge in two minutes is not great situation for anyone. > -- > Jacques Nadeau > CTO

Re: [DISCUSS] Community code reviews

2019-02-26 Thread Jacques Nadeau
I'm +1 (non-binding) if you allow a window for review (for example, I think others have suggested 1-2 business day before self+1). The post, self +1, merge in two minutes is not great situation for anyone. -- Jacques Nadeau CTO and Co-Founder, Dremio On Tue, Feb 26, 2019 at 4:51 PM Ryan Blue wro

[DISCUSS] Community code reviews

2019-02-26 Thread Ryan Blue
Hi everyone, I’d like to give a shout out to some of the awesome people that have joined this community and taken the time to review pull requests: Matt Cheah, Anton Okolnychyi, Ratandeep Ratti, Filip Bocse, and Uwe Korn. Thanks to all of you! This work is really helpful to growing community and

Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Matt Cheah
This idea has been discussed before in several cases, see https://github.com/apache/incubator-iceberg/issues/16. We originally thought this would be the best way to support encryption metadata. However we instead made encryption a first-class concept in Iceberg. The design premise we decided

Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Arvind Pruthi
I can create a proposal to add an Optional Extended Metadata Attribute to “data_files” structure. I am thinking something simple and equivalent of “Extended Attributes in file systems” basically a list of key-value pairs. I am thinking such an abstraction will help address many unforeseen cases.

Re: Split Planning - Expensive Tasks First

2019-02-26 Thread Ryan Blue
Split planning works the way it currently does to prioritize newer files and avoid reordering them. The idea is that newer files are the most likely to be read, so engines like Presto that continue to plan splits as the first splits run will return results faster. I like the idea of returning the

Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Arvind Pruthi
Well the question/use case has to deal with legacy with Hive Meta Store. Since Hive Meta store only tracked partitions and not individual files, use cases are built around that assumption. In this case, we have an important house-keeping flow which replaces existing files with files of the same

Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Jacques Nadeau
We're using etag for better clarity on this at Dremio (for a different use case). I wonder if the same thing should be available in iceberg. -- Jacques Nadeau CTO and Co-Founder, Dremio On Tue, Feb 26, 2019 at 9:48 AM Ryan Blue wrote: > Hi Arvind, > > Iceberg assumes that all file locations are

Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Ryan Blue
You could always embed version information in the file location, like S3's @ syntax. That's just another way to make it unique. Why is it necessary to overwrite the original file location though? That's why I don't think I understand the use case. On Tue, Feb 26, 2019 at 9:50 AM Jacques Nadeau wr

Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Ryan Blue
Hi Arvind, Iceberg assumes that all file locations are unique. If two snapshots refer to the same location, then whatever data file (or version) is in that location is what is read. What is your use case? Apache Iceberg has no official releases yet. We still need to do some license work for binar

Re: Would we consider adding support for metrics collection/tracing instrumentation such as opencensus or opentracing?

2019-02-26 Thread Ryan Blue
I'm not sure what I would want from DropWizard metrics. Most of the things we want to time happen just a few times in a job and are specific to a table. For example, we want to know how long a particular query takes to plan. That is dependent on how large the table is and what filters were applied

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-26 Thread Anton Okolnychyi
Unfortunately, Spark doesn’t push down filters for nested columns. I remember an effort to implement it [1]. However, it is not merged. So, even if we have proper statistics in Iceberg, we cannot leverage it from Spark. [1] - https://github.com/apache/spark/pull/22573

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-26 Thread Gautam
Thanks Anton, this is very helpful! I will apply the patch from pull#63 and give it a shot. Re: Collecting min/max stas on nested structures ( *https://github.com/apache/incubator-iceberg/issues/78 * ) ... We have the exact same use case for

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-26 Thread Anton Okolnychyi
Hi Gautam, I believe you see this behaviour because SparkAppenderFactory is configured to use ParquetWriteAdapter. It only tracks the number of records and uses ParquetWriteSupport from Spark. This means that the statistics is not collected on writes and cannot be used on reads. Once [1] is me

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-26 Thread Gautam
.. Just to be clear my concern is around Iceberg not skipping files. Iceberg does skip rowGroups when scanning files as *iceberg.parquet.ParquetReader* uses the parquet stats under it while skipping, albeit none of these stats come from the manifests. On Tue, Feb 26, 2019 at 7:24 PM Gautam wrote:

Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-26 Thread Gautam
Hello Devs, I am looking into leveraging Iceberg to speed up split generation and to minimize file scans. My understanding was that Iceberg keeps key statistics as listed under Metrics.java [1] viz. column lower/upper bounds, nullValues, distinct value counts, etc. and that table