Re: merge-on-read?

2018-12-07 Thread Erik Wright
Owen, something similar has come up in a roadmap discussion of mine. I have a question about the solution you mentioned. The requirements would be that there is a 1:1 mapping between rows in the > matching files and stripes. > Were you thinking that there would really be a 1:1 mapping and that th

[GitHub] rdblue opened a new pull request #30: Spark 2.4

2018-12-07 Thread GitBox
rdblue opened a new pull request #30: Spark 2.4 URL: https://github.com/apache/incubator-iceberg/pull/30 This updates the Spark dependency to 2.4.0. Changes include: * Remove ORC support that uses a now-private Spark API (BufferHolder) * Use Spark filters instead of expressions,

[GitHub] rdblue opened a new issue #31: Add startsWith predicate

2018-12-07 Thread GitBox
rdblue opened a new issue #31: Add startsWith predicate URL: https://github.com/apache/incubator-iceberg/issues/31 Some users have requested prefix matching or startsWith. This issue was migrated from https://github.com/Netflix/iceberg/issues/49. -

[GitHub] rdblue opened a new issue #32: Ignore unsupported partition fields

2018-12-07 Thread GitBox
rdblue opened a new issue #32: Ignore unsupported partition fields URL: https://github.com/apache/incubator-iceberg/issues/32 Iceberg may add new transforms to the partition spec. When a transform is not recognized, Iceberg should ignore the field so that the format is forward-compatible w

[GitHub] rdblue opened a new issue #33: Add Avro support to Pig reader

2018-12-07 Thread GitBox
rdblue opened a new issue #33: Add Avro support to Pig reader URL: https://github.com/apache/incubator-iceberg/issues/33 The initial implementation for pig only supports parquet. We need an avro read path. This issue was migrated from https://github.com/Netflix/iceberg/issues/51 ---

[GitHub] rdblue opened a new issue #34: Add operation details to snapshots

2018-12-07 Thread GitBox
rdblue opened a new issue #34: Add operation details to snapshots URL: https://github.com/apache/incubator-iceberg/issues/34 Snapshots that are append-only can be aged off more aggressively than deletes because all data files must be tracked in the next snapshot. Adding an operation type a

[GitHub] rdblue opened a new issue #35: Implement strict projection in more transforms

2018-12-07 Thread GitBox
rdblue opened a new issue #35: Implement strict projection in more transforms URL: https://github.com/apache/incubator-iceberg/issues/35 Strict projection isn't required and wasn't implemented for several of the partitioning transformations. When strict projection isn't implemented (the `p

[GitHub] rdblue opened a new issue #36: Split files when planning scan tasks

2018-12-07 Thread GitBox
rdblue opened a new issue #36: Split files when planning scan tasks URL: https://github.com/apache/incubator-iceberg/issues/36 When building a scan, the TableScan API can plan the files to read (`planFiles`) or group the files into combined splits (`planTasks`). Split planning should also

[GitHub] rdblue opened a new issue #37: Add split offsets to manifest files

2018-12-07 Thread GitBox
rdblue opened a new issue #37: Add split offsets to manifest files URL: https://github.com/apache/incubator-iceberg/issues/37 Instead of storing a single HDFS block size for each data file, Iceberg should store a list of split offsets. That will allow split planning to be more precise by u

[GitHub] rdblue opened a new issue #38: Add column comments to Iceberg schemas

2018-12-07 Thread GitBox
rdblue opened a new issue #38: Add column comments to Iceberg schemas URL: https://github.com/apache/incubator-iceberg/issues/38 Iceberg schemas should allow storing comments as documentation for struct fields. This is an aut

[GitHub] dongjoon-hyun commented on a change in pull request #30: Update to Spark 2.4

2018-12-07 Thread GitBox
dongjoon-hyun commented on a change in pull request #30: Update to Spark 2.4 URL: https://github.com/apache/incubator-iceberg/pull/30#discussion_r239900431 ## File path: build.gradle ## @@ -302,7 +300,7 @@ project(':iceberg-presto-runtime') { shadow "org.apache.avr

[GitHub] rdblue commented on issue #38: Add column comments to Iceberg schemas

2018-12-07 Thread GitBox
rdblue commented on issue #38: Add column comments to Iceberg schemas URL: https://github.com/apache/incubator-iceberg/issues/38#issuecomment-445321961 @govi20, I just saw your comment on the old issue. If you're still interested in working on this, feel free to open a PR! You'll ne

[GitHub] rdblue opened a new issue #39: Add in and notIn predicates

2018-12-07 Thread GitBox
rdblue opened a new issue #39: Add in and notIn predicates URL: https://github.com/apache/incubator-iceberg/issues/39 Currently, set inclusion is implemented using a tree of `equals` predicates joined with `or` predicates. It would be much more efficient to add support for `in` and `notIn`

[GitHub] rdblue commented on issue #39: Add in and notIn predicates

2018-12-07 Thread GitBox
rdblue commented on issue #39: Add in and notIn predicates URL: https://github.com/apache/incubator-iceberg/issues/39#issuecomment-44543 There's some discussion on the linked issue from the old Netflix repository. This is

[GitHub] rdblue opened a new issue #40: Add external schema mappings for files written with name-based schemas

2018-12-07 Thread GitBox
rdblue opened a new issue #40: Add external schema mappings for files written with name-based schemas URL: https://github.com/apache/incubator-iceberg/issues/40 Files written by Iceberg writers contain Iceberg field IDs that are used for column projection. Iceberg doesn't currently support

[GitHub] rdblue commented on issue #40: Add external schema mappings for files written with name-based schemas

2018-12-07 Thread GitBox
rdblue commented on issue #40: Add external schema mappings for files written with name-based schemas URL: https://github.com/apache/incubator-iceberg/issues/40#issuecomment-445411381 An incomplete implementation is available as a PR in the Netflix repository: https://github.com/Netflix/i

[GitHub] rdblue edited a comment on issue #39: Add in and notIn predicates

2018-12-07 Thread GitBox
rdblue edited a comment on issue #39: Add in and notIn predicates URL: https://github.com/apache/incubator-iceberg/issues/39#issuecomment-44543 There's some discussion on the linked issue from the old Netflix repository, and there is also an incomplete PR that is a good starting point:

[GitHub] rdblue opened a new issue #41: Add an API to maintain external schema mappings

2018-12-07 Thread GitBox
rdblue opened a new issue #41: Add an API to maintain external schema mappings URL: https://github.com/apache/incubator-iceberg/issues/41 Once Iceberg supports external schema mappings (#40), it should also support an easy way to maintain those mappings by notifying Iceberg when an external

[GitHub] rdblue commented on issue #31: Add startsWith predicate

2018-12-07 Thread GitBox
rdblue commented on issue #31: Add startsWith predicate URL: https://github.com/apache/incubator-iceberg/issues/31#issuecomment-445411845 The Netflix repository had an incomplete PR that is a good starting point: https://github.com/Netflix/iceberg/pull/78 -

[GitHub] rdblue opened a new issue #42: Add an action to cherry-pick changes in a snapshot and apply them on another snapshot

2018-12-07 Thread GitBox
rdblue opened a new issue #42: Add an action to cherry-pick changes in a snapshot and apply them on another snapshot URL: https://github.com/apache/incubator-iceberg/issues/42 In an audit workflow, new data is written to an orphan snapshot that is not committed as the table's current state

[GitHub] rdblue opened a new issue #9: Vectorize reads and deserialize to Arrow

2018-12-07 Thread GitBox
rdblue opened a new issue #9: Vectorize reads and deserialize to Arrow URL: https://github.com/apache/incubator-iceberg/issues/9 Iceberg does not use vectorized reads to produce data for Spark. For cases where Spark can use its vectorized read path (flat schemas, no evolution) Spark will b

[GitHub] rdblue closed issue #9: Vectorize reads and deserialize to Arrow

2018-12-07 Thread GitBox
rdblue closed issue #9: Vectorize reads and deserialize to Arrow URL: https://github.com/apache/incubator-iceberg/issues/9 This is an automated message from the Apache Git Service. To respond to the message, please log on Git

[GitHub] rdblue commented on issue #9: Vectorize reads and deserialize to Arrow

2018-12-07 Thread GitBox
rdblue commented on issue #9: Vectorize reads and deserialize to Arrow URL: https://github.com/apache/incubator-iceberg/issues/9#issuecomment-445412220 There's more context and discussion on the issue in the old Netflix project: https://github.com/Netflix/iceberg/issues/90

[GitHub] rdblue opened a new issue #43: Support snapshot selection in Spark read options

2018-12-07 Thread GitBox
rdblue opened a new issue #43: Support snapshot selection in Spark read options URL: https://github.com/apache/incubator-iceberg/issues/43 Spark passes query options from `DataFrameReader` to the Iceberg source. Iceberg should support selecting a specific snapshot ID or the table state at

[GitHub] rdblue opened a new issue #44: Support cryptographic integrity

2018-12-07 Thread GitBox
rdblue opened a new issue #44: Support cryptographic integrity URL: https://github.com/apache/incubator-iceberg/issues/44 Parquet encryption protects integrity of individual data files. However, in an untrusted storage, removal of one or more data file in a table might go unnoticed. Replac