Owen, something similar has come up in a roadmap discussion of mine. I have
a question about the solution you mentioned.
The requirements would be that there is a 1:1 mapping between rows in the
> matching files and stripes.
>
Were you thinking that there would really be a 1:1 mapping and that th
rdblue opened a new pull request #30: Spark 2.4
URL: https://github.com/apache/incubator-iceberg/pull/30
This updates the Spark dependency to 2.4.0.
Changes include:
* Remove ORC support that uses a now-private Spark API (BufferHolder)
* Use Spark filters instead of expressions,
rdblue opened a new issue #31: Add startsWith predicate
URL: https://github.com/apache/incubator-iceberg/issues/31
Some users have requested prefix matching or startsWith.
This issue was migrated from https://github.com/Netflix/iceberg/issues/49.
-
rdblue opened a new issue #32: Ignore unsupported partition fields
URL: https://github.com/apache/incubator-iceberg/issues/32
Iceberg may add new transforms to the partition spec. When a transform is
not recognized, Iceberg should ignore the field so that the format is
forward-compatible w
rdblue opened a new issue #33: Add Avro support to Pig reader
URL: https://github.com/apache/incubator-iceberg/issues/33
The initial implementation for pig only supports parquet. We need an avro
read path.
This issue was migrated from https://github.com/Netflix/iceberg/issues/51
---
rdblue opened a new issue #34: Add operation details to snapshots
URL: https://github.com/apache/incubator-iceberg/issues/34
Snapshots that are append-only can be aged off more aggressively than
deletes because all data files must be tracked in the next snapshot. Adding an
operation type a
rdblue opened a new issue #35: Implement strict projection in more transforms
URL: https://github.com/apache/incubator-iceberg/issues/35
Strict projection isn't required and wasn't implemented for several of the
partitioning transformations. When strict projection isn't implemented (the
`p
rdblue opened a new issue #36: Split files when planning scan tasks
URL: https://github.com/apache/incubator-iceberg/issues/36
When building a scan, the TableScan API can plan the files to read
(`planFiles`) or group the files into combined splits (`planTasks`). Split
planning should also
rdblue opened a new issue #37: Add split offsets to manifest files
URL: https://github.com/apache/incubator-iceberg/issues/37
Instead of storing a single HDFS block size for each data file, Iceberg
should store a list of split offsets. That will allow split planning to be more
precise by u
rdblue opened a new issue #38: Add column comments to Iceberg schemas
URL: https://github.com/apache/incubator-iceberg/issues/38
Iceberg schemas should allow storing comments as documentation for struct
fields.
This is an aut
dongjoon-hyun commented on a change in pull request #30: Update to Spark 2.4
URL: https://github.com/apache/incubator-iceberg/pull/30#discussion_r239900431
##
File path: build.gradle
##
@@ -302,7 +300,7 @@ project(':iceberg-presto-runtime') {
shadow "org.apache.avr
rdblue commented on issue #38: Add column comments to Iceberg schemas
URL:
https://github.com/apache/incubator-iceberg/issues/38#issuecomment-445321961
@govi20, I just saw your comment on the old issue. If you're still
interested in working on this, feel free to open a PR!
You'll ne
rdblue opened a new issue #39: Add in and notIn predicates
URL: https://github.com/apache/incubator-iceberg/issues/39
Currently, set inclusion is implemented using a tree of `equals` predicates
joined with `or` predicates. It would be much more efficient to add support for
`in` and `notIn`
rdblue commented on issue #39: Add in and notIn predicates
URL:
https://github.com/apache/incubator-iceberg/issues/39#issuecomment-44543
There's some discussion on the linked issue from the old Netflix repository.
This is
rdblue opened a new issue #40: Add external schema mappings for files written
with name-based schemas
URL: https://github.com/apache/incubator-iceberg/issues/40
Files written by Iceberg writers contain Iceberg field IDs that are used for
column projection. Iceberg doesn't currently support
rdblue commented on issue #40: Add external schema mappings for files written
with name-based schemas
URL:
https://github.com/apache/incubator-iceberg/issues/40#issuecomment-445411381
An incomplete implementation is available as a PR in the Netflix repository:
https://github.com/Netflix/i
rdblue edited a comment on issue #39: Add in and notIn predicates
URL:
https://github.com/apache/incubator-iceberg/issues/39#issuecomment-44543
There's some discussion on the linked issue from the old Netflix repository,
and there is also an incomplete PR that is a good starting point:
rdblue opened a new issue #41: Add an API to maintain external schema mappings
URL: https://github.com/apache/incubator-iceberg/issues/41
Once Iceberg supports external schema mappings (#40), it should also support
an easy way to maintain those mappings by notifying Iceberg when an external
rdblue commented on issue #31: Add startsWith predicate
URL:
https://github.com/apache/incubator-iceberg/issues/31#issuecomment-445411845
The Netflix repository had an incomplete PR that is a good starting point:
https://github.com/Netflix/iceberg/pull/78
-
rdblue opened a new issue #42: Add an action to cherry-pick changes in a
snapshot and apply them on another snapshot
URL: https://github.com/apache/incubator-iceberg/issues/42
In an audit workflow, new data is written to an orphan snapshot that is not
committed as the table's current state
rdblue opened a new issue #9: Vectorize reads and deserialize to Arrow
URL: https://github.com/apache/incubator-iceberg/issues/9
Iceberg does not use vectorized reads to produce data for Spark. For cases
where Spark can use its vectorized read path (flat schemas, no evolution) Spark
will b
rdblue closed issue #9: Vectorize reads and deserialize to Arrow
URL: https://github.com/apache/incubator-iceberg/issues/9
This is an automated message from the Apache Git Service.
To respond to the message, please log on Git
rdblue commented on issue #9: Vectorize reads and deserialize to Arrow
URL: https://github.com/apache/incubator-iceberg/issues/9#issuecomment-445412220
There's more context and discussion on the issue in the old Netflix project:
https://github.com/Netflix/iceberg/issues/90
rdblue opened a new issue #43: Support snapshot selection in Spark read options
URL: https://github.com/apache/incubator-iceberg/issues/43
Spark passes query options from `DataFrameReader` to the Iceberg source.
Iceberg should support selecting a specific snapshot ID or the table state at
rdblue opened a new issue #44: Support cryptographic integrity
URL: https://github.com/apache/incubator-iceberg/issues/44
Parquet encryption protects integrity of individual data files. However, in
an untrusted storage, removal of one or more data file in a table might go
unnoticed. Replac
25 matches
Mail list logo