Re: Question about replacing files and about Publishing Jars

2019-02-26 Thread Jacques Nadeau
We're using etag for better clarity on this at Dremio (for a different use
case). I wonder if the same thing should be available in iceberg.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, Feb 26, 2019 at 9:48 AM Ryan Blue  wrote:

> Hi Arvind,
>
> Iceberg assumes that all file locations are unique. If two snapshots refer
> to the same location, then whatever data file (or version) is in that
> location is what is read. What is your use case?
>
> Apache Iceberg has no official releases yet. We still need to do some
> license work for binaries, get the build set up for Apache publication,
> finish a few more PRs, and rename packages. In the mean time, you can use
> JitPack to build binaries for specific commits. That should allow you to
> easily test the project if you don't want to build it yourself.
>
> On Mon, Feb 25, 2019 at 6:27 PM Arvind Pruthi 
> wrote:
>
>> Hello There,
>>
>>
>>
>> Q1. What happens In case a file is deleted and a new file is to be added
>> with the same name, but the snapshot in which the delete was registered is
>> still around? There is no ambiguity from listing the manifest entries point
>> of view. However, there will be ambiguity at the Hdfs level. How is that
>> resolved? Also any thoughts on if a file needs to be replaced with a
>> different file with the same name (We have a use case for this)?
>>
>>
>>
>> Q2. Are the iceberg jars being published anywhere? I couldn’t find them
>> in maven central.
>>
>>
>>
>> Thanks,
>>
>> Arvind
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Community code reviews

2019-02-26 Thread Jacques Nadeau
I'm +1 (non-binding) if you allow a window for review (for example, I think
others have suggested 1-2 business day before self+1). The post, self +1,
merge in two minutes is not great situation for anyone.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, Feb 26, 2019 at 4:51 PM Ryan Blue  wrote:

> Hi everyone,
>
> I’d like to give a shout out to some of the awesome people that have
> joined this community and taken the time to review pull requests: Matt
> Cheah, Anton Okolnychyi, Ratandeep Ratti, Filip Bocse, and Uwe Korn. Thanks
> to all of you!
>
> This work is really helpful to growing community and is a significant step
> toward becoming a committer.
>
> Since we have such great community support, I’d like to suggest an option
> for getting pull requests merged more quickly while we’re in the current
> phase. We don’t have many committers to review pull requests, but we do
> have several people on that path. I suggest we allow committers to merge
> their own pull requests if they are reviewed by the community.
>
> I think this could be helpful, but I have seen it go wrong in the past
> when people from the same company don’t make good faith reviews and instead
> +1 a PR just to get it in. That said, I think we can address that problem
> if and when it happens. We can also limit this policy to this year, after
> which we should have more committers.
>
> What does everyone think?
>
> If there’s enough support on this thread, I’ll start a vote.
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Updates/Deletes/Upserts in Iceberg

2019-05-07 Thread Jacques Nadeau
Awesome. This was on my list for some time. Glad you got it started.

On Wed, May 8, 2019, 3:42 AM Anton Okolnychyi 
wrote:

> Hi folks,
>
> Miguel (cc) and I have spent some time thinking about how to perform
> updates/deletes/upserts on top of Iceberg tables. This functionality is
> essential for many modern use cases. We've summarized our ideas in a doc
> [1], which, hopefully, will trigger a discussion in the community. The
> document presents different conceptual approaches alongside their
> trade-offs. We will be glad to consider any other ideas as well.
>
> Thanks,
> Anton
>
> [1] -
> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/
>
>
>


Re: Updates/Deletes/Upserts in Iceberg

2019-05-09 Thread Jacques Nadeau
This is a nice doc and it covers many different options. Upon first skim, I
don't see a strong argument for particular approach. D

In our own development, we've been leaning heavily towards what you
describe in the document as "lazy with SRI". I believe this is consistent
with what the Hive community did on top of Orc. It's interesting because my
(maybe incorrect) understanding of the Databricks Delta approach is they
chose what you title "eager" in their approach to upserts. They may also
have a lazy approach for other types of mutations but I don't think they do.

Thanks again for putting this together!
Jacques
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi
 wrote:

> Hi folks,
>
> Miguel (cc) and I have spent some time thinking about how to perform
> updates/deletes/upserts on top of Iceberg tables. This functionality is
> essential for many modern use cases. We've summarized our ideas in a doc
> [1], which, hopefully, will trigger a discussion in the community. The
> document presents different conceptual approaches alongside their
> trade-offs. We will be glad to consider any other ideas as well.
>
> Thanks,
> Anton
>
> [1] -
> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/
>
>
>


Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Jacques Nadeau
I think we just need to have further discussion about keys. Ryan said:

3. Synthetic keys should be based on filename and position


But I'm not clear there is consensus around that. I'm also not sure whether
he means lossless inclusion, simply derived-from or something else. My
thinking before is you must have synthetic keys in many cases since solving
concurrency becomes challenging otherwise.

Erik said:

> C might not be partitioned or sorted by the natural key, either, which
> means that finding the synthetic key can be expensive.


But I'm not sure why this needs to be true. If we give control over the
creation of synthetic key, wouldn't that resolve this issue?

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, May 21, 2019 at 7:54 AM Erik Wright 
wrote:

> On Thu, May 16, 2019 at 4:13 PM Ryan Blue  wrote:
>
>> Replies inline.
>>
>> On Thu, May 16, 2019 at 10:07 AM Erik Wright 
>> wrote:
>>
>>> I would be happy to participate. Iceberg with merge-on-read capabilities
>>> is a technology choice that my team is actively considering. It appears
>>> that our scenario differs meaningfully from the one that Anton and Miguel
>>> are considering. It would be great to take the time to compare the two and
>>> see if there is a single implementation that can meet the needs of each
>>> scenario.
>>>
>>
>> Can you be more specific about where the use cases differ meaningfully? I
>> think that if we agree that operations on natural keys can be implemented
>> using synthetic keys to encode deletes (#2), then everyone is aligned on
>> the core parts of a design. We can figure out the implications of how
>> synthetic keys are encoded, but I don't see that issue (#3) having a huge
>> impact on use cases. So is #2 the main disagreement?
>>
>
> We are mainly interested in upserts and deletes by natural key. We are not
> interested in more powerful queries of the types mentioned in the doc.
>
> On the other hand, in our cases, the upserts and deletes are generated
> upstream. In other words, I have an incremental transformation job J1 with
> inputs A and B, producing an output C. I can generate the upserts and
> deletes directly from my inputs, without referring to the current state of
> C.
>
> C might not be partitioned or sorted by the natural key, either, which
> means that finding the synthetic key can be expensive.
>
> In our bespoke model, we are able to generate a new delta (and append it
> to the dataset) without referring to the existing base or existing deltas.
>
> Another apparent divergence is that my jobs are incremental and want to be
> able to track the state they have seen so far in order to consume only new
> deltas in their next execution. So if job J2 consumes C, it needs to know
> that the deltas since its last read will not have been compacted into the
> base. This implies support for multiple generations of deltas, as different
> consumers could be at different points in the stream of deltas.
>
>
>> On Wed, May 15, 2019 at 3:55 PM Ryan Blue 
>>> wrote:
>>>
>>>> *2. Iceberg diff files should use synthetic keys*
>>>>
>>>> A lot of the discussion on the doc is about whether natural keys are
>>>> practical or what assumptions we can make or trade about them. In my
>>>> opinion, Iceberg tables will absolutely need natural keys for reasonable
>>>> use cases. And those natural keys will need to be unique. And Iceberg will
>>>> need to rely on engines to enforce that uniqueness.
>>>>
>>>> But, there is a difference between table behavior and implementation.
>>>> We can use synthetic keys to implement the requirements of natural keys.
>>>> Each row should be identified by its file and position in a file. When
>>>> deleting by a natural key, we just need to find out what the synthetic key
>>>> is and encode that in the delete diff.
>>>>
>>> This comment has important implications for the effort required to
>>> generate delete diff files. I've tried to cover why in comments I added
>>> today to the doc, but it could also be a topic of the hangout.
>>>
>>
>> Do you mean that you can't encode a delete without reading data to locate
>> the affected rows?
>>
>
> Yes.
>
>
>>
>> *3. Synthetic keys should be based on filename and position*
>>>>
>>>> I think identifying the file in a synthetic key makes a lot of sense.
>>>> This would allow for delta file reuse as individual files are rewritten by
>>>> a “major” compaction and provides

Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Jacques Nadeau
It would be useful to describe the types of concurrent operations that
> would be supported (i.e., failed snapshotting could easily be recovered,
> vs. the whole operation needing to be re-executed) vs. those that wouldn't.
> Solving for unlimited concurrency cases may create way more complexity than
> is necessary.
>

I'd like to restate my comment a little bit. We need unique keys to make
things work. They can be synthetic or not but they should not have any
retrievable iceberg related data in them.

The main thing I'm talking about is how you target a deletion across time.
If you have a file A, and you want to delete record X in A, you define
delete A.X. At the same time, another process may be compacting A into A'.
In so doing, the position of A.X in A' is something other than X. At this
point, the deletion needs to be rerun against A' so that we can ensure that
the deletion is propagated forward. If the only thing you have is A.X, you
need to have way from of getting to the same location in A'. You should be
able to take the delta file that lists the delete of A.2 and apply it
directly to A' without having to also consult A. If you didn't need to
solve this number, then you could simply use A.X as opposed to the key of
A.X in your delta files.

Synthetic seems relative. If the synthetic key is client-supplied, in what
> way is it relevant to Iceberg whether it is synthetic vs. natural? By
> calling it synthetic within Iceberg there is a strong implication that it
> is the implementation that generates it (the filename/position key suggests
> that). If it's the client that supplies it, it _may_ be synthetic (from the
> point of view of the overall data model; i.e. a customer key in a database
> vs. a customer ID that shows up on a bill) but from Iceberg's case that
> doesn't matter. Only the unicity constraint does.
>

I agree with the main statement here: the only real requirement is keys
need to be unique across all existing snapshots. There could be two
generators: one that uses an iceberg internal behavior to generate keys and
one that is user definable. While there could be a third which uses an
existing field (or set of fields) to define the key I think we probably
should avoid implementing this as it has a whole other sets of problems
that are best left outside of Iceberg's area of concern.


Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Jacques Nadeau
>
> It’s not at all clear why unique keys would be needed at all.


If we turn your questions around, you answer yourself. If you have
independent writers, you need unique keys.

Also truly independent writers (like a job writing while a job compacts),
> means effectively a distributed transaction, and I believe it’s clearly out
> of scope for Iceberg to solve that ?
>

Assuming a single process is writing seems severely limiting in design and
scale. I'm also surprised that you would think this is outside of Iceberg's
scope. A table format that can only be modified by a single process
basically locks that format into a single tool for a particular deployment.

Uniqueness - enforcing uniqueness at scale is not feasible (proovably so).


Expecting uniqueness is different than enforcing it. If you're saying it is
impossible to enforce, I understand that. If your we can't define a system
where it is expected and there are ramifications if it is not maintained.

Also, at scale, it’s really only feasible to do query and update/upsert on
> the partition/bucket/sort key, any other access is likely a full scan of
> terabytes of data, on remote storage.


I'm not sure why you would say unless you assume a particular
implementation. Single record deletion is definitely an important use case.
There is no need to do a full table scan to accomplish that unless you're
assuming an eager approach to deletion.

I do continue to wonder how much of this back and forth is the mixing of
thinking around restatement (eager) versus delta (lazy) implementations.
Maybe we should separate them out as two different conversations?


Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Jacques Nadeau
> That's my point, truly independent writers (two Spark jobs, or a Spark job
> and Dremio job) means a distributed transaction. It would need yet another
> external transaction coordinator on top of both Spark and Dremio, Iceberg
> by itself
> cannot solve this.
>

I'm not ready to accept this. Iceberg already supports a set of semantics
around multiple writers committing simultaneously and how conflict
resolution is done. The same can be done here.


> By single writer, I don't mean single process, I mean multiple coordinated
> processes like Spark executors coordinated by Spark driver. The coordinator
> ensures that the data is pre-partitioned on
> each executor, and the coordinator commits the snapshot.
>
> Note however that single writer job/multiple concurrent reader jobs is
> perfectly feasible, i.e. it shouldn't be a problem to write from a Spark
> job and read from multiple Dremio queries concurrently (for example)
>

:D This is still "single process" from my perspective. That process may be
coordinating other processes to do distributed work but ultimately it is a
single process.


> I'm not sure what you mean exactly. If we can't enforce uniqueness we
> shouldn't assume it.
>

I disagree. We can specify that as a requirement and state that you'll get
unintended consequences if you provide your own keys and don't maintain
this.


> We do expect that most of the time the natural key is unique, but the
> eager and lazy with natural key designs can handle duplicates
> consistently. Basically it's not a problem to have duplicate natural keys,
> everything works fine.
>

That heavily depends on how things are implemented. For example, we may
write a bunch of code that generates internal data structures based on this
expectation. If we have to support duplicate matches, all of sudden we can
no longer size various data structures to improve performance and may be
unable to preallocate memory associated with a guaranteed completion.

Let me try and clarify each point:
>
> - lookup for query or update on a non-(partition/bucket/sort) key
> predicate implies scanning large amounts of data - because these are the
> only data structures that can narrow down the lookup, right ? One could
> argue that the min/max index (file skipping) can be applied to any column,
> but in reality if that column is not sorted the min/max intervals can have
> huge overlaps so it may be next to useless.
> - remote storage - this is a critical architecture decision -
> implementations on local storage imply a vastly different design for the
> entire system, storage and compute.
> - deleting single records per snapshot is unfeasible in eager but also
> particularly in the lazy design: each deletion creates a very small
> snapshot. Deleting 1 million records one at a time would create 1 million
> small files, and 1 million RPC calls.
>

Why is this unfeasible? If I have a dataset of 100mm files including 1mm
small files, is that a major problem? It seems like your usecase isn't one
where you want to support single record deletes but it is definitely
something important to many people.


> Eager is conceptually just lazy + compaction done, well, eagerly. The
> logic for both is exactly the same, the trade-off is just that with eager
> you implicitly compact every time so that you don't do any work on read,
> while with lazy
> you want to amortize the cost of compaction over multiple snapshots.
>
> Basically there should be no difference between the two conceptually, or
> with regard to keys, etc. The only difference is some mechanics in
> implementation.
>

I think you have deconstruct the problem too much to say these are the same
(or at least that is what I'm starting to think given this thread). It
seems like real world implementation decisions (per our discussion here)
are in conflict. For example, you just argued against having a 1mm
arbitrary mutations but I think that is because you aren't thinking about
things over time with a delta implementation. Having 10,000 mutations a day
where we do delta compaction once a week and local file mappings (key to
offset sparse bitmaps) seems like it could result in very good performance
in a case where we're mutating small amounts of data. In this scenario, you
may not do major compaction ever unless you get to a high enough percentage
of records that have been deleted in the original dataset. That drives a
very different set of implementation decisions from a situation where
you're trying to restate an entire partition at once.


Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Jacques Nadeau
I agree with Anton that we should probably spend some time on hangouts
further discussing things. Definitely differing expectations here and we
seem to be talking a bit past each other.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, May 21, 2019 at 3:44 PM Cristian Opris 
wrote:

> I love a good flame war :P
>
> On 21 May 2019, at 22:57, Jacques Nadeau  wrote:
>
>
> That's my point, truly independent writers (two Spark jobs, or a Spark job
>> and Dremio job) means a distributed transaction. It would need yet another
>> external transaction coordinator on top of both Spark and Dremio, Iceberg
>> by itself
>> cannot solve this.
>>
>
> I'm not ready to accept this. Iceberg already supports a set of semantics
> around multiple writers committing simultaneously and how conflict
> resolution is done. The same can be done here.
>
>
>
>
> MVCC (which is what Iceberg tries to implement) requires a total ordering
> of snapshots. Also the snapshots need to be non-conflicting. I really don't
> see how any metadata data structures can solve this without an outside
> coordinator.
>
> Consider this:
>
> Snapshot 0: (K,A) = 1
> Job X: UPDATE K SET A=A+1
> Job Y: UPDATE K SET A=10
>
> What should the final value of A be and who decides ?
>
>
>
>> By single writer, I don't mean single process, I mean multiple
>> coordinated processes like Spark executors coordinated by Spark driver. The
>> coordinator ensures that the data is pre-partitioned on
>> each executor, and the coordinator commits the snapshot.
>>
>> Note however that single writer job/multiple concurrent reader jobs is
>> perfectly feasible, i.e. it shouldn't be a problem to write from a Spark
>> job and read from multiple Dremio queries concurrently (for example)
>>
>
> :D This is still "single process" from my perspective. That process may be
> coordinating other processes to do distributed work but ultimately it is a
> single process.
>
>
> Fair enough
>
>
>
>> I'm not sure what you mean exactly. If we can't enforce uniqueness we
>> shouldn't assume it.
>>
>
> I disagree. We can specify that as a requirement and state that you'll get
> unintended consequences if you provide your own keys and don't maintain
> this.
>
>
> There's no need for unintended consequences, we can specify consistent
> behaviour (and I believe the document says what that is)
>
>
>
>
>> We do expect that most of the time the natural key is unique, but the
>> eager and lazy with natural key designs can handle duplicates
>> consistently. Basically it's not a problem to have duplicate natural
>> keys, everything works fine.
>>
>
> That heavily depends on how things are implemented. For example, we may
> write a bunch of code that generates internal data structures based on this
> expectation. If we have to support duplicate matches, all of sudden we can
> no longer size various data structures to improve performance and may be
> unable to preallocate memory associated with a guaranteed completion.
>
>
> Again we need to operate on the assumption that this is a large scale
> distributed compute/remote storage scenario. Key matching is done with
> shuffles with data movement across the network, such optimizations would
> really have little impact on overall performance. Not to mention that most
> query engines would already optimize the shuffle already as much as it can
> be optimized.
>
> It is true that if actual duplicate keys would make the key matching join
> (anti-join) somewhat more expensive, however it can be done in such a way
> that if the keys are in practice unique the join is as efficient as it can
> be.
>
>
>
> Let me try and clarify each point:
>>
>> - lookup for query or update on a non-(partition/bucket/sort) key
>> predicate implies scanning large amounts of data - because these are the
>> only data structures that can narrow down the lookup, right ? One could
>> argue that the min/max index (file skipping) can be applied to any column,
>> but in reality if that column is not sorted the min/max intervals can have
>> huge overlaps so it may be next to useless.
>> - remote storage - this is a critical architecture decision -
>> implementations on local storage imply a vastly different design for the
>> entire system, storage and compute.
>> - deleting single records per snapshot is unfeasible in eager but also
>> particularly in the lazy design: each deletion creates a very small
>> snapshot. Deleting 1 million records one at a time would create 1 million
>> small files

Re: Updates/Deletes/Upserts in Iceberg

2019-05-22 Thread Jacques Nadeau
works for me.

To make things easier, we can use my zoom meeting if people like:

Join Zoom Meeting
https://zoom.us/j/4157302092

One tap mobile
+16465588656,,4157302092# US (New York)
+16699006833,,4157302092# US (San Jose)

Dial by your location
+1 646 558 8656 US (New York)
+1 669 900 6833 US (San Jose)
877 853 5257 US Toll-free
888 475 4499 US Toll-free
Meeting ID: 415 730 2092
Find your local number: https://zoom.us/u/aH9XYBfm






--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, May 22, 2019 at 8:54 AM Ryan Blue  wrote:

> 9AM on Friday works best for me. How about then?
>
> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi 
> wrote:
>
>> What about this Friday? One hour slot from 9:00 to 10:00 am or 10:00 to
>> 11:00 am PST? Some folks are based in London, so meeting later than this is
>> hard. If Friday doesn’t work, we can consider Tuesday or Wednesday next
>> week.
>>
>> On 22 May 2019, at 00:54, Jacques Nadeau  wrote:
>>
>> I agree with Anton that we should probably spend some time on hangouts
>> further discussing things. Definitely differing expectations here and we
>> seem to be talking a bit past each other.
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris 
>> wrote:
>>
>>> I love a good flame war :P
>>>
>>> On 21 May 2019, at 22:57, Jacques Nadeau  wrote:
>>>
>>>
>>> That's my point, truly independent writers (two Spark jobs, or a Spark
>>>> job and Dremio job) means a distributed transaction. It would need yet
>>>> another external transaction coordinator on top of both Spark and Dremio,
>>>> Iceberg by itself
>>>> cannot solve this.
>>>>
>>>
>>> I'm not ready to accept this. Iceberg already supports a set of
>>> semantics around multiple writers committing simultaneously and how
>>> conflict resolution is done. The same can be done here.
>>>
>>>
>>>
>>>
>>> MVCC (which is what Iceberg tries to implement) requires a total
>>> ordering of snapshots. Also the snapshots need to be non-conflicting. I
>>> really don't see how any metadata data structures can solve this without an
>>> outside coordinator.
>>>
>>> Consider this:
>>>
>>> Snapshot 0: (K,A) = 1
>>> Job X: UPDATE K SET A=A+1
>>> Job Y: UPDATE K SET A=10
>>>
>>> What should the final value of A be and who decides ?
>>>
>>>
>>>
>>>> By single writer, I don't mean single process, I mean multiple
>>>> coordinated processes like Spark executors coordinated by Spark driver. The
>>>> coordinator ensures that the data is pre-partitioned on
>>>> each executor, and the coordinator commits the snapshot.
>>>>
>>>> Note however that single writer job/multiple concurrent reader jobs is
>>>> perfectly feasible, i.e. it shouldn't be a problem to write from a Spark
>>>> job and read from multiple Dremio queries concurrently (for example)
>>>>
>>>
>>> :D This is still "single process" from my perspective. That process may
>>> be coordinating other processes to do distributed work but ultimately it is
>>> a single process.
>>>
>>>
>>> Fair enough
>>>
>>>
>>>
>>>> I'm not sure what you mean exactly. If we can't enforce uniqueness we
>>>> shouldn't assume it.
>>>>
>>>
>>> I disagree. We can specify that as a requirement and state that you'll
>>> get unintended consequences if you provide your own keys and don't maintain
>>> this.
>>>
>>>
>>> There's no need for unintended consequences, we can specify consistent
>>> behaviour (and I believe the document says what that is)
>>>
>>>
>>>
>>>
>>>> We do expect that most of the time the natural key is unique, but the
>>>> eager and lazy with natural key designs can handle duplicates
>>>> consistently. Basically it's not a problem to have duplicate natural
>>>> keys, everything works fine.
>>>>
>>>
>>> That heavily depends on how things are implemented. For example, we may
>>> write a bunch of code that generates internal data structures based on this
>>> expectation. If we have to support duplicate matches, all of sudden we can
>>> no longer size various data structures 

Re: Updates/Deletes/Upserts in Iceberg

2019-05-29 Thread Jacques Nadeau
Yeah, I totally forgot to record our discussion. Will do so next time,
sorry.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, May 29, 2019 at 4:24 PM Ryan Blue  wrote:

> It wasn't recorded, but I can summarize what we talked about. Sorry I
> haven't sent this out earlier.
>
> We talked about the options and some of the background in Iceberg --
> basically that it isn't possible to determine the order of commits before
> you commit so you can't rely on some monotonically increasing value from a
> snapshot to know which deltas to apply to a file. The result is that we
> can't apply diffs to data files using a rule like "files older than X"
> because we can't identify those files without the snapshot history.
>
> That gives us basically 2 options for scoping delete diffs: either
> identify the files to apply a diff to when writing the diff, or log changes
> applied to a snapshot and keep the snapshot history around (which is how we
> know the order of snapshots). The first option is not good if you want to
> write without reading data to determine where the deleted records are. The
> second prevents cleaning up snapshot history.
>
> We also talked about whether we should encode IDs in data files. Jacques
> pointed out that retrying a commit is easier if you don't need to re-read
> the original data to reconcile changes. For example, if a data file was
> compacted in a concurrent write, how do we reconcile a delete for it? We
> discussed other options, like rolling back the compaction for delete
> events. I think that's a promising option.
>
> For action items, Jacques was going to think about whether we need to
> encode IDs in data files or if we could use positions to identify rows and
> write up a summary/proposal. Erik was going to take on planning how
> identifying rows without reading data would work and similarly write up a
> summary/proposal.
>
> That's from memory, so if I've missed anything, I hope that other
> attendees will fill in the details!
>
> rb
>
> On Wed, May 29, 2019 at 3:34 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
>> Hi Ryan,
>>
>> I couldn't attend the meeting. Just curious, if this is recorded by any
>> chance.
>>
>> Regards
>> Venkata krishnan
>>
>>
>> On Fri, May 24, 2019 at 8:49 AM Ryan Blue 
>> wrote:
>>
>>> Yes, I agree. I'll talk a little about a couple of the constraints of
>>> this as well.
>>>
>>> On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi 
>>> wrote:
>>>
>>>> The agenda looks good to me. I think it would also make sense to
>>>> clarify the responsibilities of query engines and Iceberg. Not only in
>>>> terms of uniqueness, but also in terms of applying diffs on read, for
>>>> example.
>>>>
>>>> On 23 May 2019, at 01:59, Ryan Blue  wrote:
>>>>
>>>> Here’s a rough agenda:
>>>>
>>>>- Use cases: everyone come with a use case that you’d like to have
>>>>supported. We’ll go around and introduce ourselves and our use cases.
>>>>- Main topic: How should Iceberg identify rows that are deleted?
>>>>- Side topics from my initial email, if we have time: should we use
>>>>insert diffs, should we support dense and sparse formats, etc.
>>>>
>>>> The main topic I think we should discuss is: *How should Iceberg
>>>> identify rows that are deleted?*
>>>>
>>>> I’m phrasing it this way to avoid where I think we’re talking past one
>>>> another because we are making assumptions. The important thing is that
>>>> there are two main options:
>>>>
>>>>- Filename and position, vs
>>>>- Specific values of (few) columns in the data
>>>>
>>>> This phrasing also avoids discussing uniqueness constraints. Once we
>>>> get down to behavior, I think we agree. For example, I think we all agree
>>>> that uniqueness cannot be enforced in Iceberg.
>>>>
>>>> If uniqueness can’t be enforced in Iceberg, the main choice comes down
>>>> to how we identify rows that are deleted. If we use (filename, position)
>>>> then we know that there is only one row. On the other hand, if we use data
>>>> values to identify rows then a delete may identify more than one row
>>>> because there are no uniqueness guarantees. I think we also agree that if
>>>> there is more than one row identified, all of them should be deleted.
>>>>
&

Re: Subscribing to dev mailing list

2019-10-01 Thread Jacques Nadeau
You should send an email to dev-subscr...@iceberg.apache.org

On Tue, Oct 1, 2019, 4:42 AM Thippana Vamsi Kalyan  wrote:

> Please add my email id.
>
> Thank you so much
> --
> Best regards
> T.Vamsi Kalyan
>


Re: [DISCUSS] Iceberg community sync?

2019-10-03 Thread Jacques Nadeau
Sounds good to me. I'd vote for once a month.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Thu, Oct 3, 2019 at 4:56 PM Ryan Blue  wrote:

> Hi everyone,
>
> Other projects I'm involved in use a hangouts meetup every few weeks to
> sync up about the status of different ongoing projects. Iceberg is getting
> to the point where we might want to consider doing this as well. There are
> some significant efforts, like vectorization, row-level delete, and our
> first release.
>
> Usually how this works is we talk over hangouts or some other video call
> platform. Also, someone takes notes and sends those notes to the dev list
> to keep a record of what we discussed for anyone that couldn't attend.
>
> Does this sound like a good idea to anyone?
>
> If so, how often? I know we have people in different time zones, so
> depending on who wants to attend, we may need to alternate times.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Iceberg community sync?

2019-10-06 Thread Jacques Nadeau
Tuesdays work best for me.

On Sun, Oct 6, 2019, 4:18 PM Anton Okolnychyi 
wrote:

> Tuesday/Wednesday/Thursday works fine for me. Anything up to 19:00 UTC /
> 20:00 BST / 12:00 PDT is OK if 09:00 PDT is too early for someone.
>
> Thanks,
> Anton
>
> On 4 Oct 2019, at 19:59, Ryan Blue  wrote:
>
> Sounds good. How about the first sync next week?
>
> Since Anton replied, let's do the first one at a time that's reasonable
> for people in BST. How about 16:00 UTC / 17:00 BST / 09:00 PDT? I could
> make that time Tuesday, Wednesday, or Thursday next week, the 8th, 9th, or
> 10th.
>
> If there are people in CST that want to attend these, please say so and we
> will find a time that works for the next one.
>
> On Fri, Oct 4, 2019 at 10:36 AM Xabriel Collazo Mojica <
> xcoll...@adobe.com.invalid> wrote:
>
>> +1
>>
>>
>>
>> *Xabriel J Collazo Mojica*  |  Senior Software Engineer  |  Adobe  |
>> xcoll...@adobe.com
>>
>>
>>
>> *From: * on behalf of Anton Okolnychyi <
>> aokolnyc...@apple.com.INVALID>
>> *Reply-To: *"dev@iceberg.apache.org" 
>> *Date: *Friday, October 4, 2019 at 1:41 AM
>> *To: *Iceberg Dev List 
>> *Subject: *Re: [DISCUSS] Iceberg community sync?
>>
>>
>>
>> +1
>>
>>
>>
>> On 4 Oct 2019, at 07:14, Julien Le Dem 
>> wrote:
>>
>>
>>
>> +1
>>
>>
>>
>> On Thu, Oct 3, 2019 at 17:52 Xinli shang  wrote:
>>
>> Good to me for once a month!
>>
>>
>>
>> On Thu, Oct 3, 2019 at 5:13 PM Jacques Nadeau  wrote:
>>
>> Sounds good to me. I'd vote for once a month.
>>
>> --
>>
>> Jacques Nadeau
>>
>> CTO and Co-Founder, Dremio
>>
>>
>>
>>
>>
>> On Thu, Oct 3, 2019 at 4:56 PM Ryan Blue 
>> wrote:
>>
>> Hi everyone,
>>
>>
>>
>> Other projects I'm involved in use a hangouts meetup every few weeks to
>> sync up about the status of different ongoing projects. Iceberg is getting
>> to the point where we might want to consider doing this as well. There are
>> some significant efforts, like vectorization, row-level delete, and our
>> first release.
>>
>>
>>
>> Usually how this works is we talk over hangouts or some other video call
>> platform. Also, someone takes notes and sends those notes to the dev list
>> to keep a record of what we discussed for anyone that couldn't attend.
>>
>>
>>
>> Does this sound like a good idea to anyone?
>>
>>
>>
>> If so, how often? I know we have people in different time zones, so
>> depending on who wants to attend, we may need to alternate times.
>>
>>
>>
>> rb
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>> --
>>
>> Xinli Shang
>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


Re: [VOTE] Release Apache Iceberg 0.7.0-incubating RC1

2019-10-14 Thread Jacques Nadeau
Ran steps, did some greps and random discovery to see if I saw any issues.

Couple questions:

   - Can someone remind me the rules around noting license of dependencies
   for a binary release. It seems like a binary release is being proposed here
   via maven but we don't have any LICENSE/NOTICE for dependencies. If the
   binary release didn't include dependencies (was just a jar with no shaded
   dependencies), I think this would be fine. However, I believe some of the
   jars include dependencies, right?
   - The source release tarball includes the gradle-wrapper.jar file but I
   don't see any reference to it. Not sure if needs to go in both NOTICE and
   LICENSE or only one.

It has been a long time since I did an incubator check so I may be wrong on
both of these and would love someone who has done it more recently to chime
in...
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Mon, Oct 14, 2019 at 9:38 AM John Zhuge  wrote:

> BTW, the failure was:
>
> Job aborted due to stage failure: Exception while getting task result:
> com.esotericsoftware.kryo.KryoException:
> java.lang.UnsupportedOperationException
> Serialization trace:
> splitOffsets (org.apache.iceberg.GenericDataFile)
> files (com.netflix.iceberg.spark.source.Writer$TaskCommit)
>
> On Mon, Oct 14, 2019 at 9:35 AM John Zhuge  wrote:
>
>> - Passed all 7 steps
>> - Build source code at tag apache-iceberg-0.7.0-incubating-rc1 locally,
>> unit tests passed. However, my downstream Spark 2.3 branch failed
>> integration tests, possibly due to
>> https://github.com/apache/incubator-iceberg/issues/446. I will try
>> Anton's suggestion and report back.
>>
>> On Sun, Oct 13, 2019 at 11:48 PM Gautam  wrote:
>>
>>> Ran all steps successfully.
>>>
>>>  +1 from me.
>>>
>>> On Mon, Oct 14, 2019 at 7:30 AM 俊杰陈  wrote:
>>>
>>>> Ran all steps successfully, +1
>>>>
>>>> On Mon, Oct 14, 2019 at 7:39 AM Ted Gooch 
>>>> wrote:
>>>> >
>>>> > Ran all steps no issues from me.
>>>> > +1
>>>> >
>>>> > On Sun, Oct 13, 2019 at 12:09 PM Ryan Blue 
>>>> wrote:
>>>> >>
>>>> >> +1 (binding)
>>>> >>
>>>> >> I went through all of the validation and it looks good.
>>>> >>
>>>> >> I also tested the iceberg-spark-runtime Jar with the Apache Spark
>>>> 2.4.4 download. Copying the runtime Jar into Spark's jars folder works
>>>> without problems to read and write both path-based tables and Hive tables.
>>>> Metadata tables work correctly, same with time travel, and metadata tables
>>>> with time travel also work. I also didn't run out of threads in the test
>>>> Hive metastore as I did with the last candidate.
>>>> >>
>>>> >> On Sun, Oct 13, 2019 at 11:30 AM Anton Okolnychyi <
>>>> aokolnyc...@apple.com> wrote:
>>>> >>>
>>>> >>> +1 from me then
>>>> >>>
>>>> >>> On 13 Oct 2019, at 18:33, Ryan Blue 
>>>> wrote:
>>>> >>>
>>>> >>> The publish steps will now sign all of the artifacts, which is
>>>> required for an Apache release. That's why the publish steps fail in
>>>> master. To fix this in master, we can come up with a way to only turn on
>>>> release signatures if `-Prelease` is set, which is how we also select the
>>>> Apache releases repository.
>>>> >>>
>>>> >>> I don't think this is a problem with the release. The convenience
>>>> binaries in the release must be signed and published from an Apache
>>>> repository, so this is necessary. If you're trying to use the release, then
>>>> you don't need to be using JitPack.
>>>> >>>
>>>> >>> On Sun, Oct 13, 2019 at 6:53 AM Anton Okolnychyi
>>>>  wrote:
>>>> >>>>
>>>> >>>> Verified signature/checksum/rat, run tests.
>>>> >>>>
>>>> >>>> No other pending questions except what Arina and Gautam brought up.
>>>> >>>>
>>>> >>>> - Anton
>>>> >>>>
>>>> >>>> On 13 Oct 2019, at 09:17, Gautam  wrote:
>>>> >>>>
>>>> >>>> I was able to run steps in Ryan's mail just fine but ran  into the
>>>> same thing Arina mentioned  .. when running &

random comment

2020-01-02 Thread Jacques Nadeau
I have a random comment on this project versus others I'm involved in. This
is not meant to be critical, it's just an observation.

It feels like very little discussion happens on the dev list other than the
random technical support email. Basically, all interaction is on Github (?)
but there are no notifications of Github ticket creations against the dev
list. If you look at the dev list, the last three months we had email
counts of Oct: 109, Nov: 35, Dec: 15. When I saw the ~35 prs closed/month
number in the report I was shocked given the lack of email on the dev list.

What do other people think about this?


On Thu, Jan 2, 2020 at 10:28 AM Ryan Blue  wrote:

> Hi everyone,
>
> I've posted the initial draft of our report to the IPMC. If you have
> anything to add, please reply!
>
> rb
>
> 
> ## Iceberg
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> Iceberg has been incubating since 2018-11-16.
>
> ### Three most important unfinished issues to address before graduating:
>
>   1. Grow the Iceberg community
>   2. Add more committers and PPMC members
>
> ### Are there any issues that the IPMC or ASF Board need to be aware of?
>
> No issues.
>
> ### How has the community developed since the last report?
>
> In the 4 months since the last report, 138 pull requests were merged for
> an average of 34.5 per month. While this is down from the previous monthly
> average of 49.6 per month for June through August, this contribution rate
> is still very active and healthy. Contributions are coming from a regular
> group of contributors outside of the initial set of committers, which is a
> positive indication for adding new committers and PPMC members over the
> next few months.
>
> The community released the first version of Apache Iceberg,
> 0.7.0-incubating. This release used the "standard" incubator disclaimer and
> included convenience binaries. The release candidate votes were very active
> with community members testing out the release and reporting problems.
>
> There was an Apache Iceberg talk at ApacheCon NA in September.
>
> ### How has the project developed since the last report?
>
>   - The community is building support for the upcoming Spark 3.0 release
>   - The first PR from the vectorization branch has been merged into master
>   - Support for IN and NOT IN predicates was contributed
>   - Python added support for Hive metastore tables and the read path is
> near commit
>   - Flaky tests have been fixed
>   - Baseline checks (style, errorprone, findbugs) are now applied to all
> modules
>
> ### How would you assess the podling's maturity?
> Please feel free to add your own commentary.
>
>   - [ ] Initial setup
>   - [ ] Working towards first release
>   - [x] Community building
>   - [x] Nearing graduation
>   - [ ] Other:
>
> ### Date of last release:
>
>   - 0.7.0-incubating was released 25 October 2019
>
> ### When were the last committers or PPMC members elected?
>
>   - Anton Okolnychyi was added 30 August 2019
>
> ### Have your mentors been helpful and responsive?
>
> Yes. 4 of 5 mentors voted on the 0.7.0-incubating IPMC vote. Thanks to our
> mentors for being active!
>
> ### Is the PPMC managing the podling's brand / trademarks?
>
> Yes, the podling is managing the brand and is not aware of any issues.
> The project name has been approved.
>
> ### Signed-off-by:
>
>   - [x] (iceberg) Ryan Blue
>  Comments:
>   - [ ] (iceberg) Julien Le Dem
>  Comments:
>   - [ ] (iceberg) Owen O'Malley
>  Comments:
>   - [ ] (iceberg) James Taylor
>  Comments:
>   - [ ] (iceberg) Carl Steinbach
>  Comments:
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Welcome new committer and PPMC member Ratandeep Ratti

2020-02-16 Thread Jacques Nadeau
Congrats!

On Sun, Feb 16, 2020, 7:06 PM xiaokun ding  wrote:

> CONGRATULATIONS
>
> 李响  于2020年2月17日周一 上午11:05写道:
>
>> CONGRATULATIONS!!!
>>
>> On Mon, Feb 17, 2020 at 9:50 AM Junjie Chen 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Mon, Feb 17, 2020 at 5:48 AM Ryan Blue  wrote:
>>>
 Hi everyone,

 I'd like to congratulate Ratandeep Ratti, who was just invited to join
 the Iceberg committers adn PPMC!

 Thanks for your contributions and reviews, Ratandeep!

 rb

 --
 Ryan Blue

>>>
>>>
>>> --
>>> Best Regards
>>>
>>
>>
>> --
>>
>>李响 Xiang Li
>>
>> 手机 cellphone :+86-136-8113-8972
>> 邮件 e-mail  :wate...@gmail.com
>>
>


Re: [DISCUSS] Graduating from the Apache Incubator

2020-05-11 Thread Jacques Nadeau
Agree with Owen. Great to see Iceberg's growth.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Mon, May 11, 2020 at 12:16 PM Owen O'Malley 
wrote:

> +1 to graduation. It is exciting watching the project and its community
> grow.
>
> .. Owen
>
> On Mon, May 11, 2020 at 11:26 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I think that Iceberg is about ready to graduate from the Apache
>> Incubator. We now have 2 releases — that include convenience binaries — and
>> have added 2 committers/PPMC members and 2 PPMC members from the original
>> set of committers. We are seeing a steady rate of contributions from a
>> diverse group of people and companies interested in Iceberg. Thank you all
>> for your contributions and for being part of this community!
>>
>> The next step is to agree as a community that we would like to graduate.
>> If you have any concerns about graduation, please raise them.
>>
>> Below is the draft resolution for the board to create an Apache Iceberg
>> TLP. This is mostly boilerplate, but I’ve added 2 things:
>>
>>1. I’d like to volunteer to be the PMC chair of the project so I’ve
>>added myself to the draft. Others are welcome to volunteer as well and we
>>can decide as a community.
>>2. The project description I filled in is: software related to
>>“managing huge analytic datasets using a standard at-rest table format 
>> that
>>is designed for high performance and ease of use”.
>>
>> Establish the Apache Iceberg Project
>>
>> WHEREAS, the Board of Directors deems it to be in the best interests of
>> the Foundation and consistent with the Foundation's purpose to establish
>> a Project Management Committee charged with the creation and maintenance
>> of open-source software, for distribution at no charge to the public,
>> related to managing huge analytic datasets using a standard at-rest
>> table format that is designed for high performance and ease of use..
>>
>> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
>> (PMC), to be known as the "Apache Iceberg Project", be and hereby is
>> established pursuant to Bylaws of the Foundation; and be it further
>>
>> RESOLVED, that the Apache Iceberg Project be and hereby is responsible
>> for the creation and maintenance of software related to managing huge
>> analytic datasets using a standard at-rest table format that is designed
>> for high performance and ease of use; and be it further
>>
>> RESOLVED, that the office of "Vice President, Apache Iceberg" be and
>> hereby is created, the person holding such office to serve at the
>> direction of the Board of Directors as the chair of the Apache Iceberg
>> Project, and to have primary responsibility for management of the
>> projects within the scope of responsibility of the Apache Iceberg
>> Project; and be it further
>>
>> RESOLVED, that the persons listed immediately below be and hereby are
>> appointed to serve as the initial members of the Apache Iceberg Project:
>>
>>  * Anton Okolnychyi 
>>  * Carl Steinbach   
>>  * Daniel C. Weeks  
>>  * James R. Taylor  
>>  * Julien Le Dem
>>  * Owen O'Malley
>>  * Parth Brahmbhatt 
>>  * Ratandeep Ratti  
>>  * Ryan Blue
>>
>> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
>> the office of Vice President, Apache Iceberg, to serve in accordance
>> with and subject to the direction of the Board of Directors and the
>> Bylaws of the Foundation until death, resignation, retirement, removal
>> or disqualification, or until a successor is appointed; and be it
>> further
>>
>> RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
>> the migration and rationalization of the Apache Incubator Iceberg
>> podling; and be it further
>>
>> RESOLVED, that all responsibilities pertaining to the Apache Incubator
>> Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
>> discharged.
>>
>> --
>> Ryan Blue
>>
>


Re: [VOTE] Graduate to a top-level project

2020-05-12 Thread Jacques Nadeau
I'm +1.

(I think that is non-binding here but binding at the incubator level)
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, May 12, 2020 at 2:35 PM Romin Parekh  wrote:

> +1
>
> On Tue, May 12, 2020 at 2:32 PM Owen O'Malley 
> wrote:
>
>> +1
>>
>> On Tue, May 12, 2020 at 2:16 PM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> I propose that the Iceberg community should petition to graduate from
>>> the Apache Incubator to a top-level project.
>>>
>>> Here is the draft board resolution:
>>>
>>> Establish the Apache Iceberg Project
>>>
>>> WHEREAS, the Board of Directors deems it to be in the best interests of
>>> the Foundation and consistent with the Foundation's purpose to establish
>>> a Project Management Committee charged with the creation and maintenance
>>> of open-source software, for distribution at no charge to the public,
>>> related to managing huge analytic datasets using a standard at-rest
>>> table format that is designed for high performance and ease of use..
>>>
>>> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
>>> (PMC), to be known as the "Apache Iceberg Project", be and hereby is
>>> established pursuant to Bylaws of the Foundation; and be it further
>>>
>>> RESOLVED, that the Apache Iceberg Project be and hereby is responsible
>>> for the creation and maintenance of software related to managing huge
>>> analytic datasets using a standard at-rest table format that is designed
>>> for high performance and ease of use; and be it further
>>>
>>> RESOLVED, that the office of "Vice President, Apache Iceberg" be and
>>> hereby is created, the person holding such office to serve at the
>>> direction of the Board of Directors as the chair of the Apache Iceberg
>>> Project, and to have primary responsibility for management of the
>>> projects within the scope of responsibility of the Apache Iceberg
>>> Project; and be it further
>>>
>>> RESOLVED, that the persons listed immediately below be and hereby are
>>> appointed to serve as the initial members of the Apache Iceberg Project:
>>>
>>>  * Anton Okolnychyi 
>>>  * Carl Steinbach   
>>>  * Daniel C. Weeks  
>>>  * James R. Taylor  
>>>  * Julien Le Dem
>>>  * Owen O'Malley
>>>  * Parth Brahmbhatt 
>>>  * Ratandeep Ratti  
>>>  * Ryan Blue
>>>
>>> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
>>> the office of Vice President, Apache Iceberg, to serve in accordance
>>> with and subject to the direction of the Board of Directors and the
>>> Bylaws of the Foundation until death, resignation, retirement, removal
>>> or disqualification, or until a successor is appointed; and be it
>>> further
>>>
>>> RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
>>> the migration and rationalization of the Apache Incubator Iceberg
>>> podling; and be it further
>>>
>>> RESOLVED, that all responsibilities pertaining to the Apache Incubator
>>> Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
>>> discharged.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Petition the IPMC to graduate to top-level project
>>> [ ] +0
>>> [ ] -1 Wait to graduate because . . .
>>> --
>>> Ryan Blue
>>>
>>
>
> --
> Thanks,
> Romin
>
>
>


Re: [DISCUSS] August board report

2020-08-12 Thread Jacques Nadeau
The conference was free so all the recordings are available on-demand for
free:
https://subsurfaceconf.com/summer2020/recordings
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Aug 12, 2020 at 7:07 PM OpenInx  wrote:

> > Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
> Hive
> queries against Iceberg tables and working with petabyte-scale Iceberg
> tables.
> Iceberg was also mentioned in the keynotes.
>
> Are there slides or videos about the two iceberg talks ? I'd like to
> read/watch slides or videos but it seems I did not find the resources after
> a few google.  How about creating a page to collect all those sharing (also
> a 'power by' page) ?
>
>
>
> On Thu, Aug 13, 2020 at 7:50 AM Owen O'Malley 
> wrote:
>
>> +1 looks good.
>>
>> On Wed, Aug 12, 2020 at 4:41 PM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> Here's a draft of the board report for this month. Please reply with
>>> anything that you'd like to see added or that I've missed. Thanks!
>>>
>>> rb
>>>
>>> ## Description:
>>> Apache Iceberg is a table format for huge analytic datasets that is
>>> designed
>>> for high performance and ease of use.
>>>
>>> ## Issues:
>>> There are no issues requiring board attention.
>>>
>>> ## Membership Data:
>>> Apache Iceberg was founded 2020-05-19 (2 months ago)
>>> There are currently 10 committers and 9 PMC members in this project.
>>> The Committer-to-PMC ratio is roughly 1:1.
>>>
>>> Community changes, past quarter:
>>> - No new PMC members (project graduated recently).
>>> - Shardul Mahadik was added as committer on 2020-07-25
>>>
>>> ## Project Activity:
>>> 0.9.0 was released, including support for Spark 3 and SQL DDL commands,
>>> support
>>> for JDK 11, vectorized Parquet reads, and an action to compact data
>>> files.
>>>
>>> Since the 0.9.0 release, the community has made progress in several
>>> areas:
>>> - The Hive StorageHandler now provides access to query Iceberg tables
>>>   (work is ongoing to implement projection and predicate pushdown).
>>> - Flink integration has made substantial progress toward using native
>>> RowData,
>>>   and the first stage of the Flink sink (data file writers) has been
>>> committed.
>>> - An action to expire snapshots using Spark was added and is an
>>> improvement on
>>>   the incremental approach because it compares the reachable file sets.
>>> - The implementation of row-level deletes is nearing completion. Scan
>>> planning
>>>   now supports delete files, merge-based and set-based row filters have
>>> been
>>>   committed, and delete file writers are under review. The delete file
>>> writers
>>>   allow storing deleted row data in support of Flink CDC use cases.
>>>
>>> Releases:
>>> - 0.9.0 was released on 2020-07-13
>>> - 0.9.1 has an ongoing vote
>>>
>>> ## Community Health:
>>> The month since the last report has been one of the busiest since the
>>> project
>>> started. 80 pull requests were merged in the last 4 weeks, and more
>>> importantly,
>>> came from 21 different contributors. Both of these are new high
>>> watermarks.
>>>
>>> Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
>>> Hive
>>> queries against Iceberg tables and working with petabyte-scale Iceberg
>>> tables.
>>> Iceberg was also mentioned in the keynotes.
>>>
>>> --
>>> Ryan Blue
>>>
>>


Re: [DISCUSS] August board report

2020-08-24 Thread Jacques Nadeau
The talks are posted as a youtube playlist now:

https://www.youtube.com/watch?v=L8WQZeeV6Yw&list=PL-gIUf9e9CCtewYqIGUKvz0fVcoyOYU1H

Iceberg specific videos:
Adrian/Christine
https://www.youtube.com/watch?v=9azStU4aDFE&list=PL-gIUf9e9CCtewYqIGUKvz0fVcoyOYU1H&index=5

Anton
https://www.youtube.com/watch?v=5RJrqS8_u68&list=PL-gIUf9e9CCtewYqIGUKvz0fVcoyOYU1H&index=10

Dan
https://www.youtube.com/watch?v=9uiaCN3tJyI&list=PL-gIUf9e9CCtewYqIGUKvz0fVcoyOYU1H&index=3
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Thu, Aug 13, 2020 at 7:49 PM OpenInx  wrote:

> Thanks for the links,  Jacques.  I will try to create a pull request to
> attach that sharing links.
>
> On Thu, Aug 13, 2020 at 10:24 AM Jacques Nadeau 
> wrote:
>
>> The conference was free so all the recordings are available on-demand for
>> free:
>> https://subsurfaceconf.com/summer2020/recordings
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Aug 12, 2020 at 7:07 PM OpenInx  wrote:
>>
>>> > Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
>>> Hive
>>> queries against Iceberg tables and working with petabyte-scale Iceberg
>>> tables.
>>> Iceberg was also mentioned in the keynotes.
>>>
>>> Are there slides or videos about the two iceberg talks ? I'd like to
>>> read/watch slides or videos but it seems I did not find the resources after
>>> a few google.  How about creating a page to collect all those sharing (also
>>> a 'power by' page) ?
>>>
>>>
>>>
>>> On Thu, Aug 13, 2020 at 7:50 AM Owen O'Malley 
>>> wrote:
>>>
>>>> +1 looks good.
>>>>
>>>> On Wed, Aug 12, 2020 at 4:41 PM Ryan Blue  wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Here's a draft of the board report for this month. Please reply with
>>>>> anything that you'd like to see added or that I've missed. Thanks!
>>>>>
>>>>> rb
>>>>>
>>>>> ## Description:
>>>>> Apache Iceberg is a table format for huge analytic datasets that is
>>>>> designed
>>>>> for high performance and ease of use.
>>>>>
>>>>> ## Issues:
>>>>> There are no issues requiring board attention.
>>>>>
>>>>> ## Membership Data:
>>>>> Apache Iceberg was founded 2020-05-19 (2 months ago)
>>>>> There are currently 10 committers and 9 PMC members in this project.
>>>>> The Committer-to-PMC ratio is roughly 1:1.
>>>>>
>>>>> Community changes, past quarter:
>>>>> - No new PMC members (project graduated recently).
>>>>> - Shardul Mahadik was added as committer on 2020-07-25
>>>>>
>>>>> ## Project Activity:
>>>>> 0.9.0 was released, including support for Spark 3 and SQL DDL
>>>>> commands, support
>>>>> for JDK 11, vectorized Parquet reads, and an action to compact data
>>>>> files.
>>>>>
>>>>> Since the 0.9.0 release, the community has made progress in several
>>>>> areas:
>>>>> - The Hive StorageHandler now provides access to query Iceberg tables
>>>>>   (work is ongoing to implement projection and predicate pushdown).
>>>>> - Flink integration has made substantial progress toward using native
>>>>> RowData,
>>>>>   and the first stage of the Flink sink (data file writers) has been
>>>>> committed.
>>>>> - An action to expire snapshots using Spark was added and is an
>>>>> improvement on
>>>>>   the incremental approach because it compares the reachable file sets.
>>>>> - The implementation of row-level deletes is nearing completion. Scan
>>>>> planning
>>>>>   now supports delete files, merge-based and set-based row filters
>>>>> have been
>>>>>   committed, and delete file writers are under review. The delete file
>>>>> writers
>>>>>   allow storing deleted row data in support of Flink CDC use cases.
>>>>>
>>>>> Releases:
>>>>> - 0.9.0 was released on 2020-07-13
>>>>> - 0.9.1 has an ongoing vote
>>>>>
>>>>> ## Community Health:
>>>>> The month since the last report has been one of the busiest since the
>>>>> project
>>>>> started. 80 pull requests were merged in the last 4 weeks, and more
>>>>> importantly,
>>>>> came from 21 different contributors. Both of these are new high
>>>>> watermarks.
>>>>>
>>>>> Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
>>>>> Hive
>>>>> queries against Iceberg tables and working with petabyte-scale Iceberg
>>>>> tables.
>>>>> Iceberg was also mentioned in the keynotes.
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>


New project integrated with Iceberg

2020-10-01 Thread Jacques Nadeau
Hey All,

Ryan Murray, Laurent Goujon and I have been working since early this year
on a way to introduce git-like capabilities, branches, tags and cross-table
transactions into Iceberg (as well as Delta Lake and Hive tables). We just
announced this work as a new OSS project. We're calling it Project Nessie (
projectnessie.org) and we'd love your feedback.

Our goal is to contribute our Iceberg integration into the project. You can
check that work out here:
https://github.com/projectnessie/nessie/tree/main/clients/iceberg

Thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: [VOTE] Release Apache Iceberg 0.8.0-incubating RC2

2020-11-01 Thread Jacques Nadeau
+1 (non-binding)

Ran through steps 1-7, completed successfully.

I also updated Nessie to pull from the staging maven repository and ran the
Nessie test suite and it completed successfully with the staged 0.10.0
artifacts.


--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Sat, May 2, 2020 at 8:54 PM tison  wrote:

> +1 (non-binding)
>
> √ RAT checks passed
> √ signature is correct
> √ checksum is correct
> √ build from source (with java 8)
> √ run tests locally
>
> Best,
> tison.
>
>
> Carl Steinbach  于2020年5月3日周日 上午11:09写道:
>
>> +1 (binding)
>>
>>
>> On Fri, May 1, 2020 at 9:38 AM RD  wrote:
>>
>>> +1
>>> Validated all the steps mentioned.
>>>
>>> -R
>>>
>>> On Fri, May 1, 2020 at 9:31 AM Ryan Blue 
>>> wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> Ran rat, validated checksums and signature, and ran the build.
>>>>
>>>> I noticed that the iceberg-spark-runtime Jar is about 22MB larger and
>>>> it looks like the problem is mainly that parquet-avro 1.11.0 is shading all
>>>> of fastutil without minimizing the Jar like parquet-column does. I tried
>>>> rolling back to 1.10.1, but that requires rolling back Avro as well, so I
>>>> think the best option right now is to continue with a 37MB runtime Jar. We
>>>> can fix this in a 0.8.1 release when Parquet releases 1.11.1 with a fix.
>>>>
>>>> rb
>>>>
>>>> On Thu, Apr 30, 2020 at 11:41 PM Gautam 
>>>> wrote:
>>>>
>>>>>
>>>>> Ran checks on
>>>>> https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc2/
>>>>>
>>>>> √ RAT checks passed
>>>>> √ signature is correct
>>>>> √ checksum is correct
>>>>> √ build from source (with java 8)
>>>>> √ run tests locally
>>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 30, 2020 at 4:18 PM Samarth Jain 
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>> all checks passed
>>>>>>
>>>>>> On Thu, Apr 30, 2020 at 4:06 PM John Zhuge  wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>>1. Checked signature and checksum
>>>>>>>2. Checked license
>>>>>>>3. Built and ran unit tests.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 30, 2020 at 2:24 PM Owen O'Malley <
>>>>>>> owen.omal...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>>1. Checked signature and checksum
>>>>>>>>2. Built and ran unit tests.
>>>>>>>>3. Checked ORC version :)
>>>>>>>>
>>>>>>>> On Monday, ORC released 1.6.3, so we should grab those fixes soon.
>>>>>>>>
>>>>>>>> .. Owen
>>>>>>>>
>>>>>>>> On Thu, Apr 30, 2020 at 12:34 PM Dongjoon Hyun <
>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> +1.
>>>>>>>>>
>>>>>>>>> 1. Verified checksum, sig, and license
>>>>>>>>> 3. Build from the source and run UTs.
>>>>>>>>> 4. Run some manual ORC write/read tests with Apache Spark
>>>>>>>>> 2.4.6-SNAPSHOT (as of today).
>>>>>>>>>
>>>>>>>>> Thank you, all!
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>> On Thu, Apr 30, 2020 at 10:28 AM parth brahmbhatt <
>>>>>>>>> brahmbhatt.pa...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1. checks passed, did not observe the unit test failure.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Parth
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 30, 2020 at 9:13 AM Daniel W

Re: Integrating Existing Iceberg Tables with a Metastore

2020-11-20 Thread Jacques Nadeau
FYI, I would avoid adopting HMS because you need a better catalog. While
the HMS Iceberg catalog is mature, you're adopting something (HMS) that
carries a lot of baggage. I'd look at the other catalogs that are up and
coming if you can.

For example, Nessie (projectnessie.org) was built to provide a cloud native
approach to Iceberg transaction arbitration (along with some other nifty
features around cross-table transactions and git semantics) so that people
who work in the cloud but don't use Hive metastore, don't have to start.
The HA complexity, scaling dynamics and overall operational load of Nessie
is targeted to be a fraction of what HMS is.

Full disclosure, I work on Nessie.

Foot for thought, anyway.

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Fri, Nov 20, 2020 at 10:58 AM Marko Babic 
wrote:

> Hi Peter. Thanks for responding.
>
> > The command you mention below: `CREATE EXTERNAL TABLE` above an existing
> Iceberg table will not transfer the "responsibility" of tracking the
> snapshot to HMS. It only creates a HMS external table ...
>
> So my understanding is that the HiveCatalog is basically just using HMS as
> an atomically updateable pointer to a metadata file (excepting recent work
> to make Iceberg tables queryable _from_ Hive which we won't be doing) so
> what I'm doing with that command is mimicking the DDL for a
> HiveCatalog-created table, which sets up Iceberg tables as external
> tables in HMS
> <https://github.com/apache/iceberg/blob/f8c68ebcb4e35db5d7f5ccb8e20d53df3abdf8b1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L230-L292>,
>  and
> manually updating the `metadata_location` table property to point to latest
> metadata file for the existing table that I want to integrate with HMS.
> Updating the metadata pointer, along with obviously updating all
> readers/writers to load the table via the HiveCatalog, seems to be all I
> need to do to make that work but I'm just naively dipping my toes in here
> and could absolutely be missing something. E.g. I figured out I'd have to
> rename the latest metadata file from the existing table I want to integrate
> with HMS so that BaseMetastoreTableOperations could parse the version
> number, but only realized later that I'd have to rename _all_ the old
> metadata files + twiddle the metadata log entries to use the updated names.
>
> > What I would do is this: ...
>
> Makes sense, the external table creation + metadata pointer mangling is my
> attempt to do basically this but I'm not confident I know everything that
> needs to go into making step 2 happen. :)
>
> The following is what I'm thinking:
>
> - Given an existing Hadoop table on top of S3 at s3://old_table/, create a
> new table with the same schema + partition spec via HiveCatalog.
> - Parse metadata files from the old table and update them to be
> HadoopCatalog-compatible: all I'd be updating is metadata file names + the
> metadata log as described above.
> - Write updated metadata files to s3://old_table/metadata/. Update new
> table in HMS to point to latest, updated metadata file and update the table
> location to point to s3://old_table/.
>
> I could alternatively `aws s3 sync` data files from the old table to the
> new one, rewrite all the old metadata + snapshot manifest lists + manifest
> files to point to the new data directory, and leave s3://old_table/
> untouched, but I guess that's a decision I'd make once I'm into things and
> have a better sense of what'd be less error-prone.
>
> Thanks again!
>
> Marko
>
>
> On Fri, Nov 20, 2020 at 12:39 AM Peter Vary 
> wrote:
>
>> Hi Marko,
>>
>> The command you mention below: `CREATE EXTERNAL TABLE` above an existing
>> Iceberg table will not transfer the "responsibility" of tracking the
>> snapshot to HMS. It only creates a HMS external table which will allow Hive
>> queries to read the given table. If you want to track the snapshot in the
>> HMS then you have to originally create a table in HMS using HiveCatalog.
>>
>> What I would do is this:
>>
>>1. Create a new Iceberg table in a catalog which supports concurrent
>>writes (Hive/Hadoop/Custom)
>>2. Migrate the tables to the new catalog. Maybe there are some
>>already existing tools there, or with some java/spark code the snapshot
>>files can be read and rewritten. By my understanding you definitely do not
>>have to rewrite the data files, just the snapshot files (and maybe the
>>manifest files)
>>
>> Hope this helps,
>> Peter
>>
>>
>> On Nov 19, 2020, at 21:29, John Clara 
>> wrote:
>>

Re: Iceberg/Hive properties handling

2020-11-25 Thread Jacques Nadeau
I agree with Ryan on the core principles here. As I understand them:

   1. Iceberg metadata describes all properties of a table
   2. Hive table properties describe "how to get to" Iceberg metadata
   (which catalog + possibly ptr, path, token, etc)
   3. There could be default "how to get to" information set at a global
   level
   4. Best-effort schema should stored be in the table properties in HMS.
   This should be done for information schema retrieval purposes within Hive
   but should be ignored during Hive/other tool execution.

Is that a fair summary of your statements Ryan (except 4, which I just
added)?

One comment I have on #2 is that for different catalogs and use cases, I
think it can be somewhat more complex where it would be desirable for a
table that initially existed without Hive that was later exposed in Hive to
support a ptr/path/token for how the table is named externally. For
example, in a Nessie context we support arbitrary paths for an Iceberg
table (such as folder1.folder2.folder3.table1). If you then want to expose
that table to Hive, you might have this mapping for #2

db1.table1 => nessie:folder1.folder2.folder3.table1

Similarly, you might want to expose a particular branch version of a table.
So it might say:

db1.table1_etl_branch => nessie.folder1@etl_branch

Just saying that the address to the table in the catalog could itself have
several properties. The key being that no matter what those are, we should
follow #1 and only store properties that are about the ptr, not the
content/metadata.

Lastly, I believe #4 is the case but haven't tested it. Can someone confirm
that it is true? And that it is possible/not problematic?


--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue  wrote:

> Thanks for working on this, Laszlo. I’ve been thinking about these
> problems as well, so this is a good time to have a discussion about Hive
> config.
>
> I think that Hive configuration should work mostly like other engines,
> where different configurations are used for different purposes. Different
> purposes means that there is not a global configuration priority.
> Hopefully, I can explain how we use the different config sources elsewhere
> to clarify.
>
> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
> Configuration, but it also has its own global configuration. There are also
> Iceberg table properties, and all of the various Hive properties if you’re
> tracking tables with a Hive MetaStore.
>
> The first step is to simplify where we can, so we effectively eliminate 2
> sources of config:
>
>- The Hadoop Configuration is only used to instantiate Hadoop classes,
>like FileSystem. Iceberg should not use it for any other config.
>- Config in the Hive MetaStore is only used to identify that a table
>is Iceberg and point to its metadata location. All other config in HMS is
>informational. For example, the input format is FileInputFormat so that
>non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>but it is available so they also don’t fail trying to load the class.
>Table-specific config should not be stored in table or serde properties.
>
> That leaves Spark configuration and Iceberg table configuration.
>
> Iceberg differs from other tables because it is opinionated: data
> configuration should be maintained at the table level. This is cleaner for
> users because config is standardized across engines and in one place. And
> it also enables services that analyze a table and update its configuration
> to tune options that users almost never do, like row group or stripe size
> in the columnar formats. Iceberg table configuration is used to configure
> table-specific concerns and behavior.
>
> Spark configuration is used for engine-specific concerns, and runtime
> overrides. A good example of an engine-specific concern is the catalogs
> that are available to load Iceberg tables. Spark has a way to load and
> configure catalog implementations and Iceberg uses that for all
> catalog-level config. Runtime overrides are things like target split size.
> Iceberg has a table-level default split size in table properties, but this
> can be overridden by a Spark option for each table, as well as an option
> passed to the individual read. Note that these necessarily have different
> config names for how they are used: Iceberg uses read.split.target-size
> and the read-specific option is target-size.
>
> Applying this to Hive is a little strange for a couple reasons. First,
> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
> think the right place to store engine-specific config is there, including
> Iceberg catalogs using a strategy similar to what Spark does: what external
>

Re: Iceberg/Hive properties handling

2020-11-25 Thread Jacques Nadeau
Minor error, my last example should have been:

db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau  wrote:

> I agree with Ryan on the core principles here. As I understand them:
>
>1. Iceberg metadata describes all properties of a table
>2. Hive table properties describe "how to get to" Iceberg metadata
>(which catalog + possibly ptr, path, token, etc)
>3. There could be default "how to get to" information set at a global
>level
>4. Best-effort schema should stored be in the table properties in HMS.
>This should be done for information schema retrieval purposes within Hive
>but should be ignored during Hive/other tool execution.
>
> Is that a fair summary of your statements Ryan (except 4, which I just
> added)?
>
> One comment I have on #2 is that for different catalogs and use cases, I
> think it can be somewhat more complex where it would be desirable for a
> table that initially existed without Hive that was later exposed in Hive to
> support a ptr/path/token for how the table is named externally. For
> example, in a Nessie context we support arbitrary paths for an Iceberg
> table (such as folder1.folder2.folder3.table1). If you then want to expose
> that table to Hive, you might have this mapping for #2
>
> db1.table1 => nessie:folder1.folder2.folder3.table1
>
> Similarly, you might want to expose a particular branch version of a
> table. So it might say:
>
> db1.table1_etl_branch => nessie.folder1@etl_branch
>
> Just saying that the address to the table in the catalog could itself have
> several properties. The key being that no matter what those are, we should
> follow #1 and only store properties that are about the ptr, not the
> content/metadata.
>
> Lastly, I believe #4 is the case but haven't tested it. Can someone
> confirm that it is true? And that it is possible/not problematic?
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue 
> wrote:
>
>> Thanks for working on this, Laszlo. I’ve been thinking about these
>> problems as well, so this is a good time to have a discussion about Hive
>> config.
>>
>> I think that Hive configuration should work mostly like other engines,
>> where different configurations are used for different purposes. Different
>> purposes means that there is not a global configuration priority.
>> Hopefully, I can explain how we use the different config sources elsewhere
>> to clarify.
>>
>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>> Configuration, but it also has its own global configuration. There are also
>> Iceberg table properties, and all of the various Hive properties if you’re
>> tracking tables with a Hive MetaStore.
>>
>> The first step is to simplify where we can, so we effectively eliminate 2
>> sources of config:
>>
>>- The Hadoop Configuration is only used to instantiate Hadoop
>>classes, like FileSystem. Iceberg should not use it for any other config.
>>- Config in the Hive MetaStore is only used to identify that a table
>>is Iceberg and point to its metadata location. All other config in HMS is
>>informational. For example, the input format is FileInputFormat so that
>>non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>>but it is available so they also don’t fail trying to load the class.
>>Table-specific config should not be stored in table or serde properties.
>>
>> That leaves Spark configuration and Iceberg table configuration.
>>
>> Iceberg differs from other tables because it is opinionated: data
>> configuration should be maintained at the table level. This is cleaner for
>> users because config is standardized across engines and in one place. And
>> it also enables services that analyze a table and update its configuration
>> to tune options that users almost never do, like row group or stripe size
>> in the columnar formats. Iceberg table configuration is used to configure
>> table-specific concerns and behavior.
>>
>> Spark configuration is used for engine-specific concerns, and runtime
>> overrides. A good example of an engine-specific concern is the catalogs
>> that are available to load Iceberg tables. Spark has a way to load and
>> configure catalog implementations and Iceberg uses that for all
>> catalog-level config. Runtime overrides are things like target split size.
>> Iceberg has a table-level default split size in table properti

Re: Iceberg/Hive properties handling

2020-12-01 Thread Jacques Nadeau
Would someone be willing to create a document that states the current
proposal?

It is becoming somewhat difficult to follow this thread. I also worry that
without a complete statement of the current shape that people may be
incorrectly thinking they are in alignment.



--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy 
wrote:

> Thanks, Ryan. I answered inline.
>
> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue  wrote:
>
>> This sounds like a good plan overall, but I have a couple of notes:
>>
>>1. We need to keep in mind that users plug in their own catalogs, so
>>iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
>>Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>>and hive.catalog values. Those should just be names for catalogs 
>> configured
>>in Configuration, i.e., via hive-site.xml. We then only need a
>>special value for loading Hadoop tables from paths.
>>
>> About extensibility, I think the usual Hive way is to use Java class
> names. So this way the value for 'iceberg.catalog' could be e.g.
> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
> would need to have a factory method that constructs the catalog object from
> a properties object (Map). E.g.
> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
> 'iceberg.catalog_location' to be present in properties.
>
>>
>>1. I don’t think that catalog configuration should be kept in table
>>properties. A catalog should not be loaded for each table. So I don’t 
>> think
>>we need iceberg.catalog_location. Instead, we should have a way to
>>define catalogs in the Configuration for tables in the metastore to
>>reference.
>>
>>  I think it makes sense, on the other hand it would make adding new
> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
> and restart/reinit services. Maybe it can be cumbersome in some
> environments.
>
>>
>>1. I’d rather use a prefix to exclude properties from being passed to
>>Iceberg than to include them. Otherwise, users don’t know what to do to
>>pass table properties from Hive or Impala. If we exclude a prefix or
>>specific properties, then everything but the properties reserved for
>>locating the table are passed as the user would expect.
>>
>> I don't have a strong opinion about this, but yeah, maybe this behavior
> would cause the least surprises.
>
>>
>>
>>
>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy 
>> wrote:
>>
>>> Thanks, Peter. I answered inline.
>>>
>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary 
>>> wrote:
>>>
>>>> Hi Zoltan,
>>>>
>>>> Answers below:
>>>>
>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>> borokna...@cloudera.com.INVALID> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Thanks for the replies. My take for the above questions are as follows
>>>>
>>>>- Should 'iceberg.catalog' be a required property?
>>>>- Yeah, I think it would be nice if this would be required to avoid
>>>>   any implicit behavior
>>>>
>>>> Currently we have a Catalogs class to get/initialize/use the different
>>>> Catalogs. At that time the decision was to use HadoopTables as a default
>>>> catalog.
>>>> It might be worthwhile to use the same class in Impala as well, so the
>>>> behavior is consistent.
>>>>
>>>
>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>> whenever possible. The Catalogs class is very similar to what we have
>>> currently in Impala.
>>>
>>>>
>>>>- 'hadoop.catalog' LOCATION and catalog_location
>>>>   - In Impala we don't allow setting LOCATION for tables stored in
>>>>   'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>>   table's actual location. We were also thinking about using only the 
>>>> table
>>>>   LOCATION, and set it to the catalog location, but we also found it
>>>>   confusing.
>>>>
>>>> It could definitely work, but it is somewhat strange that we have an
>>>> external table location set to an arbitrary path, and we have a different
>>>> location generated by other configs. It would be ni

Re: Iceberg At Adobe

2020-12-03 Thread Jacques Nadeau
Yeah, thanks for sharing

On Thu, Dec 3, 2020 at 11:57 AM John Zhuge  wrote:

> Very nice!
>
> On Thu, Dec 3, 2020 at 10:36 AM Miao Wang 
> wrote:
>
>> Hi,
>>
>>
>>
>> Our team post 1 blog about Iceberg use case at Adobe.
>>
>>
>>
>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866
>>
>>
>>
>> There will be a series of blogs to show more details.
>>
>>
>>
>> Miao
>>
>
>
> --
> John Zhuge
>
-- 
--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: Iceberg/Hive properties handling

2020-12-07 Thread Jacques Nadeau
Hey Peter, thanks for updating the doc and your heads up in the other
thread on your capacity to look at this before EOY.

I'm going to try to create a specification document based on the discussion
document you put together. I think there is general consensus around what
you call "Spark-like catalog configuration" so I'd like to formalize that
more.

It seems like there is less consensus around the whitelist/blacklist side
of things. You outline four approaches:

   1. Hard coded HMS only property list
   2. Hard coded Iceberg only property list
   3. Prefix for Iceberg properties
   4. Prefix for HMS only properties

I generally think #2 is a no-go as it creates too much coupling between
catalog implementations and core iceberg. It seems like Ryan Blue would
prefer #4 (correct?). Any other strong opinions?
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Thu, Dec 3, 2020 at 9:27 AM Peter Vary 
wrote:

> As Jacques suggested (with the help of Zoltan) I have collected the
> current state and the proposed solutions in a document:
>
> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>
> My feeling is that we do not have a final decision, so tried to list all
> the possible solutions.
> Please comment!
>
> Thanks,
> Peter
>
> On Dec 2, 2020, at 18:10, Peter Vary  wrote:
>
> When I was working on the CREATE TABLE patch I found the following
> TBLPROPERTIES on newly created tables:
>
>- external.table.purge
>- EXTERNAL
>- bucketing_version
>- numRows
>- rawDataSize
>- totalSize
>- numFiles
>- numFileErasureCoded
>
>
> I am afraid that we can not change the name of most of these properties,
> and might not be useful to have most of them along with Iceberg statistics
> already there. Also my feeling is that this is only the top of the Iceberg
> (pun intended :)) so this is why I think we should be more targeted way to
> push properties to the Iceberg tables.
>
> On Dec 2, 2020, at 18:04, Ryan Blue  wrote:
>
> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>
> Also I expect that we want to add Hive write specific configs to table
> level when the general engine independent configuration is not ideal for
> Hive, but every Hive query for a given table should use some specific
> config.
>
> Hive may need configuration, but I think these should still be kept in the
> Iceberg table. There is no reason to make Hive config inaccessible from
> other engines. If someone wants to view all of the config for a table from
> Spark, the Hive config should also be included right?
>
> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary  wrote:
>
>> I will ask Laszlo if he wants to update his doc.
>>
>> I see both pros and cons of catalog definition in config files. If there
>> is an easy default then I do not mind any of the proposed solutions.
>>
>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>> solution, because in Hive it is common to add new keys to the property list
>> - no restriction is in place (I am not even sure that the currently
>> implemented blacklist for preventing to propagate properties to Iceberg
>> tables is complete). Also I expect that we want to add Hive write specific
>> configs to table level when the general engine independent configuration is
>> not ideal for Hive, but every Hive query for a given table should use some
>> specific config.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau  ezt írta (időpont: 2020. dec. 1., Ke
>> 17:06):
>>
>>> Would someone be willing to create a document that states the current
>>> proposal?
>>>
>>> It is becoming somewhat difficult to follow this thread. I also worry
>>> that without a complete statement of the current shape that people may be
>>> incorrectly thinking they are in alignment.
>>>
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>> borokna...@cloudera.com> wrote:
>>>
>>>> Thanks, Ryan. I answered inline.
>>>>
>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue  wrote:
>>>>
>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>
>>>>>1. We need to keep in mind that users plug in their own catalogs,
>>>>>so iceberg.catalog could be a Glue or Nessie catalog, not just
>>>>>Hive or Hadoop. I don’t think it makes much sense to use separate
>

Re: Adobe Blog ..

2021-01-15 Thread Jacques Nadeau
+1. This is a great series.

I think it would be great to add a section to the website linking to
helpful articles, slide decks, etc about Iceberg. In the trenches
information is often the most useful.

On Fri, Jan 15, 2021 at 3:43 PM Ryan Blue  wrote:

> Thanks, Gautam! I was just reading the one on query optimizations. Great
> that you are writing this series, I think it will be helpful.
>
> On Fri, Jan 15, 2021 at 3:36 PM Gautam  wrote:
>
>> Hello Devs,
>>   We at Adobe have been penning down our experiences with
>> Apache Iceberg thus far. Here is the third blog in that series titled:
>> "Taking Query Optimizations to the Next Level with Iceberg" *[1]*. In
>> case you haven't, here are the first two blogs titled "Iceberg at Adobe"
>> *[2]* and "High Throughput Ingestion with Iceberg" *[3]*.
>>
>> Hoping these are helpful to others..
>>
>> thanks and regards,
>> -Gautam.
>>
>> [1] -
>> https://medium.com/adobetech/taking-query-optimizations-to-the-next-level-with-iceberg-6c968b83cd6f
>> [2] - https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866
>> [3] -
>> https://medium.com/adobetech/high-throughput-ingestion-with-iceberg-ccf7877a413f
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Welcoming Peter Vary as a new committer!

2021-01-25 Thread Jacques Nadeau
Congrats Peter! Thanks for all your great work

On Mon, Jan 25, 2021 at 10:24 AM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to welcome Peter Vary as a new Iceberg committer.
>
> Thanks for all your contributions, Peter!
>
> rb
>
> --
> Ryan Blue
>


Re: Proposal: Support for views in Iceberg

2021-07-22 Thread Jacques Nadeau
Some thoughts...

   - In general, many engines want (or may require) a resolved sql field.
   This--at minimum--typically includes star expansion since traditional view
   behavior is stars are expanded at view creation time (since this is the
   only way to guarantee that the view returns the same logical definition
   even if the underlying table changes). This may also include a replacement
   of relative object names to absolute object names based on the session
   catalog & namespace. If I recall correctly, Hive does both of these things.
   - It isn't clear in the spec whether the table references used in views
   are restricted to other Iceberg objects or can be arbitrary objects in the
   context of a particular engine. Maybe I missed this? For example, can I
   have a Trino engine view that references an Elasticsearch table stored in
   an Iceberg view?
   - Restricting schemas to the Iceberg types will likely lead to
   unintended consequences. I appreciate the attraction to it but I think it
   would either create artificial barriers around the types of SQL that are
   allowed and/or mean that replacing a CTE with a view could potentially
   change the behavior of the query which I believe violates most typical
   engine behaviors. A good example of this is the simple sql statement of
   "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
   default I believe), c2 will be specified as a CHAR(3). In this Iceberg
   context, is this view disallowed? If it isn't disallowed then you have an
   issue where the view schema will be required to be different from a CTE
   since the engine will resolve it differently than Iceberg. Even if you
   ignore CHAR(X), you've still got VARCHAR(X) to contend with...
   - It is important to remember that Calcite is a set of libraries and not
   a specification. There are things that can be specified in Calcite but in
   general it doesn't have formal specification as a first principle. It is
   more implementation as a first principle. This is in contrast to projects
   like Arrow and Iceberg, which start with well-formed specifications. I've
   been working with Calcite since before it was an Apache project and I
   wouldn't recommend adopting it as any form of a specification. On the
   flipside, I am very supportive of using it for a reference implementation
   standard for Iceberg view consumption, manipulation, etc.  If anything, I'd
   suggest we start with the adoption of a relatively clear grammar, e.g. the
   Antlr grammar file that Spark [1] and/or Trino [2] use. Even that is not a
   complete specification as grammar must still be interpreted with regards to
   type promotion, function resolution, consistent unnamed expression naming,
   etc that aren't defined at the grammar level. I'd definitely avoid using
   Calcite's JavaCC grammar as it heavily embeds implementation details (in a
   good way) and relies on some fairly complex logic in the validator and
   sql2rel components to be fully resolved/comprehended.

Given the above, I suggest having a field which describes the
dialect(origin?) of the view and then each engine can decide how they want
to consume/mutate that view (and whether they want to or not). It does risk
being a dumping ground. Nonetheless, I'd expect the alternative of
establishing a formal SQL specification to be a similarly long process to
the couple of years it took to build the Arrow and Iceberg specifications.
(Realistically, there is far more to specify here than there is in either
of those two domains.)

Some other notes:

   - Calcite does provide a nice reference document [3] but it is not
   sufficient to implement what is necessary for parsing/validating/resolving
   a SQL string correctly/consistently.
   - Projects like Coral [4] are interesting here but even Coral is based
   roughly on "HiveQL" which also doesn't have a formal specification process
   outside of the Hive version you're running. See this thread in Coral slack
   [5]
   - ZetaSQL [6] also seems interesting in this space. It feels closer to
   specification based [7] than Calcite but is much less popular in the big
   data domain. I also haven't reviewed it's SQL completeness closely, a
   strength of Calcite.
   - One of the other problems with building against an implementation as
   opposed to a specification (e.g. Calcite) is it can make it difficult or
   near impossible to implement the same algorithms again without a bunch of
   reverse engineering. If interested in an example of this, see the
   discussion behind LZ4 deprecation on the Parquet spec [8] for how painful
   this kind of mistake can become.
   - I'd love to use the SQL specification itself but nobody actually
   implements that in its entirety and it has far too many places where things
   are "implementation-defined" [9].

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
[2]
https://github.

Re: [DISCUSS] UUID type

2021-07-27 Thread Jacques Nadeau
What specific arguments are there for it being a first class type besides
it is elsewhere? Is there some kind of optimization iceberg or an engine
could do if it was typed versus just a bucket of bits? Fixed width binary
seems to cover the cases I see in terms of actual functionality in the
iceberg libraries or engines…



On Tue, Jul 27, 2021 at 6:54 PM Yan Yan  wrote:

> One conversation I used to come across regarding UUID deprecation was from
> https://github.com/apache/iceberg/pull/1611
>
> Thanks,
> Yan
>
> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary 
> wrote:
>
>> Hi Joshua,
>>
>> I do not have a strong preference about the UUID type, but I would like
>> the highlight, that the type is handled inconsistently in Iceberg with
>> different file formats. (See:
>> https://github.com/apache/iceberg/issues/1881)
>>
>> If we keep the type, it would be good to standardize the handling in
>> every file format.
>>
>> Thanks, Peter
>>
>> On Tue, 27 Jul 2021, 17:08 Joshua Howard,  wrote:
>>
>>> Hi.
>>>
>>> UUID is a current data type according to the Iceberg spec (
>>> https://iceberg.apache.org/spec/#primitive-types), but there seems to
>>> have been some discussion about removing it? I could not find the original
>>> discussion, but a reference to the discussion can be found here (
>>> https://github.com/trinodb/trino/issues/6663).
>>>
>>> I generally agree with the consensus in the Trino issue to keep UUID in
>>> Iceberg. To summarize…
>>>
>>> - It makes sense to keep the type now that row identifiers are supported
>>> - Some engines (Trino) have support for the UUID type
>>> - Engines w/o support for UUID type can determine how to map
>>>
>>> Does anyone want to remove the type? If so, why?
>>
>>


Re: [DISCUSS] Moving to apache-iceberg Slack workspace

2021-07-28 Thread Jacques Nadeau
My one recommendation would be that if you go off Apache infra that you
make sure all the PMC members are admins of the new account.

On Wed, Jul 28, 2021 at 8:35 AM Ryan Blue  wrote:

> No problem, the site hasn't been deployed yet.
>
> I've also looked a bit into the invite issue for the apache-iceberg
> community and I think we should be able to set something up so that people
> can invite themselves. Thanks for raising this as an issue, Piotr. I didn't
> realize that the shared invite URLs expire quickly. We'll either keep that
> refreshed or set up a self-invite form.
>
> On Wed, Jul 28, 2021 at 8:23 AM Russell Spitzer 
> wrote:
>
>> +1 Also I merged the change, sorry if that was premature.
>>
>> On Jul 28, 2021, at 10:19 AM, Eduard Tudenhoefner 
>> wrote:
>>
>> +1 moving to the apache-iceberg 
>> slack workspace
>>
>> On Wed, Jul 28, 2021 at 5:16 PM Ryan Blue  wrote:
>>
>>> Yes, we can update the Slack link. But we want to give everyone a chance
>>> to speak up either in support or not before we make changes to how we
>>> recommend communicating in the community.
>>>
>>> If you're in favor of moving to the apache-iceberg Slack space, please
>>> speak up!
>>>
>>> Ryan
>>>
>>> On Wed, Jul 28, 2021 at 12:51 AM Eduard Tudenhoefner 
>>> wrote:
>>>
 Could we just update the slack link to
 https://join.slack.com/t/apache-iceberg/ on the website (see PR#2882
 )?

 On Wed, Jul 28, 2021 at 7:13 AM Jack Ye  wrote:

> Any updates on this? Given the fact of the currently broken invitation
> link, I think we should move asap.
>
> -Jack Ye
>
> On Tue, Jul 27, 2021 at 2:15 AM Piotr Findeisen <
> pi...@starburstdata.com> wrote:
>
>> Hi,
>>
>> I don't have opinion which Slack workspace this is in, as long as
>> it's easy to join.
>> Manual joining process is not healthy for sure.
>> Btw, the apache-iceberg is currently limited to @apache.org emails,
>> which some people do not have (e.g. i do not).
>> Will you be sharing an invite link or something?
>>
>> Best,
>> PF
>>
>>
>>
>> On Sun, Jul 25, 2021 at 9:48 PM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> A few weeks ago, we talked about whether to use a Slack workspace
>>> for Iceberg or whether to continue using the ASF's workspace. At the 
>>> time,
>>> we thought it was possible for anyone to invite themselves to the ASF's
>>> Slack, so we thought it would be best to use the common one. But after 
>>> the
>>> self-invite link broke recently, we found out that ASF infra doesn't 
>>> want
>>> to fix the link anymore because of spammers, which leaves us without a 
>>> way
>>> for people to join us on Slack without requesting an invite.
>>>
>>> I think that requesting an invite is too much friction for someone
>>> that wants to join this community, so I propose we reconsider our 
>>> previous
>>> decision and move to apache-iceberg.slack.com.
>>>
>>> Any objections?
>>>
>>> Ryan
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: [DISCUSS] UUID type

2021-07-29 Thread Jacques Nadeau
I think points 1&2 don't really apply since a fixed width binary already
covers those properties.

It seems like this isn't really a concern of iceberg but rather a cosmetic
layer that exists primarily (only?) in trino. In that case I would be
inclined to say that trino should just use custom metadata and a fixed
binary type. That way you still have the desired ux without exposing those
extra concepts to the  iceberg. It actually feels like better encapsulation
imo.

On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen 
wrote:

> Hi,
>
> I agree with Ryan, that it takes some precautions before one can assume
> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
> at all.
> After all, this is just a primitive type, which is commonly used for
> certain things, but "commonly" doesn't mean "always".
>
> The advantages of having a dedicated type are on 3 layers.
> The compact representation in the file, and compact representation in
> memory in the query engine are the ones mentioned above.
>
> The third layer is the usability. Seeing a UUID column i know what values
> i can expect, so it's more descriptive than `id char(36)`.
> This also means i can CREATE TABLE ... AS SELECT uuid(),  without need
> for casting to varchar.
> It also removes temptation of casting uuid to varbinary to achieve compact
> representation.
>
> Thus i think it would be good to have them.
>
> Best
> PF
>
>
>
> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue  wrote:
>
>> The original reason why I added UUID to the spec was that I thought there
>> would be opportunities to take advantage of UUIDs as unique values and to
>> optimize the use of UUIDs. I was thinking about auto-increment ID fields
>> and how we might do something similar in Iceberg.
>>
>> The reason we have thought about removing UUID is that there aren't as
>> many opportunities to take advantage of UUIDs as I thought. My original
>> assumption was that we could do things like bucket on UUID fields or assume
>> that a UUID field has a high NDV. But that's not necessarily the case with
>> when a UUID field is a foreign key, only when it is used as an identifier
>> or primary key. Before Jack added tracking for row identifier fields, we
>> couldn't know that a UUID was unique in a table. As a result, we didn't
>> invest in support for UUID.
>>
>> Quick aside: Now that row identifier fields are tracked, we can do some
>> of these things with the row identifier fields. Engines can assume that the
>> tuple of row identifier fields is unique in a table for join estimation.
>> And engines can use row identifier fields in sort keys to ensure lots of
>> partition split locations (this is really important for Spark).
>>
>> Coming back to UUIDs, the second reason to have a UUID type is still
>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>> strings that are more than twice as large, or even worse UCS-16 Strings
>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>> could really help engines as long as they can keep the values as
>> fixed-width binary.
>>
>> I could go either way on this. I think it is valuable to have a compact
>> representation for UUIDs rather than using the string representation. But
>> that will require investing in the type and building support in engines
>> that won't take advantage of it. If Trino can use this, I think it may be
>> worth keeping and investing in.
>>
>> Ryan
>>
>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye  wrote:
>>
>>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>>> think It is more about user experience, whether the conversion is done at
>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>> Iceberg can optimize this common use case internally for users. There might
>>> be some other benefits I overlooked, but maybe the complication introduced
>>> by this type does not really justify the slightly better user experience. I
>>> am also on the fence about it.
>>>
>>> -Jack Ye
>>>
>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau 
>>> wrote:
>>>
>>>> What specific arguments are there for it being a first class type
>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>> binary seems to co

Re: [DISCUSS] UUID type

2021-07-29 Thread Jacques Nadeau
It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
type. Which engines are you thinking of that have a native UUID type
besides the Presto derivatives and support Iceberg?

I agree that Trino should expose a UUID type on top of Iceberg tables. All
the user experience things that you are describing as important (compact
storage, friendly display, ddl, clean literals) are possible without it
being a first class type in Iceberg using a trino specific property.

I don't really have a strong opinion about UUID. In general, type bloat is
probably just a part of this kind of project. Generally, CHAR(X) and
VARCHAR(X) feel like much bigger concerns given that they exist in all of
the engines but not Iceberg--especially when we start talking about views.

Some of this argues for physical vs logical type abstraction. (Something
that was always challenging in Parquet but also helped to resolve how these
types are managed in engines that don't support them.)

thanks,
Jacques

PS: Funny aside, the bloat on an ip address is actually worse than a UUID,
right? IPv4 = 4 bytes. IPv4 String = 15 bytes 15/4 => 275% bloat. UUID
36/16 => 125% bloat.

On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue  wrote:

> I don't think this is just a problem in Trino.
>
> If there is no UUID type, then a user must choose between a 36-byte string
> and a 16-byte binary. That's not a good choice to force people into. If
> someone chooses binary, then it's harder to work with rows and construct
> queries even though there is a standard representation for UUIDs. To avoid
> the user headache, people will probably choose to store values as strings.
> Using a string would mean that more than half the value is needlessly
> discarded by default in Iceberg lower/upper bounds instead of keeping the
> entire value. And since engines don't know what's in the string, the full
> value must be used in comparison, which is extra work and extra space.
>
> Inflated values may not be a problem in some cases. IPv4 addresses are one
> case where you could argue that it doesn't matter very much that they are
> typically stored as strings. But I expect the use of UUIDs to be common for
> ID columns because you can generate them without coordination (unlike an
> incrementing ID) and that's a concern because the use as an ID makes them
> likely to be join keys.
>
> If we want the values to be stored as 16-byte fixed, then we need to make
> it easy to get the expected string representation in and out, just like we
> do with date/time types. I don't think that's specific to any engine.
>
> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau 
> wrote:
>
>> I think points 1&2 don't really apply since a fixed width binary already
>> covers those properties.
>>
>> It seems like this isn't really a concern of iceberg but rather a
>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>> be inclined to say that trino should just use custom metadata and a fixed
>> binary type. That way you still have the desired ux without exposing those
>> extra concepts to the  iceberg. It actually feels like better encapsulation
>> imo.
>>
>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen 
>> wrote:
>>
>>> Hi,
>>>
>>> I agree with Ryan, that it takes some precautions before one can assume
>>> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
>>> at all.
>>> After all, this is just a primitive type, which is commonly used for
>>> certain things, but "commonly" doesn't mean "always".
>>>
>>> The advantages of having a dedicated type are on 3 layers.
>>> The compact representation in the file, and compact representation in
>>> memory in the query engine are the ones mentioned above.
>>>
>>> The third layer is the usability. Seeing a UUID column i know what
>>> values i can expect, so it's more descriptive than `id char(36)`.
>>> This also means i can CREATE TABLE ... AS SELECT uuid(),  without
>>> need for casting to varchar.
>>> It also removes temptation of casting uuid to varbinary to achieve
>>> compact representation.
>>>
>>> Thus i think it would be good to have them.
>>>
>>> Best
>>> PF
>>>
>>>
>>>
>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue  wrote:
>>>
>>>> The original reason why I added UUID to the spec was that I thought
>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>> and to optimize the use of UUIDs. I was thinking about a

Re: Proposal: Support for views in Iceberg

2021-08-26 Thread Jacques Nadeau
gt;>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>>>> engine, it would be a breaking change.
>>>>>>>> Unless we make the compatible approach as expressive as full power
>>>>>>>> of SQL, some views that are possible to create in v1 will not be 
>>>>>>>> possible
>>>>>>>> to create in v2.
>>>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>>>> compatible", we may not be able to roll it out.
>>>>>>>>
>>>>>>>> > the convention of common SQL has been working for a majority of
>>>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>>>> application, joins, grouping and common aggregate and scalar function. 
>>>>>>>> A
>>>>>>>> few users occasionally would like to use Trino or Spark specific 
>>>>>>>> functions
>>>>>>>> but are sometimes able to find a way to use a function that is common 
>>>>>>>> to
>>>>>>>> both the engines.
>>>>>>>>
>>>>>>>>
>>>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>>>> to define useful views, while also keep them portable.
>>>>>>>>
>>>>>>>> To be able to express column projections, simple filter
>>>>>>>> application, joins, grouping and common aggregate and scalar function 
>>>>>>>> in a
>>>>>>>> structured IR, how much effort do you think would be required?
>>>>>>>> We didn't really talk about downsides of a structured approach,
>>>>>>>> other than it looks complex.
>>>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue
>>>>>>>> for that. Maybe i were overly optimistic though.
>>>>>>>>
>>>>>>>>
>>>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>>>> to be consumed by multiple engines, we may be better served with 
>>>>>>>> approach
>>>>>>>> that's outside of Iceberg spec, like
>>>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thank you for all the comments. I will try to address them all
>>>>>>>>> here together.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>- @all Cross engine compatibility of view definition: Multiple
>>>>>>>>>options such as engine agnostic SQL or IR of some form have been 
>>>>>>>>> mentioned.
>>>>>>>>>We can all agree that all of these options are non-trivial to
>>>>>>>>>design/implement (perhaps a multi-year effort based on the option 
>>>>>>>>> chosen)
>>>>>>>>>and merit further discussion. I would like to suggest that we 
>>>>>>>>> continue this
>>>>>>>>>discussion but target this work for the future (v2?). In v1, we 
>>>>>>>>> can add an
>>>>>>>>>optional dialect field and an optional expanded/resolved SQL field 
>>>>>>>>> that can
>>>>>>>>>be interpreted by engines as they see fit. V1 can unlock many use 
>>>>>>>>> cases
>>>>>>>>>where the views are either accessed by a single engine or 
>>>>>>>>> multi-engine use
>>>>>>>>>cases where a (common) subset of SQL is supported. This proposal 
>>>>>>>>> allows for
>>>>>>&

Re: Proposal: Support for views in Iceberg

2021-08-26 Thread Jacques Nadeau
On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue  wrote:

> Would a physical plan be portable for the purpose of an engine-agnostic
> view?
>

My goal is it would be. There may be optional "hints" that a particular
engine could leverage and others wouldn't but I think the goal should be
that the IR is entirely engine-agnostic. Even in the Arrow project proper,
there are really two independent heavy-weight engines that have their own
capabilities and trajectories (c++ vs rust).


> Physical plan details seem specific to an engine to me, but maybe I'm
> thinking too much about how Spark is implemented. My inclination would be
> to accept only logical IR, which could just mean accepting a subset of the
> standard.
>

I think it is very likely that different consumers will only support a
subset of plans. That being said, I'm not sure what you're specifically
trying to mitigate or avoid. I'd be inclined to simply allow the full
breadth of IR within Iceberg. If it is well specified, an engine can either
choose to execute or not (same as the proposal wrt to SQL syntax or if a
function is missing on an engine). The engine may even have internal
rewrites if it likes doing things a different way than what is requested.


> The document that Micah linked to is interesting, but I'm not sure that
> our goals are aligned.
>

I think there is much commonality here and I'd argue it would be best to
really try to see if a unified set of goals works well. I think Arrow IR is
young enough that it can still be shaped/adapted. It may be that there
should be some give or take on each side. It's possible that the goals are
too far apart to unify but my gut is that they are close enough that we
should try since it would be a great force multiplier.


> For one thing, it seems to make assumptions about the IR being used for
> Arrow data (at least in Wes' proposal), when I think that it may be easier
> to be agnostic to vectorization.
>

Other than using the Arrow schema/types, I'm not at all convinced that the
IR should be Arrow centric. I've actually argued to some that Arrow IR
should be independent of Arrow to be its best self. Let's try to review it
and see if/where we can avoid a tight coupling between plans and arrow
specific concepts.


> It also delegates forward/backward compatibility to flatbuffers, when I
> think compatibility should be part of the semantics and not delegated to
> serialization. For example, if I have Join("inner", a.id, b.id) and I
> evolve that to allow additional predicates Join("inner", a.id, b.id, a.x
> < b.y) then just because I can deserialize it doesn't mean it is compatible.
>

I don't think that flatbuffers alone can solve all compatibility problems.
It can solve some and I'd expect that implementation libraries will have to
solve others. Would love to hear if others disagree (and think flatbuffers
can solve everything wrt compatibility).

J

>


A new project focused on serialized algebra

2021-09-08 Thread Jacques Nadeau
Hey all,

For some time I've been thinking that having a common serialized
representation of query plans would be helpful across multiple related
projects. I started working on something independently in this vein several
months ago. Since then, Arrow has started exploring "Arrow IR" and in
Iceberg, Piotr and others were proposing something similar to support a
cross-engine structured view. Given the different veins of interest, I
think we should combine forces on a consolidated consensus-driven solution.

As I've had more conversations with different people, I've come to the
conclusion that given the complexity of the task and people's
competing priorities, a separate "Switzerland" project is the best way to
find common ground. As such, I've started to sketch out a specification [1]
called Substrait. I'd love to collaborate with the Iceberg community to
ensure the specification does a good job of supporting the needs of this
project.

For those that are interested, please join Slack and/or start a discussion
on GitHub. My first goal is to come to consensus on the type system of
simple [2], compound [3] and physical [4] types. The general approach I'm
trying to follow is:

   - Use Spark, Trino, Arrow and Iceberg as the four indicators of whether
   something should be first class. It must exist in at least two systems to
   be formalized.
   - Avoid a formal distinction between logical and physical (types,
   operators, etc)
   - Lean more towards simple types than compound types when systems
   generally use only a constrained set of parameters (e.g. timestamp(3) and
   timestamp(6) as opposed to timestamp(x)).


Links for Substrait:
Site: https://substrait.io
Spec source: https://github.com/substrait-io/substrait/tree/main/site/docs
Binary format: https://github.com/substrait-io/substrait/tree/main/binary

Please let me know your thoughts,
Jacques

[1] https://substrait.io/spec/specification/#components
[2] https://substrait.io/types/simple_logical_types/
[3] https://substrait.io/types/compound_logical_types/
[4] https://substrait.io/types/physical_types/


Re: A new project focused on serialized algebra

2021-09-10 Thread Jacques Nadeau
There are also some good conversations happening in on the Github
discussion forum [1].

We're trying to drive consensus around the type system to start [2]. Would
love the Iceberg community members to weigh in as the contributors are
fairly Arrow heavy atm.

Thanks,
Jacques


[1] https://github.com/substrait-io/substrait/discussions
[2] https://github.com/substrait-io/substrait/discussions/2


On Fri, Sep 10, 2021 at 9:19 AM Ryan Blue  wrote:

> Nevermind, I see there's a Substrait Slack community. Here's the invite
> link for anyone else that's interested:
> https://join.slack.com/t/substrait/shared_invite/zt-vivbux2c-~B1jEWcR0wYhq5k4LHuoLQ
>
> On Fri, Sep 10, 2021 at 9:16 AM Ryan Blue  wrote:
>
>> Thanks, Jacques! I think it's a great idea to have this as an external
>> project so that it doesn't get tied to a particular set of goals for an
>> existing project.
>>
>> Where is a good place to discuss this? Should we create a #substrait room
>> on Iceberg Slack? ASF Slack? On this thread?
>>
>> Ryan
>>
>> On Wed, Sep 8, 2021 at 8:21 AM Jacques Nadeau 
>> wrote:
>>
>>> Hey all,
>>>
>>> For some time I've been thinking that having a common serialized
>>> representation of query plans would be helpful across multiple related
>>> projects. I started working on something independently in this vein several
>>> months ago. Since then, Arrow has started exploring "Arrow IR" and in
>>> Iceberg, Piotr and others were proposing something similar to support a
>>> cross-engine structured view. Given the different veins of interest, I
>>> think we should combine forces on a consolidated consensus-driven solution.
>>>
>>> As I've had more conversations with different people, I've come to the
>>> conclusion that given the complexity of the task and people's
>>> competing priorities, a separate "Switzerland" project is the best way to
>>> find common ground. As such, I've started to sketch out a specification [1]
>>> called Substrait. I'd love to collaborate with the Iceberg community to
>>> ensure the specification does a good job of supporting the needs of this
>>> project.
>>>
>>> For those that are interested, please join Slack and/or start a
>>> discussion on GitHub. My first goal is to come to consensus on the type
>>> system of simple [2], compound [3] and physical [4] types. The general
>>> approach I'm trying to follow is:
>>>
>>>- Use Spark, Trino, Arrow and Iceberg as the four indicators of
>>>whether something should be first class. It must exist in at least two
>>>systems to be formalized.
>>>- Avoid a formal distinction between logical and physical (types,
>>>operators, etc)
>>>- Lean more towards simple types than compound types when systems
>>>generally use only a constrained set of parameters (e.g. timestamp(3) and
>>>timestamp(6) as opposed to timestamp(x)).
>>>
>>>
>>> Links for Substrait:
>>> Site: https://substrait.io
>>> Spec source:
>>> https://github.com/substrait-io/substrait/tree/main/site/docs
>>> Binary format:
>>> https://github.com/substrait-io/substrait/tree/main/binary
>>>
>>> Please let me know your thoughts,
>>> Jacques
>>>
>>> [1] https://substrait.io/spec/specification/#components
>>> [2] https://substrait.io/types/simple_logical_types/
>>> [3] https://substrait.io/types/compound_logical_types/
>>> [4] https://substrait.io/types/physical_types/
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
> Ryan Blue
> Tabular
>


Re: [DISCUSS] UUID type

2021-09-17 Thread Jacques Nadeau
I already added it to Substrait because of Iceberg lazy consensus :D


On Fri, Sep 17, 2021 at 2:05 PM Ryan Blue  wrote:

> Let's move forward with it. I'm not hearing much dissent after saying the
> general trend is to keep UUID. So let's call it lazy consensus.
>
> Ryan
>
> On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen 
> wrote:
>
>> Hi Ryan,
>>
>> Please advise whatever feels more appropriate from your perspective.
>> From my perspective, we could just go ahead and merge Trino Iceberg
>> support for UUID, since this is just fulfilling the spec as it is defined
>> today.
>>
>> Best
>> PF
>>
>>
>> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue  wrote:
>>
>>> I don't think we necessarily reached consensus, but I think the general
>>> trend toward the end was to keep support for UUID. Should we start a vote
>>> to validate consensus?
>>>
>>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard 
>>> wrote:
>>>
>>>> Just following up on Piotr's message here.
>>>>
>>>> Have we converged? I think most people would assume that silence is a
>>>> vote for the status-quo.
>>>>
>>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <
>>>> pi...@starburstdata.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It seems we converged here that UUID should remain included.
>>>>> I read this as a consensus reached, but it may be subjective. Did we
>>>>> objectively reached consensus on this?
>>>>>
>>>>> From Iceberg project perspective there isn't anything to do, as UUID
>>>>> already *is* part of the spec (
>>>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>>>> Trino Iceberg PR adding support for UUID
>>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while
>>>>> this conversation has been ongoing.
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B  wrote:
>>>>>
>>>>>> Hi Ryan and all,
>>>>>>
>>>>>> That sounds like a reasonable reason to leave IP address types out.
>>>>>> In my experience, dedicated IP address types are mostly found in logging
>>>>>> tools and other things for sysadmins / DevOps etc.
>>>>>>
>>>>>> When querying data with IP addresses, I’ve seen it done quite a lot
>>>>>> (eg security reasons) but usually stored as string or manipulated in a 
>>>>>> UDF.
>>>>>> They’re not commonly supported types.
>>>>>>
>>>>>> I would also draw the line at UUID types.
>>>>>>
>>>>>> - Kyle Bendickson
>>>>>>
>>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue  wrote:
>>>>>>
>>>>>> 
>>>>>> Jacques, you make some good points here. I think my argument about
>>>>>> usability leading to performance issues is a stronger argument for 
>>>>>> engines
>>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>>>
>>>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>>>> table that's difficult to query in other engines, where people would
>>>>>> probably choose to store the data as a string. On the other hand, if
>>>>>> Iceberg keeps the UUID type then integrations would simply translate to 
>>>>>> the
>>>>>> UUID string representation before passing data to the other engines.
>>>>>> While the engines would be using 36-byte values in join keys, the user
>>>>>> experience issue is fixed and the data is more compact on disk and in
>>>>>> Iceberg's bounds metadata.
>>>>>>
>>>>>> While having a UUID type in Iceberg can't really help engines that
>>>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>>>> slightly better to have the UUID type in general since at least one 
>>>>

Re: support of RCFile

2021-09-29 Thread Jacques Nadeau
I actually wonder if file formats should be an extension api so someone can
implement a file format but it without any changes in Iceberg core (I don't
think this is possible today). Let's say one wanted to create a proprietary
format but use Iceberg semantics (not me). Could we make it such that one
could do so by building an extension and leveraging off-the-shelf Iceberg?
That's seems the best option for something like RC file. For sure people
are going to have a desire to add new formats and given the pain of
rewriting large datasets but I'd hate to see lots of partially implemented
file formats in Iceberg proper. Better for people to build against an
extension api and have them serve the purposes they need. Maybe go so far
as the extension api only allows read, not write so that people don't do
crazy things...


On Wed, Sep 29, 2021 at 6:43 PM yuan youjun  wrote:

> Hi Ryan and Russell
>
> Thanks very much for your response.
>
> well, I want ACID and row level update capability that icegerg provides. I
> believe data lake is a better way to manage our dataset, instead of hive.
> I also want our transition from hive to data lake is as smooth as
> possible, which means:
> 1, the transition should be transparent to consumers (dashboard, data
> scientist, downstream pipelines). If we start a new table with iceberg with
> new data, then those consumers will NOT be able to query old data (without
> splitting their queries into two, and combine the result).
> 2, do not impose significant infra cost. Convert historical data from
> RCFile into ORC or Parquet would be time consuming and costly (though it’s
> one time cost). I got your point that new format probably save our storage
> cost in the long term, this would be a separate interesting topic.
>
> Here is what in my mind now:
> 1, if iceberg support (or will support) legacy format, that would be ideal.
> 2, if not, is it possible for us to develop that feature (may be in a
> fork).
> 3, convert history data into new format should be our last sort, this way
> we need more evaluation.
>
>
> youjun
>
> 2021年9月30日 上午12:15,Ryan Blue  写道:
>
> Youjun, what are you trying to do?
>
> If you have existing tables in an incompatible format, you may just want
> to leave them as they are for historical data. It depends on why you want
> to use Iceberg. If you want to be able to query larger ranges of that data
> because you've clustered across files by filter columns, then you'd want to
> build the Iceberg metadata. But if you have a lot of historical data that
> hasn't been clustered and is unlikely to be rewritten, then keeping old
> tables in RCFile and doing new work in Iceberg could be a better option.
>
> You may also want to check how much savings you get out of using Iceberg
> with Parquet files vs RCFile. If you find that you can cluster your data
> for better queries and that ends up making your dataset considerably
> smaller then maybe it's worth the conversion that Russell suggested. RCFile
> is pretty old so I think there's a good chance you'd save a lot of space --
> just updating from an old compression codec to something more modern like
> snappy to lz4 or gzip to zstd could be a big win.
>
> Ryan
>
> On Wed, Sep 29, 2021 at 8:49 AM Russell Spitzer 
> wrote:
>
>> Within Iceberg it would take a bit of effort, we would need custom
>> readers at the minimum if we just wanted to make it ReadOnly support. I
>> think the main complexity would be designing the specific readers for the
>> platform you want to use like "Spark" or "Flink", the actual metadata
>> handling and such would probably be pretty straightforward. I would
>> definitely size it as at least a several week project and I'm not sure we
>> would want to support it in OSS Iceberg.
>>
>> On Wed, Sep 29, 2021 at 10:40 AM 袁尤军  wrote:
>>
>>> thanks for the suggestion. we need to evaluate the cost to convert the
>>> format, as those hive tables  have been there for many years, so PB data
>>> need to reformat.
>>>
>>> also, do you think it is possible to develop the support for a new
>>> format? how costly is it?
>>>
>>> 发自我的iPhone
>>>
>>> > 在 2021年9月29日,下午9:34,Russell Spitzer  写道:
>>> >
>>> > There is no plan I am aware of using RCFiles directly in Iceberg.
>>> While we could work to support other file formats, I don't think it is very
>>> widely used compared to ORC and Parquet (Iceberg has native support for
>>> these formats).
>>> >
>>> > My suggestion for conversion would be to do a CTAS statement in Spark
>>> and have the table completely converted over to Parquet (or ORC). This is
>>> probably the simplest way.
>>> >
>>> >> On Sep 29, 2021, at 7:01 AM, yuan youjun 
>>> wrote:
>>> >>
>>> >> Hi community,
>>> >>
>>> >> I am exploring ways to evolute existing hive tables (RCFile)  into
>>> data lake. However I found out that iceberg (or Hudi, delta lake) does not
>>> support RCFile. So my questions are:
>>> >> 1, is there any plan (or is it possible) to support RCFile in the
>>> future? So we can m

Re: Iceberg python library sync

2021-10-05 Thread Jacques Nadeau
This might be a dumb question but...why is the iceberg python mailing list
a Google group as opposed to an Apache mailing list?

On Sat, Oct 2, 2021 at 9:57 PM Jun H.  wrote:

> Hi everyone,
>
> I just sent the invite for the next python library meeting on Tuesday
> (10/12) at 9 AM (UTC-7, PDT). In this meeting, we will continue the
> discussion of the high level design and the project planning.
> Here is the meeting agenda:
> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
> .
>
> Please join the iceberg-python-sync
>  Google Group to
> receive an invitation.
>
> Thanks.
>
> Jun
>
>
> On Sun, Sep 19, 2021 at 11:37 PM Jun H.  wrote:
>
>> Hi everyone,
>>
>> I just sent the invite for the next python library meeting on Tuesday
>> (9/27) at 9 AM (UTC-7, PDT). In this meeting, we will discuss the high
>> level design and the project planning.
>> Here is the meeting agenda:
>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
>> .
>>
>> Please join the iceberg-python-sync
>>  list on Google
>> Groups to receive an invitation.
>>
>> Thanks.
>>
>> Jun
>>
>>
>> On Tue, Aug 24, 2021 at 10:54 PM Jun H.  wrote:
>>
>>> Hi everyone,
>>>
>>> I have sent the brainstorm meeting invite to discuss the python library
>>> redesign on Tuesday (9/7) at 9 AM (UTC-7, PDT).
>>>
>>> Here is the meeting agenda:
>>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
>>> .
>>>
>>> Please join the iceberg-python-sync
>>>  list on Google
>>> Groups to receive an invitation.
>>>
>>> Thanks.
>>>
>>> Jun
>>>
>>>
>>> On Fri, Aug 20, 2021 at 9:18 AM Jun H.  wrote:
>>>
 Hi everyone,

 Here is the doc for the meeting notes from the past python library
 sync:
 https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit#heading=h.864lyyx0j7ax.
 Similar to community sync, we will use it as a running log. Please feel
 free to add additional notes or agenda items.

 As discussed in the sync, we will have a separate brainstorm meeting to
 discuss the python library redesign in the coming weeks. Is Wednesday (9/8)
 at 9 AM (UTC-7, PDT) a good time for everyone (there is a community sync on
 9/1)?

 Please join the iceberg-python-sync
  list on
 Google Groups to receive an invitation.

 Thanks.

 Jun


 On Mon, Aug 16, 2021 at 8:04 AM Jun H.  wrote:

> Hi everyone,
>
> I have sent the meeting invite using the replied emails in the
> threads. Here is the meeting agenda:
> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
> .
>
> Similar to iceberg community sync, please join the iceberg-python-sync
>  list on
> Google Groups to receive an invitation.
>
> Thanks.
>
> Jun
>
>
> On Sat, Aug 14, 2021 at 6:18 AM Uwe L. Korn  wrote:
>
>> Please also invite me as well. I currently don’t have the time to
>> join but would be interested in joining in future.
>>
>> Am 13.08.2021 um 23:36 schrieb Ryan Blue :
>>
>> 
>> Thanks, Jun!
>>
>> On Fri, Aug 13, 2021 at 2:29 PM Jun H.  wrote:
>>
>>> Thanks everyone. I will set up the sync meeting to kick off the
>>> discussion at 9 AM (UTC-7, PDT) on 08/18/2021 (coming Wednesday). I will
>>> create and share a meeting agenda and notes doc soon.
>>>
>>> Best regards,
>>>
>>> Jun
>>>
>>>
>>>
>>>
>>> On Thu, Aug 12, 2021 at 1:49 PM Szehon Ho
>>>  wrote:
>>>
 +1, would love to listen in as well

 Thanks,
 Szehon

 On 12 Aug 2021, at 12:48, Arthur Wiedmer <
 arthur.wiedmer+apa...@gmail.com> wrote:

 Hi Jun,

 Please add me as well!

 Best,
 Arthur



 On Thu, Aug 12, 2021 at 12:19 AM Jun H.  wrote:

> Hi everyone,
>
> Since early this year, we have started working on the iceberg
> python library to bring it up to date and support the new V2 spec. 
> Here is
> a summary of the current feature plan
> .
> We have a lot of interesting work to do.
>
> To keep the community in sync, we plan to set up a recurring
> iceberg python library sync meeting. Please let me know if you are
> interested in or have any questions.
>
> Thanks.

Re: Iceberg python library sync

2021-10-05 Thread Jacques Nadeau
Got it, thanks.

That is a particularly sad form of missing feature...

On Tue, Oct 5, 2021 at 9:55 AM Daniel Weeks 
wrote:

> Hey Jacques,
>
> The google group is mostly for signing up for meeting attendance (if you
> join the GG, you get the invite for the sync on your calendar).
>
> Notes, correspondence, scheduling, etc. should be done via the dev list.
>
> That's my understanding at least,
> -Dan
>
>
>
> On Tue, Oct 5, 2021 at 9:31 AM Jacques Nadeau 
> wrote:
>
>> This might be a dumb question but...why is the iceberg python mailing
>> list a Google group as opposed to an Apache mailing list?
>>
>> On Sat, Oct 2, 2021 at 9:57 PM Jun H.  wrote:
>>
>>> Hi everyone,
>>>
>>> I just sent the invite for the next python library meeting on Tuesday
>>> (10/12) at 9 AM (UTC-7, PDT). In this meeting, we will continue the
>>> discussion of the high level design and the project planning.
>>> Here is the meeting agenda:
>>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
>>> .
>>>
>>> Please join the iceberg-python-sync
>>> <https://groups.google.com/search?q=iceberg-python-sync> Google Group
>>> to receive an invitation.
>>>
>>> Thanks.
>>>
>>> Jun
>>>
>>>
>>> On Sun, Sep 19, 2021 at 11:37 PM Jun H.  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I just sent the invite for the next python library meeting on Tuesday
>>>> (9/27) at 9 AM (UTC-7, PDT). In this meeting, we will discuss the high
>>>> level design and the project planning.
>>>> Here is the meeting agenda:
>>>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
>>>> .
>>>>
>>>> Please join the iceberg-python-sync
>>>> <https://groups.google.com/search?q=iceberg-python-sync> list on
>>>> Google Groups to receive an invitation.
>>>>
>>>> Thanks.
>>>>
>>>> Jun
>>>>
>>>>
>>>> On Tue, Aug 24, 2021 at 10:54 PM Jun H.  wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I have sent the brainstorm meeting invite to discuss the python
>>>>> library redesign on Tuesday (9/7) at 9 AM (UTC-7, PDT).
>>>>>
>>>>> Here is the meeting agenda:
>>>>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
>>>>> .
>>>>>
>>>>> Please join the iceberg-python-sync
>>>>> <https://groups.google.com/search?q=iceberg-python-sync> list on
>>>>> Google Groups to receive an invitation.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Jun
>>>>>
>>>>>
>>>>> On Fri, Aug 20, 2021 at 9:18 AM Jun H.  wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Here is the doc for the meeting notes from the past python library
>>>>>> sync:
>>>>>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit#heading=h.864lyyx0j7ax.
>>>>>> Similar to community sync, we will use it as a running log. Please feel
>>>>>> free to add additional notes or agenda items.
>>>>>>
>>>>>> As discussed in the sync, we will have a separate brainstorm meeting
>>>>>> to discuss the python library redesign in the coming weeks. Is Wednesday
>>>>>> (9/8) at 9 AM (UTC-7, PDT) a good time for everyone (there is a community
>>>>>> sync on 9/1)?
>>>>>>
>>>>>> Please join the iceberg-python-sync
>>>>>> <https://groups.google.com/search?q=iceberg-python-sync> list on
>>>>>> Google Groups to receive an invitation.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Jun
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 16, 2021 at 8:04 AM Jun H.  wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I have sent the meeting invite using the replied emails in the
>>>>>>> threads. Here is the meeting agenda:
>>>>>>> https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?usp=sharing
&g

Re: [DISCUSS] Iceberg roadmap

2021-11-07 Thread Jacques Nadeau
A few additional observations about StarRocks...

- As far as I can tell, StarRocks has an ASF incompatible license (Elastic
License 2.0).
- It appears to be a hard fork of Apache Doris, a project still in the
incubator (and looks like it probably is destructive to the Doris project)
- The project has only existed for ~2 months.





On Sun, Nov 7, 2021 at 7:34 PM OpenInx  wrote:

> Any thoughts for adding StarRocks integration to the roadmap ?
>
> I think the guys from StarRocks community can provide more background and
> inputs.
>
> On Thu, Nov 4, 2021 at 5:59 PM OpenInx  wrote:
>
>> Update:
>>
>> StarRocks[1] is a next-gen sub-second MPP database for full analysis
>> scenarios, including multi-dimensional analytics, real-time analytics and
>> ad-hoc query.  Their team is planning to integrate iceberg tables as
>> StarRocks external tables in the next month [2], so that people could
>> connect the data lake and StarRocks warehouse in the same engine.
>> The excellent performance of StarRocks will also help accelerate the
>> analysis and access of the iceberg table, I think this is a great thing for
>> both the iceberg community and the StarRocks community.   I think we can
>> add an extra project about StarRocks integration work in the apache iceberg
>> roadmap [3] ?
>>
>> [1].  https://github.com/StarRocks/starrocks
>> [2].  https://github.com/StarRocks/starrocks/issues/1030
>> [3].  https://github.com/apache/iceberg/projects
>>
>> On Mon, Nov 1, 2021 at 11:52 PM Ryan Blue  wrote:
>>
>>> I closed the upgrade project and marked the FLIP-27 project priority 1.
>>> Thanks for all the work to get this done!
>>>
>>> On Sun, Oct 31, 2021 at 8:10 PM OpenInx  wrote:
>>>
 Update:

 I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap can be
 closed now, because all of the issues have been addressed.

 [1]. https://github.com/apache/iceberg/projects/12

 On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner 
 wrote:

> I created a Roadmap section in
>  https://github.com/apache/iceberg/pull/3163
>  that links to the
> planning boards that Jack created. I figured it makes sense if we link
> available Design Docs directly on those Boards (as was already done),
> because then the Design docs are closer to the set of related issues.
>
> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue  wrote:
>
>> Thanks, Jack!
>>
>> Eduard, I think that's a good idea. We should have a roadmap page as
>> well that links to the projects that Jack just created.
>>
>> On Mon, Sep 20, 2021 at 12:57 PM Jack Ye  wrote:
>>
>>> It seems like we have reached some consensus around the projects
>>> listed here. I have created corresponding Github projects for each:
>>> https://github.com/apache/iceberg/projects
>>>
>>> Related design docs are also linked there.
>>>
>>> Best,
>>> Jack Ye
>>>
>>> On Sun, Sep 19, 2021 at 11:18 PM Eduard Tudenhoefner <
>>> edu...@dremio.com> wrote:
>>>
 Would it make sense to have a section on the website where we
 collect all the links to the design docs/specs as that would be easier 
 to
 find than searching for things on the ML?

 I was thinking about something like for each component:
 * link to the ML discussion
 * link to the actual Spec/Design Doc

 Thoughts?

 On Fri, Sep 10, 2021 at 11:38 PM Ryan Blue  wrote:

> Hi everyone,
>
> At the last sync meeting, we brought up publishing a community
> roadmap and brainstormed the many features and initiatives that the
> community is working on. In this thread, I want to make sure that we 
> have a
> good list of what people are thinking about and I think we should try 
> to
> categorize the projects by size and general priority. When we reach a 
> rough
> agreement, I’ll write this up and post it on the ASF site along with 
> links
> to some projects in Github.
>
> My rationale for attempting to prioritize projects is that if we
> try to do too many things, it will be slower progress across 
> everything
> rather than getting a few important items done. I know that priorities
> don’t align very cleanly in practice, but it is hopefully worth 
> trying. To
> come up with a priority, I’m trying to keep top priority items to a 
> minimum
> by including only one from each group (Spark, Flink, Python, etc.). 
> The
> remaining items are split between priority 2 and 3. Priority 3 is not
> urgent, including things that can be plugged in (like other IO 
> libraries),
> docs, etc. Everything else is priority 2.
>
> That something isn’t priorit

Re: [Discuss] Iceberg View Interoperability

2024-11-29 Thread Jacques Nadeau
Hey Ajantha, thanks for looping me in. This is a great conversation.

FYI, I'm a co-creator of Substrait so read this all with that in mind.

Substrait has a couple of key underpinnings that are worth noting:
1. It's a specification first and foremost (with tools to help work with
the specification). This is how the project started and continues to work.
Clear specifications are required to succeed at having multiple different
communities and languages work independently but be compatible. This is
inspired by work I've been a part of in Arrow, Iceberg and Parquet.
2. Efficient IR (aka bytecode) is about creating a representation that is
easy for computers to consume and manipulate (as opposed to humans).

On the first item, I think we've already reaped the benefits of this. A
good example is the independent implementations of Substrait in multiple
languages (e.g. Ibis in python, Duckdb in C and Datafusion in Rust,
Calcite/Isthmus in Java). And a fun recent paper by MSFT on their use of
Substrait as a standard plan representation across engines [1]. The success
of the specification is defined by having multiple implementations written
independently (and sometimes competitively). FWIW, I know some lament all
the different Parquet readers and writers written in the world. While there
are probably too many, that's actually a testament to the success of the
specification.

On the second item, I disagree that IR is "just another dialect". Substrait
is built and operates much more like LLVM IR Bytecode or JVM Bytecode.
Swift, Rust and C++ aren't in the same class as LLVM IR. Scala, Kotlin and
Java aren't in the same class as JVM Bytecode. A key characteristic of
formal IRs is deliberate design decisions on the representation for
simplicity, specification and consumption that represents well across a
range of systems. It's a key effort in the Substrait community. Good
concrete examples would be our work to define simple decisions like
disallowing implicit casts as well as more complex topics like how to
represent subqueries [2], specify output type derivation rules [3], and
producing engine independent function compliance tests [4], etc.

WRT to multiple IRs, I don't think there are any others like Substrait
right now. (e.g. (1) well-specified, (2) language agonistic, (3)
serializable, (4) domain specific, (5) system agnostic and (6) built for
"dumb" consumers). I think the most mature attempt prior to Substrait was
probably the GPORCA XML representation [5] (which was used across Hawq &
Greenplum). That project ultimately failed due to single-vendor ownership
and focus on advanced query optimization as opposed to interoperability.
(For what it's worth--when starting on Substrait I spoke to some of the
creators of that project to get feedback on what went well and what
didn't).

I agree with Walaa that any translation engine will need an intermediate
representation to do m=>n translations. In most translation systems, the ir
is something internal that was built over time (what I'd call a "lowercase
ir"). I believe sqlglot and Coral both exist in that category. Both systems
may one day move towards formalizing a representation for input and output.
I think the world would be better if they used an existing standard like
Substrait as opposed to introducing a new standard (it's a lot of work to
get right, let's do it just once as a community). The strengths of tools
like Coral and sqlglot are they are complete products. You can pick each up
and translate SQL today. The strength of Substrait isn't SQL translations,
it's being a clear serializable way to represent data processing
instructions (including SQL).

In many ways I'm reminded of engines that came before Arrow. Each engine
had an internal representation of data. Some were columnar. Some even
looked somewhat like Arrow. When Arrow arrived, they said "we already have
an intermediate representation for data" and they did. People would say
"I'm not going to change my internal representation of data". We never
expected they would. Rewriting an existing engine to a new native format is
near impossible. The goal was new engines would just use Arrow rather than
reinventing the wheel and existing systems would build high-efficiency
adapters to their internal representations. And that's what happened.
Datafusion and Velox didn't need to invent entirely new representations.
When they needed new features they just added them to Arrow (like
Stringview, thanks Meta folks!) I believe Substrait has much the same
opportunity [6]. When people are working on new translations systems, they
use Substrait because it makes things easier. Systems that already have
internal representations pre-Substrait will build adapters over time
(sounds like Coral already has some).

TL;DR, serialization of Iceberg views is exactly the kind of use case that
Substrait was built for and I hope it becomes the main way to get reliable
interoperability. The ecosystem around Substrait (including Coral v

Re: [DISCUSS] Apache Iceberg Summit 2025 - Selection Committee

2024-11-29 Thread Jacques Nadeau
happy to help on selection committee.

On Mon, Nov 25, 2024, 11:43 PM Jean-Baptiste Onofré  wrote:

> Hi everyone,
>
> As you probably know, we've been having discussions about the Iceberg
> Summit 2025.
>
> The PMC pre-approved the Iceberg Summit proposal, and one of the first
> steps is to put together a selection committee that will be
> responsible for choosing talks and guiding the process.
> Once we have a selection committee, I will complete the concrete
> proposal for the ASF and the Iceberg PMC to request the ability to use
> the name Iceberg/Apache Iceberg.
>
> If you'd like to help and be part of the selection committee, please
> volunteer in a reply to this thread. Since we likely can't include
> everyone that volunteers, I propose that the PMC should choose the
> final committee from the set of people that volunteer.
>
> We'll leave this open up to Dec 10th to give people time (as
> Thanksgiving is this week).
>
> Thanks !
> Regards
> JB
>