Re: [Format] Semantics for dictionary batches in streams

2019-09-09 Thread Micah Kornfield
Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates all the recent dictionary discussions. On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney wrote: > hi Micah, > > I think we should formulate changes to format/Columnar.rst and have a > vote, what do you think? > > On Thu, Aug

Re: [Format] Semantics for dictionary batches in streams

2019-09-09 Thread Wes McKinney
hi Micah, I think we should formulate changes to format/Columnar.rst and have a vote, what do you think? On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield wrote: >> >> >> > I was thinking the file format must satisfy one of two conditions: >> > 1. Exactly one dictionarybatch per encoded column >>

Re: [Format] Semantics for dictionary batches in streams

2019-08-29 Thread Micah Kornfield
> > > > I was thinking the file format must satisfy one of two conditions: > > 1. Exactly one dictionarybatch per encoded column > > 2. DictionaryBatches are interleaved correctly. Could you clarify? I think you clarified it very well :) My motivation for suggesting the additional complexity is

Re: [Format] Semantics for dictionary batches in streams

2019-08-28 Thread Wes McKinney
On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield wrote: > > I was thinking the file format must satisfy one of two conditions: > 1. Exactly one dictionarybatch per encoded column > 2. DictionaryBatches are interleaved correctly. Could you clarify? In the first case, there is no issue with dictio

Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Micah Kornfield
I was thinking the file format must satisfy one of two conditions: 1. Exactly one dictionarybatch per encoded column 2. DictionaryBatches are interleaved correctly. On Tuesday, August 27, 2019, Wes McKinney wrote: > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou wrote: > > > > > > Le 27/08/20

Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Wes McKinney
On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou wrote: > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > > So the current situation we have right now in C++ is that if we tried > > to create an IPC stream from a sequence of record batches that don't > > all have the same dictionary, we'd run in

Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Antoine Pitrou
Le 27/08/2019 à 22:31, Wes McKinney a écrit : > So the current situation we have right now in C++ is that if we tried > to create an IPC stream from a sequence of record batches that don't > all have the same dictionary, we'd run into two scenarios: > > * Batches that either have a prefix of a p

Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Wes McKinney
So the current situation we have right now in C++ is that if we tried to create an IPC stream from a sequence of record batches that don't all have the same dictionary, we'd run into two scenarios: * Batches that either have a prefix of a prior-observed dictionary, or the prior dictionary is a pre

Re: [Format] Semantics for dictionary batches in streams

2019-08-11 Thread Micah Kornfield
I'm not sure what you mean by record-in-dictionary-id, so it is possible this is a solution that I just don't understand :) The only two references to dictionary IDs that I could find, are one in schema.fbs [1] which is attached a column in a schema and the one referenced above in DictionaryBatch

Re: [Format] Semantics for dictionary batches in streams

2019-08-11 Thread Jacques Nadeau
Wow, you've shown how little I've thought about Arrow dictionaries for a while. I thought we had a dictionary id and a record-in-dictionary-id. Wouldn't that approach make more sense? Does no one do this today? (We frequently use compound values for this type of scenario...) On Sat, Aug 10, 2019 a

Re: [Format] Semantics for dictionary batches in streams

2019-08-10 Thread Micah Kornfield
Reading data from two different parquet files sequentially with different dictionaries for the same column. This could be handled by re-encoding data but that seems potentially sub-optimal. On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau wrote: > What situation are anticipating where you're goi

Re: [Format] Semantics for dictionary batches in streams

2019-08-10 Thread Jacques Nadeau
What situation are anticipating where you're going to be restating ids mid stream? On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield wrote: > The IPC specification [1] defines behavior when isDelta on a > DictionaryBatch [2] is "true". I might have missed it in the > specification, but I couldn'

Re: [Format] Semantics for dictionary batches in streams

2019-08-10 Thread Micah Kornfield
I should add that Option #1 above would be my preference, even though it adds some complications (especially for the file format). On Sat, Aug 10, 2019 at 12:12 AM Micah Kornfield wrote: > The IPC specification [1] defines behavior when isDelta on a > DictionaryBatch [2] is "true". I might have

[Format] Semantics for dictionary batches in streams

2019-08-10 Thread Micah Kornfield
The IPC specification [1] defines behavior when isDelta on a DictionaryBatch [2] is "true". I might have missed it in the specification, but I couldn't find the interpretation for what the expected behavior is when isDelta=false and and two dictionary batches with the same ID are sent. It seems