Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates
all the recent dictionary discussions.
On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney wrote:
> hi Micah,
>
> I think we should formulate changes to format/Columnar.rst and have a
> vote, what do you think?
>
> On Thu, Aug
hi Micah,
I think we should formulate changes to format/Columnar.rst and have a
vote, what do you think?
On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield wrote:
>>
>>
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1. Exactly one dictionarybatch per encoded column
>>
>
>
> > I was thinking the file format must satisfy one of two conditions:
> > 1. Exactly one dictionarybatch per encoded column
> > 2. DictionaryBatches are interleaved correctly.
Could you clarify?
I think you clarified it very well :) My motivation for suggesting the
additional complexity is
On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield wrote:
>
> I was thinking the file format must satisfy one of two conditions:
> 1. Exactly one dictionarybatch per encoded column
> 2. DictionaryBatches are interleaved correctly.
Could you clarify? In the first case, there is no issue with
dictio
I was thinking the file format must satisfy one of two conditions:
1. Exactly one dictionarybatch per encoded column
2. DictionaryBatches are interleaved correctly.
On Tuesday, August 27, 2019, Wes McKinney wrote:
> On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou wrote:
> >
> >
> > Le 27/08/20
On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou wrote:
>
>
> Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > So the current situation we have right now in C++ is that if we tried
> > to create an IPC stream from a sequence of record batches that don't
> > all have the same dictionary, we'd run in
Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> So the current situation we have right now in C++ is that if we tried
> to create an IPC stream from a sequence of record batches that don't
> all have the same dictionary, we'd run into two scenarios:
>
> * Batches that either have a prefix of a p
So the current situation we have right now in C++ is that if we tried
to create an IPC stream from a sequence of record batches that don't
all have the same dictionary, we'd run into two scenarios:
* Batches that either have a prefix of a prior-observed dictionary, or
the prior dictionary is a pre
I'm not sure what you mean by record-in-dictionary-id, so it is possible
this is a solution that I just don't understand :)
The only two references to dictionary IDs that I could find, are one in
schema.fbs [1] which is attached a column in a schema and the one
referenced above in DictionaryBatch
Wow, you've shown how little I've thought about Arrow dictionaries for a
while. I thought we had a dictionary id and a record-in-dictionary-id.
Wouldn't that approach make more sense? Does no one do this today? (We
frequently use compound values for this type of scenario...)
On Sat, Aug 10, 2019 a
Reading data from two different parquet files sequentially with different
dictionaries for the same column. This could be handled by re-encoding
data but that seems potentially sub-optimal.
On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau wrote:
> What situation are anticipating where you're goi
What situation are anticipating where you're going to be restating ids mid
stream?
On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield
wrote:
> The IPC specification [1] defines behavior when isDelta on a
> DictionaryBatch [2] is "true". I might have missed it in the
> specification, but I couldn'
I should add that Option #1 above would be my preference, even though it
adds some complications (especially for the file format).
On Sat, Aug 10, 2019 at 12:12 AM Micah Kornfield
wrote:
> The IPC specification [1] defines behavior when isDelta on a
> DictionaryBatch [2] is "true". I might have
The IPC specification [1] defines behavior when isDelta on a
DictionaryBatch [2] is "true". I might have missed it in the
specification, but I couldn't find the interpretation for what the expected
behavior is when isDelta=false and and two dictionary batches with the
same ID are sent.
It seems
14 matches
Mail list logo