I don't think you are missing anything.  The parquet encoding is baked
into the data on the disk so re-encoding at some stage is inevitable.
Re-encoding in python like you are doing is going to be inefficient.
I think you will want to do the re-encoding in C++.  Unfortunately, I
don't think we have a kernel function for this.

I think one could be created.  The input would be the current
dictionary and an incoming dictionary array.  The output would be a
newly encoded array and a delta dictionary.  The dictionary builder is
already implemented in this way[1] so the logic is probably lying
around.

If such a kernel function were implemented it would be nice to extend
the IPC file writer to support using it internally.  Then it would be
invisible to the general Arrow user (they could just use the record
batch file writer with this option turned on).

On Tue, May 31, 2022 at 8:58 AM Niklas Bivald <niklas.biv...@enplore.com> wrote:
>
> Hi,
>
> Background:
> I have a need to optimize read speed for few-column lookups in large
> datasets. Currently I have the data in Plasma to have fast reading of it,
> but Plasma is cumbersome to manage when the data frequently changes (and
> “locks” the ram). Instead I’m trying to figure out a fast-enough approach
> to read columns from an arrow file from disk (100-200ms range). Reading
> from an Arrow file appears to be fast enough (even though I unfortunately
> have string values so zero-copy is out of the question). However, to read
> an arrow file I first need to generate it.
>
> Current solution:
> I’m re-writing a parquet file to an arrow file row group per row group
> (dataset is bigger than RAM). Initially I had a naive implementation that
> read a batch from the parquet (using pyarrow.parquet) and tried to write it
> to a RecordBatchFileWriter. However that quickly led to:
>
> pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC
> file format. Arrow IPC files only support a single non-delta dictionary for
> a given field across all batches.[1]
>
> Trying to fix this (depending on my solution) I had a question:
>
>
>    1. Is there a way when creating the pyarrow schema define what
>    categorical values exist in the dictionaries? Or force a specific
>    dictionary when using `pa.DictionaryArray.from_pandas`
>
>
> Right now I use `pa.DictionaryArray.from_arrays` with the same dictionary
> values as an array, but it’s pretty cumbersome since I basically - per
> column per row group - need to convert the column values into the indices.
> Naive implementation:
>
> >>> def create_dictionary_array_indices(column_name, arrow_array):
> >>>  global categories_columns
> >>>  values = arrow_array.to_pylist()
> >>>  indices = []
> >>>  for i, value in enumerate(values):
> >>>  if not value or value != value:
> >>>  indices.append(None)
> >>>  else:
> >>>  indices.append(
> >>>  dictionary_values[column_name].index(value)
> >>>  )
> >>>  indices = pd.array(indices, dtype=pd.Int32Dtype())
> >>>  return pa.DictionaryArray.from_arrays(indices,
> dictionary_values[column_name])
>
> I also tried using pa.DictionaryArray.from_pandas with Series, but even
> though I had the same dictionary content in the Series I didn’t manage to
> get it to generate the same Dictionary (still gave "Dictionary replacement
> detected…")
>
> But is this process making sense? Am I missing something? I can probably
> speed that up (need to figure out how to vectorize looking up indexes in an
> array) but before spending a lot of time doing that I just wanted to check
> whether the approach was sane at all. Full sample code (that works, but is
> super slow) at
> https://gist.github.com/bivald/f8e0a7625af2eabbf7c5fa055da91d61
>
> Regards,
> Niklas
>
> [1] RecordBatchStream instead works, but I got slower read times using
> it... but might need to redo my timings
> [2] This is a continuation of
> https://stackoverflow.com/questions/72438916/batch-by-batch-convert-parquet-to-arrow-with-categorical-values-arrow-ipc-files

Reply via email to