Re: Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

simba nyatsanga Mon, 22 Jan 2018 14:24:07 -0800

Great! Thanks Wes. It's really great and interesting to see a concerted
effort to have a conversion from a language specific implementation of
common data structures into a common memory layout that can be consumed by
another language (HashMap in Java/ Hash in Ruby etc). Excited to see how
the API evolves in this regard.


On Mon, 22 Jan 2018 at 23:54 Wes McKinney <wesmck...@gmail.com> wrote:

> Note we have https://issues.apache.org/jira/browse/ARROW-1705 (and
> maybe some other JIRAs, I'd have to go digging) about improving
> support for converting Python dicts to the right Arrow memory layout.
>
> - Wes
>
> On Mon, Jan 22, 2018 at 4:50 PM, simba nyatsanga <simnyatsa...@gmail.com>
> wrote:
> > Hi Uwe,
> >
> > Thank you very much for the detailed explanation. I have a much better
> > understanding now.
> >
> > Cheers
> >
> > On Mon, 22 Jan 2018 at 19:37 Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> >> Hello Simba,
> >>
> >> find the answers inline.
> >>
> >> On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote:
> >> > Hi Everyone,
> >> >
> >> > I've got two questions that I'd like help with:
> >> >
> >> > 1. Pandas and numpy arrays can handle multiple types in a sequence
> eg. a
> >> > float and a string by using the dtype=object. From what I gather,
> Arrow
> >> > arrays enforce a uniform type depending on the type of the first
> >> > encountered element in a sequence. This looks like a deliberate choice
> >> and
> >> > I'd like to get a better understanding of the reason for ensuring this
> >> > conformity. Does making the data structure's type deterministic allow
> for
> >> > efficient pointer arithmetic when reading contiguous blocks and thus
> >> making
> >> > reading performant?
> >>
> >> As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy
> >> you simply have the limitation that the type system can only represent a
> >> small number of types. Especially all these types are primitive and
> allow
> >> no nesting (e.g. you cannot implement a NumPy array of NumPy arrays of
> >> varying lengths). In NumPy you have the way to work around this
> limitation
> >> by using the object type. This simply means you have any array of
> (64bit)
> >> pointers to Python objects of which NumPy does know nothing. In the most
> >> simplistic form, you could achieve the same behaviour by allocating an
> >> INT64 Arrow Array, increase the reference count of each object and then
> >> store the pointers of the object in this array. While this may work,
> please
> >> don't use this kind of hack.
> >>
> >> The main concept of Arrow is to define data structures that can be
> >> exchanged between applications that are implemented in different
> languages
> >> and ecosystems. Storing Python objects in them is a bit against its use
> >> case (we might support this one day for convenience in Python but it
> will
> >> be discouraged). In Arrow we have the concept of a UNION type, i.e. we
> can
> >> specify that a row can contain an object of a fixed set of types. This
> will
> >> bring you nearly the same abilities you have with the object type but
> with
> >> the improvement that you could also pass this data to another Arrow
> >> consumer of any language and it can cope with the data. But this also
> comes
> >> a bit at the cost of usability: You need to specify the types that
> occur in
> >> the array (this one is also an "at least for", we may write some
> >> auto-detection in the future but this a bit of work).
> >>
> >> > 2. Pandas and numpy can also handle dictionary elements using the
> >> > dtype=object while pyarrow arrays don't. I'd like to understand the
> >> > reasoning behind the choice here as well.
> >>
> >> This is again to due being more statically typed than just supporting
> >> pointers to generic objects. For this we actually have at the moment a
> >> STRUCT type in Arrow that supports in each row we have a set of named
> >> entries where each entry has a fixed type (but the types can be
> different
> >> between entries). Alternatively we also have a MAP<KEY, VALUE> type
> (that
> >> probably needs some more specification work). Here you store data as
> you do
> >> in a typical Python dictionary but KEY and VALUE are fixed types.
> Depending
> >> on your data either STRUCT or MAP might be the correct types to use.
> >>
> >> As we talk in general about columnar data in the Arrow context, we
> expect
> >> that the data in a column is of the same or a similar type in each row
> of a
> >> column.
> >>
> >> Uwe
> >>
>

Re: Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

Reply via email to