Hello Simba,

find the answers inline.

On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote:
> Hi Everyone,
> 
> I've got two questions that I'd like help with:
> 
> 1. Pandas and numpy arrays can handle multiple types in a sequence eg. a
> float and a string by using the dtype=object. From what I gather, Arrow
> arrays enforce a uniform type depending on the type of the first
> encountered element in a sequence. This looks like a deliberate choice and
> I'd like to get a better understanding of the reason for ensuring this
> conformity. Does making the data structure's type deterministic allow for
> efficient pointer arithmetic when reading contiguous blocks and thus making
> reading performant?

As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy you 
simply have the limitation that the type system can only represent a small 
number of types. Especially all these types are primitive and allow no nesting 
(e.g. you cannot implement a NumPy array of NumPy arrays of varying lengths). 
In NumPy you have the way to work around this limitation by using the object 
type. This simply means you have any array of (64bit) pointers to Python 
objects of which NumPy does know nothing. In the most simplistic form, you 
could achieve the same behaviour by allocating an INT64 Arrow Array, increase 
the reference count of each object and then store the pointers of the object in 
this array. While this may work, please don't use this kind of hack. 

The main concept of Arrow is to define data structures that can be exchanged 
between applications that are implemented in different languages and 
ecosystems. Storing Python objects in them is a bit against its use case (we 
might support this one day for convenience in Python but it will be 
discouraged). In Arrow we have the concept of a UNION type, i.e. we can specify 
that a row can contain an object of a fixed set of types. This will bring you 
nearly the same abilities you have with the object type but with the 
improvement that you could also pass this data to another Arrow consumer of any 
language and it can cope with the data. But this also comes a bit at the cost 
of usability: You need to specify the types that occur in the array (this one 
is also an "at least for", we may write some auto-detection in the future but 
this a bit of work).

> 2. Pandas and numpy can also handle dictionary elements using the
> dtype=object while pyarrow arrays don't. I'd like to understand the
> reasoning behind the choice here as well.

This is again to due being more statically typed than just supporting pointers 
to generic objects. For this we actually have at the moment a STRUCT type in 
Arrow that supports in each row we have a set of named entries where each entry 
has a fixed type (but the types can be different between entries). 
Alternatively we also have a MAP<KEY, VALUE> type (that probably needs some 
more specification work). Here you store data as you do in a typical Python 
dictionary but KEY and VALUE are fixed types. Depending on your data either 
STRUCT or MAP might be the correct types to use.

As we talk in general about columnar data in the Arrow context, we expect that 
the data in a column is of the same or a similar type in each row of a column. 

Uwe

Reply via email to