Hello Simba, find the answers inline.
On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote: > Hi Everyone, > > I've got two questions that I'd like help with: > > 1. Pandas and numpy arrays can handle multiple types in a sequence eg. a > float and a string by using the dtype=object. From what I gather, Arrow > arrays enforce a uniform type depending on the type of the first > encountered element in a sequence. This looks like a deliberate choice and > I'd like to get a better understanding of the reason for ensuring this > conformity. Does making the data structure's type deterministic allow for > efficient pointer arithmetic when reading contiguous blocks and thus making > reading performant? As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy you simply have the limitation that the type system can only represent a small number of types. Especially all these types are primitive and allow no nesting (e.g. you cannot implement a NumPy array of NumPy arrays of varying lengths). In NumPy you have the way to work around this limitation by using the object type. This simply means you have any array of (64bit) pointers to Python objects of which NumPy does know nothing. In the most simplistic form, you could achieve the same behaviour by allocating an INT64 Arrow Array, increase the reference count of each object and then store the pointers of the object in this array. While this may work, please don't use this kind of hack. The main concept of Arrow is to define data structures that can be exchanged between applications that are implemented in different languages and ecosystems. Storing Python objects in them is a bit against its use case (we might support this one day for convenience in Python but it will be discouraged). In Arrow we have the concept of a UNION type, i.e. we can specify that a row can contain an object of a fixed set of types. This will bring you nearly the same abilities you have with the object type but with the improvement that you could also pass this data to another Arrow consumer of any language and it can cope with the data. But this also comes a bit at the cost of usability: You need to specify the types that occur in the array (this one is also an "at least for", we may write some auto-detection in the future but this a bit of work). > 2. Pandas and numpy can also handle dictionary elements using the > dtype=object while pyarrow arrays don't. I'd like to understand the > reasoning behind the choice here as well. This is again to due being more statically typed than just supporting pointers to generic objects. For this we actually have at the moment a STRUCT type in Arrow that supports in each row we have a set of named entries where each entry has a fixed type (but the types can be different between entries). Alternatively we also have a MAP<KEY, VALUE> type (that probably needs some more specification work). Here you store data as you do in a typical Python dictionary but KEY and VALUE are fixed types. Depending on your data either STRUCT or MAP might be the correct types to use. As we talk in general about columnar data in the Arrow context, we expect that the data in a column is of the same or a similar type in each row of a column. Uwe