Great! Thanks Wes. It's really great and interesting to see a concerted effort to have a conversion from a language specific implementation of common data structures into a common memory layout that can be consumed by another language (HashMap in Java/ Hash in Ruby etc). Excited to see how the API evolves in this regard.
On Mon, 22 Jan 2018 at 23:54 Wes McKinney <wesmck...@gmail.com> wrote: > Note we have https://issues.apache.org/jira/browse/ARROW-1705 (and > maybe some other JIRAs, I'd have to go digging) about improving > support for converting Python dicts to the right Arrow memory layout. > > - Wes > > On Mon, Jan 22, 2018 at 4:50 PM, simba nyatsanga <simnyatsa...@gmail.com> > wrote: > > Hi Uwe, > > > > Thank you very much for the detailed explanation. I have a much better > > understanding now. > > > > Cheers > > > > On Mon, 22 Jan 2018 at 19:37 Uwe L. Korn <uw...@xhochy.com> wrote: > > > >> Hello Simba, > >> > >> find the answers inline. > >> > >> On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote: > >> > Hi Everyone, > >> > > >> > I've got two questions that I'd like help with: > >> > > >> > 1. Pandas and numpy arrays can handle multiple types in a sequence > eg. a > >> > float and a string by using the dtype=object. From what I gather, > Arrow > >> > arrays enforce a uniform type depending on the type of the first > >> > encountered element in a sequence. This looks like a deliberate choice > >> and > >> > I'd like to get a better understanding of the reason for ensuring this > >> > conformity. Does making the data structure's type deterministic allow > for > >> > efficient pointer arithmetic when reading contiguous blocks and thus > >> making > >> > reading performant? > >> > >> As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy > >> you simply have the limitation that the type system can only represent a > >> small number of types. Especially all these types are primitive and > allow > >> no nesting (e.g. you cannot implement a NumPy array of NumPy arrays of > >> varying lengths). In NumPy you have the way to work around this > limitation > >> by using the object type. This simply means you have any array of > (64bit) > >> pointers to Python objects of which NumPy does know nothing. In the most > >> simplistic form, you could achieve the same behaviour by allocating an > >> INT64 Arrow Array, increase the reference count of each object and then > >> store the pointers of the object in this array. While this may work, > please > >> don't use this kind of hack. > >> > >> The main concept of Arrow is to define data structures that can be > >> exchanged between applications that are implemented in different > languages > >> and ecosystems. Storing Python objects in them is a bit against its use > >> case (we might support this one day for convenience in Python but it > will > >> be discouraged). In Arrow we have the concept of a UNION type, i.e. we > can > >> specify that a row can contain an object of a fixed set of types. This > will > >> bring you nearly the same abilities you have with the object type but > with > >> the improvement that you could also pass this data to another Arrow > >> consumer of any language and it can cope with the data. But this also > comes > >> a bit at the cost of usability: You need to specify the types that > occur in > >> the array (this one is also an "at least for", we may write some > >> auto-detection in the future but this a bit of work). > >> > >> > 2. Pandas and numpy can also handle dictionary elements using the > >> > dtype=object while pyarrow arrays don't. I'd like to understand the > >> > reasoning behind the choice here as well. > >> > >> This is again to due being more statically typed than just supporting > >> pointers to generic objects. For this we actually have at the moment a > >> STRUCT type in Arrow that supports in each row we have a set of named > >> entries where each entry has a fixed type (but the types can be > different > >> between entries). Alternatively we also have a MAP<KEY, VALUE> type > (that > >> probably needs some more specification work). Here you store data as > you do > >> in a typical Python dictionary but KEY and VALUE are fixed types. > Depending > >> on your data either STRUCT or MAP might be the correct types to use. > >> > >> As we talk in general about columnar data in the Arrow context, we > expect > >> that the data in a column is of the same or a similar type in each row > of a > >> column. > >> > >> Uwe > >> >