Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-18 Thread Robert Nishihara
Hi Mike, 1. I think yes, though we'd need to turn off the automatic LRU eviction that happens when the store fills up. 3. I think there are some edge cases and it depends what is in your DataFrame, but at least if it consists of numerical data then the two representations should use the same unde

Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-18 Thread Mike Sam
I am interested to implement an arrow based persisted cache store and I have a few related questions: 1. Is it possible just to use Plasma for this goal? (My understanding is that it is not persistable) Else, what is the recommended way to do so? 2. Is feather the better file f

[jira] [Created] (ARROW-2011) Allow setting the pickler to use in pyarrow serialization.

2018-01-18 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-2011: --- Summary: Allow setting the pickler to use in pyarrow serialization. Key: ARROW-2011 URL: https://issues.apache.org/jira/browse/ARROW-2011 Project: Apache Arrow

[jira] [Created] (ARROW-2010) [C++] Compiler warnings with CHECKIN warning level in ORC adapter

2018-01-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2010: --- Summary: [C++] Compiler warnings with CHECKIN warning level in ORC adapter Key: ARROW-2010 URL: https://issues.apache.org/jira/browse/ARROW-2010 Project: Apache Arrow

Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Great, thank you for the explanation - it makes so much sense. I have a use case where once I've converted an Arrow table back to pandas I then convert it into a dictionary (with to_dict()). This dictionary then gets JSON serialised and sent over the wire for display on the client side. I encounter

Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread Wes McKinney
Upon converting to Arrow, the information about whether the original input was a list or ndarray was lost. So any kind of sequence ends up as an Arrow List type. When converting back to pandas, we could return either a list or an ndarray. Returning ndarray is faster and much more memory efficient;

Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Hi Wes, Great! Thanks for the pointer. From what I gather this is a fundamental and deliberate design decision. Would I be correct in saying the memory footprint and access speed of a NumPy array compared to that of a Python list is the reason why the conversion is done? Kind Regards Simba On Th

Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread Wes McKinney
hi Simba, Yes -- Arrow list types are converted to NumPy arrays when converting back to pandas with to_pandas(...). This conversion happens in C++ code in https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L541 - Wes On Thu, Jan 18, 2018 at 1:26 PM, simba nyatsan

PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Good day everyone, I noticed what looks like type inference happening after persisting a pandas DataFrame where one of the column values is a list. When I load up the DataFrame again and do df.to_dict(), the value is no longer a list but a numpy array. I dug through functions in the pandas_compat.