Wes McKinney created ARROW-39:
---------------------------------

             Summary: C++: Logical chunked arrays / columns: conforming to a 
fixed chunk sizes
                 Key: ARROW-39
                 URL: https://issues.apache.org/jira/browse/ARROW-39
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Wes McKinney


Implementing algorithms on large arrays assembled in physical chunks is 
problematic if:

- The chunks are not all the same size (except possibly the last chunk, which 
can be less). Otherwise, retrieving a particular element is in general a O(log 
num_chunks) operation

- The chunk size is not a power of 2. Computing integer modulus with a 
non-multiple of 2 requires more clock cycles (in other words, {{i % p}} is much 
more expensive to compute than {{i & (p - 1)}}, but the latter only works if p 
is a power of 2)

Most of the Arrow data adapters will either feature contiguous data (1 chunk, 
so chunking is not an issue) or a regular chunk size, so this isn't as much of 
an immediate concern, but we should consider making it a contract of any data 
structures dealing in multiple arrays. 

In general, it would be preferable to reorganize memory into either a regular 
chunksize (like 64K values per chunk) or a contiguous memory region. I would 
prefer for the moment to not to invest significant energy in writing algorithms 
for data with irregular chunk sizes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to