[ https://issues.apache.org/jira/browse/ARROW-39?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657140#comment-17657140 ]
Rok Mihevc commented on ARROW-39: --------------------------------- This issue has been migrated to [issue #15413|https://github.com/apache/arrow/issues/15413] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > C++: Logical chunked arrays / columns: conforming to fixed chunk sizes > ---------------------------------------------------------------------- > > Key: ARROW-39 > URL: https://issues.apache.org/jira/browse/ARROW-39 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > Priority: Major > Fix For: 0.3.0 > > > Implementing algorithms on large arrays assembled in physical chunks is > problematic if: > - The chunks are not all the same size (except possibly the last chunk, which > can be less). Otherwise, retrieving a particular element is in general a > O(log num_chunks) operation > - The chunk size is not a power of 2. Computing integer modulus with a > non-multiple of 2 requires more clock cycles (in other words, {{i % p}} is > much more expensive to compute than {{i & (p - 1)}}, but the latter only > works if p is a power of 2) > Most of the Arrow data adapters will either feature contiguous data (1 chunk, > so chunking is not an issue) or a regular chunk size, so this isn't as much > of an immediate concern, but we should consider making it a contract of any > data structures dealing in multiple arrays. > In general, it would be preferable to reorganize memory into either a regular > chunksize (like 64K values per chunk) or a contiguous memory region. I would > prefer for the moment to not to invest significant energy in writing > algorithms for data with irregular chunk sizes. -- This message was sent by Atlassian Jira (v8.20.10#820010)