[ https://issues.apache.org/jira/browse/ARROW-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-645: ------------------------------- Description: In very large schemas, due of the way we are flattening the field and buffer metadata in the RecordBatch: https://github.com/apache/arrow/blob/master/format/Message.fbs#L271 The cost to reconstruct / load a single array from a RecordBatch can be arbitrarily high. As an example, let's consider a schema: {code} f0: int32 f1: string ... omitting 999996 duplicate f999998: int32 f999999: string {code} Here, a record batch has 1 million fields, and in total 2.5 million buffers. The problem with this is: to select a single field out of a record batch, we have to inspect all types leading up to the field of interest to know how many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the serialized metadata before that field's metadata appears. Solving this is a little bit tricky. One way would be to add optional "field position" and "buffer position" attributes to the {{Field}} table: https://github.com/apache/arrow/blob/master/format/Message.fbs#L188 So here, we would know that for the {{f1}} field, the field index is 1 and the buffer index is 2. Because a string has 3 buffers associated with it, we would know to select buffers in slots 2, 3, 4 to reconstruct the vector container. Let me know if the problem is not clear, and any other ideas about solutions was: In very large schemas, due of the way we are flattening the field and buffer metadata in the RecordBatch: https://github.com/apache/arrow/blob/master/format/Message.fbs#L271 The cost to reconstruct / load a single array from a RecordBatch can be arbitrarily high. As an example, let's consider a schema: {code} f0: int32 f1: string ... omitting 999996 duplicate f999998: int32 f999999: string {code} Here, a record batch has 1 million fields, and in total 1.5 million buffers. The problem with this is: to select a single field out of a record batch, we have to inspect all types leading up to the field of interest to know how many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the serialized metadata before that field's metadata appears. Solving this is a little bit tricky. One way would be to add optional "field position" and "buffer position" attributes to the {{Field}} table: https://github.com/apache/arrow/blob/master/format/Message.fbs#L188 So here, we would know that for the {{f1}} field, the field index is 1 and the buffer index is 2. Because a string has 3 buffers associated with it, we would know to select buffers in slots 2, 3, 4 to reconstruct the vector container. Let me know if the problem is not clear, and any other ideas about solutions > [Format] Mitigating the cost of random access in "wide" record batches > ---------------------------------------------------------------------- > > Key: ARROW-645 > URL: https://issues.apache.org/jira/browse/ARROW-645 > Project: Apache Arrow > Issue Type: New Feature > Components: Format > Reporter: Wes McKinney > > In very large schemas, due of the way we are flattening the field and buffer > metadata in the RecordBatch: > https://github.com/apache/arrow/blob/master/format/Message.fbs#L271 > The cost to reconstruct / load a single array from a RecordBatch can be > arbitrarily high. > As an example, let's consider a schema: > {code} > f0: int32 > f1: string > ... omitting 999996 duplicate > f999998: int32 > f999999: string > {code} > Here, a record batch has 1 million fields, and in total 2.5 million buffers. > The problem with this is: to select a single field out of a record batch, we > have to inspect all types leading up to the field of interest to know how > many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the > serialized metadata before that field's metadata appears. > Solving this is a little bit tricky. One way would be to add optional "field > position" and "buffer position" attributes to the {{Field}} table: > https://github.com/apache/arrow/blob/master/format/Message.fbs#L188 > So here, we would know that for the {{f1}} field, the field index is 1 and > the buffer index is 2. Because a string has 3 buffers associated with it, we > would know to select buffers in slots 2, 3, 4 to reconstruct the vector > container. > Let me know if the problem is not clear, and any other ideas about solutions -- This message was sent by Atlassian JIRA (v6.3.15#6346)