[ https://issues.apache.org/jira/browse/ARROW-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489465#comment-17489465 ]
Jeroen van Straten edited comment on ARROW-15583 at 2/9/22, 11:07 AM: ---------------------------------------------------------------------- In the same vein, the Arrow C++ consumer assumes that the custom type anchor and type variation anchor sets are disjoint, because it uses the same vector for those. That is, if anchor 42 is used for a type, it can no longer be used for a type variation, and vice versa. AFAICT this is also not guaranteed by the Substrait specification. This is technically a different issue, but it might be easier to solve both at once. ETA: less important because outside of testing we're not a producer, but when allocating new anchors the counter starts at 0. I'm fairly certain however that, at least for type variations, anchor 0 is reserved for "undefined." The spec is currently unclear about this, but it's probably better to be safe than sorry and avoid using 0 for anchors altogether. was (Author: JIRAUSER282962): In the same vein, the Arrow C++ consumer assumes that the custom type anchor and type variation anchor sets are disjoint, because it uses the same vector for those. That is, if anchor 42 is used for a type, it can no longer be used for a type variation, and vice versa. AFAICT this is also not guaranteed by the Substrait specification. This is technically a different issue, but it might be easier to solve both at once. > [C++] The Substrait consumer could potentially use a massive amount of RAM if > the producer uses large anchors > ------------------------------------------------------------------------------------------------------------- > > Key: ARROW-15583 > URL: https://issues.apache.org/jira/browse/ARROW-15583 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > Labels: substrait > > In Substrait a function is referred to by a "fully qualified name" which > consists of a URI and a function name. For example, the "add" function is > something like > {{https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml}}. > To avoid serializing these long names multiple times in the plan the > producer should pick an anchor value (an int32 in protobuf) and use that > everywhere (with a single lookup table at the top level of the plan). > To avoid map lookups the Arrow C++ consumer currently assumes that this > lookup table will be small enough it can be stored in a vector... > {noformat} > { > > "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#add", > > "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#subtract" > } > {noformat} > However, this sort of assumes that a plan is going to use numbers like 0, 1, > 2, ... N to create N anchors. There is nothing that prevents a consumer from > using whatever numbers it wants (e.g. a pointer value). If the producer uses > a really large anchor value then the C++ Substrait consumer will create a > lookup table with a lot of blank values. This could lead to a lot of wasted > memory. > We could try and request the Substrait spec enfoce small anchors or we could > change the extension set handling in the C++ consumer to use an unordered_map. -- This message was sent by Atlassian Jira (v8.20.1#820001)