[ 
https://issues.apache.org/jira/browse/ARROW-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489465#comment-17489465
 ] 

Jeroen van Straten edited comment on ARROW-15583 at 2/9/22, 11:07 AM:
----------------------------------------------------------------------

In the same vein, the Arrow C++ consumer assumes that the custom type anchor 
and type variation anchor sets are disjoint, because it uses the same vector 
for those. That is, if anchor 42 is used for a type, it can no longer be used 
for a type variation, and vice versa. AFAICT this is also not guaranteed by the 
Substrait specification. This is technically a different issue, but it might be 
easier to solve both at once.

ETA: less important because outside of testing we're not a producer, but when 
allocating new anchors the counter starts at 0. I'm fairly certain however 
that, at least for type variations, anchor 0 is reserved for "undefined." The 
spec is currently unclear about this, but it's probably better to be safe than 
sorry and avoid using 0 for anchors altogether.


was (Author: JIRAUSER282962):
In the same vein, the Arrow C++ consumer assumes that the custom type anchor 
and type variation anchor sets are disjoint, because it uses the same vector 
for those. That is, if anchor 42 is used for a type, it can no longer be used 
for a type variation, and vice versa. AFAICT this is also not guaranteed by the 
Substrait specification. This is technically a different issue, but it might be 
easier to solve both at once.

> [C++] The Substrait consumer could potentially use a massive amount of RAM if 
> the producer uses large anchors
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15583
>                 URL: https://issues.apache.org/jira/browse/ARROW-15583
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>              Labels: substrait
>
> In Substrait a function is referred to by a "fully qualified name" which 
> consists of a URI and a function name.  For example, the "add" function is 
> something like 
> {{https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml}}.
>   To avoid serializing these long names multiple times in the plan the 
> producer should pick an anchor value (an int32 in protobuf) and use that 
> everywhere (with a single lookup table at the top level of the plan).
> To avoid map lookups the Arrow C++ consumer currently assumes that this 
> lookup table will be small enough it can be stored in a vector...
> {noformat}
> {
>   
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#add";,
>   
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#subtract";
> }
> {noformat}
> However, this sort of assumes that a plan is going to use numbers like 0, 1, 
> 2, ... N to create N anchors.  There is nothing that prevents a consumer from 
> using whatever numbers it wants (e.g. a pointer value).  If the producer uses 
> a really large anchor value then the  C++ Substrait consumer will create a 
> lookup table with a lot of blank values.  This could lead to a lot of wasted 
> memory.
> We could try and request the Substrait spec enfoce small anchors or we could 
> change the extension set handling in the C++ consumer to use an unordered_map.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to