Le 06/10/2023 à 17:54, Felipe Oliveira Carvalho a écrit :
Hello,

Since existing C Data Interface integrations sometimes don't parse beyond
the first `l` (or `L`) I'm going to start a new [VOTE] thread with Dewey's
suggestion:

Regardless of which format string we choose for ListView, a bug should certainly be reported to these implementations. A robust implementation should ensure the imported format string is conformant, otherwise there is a risk that the exporter actually meant something else.

Regards

Antoine.



+vl and +vL

If anyone objects to that and has a different suggestion, reply here so I
don't have to spam the list with too many new threads.

--
Felipe

On Thu, Oct 5, 2023 at 6:49 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

I won't belabour the point any more, but the difference in layout
between a list and a list view is consequential enough to deserve its
own top-level character in my opinion. My vote would be +1 for +vl and
+vL.

On Thu, Oct 5, 2023 at 6:40 PM Felipe Oliveira Carvalho
<felipe...@gmail.com> wrote:

Union format strings share enough properties that having them in the
same switch case doesn't result in additional complexity...lists and
list views are completely different types (for the purposes of parsing
the format string).

Dense and sparse union differ a bit more than list and list-view.

Not starting with `+l` for list-views would be a deviation from this
pattern started by unions.


+------------------------+---------------------------------------------------+------------+
| ``+ud:I,J,...``        | dense union with type ids I,J...
  |            |

+------------------------+---------------------------------------------------+------------+
| ``+us:I,J,...``        | sparse union with type ids I,J...
   |            |

+------------------------+---------------------------------------------------+------------+

Is sharing prefixes an issue?

To make this more concrete, these are the parser changes for supporting
`+lv` and `+Lv` as I proposed in the beginning:

@@ -1097,9 +1101,9 @@ struct SchemaImporter {
      RETURN_NOT_OK(f_parser_.CheckHasNext());
      switch (f_parser_.Next()) {
        case 'l':
-        return ProcessListLike<ListType>();
+        return ProcessVarLengthList<false>();
        case 'L':
-        return ProcessListLike<LargeListType>();
+        return ProcessVarLengthList<true>();
        case 'w':
          return ProcessFixedSizeList();
        case 's':
@@ -1195,12 +1199,30 @@ struct SchemaImporter {
      return CheckNoChildren(type);
    }

-  template <typename ListType>
-  Status ProcessListLike() {
-    RETURN_NOT_OK(f_parser_.CheckAtEnd());
-    RETURN_NOT_OK(CheckNumChildren(1));
-    ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
-    type_ = std::make_shared<ListType>(field);
+  template <bool is_large_variation>
+  Status ProcessVarLengthList() {
+    if (f_parser_.AtEnd()) {
+      RETURN_NOT_OK(CheckNumChildren(1));
+      ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
+      if constexpr (is_large_variation) {
+        type_ = large_list(field);
+      } else {
+        type_ = list(field);
+      }
+    } else {
+      if (f_parser_.Next() == 'v') {
+        RETURN_NOT_OK(CheckNumChildren(1));
+        ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
+        if constexpr (is_large_variation) {
+          type_ = large_list_view(field);
+        } else {
+          type_ = list_view(field);
+        }
+      } else {
+        return f_parser_.Invalid();
+      }
+    }
+
      return Status::OK();
    }

--
Felipe


On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou <anto...@python.org>
wrote:


I don't think the parsing will be a problem even in C. It's not like
you
have to backtrack anyway.

+1 from me on Felipe's proposal.

Regards

Antoine.


Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit :
This mailing list thread is going to be the discussion.

The union types also use two characters, so I didn’t think it would
be a
problem.

—
Felipe

On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington
<de...@voltrondata.com.invalid>
wrote:

I'm sorry for missing earlier discussion on this or a PR into the
format where this discussion may have occurred...is there a reason
that +lv and +Lv were chosen over a single-character version (i.e.,
maybe +v and +V)? A single-character version is (slightly) easier to
parse in C.

On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
<felipe...@gmail.com> wrote:

Hello,

I'm writing to propose "+lv" and "+Lv" as format strings for
list-view
and
large list-view arrays passing through the Arrow C data interface
[1].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because....

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html





Reply via email to