Hello Arrow Team,

My name is Igor Guzenko. I'm currently working on task related to
complex types in Apache Drill [1], and bumped into an issue that Drill
hasn't
appropriate vector for representing canonical (java-like) Map datatype
[2]. So I'm looking for inspiration how the efficient
columnar map vector can be implemented. I believe that such map can be
composed of three value vectors (like in Hive):
  1) keys vector;
  2) values vector;
  3) offsets vector which points to start index of each next map in
two previous vectors.
But there is a major issue with such implementation. It's hard to
quickly retrieve values using key, some advanced tricks required
to do this efficiently.

I would be happy if you guys can share your expertise on this topic.
After learning some history of changes in Arrow, I found
that old map vector was renamed to struct and map datatype was
declared as list of structs, each of them containing vector for keys
and values.
I'm still very interested how Maps work internally in Arrow and I'd
like to implement similar one in Drill (so later future integration
with Arrow could be made more smoothly). Also, if you need new vector
for Map too, I would be happy to contribute it to both Drill and
Arrow projects.

[1] : https://issues.apache.org/jira/browse/DRILL-3290
[2] : https://github.com/paul-rogers/drill/wiki/Drill-Maps

Thanks for attention,
Igor Guzenko

Reply via email to