Re: Arrow Flight usage with graph databases

David Li Mon, 18 Jul 2022 17:23:10 -0700

Hi Valentyn,

Just to make sure, is this Flight or Flight SQL? I ask since Flight itself does 
not have a notion of transactions in the first place. I'm also curious what the 
intended target client application is.

Not being familiar with graph databases myself, I'll try to give some comments…

Lack of a schema does make things hard. There were some prior discussions about 
schema evolution during a (Flight) data stream, which would let you add/remove 
fields as the query progresses. And unions would let you accommodate 
inconsistent types. But if the changes are frequent, you'd negate many of the 
benefits of Arrow/Flight. And both of these could make client-side usage 
inconvenient.

What type/size metadata are you referring to? Presumably, this would instead 
end up in the schema, once using Arrow?

Is there any possibility to (say) unify (chunks of) the result to a consistent 
schema at least? Or possibly, encoding (some) properties as a Map<String, 
Union<...>> instead of as columns. (This negates the benefits of columnar data, 
of course, if you are interested in a particular property, but if you know 
those properties up front, the server could pull those out into (consistently 
typed) columns.)

> We are currently working on a prototype in which we are trying to use Arrow 
> Flight as a transport for transmitting requests and data to Gremlin Server. 
> Serialization is still based on an internal format due to schema creation 
> complexity.

So effectively, the internal format is being carried in a string/binary column?

On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
> Hi All,
> 
> I'm investigating the possibility of using Arrow Flight with graph databases, 
> and exploring how to enable Arrow Flight endpoint in Apache Tinkerpop Gremlin 
> server.
> 
> Now graph databases use several incompatible protocols that make it difficult 
> to use and spread the technology.
> A common features for graph databases are
> 1. Lack of a scheme. Each vertex of the graph can have its own set of 
> properties, including properties with the same name but different types. 
> Metadata such as type and size are also passed with each value, which 
> increases the amount of data transferred. Some data types are not supported 
> by all languages.
> 2. Internal representation of data is different for all implementations. For 
> data exchange we used a set of formats like customized JSON and custom 
> binary, but we would like to get a performance gain from using Arrow Flight.
> 3. The difference in concepts like transactions, sessions, etc. Conceptually 
> this may differ from the implementation in SQL.
> Gremlin server does not natively support transactions, so we use the Neo4J 
> plugin.
> 
> We are currently working on a prototype in which we are trying to use Arrow 
> Flight as a transport for transmitting requests and data to Gremlin Server. 
> Serialization is still based on an internal format due to schema creation 
> complexity.
> 
> Ideas are welcome.
> 
> Regards, Valentyn

Re: Arrow Flight usage with graph databases

Reply via email to