I'm working on something similar for Ariadne which is a python graphql server package.
https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_arrow_flight_server.py https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_asgi_arrow_client.py I'm basically calling pa.Table.from_pylist which infers the schema from the first json record, but that record could be incomplete so passing a schema is preferable. arrow_data = pa.Table.from_pylist([result]) Basically I need to look at the graphql query and then go into the graphql SDL (Schema Definition Language) and generate an equivalent Arrow schema based on the subset of data points requested. -----Original Message----- From: Gavin Ray <ray.gavi...@gmail.com> Sent: Wednesday, July 20, 2022 11:15 AM To: dev@arrow.apache.org Subject: Re: Arrow Flight usage with graph databases External Email: Use caution with links and attachments > > We considered the option to analyze data to build a schema on the fly, > however it will be quite an expensive operation which will not allow > us to get performance benefits from using Arrow Flight. I'm not sure if you'll be able to avoid generating a schema on the fly, if it's anything like SQL or GraphQL queries since each query would have a unique shape based on the user's selection. Have you benchmarked this out of curiosity? (It's not an uncommon usecase from what I've seen) For example, Matt Topol does this to dynamically generate response schemas in his implementation of GraphQL-via-Flight and he says the overhead is negligible. On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk <valent...@bitquilltech.com.invalid> wrote: > Hi David, > > We are planning to use Flight for the prototype. We are also planning > to use Flight SQL as a reference, however we wanted to explore ideas > whether Arrow Flight Graph can be implemented on top of Arrow Flight > (similar to Arrow Flight SQL). > > Graph databases generally do not expose or enforce schema, which > indeed makes it challenging. While we do have ideas on building > extensions for graph databases to add schema, and we do see some other > ideas related to this, we will not be able to rely on this as part of the > initial prototype. > We considered the option to analyze data to build a schema on the fly, > however it will be quite an expensive operation which will not allow > us to get performance benefits from using Arrow Flight. > > >What type/size metadata are you referring to? > Metadata usually includes information about data type, size and > type-specific properties. Some complex types are made up of 10 or more > parts. Each Vertex or Edge of graph can have its own distinct set of > properties, but the total number of types is several dozen and this > can serve as a basis for constructing a schema. The total size of > metadata can be quite big, as we wanted to support cases where the > graph database can be very large (e.g. hundreds of GBs, with vertices > and edges possibly containing different properties). > More information about the serialization format we are using right now > can be found at > https://urldefense.com/v3/__https://tinkerpop.apache.org/docs/3.5.4/dev/io/*graphbinary__;Iw!!KSjYCgUGsB4!dzRC2hHjZwTZ3GW0T6UCRaF722tbMO9StAJ_-RbcqRr_fg8xu478tctsdw1qspUjo4WSSdvmFtQ-R7u0Fmdr3jc$ > . > > >So effectively, the internal format is being carried in a > >string/binary > column? > Yes, I am considering this option for the first stage of implementation. > > David, thank you again for your reply, and please let me know your > thoughts or whether you might have any suggestions around adopting > Arrow Flight for schema-less databases. > > Regards, Valentyn. > > On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org> wrote: > > > Hi Valentyn, > > > > Just to make sure, is this Flight or Flight SQL? I ask since Flight > itself > > does not have a notion of transactions in the first place. I'm also > curious > > what the intended target client application is. > > > > Not being familiar with graph databases myself, I'll try to give > > some comments… > > > > Lack of a schema does make things hard. There were some prior > > discussions about schema evolution during a (Flight) data stream, > > which would let you add/remove fields as the query progresses. And > > unions would let you accommodate inconsistent types. But if the > > changes are frequent, you'd negate many of the benefits of > > Arrow/Flight. And both of these could make client-side usage inconvenient. > > > > What type/size metadata are you referring to? Presumably, this would > > instead end up in the schema, once using Arrow? > > > > Is there any possibility to (say) unify (chunks of) the result to a > > consistent schema at least? Or possibly, encoding (some) properties > > as a Map<String, Union<...>> instead of as columns. (This negates > > the benefits of columnar data, of course, if you are interested in a > > particular property, but if you know those properties up front, the > > server could > pull > > those out into (consistently typed) columns.) > > > > > We are currently working on a prototype in which we are trying to > > > use > > Arrow Flight as a transport for transmitting requests and data to > > Gremlin Server. Serialization is still based on an internal format > > due to schema creation complexity. > > > > So effectively, the internal format is being carried in a > > string/binary column? > > > > On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote: > > > Hi All, > > > > > > I'm investigating the possibility of using Arrow Flight with graph > > databases, and exploring how to enable Arrow Flight endpoint in > > Apache Tinkerpop Gremlin server. > > > > > > Now graph databases use several incompatible protocols that make > > > it > > difficult to use and spread the technology. > > > A common features for graph databases are 1. Lack of a scheme. > > > Each vertex of the graph can have its own set of > > properties, including properties with the same name but different types. > > Metadata such as type and size are also passed with each value, > > which increases the amount of data transferred. Some data types are > > not > supported > > by all languages. > > > 2. Internal representation of data is different for all > implementations. > > For data exchange we used a set of formats like customized JSON and > custom > > binary, but we would like to get a performance gain from using Arrow > Flight. > > > 3. The difference in concepts like transactions, sessions, etc. > > Conceptually this may differ from the implementation in SQL. > > > Gremlin server does not natively support transactions, so we use > > > the > > Neo4J plugin. > > > > > > We are currently working on a prototype in which we are trying to > > > use > > Arrow Flight as a transport for transmitting requests and data to > > Gremlin Server. Serialization is still based on an internal format > > due to schema creation complexity. > > > > > > Ideas are welcome. > > > > > > Regards, Valentyn > > > This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2022 BlackRock, Inc. All rights reserved.