Re: Arrow Flight usage with graph databases

Matthew Topol Wed, 27 Jul 2022 17:10:06 -0700

So this is sightly different than what I was doing and spoke about. As far
as I can tell from your links, you are evaluating the graphql using that
graphql server and then converting the JSON response into arrow format
(correct me if I'm wrong please).


What I did was to hook into a graphql parser and make my own evaluator
which was arrow-native the whole way through. Using the GraphQL request to
define the resulting Arrow schema based on the shape of the requested data.
I had a planner and executor, with the executor using the plan to set up a
pipeline to stream the record batches through.

Just something to think about :)

--Matt

On Wed, Jul 27, 2022, 7:19 PM Lee, David <david....@blackrock.com.invalid>
wrote:

> I'm working on something similar for Ariadne which is a python graphql
> server package.
>
>
> https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_arrow_flight_server.py
>
> https://github.com/davlee1972/ariadne_arrow/blob/arrow_flight/benchmark/test_asgi_arrow_client.py
>
> I'm basically calling pa.Table.from_pylist which infers the schema from
> the first json record, but that record could be incomplete so passing a
> schema is preferable.
>
> arrow_data = pa.Table.from_pylist([result])
>
> Basically I need to look at the graphql query and then go into the graphql
> SDL (Schema Definition Language) and generate an equivalent Arrow schema
> based on the subset of data points requested.
>
> -----Original Message-----
> From: Gavin Ray <ray.gavi...@gmail.com>
> Sent: Wednesday, July 20, 2022 11:15 AM
> To: dev@arrow.apache.org
> Subject: Re: Arrow Flight usage with graph databases
>
> External Email: Use caution with links and attachments
>
>
> >
> > We considered the option to analyze data to build a schema on the fly,
> > however it will be quite an expensive operation which will not allow
> > us to get performance benefits from using Arrow Flight.
>
>
> I'm not sure if you'll be able to avoid generating a schema on the fly, if
> it's anything like SQL or GraphQL queries since each query would have a
> unique shape based on the user's selection.
>
> Have you benchmarked this out of curiosity?
> (It's not an uncommon usecase from what I've seen)
>
> For example, Matt Topol does this to dynamically generate response schemas
> in his implementation of GraphQL-via-Flight and he says the overhead is
> negligible.
>
> On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk <
> valent...@bitquilltech.com.invalid> wrote:
>
> > Hi David,
> >
> > We are planning to use Flight for the prototype. We are also planning
> > to use Flight SQL as a reference, however we wanted to explore ideas
> > whether Arrow Flight Graph can be implemented on top of Arrow Flight
> > (similar to Arrow Flight SQL).
> >
> > Graph databases generally do not expose or enforce schema, which
> > indeed makes it challenging. While we do have ideas on building
> > extensions for graph databases to add schema, and we do see some other
> > ideas related to this, we will not be able to rely on this as part of
> the initial prototype.
> > We considered the option to analyze data to build a schema on the fly,
> > however it will be quite an expensive operation which will not allow
> > us to get performance benefits from using Arrow Flight.
> >
> > >What type/size metadata are you referring to?
> > Metadata usually includes information about data type, size and
> > type-specific properties. Some complex types are made up of 10 or more
> > parts. Each Vertex or Edge of graph can have its own distinct set of
> > properties, but the total number of types is several dozen and this
> > can serve as a basis for constructing a schema. The total size of
> > metadata can be quite big, as we wanted to support cases where the
> > graph database can be very large (e.g. hundreds of GBs, with vertices
> > and edges possibly containing different properties).
> > More information about the serialization format we are using right now
> > can be found at
> https://urldefense.com/v3/__https://tinkerpop.apache.org/docs/3.5.4/dev/io/*graphbinary__;Iw!!KSjYCgUGsB4!dzRC2hHjZwTZ3GW0T6UCRaF722tbMO9StAJ_-RbcqRr_fg8xu478tctsdw1qspUjo4WSSdvmFtQ-R7u0Fmdr3jc$
> .
> >
> > >So effectively, the internal format is being carried in a
> > >string/binary
> > column?
> > Yes, I am considering this option for the first stage of implementation.
> >
> > David, thank you again for your reply, and please let me know your
> > thoughts or whether you might have any suggestions around adopting
> > Arrow Flight for schema-less databases.
> >
> > Regards, Valentyn.
> >
> > On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org> wrote:
> >
> > > Hi Valentyn,
> > >
> > > Just to make sure, is this Flight or Flight SQL? I ask since Flight
> > itself
> > > does not have a notion of transactions in the first place. I'm also
> > curious
> > > what the intended target client application is.
> > >
> > > Not being familiar with graph databases myself, I'll try to give
> > > some comments…
> > >
> > > Lack of a schema does make things hard. There were some prior
> > > discussions about schema evolution during a (Flight) data stream,
> > > which would let you add/remove fields as the query progresses. And
> > > unions would let you accommodate inconsistent types. But if the
> > > changes are frequent, you'd negate many of the benefits of
> > > Arrow/Flight. And both of these could make client-side usage
> inconvenient.
> > >
> > > What type/size metadata are you referring to? Presumably, this would
> > > instead end up in the schema, once using Arrow?
> > >
> > > Is there any possibility to (say) unify (chunks of) the result to a
> > > consistent schema at least? Or possibly, encoding (some) properties
> > > as a Map<String, Union<...>> instead of as columns. (This negates
> > > the benefits of columnar data, of course, if you are interested in a
> > > particular property, but if you know those properties up front, the
> > > server could
> > pull
> > > those out into (consistently typed) columns.)
> > >
> > > > We are currently working on a prototype in which we are trying to
> > > > use
> > > Arrow Flight as a transport for transmitting requests and data to
> > > Gremlin Server. Serialization is still based on an internal format
> > > due to schema creation complexity.
> > >
> > > So effectively, the internal format is being carried in a
> > > string/binary column?
> > >
> > > On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
> > > > Hi All,
> > > >
> > > > I'm investigating the possibility of using Arrow Flight with graph
> > > databases, and exploring how to enable Arrow Flight endpoint in
> > > Apache Tinkerpop Gremlin server.
> > > >
> > > > Now graph databases use several incompatible protocols that make
> > > > it
> > > difficult to use and spread the technology.
> > > > A common features for graph databases are 1. Lack of a scheme.
> > > > Each vertex of the graph can have its own set of
> > > properties, including properties with the same name but different
> types.
> > > Metadata such as type and size are also passed with each value,
> > > which increases the amount of data transferred. Some data types are
> > > not
> > supported
> > > by all languages.
> > > > 2. Internal representation of data is different for all
> > implementations.
> > > For data exchange we used a set of formats like customized JSON and
> > custom
> > > binary, but we would like to get a performance gain from using Arrow
> > Flight.
> > > > 3. The difference in concepts like transactions, sessions, etc.
> > > Conceptually this may differ from the implementation in SQL.
> > > > Gremlin server does not natively support transactions, so we use
> > > > the
> > > Neo4J plugin.
> > > >
> > > > We are currently working on a prototype in which we are trying to
> > > > use
> > > Arrow Flight as a transport for transmitting requests and data to
> > > Gremlin Server. Serialization is still based on an internal format
> > > due to schema creation complexity.
> > > >
> > > > Ideas are welcome.
> > > >
> > > > Regards, Valentyn
> > >
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>

Re: Arrow Flight usage with graph databases

Reply via email to