Re: Arrow Flight usage with graph databases

Matthew Topol Wed, 27 Jul 2022 17:44:50 -0700

Yea, the drawback you'll find there is that you can't effectively stream
record batches as they are available with that setup as you wait for all of
the results before converting to an Arrow table.


The result is higher memory usage necessary for larger result sets and your
time to the first byte is bottlenecked by the whole request instead of
getting the first record batch immediately.

If your requests are small on average and/or are very quick to come back
then these aren't necessarily issues for your use case, lol.

--Matt

On Wed, Jul 27, 2022, 8:32 PM Lee, David <david....@blackrock.com.invalid>
wrote:

> Correct more or less.. It is Arrow Flight Native end to end.
>
> The GraphQL query is a string (saved as a Flight Ticket) that is sent from
> a client using Arrow Flight RPC.
> The GraphQL query is executed on the GraphQL flight server that produces
> python record objects (JSON structured records).
> Those Python record objects are then converted into an Arrow Formatted
> Table using pa.Table.from_pylist().
> The Arrow Table is then sent back to the client to complete the original
> Fight RPC request.
>
> -----Original Message-----
> From: Matthew Topol <m...@voltrondata.com.INVALID>
> Sent: Wednesday, July 27, 2022 5:10 PM
> To: dev@arrow.apache.org
> Subject: Re: Arrow Flight usage with graph databases
>
> External Email: Use caution with links and attachments
>
>
> So this is sightly different than what I was doing and spoke about. As far
> as I can tell from your links, you are evaluating the graphql using that
> graphql server and then converting the JSON response into arrow format
> (correct me if I'm wrong please).
>
> What I did was to hook into a graphql parser and make my own evaluator
> which was arrow-native the whole way through. Using the GraphQL request to
> define the resulting Arrow schema based on the shape of the requested data.
> I had a planner and executor, with the executor using the plan to set up a
> pipeline to stream the record batches through.
>
> Just something to think about :)
>
> --Matt
>
> On Wed, Jul 27, 2022, 7:19 PM Lee, David <david....@blackrock.com.invalid>
> wrote:
>
> > I'm working on something similar for Ariadne which is a python graphql
> > server package.
> >
> >
> > https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro
> > w/blob/arrow_flight/benchmark/test_arrow_flight_server.py__;!!KSjYCgUG
> > sB4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUV
> > D6bj8z2b8U0qi_IGMeHgTkAw$
> >
> > https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro
> > w/blob/arrow_flight/benchmark/test_asgi_arrow_client.py__;!!KSjYCgUGsB
> > 4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUVD6
> > bj8z2b8U0qi_IGM3u1Wkxw$
> >
> > I'm basically calling pa.Table.from_pylist which infers the schema
> > from the first json record, but that record could be incomplete so
> > passing a schema is preferable.
> >
> > arrow_data = pa.Table.from_pylist([result])
> >
> > Basically I need to look at the graphql query and then go into the
> > graphql SDL (Schema Definition Language) and generate an equivalent
> > Arrow schema based on the subset of data points requested.
> >
> > -----Original Message-----
> > From: Gavin Ray <ray.gavi...@gmail.com>
> > Sent: Wednesday, July 20, 2022 11:15 AM
> > To: dev@arrow.apache.org
> > Subject: Re: Arrow Flight usage with graph databases
> >
> > External Email: Use caution with links and attachments
> >
> >
> > >
> > > We considered the option to analyze data to build a schema on the
> > > fly, however it will be quite an expensive operation which will not
> > > allow us to get performance benefits from using Arrow Flight.
> >
> >
> > I'm not sure if you'll be able to avoid generating a schema on the
> > fly, if it's anything like SQL or GraphQL queries since each query
> > would have a unique shape based on the user's selection.
> >
> > Have you benchmarked this out of curiosity?
> > (It's not an uncommon usecase from what I've seen)
> >
> > For example, Matt Topol does this to dynamically generate response
> > schemas in his implementation of GraphQL-via-Flight and he says the
> > overhead is negligible.
> >
> > On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk <
> > valent...@bitquilltech.com.invalid> wrote:
> >
> > > Hi David,
> > >
> > > We are planning to use Flight for the prototype. We are also
> > > planning to use Flight SQL as a reference, however we wanted to
> > > explore ideas whether Arrow Flight Graph can be implemented on top
> > > of Arrow Flight (similar to Arrow Flight SQL).
> > >
> > > Graph databases generally do not expose or enforce schema, which
> > > indeed makes it challenging. While we do have ideas on building
> > > extensions for graph databases to add schema, and we do see some
> > > other ideas related to this, we will not be able to rely on this as
> > > part of
> > the initial prototype.
> > > We considered the option to analyze data to build a schema on the
> > > fly, however it will be quite an expensive operation which will not
> > > allow us to get performance benefits from using Arrow Flight.
> > >
> > > >What type/size metadata are you referring to?
> > > Metadata usually includes information about data type, size and
> > > type-specific properties. Some complex types are made up of 10 or
> > > more parts. Each Vertex or Edge of graph can have its own distinct
> > > set of properties, but the total number of types is several dozen
> > > and this can serve as a basis for constructing a schema. The total
> > > size of metadata can be quite big, as we wanted to support cases
> > > where the graph database can be very large (e.g. hundreds of GBs,
> > > with vertices and edges possibly containing different properties).
> > > More information about the serialization format we are using right
> > > now can be found at
> > https://urldefense.com/v3/__https://tinkerpop.apache.org/docs/3.5.4/de
> > v/io/*graphbinary__;Iw!!KSjYCgUGsB4!dzRC2hHjZwTZ3GW0T6UCRaF722tbMO9StA
> > J_-RbcqRr_fg8xu478tctsdw1qspUjo4WSSdvmFtQ-R7u0Fmdr3jc$
> > .
> > >
> > > >So effectively, the internal format is being carried in a
> > > >string/binary
> > > column?
> > > Yes, I am considering this option for the first stage of
> implementation.
> > >
> > > David, thank you again for your reply, and please let me know your
> > > thoughts or whether you might have any suggestions around adopting
> > > Arrow Flight for schema-less databases.
> > >
> > > Regards, Valentyn.
> > >
> > > On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org> wrote:
> > >
> > > > Hi Valentyn,
> > > >
> > > > Just to make sure, is this Flight or Flight SQL? I ask since
> > > > Flight
> > > itself
> > > > does not have a notion of transactions in the first place. I'm
> > > > also
> > > curious
> > > > what the intended target client application is.
> > > >
> > > > Not being familiar with graph databases myself, I'll try to give
> > > > some comments…
> > > >
> > > > Lack of a schema does make things hard. There were some prior
> > > > discussions about schema evolution during a (Flight) data stream,
> > > > which would let you add/remove fields as the query progresses. And
> > > > unions would let you accommodate inconsistent types. But if the
> > > > changes are frequent, you'd negate many of the benefits of
> > > > Arrow/Flight. And both of these could make client-side usage
> > inconvenient.
> > > >
> > > > What type/size metadata are you referring to? Presumably, this
> > > > would instead end up in the schema, once using Arrow?
> > > >
> > > > Is there any possibility to (say) unify (chunks of) the result to
> > > > a consistent schema at least? Or possibly, encoding (some)
> > > > properties as a Map<String, Union<...>> instead of as columns.
> > > > (This negates the benefits of columnar data, of course, if you are
> > > > interested in a particular property, but if you know those
> > > > properties up front, the server could
> > > pull
> > > > those out into (consistently typed) columns.)
> > > >
> > > > > We are currently working on a prototype in which we are trying
> > > > > to use
> > > > Arrow Flight as a transport for transmitting requests and data to
> > > > Gremlin Server. Serialization is still based on an internal format
> > > > due to schema creation complexity.
> > > >
> > > > So effectively, the internal format is being carried in a
> > > > string/binary column?
> > > >
> > > > On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
> > > > > Hi All,
> > > > >
> > > > > I'm investigating the possibility of using Arrow Flight with
> > > > > graph
> > > > databases, and exploring how to enable Arrow Flight endpoint in
> > > > Apache Tinkerpop Gremlin server.
> > > > >
> > > > > Now graph databases use several incompatible protocols that make
> > > > > it
> > > > difficult to use and spread the technology.
> > > > > A common features for graph databases are 1. Lack of a scheme.
> > > > > Each vertex of the graph can have its own set of
> > > > properties, including properties with the same name but different
> > types.
> > > > Metadata such as type and size are also passed with each value,
> > > > which increases the amount of data transferred. Some data types
> > > > are not
> > > supported
> > > > by all languages.
> > > > > 2. Internal representation of data is different for all
> > > implementations.
> > > > For data exchange we used a set of formats like customized JSON
> > > > and
> > > custom
> > > > binary, but we would like to get a performance gain from using
> > > > Arrow
> > > Flight.
> > > > > 3. The difference in concepts like transactions, sessions, etc.
> > > > Conceptually this may differ from the implementation in SQL.
> > > > > Gremlin server does not natively support transactions, so we use
> > > > > the
> > > > Neo4J plugin.
> > > > >
> > > > > We are currently working on a prototype in which we are trying
> > > > > to use
> > > > Arrow Flight as a transport for transmitting requests and data to
> > > > Gremlin Server. Serialization is still based on an internal format
> > > > due to schema creation complexity.
> > > > >
> > > > > Ideas are welcome.
> > > > >
> > > > > Regards, Valentyn
> > > >
> > >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2022 BlackRock, Inc. All rights reserved.
> >
>

Re: Arrow Flight usage with graph databases

Reply via email to