Re: Druid + Presto?

Parth Brahmbhatt Fri, 10 Jul 2020 13:16:46 -0700

The two forks are very different and the patches are not really sharable.
Features /improvements may get re-implemented but code has diverged
significantly enough that it is pretty much always a re-implementation. The
overall pushdown approach is also very different between the 2 forks so it
is unlikely that the implementation can be shared.


> In the overall warehouse + Druid setup you're envisioning, would Druid be
the main way of querying the tables that it stores? Or would they all be
synced periodically from the warehouse into Druid, using the warehouse as a
source of truth? I'm asking since I'm wondering how important it is to
think about functionality that might help load datasources based on tables
that are in the Presto metastore.

In most cases when our users are building a custom viz they query druid
directly for tables that it stores with batch jobs that sync data from
warehouse to druid. Druid is never the source of truth as it is always
derived from a warehouse table. However the cost of building the custom viz
is generally higher and currently there is no good/reliable way to build a
Tableau dashboard that queries druid directly. It is hard to say what the
users might do in future but I would say it largely depends on how
performant and comprehensive this new route of druid connectivity through
Presto turns out. Personally I would expect a lot more Tableau dashboards
and thus more druid tables being queried from presto, if we cover most
query patterns in a performant way.

For loading datasources based on tables that are in presto, at some point
we may develop the insert into support for druid connector via either
presto or spark. Right now we just run the hadoop batch indexer.

> Druid SQL is ANSI SQL for the most part but there are two big
differences. First, it doesn't support everything in ANSI SQL (two
examples: it currently doesn't support shuffle joins and windowed
aggregations). Second, it supports some functionality that is not in ANSI
SQL (like the TIME_ and DS_ operators). So it is smaller in some ways and
bigger in other ways. I was thinking a reverse translator could let you
write a Druid SQL query that uses our special operators, but also requires
a shuffle join, and then translate and execute it as an equivalent Presto
SQL query. The idea being you can express your query in either dialect and
get routed to the right place in the end.

I don't see a use case on our platform for Druid to Presto connectivity.
Most of druid's special operators have equivalent presto functions and for
missing operators we could just add connector level procedures in presto.
>From the user's standpoint as long as we offer acceptable performance and a
full set of druid features (including scans down the line when we can
support partial agg pushdown in prestosql) we don't see a reason to support
the druid to presto route assuming presto should be a super set for both
syntax and feature set.

Thanks
Parth



On Fri, Jul 10, 2020 at 9:36 AM Mainak Ghosh <mgh...@twitter.com> wrote:

> + Zhenxiao
>
> On Jul 9, 2020, at 11:48 PM, Gian Merlino <g...@apache.org> wrote:
>
> One other thing I'm wondering is how similar are the two forks of Presto?
> Are patches generally being shared between them or are they going off in
> different directions? One example: as I understand it, aggregate pushdown
> support was added to the core of both forks relatively recently — within
> the last year or so — does it work the same way in each one? I'm wondering
> how much work can be shared between these different efforts and perhaps
> between these efforts and the Druid project itself.
>
> On Thu, Jul 9, 2020 at 11:24 PM Gian Merlino <g...@apache.org> wrote:
>
>> Hey Samarth,
>>
>> Thanks for sharing these details.
>>
>> In the overall warehouse + Druid setup you're envisioning, would Druid be
>> the main way of querying the tables that it stores? Or would they all be
>> synced periodically from the warehouse into Druid, using the warehouse as a
>> source of truth? I'm asking since I'm wondering how important it is to
>> think about functionality that might help load datasources based on tables
>> that are in the Presto metastore.
>>
>> >  You bring up an interesting idea on the reverse connector. What do you
>> think the value of such a connector will be? I am assuming Druid SQL for
>> the most part is ANSI SQL.
>>
>> Druid SQL is ANSI SQL for the most part but there are two big
>> differences. First, it doesn't support everything in ANSI SQL (two
>> examples: it currently doesn't support shuffle joins and windowed
>> aggregations). Second, it supports some functionality that is not in ANSI
>> SQL (like the TIME_ and DS_ operators). So it is smaller in some ways and
>> bigger in other ways. I was thinking a reverse translator could let you
>> write a Druid SQL query that uses our special operators, but also requires
>> a shuffle join, and then translate and execute it as an equivalent Presto
>> SQL query. The idea being you can express your query in either dialect and
>> get routed to the right place in the end.
>>
>> On Thu, Jul 9, 2020 at 4:36 PM Samarth Jain <sama...@apache.org> wrote:
>>
>>> Gian,
>>>
>>> For the presto-sql version of Druid connector, for V1, we decided to
>>> pursue
>>> the JDBC route. You can follow along on the progress here -
>>> https://github.com/prestosql/presto/issues/1855
>>> My colleague, Parth (cc'ed as well) is working on implementing Druid
>>> aggregation push down including support for top-n style queries. Our
>>> immediate use cases, and what we think Druid
>>> generally is more suitable for, is for solving for aggregate group by
>>> style
>>> queries. Having a presto-druid connector also enables us to join data in
>>> Druid with the rest of our warehouse.
>>> In general though, for queries that don't do any aggregations i.e. which
>>> get translated to Druid SCAN queries, it makes sense to by-pass the Druid
>>> datanodes altogether and directly go
>>> to the deep storage. I think Druid provides enough metadata about the
>>> active segment files to be able to do that relatively easily.
>>>
>>> You bring up an interesting idea on the reverse connector. What do you
>>> think the value of such a connector will be? I am assuming Druid SQL for
>>> the most part is ANSI SQL.
>>>
>>> On Thu, Jul 9, 2020 at 12:56 PM Zhenxiao Luo <z...@twitter.com.invalid>
>>> wrote:
>>>
>>> > Thank you, Mainak.
>>> >
>>> > Hi Gian,
>>> >
>>> > Glad to see you are interested in Presto Druid connector.
>>> >
>>> > My colleague, @Hao Luo <h...@twitter.com> @Beinan Wang
>>> > <bein...@twitter.com> and
>>> > me, together, implemented the Presto Druid connector in PrestoDB:
>>> > https://prestodb.io/docs/current/connector/druid.html
>>> >
>>> > Our implementation includes:
>>> > 1. Presto could scan Druid segments to compute SQL results
>>> > 2. aggregation pushdown, where Presto leverages Druid fast aggregation
>>> > capabilities, and stream aggregated result from Druid
>>> > actually, we implemented 2 execution paths, users could use
>>> configurations
>>> > to control whether they'd like to scan segments or pushdown all
>>> sub-queries
>>> > to Druid
>>> >
>>> > We had run benchmarkings comparing Presto Druid connector with other
>>> SQL
>>> > engines. And are ready to run production workloads.
>>> >
>>> > Thanks,
>>> > Zhenxiao
>>> >
>>> > On Thu, Jul 9, 2020 at 12:40 PM Mainak Ghosh <mgh...@twitter.com>
>>> wrote:
>>> >
>>> > > Hello Gian,
>>> > >
>>> > > We are currently testing the (other) Presto Druid connector at our
>>> end.
>>> > It
>>> > > has aggregation push down support. Adding Zhenxiao to this thread
>>> since
>>> > he
>>> > > is the primary developer of the connector. He can provide the kind of
>>> > > details you are looking for.
>>> > >
>>> > > Thanks,
>>> > > Mainak
>>> > >
>>> > > > On Jul 9, 2020, at 12:25 PM, Gian Merlino <g...@apache.org> wrote:
>>> > > >
>>> > > > By the way, I see that the other Presto has a Druid connector too:
>>> > > > https://prestodb.io/docs/current/connector/druid.html. From the
>>> docs
>>> > it
>>> > > > looks like it has different lineage and might even work
>>> differently.
>>> > > >
>>> > > > On Thu, Jul 9, 2020 at 12:22 PM Gian Merlino <g...@apache.org>
>>> wrote:
>>> > > >
>>> > > >> I was thinking of exploring ideas like pushing down aggregations,
>>> > > enabling
>>> > > >> Presto to query directly from deep storage (in cases where there
>>> > aren't
>>> > > any
>>> > > >> interesting things to push down, this may be more efficient than
>>> > > querying
>>> > > >> Druid servers), enabling translation from Druid's SQL dialect to
>>> > > Presto's
>>> > > >> SQL dialect (a "reverse connector"), etc. Do you (or anyone else
>>> on
>>> > this
>>> > > >> list) have any thoughts on any of those?
>>> > > >>
>>> > > >> I'm also curious what kinds of improvements you're planning to the
>>> > > >> connector you built.
>>> > > >>
>>> > > >> On Thu, Jul 9, 2020 at 10:18 AM Samarth Jain <
>>> samarth.j...@gmail.com>
>>> > > >> wrote:
>>> > > >>
>>> > > >>> Hi Gian,
>>> > > >>>
>>> > > >>> I contributed the jdbc based presto-druid connector in prestosql
>>> > which
>>> > > >>> went
>>> > > >>> out in release 337
>>> > > >>> https://prestosql.io/docs/current/release/release-337.html. The
>>> v1
>>> > > >>> version
>>> > > >>> of the connector doesn’t support aggregate push down yet. It is
>>> being
>>> > > >>> actively worked on and we expect it to be improved over the next
>>> few
>>> > > >>> releases. We are currently evaluating using the presto-druid
>>> > connector
>>> > > in
>>> > > >>> our Tableau setup. It would be interesting to see what changes in
>>> > Druid
>>> > > >>> would be needed to support that integration.
>>> > > >>>
>>> > > >>> Thanks,
>>> > > >>> Samarth
>>> > > >>>
>>> > > >>> On Thu, Jul 9, 2020 at 10:07 AM Gian Merlino <g...@apache.org>
>>> > wrote:
>>> > > >>>
>>> > > >>>> Hey Druids,
>>> > > >>>>
>>> > > >>>> I was wondering, is anyone on this list using Druid + Presto
>>> > together?
>>> > > >>> If
>>> > > >>>> so, what does your architecture look like and which edition /
>>> flavor
>>> > > of
>>> > > >>>> Presto and Druid connector are you using? What's your experience
>>> > been
>>> > > >>> like?
>>> > > >>>> I'm asking since I'm starting to think about whether it makes
>>> sense
>>> > to
>>> > > >>> look
>>> > > >>>> at ways to improve the integration between the two projects.
>>> > > >>>>
>>> > > >>>> Gian
>>> > > >>>>
>>> > > >>>
>>> > > >>
>>> > >
>>> > >
>>> >
>>>
>>
>

Re: Druid + Presto?

Reply via email to