I'm not that familiar with machine learning, but is there potential value
in having Druid be a "consumer" of machine learning, such as for
optimization purposes?

For example, training a Druid cluster on past queries as part of a query
cost estimator.



On Tue, Jan 28, 2020 at 12:39 AM Roman Leventov <leventov...@gmail.com>
wrote:

> However, I now see the Charles' point -- the data which is typically stored
> in Druid rows is simple and is not something models are typically applied
> to. Timeseries themselves (that is, the results of timeseries queries in
> Druid) may be an input for anomaly detection or phase transition models,
> but there is not point in applying them inside Druid.
>
> One corner case is sketches which are time series, so models could be
> applied to them individually.
>
> On Tue, 28 Jan 2020 at 08:59, Roman Leventov <leventov...@gmail.com>
> wrote:
>
> > I was thinking about model training at Druid indexing side and evaluation
> > at Druid querying side.
> >
> > The advantage Druid has over Spark at querying is faster row filtering
> > thanks to bitset indexes. But since model evaluation is a pretty heavy
> > operation (I suppose; does anyone has ballpark time estimates? how does
> it
> > compare to Sketch update?) then row scanning may not be the bottleneck
> and
> > therefore no significant reason to use Druid instead of just plugging
> Spark
> > engine to Druid segments.
> >
> > At indexing side, Druid indexer may be considered a general-purpose job
> > scheduler so that somebody who already has Druid may leverage it instead
> of
> > setting up a separate Airflow scheduler.
> >
> > On Tue, 28 Jan 2020, 06:46 Charles Allen, <cral...@apache.org> wrote:
> >
> >> >  it makes more sense to have tooling around Druid, to do slice and
> dice
> >> the data that you need, and do the ml stuff in sklearn, or even in spark
> >>
> >> I agree with this sentiment. Druid as an execution engine is very good
> at
> >> doing distributed aggregation (distributed reduce). What advantage does
> >> Druid as an engine have that Spark does not for ML?
> >>
> >> Are you talking training or model evaluation? or any?
> >>
> >> It *might* be possible to have a likeness mechanism, whereby you can
> pass
> >> in a model as a filter and aggregate on rows (dimension tuples?) that
> >> match
> >> the model by some minimum criteria, but I'm not really sure what utility
> >> that would be. Maybe as a quick backtesting engine? I feel like I'm a
> >> solution searching for a problem going down this route though.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko <fo...@driesprong.frl
> >
> >> wrote:
> >>
> >> > > Vertica has it. Good idea to introduce it in Druid.
> >> >
> >> > I'm not sure if this is a valid argument. With this argument, you can
> >> > introduce anything into Druid. I think it is good to be opinionated,
> >> and as
> >> > a community why we do or don't introduce ML possibilities into the
> >> > software.
> >> >
> >> > For example, databases like Postgres and Bigquery allow users to do
> >> simple
> >> > regression models:
> >> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also
> >> don't
> >> > think it isn't that hard to introduce linear regression using gradient
> >> > decent into Druid:
> >> >
> >> >
> >>
> https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
> >> > However,
> >> > how many people are going to use this?
> >> >
> >> > For me, it makes more sense to have tooling around Druid, to do slice
> >> and
> >> > dice the data that you need, and do the ml stuff in sklearn, or even
> in
> >> > spark. For example using https://github.com/druid-io/pydruid or
> having
> >> the
> >> > ability to use Spark to read directly from the deep storage.
> >> >
> >> > Introducing models using SP or UDF's is also a possibility, but here I
> >> > share the concerns of Sayat when it comes to performance and
> >> scalability.
> >> >
> >> > Cheers, Fokko
> >> >
> >> >
> >> >
> >> > Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar <
> >> gaura...@gmail.com>:
> >> >
> >> > > +1
> >> > >
> >> > > Vertica has it. Good idea to introduce it in Druid.
> >> > >
> >> > > On Mon, Jan 13, 2020 at 12:52 AM Dusan Maric <thema...@gmail.com>
> >> wrote:
> >> > >
> >> > > > +1
> >> > > >
> >> > > > That would be a great idea! Thanks for sharing this.
> >> > > >
> >> > > > Would just like to chime in on Druid + ML model cases: predictions
> >> and
> >> > > > anomaly detection on top of TensorFlow ❤
> >> > > >
> >> > > > Regards,
> >> > > >
> >> > > > On Fri, Jan 10, 2020 at 6:41 AM Roman Leventov <
> >> leventov...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hello Druid developers, what do you think about the future of
> >> Druid &
> >> > > > > machine learning?
> >> > > > >
> >> > > > > Druid has been great at complex aggregations. Could (should?) It
> >> make
> >> > > > > inroads into ML? Perhaps aggregators which apply the rows
> against
> >> > some
> >> > > > > pre-trained model and summarize results.
> >> > > > >
> >> > > > > Should model training stay completely external to Druid, or it
> >> could
> >> > be
> >> > > > > incorporated into Druid's data lifecycle on a conceptual level,
> >> such
> >> > > as a
> >> > > > > recurring "indexing" task which stores the result (the model) in
> >> > > Druid's
> >> > > > > deep storage, the model automatically loaded on historical nodes
> >> as
> >> > > > needed
> >> > > > > (just like segments) and certain aggregators pick up the latest
> >> > model?
> >> > > > >
> >> > > > > Does this make any sense? In what cases Druid & ML will and will
> >> not
> >> > > work
> >> > > > > well together, and ML should stay a Spark's prerogative?
> >> > > > >
> >> > > > > I would be very interested to hear any thoughts on the topic,
> >> vague
> >> > > ideas
> >> > > > > and questions.
> >> > > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Dušan Marić
> >> > > > mob.: +381 64 1124779 | e-mail: thema...@gmail.com | skype:
> >> themaric
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to