Having a smart segment balancer in the coordinator that used a "segment
work" based distribution model would be awesome. Matching up the work a
segment is likely to induce to the historical capacity of a node to do
work, all in a dynamic way... you wouldn't even need retention periods
anymore! Especially cool if the system could fall back to pulling from deep
storage ad hoc.



On Tue, Jan 28, 2020 at 3:28 PM Jonathan Wei <jon...@apache.org> wrote:

> I'm not that familiar with machine learning, but is there potential value
> in having Druid be a "consumer" of machine learning, such as for
> optimization purposes?
>
> For example, training a Druid cluster on past queries as part of a query
> cost estimator.
>
>
>
> On Tue, Jan 28, 2020 at 12:39 AM Roman Leventov <leventov...@gmail.com>
> wrote:
>
> > However, I now see the Charles' point -- the data which is typically
> stored
> > in Druid rows is simple and is not something models are typically applied
> > to. Timeseries themselves (that is, the results of timeseries queries in
> > Druid) may be an input for anomaly detection or phase transition models,
> > but there is not point in applying them inside Druid.
> >
> > One corner case is sketches which are time series, so models could be
> > applied to them individually.
> >
> > On Tue, 28 Jan 2020 at 08:59, Roman Leventov <leventov...@gmail.com>
> > wrote:
> >
> > > I was thinking about model training at Druid indexing side and
> evaluation
> > > at Druid querying side.
> > >
> > > The advantage Druid has over Spark at querying is faster row filtering
> > > thanks to bitset indexes. But since model evaluation is a pretty heavy
> > > operation (I suppose; does anyone has ballpark time estimates? how does
> > it
> > > compare to Sketch update?) then row scanning may not be the bottleneck
> > and
> > > therefore no significant reason to use Druid instead of just plugging
> > Spark
> > > engine to Druid segments.
> > >
> > > At indexing side, Druid indexer may be considered a general-purpose job
> > > scheduler so that somebody who already has Druid may leverage it
> instead
> > of
> > > setting up a separate Airflow scheduler.
> > >
> > > On Tue, 28 Jan 2020, 06:46 Charles Allen, <cral...@apache.org> wrote:
> > >
> > >> >  it makes more sense to have tooling around Druid, to do slice and
> > dice
> > >> the data that you need, and do the ml stuff in sklearn, or even in
> spark
> > >>
> > >> I agree with this sentiment. Druid as an execution engine is very good
> > at
> > >> doing distributed aggregation (distributed reduce). What advantage
> does
> > >> Druid as an engine have that Spark does not for ML?
> > >>
> > >> Are you talking training or model evaluation? or any?
> > >>
> > >> It *might* be possible to have a likeness mechanism, whereby you can
> > pass
> > >> in a model as a filter and aggregate on rows (dimension tuples?) that
> > >> match
> > >> the model by some minimum criteria, but I'm not really sure what
> utility
> > >> that would be. Maybe as a quick backtesting engine? I feel like I'm a
> > >> solution searching for a problem going down this route though.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko
> <fo...@driesprong.frl
> > >
> > >> wrote:
> > >>
> > >> > > Vertica has it. Good idea to introduce it in Druid.
> > >> >
> > >> > I'm not sure if this is a valid argument. With this argument, you
> can
> > >> > introduce anything into Druid. I think it is good to be opinionated,
> > >> and as
> > >> > a community why we do or don't introduce ML possibilities into the
> > >> > software.
> > >> >
> > >> > For example, databases like Postgres and Bigquery allow users to do
> > >> simple
> > >> > regression models:
> > >> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also
> > >> don't
> > >> > think it isn't that hard to introduce linear regression using
> gradient
> > >> > decent into Druid:
> > >> >
> > >> >
> > >>
> >
> https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
> > >> > However,
> > >> > how many people are going to use this?
> > >> >
> > >> > For me, it makes more sense to have tooling around Druid, to do
> slice
> > >> and
> > >> > dice the data that you need, and do the ml stuff in sklearn, or even
> > in
> > >> > spark. For example using https://github.com/druid-io/pydruid or
> > having
> > >> the
> > >> > ability to use Spark to read directly from the deep storage.
> > >> >
> > >> > Introducing models using SP or UDF's is also a possibility, but
> here I
> > >> > share the concerns of Sayat when it comes to performance and
> > >> scalability.
> > >> >
> > >> > Cheers, Fokko
> > >> >
> > >> >
> > >> >
> > >> > Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar <
> > >> gaura...@gmail.com>:
> > >> >
> > >> > > +1
> > >> > >
> > >> > > Vertica has it. Good idea to introduce it in Druid.
> > >> > >
> > >> > > On Mon, Jan 13, 2020 at 12:52 AM Dusan Maric <thema...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > +1
> > >> > > >
> > >> > > > That would be a great idea! Thanks for sharing this.
> > >> > > >
> > >> > > > Would just like to chime in on Druid + ML model cases:
> predictions
> > >> and
> > >> > > > anomaly detection on top of TensorFlow ❤
> > >> > > >
> > >> > > > Regards,
> > >> > > >
> > >> > > > On Fri, Jan 10, 2020 at 6:41 AM Roman Leventov <
> > >> leventov...@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hello Druid developers, what do you think about the future of
> > >> Druid &
> > >> > > > > machine learning?
> > >> > > > >
> > >> > > > > Druid has been great at complex aggregations. Could (should?)
> It
> > >> make
> > >> > > > > inroads into ML? Perhaps aggregators which apply the rows
> > against
> > >> > some
> > >> > > > > pre-trained model and summarize results.
> > >> > > > >
> > >> > > > > Should model training stay completely external to Druid, or it
> > >> could
> > >> > be
> > >> > > > > incorporated into Druid's data lifecycle on a conceptual
> level,
> > >> such
> > >> > > as a
> > >> > > > > recurring "indexing" task which stores the result (the model)
> in
> > >> > > Druid's
> > >> > > > > deep storage, the model automatically loaded on historical
> nodes
> > >> as
> > >> > > > needed
> > >> > > > > (just like segments) and certain aggregators pick up the
> latest
> > >> > model?
> > >> > > > >
> > >> > > > > Does this make any sense? In what cases Druid & ML will and
> will
> > >> not
> > >> > > work
> > >> > > > > well together, and ML should stay a Spark's prerogative?
> > >> > > > >
> > >> > > > > I would be very interested to hear any thoughts on the topic,
> > >> vague
> > >> > > ideas
> > >> > > > > and questions.
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Dušan Marić
> > >> > > > mob.: +381 64 1124779 | e-mail: thema...@gmail.com | skype:
> > >> themaric
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to