Having a smart segment balancer in the coordinator that used a "segment work" based distribution model would be awesome. Matching up the work a segment is likely to induce to the historical capacity of a node to do work, all in a dynamic way... you wouldn't even need retention periods anymore! Especially cool if the system could fall back to pulling from deep storage ad hoc.
On Tue, Jan 28, 2020 at 3:28 PM Jonathan Wei <jon...@apache.org> wrote: > I'm not that familiar with machine learning, but is there potential value > in having Druid be a "consumer" of machine learning, such as for > optimization purposes? > > For example, training a Druid cluster on past queries as part of a query > cost estimator. > > > > On Tue, Jan 28, 2020 at 12:39 AM Roman Leventov <leventov...@gmail.com> > wrote: > > > However, I now see the Charles' point -- the data which is typically > stored > > in Druid rows is simple and is not something models are typically applied > > to. Timeseries themselves (that is, the results of timeseries queries in > > Druid) may be an input for anomaly detection or phase transition models, > > but there is not point in applying them inside Druid. > > > > One corner case is sketches which are time series, so models could be > > applied to them individually. > > > > On Tue, 28 Jan 2020 at 08:59, Roman Leventov <leventov...@gmail.com> > > wrote: > > > > > I was thinking about model training at Druid indexing side and > evaluation > > > at Druid querying side. > > > > > > The advantage Druid has over Spark at querying is faster row filtering > > > thanks to bitset indexes. But since model evaluation is a pretty heavy > > > operation (I suppose; does anyone has ballpark time estimates? how does > > it > > > compare to Sketch update?) then row scanning may not be the bottleneck > > and > > > therefore no significant reason to use Druid instead of just plugging > > Spark > > > engine to Druid segments. > > > > > > At indexing side, Druid indexer may be considered a general-purpose job > > > scheduler so that somebody who already has Druid may leverage it > instead > > of > > > setting up a separate Airflow scheduler. > > > > > > On Tue, 28 Jan 2020, 06:46 Charles Allen, <cral...@apache.org> wrote: > > > > > >> > it makes more sense to have tooling around Druid, to do slice and > > dice > > >> the data that you need, and do the ml stuff in sklearn, or even in > spark > > >> > > >> I agree with this sentiment. Druid as an execution engine is very good > > at > > >> doing distributed aggregation (distributed reduce). What advantage > does > > >> Druid as an engine have that Spark does not for ML? > > >> > > >> Are you talking training or model evaluation? or any? > > >> > > >> It *might* be possible to have a likeness mechanism, whereby you can > > pass > > >> in a model as a filter and aggregate on rows (dimension tuples?) that > > >> match > > >> the model by some minimum criteria, but I'm not really sure what > utility > > >> that would be. Maybe as a quick backtesting engine? I feel like I'm a > > >> solution searching for a problem going down this route though. > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko > <fo...@driesprong.frl > > > > > >> wrote: > > >> > > >> > > Vertica has it. Good idea to introduce it in Druid. > > >> > > > >> > I'm not sure if this is a valid argument. With this argument, you > can > > >> > introduce anything into Druid. I think it is good to be opinionated, > > >> and as > > >> > a community why we do or don't introduce ML possibilities into the > > >> > software. > > >> > > > >> > For example, databases like Postgres and Bigquery allow users to do > > >> simple > > >> > regression models: > > >> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also > > >> don't > > >> > think it isn't that hard to introduce linear regression using > gradient > > >> > decent into Druid: > > >> > > > >> > > > >> > > > https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ > > >> > However, > > >> > how many people are going to use this? > > >> > > > >> > For me, it makes more sense to have tooling around Druid, to do > slice > > >> and > > >> > dice the data that you need, and do the ml stuff in sklearn, or even > > in > > >> > spark. For example using https://github.com/druid-io/pydruid or > > having > > >> the > > >> > ability to use Spark to read directly from the deep storage. > > >> > > > >> > Introducing models using SP or UDF's is also a possibility, but > here I > > >> > share the concerns of Sayat when it comes to performance and > > >> scalability. > > >> > > > >> > Cheers, Fokko > > >> > > > >> > > > >> > > > >> > Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar < > > >> gaura...@gmail.com>: > > >> > > > >> > > +1 > > >> > > > > >> > > Vertica has it. Good idea to introduce it in Druid. > > >> > > > > >> > > On Mon, Jan 13, 2020 at 12:52 AM Dusan Maric <thema...@gmail.com> > > >> wrote: > > >> > > > > >> > > > +1 > > >> > > > > > >> > > > That would be a great idea! Thanks for sharing this. > > >> > > > > > >> > > > Would just like to chime in on Druid + ML model cases: > predictions > > >> and > > >> > > > anomaly detection on top of TensorFlow ❤ > > >> > > > > > >> > > > Regards, > > >> > > > > > >> > > > On Fri, Jan 10, 2020 at 6:41 AM Roman Leventov < > > >> leventov...@gmail.com> > > >> > > > wrote: > > >> > > > > > >> > > > > Hello Druid developers, what do you think about the future of > > >> Druid & > > >> > > > > machine learning? > > >> > > > > > > >> > > > > Druid has been great at complex aggregations. Could (should?) > It > > >> make > > >> > > > > inroads into ML? Perhaps aggregators which apply the rows > > against > > >> > some > > >> > > > > pre-trained model and summarize results. > > >> > > > > > > >> > > > > Should model training stay completely external to Druid, or it > > >> could > > >> > be > > >> > > > > incorporated into Druid's data lifecycle on a conceptual > level, > > >> such > > >> > > as a > > >> > > > > recurring "indexing" task which stores the result (the model) > in > > >> > > Druid's > > >> > > > > deep storage, the model automatically loaded on historical > nodes > > >> as > > >> > > > needed > > >> > > > > (just like segments) and certain aggregators pick up the > latest > > >> > model? > > >> > > > > > > >> > > > > Does this make any sense? In what cases Druid & ML will and > will > > >> not > > >> > > work > > >> > > > > well together, and ML should stay a Spark's prerogative? > > >> > > > > > > >> > > > > I would be very interested to hear any thoughts on the topic, > > >> vague > > >> > > ideas > > >> > > > > and questions. > > >> > > > > > > >> > > > > > >> > > > > > >> > > > -- > > >> > > > Dušan Marić > > >> > > > mob.: +381 64 1124779 | e-mail: thema...@gmail.com | skype: > > >> themaric > > >> > > > > > >> > > > > >> > > > >> > > > > > >