Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Alexey Zinoviev Wed, 15 Aug 2018 20:22:34 -0700

Sorry, for the delay, dear Pavel and Denis.

Yes, I need a kind of indexing to improve KNN algorithms during training
the model.

In my draft solution I've implemented building of
https://en.wikipedia.org/wiki/K-d_tree
<https://en.wikipedia.org/wiki/K-d_tree> on the training data set.
It collects the information about data distributed in our specific ML
Datasets (distributed by nodes over Ignite cache)

Pavel, you ask me any questions and I've prepared answers.

1) Should be this index in-memory only or you want to persist it?
*Maybe it should be persisted (to reuse that for next predictions)*

2) In general index can be implemented in two ways per-partition and
per-node.
*Thank you for your explanation. Of course the per-node is better.*

3) I think K-d tree is preferable
*You are absolutely right, it should be K-d tree*

4) Could you please share use cases you're trying to solve with QuadTree?
With
close to real data and examples of queries?

I need to solve *k-nearest neighbors search task *on a set of vectors with
unique keys presented in Ignite Cache (training set),
during the training phase I'm going to build a temporary index as a KD-Tree
(based on distance between vectors).

The distance metric is a parameter too.

After that, in prediction phase the *k-nearest neighbors search task *is
solved by brute-force iteration over all vectors to find the *k-nearest
neighbors.*
I'd like to improve the search part with queries to index to extract
closest vectors.

Of course, it's kind of experiment, and maybe it's bad idea to patch SQL
internals to solve this private task, but as you mentioned it can be a good
point of collaboration between ML and SQL components.

Can I get one of the implemented indices as a primary example and implement
it by myself?
Could you recommend something as an start point for me?

Thanks for your help.

2018-08-04 11:18 GMT+06:00 Denis Magda <[email protected]>:

> Alexey, are you working on some new ML/DL APIs/algorithms? Please elaborate
> what you'd like to add to Ignite.
>
> --
> Denis
>
> On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <[email protected]> wrote:
>
> > Hello Alexey,
> >
> > It's not so difficult to implement new type of indexing of data, but if
> you
> > want to reach performance in distributed environment you need to have
> > strong knowledge of a data you're indexing and what kind of queries you
> > want to execute.
> > Should be this index in-memory only or you want to persist it?
> > In case of persistence your index should fit our page memory model
> > requirements.
> > In both cases your index should be ready to work in concurrent
> environment.
> >
> > In general index can be implemented in two ways per-partition and
> per-node.
> > Per-partition may be efficient if you have a lot of points (x,y)
> > representing a big one, e.g. image. In this case it's required that all
> of
> > these points will be in one partition that query e.g. makes images
> > intersection will execute in one node. But if you have multiple images,
> > your index will contain also another points from other object and will
> > overload it.
> > Per-node may be efficient if you have a lot of points (x,y) that are
> > independent of each other, that you will use it as spatial e.g.. But in
> > this case, I think K-d tree is preferable as it can be used in more wide
> > way.
> >
> > Could you please share use cases you're trying to solve with QuadTree?
> With
> > close to real data and examples of queries? Because now the question is
> too
> > abstract and it's hard to understand how it should be implemented to
> reach
> > good results.
> >
> >
> > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <[email protected]>:
> >
> > > Hi, Igniters.
> > >
> > > Currently I'm working on different math stuff over the Apache Ignite
> and
> > in
> > > a few tasks I need to implement in memory something like this
> > >
> > > https://en.wikipedia.org/wiki/Quadtree
> > >
> > > I didn't find such index in Apache Ignite, but maybe it's under
> > development
> > > by somebody?
> > >
> > > Is it a difficult to add a new index type to our distributed SQL (from
> > > point of view of different infrastructure issues and so on P.S I don't
> > > worry the math stuff here because I've implemented it many times in
> > > non-distributed version)?
> > >
> > > It will be great to hear any kind of your thoughts and maybe somebody
> > could
> > > help with implementation
> > >
> > > zaleslaw
> > >
> >
>

Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Reply via email to