Sorry, for the delay, dear Pavel and Denis. Yes, I need a kind of indexing to improve KNN algorithms during training the model.
In my draft solution I've implemented building of https://en.wikipedia.org/wiki/K-d_tree <https://en.wikipedia.org/wiki/K-d_tree> on the training data set. It collects the information about data distributed in our specific ML Datasets (distributed by nodes over Ignite cache) Pavel, you ask me any questions and I've prepared answers. 1) Should be this index in-memory only or you want to persist it? *Maybe it should be persisted (to reuse that for next predictions)* 2) In general index can be implemented in two ways per-partition and per-node. *Thank you for your explanation. Of course the per-node is better.* 3) I think K-d tree is preferable *You are absolutely right, it should be K-d tree* 4) Could you please share use cases you're trying to solve with QuadTree? With close to real data and examples of queries? I need to solve *k-nearest neighbors search task *on a set of vectors with unique keys presented in Ignite Cache (training set), during the training phase I'm going to build a temporary index as a KD-Tree (based on distance between vectors). The distance metric is a parameter too. After that, in prediction phase the *k-nearest neighbors search task *is solved by brute-force iteration over all vectors to find the *k-nearest neighbors.* I'd like to improve the search part with queries to index to extract closest vectors. Of course, it's kind of experiment, and maybe it's bad idea to patch SQL internals to solve this private task, but as you mentioned it can be a good point of collaboration between ML and SQL components. Can I get one of the implemented indices as a primary example and implement it by myself? Could you recommend something as an start point for me? Thanks for your help. 2018-08-04 11:18 GMT+06:00 Denis Magda <dma...@apache.org>: > Alexey, are you working on some new ML/DL APIs/algorithms? Please elaborate > what you'd like to add to Ignite. > > -- > Denis > > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <jokse...@gmail.com> wrote: > > > Hello Alexey, > > > > It's not so difficult to implement new type of indexing of data, but if > you > > want to reach performance in distributed environment you need to have > > strong knowledge of a data you're indexing and what kind of queries you > > want to execute. > > Should be this index in-memory only or you want to persist it? > > In case of persistence your index should fit our page memory model > > requirements. > > In both cases your index should be ready to work in concurrent > environment. > > > > In general index can be implemented in two ways per-partition and > per-node. > > Per-partition may be efficient if you have a lot of points (x,y) > > representing a big one, e.g. image. In this case it's required that all > of > > these points will be in one partition that query e.g. makes images > > intersection will execute in one node. But if you have multiple images, > > your index will contain also another points from other object and will > > overload it. > > Per-node may be efficient if you have a lot of points (x,y) that are > > independent of each other, that you will use it as spatial e.g.. But in > > this case, I think K-d tree is preferable as it can be used in more wide > > way. > > > > Could you please share use cases you're trying to solve with QuadTree? > With > > close to real data and examples of queries? Because now the question is > too > > abstract and it's hard to understand how it should be implemented to > reach > > good results. > > > > > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <zaleslaw....@gmail.com>: > > > > > Hi, Igniters. > > > > > > Currently I'm working on different math stuff over the Apache Ignite > and > > in > > > a few tasks I need to implement in memory something like this > > > > > > https://en.wikipedia.org/wiki/Quadtree > > > > > > I didn't find such index in Apache Ignite, but maybe it's under > > development > > > by somebody? > > > > > > Is it a difficult to add a new index type to our distributed SQL (from > > > point of view of different infrastructure issues and so on P.S I don't > > > worry the math stuff here because I've implemented it many times in > > > non-distributed version)? > > > > > > It will be great to hear any kind of your thoughts and maybe somebody > > could > > > help with implementation > > > > > > zaleslaw > > > > > >