I drafted my proposal about the above topic at https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing . Looking forward to your feedback.
On Wed, Apr 3, 2019 at 10:55 PM GUO Rui <ru...@uci.edu> wrote: > Dear Andrey Borodin, > > I discussed the above topic with the professors at my school, and got the > following points: > > 1. The case that the volume of an MBB is 0 should be very rare, and if the > data is skewed (e.g. only a few nodes have non-NULL value on a dimension) > then the data can be pre-proceeded and normalized before it goes to the > database, thus the storage and query can be much faster; > 2. The performance of the R-tree family may depend on the specific data > set and even the order of the data insertions, so one algorithm may be > better on one dataset and slower on another, thus the benchmark should > include different datasets; > > I totally agree that by adopting the RR*-tree algorithm we can improve the > performance of PostgreSQL. For my proposal, I'll: > 1. Document the benchmarks I found available online (e.g. > https://github.com/ambling/rtree-benchmark), and then state how we'd like > to generate data ourselves (e.g. data with a Gaussian distribution, or the > same dataset but different insertion order...) to test with for a wilder > coverage; > 2. Create tools to generate a report on current PostgreSQL performance > with the benchmark; > 3. Plan to improve the R-tree and GiST part of PostgreSQL. For the > discussion in the email thread > https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#cajeawvfmo-fxaj6lkj8wtb1br0mtby48egmvejboodroegy...@mail.gmail.com > <https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#cajeawvfmo-fxaj6lkj8wtb1br0mtby48egmvejboodroegy...@mail.gmail.com> > , I prefer to do a *scale-based* trick rather than using bits in a float > or creating a new struct; > 4. Generate a performance report on PostgreSQL with the above R-tree patch; > The following would be marked as *optional*: > 5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree > function, also discussed in the above email thread); > 6. For skewed data, try to warn the user, and then suggest methods to cook > the data (e.g. the normalization algorithms in ML); pre-proceeding the data > should not be the duty of the database; > 7. Other advanced features of RR*-tree and GiST bulk loading; > > Any comments or feedback on the above ideas? I'll work on a draft proposal > ASAP. > > Many thanks, > Rui Guo > > > > > On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4...@yandex-team.ru> > wrote: > >> Hi! >> >> > 31 марта 2019 г., в 14:58, GUO Rui <ru...@uci.edu> написал(а): >> > >> > I'm Rui Guo, a PhD student focusing on database at the University of >> California, Irvine. I'm interested in the "GiST API advancement" project >> for the Google Summer of Code 2019 which is listed at >> https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 >> . >> > >> > I'm still reading about RR*-tree, GiST and the PostgreSQL source code >> to have a better idea on my proposal. Meanwhile, I have a very basic and >> simple question: >> > >> > Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are >> heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the >> least), is it possible to apply machine learning algorithm to improve it? >> The only related reference I got is to use deep learning in database join >> operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use >> machine learning here or someone already did? >> >> If you are interested in ML and DBs you should definitely look into [0]. >> You do not have to base your proposal on mentor ideas, you can use your >> own. Implementing learned indexes - seems reasonable. >> >> RR*-tree algorithms are heuristic in some specific parts, but in general >> they are designed to optimize very clear metrics. Generally, ML algorithms >> tend to compose much bigger pile of heuristics and solve less >> mathematically clear tasks than splitting subtrees or choosing subtree for >> insertion. >> R*-tree algorithms are heuristic only to be faster. >> >> Best regards, Andrey Borodin. >> >> [0] https://arxiv.org/pdf/1712.01208.pdf > >