Dear Olegs, dear Nikolay, dear all Allow me to revive this thread:
Are there any advances in a learned index for PostgreSQL? Background: I'm trying to benchmark those experimental indices. For this I did some bibliography work (see below). Fun fact: Not only Postgres people love high-proof drinks, sorry, high-proof index names: see "From wisckey to bourbon" :-). My main discovery is a discussion report - which included Stonebraker - entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first two takeaways are "#1: The initial comparisons of learned indices with optimized traditional indices should be further expanded to include concurrency control and multi-user settings. (...) #2: A key benefit of learned indices may come from the learned structures requiring lower space utilization, rather than a reduction in search time." Yours, Stefan [1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel, J. M., RĂ©, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment and Prognosis. Data Engineering, 3. Web access: https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf Bibliography (without any claim for completeness): Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, May). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (pp. 489-504). Web access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909 "Recursive model index, a learned index structure" (based on Kraska et al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi (Go language), https://github.com/learnedsystems/RMI (Rust) Kaur, T. (2018). Learned Index Structures: Practical Implementations and Future Directions. Master Thesis. Web access: https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf (pretends to be open source C++ but none found) Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska, T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (pp. 1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659), Source code: https://github.com/learnedsystems/RadixSpline (C++). Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B., Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to bourbon: A learned index for log-structured merge trees. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 155-171). Web access: https://www.usenix.org/system/files/osdi20-dai_0.pdf Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... & Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 969-984). Web access: https://doi.org/10.1145/3318464.3389711 / https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code: https://github.com/microsoft/ALEX (C++, MIT license) Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov <o.iva...@postgrespro.ru>: > > On 12/12/2017 04:33 PM, Oleg Bartunov wrote: > > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov > > <samokhva...@gmail.com> wrote: > >> Very interesting read: https://arxiv.org/abs/1712.01208 > >> > >> HN discussion: https://news.ycombinator.com/item?id=15894896 > >> > >> Some of the comments (from Twitter > >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co > >> at GOOG just released a paper showing how machine-learned indexes can > >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than > >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster > >> unlike CPU. Amazing." > >> > >> Can those ideas be applied to Postgres in its current state? Or it's not > >> really down-to-earth? > > Oleg made some analysis of the paper. > If the keys are comparable and the data distribution is simple enough > (which seems to happen rather often) and the data does not change, we > can learn the distribution, and so perform the search faster, and save > the memory also. That is what authors wrote and that is what must work > in my opinion. > > The limitations of the approach, which authors mention in their work: > + The data does not change. The proposed method can be extended more or > less straightforward to nearly append-only workloads and to workloads in > which the data distribution does not change or changes very slowly (the > data itself or its amount may change, nevertheless). Otherwise it is > challenging, because the key positions cannot be considered as > independent values anymore, but it looks still possible. The other > proposed by the authors option is to store new objects in buffer and > retrain model in the background, but that does not look as a nice solution. > + GPU are fast, but the latency of doing operations on the GPU almost > certainly avoids all speed benefits for such small models. The only case > in which it is reasonable to use GPU is training models (i. e. building > such indexes). > > The following left unclear for me from the paper: > + How effectively the approach can be implemented taking into account > the limitations above. > + For hash functions authors propose to replace hash function with the > learned CDF, which can decrease the amount of collisions, which decreaes > the memory consumption. That is reasonable, but this hash function > implicitly uses the information about the order on the keys. I suppose > that using the ordering on the data and with access to the data > histogram one could achieve similar memory consumption decrease. > + The proposed hierarchical learning looks natural for the problem, but > it is not clear that it is the best option. For example, one might use > the end-to-end learning procedure with weights sparsity regularizers as > well. I wonder whether the last approach can obtain the competitive > results with the first one, and if not, then why. Anyway, I have a > feeling that the proposed method has a room for improvement. > > In general, the approach still needs some additional research (mostly > about the changing data), and has some points to think about how to > implement them properly (paging, for example), but it looks promising > enough nevertheless. >