Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Oleg Ivanov Tue, 12 Dec 2017 09:02:48 -0800

On 12/12/2017 04:33 PM, Oleg Bartunov wrote:

On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
<samokhva...@gmail.com> wrote:

Very interesting read: https://arxiv.org/abs/1712.01208


HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter
https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
at GOOG just released a paper showing how machine-learned indexes can
replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
B-Trees, 10-100x less space. Executes on GPU, which are getting faster
unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's not
really down-to-earth?

Oleg made some analysis of the paper.

If the keys are comparable and the data distribution is simple enough(which seems to happen rather often) and the data does not change, wecan learn the distribution, and so perform the search faster, and savethe memory also. That is what authors wrote and that is what must workin my opinion.


The limitations of the approach, which authors mention in their work:

+ The data does not change. The proposed method can be extended more orless straightforward to nearly append-only workloads and to workloads inwhich the data distribution does not change or changes very slowly (thedata itself or its amount may change, nevertheless). Otherwise it ischallenging, because the key positions cannot be considered asindependent values anymore, but it looks still possible. The otherproposed by the authors option is to store new objects in buffer andretrain model in the background, but that does not look as a nice solution.+ GPU are fast, but the latency of doing operations on the GPU almostcertainly avoids all speed benefits for such small models. The only casein which it is reasonable to use GPU is training models (i. e. buildingsuch indexes).


The following left unclear for me from the paper:

+ How effectively the approach can be implemented taking into accountthe limitations above.+ For hash functions authors propose to replace hash function with thelearned CDF, which can decrease the amount of collisions, which decreaesthe memory consumption. That is reasonable, but this hash functionimplicitly uses the information about the order on the keys. I supposethat using the ordering on the data and with access to the datahistogram one could achieve similar memory consumption decrease.+ The proposed hierarchical learning looks natural for the problem, butit is not clear that it is the best option. For example, one might usethe end-to-end learning procedure with weights sparsity regularizers aswell. I wonder whether the last approach can obtain the competitiveresults with the first one, and if not, then why. Anyway, I have afeeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostlyabout the changing data), and has some points to think about how toimplement them properly (paging, for example), but it looks promisingenough nevertheless.

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Reply via email to