On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > If I get what you propose, you want to have a "top" tree for (job, nlp, > year), which "splits" the data set into subsets of ~5000-7000 rows. And > then for each subset you want a separate "small" trees on each of the > other columns, so in this case three trees. > > Well, the problem with this is pretty obvious - each of the small trees > requires separate copies of the leaf pages. And remember, in a btree the > internal pages are usually less than 1% of the index, so this pretty > much triples the size of the index. And if you insert a row into the > index, it has to insert the item pointer into each of the small trees, > likely requiring a separate I/O for each. > > So I'd bet this is not any different from just having three separate > indexes - it doesn't save space, doesn't save I/O, nothing.
I agree. In a lot of cases, it's actually useful to define the index on fewer columns, like (job, nlp, year) or even just (job, nlp) or even just (job) because it makes the index so much smaller and that's pretty important. If you have enough duplicate entries in a (job, nlp, year) index to justify create a (job, nlp, year, sequence) index, the number of distinct (job, nlp, year) tuples has to be small compared to the number of (job, nlp, year, sequence) tuples - and that means that you wouldn't actually save much by combining your (job, nlp, year, sequence) index with a (job, nlp, year, other-stuff) index. As you say, the internal pages aren't the problem. I don't intend to say that there's no possible use for this kind of technology. Peter G. says that people are writing research papers about that and they probably wouldn't be doing that unless they'd found some case where it's a big win. But this example seems extremely unconvincing. -- Robert Haas EDB: http://www.enterprisedb.com