I realized later that "these data structures" was referring to "non-balanced data structures, such as quad-trees, k-d trees, and radix trees (tries)" from the previous paragraph, and not SP-GiST, which isn't a data structure per se. I feel a bit silly about this confusion and am not sure others would have the same confusion I did.
Alex On Mon, Oct 16, 2023 at 11:09 PM Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote: > On 2023-Oct-16, PG Doc comments form wrote: > > > I'm confused by this paragraph: > > > > > These popular data structures were originally developed for in-memory > > > usage. In main memory, they are usually designed as a set of > dynamically > > > allocated nodes linked by pointers. This is not suitable for direct > storing > > > on disk, since these chains of pointers can be rather long which would > > > require too many disk accesses. In contrast, disk-based data structures > > > should have a high fanout to minimize I/O. The challenge addressed by > > > SP-GiST is to map search tree nodes to disk pages in such a way that a > > > search need access only a few disk pages, even if it traverses many > nodes. > > > > In particular, "These popular datastructures" is ambiguous -- based on > how > > the previous paragraph ends, it sounds like the "popular datastructures" > or > > SP-GiSTs, but then it goes on to say they were designed for in-memory, > and > > then it mentions that SP-GiST's space partitioning (with high fanout) is > > more appropriate to minimize disk access. I think maybe the solution here > > would be to replace "These popular data structures" with something like > > "Classic Postgres indexes such as B-tree indexes..." or something like > > that. > > Yeah, this paragraph is a rewording of Oleg's[1] > > SP-GiST is an abbreviation of space-partitioned GiST - the search > tree, which allows to implement a wide range of different > non-balanced disk-based data structures, such as quadtree, kd-tree, > tries - popular data structures, originally developed for memory > storage. Main memory access structures usually designed as a set of > dynamically allocated nodes linked by pointers, which is not suitable > for direct storing on disk, since these chains of pointers can be > rather long and require too many disk accesses. In opposite, disk > based data structures have a high fanout to minimize I/O. The > challenge is to map nodes of tree to disk pages in such a way, so > search algorithm accesses only a few disk pages, even if it traverse > many nodes. > > where the point is (IMO) much clearer. > > [1] http://www.sai.msu.su/~megera/wiki/spgist_dev > > > Also, I think a short parenthetical definition of "fanout" would be > useful > > here, something like "high fanout (i.e. where each node has a potentially > > large number of child nodes that reside in the same disk page)". Took me > > some googling to realize what fanout meant in this context. > > Hmm. We also use the term (hypenated as fan-out) in the reference to > recursive_worktable_factor. Maybe we should list it in the glossary in > a way that works for both. > > -- > Álvaro Herrera PostgreSQL Developer — > https://www.EnterpriseDB.com/ > "La primera ley de las demostraciones en vivo es: no trate de usar el > sistema. > Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) >