On Fri, Apr 5, 2019 at 11:25 PM Andres Freund <and...@anarazel.de> wrote: > I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the > others that collaborated on making tableam happen. It was/is a huge > project.
Thank you so much for bringing this project to commit! Excellent work! Your explanation of existing limitations looks very good and convincing. But I think there is one you didn't mention. We require new table AMs to basically save old "contract" between heap and indexes. We have "all or nothing" choice during updates. So, storage can either insert to none of indexes or insert to all of them (HOT-like update). I think any storage, which is going to address "write amplification" point raised by Uber, needs to break this "contract". For example, zheap is promised to implement delete-marking indexes. But it's not yet published. And for me it's not clear that this approach is better among the alternatives. With delete-marking approach you need to update index tuples corresponding to old values of updated fields. But additionally to that it's not trivial to delete index tuple. In order to do that, you need to both locate this index tuple and know that this index value isn't present in undo chain. So, it's likely required another index lookup during purging of undo chain. Thus, we basically need to random lookup index twice for every deleted index tuple. Also, it becomes more complex to lookup appropriate heap tuple during index scan. Then you need to check not only visibility, but also matching index value (here we need to adjust index_fetch_tuple interface). Because it might happen that visible to you version have different index value. That may lead to O(N^2) performance while accessing single row with N versions (MySQL InnoDB has this problem). Alternative idea is to have MVCC-aware indexes. This approach looks more attractive for me. In this approach you basically need xmin, xmax fields in index tuples. On insertion of index tuple you fill it's xmin. On update, previous index tuple is marked with xmax. After that outdated index tuples might be deleted in the lazy manner when page space is required. So, only one random access is required for deleted index tuple. With this approach fetching single row is O(N). Also, index-only scan becomes very easy and doesn't even need a visibility map. The only problem here is extra space requirements for index tuples. But I think, this is well-isolated problem, which is easy to attack. For instance, some visibility information could be evicted to undo chain (like zheap does for its tuples). Also, we can have special bit for "all visible" index tuples. With "all visible" bit set this tuple can get rid of visibility fields. We can do this for index tuples, because if index tuple requires extra space we can split the page, in spite of heap where tuples are fixed in pages and xmax needs to be updated in-place. I understand that delete-marking indexes have some advantages, and some people find them more appealing. But my point is that we shouldn't builtin one of this approaches into API unless we have concrete proof that this approach is strongly overcomes another. It would be better to have our table-AM API flexible enough to implement both. I can imagine we have proper encapsulation here bringing more interaction with indexes to the table AM side. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company