Hi! On 2017-12-06 19:40:00 +0300, Konstantin Knizhnik wrote: > As far as I remember, several years ago when implementation of intra-query > parallelism was just started there was discussion whether to use threads or > leave traditional Postgres process architecture. The decision was made to > leave processes. So now we have bgworkers, shared message queue, DSM, ... > The main argument for such decision was that switching to threads will > require rewriting of most of Postgres code.
> It seems to be quit reasonable argument and and until now I agreed with it. > > But recently I wanted to check it myself. I think that's something pretty important to play with. There've been several discussions lately, both on and off list / in person, that we're taking on more-and-more technical debt just because we're using processes. Besides the above, we've grown: - a shared memory allocator - a shared memory hashtable - weird looking thread aware pointers - significant added complexity in various projects due to addresses not being mapped to the same address etc. > The first problem with porting Postgres to pthreads is static variables > widely used in Postgres code. > Most of modern compilers support thread local variables, for example GCC > provides __thread keyword. > Such variables are placed in separate segment which is address through > segment register (at Intel). > So access time to such variables is the same as to normal static variables. I experimented similarly. Although I'm not 100% sure that if were to go for it, we wouldn't instead want to abstract our session concept further, or well, at all. > Certainly may be not all compilers have builtin support of TLS and may be > not at all hardware platforms them are implemented ias efficiently as at > Intel. > So certainly such approach decreases portability of Postgres. But IMHO it is > not so critical. I'd agree there, but I don't think the project necessarily does. > What I have done: > 1. Add session_local (defined as __thread) to definition of most of static > and global variables. > I leaved some variables pointed to shared memory as static. Also I have to > changed initialization of some static variables, > because address of TLS variable can not be used in static initializers. > 2. Change implementation of GUCs to make them thread specific. > 3. Replace fork() with pthread_create > 4. Rewrite file descriptor cache to be global (shared by all threads). That one I'm very unconvinced of, that's going to add a ton of new contention. > What are the advantages of using threads instead of processes? > > 1. No need to use shared memory. So there is no static limit for amount of > memory which can be used by Postgres. No need in distributed shared memory > and other stuff designed to share memory between backends and > bgworkers. This imo is the biggest part. We can stop duplicating OS and our own implementations in a shmem aware way. > 2. Threads significantly simplify implementation of parallel algorithms: > interaction and transferring data between threads can be done easily and > more efficiently. That's imo the same as 1. > 3. It is possible to use more efficient/lightweight synchronization > primitives. Postgres now mostly relies on its own low level sync.primitives > which user-level implementation > is using spinlocks and atomics and then fallback to OS semaphores/poll. I am > not sure how much gain can we get by replacing this primitives with one > optimized for threads. > My colleague from Firebird community told me that just replacing processes > with threads can obtain 20% increase of performance, but it is just first > step and replacing sync. primitive > can give much greater advantage. But may be for Postgres with its low level > primitives it is not true. I don't believe that that's actually the case to any significant degree. > 6. Faster backend startup. Certainly starting backend at each user's request > is bad thing in any case. Some kind of connection pooling should be used in > any case to provide acceptable performance. But in any case, start of new > backend process in postgres causes a lot of page faults which have > dramatical impact on performance. And there is no such problem with threads. I don't buy this in itself. The connection establishment overhead isn't largely the fork, it's all the work afterwards. I do think it makes connection pooling etc easier. > I just want to receive some feedback and know if community is interested in > any further work in this direction. I personally am. I think it's beyond high time that we move to take advantage of threads. That said, I don't think just replacing threads is the right thing. I'm pretty sure we'd still want to have postmaster as a separate process, for robustness. Possibly we even want to continue having various processes around besides that, the most interesting cases involving threads are around intra-query parallelism, and pooling, and for both a hybrid model could be beneficial. I think that we probably initially want some optional move to threads. Most extensions won't initially be thread ready, and imo we should continue to work with that for a while, just refusing to use parallelism if any loaded shared library doesn't signal parallelism support. We also don't necessarily want to require threads on all platforms at the same time. I think the biggest problem with doing this for real is that it's a huge project, and that it'll take a long time. Thanks for working on this! Andres Freund