On Thu, Jul 2, 2015 at 5:57 PM, Armin Rigo <ar...@tunes.org> wrote: > Hi all, > > I implemented support for %fs and %gs segment prefixes on the x86 and > x86-64 platforms, in what turns out to be a small patch. > > For those not familiar with it, at least on x86-64, %fs and %gs are > two special registers that a user program can ask be added to any > address machine instruction. This is done with a one-byte instruction > prefix, "%fs:" or "%gs:". The actual value stored in these two > registers cannot quickly be modified (at least before the Haswell > CPU), but the general idea is that they are rarely modified. > Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs > at the same speed as a "movq (%rdx), %rax" would. (I failed to > measure any difference, but I guess that the instruction is one more > byte in length, which means that a large quantity of them would tax > the instruction caches a bit more.) > > For reference, the pthread library on x86-64 uses %fs to point to > thread-local variables. There are a number of special modes in gcc to > already produce instructions like "movq %fs:(16), %rax" to load > thread-local variables (declared with __thread). However, this > support is special-case only. The %gs register is free to use. (On > x86, %gs is used by pthread and %fs is free to use.) > > > So what I did is to add the __seg_fs and __seg_gs address spaces. It > is used like this, for example: > > typedef __seg_gs struct myobject_s { > int a, b, c; > } myobject_t; > > You can then use variables of type "struct myobject_s *o1" as regular > pointers, and "myobject_t *o2" as %gs-based pointers. Accesses to > "o2->a" are compiled to instructions that use the %gs prefix; accesses > to "o1->a" are compiled as usual. These two pointer types are > incompatible. The way you obtain %gs-based pointers, or control the > value of %gs itself, is out of the scope of gcc; you do that by using > the correct system calls and by manual arithmetic. There is no > automatic conversion; the C code can contain casts between the three > address spaces (regular, %fs and %gs) which, like regular pointer > casts, are no-ops. > > > My motivation comes from the PyPy-STM project ("removing the Global > Interpreter Lock" for this Python interpreter). In this project, I > want *almost all* pointer manipulations to resolve to different > addresses depending on which thread runs the code. The idea is to use > mmap() tricks to ensure that the actual memory usage remains > reasonable, by sharing most of the pages (but not all of them) between > each thread's "segment". So most accesses to a %gs-prefixed address > actually access the same physical memory in all threads; but not all > of them. This gives me a dynamic way to have a large quantity of data > which every thread can read, and by changing occasionally the mapping > of a single page, I can make some changes be thread-local, i.e. > invisible to other threads. > > Of course, the same effect can be achieved in other ways, like > declaring a regular "__thread intptr_t base;" and adding the "base" > explicitly to every pointer access. Clearly, this would have a large > performance impact. The %gs solution comes at almost no cost. The > patched gcc is able to compile the hundreds of MBs of (generated) C > code with systematic %gs usage and seems to work well (with one > exception, see below). > > > Is there interest in that? And if so, how to progress?
It's nice to have the ability to test address-space issues on a commonly available target at least (not sure if adding runtime testcases is easy though). > * The patch included here is very minimal. It is against the > gcc_5_1_0_release branch but adapting it to "trunk" should be > straightforward. > > * I'm unclear if target_default_pointer_address_modes_p() should > return "true" or not in this situation: i386-c.c now defines more than > the default address mode, but the new ones also use pointers of the > same standard size. > > * One case in which this patched gcc miscompiles code is found in the > attached bug1.c/bug1.s. (This case almost never occurs in PyPy-STM, > so I could work around it easily.) I think that some early, pre-RTL > optimization is to "blame" here, possibly getting confused because the > nonstandard address spaces also use the same size for pointers. Of > course it is also possible that I messed up somewhere, or that the > whole idea is doomed because many optimizations make a similar > assumption. Hopefully not: it is the only issue I encountered. Hmm, without being able to dive into it with a debugger it's hard to tell ;) You might want to open a bugreport in bugzilla for this at least. > * The extra byte needed for the "%gs:" prefix is not explicitly > accounted for. Is it only by chance that I did not observe gcc > underestimating how large the code it writes is, and then e.g. use > jump instructions that would be rejected by the assembler? Yes, I think you are just lucky here. Richard. > * For completeness: this is very similar to clang's > __attribute__((addressspace(256))) but a few details differ. (Also, > not to discredit other projects in their concurrent's mailing list, > but I had to fix three distinct bugs in llvm before I could use it. > It contributes to me having more trust in gcc...) > > > Links for more info about pypy-stm: > > * http://morepypy.blogspot.ch/2015/03/pypy-stm-251-released.html > * https://bitbucket.org/pypy/stmgc/src/use-gcc/gcc-seg-gs/ > * https://bitbucket.org/pypy/stmgc/src/use-gcc/c8/stmgc.h > > > Thanks for reading so far! > > Armin