Hi all, I implemented support for %fs and %gs segment prefixes on the x86 and x86-64 platforms, in what turns out to be a small patch.
For those not familiar with it, at least on x86-64, %fs and %gs are two special registers that a user program can ask be added to any address machine instruction. This is done with a one-byte instruction prefix, "%fs:" or "%gs:". The actual value stored in these two registers cannot quickly be modified (at least before the Haswell CPU), but the general idea is that they are rarely modified. Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs at the same speed as a "movq (%rdx), %rax" would. (I failed to measure any difference, but I guess that the instruction is one more byte in length, which means that a large quantity of them would tax the instruction caches a bit more.) For reference, the pthread library on x86-64 uses %fs to point to thread-local variables. There are a number of special modes in gcc to already produce instructions like "movq %fs:(16), %rax" to load thread-local variables (declared with __thread). However, this support is special-case only. The %gs register is free to use. (On x86, %gs is used by pthread and %fs is free to use.) So what I did is to add the __seg_fs and __seg_gs address spaces. It is used like this, for example: typedef __seg_gs struct myobject_s { int a, b, c; } myobject_t; You can then use variables of type "struct myobject_s *o1" as regular pointers, and "myobject_t *o2" as %gs-based pointers. Accesses to "o2->a" are compiled to instructions that use the %gs prefix; accesses to "o1->a" are compiled as usual. These two pointer types are incompatible. The way you obtain %gs-based pointers, or control the value of %gs itself, is out of the scope of gcc; you do that by using the correct system calls and by manual arithmetic. There is no automatic conversion; the C code can contain casts between the three address spaces (regular, %fs and %gs) which, like regular pointer casts, are no-ops. My motivation comes from the PyPy-STM project ("removing the Global Interpreter Lock" for this Python interpreter). In this project, I want *almost all* pointer manipulations to resolve to different addresses depending on which thread runs the code. The idea is to use mmap() tricks to ensure that the actual memory usage remains reasonable, by sharing most of the pages (but not all of them) between each thread's "segment". So most accesses to a %gs-prefixed address actually access the same physical memory in all threads; but not all of them. This gives me a dynamic way to have a large quantity of data which every thread can read, and by changing occasionally the mapping of a single page, I can make some changes be thread-local, i.e. invisible to other threads. Of course, the same effect can be achieved in other ways, like declaring a regular "__thread intptr_t base;" and adding the "base" explicitly to every pointer access. Clearly, this would have a large performance impact. The %gs solution comes at almost no cost. The patched gcc is able to compile the hundreds of MBs of (generated) C code with systematic %gs usage and seems to work well (with one exception, see below). Is there interest in that? And if so, how to progress? * The patch included here is very minimal. It is against the gcc_5_1_0_release branch but adapting it to "trunk" should be straightforward. * I'm unclear if target_default_pointer_address_modes_p() should return "true" or not in this situation: i386-c.c now defines more than the default address mode, but the new ones also use pointers of the same standard size. * One case in which this patched gcc miscompiles code is found in the attached bug1.c/bug1.s. (This case almost never occurs in PyPy-STM, so I could work around it easily.) I think that some early, pre-RTL optimization is to "blame" here, possibly getting confused because the nonstandard address spaces also use the same size for pointers. Of course it is also possible that I messed up somewhere, or that the whole idea is doomed because many optimizations make a similar assumption. Hopefully not: it is the only issue I encountered. * The extra byte needed for the "%gs:" prefix is not explicitly accounted for. Is it only by chance that I did not observe gcc underestimating how large the code it writes is, and then e.g. use jump instructions that would be rejected by the assembler? * For completeness: this is very similar to clang's __attribute__((addressspace(256))) but a few details differ. (Also, not to discredit other projects in their concurrent's mailing list, but I had to fix three distinct bugs in llvm before I could use it. It contributes to me having more trust in gcc...) Links for more info about pypy-stm: * http://morepypy.blogspot.ch/2015/03/pypy-stm-251-released.html * https://bitbucket.org/pypy/stmgc/src/use-gcc/gcc-seg-gs/ * https://bitbucket.org/pypy/stmgc/src/use-gcc/c8/stmgc.h Thanks for reading so far! Armin
Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 223859) +++ gcc/config/i386/i386.c (working copy) @@ -15963,6 +15963,20 @@ fputs (" PTR ", file); } + /**** <AR> ****/ + switch (MEM_ADDR_SPACE(x)) + { + case ADDR_SPACE_SEG_FS: + fputs (ASSEMBLER_DIALECT == ASM_ATT ? "%fs:" : "fs:", file); + break; + case ADDR_SPACE_SEG_GS: + fputs (ASSEMBLER_DIALECT == ASM_ATT ? "%gs:" : "gs:", file); + break; + default: + break; + } + /**** </AR> ****/ + x = XEXP (x, 0); /* Avoid (%rip) for call operands. */ if (CONSTANT_ADDRESS_P (x) && code == 'P' @@ -51816,6 +51830,120 @@ } #endif + +/***** <AR> *****/ + +/*** GS segment register addressing mode ***/ + +static machine_mode +ix86_addr_space_pointer_mode (addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return ptr_mode; +} + +/* Return the appropriate mode for a named address address. */ +static machine_mode +ix86_addr_space_address_mode (addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return Pmode; +} + +/* Named address space version of valid_pointer_mode. */ +static bool +ix86_addr_space_valid_pointer_mode (machine_mode mode, addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return targetm.valid_pointer_mode (mode); +} + +/* Like ix86_legitimate_address_p, except with named addresses. */ +static bool +ix86_addr_space_legitimate_address_p (machine_mode mode, rtx x, + bool reg_ok_strict, addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return ix86_legitimate_address_p (mode, x, reg_ok_strict); +} + +/* Named address space version of LEGITIMIZE_ADDRESS. */ +static rtx +ix86_addr_space_legitimize_address (rtx x, rtx oldx, + machine_mode mode, addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return ix86_legitimize_address (x, oldx, mode); +} + +/* The default, SEG_FS and SEG_GS address spaces are all "subsets" of + each other. */ +bool static +ix86_addr_space_subset_p (addr_space_t subset, addr_space_t superset) +{ + gcc_assert (subset == ADDR_SPACE_GENERIC || + subset == ADDR_SPACE_SEG_FS || + subset == ADDR_SPACE_SEG_GS); + gcc_assert (superset == ADDR_SPACE_GENERIC || + superset == ADDR_SPACE_SEG_FS || + superset == ADDR_SPACE_SEG_GS); + return true; +} + +/* Convert from one address space to another: it is a no-op. + It is the C code's responsibility to write sensible casts. */ +static rtx +ix86_addr_space_convert (rtx op, tree from_type, tree to_type) +{ + addr_space_t from_as = TYPE_ADDR_SPACE (TREE_TYPE (from_type)); + addr_space_t to_as = TYPE_ADDR_SPACE (TREE_TYPE (to_type)); + + gcc_assert (from_as == ADDR_SPACE_GENERIC || + from_as == ADDR_SPACE_SEG_FS || + from_as == ADDR_SPACE_SEG_GS); + gcc_assert (to_as == ADDR_SPACE_GENERIC || + to_as == ADDR_SPACE_SEG_FS || + to_as == ADDR_SPACE_SEG_GS); + + return op; +} + +#undef TARGET_ADDR_SPACE_POINTER_MODE +#define TARGET_ADDR_SPACE_POINTER_MODE ix86_addr_space_pointer_mode + +#undef TARGET_ADDR_SPACE_ADDRESS_MODE +#define TARGET_ADDR_SPACE_ADDRESS_MODE ix86_addr_space_address_mode + +#undef TARGET_ADDR_SPACE_VALID_POINTER_MODE +#define TARGET_ADDR_SPACE_VALID_POINTER_MODE ix86_addr_space_valid_pointer_mode + +#undef TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P +#define TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P \ + ix86_addr_space_legitimate_address_p + +#undef TARGET_ADDR_SPACE_LEGITIMIZE_ADDRESS +#define TARGET_ADDR_SPACE_LEGITIMIZE_ADDRESS \ + ix86_addr_space_legitimize_address + +#undef TARGET_ADDR_SPACE_SUBSET_P +#define TARGET_ADDR_SPACE_SUBSET_P ix86_addr_space_subset_p + +#undef TARGET_ADDR_SPACE_CONVERT +#define TARGET_ADDR_SPACE_CONVERT ix86_addr_space_convert + +/***** </AR> *****/ + + /* Initialize the GCC target structure. */ #undef TARGET_RETURN_IN_MEMORY #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory Index: gcc/config/i386/i386.h =================================================================== --- gcc/config/i386/i386.h (revision 223859) +++ gcc/config/i386/i386.h (working copy) @@ -2568,6 +2568,11 @@ /* For switching between functions with different target attributes. */ #define SWITCHABLE_TARGET 1 +enum { + ADDR_SPACE_SEG_FS = 1, + ADDR_SPACE_SEG_GS = 2 +}; + /* Local variables: version-control: t Index: gcc/config/i386/i386-c.c =================================================================== --- gcc/config/i386/i386-c.c (revision 223859) +++ gcc/config/i386/i386-c.c (working copy) @@ -572,6 +572,9 @@ ix86_tune, ix86_fpmath, cpp_define); + + cpp_define (parse_in, "__SEG_FS"); + cpp_define (parse_in, "__SEG_GS"); } @@ -586,6 +589,9 @@ /* Update pragma hook to allow parsing #pragma GCC target. */ targetm.target_option.pragma_parse = ix86_pragma_target_parse; + c_register_addr_space ("__seg_fs", ADDR_SPACE_SEG_FS); + c_register_addr_space ("__seg_gs", ADDR_SPACE_SEG_GS); + #ifdef REGISTER_SUBTARGET_PRAGMAS REGISTER_SUBTARGET_PRAGMAS (); #endif
typedef __seg_gs struct foo_s { int a[20]; } foo_t; int sum1(foo_t *p) { int i, total=0; for (i=0; i<20; i++) total += p->a[i]; // <= the %gs: prefix is correctly inserted return total; } int sum2(void) { foo_t *p = (foo_t *)0x1234; int i, total=0; for (i=0; i<20; i++) total += p->a[i]; // <= this memory read is missing %gs: return total; }
bug1.s
Description: Binary data