I don't know whether this is worth it. I would need to spend a lot more time with the simulator and oprofile. But once I had the idea I had to write it down. And if somebody decides to build a non-PIC ABI, this is definitely worth it.
I mentioned before I'm wondering about the tiny number of TLB entries on the low-end MIPS-based routers. Otherwise-decent ath71xx routers like the DIR-825 have 16 JTLB entries to cover 64M of RAM. The DIR-601-A1 has 32M, and it's selling for $13[1] refurbished on newegg.com right now. The 3.3.8 kernel on these devices seems to see ~47 cycles of latency per TLB miss.[2].) These devices have 64k icache/32k dcache; cache misses are ~42 cycles. Because of how jumpy PIC is, I think we can start missing TLB before missing the cache.... One no-code solution is to turn on 16kB pages. Unlike 64kB pages, 16k at least boots on qemu, Looking at a /proc/self/maps for a shell, the number of TLB entries to cover the whole process would go from 167 pairs to 54.[2] The reason it's not 4:1 is granularity; there are costs to separating libc.so from libm, libgcc_s, and libcrypt. Which reminds me: it might be worth it to do some really cheap profiling to let gcc separate hot from cold functions. Just about every process has libc and libgcc_s mapped, and many will have libm. Busybox is a good proportion of workload too. Counting the pure read-only pages I see 127 page pairs out of that 167 (or 36 out of 54) which could be mapped by a single large TLB entry. There's existing infrastructure in the Linux kernel for manually mapping huge pages. Unfortunately on MIPS it's only working on 64-bit kernels. But it's easy to take a power-of-two aligned chunk of physmem and just slam it into every process's address space at some aligned virtual address. That's just adding a WIRED TLB entry with the Global bit set, and marking that range as not available to mmap. This does not necessarily interfere with the basic shared library scheme. ELF PIC code does need per-process initialized data, and data including the GOT needs to be at a fixed distance from the code segment. But nothing says they have to be adjacent, just reachable via PC-relative addressing. A linker append script[5] can just push the data segment 2M away. The address space would look like: ++++++ main program heap ... normal shared libraries .... === fixed global readonly segment === libc.text ... libm.text ... libbusybox.text === end global readonly segment === ... libc.data = libc.text + 2M ... libm.data = libm.text + 2M ... libbusybox.data = libbusybox.text + 2M .... stack ++++++ The primary change required would be to teach ld.so about the global segment and the objects present in it. When ld.so would start to load /lib/libc.so, it would notice the hard read-only libc.text segment was already present at address x, skip mapping libc.text but keep x as the vma offset for the rest of /lib/libc.so's segments, which would be mmapped per process as usual. Note that a squashfs containing libraries with these 2M gaps should function as normal if the normal ld.so is used. (Well, the libraries will eat up 2M more virtual address space each, but it's just PROT_NONE mappings.) The global segment can be positioned randomly on boot, subject to alignment constraints, and the position of individual text segments can be shuffled. Although you keep per-boot ASLR, you do lose per-process ASLR. The segment is read-only; how would you get anything in there? My guess is that the global segment could be read/write until ld.so launches the "real" init. The /lib filesystem is available, and if the global segment was read-write at that point, ld.so could position the text segments normally, although it'd have to memcpy instead of mmap them into place. OK, I'm done. Need to get back to Real Work.... Jay [1]: Well, the DIR-601 is $13 plus a heatsink. I think they have some serious overheating issues--which would explain why there are so many refurbished ones. I had one in my basement, and when the weather turned cold it stopped locking up. Or maybe it was some 12.09 fix.... [2]: There are non-architected 4-entry micro-ITLB and DTLBs; they eat a cycle on misses present in the JTLB. [3]: For mips16 busybox it's 153->148 pairs, or 54->48 pairs. [4]: And for mips16 busybox there are 108 read-only pairs, or 30 pairs with 16k pages. [5]: gcc -shared -T offset.ld, were offset.ld is: SECTIONS { . = . + 0x00200000; .fakedata : { *(.fakedata) } } INSERT AFTER .exception_ranges; _______________________________________________ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel