I don't know whether this is worth it. I would need to spend a lot more time 
with the simulator and oprofile. But once I had the idea I had to write it 
down. And if somebody decides to build a non-PIC ABI, this is definitely worth 
it.

I mentioned before I'm wondering about the tiny number of TLB entries on the 
low-end MIPS-based routers. Otherwise-decent ath71xx routers like the DIR-825 
have 16 JTLB entries to cover 64M of RAM. The DIR-601-A1 has 32M, and it's 
selling for $13[1] refurbished on newegg.com right now.

The 3.3.8 kernel on these devices seems to see ~47 cycles of latency per TLB 
miss.[2].) These devices have 64k icache/32k dcache; cache misses are ~42 
cycles. Because of how jumpy PIC is, I think we can start missing TLB before 
missing the cache....

One no-code solution is to turn on 16kB pages. Unlike 64kB pages, 16k at least 
boots on qemu, Looking at a /proc/self/maps for a shell, the number of TLB 
entries to cover the whole process would go from 167 pairs to 54.[2] The reason 
it's not 4:1 is granularity; there are costs to separating libc.so from libm, 
libgcc_s, and libcrypt. Which reminds me: it might be worth it to do some 
really cheap profiling to let gcc separate hot from cold functions.

Just about every process has libc and libgcc_s mapped, and many will have libm. 
Busybox is a good proportion of workload too. Counting the pure read-only pages 
I see 127 page pairs out of that 167 (or 36 out of 54) which could be mapped by 
a single large TLB entry.

There's existing infrastructure in the Linux kernel for manually mapping huge 
pages. Unfortunately on MIPS it's only working on 64-bit kernels. But it's easy 
to take a power-of-two aligned chunk of physmem and just slam it into every 
process's address space at some aligned virtual address. That's just adding a 
WIRED TLB entry with the Global bit set, and marking that range as not 
available to mmap.

This does not necessarily interfere with the basic shared library scheme. ELF 
PIC code does need per-process initialized data, and data including the GOT 
needs to be at a fixed distance from the code segment. But nothing says they 
have to be adjacent, just reachable via PC-relative addressing. A linker append 
script[5] can just push the data segment 2M away. The address space would look 
like:

++++++
main program
heap
...
normal shared libraries
....
=== fixed global readonly segment ===
libc.text
...
libm.text
...
libbusybox.text
=== end global readonly segment ===
...
libc.data = libc.text + 2M
...
libm.data = libm.text + 2M
...
libbusybox.data = libbusybox.text + 2M
....
stack
++++++

The primary change required would be to teach ld.so about the global segment 
and the objects present in it. When ld.so would start to load /lib/libc.so, it 
would notice the hard read-only libc.text segment was already present at 
address x, skip mapping libc.text but keep x as the vma offset for the rest of 
/lib/libc.so's segments, which would be mmapped per process as usual.

Note that a squashfs containing libraries with these 2M gaps should function as 
normal if the normal ld.so is used. (Well, the libraries will eat up 2M more 
virtual address space each, but it's just PROT_NONE mappings.) The global 
segment can be positioned randomly on boot, subject to alignment constraints, 
and the position of individual text segments can be shuffled. Although you keep 
per-boot ASLR, you do lose per-process ASLR.

The segment is read-only; how would you get anything in there? My guess is that 
the global segment could be read/write until ld.so launches the "real" init. 
The /lib filesystem is available, and if the global segment was read-write at 
that point, ld.so could position the text segments normally, although it'd have 
to memcpy instead of mmap them into place.

OK, I'm done. Need to get back to Real Work....

Jay

[1]: Well, the DIR-601 is $13 plus a heatsink. I think they have some serious 
overheating issues--which would explain why there are so many refurbished ones. 
I had one in my basement, and when the weather turned cold it stopped locking 
up. Or maybe it was some 12.09 fix....

[2]: There are non-architected 4-entry micro-ITLB and DTLBs; they eat a cycle 
on misses present in the JTLB. 

[3]: For mips16 busybox it's 153->148 pairs, or 54->48 pairs. 

[4]: And for mips16 busybox there are 108 read-only pairs, or 30 pairs with 16k 
pages.

[5]: gcc -shared -T offset.ld, were offset.ld is:

SECTIONS
{
   . = . + 0x00200000;
   .fakedata : { *(.fakedata) }
}
INSERT AFTER .exception_ranges;

_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel

Reply via email to