the new auth server, which uses the fs as its root rather than
a stand-alone fs,  happens to be faster than our now-old
cpu server, so i did a quick build test with a kernel including
the massive-fw myricom driver.  suspecting that latency kills
even on 10gbe, i tried a second build with NPROC=24. a
table comparing ken fs, fossil+venti, and ramfs follows.
unfortunately, i was not able to use the same system for the
fossil+venti tests, but there's a ramfs test on the same system
to bring things into perspective due to the large differences
in processor generation, network, &c.  here's an example test:

        tyty; echo $NPROC
        4
        tyty; time mk>/dev/null && mk clean>/dev/null
        2.93u 1.30s 3.36r        mk
        tyty; NPROC=24 time mk >/dev/null && mk clean>/dev/null
        1.32u 0.22s 2.29r        mk

and here are the compiled results:

a       Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
        4 active cores (8 threads; 4 enabled);
        http://ark.intel.com/Product.aspx?id=35365
        intel 82598 10gbe nic; fs has myricom 10gbe nic; 54µs latency
b       Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
        4 active cores (4 threads; 4 enabled);
        
http://www.intel.com/p/en_US/products/server/processor/xeon5000/specifications
        intel 82563-style gbe nic; 70µs latency

mach    fs      nproc   time
a       ken     4       2.93u 1.30s 3.36r        mk
                24      1.32u 0.22s 2.29r        mk
        ramfs   4       3.10u 1.67s 3.01r        mk
                24      2.98u 1.23s 2.42r        mk
b       venti   4       2.65u 3.44s 21.46r       mk
                24      2.98u 3.56s 21.58r       mk
        ramfs   4       3.55u 2.22s 9.08r        mk
                24      3.50u 2.67s 9.41r        mk

it's interesting that neither venti nor ramfs get any faster
on machine b with NPROCS set to 24, but both get
faster on machine a and the fastest time of all is not
ramfs, but ken's fs with NPROC=24.  so i suppose the
64-bit question is, is that because moving data in and
out of user space is slower than 10gbe, or because ramfs
is single threaded and slow?

in any event, it's clear that if the fs is good, latency
can kill even on 10gbe lan.  it would naively seem to me that
using the Tstream model would be too expensive, requiring
thousands of new streams, and require modifying at
least 8c, 8l, mk, rc, awk (what am i forgetting?).  but
it would be worth a test.

- erik

Reply via email to