I haven't done measurements of this in years, but...
I'll wager that compression is memory bound, not CPU bound, for today's
servers. A system with low latency and high bandwidth memory will perform
well (UltraSPARC-T1). Threading may not help much on systems with a single
memory interface, but should help some on systems with multiple memory
interfaces (UltraSPARC-*, Opteron, Athlon FX, etc.)
-- richard
A rather simple test using CSQamp.pkg from the cooltools download site. There's nothing magical about this file - it just happens to be a largish file that I had on hand.
$ time gzip -c < CSQamp.pkg > /dev/null
V40z:
real 0m15.339s
user 0m14.534s
sys 0m0.485s
V240:
real 0m35.825s
user 0m35.335s
sys 0m0.284s
T2000:
time gzip -c < CSQamp.pkg > /dev/null
real 1m33.669s
user 1m32.768s
sys 0m0.881s
If I do 8 gzips in parallel:
V40z:
time ~/scripts/pgzip
real 0m32.632s
user 1m53.382s
sys 0m1.653s
V240:
time ~/scripts/pgzip
real 2m24.704s
user 4m42.430s
sys 0m2.305s
T2000:
time ~/scripts/pgzip
real 1m40.165s
user 13m10.475s
sys 0m6.578s
In each of the tests, the file was in /tmp. As expected, the V40z running 8 gzip processes (using 4 cores) took twice as long as it did running 1 (using 1 core). The V240 took 4 times as long (8 processes, 2 threads) as the original, and the T2000 ran 8 (8 processes, 8 cores) in just about the same amount of time as it ran 1.
For giggles, I ran 32 processes on the T2000 and came up with 5m4.585s (real) 158m33.380s (user) and 42.484s (sys). In other words, the T2000 running 32 gzip processes had an elapsed time of 3 times greater than 8 processes. Even though the elapsed jumped by 3x, the %sys jumped by nearly 7x.
Here's a summary:
Server gzips Seconds MB/sec
V40z 8 32.632 49,445
T2000 32 304.585 21,189
T2000 8 100.165 16,108
V40z 1 15.339 13,149
V240 8 144.704 11,150
V240 1 35.825 5,630
T2000 1 99.669 2,024
Clearly more threads doing compression with gzip give better performance than a single thread. How that translates into memory vs. CPU speed, I am not sure. However, I can't help but think that if my file server is compressing every data block that it writes that it would be able to write more data if it used a thread (or more) per core I would come out ahead.
I am a firm believer that the next generation of compression commands and libraries need to use parallel algorithms. The simplest way to do this would be to divide the data into chunks and farm out each chunk to various worker threads. This will likely come at the cost of efficiency of the compression, but in intial tests I have done this amounts to a very small difference in size relative to the speedup achieved. Initial tests were with a chunk of C code and zlib.
Mike
--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss