One additional detail, we also did filestore testing using Jewel and saw substantially similar results to those on Kraken.
On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdin...@gmail.com> wrote: > Hello Ceph-users, > > Florian has been helping with some issues on our proof-of-concept cluster, > where we've been experiencing these issues. Thanks for the replies so far. > I wanted to jump in with some extra details. > > All of our testing has been with scrubbing turned off, to remove that as a > factor. > > Our use case requires a Ceph cluster to indefinitely store ~10 billion > files 20-60KB in size. We’ll begin with 4 billion files migrated from a > legacy storage system. Ongoing writes will be handled by ~10 client > machines and come in at a fairly steady 10-20 million files/day. Every file > (excluding the legacy 4 billion) will be read once by a single client > within hours of it’s initial write to the cluster. Future file read > requests will come from a single server and with a long-tail distribution, > with popular files read thousands of times a year but most read never or > virtually never. > > Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD > journals at a 1:4 ratio with HDDs, Each node looks like this: > > - 2 x E5-2660 8-core Xeons > - 64GB RAM DDR-3 PC1600 > - 10Gb ceph-internal network (SFP+) > - LSI 9210-8i controller (IT mode) > - 4 x OSD 8TB HDDs, mix of two types > - Seagate ST8000DM002 > - HGST HDN728080ALE604 > - Mount options = xfs (rw,noatime,attr2,inode64,noquota) > - 1 x SSD journal Intel 200GB DC S3700 > > > Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a > replication level 2. We’re using rados bench to shotgun a lot of files into > our test pools. Specifically following these two steps: > ceph osd pool create poolofhopes 2048 2048 replicated "" > replicated_ruleset 500000000 > rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup > > We leave the bench running for days at a time and watch the objects in > cluster count. We see performance that starts off decent and degrades over > time. There’s a very brief initial surge in write performance after which > things settle into the downward trending pattern. > > 1st hour - 2 million objects/hour > 20th hour - 1.9 million objects/hour > 40th hour - 1.7 million objects/hour > > This performance is not encouraging for us. We need to be writing 40 > million objects per day (20 million files, duplicated twice). The rates > we’re seeing at the 40th hour of our bench would be suffecient to achieve > that. Those write rates are still falling though and we’re only at a > fraction of the number of objects in cluster that we need to handle. So, > the trends in performance suggests we shouldn’t count on having the write > performance we need for too long. > > If we repeat the process of creating a new pool and running the bench the > same pattern holds, good initial performance that gradually degrades. > > https://postimg.org/image/ovymk7n2d/ > [caption:90 million objects written to a brand new, pre-split pool > (poolofhopes). There are already 330 million objects on the cluster in > other pools.] > > Our working theory is that the degradation over time may be related to > inode or dentry lookups that miss cache and lead to additional disk reads > and seek activity. There’s a suggestion that filestore directory splitting > may exacerbate that problem as additional/longer disk seeks occur related > to what’s in which XFS assignment group. We have found pre-split pools > useful in one major way, they avoid periods of near-zero write performance > that we have put down to the active splitting of directories (the > "thundering herd" effect). The overall downward curve seems to remain the > same whether we pre-split or not. > > The thundering herd seems to be kept in check by an appropriate pre-split. > Bluestore may or may not be a solution, but uncertainty and stability > within our fairly tight timeline don't recommend it to us. Right now our > big question is "how can we avoid the gradual degradation in write > performance over time?". > > Thank you, Patrick > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com