One additional detail, we also did filestore testing using Jewel and saw
substantially similar results to those on Kraken.

On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdin...@gmail.com> wrote:

> Hello Ceph-users,
>
> Florian has been helping with some issues on our proof-of-concept cluster,
> where we've been experiencing these issues. Thanks for the replies so far.
> I wanted to jump in with some extra details.
>
> All of our testing has been with scrubbing turned off, to remove that as a
> factor.
>
> Our use case requires a Ceph cluster to indefinitely store ~10 billion
> files 20-60KB in size. We’ll begin with 4 billion files migrated from a
> legacy storage system. Ongoing writes will be handled by ~10 client
> machines and come in at a fairly steady 10-20 million files/day. Every file
> (excluding the legacy 4 billion) will be read once by a single client
> within hours of it’s initial write to the cluster. Future file read
> requests will come from a single server and with a long-tail distribution,
> with popular files read thousands of times a year but most read never or
> virtually never.
>
> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD
> journals at a 1:4 ratio with HDDs, Each node looks like this:
>
>    - 2 x E5-2660 8-core Xeons
>    - 64GB RAM DDR-3 PC1600
>    - 10Gb ceph-internal network (SFP+)
>    - LSI 9210-8i controller (IT mode)
>    - 4 x OSD 8TB HDDs, mix of two types
>    - Seagate ST8000DM002
>       - HGST HDN728080ALE604
>       - Mount options = xfs (rw,noatime,attr2,inode64,noquota)
>       - 1 x SSD journal Intel 200GB DC S3700
>
>
> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a
> replication level 2. We’re using rados bench to shotgun a lot of files into
> our test pools. Specifically following these two steps:
> ceph osd pool create poolofhopes 2048 2048 replicated ""
> replicated_ruleset 500000000
> rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup
>
> We leave the bench running for days at a time and watch the objects in
> cluster count. We see performance that starts off decent and degrades over
> time. There’s a very brief initial surge in write performance after which
> things settle into the downward trending pattern.
>
> 1st hour - 2 million objects/hour
> 20th hour - 1.9 million objects/hour
> 40th hour - 1.7 million objects/hour
>
> This performance is not encouraging for us. We need to be writing 40
> million objects per day (20 million files, duplicated twice). The rates
> we’re seeing at the 40th hour of our bench would be suffecient to achieve
> that. Those write rates are still falling though and we’re only at a
> fraction of the number of objects in cluster that we need to handle. So,
> the trends in performance suggests we shouldn’t count on having the write
> performance we need for too long.
>
> If we repeat the process of creating a new pool and running the bench the
> same pattern holds, good initial performance that gradually degrades.
>
> https://postimg.org/image/ovymk7n2d/
> [caption:90 million objects written to a brand new, pre-split pool
> (poolofhopes). There are already 330 million objects on the cluster in
> other pools.]
>
> Our working theory is that the degradation over time may be related to
> inode or dentry lookups that miss cache and lead to additional disk reads
> and seek activity. There’s a suggestion that filestore directory splitting
> may exacerbate that problem as additional/longer disk seeks occur related
> to what’s in which XFS assignment group. We have found pre-split pools
> useful in one major way, they avoid periods of near-zero write performance
> that we have put down to the active splitting of directories (the
> "thundering herd" effect). The overall downward curve seems to remain the
> same whether we pre-split or not.
>
> The thundering herd seems to be kept in check by an appropriate pre-split.
> Bluestore may or may not be a solution, but uncertainty and stability
> within our fairly tight timeline don't recommend it to us. Right now our
> big question is "how can we avoid the gradual degradation in write
> performance over time?".
>
> Thank you, Patrick
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to