Re: [ceph-users] Collecting BlueStore per Object DB overhead

Gregory Farnum Tue, 01 May 2018 10:58:43 -0700

On Mon, Apr 30, 2018 at 10:57 PM Wido den Hollander <w...@42on.com> wrote:


>
>
> On 04/30/2018 10:25 PM, Gregory Farnum wrote:
> >
> >
> > On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander <w...@42on.com
> > <mailto:w...@42on.com>> wrote:
> >
> >     Hi,
> >
> >     I've been investigating the per object overhead for BlueStore as I've
> >     seen this has become a topic for a lot of people who want to store a
> lot
> >     of small objects in Ceph using BlueStore.
> >
> >     I've writting a piece of Python code which can be run on a server
> >     running OSDs and will print the overhead.
> >
> >     https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> >
> >     Feedback on this script is welcome, but also the output of what
> people
> >     are observing.
> >
> >     The results from my tests are below, but what I see is that the
> overhead
> >     seems to range from 10kB to 30kB per object.
> >
> >     On RBD-only clusters the overhead seems to be around 11kB, but on
> >     clusters with a RGW workload the overhead goes higher to 20kB.
> >
> >
> > This change seems implausible as RGW always writes full objects, whereas
> > RBD will frequently write pieces of them and do overwrites.
> > I'm not sure what all knobs are available and which diagnostics
> > BlueStore exports, but is it possible you're looking at the total
> > RocksDB data store rather than the per-object overhead? The distinction
> > here being that the RocksDB instance will also store "client" (ie, RGW)
> > omap data and xattrs, in addition to the actual BlueStore onodes.
>
> Yes, that is possible. But in the end, the amount of onodes is the
> objects you store and then you want to know how many bytes the RocksDB
> database uses.
>
> I do agree that RGW doesn't do partial writes and has more metadata, but
> eventually that all has to be stored.
>
> We just need to come up with some good numbers on how to size the DB.
>

Ah yeah, this makes sense if you're trying to size for the DB partitions. I
just don't want people to look at it and go "RADOS + BlueStore require 30KB
per object!?!?!?" ;)
(And on a similar vein, the RGW-imposed overhead will depend a great deal
on the object names you use; they can get pretty large and have to get
written down in a few different places...)
-Greg



>
> Currently I assume a 10GB:1TB ratio and that is working out, but with
> people wanting to use 12TB disks we need to drill those numbers down
> even more. Otherwise you will need a lot of SSD space to store the DB in
> SSD if you want to.
>
> Wido
>
> > -Greg
> >
> >
> >
> >     I know that partial overwrites and appends contribute to higher
> overhead
> >     on objects and I'm trying to investigate this and share my
> information
> >     with the community.
> >
> >     I have two use-cases who want to store >2 billion objects with a avg
> >     object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
> >     become a big problem.
> >
> >     Anybody willing to share the overhead they are seeing with what
> >     use-case?
> >
> >     The more data we have on this the better we can estimate how DBs
> need to
> >     be sized for BlueStore deployments.
> >
> >     Wido
> >
> >     # Cluster #1
> >     osd.25 onodes=178572 db_used_bytes=2188378112 <(218)%20837-8112>
> <tel:(218)%20837-8112>
> >     avg_obj_size=6196529
> >     overhead=12254
> >     osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
> >     overhead=10996
> >     osd.10 onodes=195502 db_used_bytes=2395996160 <(239)%20599-6160>
> <tel:(239)%20599-6160>
> >     avg_obj_size=6013645
> >     overhead=12255
> >     osd.30 onodes=186172 db_used_bytes=2393899008 <(239)%20389-9008>
> <tel:(239)%20389-9008>
> >     avg_obj_size=6359453
> >     overhead=12858
> >     osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883
> >     overhead=10589
> >     osd.0 onodes=199658 db_used_bytes=2028994560 <(202)%20899-4560>
> <tel:(202)%20899-4560>
> >     avg_obj_size=4835928
> >     overhead=10162
> >     osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715
> >     overhead=11687
> >
> >     # Cluster #2
> >     osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992
> >     overhead_per_obj=12508
> >     osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248
> >     overhead_per_obj=13473
> >     osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150
> >     overhead_per_obj=12924
> >     osd.2 onodes=185757 db_used_bytes=3567255552 avg_obj_size=5359974
> >     overhead_per_obj=19203
> >     osd.5 onodes=198822 db_used_bytes=3033530368 <(303)%20353-0368>
> <tel:(303)%20353-0368>
> >     avg_obj_size=6765679
> >     overhead_per_obj=15257
> >     osd.4 onodes=161142 db_used_bytes=2136997888 <(213)%20699-7888>
> <tel:(213)%20699-7888>
> >     avg_obj_size=6377323
> >     overhead_per_obj=13261
> >     osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527
> >     overhead_per_obj=11551
> >     osd.6 onodes=178874 db_used_bytes=2542796800 <(254)%20279-6800>
> <tel:(254)%20279-6800>
> >     avg_obj_size=6539688
> >     overhead_per_obj=14215
> >     osd.9 onodes=195166 db_used_bytes=2538602496 <(253)%20860-2496>
> <tel:(253)%20860-2496>
> >     avg_obj_size=6237672
> >     overhead_per_obj=13007
> >     osd.8 onodes=203946 db_used_bytes=3279945728 avg_obj_size=6523555
> >     overhead_per_obj=16082
> >
> >     # Cluster 3
> >     osd.133 onodes=68558 db_used_bytes=15868100608 <(586)%20810-0608>
> >     <tel:(586)%20810-0608> avg_obj_size=14743206
> >     overhead_per_obj=231455
> >     osd.132 onodes=60164 db_used_bytes=13911457792 avg_obj_size=14539445
> >     overhead_per_obj=231225
> >     osd.137 onodes=62259 db_used_bytes=15597568000 <(559)%20756-8000>
> >     <tel:(559)%20756-8000> avg_obj_size=15138484
> >     overhead_per_obj=250527
> >     osd.136 onodes=70361 db_used_bytes=14540603392 avg_obj_size=13729154
> >     overhead_per_obj=206657
> >     osd.135 onodes=68003 db_used_bytes=12285116416 <(228)%20511-6416>
> >     <tel:(228)%20511-6416> avg_obj_size=12877744
> >     overhead_per_obj=180655
> >     osd.134 onodes=64962 db_used_bytes=14056161280 <(405)%20616-1280>
> >     <tel:(405)%20616-1280> avg_obj_size=15923550
> >     overhead_per_obj=216375
> >     osd.139 onodes=68016 db_used_bytes=20782776320 avg_obj_size=13619345
> >     overhead_per_obj=305557
> >     osd.138 onodes=66209 db_used_bytes=12850298880 avg_obj_size=14593418
> >     overhead_per_obj=194086
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Collecting BlueStore per Object DB overhead

Reply via email to