On Mon, Apr 30, 2018 at 10:57 PM Wido den Hollander <w...@42on.com> wrote:
> > > On 04/30/2018 10:25 PM, Gregory Farnum wrote: > > > > > > On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander <w...@42on.com > > <mailto:w...@42on.com>> wrote: > > > > Hi, > > > > I've been investigating the per object overhead for BlueStore as I've > > seen this has become a topic for a lot of people who want to store a > lot > > of small objects in Ceph using BlueStore. > > > > I've writting a piece of Python code which can be run on a server > > running OSDs and will print the overhead. > > > > https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f > > > > Feedback on this script is welcome, but also the output of what > people > > are observing. > > > > The results from my tests are below, but what I see is that the > overhead > > seems to range from 10kB to 30kB per object. > > > > On RBD-only clusters the overhead seems to be around 11kB, but on > > clusters with a RGW workload the overhead goes higher to 20kB. > > > > > > This change seems implausible as RGW always writes full objects, whereas > > RBD will frequently write pieces of them and do overwrites. > > I'm not sure what all knobs are available and which diagnostics > > BlueStore exports, but is it possible you're looking at the total > > RocksDB data store rather than the per-object overhead? The distinction > > here being that the RocksDB instance will also store "client" (ie, RGW) > > omap data and xattrs, in addition to the actual BlueStore onodes. > > Yes, that is possible. But in the end, the amount of onodes is the > objects you store and then you want to know how many bytes the RocksDB > database uses. > > I do agree that RGW doesn't do partial writes and has more metadata, but > eventually that all has to be stored. > > We just need to come up with some good numbers on how to size the DB. > Ah yeah, this makes sense if you're trying to size for the DB partitions. I just don't want people to look at it and go "RADOS + BlueStore require 30KB per object!?!?!?" ;) (And on a similar vein, the RGW-imposed overhead will depend a great deal on the object names you use; they can get pretty large and have to get written down in a few different places...) -Greg > > Currently I assume a 10GB:1TB ratio and that is working out, but with > people wanting to use 12TB disks we need to drill those numbers down > even more. Otherwise you will need a lot of SSD space to store the DB in > SSD if you want to. > > Wido > > > -Greg > > > > > > > > I know that partial overwrites and appends contribute to higher > overhead > > on objects and I'm trying to investigate this and share my > information > > with the community. > > > > I have two use-cases who want to store >2 billion objects with a avg > > object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to > > become a big problem. > > > > Anybody willing to share the overhead they are seeing with what > > use-case? > > > > The more data we have on this the better we can estimate how DBs > need to > > be sized for BlueStore deployments. > > > > Wido > > > > # Cluster #1 > > osd.25 onodes=178572 db_used_bytes=2188378112 <(218)%20837-8112> > <tel:(218)%20837-8112> > > avg_obj_size=6196529 > > overhead=12254 > > osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002 > > overhead=10996 > > osd.10 onodes=195502 db_used_bytes=2395996160 <(239)%20599-6160> > <tel:(239)%20599-6160> > > avg_obj_size=6013645 > > overhead=12255 > > osd.30 onodes=186172 db_used_bytes=2393899008 <(239)%20389-9008> > <tel:(239)%20389-9008> > > avg_obj_size=6359453 > > overhead=12858 > > osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883 > > overhead=10589 > > osd.0 onodes=199658 db_used_bytes=2028994560 <(202)%20899-4560> > <tel:(202)%20899-4560> > > avg_obj_size=4835928 > > overhead=10162 > > osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715 > > overhead=11687 > > > > # Cluster #2 > > osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992 > > overhead_per_obj=12508 > > osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248 > > overhead_per_obj=13473 > > osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150 > > overhead_per_obj=12924 > > osd.2 onodes=185757 db_used_bytes=3567255552 avg_obj_size=5359974 > > overhead_per_obj=19203 > > osd.5 onodes=198822 db_used_bytes=3033530368 <(303)%20353-0368> > <tel:(303)%20353-0368> > > avg_obj_size=6765679 > > overhead_per_obj=15257 > > osd.4 onodes=161142 db_used_bytes=2136997888 <(213)%20699-7888> > <tel:(213)%20699-7888> > > avg_obj_size=6377323 > > overhead_per_obj=13261 > > osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527 > > overhead_per_obj=11551 > > osd.6 onodes=178874 db_used_bytes=2542796800 <(254)%20279-6800> > <tel:(254)%20279-6800> > > avg_obj_size=6539688 > > overhead_per_obj=14215 > > osd.9 onodes=195166 db_used_bytes=2538602496 <(253)%20860-2496> > <tel:(253)%20860-2496> > > avg_obj_size=6237672 > > overhead_per_obj=13007 > > osd.8 onodes=203946 db_used_bytes=3279945728 avg_obj_size=6523555 > > overhead_per_obj=16082 > > > > # Cluster 3 > > osd.133 onodes=68558 db_used_bytes=15868100608 <(586)%20810-0608> > > <tel:(586)%20810-0608> avg_obj_size=14743206 > > overhead_per_obj=231455 > > osd.132 onodes=60164 db_used_bytes=13911457792 avg_obj_size=14539445 > > overhead_per_obj=231225 > > osd.137 onodes=62259 db_used_bytes=15597568000 <(559)%20756-8000> > > <tel:(559)%20756-8000> avg_obj_size=15138484 > > overhead_per_obj=250527 > > osd.136 onodes=70361 db_used_bytes=14540603392 avg_obj_size=13729154 > > overhead_per_obj=206657 > > osd.135 onodes=68003 db_used_bytes=12285116416 <(228)%20511-6416> > > <tel:(228)%20511-6416> avg_obj_size=12877744 > > overhead_per_obj=180655 > > osd.134 onodes=64962 db_used_bytes=14056161280 <(405)%20616-1280> > > <tel:(405)%20616-1280> avg_obj_size=15923550 > > overhead_per_obj=216375 > > osd.139 onodes=68016 db_used_bytes=20782776320 avg_obj_size=13619345 > > overhead_per_obj=305557 > > osd.138 onodes=66209 db_used_bytes=12850298880 avg_obj_size=14593418 > > overhead_per_obj=194086 > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com