Re: [ceph-users] RGW: Reshard index of non-master zones in multi-site

2019-02-06 Thread Iain Buclaw
On Tue, 5 Feb 2019 at 10:04, Iain Buclaw  wrote:
>
> On Tue, 5 Feb 2019 at 09:46, Iain Buclaw  wrote:
> >
> > Hi,
> >
> > Following the update of one secondary site from 12.2.8 to 12.2.11, the
> > following warning have come up.
> >
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> > 1 large objects found in pool '.rgw.buckets.index'
> > Search the cluster log for 'Large omap object found' for more details.
> >
>
> [...]
>
> > Is this the reason why resharding hasn't propagated?
> >
>
> Furthermore, infact it looks like the index is broken on the secondaries.
>
> On the master:
>
> # radosgw-admin bi get --bucket=mybucket --object=myobject
> {
> "type": "plain",
> "idx": "myobject",
> "entry": {
> "name": "myobject",
> "instance": "",
> "ver": {
> "pool": 28,
> "epoch": 8848
> },
> "locator": "",
> "exists": "true",
> "meta": {
> "category": 1,
> "size": 9200,
> "mtime": "2018-03-27 21:12:56.612172Z",
> "etag": "c365c324cda944d2c3b687c0785be735",
> "owner": "mybucket",
> "owner_display_name": "Bucket User",
> "content_type": "application/octet-stream",
> "accounted_size": 9194,
> "user_data": ""
> },
> "tag": "0ef1a91a-4aee-427e-bdf8-30589abb2d3e.36603989.137292",
> "flags": 0,
> "pending_map": [],
> "versioned_epoch": 0
> }
> }
>
>
> On the secondaries:
>
> # radosgw-admin bi get --bucket=mybucket --object=myobject
> ERROR: bi_get(): (2) No such file or directory
>
> How does one go about rectifying this mess?
>

Random blog in language I don't understand seems to allude to using
radosgw-admin bi put to restore backed up indexes, but not under what
circumstances you would use such a command.

https://cloud.tencent.com/developer/article/1032854

Would this be safe to run on secondaries?

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Janne Johansson
Multicast traffic from storage has a point in things like the old Windows
provisioning software Ghost where you could netboot a room full och
computers, have them listen to a mcast stream of the same data/image and
all apply it at the same time, and perhaps re-sync potentially missing
stuff at the end, which would be far less data overall than having each
client ask the server(s) for the same image over and over.
In the case of ceph, I would say it was much less probable that many
clients would ask for exactly same data in the same order, so it would just
mean all clients hear all traffic (or at least more traffic than they asked
for) and need to skip past a lot of it.


Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos :

>
>
> I am still testing with ceph mostly, so my apologies for bringing up
> something totally useless. But I just had a chat about compuverde
> storage. They seem to implement multicast in a scale out solution.
>
> I was wondering if there is any experience here with compuverde and how
> it compared to ceph. And maybe this multicast approach could be
> interesting to use with ceph?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Marc Roos


Yes indeed, but for osd's writing the replication or erasure objects you 
get sort of parrallel processing not?



Multicast traffic from storage has a point in things like the old 
Windows provisioning software Ghost where you could netboot a room full 
och computers, have them listen to a mcast stream of the same data/image 
and all apply it at the same time, and perhaps re-sync potentially 
missing stuff at the end, which would be far less data overall than 
having each client ask the server(s) for the same image over and over. 
In the case of ceph, I would say it was much less probable that many 
clients would ask for exactly same data in the same order, so it would 
just mean all clients hear all traffic (or at least more traffic than 
they asked for) and need to skip past a lot of it.


Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos :




I am still testing with ceph mostly, so my apologies for bringing 
up 
something totally useless. But I just had a chat about compuverde 
storage. They seem to implement multicast in a scale out solution. 

I was wondering if there is any experience here with compuverde and 
how 
it compared to ceph. And maybe this multicast approach could be 
interesting to use with ceph?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

May the most significant bit of your life be positive.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Burkhard Linke

Hi,


we have a compuverde cluster, and AFAIK it uses multicast for node 
discovery, not for data distribution.



If you need more information, feel free to contact me either by email or 
via IRC (-> Be-El).



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Janne Johansson
For EC coded stuff,at 10+4 with 13 others needing data apart from the
primary, they are specifically NOT getting the same data, they are getting
either 1/10th of the pieces, or one of the 4 different checksums, so it
would be nasty to send full data to all OSDs expecting a 14th of the data.


Den ons 6 feb. 2019 kl 10:14 skrev Marc Roos :

>
> Yes indeed, but for osd's writing the replication or erasure objects you
> get sort of parrallel processing not?
>
>
>
> Multicast traffic from storage has a point in things like the old
> Windows provisioning software Ghost where you could netboot a room full
> och computers, have them listen to a mcast stream of the same data/image
> and all apply it at the same time, and perhaps re-sync potentially
> missing stuff at the end, which would be far less data overall than
> having each client ask the server(s) for the same image over and over.
> In the case of ceph, I would say it was much less probable that many
> clients would ask for exactly same data in the same order, so it would
> just mean all clients hear all traffic (or at least more traffic than
> they asked for) and need to skip past a lot of it.
>
>
> Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos :
>
>
>
>
> I am still testing with ceph mostly, so my apologies for bringing
> up
> something totally useless. But I just had a chat about compuverde
> storage. They seem to implement multicast in a scale out solution.
>
> I was wondering if there is any experience here with compuverde
> and
> how
> it compared to ceph. And maybe this multicast approach could be
> interesting to use with ceph?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> May the most significant bit of your life be positive.
>
>
>
>

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-06 Thread M Ranga Swami Reddy
Hello -  Are the any limits for mon_data_size for cluster with 2PB
(with 2000+ OSDs)?

Currently it set as 15G. What is logic behind this? Can we increase
when we get the mon_data_size_warn messages?

I am getting the mon_data_size_warn message even though there a ample
of free space on the disk (around 300G free disk)

Earlier thread on the same discusion:
https://www.spinics.net/lists/ceph-users/msg42456.html

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multicast communication compuverde

2019-02-06 Thread Maged Mokhtar



On 06/02/2019 11:14, Marc Roos wrote:

Yes indeed, but for osd's writing the replication or erasure objects you
get sort of parrallel processing not?



Multicast traffic from storage has a point in things like the old
Windows provisioning software Ghost where you could netboot a room full
och computers, have them listen to a mcast stream of the same data/image
and all apply it at the same time, and perhaps re-sync potentially
missing stuff at the end, which would be far less data overall than
having each client ask the server(s) for the same image over and over.
In the case of ceph, I would say it was much less probable that many
clients would ask for exactly same data in the same order, so it would
just mean all clients hear all traffic (or at least more traffic than
they asked for) and need to skip past a lot of it.


Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos :




I am still testing with ceph mostly, so my apologies for bringing
up
something totally useless. But I just had a chat about compuverde
storage. They seem to implement multicast in a scale out solution.

I was wondering if there is any experience here with compuverde and
how
it compared to ceph. And maybe this multicast approach could be
interesting to use with ceph?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




It could be used for sending cluster maps or other configuration in a 
push model, i believe corosync uses this by default. For use in sending 
actual data during write ops, a primary osd can send to its replicas, 
they do not have to process all traffic but can listen on specific group 
address associated with that pg, which could be an increment from a base 
multicast address defined. Some additional erasure codes and 
acknowledgment messages need to be added to account for errors/dropped 
packets. i doubt it will give a appreciable boost given most pools use 3 
replicas in total, additionally there could be issues to get multicast 
working correctly like setup igmp, so all in all in it could be a hassle.


/Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] krbd and image striping

2019-02-06 Thread James Dingwall
Hi,

I have been doing some testing with striped rbd images and have a
question about the calculation of the optimal_io_size and
minimum_io_size parameters.  My test image was created using a 4M object
size, stripe unit 64k and stripe count 16.

In the kernel rbd_init_disk() code:

unsigned int objset_bytes =
 rbd_dev->layout.object_size * rbd_dev->layout.stripe_count;

 blk_queue_io_min(q, objset_bytes);
 blk_queue_io_opt(q, objset_bytes);

Which resulted in 64M minimal / optimal io sizes.  If I understand the
meaning correctly then even for a small write there is going to be at
least 64M data written?

My use case is a ceph cluster (13.2.4) hosting rbd images for VMs
running on Xen.  The rbd volumes are mapped to dom0 and then passed
through to the guest using standard blkback/blkfront drivers.

I am doing a bit of testing with different stripe unit sizes but keeping
object size * count = 4M.  Does anyone have any experience finding
optimal rbd parameters for this scenario?

Thanks,
James
Zynstra is a private limited company registered in England and Wales 
(registered number 07864369). Our registered office and Headquarters are at The 
Innovation Centre, Broad Quay, Bath, BA1 1UD. This email, its contents and any 
attachments are confidential. If you have received this message in error please 
delete it from your system and advise the sender immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph dashboard cert documentation bug?

2019-02-06 Thread Junk
I was trying to set my mimic dashboard cert using the instructions
from 

http://docs.ceph.com/docs/mimic/mgr/dashboard/

and I'm pretty sure the lines


$ ceph config-key set mgr mgr/dashboard/crt -i dashboard.crt
$ ceph config-key set mgr mgr/dashboard/key -i dashboard.key

should be

$ ceph config-key set mgr/dashboard/crt -i dashboard.crt
$ ceph config-key set mgr/dashboard/key -i dashboard.key

Can anyone confirm?


-- 
Junk 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd and image striping

2019-02-06 Thread Ilya Dryomov
On Wed, Feb 6, 2019 at 11:09 AM James Dingwall
 wrote:
>
> Hi,
>
> I have been doing some testing with striped rbd images and have a
> question about the calculation of the optimal_io_size and
> minimum_io_size parameters.  My test image was created using a 4M object
> size, stripe unit 64k and stripe count 16.
>
> In the kernel rbd_init_disk() code:
>
> unsigned int objset_bytes =
>  rbd_dev->layout.object_size * rbd_dev->layout.stripe_count;
>
>  blk_queue_io_min(q, objset_bytes);
>  blk_queue_io_opt(q, objset_bytes);
>
> Which resulted in 64M minimal / optimal io sizes.  If I understand the
> meaning correctly then even for a small write there is going to be at
> least 64M data written?

No, these are just hints.  The exported values are pretty stupid even
in the default case and more so in the custom striping case and should
be changed.  It's certainly not the case that any write is going to be
turned into io_min or io_opt sized write.

>
> My use case is a ceph cluster (13.2.4) hosting rbd images for VMs
> running on Xen.  The rbd volumes are mapped to dom0 and then passed
> through to the guest using standard blkback/blkfront drivers.
>
> I am doing a bit of testing with different stripe unit sizes but keeping
> object size * count = 4M.  Does anyone have any experience finding
> optimal rbd parameters for this scenario?

I'd recommend focusing on the client side performance numbers for the
expected workload(s), not io_min/io_opt or object size * count target.
su = 64k and sc = 16 means that a 1M request will need responses from
up to 16 OSDs at once, which is probably not what you want unless you
have a small sequential write workload (where a custom striping layout
can prove very useful).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-06 Thread Sage Weil
Hi Swami

The limit is somewhat arbitrary, based on cluster sizes we had seen when 
we picked it.  In your case it should be perfectly safe to increase it.

sage


On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:

> Hello -  Are the any limits for mon_data_size for cluster with 2PB
> (with 2000+ OSDs)?
> 
> Currently it set as 15G. What is logic behind this? Can we increase
> when we get the mon_data_size_warn messages?
> 
> I am getting the mon_data_size_warn message even though there a ample
> of free space on the disk (around 300G free disk)
> 
> Earlier thread on the same discusion:
> https://www.spinics.net/lists/ceph-users/msg42456.html
> 
> Thanks
> Swami
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-06 Thread Dan van der Ster
Hi,

With HEALTH_OK a mon data dir should be under 2GB for even such a large cluster.

During backfilling scenarios, the mons keep old maps and grow quite
quickly. So if you have balancing, pg splitting, etc. ongoing for
awhile, the mon stores will eventually trigger that 15GB alarm.
But the intended behavior is that once the PGs are all active+clean,
the old maps should be trimmed and the disk space freed.

However, several people have noted that (at least in luminous
releases) the old maps are not trimmed until after HEALTH_OK *and* all
mons are restarted. This ticket seems related:
http://tracker.ceph.com/issues/37875

(Over here we're restarting mons every ~2-3 weeks, resulting in the
mon stores dropping from >15GB to ~700MB each time).

-- Dan


On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
>
> Hi Swami
>
> The limit is somewhat arbitrary, based on cluster sizes we had seen when
> we picked it.  In your case it should be perfectly safe to increase it.
>
> sage
>
>
> On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
>
> > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > (with 2000+ OSDs)?
> >
> > Currently it set as 15G. What is logic behind this? Can we increase
> > when we get the mon_data_size_warn messages?
> >
> > I am getting the mon_data_size_warn message even though there a ample
> > of free space on the disk (around 300G free disk)
> >
> > Earlier thread on the same discusion:
> > https://www.spinics.net/lists/ceph-users/msg42456.html
> >
> > Thanks
> > Swami
> >
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help with upmap feature on luminous

2019-02-06 Thread Dan van der Ster
Note that there are some improved upmap balancer heuristics in
development here: https://github.com/ceph/ceph/pull/26187

-- dan

On Tue, Feb 5, 2019 at 10:18 PM Kári Bertilsson  wrote:
>
> Hello
>
> I previously enabled upmap and used automatic balancing with "ceph balancer 
> on". I got very good results and OSD's ended up with perfectly distributed 
> pg's.
>
> Now after adding several new OSD's, auto balancing does not seem to be 
> working anymore. OSD's have 30-50% usage where previously all had almost the 
> same %.
>
> I turned off auto balancer and tried manually running a plan
>
> # ceph balancer reset
> # ceph balancer optimize myplan
> # ceph balancer show myplan
> ceph osd pg-upmap-items 41.1 106 125 95 121 84 34 36 99 72 126
> ceph osd pg-upmap-items 41.5 12 121 65 3 122 52 5 126
> ceph osd pg-upmap-items 41.b 117 99 65 125
> ceph osd pg-upmap-items 41.c 49 121 81 131
> ceph osd pg-upmap-items 41.e 61 82 73 52 122 46 84 118
> ceph osd pg-upmap-items 41.f 71 127 15 121 56 82
> ceph osd pg-upmap-items 41.12 81 92
> ceph osd pg-upmap-items 41.17 35 127 71 44
> ceph osd pg-upmap-items 41.19 81 131 21 119 18 52
> ceph osd pg-upmap-items 41.25 18 52 37 125 40 3 41 34 71 127 4 128
>
>
> After running this plan there's no difference and still huge inbalance on the 
> OSD's. Creating a new plan give the same plan again.
>
> # ceph balancer eval
> current cluster score 0.015162 (lower is better)
>
> Balancer eval shows quite low number, so it seems to think the pg 
> distribution is already optimized ?
>
> Since i'm not getting this working again. I looked into the offline 
> optimization at http://docs.ceph.com/docs/mimic/rados/operations/upmap/
>
> I have 2 pools.
> Replicated pool using 3 OSD's with "10k" device class.
> And remaining OSD's have "hdd" device class.
>
> The resulting out.txt creates a much larger plan, but would map alot of PG's 
> to the "10k" OSD's (where they should not be). And i can't seem to find any 
> way to exclude these 3 OSD's.
>
> Any ideas how to proceed ?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Orchestration weekly meeting location change

2019-02-06 Thread Mike Perez
Hey all,

The Orchestration weekly team meeting on Mondays at 16:00 UTC has a
new meeting location. The blue jeans url has changed so we can start
recording the meetings. Please see instructions below. The event also
has updated information:

To join the meeting on a computer or mobile phone:
https://bluejeans.com/908675367?src=calendarLink

To join from a Red Hat Deskphone or Softphone, dial: 84336.

Connecting directly from a room system?

1.) Dial: 199.48.152.152 or bjn.vc
2.) Enter Meeting ID: 908675367

Just want to dial in on your phone?

1.) Dial one of the following numbers:
 408-915-6466 (US)

See all numbers: https://www.redhat.com/en/conference-numbers
2.) Enter Meeting ID: 908675367

3.) Press #


Want to test your video connection?
https://bluejeans.com/111

--
Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Cephfs Snapshots in Luminous

2019-02-06 Thread Nicolas Huillard
Le lundi 12 novembre 2018 à 15:31 +0100, Marc Roos a écrit :
> > > is anybody using cephfs with snapshots on luminous? Cephfs
> > > snapshots are declared stable in mimic, but I'd like to know
> > > about the risks using them on luminous. Do I risk a complete
> > > cephfs failure or just some not working snapshots? It is one
> > > namespace, one fs, one data and one metadata pool.
> > > 
> > 
> > For luminous, snapshot in single mds setup basically works.
> > But snapshot is complete broken in multiple setup.
> > 
> 
> Single active mds not? And hardlinks are not supported with
> snapshots?

What's the final feeling on snapshots ?
* Luminous 12.2.10 on Debian stretch
* ceph-fuse clients
* 1 active MDS, some standbys
* single FS, single namespace, no hardlinks
* will probably create nested snapshots, ie. /1/.snaps/first and
/1/2/3/.snaps/nested
* will use the facility through VirtFS from within VMs, where ceph-fuse 
runs on the host server

What's the risk of using that experimental feature (as said in [1]) ?
* losing snapshots ?
* losing the main/last contents ?
* losing some directory trees, entire filesystem ?
* other ?

TIA,

[1] http://docs.ceph.com/docs/luminous/cephfs/experimental-features/#sn
apshots

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proxmox 4.4, Ceph hammer, OSD cache link...

2019-02-06 Thread Marco Gaiarin


I come back here.

> I've recently added a host to my ceph cluster, using proxmox 'helpers'
> to add OSD, eg:
> 
>   pveceph createosd /dev/sdb -journal_dev /dev/sda5
> 
> and now i've:
> 
>  root@blackpanther:~# ls -la /var/lib/ceph/osd/ceph-12
>  totale 60
>  drwxr-xr-x   3 root root   199 nov 21 17:02 .
>  drwxr-xr-x   6 root root  4096 nov 21 23:08 ..
>  -rw-r--r--   1 root root   903 nov 21 17:02 activate.monmap
>  -rw-r--r--   1 root root 3 nov 21 17:02 active
>  -rw-r--r--   1 root root37 nov 21 17:02 ceph_fsid
>  drwxr-xr-x 432 root root 12288 dic  1 18:21 current
>  -rw-r--r--   1 root root37 nov 21 17:02 fsid
>  lrwxrwxrwx   1 root root 9 nov 21 17:02 journal -> /dev/sda5
>  -rw---   1 root root57 nov 21 17:02 keyring
>  -rw-r--r--   1 root root21 nov 21 17:02 magic
>  -rw-r--r--   1 root root 6 nov 21 17:02 ready
>  -rw-r--r--   1 root root 4 nov 21 17:02 store_version
>  -rw-r--r--   1 root root53 nov 21 17:02 superblock
>  -rw-r--r--   1 root root 0 nov 21 17:02 sysvinit
>  -rw-r--r--   1 root root 3 nov 21 17:02 whoami
> 
> and all works as expected, only i suposed to find as a journal not the
> device (/dev/sda5) but the uuid (/dev/disk/by-uuid/).
> 
> But seems that the cache partition does not have an UUID associated:
> 
>   root@blackpanther:~# ls -la /dev/disk/by-uuid/ | grep sda5
>   root@blackpanther:~# blkid /dev/sda5
>   /dev/sda5: PARTUUID="a222c6bf-05"
> 
> I'm a but ''puzzled'' because if i've to add a disk ''before'' sda, all
> device name will change with, i suppose, unexpected result.
> 
> I'm missing something? Thanks.

I was forced to change some journal, using some partition (MBR); i've
stopped osd, flushed old journal, changed symplink and then do a
'journal format':

 root@deadpool:/var/lib/ceph/osd/ceph-6# ls -la
 totale 64
 drwxr-xr-x   3 root root   199 feb  6 17:45 .
 drwxr-xr-x   6 root root  4096 dic 14  2016 ..
 -rw-r--r--   1 root root   751 dic 14  2016 activate.monmap
 -rw-r--r--   1 root root 3 dic 14  2016 active
 -rw-r--r--   1 root root37 dic 14  2016 ceph_fsid
 drwxr-xr-x 378 root root 20480 feb  6 17:12 current
 -rw-r--r--   1 root root37 dic 14  2016 fsid
 lrwxrwxrwx   1 root root 9 feb  6 17:45 journal -> /dev/sda5
 -rw---   1 root root56 dic 14  2016 keyring
 -rw-r--r--   1 root root21 dic 14  2016 magic
 -rw-r--r--   1 root root 6 dic 14  2016 ready
 -rw-r--r--   1 root root 4 dic 14  2016 store_version
 -rw-r--r--   1 root root53 dic 14  2016 superblock
 -rw-r--r--   1 root root 0 feb  6 17:10 sysvinit
 -rw-r--r--   1 root root 2 dic 14  2016 whoami
 root@deadpool:/var/lib/ceph/osd/ceph-6# ceph-osd -i 6 --mkjournal
 2019-02-06 17:45:35.030359 7ff679c24880 -1 journal check: ondisk fsid 
---- doesn't match expected 
70357923-3227-4d57-980f-92b8c853dc76, invalid (someone else's?) journal
 2019-02-06 17:45:35.038522 7ff679c24880 -1 created new journal 
/var/lib/ceph/osd/ceph-6/journal for object store /var/lib/ceph/osd/ceph-6

Clearly i've changed the journal partition by hand (eg, direct link) so
i'm expecting that link is 'direct to partition'; but, and see the
warning about fsid, still there's no 'id' associated to that partition
(eg, no link in /dev/disk/by-*/).


If i rerun the 'mkjournal':

 root@deadpool:/var/lib/ceph/osd/ceph-6# ceph-osd -i 6 --mkjournal
 2019-02-06 17:45:37.621855 7f3391377880 -1 created new journal 
/var/lib/ceph/osd/ceph-6/journal for object store /var/lib/ceph/osd/ceph-6

So seems that effectively journal partition get 'tagged' in someway.


But i'm still confused... using ID link in journal partitions works
only for GPO partitioning?


Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull after adding new OSDs

2019-02-06 Thread Brad Hubbard
Let's try to restrict discussion to the original thread
"backfill_toofull while OSDs are not full" and get a tracker opened up
for this issue.

On Sat, Feb 2, 2019 at 11:52 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> Right now, after adding OSD:
>
> # ceph health detail
> HEALTH_ERR 74197563/199392333 objects misplaced (37.212%); Degraded data 
> redundancy (low space): 1 pg backfill_toofull
> OBJECT_MISPLACED 74197563/199392333 objects misplaced (37.212%)
> PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
> pg 6.eb is active+remapped+backfill_wait+backfill_toofull, acting 
> [21,0,47]
>
> # ceph pg ls-by-pool iscsi backfill_toofull
> PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES  LOG  STATE 
>  STATE_STAMPVERSION   REPORTED   UP   
>   ACTING   SCRUB_STAMPDEEP_SCRUB_STAMP
> 6.eb 6450  1290   0 1645654016 3067 
> active+remapped+backfill_wait+backfill_toofull 2019-02-02 00:20:32.975300 
> 7208'6567 9790:16214 [5,1,21]p5 [21,0,47]p21 2019-01-18 04:13:54.280495 
> 2019-01-18 04:13:54.280495
>
> All OSD have less 40% USE.
>
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
>  0   hdd 9.56149  1.0 9.6 TiB 3.2 TiB 6.3 TiB 33.64 1.31 313
>  1   hdd 9.56149  1.0 9.6 TiB 3.3 TiB 6.3 TiB 34.13 1.33 295
>  5   hdd 9.56149  1.0 9.6 TiB 756 GiB 8.8 TiB  7.72 0.30 103
> 47   hdd 9.32390  1.0 9.3 TiB 3.1 TiB 6.2 TiB 33.75 1.31 306
>
> (all other OSD also have less 40%)
>
> ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
>
> Maybe the developers will pay attention to the letter and say something?
>
> - Original Message -
> From: "Fyodor Ustinov" 
> To: "Caspar Smit" 
> Cc: "Jan Kasprzak" , "ceph-users" 
> Sent: Thursday, 31 January, 2019 16:50:24
> Subject: Re: [ceph-users] backfill_toofull after adding new OSDs
>
> Hi!
>
> I saw the same several times when I added a new osd to the cluster. One-two 
> pg in "backfill_toofull" state.
>
> In all versions of mimic.
>
> - Original Message -
> From: "Caspar Smit" 
> To: "Jan Kasprzak" 
> Cc: "ceph-users" 
> Sent: Thursday, 31 January, 2019 15:43:07
> Subject: Re: [ceph-users] backfill_toofull after adding new OSDs
>
> Hi Jan,
>
> You might be hitting the same issue as Wido here:
>
> [ https://www.spinics.net/lists/ceph-users/msg50603.html | 
> https://www.spinics.net/lists/ceph-users/msg50603.html ]
>
> Kind regards,
> Caspar
>
> Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz | 
> k...@fi.muni.cz ] >:
>
>
> Hello, ceph users,
>
> I see the following HEALTH_ERR during cluster rebalance:
>
> Degraded data redundancy (low space): 8 pgs backfill_toofull
>
> Detailed description:
> I have upgraded my cluster to mimic and added 16 new bluestore OSDs
> on 4 hosts. The hosts are in a separate region in my crush map, and crush
> rules prevented data to be moved on the new OSDs. Now I want to move
> all data to the new OSDs (and possibly decomission the old filestore OSDs).
> I have created the following rule:
>
> # ceph osd crush rule create-replicated on-newhosts newhostsroot host
>
> after this, I am slowly moving the pools one-by-one to this new rule:
>
> # ceph osd pool set test-hdd-pool crush_rule on-newhosts
>
> When I do this, I get the above error. This is misleading, because
> ceph osd df does not suggest the OSDs are getting full (the most full
> OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR
> disappears. Why am I getting this error?
>
> # ceph -s
> cluster:
> id: ...my UUID...
> health: HEALTH_ERR
> 1271/3803223 objects misplaced (0.033%)
> Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs 
> degraded, 67 pgs undersized
> Degraded data redundancy (low space): 8 pgs backfill_toofull
>
> services:
> mon: 3 daemons, quorum mon1,mon2,mon3
> mgr: mon2(active), standbys: mon1, mon3
> osd: 80 osds: 80 up, 80 in; 90 remapped pgs
> rgw: 1 daemon active
>
> data:
> pools: 13 pools, 5056 pgs
> objects: 1.27 M objects, 4.8 TiB
> usage: 15 TiB used, 208 TiB / 224 TiB avail
> pgs: 40124/3803223 objects degraded (1.055%)
> 1271/3803223 objects misplaced (0.033%)
> 4963 active+clean
> 41 active+recovery_wait+undersized+degraded+remapped
> 21 active+recovery_wait+undersized+degraded
> 17 active+remapped+backfill_wait
> 5 active+remapped+backfill_wait+backfill_toofull
> 3 active+remapped+backfill_toofull
> 2 active+recovering+undersized+remapped
> 2 active+recovering+undersized+degraded+remapped
> 1 active+clean+remapped
> 1 active+recovering+undersized+degraded
>
> io:
> client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr
> recovery: 2.0 MiB/s, 92 objects/s
>
> Thanks for any hint,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak http://fi.muni.cz/ | fi.muni.cz ] - work | 
> [ http://yenya.net/ | yenya.net ] - private}> |
> | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 
> 4096R/A45477D5 |

[ceph-users] CephFS overwrite/truncate performance hit

2019-02-06 Thread Hector Martin
I'm seeing some interesting performance issues with file overwriting on 
CephFS.


Creating lots of files is fast:

for i in $(seq 1 1000); do
echo $i; echo test > a.$i
done

Deleting lots of files is fast:

rm a.*

As is creating them again.

However, repeatedly creating the same file over and over again is slow:

for i in $(seq 1 1000); do
echo $i; echo test > a
done

And it's still slow if the file is created with a new name and then 
moved over:


for i in $(seq 1 1000); do
echo $i; echo test > a.$i; mv a.$i a
done

While appending to a single file is really fast:

for i in $(seq 1 1000); do
echo $i; echo test >> a
done

As is repeatedly writing to offset 0:

for i in $(seq 1 1000); do
echo $i; echo $RANDOM | dd of=a bs=128 conv=notrunc
done

But truncating the file first slows it back down again:

for i in $(seq 1 1000); do
echo $i; truncate -s 0 a; echo test >> a
done

All of these things are reasonably fast on a local FS, of course. I'm 
using the kernel client (4.18) with Ceph 13.2.4, and the relevant CephFS 
data and metadata pools are rep-3 on HDDs. It seems to me that any 
operation that *reduces* a file's size for any given filename, or 
replaces it with another inode, has a large overhead.


I have an application that stores some flag data in a file, using the 
usual open/write/close/rename dance to atomically overwrite it, and this 
operation is currently the bottleneck (while doing a bunch of other 
processing on files on CephFS). I'm considering changing it to use a 
xattr to store the data instead, which seems like it should be atomic 
and performs a lot better:


for i in $(seq 1 1000); do
echo $i; setfattr -n user.foo -v "test$RANDOM" a
done

Alternatively, is there a more CephFS-friendly atomic overwrite pattern 
than the usual open/write/close/rename? Can it e.g. guarantee that a 
write at offset 0 of less than the page size is atomic? I could easily 
make the writes equal-sized and thus avoid truncations and remove the 
rename dance, if I can guarantee they're atomic.


Is there any documentation on what write operations incur significant 
overhead on CephFS like this, and why? This particular issue isn't 
mentioned in http://docs.ceph.com/docs/master/cephfs/app-best-practices/ 
(which seems like it mostly deals with reads, not writes).


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread jesper
Hi List

We are in the process of moving to the next usecase for our ceph cluster
(Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
that works fine.

We're currently on luminous / bluestore, if upgrading is deemed to
change what we're seeing then please let us know.

We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
deadline, nomerges = 1, rotational = 0.

Each disk "should" give approximately 36K IOPS random write and the double
random read.

Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
well performing SSD block devices - potentially to host databases and
things like that. I ready through this nice document [0], I know the
HW are radically different from mine, but I still think I'm in the
very low end of what 6 x S4510 should be capable of doing.

Since it is IOPS i care about I have lowered block size to 4096 -- 4M
blocksize nicely saturates the NIC's in both directions.


$ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_torsk2_11207
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  16  5857  5841   22.8155   22.8164  0.00238437  0.00273434
2  15 11768 11753   22.9533   23.0938   0.0028559  0.00271944
3  16 17264 17248   22.4564   21.4648  0.0024  0.00278101
4  16 22857 22841   22.3037   21.84770.002716  0.00280023
5  16 28462 28446   22.2213   21.8945  0.002201860.002811
6  16 34216 34200   22.2635   22.4766  0.00234315  0.00280552
7  16 39616 39600   22.0962   21.0938  0.00290661  0.00282718
8  16 45510 45494   22.2118   23.0234   0.0033541  0.00281253
9  16 50995 50979   22.1243   21.4258  0.00267282  0.00282371
   10  16 56745 56729   22.1577   22.4609  0.00252583   0.0028193
Total time run: 10.002668
Total writes made:  56745
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 22.1601
Stddev Bandwidth:   0.712297
Max bandwidth (MB/sec): 23.0938
Min bandwidth (MB/sec): 21.0938
Average IOPS:   5672
Stddev IOPS:182
Max IOPS:   5912
Min IOPS:   5400
Average Latency(s): 0.00281953
Stddev Latency(s):  0.00190771
Max latency(s): 0.0834767
Min latency(s): 0.00120945

Min latency is fine -- but Max latency of 83ms ?
Average IOPS @ 5672 ?

$ sudo rados bench -p scbench  10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  15 23329 23314   91.0537   91.0703 0.000349856 0.000679074
2  16 48555 48539   94.7884   98.5352 0.000499159 0.000652067
3  16 76193 76177   99.1747   107.961 0.000443877 0.000622775
4  15103923103908   101.459   108.324 0.000678589 0.000609182
5  15132720132705   103.663   112.488 0.000741734 0.000595998
6  15161811161796   105.323   113.637 0.000333166 0.000586323
7  15190196190181   106.115   110.879 0.000612227 0.000582014
8  15221155221140   107.966   120.934 0.000471219 0.000571944
9  16251143251127   108.984   117.137 0.000267528 0.000566659
Total time run:   10.000640
Total reads made: 282097
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   110.187
Average IOPS: 28207
Stddev IOPS:  2357
Max IOPS: 30959
Min IOPS: 23314
Average Latency(s):   0.000560402
Max latency(s):   0.109804
Min latency(s):   0.000212671

This is also quite far from expected. I have 12GB of memory on the OSD
daemon for caching on each host - close to idle cluster - thus 50GB+ for
caching with a working set of < 6GB .. this should - in this case
not really be bound by the underlying SSD. But if it were:

IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?

No measureable service time in iostat when running tests, thus I have
come to the conclusion that it has to be either client side, the
network path, or the OSD-daemon that deliveres the increasing latency /
decreased IOPS.

Is there any suggestions on how to get more insigths in that?

Has anyone replicated close to the number Micron are reporting on NVMe?

Thanks a log.

[0]
https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

___
ceph-users mailing list
ceph-user

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread Brett Chancellor
This seems right. You are doing a single benchmark from a single client.
Your limiting factor will be the network latency. For most networks this is
between 0.2 and 0.3ms.  if you're trying to test the potential of your
cluster, you'll need multiple workers and clients.

On Thu, Feb 7, 2019, 2:17 AM  Hi List
>
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
>
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
>
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
>
> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
>
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
>
>
> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16  5857  5841   22.8155   22.8164  0.00238437
> 0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559
> 0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024
> 0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716
> 0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.00220186
> 0.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315
> 0.00280552
> 7  16 39616 39600   22.0962   21.0938  0.00290661
> 0.00282718
> 8  16 45510 45494   22.2118   23.0234   0.0033541
> 0.00281253
> 9  16 50995 50979   22.1243   21.4258  0.00267282
> 0.00282371
>10  16 56745 56729   22.1577   22.4609  0.00252583
>  0.0028193
> Total time run: 10.002668
> Total writes made:  56745
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 22.1601
> Stddev Bandwidth:   0.712297
> Max bandwidth (MB/sec): 23.0938
> Min bandwidth (MB/sec): 21.0938
> Average IOPS:   5672
> Stddev IOPS:182
> Max IOPS:   5912
> Min IOPS:   5400
> Average Latency(s): 0.00281953
> Stddev Latency(s):  0.00190771
> Max latency(s): 0.0834767
> Min latency(s): 0.00120945
>
> Min latency is fine -- but Max latency of 83ms ?
> Average IOPS @ 5672 ?
>
> $ sudo rados bench -p scbench  10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  15 23329 23314   91.0537   91.0703 0.000349856
> 0.000679074
> 2  16 48555 48539   94.7884   98.5352 0.000499159
> 0.000652067
> 3  16 76193 76177   99.1747   107.961 0.000443877
> 0.000622775
> 4  15103923103908   101.459   108.324 0.000678589
> 0.000609182
> 5  15132720132705   103.663   112.488 0.000741734
> 0.000595998
> 6  15161811161796   105.323   113.637 0.000333166
> 0.000586323
> 7  15190196190181   106.115   110.879 0.000612227
> 0.000582014
> 8  15221155221140   107.966   120.934 0.000471219
> 0.000571944
> 9  16251143251127   108.984   117.137 0.000267528
> 0.000566659
> Total time run:   10.000640
> Total reads made: 282097
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   110.187
> Average IOPS: 28207
> Stddev IOPS:  2357
> Max IOPS: 30959
> Min IOPS: 23314
> Average Latency(s):   0.000560402
> Max latency(s):   0.109804
> Min latency(s):   0.000212671
>
> This is also quite far from expected. I have 12GB of memory on the OSD
> daemon for caching on each host - close to idle cluster - thus 50GB+ for
> caching with a working set of < 6GB .. this should - in this case
> not really be bound by the underlying SSD. But if it were:
>
> IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?
>
> No measureable service time in iostat when running tests, thus I have
> come to the conclusion that it has to be either client side,

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread Christian Balzer
Hello,

On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote:

> Hi List
> 
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
> 
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
> 
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
> 
I'd make sure that the endurance of these SSDs is in line with your
expected usage.

> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
Only locally, latency is your enemy.

Tell us more about your network.

> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
> 
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
> 
> 
rados bench is not the sharpest tool in the shed for this.
As it needs to allocate stuff to begin with, amongst other things.

And before you go "fio with RBD engine", that had major issues in my
experience, too.
Your best and most realistic results will come from doing the testing
inside a VM (I presume from your use case) or a mounted RBD block device.

And then using fio, of course.

> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  16  5857  5841   22.8155   22.8164  0.00238437  0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559  0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024  0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716  0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.002201860.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315  0.00280552
> 7  16 39616 39600   22.0962   21.0938  0.00290661  0.00282718
> 8  16 45510 45494   22.2118   23.0234   0.0033541  0.00281253
> 9  16 50995 50979   22.1243   21.4258  0.00267282  0.00282371
>10  16 56745 56729   22.1577   22.4609  0.00252583   0.0028193
> Total time run: 10.002668
> Total writes made:  56745
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 22.1601
> Stddev Bandwidth:   0.712297
> Max bandwidth (MB/sec): 23.0938
> Min bandwidth (MB/sec): 21.0938
> Average IOPS:   5672
> Stddev IOPS:182
> Max IOPS:   5912
> Min IOPS:   5400
> Average Latency(s): 0.00281953
> Stddev Latency(s):  0.00190771
> Max latency(s): 0.0834767
> Min latency(s): 0.00120945
> 
> Min latency is fine -- but Max latency of 83ms ?
Outliers during setup are to be expected and ignored

> Average IOPS @ 5672 ?
> 
Plenty of good reasons to come up with that number, yes.
> $ sudo rados bench -p scbench  10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  15 23329 23314   91.0537   91.0703 0.000349856 0.000679074
> 2  16 48555 48539   94.7884   98.5352 0.000499159 0.000652067
> 3  16 76193 76177   99.1747   107.961 0.000443877 0.000622775
> 4  15103923103908   101.459   108.324 0.000678589 0.000609182
> 5  15132720132705   103.663   112.488 0.000741734 0.000595998
> 6  15161811161796   105.323   113.637 0.000333166 0.000586323
> 7  15190196190181   106.115   110.879 0.000612227 0.000582014
> 8  15221155221140   107.966   120.934 0.000471219 0.000571944
> 9  16251143251127   108.984   117.137 0.000267528 0.000566659
> Total time run:   10.000640
> Total reads made: 282097
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   110.187
> Average IOPS: 28207
> Stddev IOPS:  2357
> Max IOPS: 30959
> Min IOPS: 23314
> Average Latency(s):   0.000560402
> Max latency(s):   0.109804
> Min latency(s):   0.000212671
> 
> This is also quite far from expected. I have 12GB of memory on the