Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Christian Wuerdig
Well, preferring faster clock CPUs for SSD scenarios has been floated
several times over the last few months on this list. And realistic or not,
Nick's and Kostas' setup are similar enough (testing single disk) that it's
a distinct possibility.
Anyway, as mentioned measuring the performance counters would probably
provide more insight.


On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen 
wrote:

>
>
> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar  het
> volgende geschreven:
>
> My understanding was this test is targeting latency more than IOPS. This
> is probably why its was run using QD=1. It also makes sense that cpu freq
> will be more important than cores.
>
>
> But then it is not generic enough to be used as an advise!
> It is just a line in 3D-space.
> As there are so many
>
> --WjW
>
> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>
> On 24-6-2017 05:30, Christian Wuerdig wrote:
>
> The general advice floating around is that your want CPUs with high
> clock speeds rather than more cores to reduce latency and increase IOPS
> for SSD setups (see also
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
> something like a E5-2667V4 might bring better results in that situation.
> Also there was some talk about disabling the processor C states in order
> to bring latency down (something like this should be easy to test:
> https://stackoverflow.com/a/22482722/220986)
>
>
> I would be very careful to call this a general advice...
>
> Although the article is interesting, it is rather single sided.
>
> The only thing is shows that there is a lineair relation between
> clockspeed and write or read speeds???
> The article is rather vague on how and what is actually tested.
>
> By just running a single OSD with no replication a lot of the
> functionality is left out of the equation.
> Nobody is running just 1 osD on a box in a normal cluster host.
>
> Not using a serious SSD is another source of noise on the conclusion.
> More Queue depth can/will certainly have impact on concurrency.
>
> I would call this an observation, and nothing more.
>
> --WjW
>
>
> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> mailto:reverend...@gmail.com>> wrote:
>
> Hello,
>
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD running Ubuntu Server
> 16.04
>
> Each server has 32 threads @2.1GHz and 128GB RAM.
> The disk distribution per server is:
> 38 * HUS726020ALS210 (SAS rotational)
> 2 * HUSMH8010BSS200 (SAS SSD for journals)
> 2 * ST1920FM0043 (SAS SSD for data)
> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>
> Since we don't currently have a 10Gbit switch, we test the performance
> with the cluster in a degraded state, the noout flag set and we mount
> rbd images on the powered on osd node. We confirmed that the network
> is not saturated during the tests.
>
> We ran tests on the NVME disk and the pool created on this disk where
> we hoped to get the most performance without getting limited by the
> hardware specs since we have more disks than CPU threads.
>
> The nvme disk was at first partitioned with one partition and the
> journal on the same disk. The performance on random 4K reads was
> topped at 50K iops. We then removed the osd and partitioned with 4
> data partitions and 4 journals on the same disk. The performance
> didn't increase significantly. Also, since we run read tests, the
> journals shouldn't cause performance issues.
>
> We then ran 4 fio processes in parallel on the same rbd mounted image
> and the total iops reached 100K. More parallel fio processes didn't
> increase the measured iops.
>
> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
> the crushmap just defines the different buckets/rules for the disk
> separation (rotational, ssd, nvme) in order to create the required
> pools
>
> Is the performance of 100.000 iops for random 4K read normal for a
> disk that on the same benchmark runs at more than 300K iops on the
> same hardware or are we missing something?
>
> Best regards,
> Kostas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
___
ceph-users

Re: [ceph-users] cannot open /dev/xvdb: Input/output error

2017-06-26 Thread Mykola Golub
On Sun, Jun 25, 2017 at 11:28:37PM +0200, Massimiliano Cuttini wrote:
> 
> Il 25/06/2017 21:52, Mykola Golub ha scritto:
> >On Sun, Jun 25, 2017 at 06:58:37PM +0200, Massimiliano Cuttini wrote:
> >>I can see the error even if I easily run list-mapped:
> >>
> >># rbd-nbd list-mapped
> >>/dev/nbd0
> >>2017-06-25 18:49:11.761962 7fcdd9796e00 -1 asok(0x7fcde3f72810) 
> >> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
> >> to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': 
> >> (17) File exists/dev/nbd1
> >"AdminSocket::bind_and_listen: failed to bind" errors are harmless,
> >you can safely ignore them (or configure admin_socket in ceph.conf to
> >avoid names collisions).
> I read around that this can lead to a lock in the opening.
> http://tracker.ceph.com/issues/7690
> If the daemon exists than you have to wait that it ends its operation before
> you can connect.

In your case (rbd-nbd) this error is harmless. You can avoid them
setting in ceph.conf, [client] section something like below:

 admin socket = /var/run/ceph/$name.$pid.asok

Also to make every rbd-nbd process to log to a separate file you can
set (in [client] section):

 log file = /var/log/ceph/$name.$pid.log

> root 12610  0.0  0.2 1836768 11412 ?   Sl   Jun23   0:43 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-602b05be-395d-442e-bd68-7742deaf97bd
>  --name client.admin
> root 17298  0.0  0.2 1644244 8420 ?Sl   21:15   0:01 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-3e16395d-7dad-4680-a7ad-7f398da7fd9e
>  --name client.admin
> root 18116  0.0  0.2 1570512 8428 ?Sl   21:15   0:01 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-41a76fe7-c9ff-4082-adb4-43f3120a9106
>  --name client.admin
> root 19063  0.1  1.3 2368252 54944 ?   Sl   21:15   0:10 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-6da2154e-06fd-4063-8af5-ae86ae61df50
>  --name client.admin
> root 21007  0.0  0.2 1570512 8644 ?Sl   21:15   0:01 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-c8aca7bd-1e37-4af4-b642-f267602e210f
>  --name client.admin
> root 21226  0.0  0.2 1703640 8744 ?Sl   21:15   0:01 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-cf2139ac-b1c4-404d-87da-db8f992a3e72
>  --name client.admin
> root 21615  0.5  1.4 2368252 60256 ?   Sl   21:15   0:33 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-acb2a9b0-e98d-474e-aa42-ed4e5534ddbe
>  --name client.admin
> root 21653  0.0  0.2 1703640 11100 ?   Sl   04:12   0:14 rbd-nbd 
> --nbds_max 64 map 
> RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-8631ab86-c85c-407b-9e15-bd86e830ba74
>  --name client.admin

Do you observe the issue for all these volumes? I see many of them
were started recently (21:15) while other are older.

Don't you observe sporadic crashes/restarts of rbd-nbd processes? You
can associate a nbd device with rbd-nbd process (and rbd volume)
looking at /sys/block/nbd*/pid and ps output.

-- 
Mykola Golub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 6 osds on 2 hosts, does Ceph always write data in one osd on host1 and replica in osd on host2?

2017-06-26 Thread Stéphane Klein
Hi,

I have this OSD:

root@ceph-storage-rbx-1:~# ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 21.70432 root default
-2 10.85216 host ceph-storage-rbx-1
 0  3.61739 osd.0up  1.0  1.0
 2  3.61739 osd.2up  1.0  1.0
 4  3.61739 osd.4up  1.0  1.0
-3 10.85216 host ceph-storage-rbx-2
 1  3.61739 osd.1up  1.0  1.0
 3  3.61739 osd.3up  1.0  1.0
 5  3.61739 osd.5up  1.0  1.0

with:

  osd_pool_default_size: 2
  osd_pool_default_min_size: 1

Question: does Ceph always write data in one osd on host1 and replica on
host2?
I fear that Ceph sometime write data on osd.0 and replica on osd.2.

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 6 osds on 2 hosts, does Ceph always write data in one osd on host1 and replica in osd on host2?

2017-06-26 Thread Ashley Merrick
Hello,

Will need to see a full export of your crush map rules.

Depends what the failure domain is set to.

,Ash
Sent from my iPhone

On 26 Jun 2017, at 4:11 PM, Stéphane Klein 
mailto:cont...@stephane-klein.info>> wrote:

Hi,

I have this OSD:

root@ceph-storage-rbx-1:~# ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 21.70432 root default
-2 10.85216 host ceph-storage-rbx-1
 0  3.61739 osd.0up  1.0  1.0
 2  3.61739 osd.2up  1.0  1.0
 4  3.61739 osd.4up  1.0  1.0
-3 10.85216 host ceph-storage-rbx-2
 1  3.61739 osd.1up  1.0  1.0
 3  3.61739 osd.3up  1.0  1.0
 5  3.61739 osd.5up  1.0  1.0

with:

  osd_pool_default_size: 2
  osd_pool_default_min_size: 1

Question: does Ceph always write data in one osd on host1 and replica on host2?
I fear that Ceph sometime write data on osd.0 and replica on osd.2.

Best regards,
Stéphane
--
Stéphane Klein mailto:cont...@stephane-klein.info>>
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread Marco Gaiarin

I've hitted some strange things in my ceph cluster, and i'm asking some
fedback here.

Some cluster info: 3 nodes, 12 OSD (4 per node, symmetrical), size=3.
Proxmox based, still on hammer, so used for RBD only.
The cluser, was bult using some spare server, and there's a node that
are 'underpowered', so a bit of stress while there's some load is
expected. ;(


Last week i've used by the first time the snapshot feature. I've done
some test, before, on some ''spare'' VM doing snapshot on a powered off
VM (as expected, was merely istantaneus) and on a powered on one
(clearly, snapshotting the RAM pose some stress on that VM, but not so
much on the overral system, as expected).
I've also do some test of deleting the snapshot created, but some
minute after i've done that snapshot, and nothing relevant happens.


Friday, after 18.00 local time so with very little load on the system,
i've removed the snapshot i've done on my principal VM a week before:

 Jun 23 18:10:10 thor pvedaemon[39766]:  starting task 
UPID:thor:A0E2:09D67B5F:594D3D62:qmdelsnapshot:107:gaio@PASIAN:
 Jun 23 18:10:10 thor pvedaemon[41186]:  delete snapshot VM 107: 
Jessie
 [...]
 Jun 23 18:11:58 thor pvedaemon[39766]:  end task 
UPID:thor:A0E2:09D67B5F:594D3D62:qmdelsnapshot:107:gaio@PASIAN: OK

and the suddenly *ALL* the system get trashed, ceph go in HEALTH_WARN
status and so on:

 2017-06-23 18:12:24.941538 mon.0 10.27.251.7:6789/0 1408585 : cluster [INF] 
pgmap v17394099: 768 pgs: 768 active+clean; 2356 GB data, 7172 GB used, 9586 GB 
/ 16758 GB avail; 176 kB/s rd, 2936 kB/s wr, 46 op/s
 2017-06-23 18:12:26.020387 mon.0 10.27.251.7:6789/0 1408586 : cluster [INF] 
pgmap v17394100: 768 pgs: 768 active+clean; 2356 GB data, 7171 GB used, 9586 GB 
/ 16758 GB avail; 176 kB/s rd, 2913 kB/s wr, 48 op/s
 2017-06-23 18:12:27.086199 mon.0 10.27.251.7:6789/0 1408587 : cluster [INF] 
pgmap v17394101: 768 pgs: 768 active+clean; 2356 GB data, 7171 GB used, 9587 GB 
/ 16758 GB avail; 7582 B/s rd, 36017 B/s wr, 11 op/s
 2017-06-23 18:12:28.147427 mon.0 10.27.251.7:6789/0 1408588 : cluster [INF] 
pgmap v17394102: 768 pgs: 768 active+clean; 2356 GB data, 7170 GB used, 9588 GB 
/ 16758 GB avail; 5742 B/s rd, 36366 B/s wr, 9 op/s
 2017-06-23 18:12:28.588263 osd.11 10.27.251.7:6800/3810 1511 : cluster [WRN] 1 
slow requests, 1 included below; oldest blocked for > 30.083296 secs
 2017-06-23 18:12:28.588337 osd.11 10.27.251.7:6800/3810 1512 : cluster [WRN] 
slow request 30.083296 seconds old, received at 2017-06-23 18:11:58.504270: 
osd_op(client.37904158.0:60595026 rbd_data.4384f22ae8944a.0042 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1384448~16384] 
1.6e0a2eb snapc 2=[] ack+ondisk+write+known_if_redirected e4132) currently 
waiting for subops from 2,6
 2017-06-23 18:12:28.955268 osd.6 10.27.251.9:6800/4523 3167 : cluster [WRN] 2 
slow requests, 2 included below; oldest blocked for > 30.449628 secs
 2017-06-23 18:12:28.955454 osd.6 10.27.251.9:6800/4523 3168 : cluster [WRN] 
slow request 30.340693 seconds old, received at 2017-06-23 18:11:58.614354: 
osd_op(client.37904104.1:731096 rbd_data.723c0238e1f29.0600 
[set-alloc-hint object_size 4194304 write_size 4194304,write 16384~4096] 
3.d31445db ondisk+write e4132) currently waiting for subops from 3,11
 2017-06-23 18:12:28.955461 osd.6 10.27.251.9:6800/4523 3169 : cluster [WRN] 
slow request 30.449628 seconds old, received at 2017-06-23 18:11:58.505420: 
osd_op(client.37904158.0:60595031 rbd_data.4384f22ae8944a.6881 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1646592~4096] 
1.2610cd9b snapc 2=[] ack+ondisk+write+known_if_redirected e4132) currently 
waiting for subops from 1,3
 2017-06-23 18:12:29.588719 osd.11 10.27.251.7:6800/3810 1513 : cluster [WRN] 
23 slow requests, 22 included below; oldest blocked for > 31.084351 secs
 2017-06-23 18:12:29.588729 osd.11 10.27.251.7:6800/3810 1514 : cluster [WRN] 
slow request 30.974787 seconds old, received at 2017-06-23 18:11:58.613834: 
osd_op(client.37904104.1:731102 rbd_data.723c0238e1f29.0800 
[set-alloc-hint object_size 4194304 write_size 4194304,write 409600~4096] 
3.338471cd ondisk+write e4132) currently waiting for subops from 4,7
 2017-06-23 18:12:29.588738 osd.11 10.27.251.7:6800/3810 1515 : cluster [WRN] 
slow request 30.974705 seconds old, received at 2017-06-23 18:11:58.613916: 
osd_op(client.37904104.1:731103 rbd_data.723c0238e1f29.0800 
[set-alloc-hint object_size 4194304 write_size 4194304,write 483328~4096] 
3.338471cd ondisk+write e4132) currently waiting for subops from 4,7
 2017-06-23 18:12:29.588744 osd.11 10.27.251.7:6800/3810 1516 : cluster [WRN] 
slow request 30.974635 seconds old, received at 2017-06-23 18:11:58.613986: 
osd_op(client.37904104.1:731104 rbd_data.723c0238e1f29.0800 
[set-alloc-hint object_size 4194304 write_size 4194304,write 974848~4096] 
3.338471cd ondisk+write e4132) currently waiting for subops from 4,7
 2017-06

Re: [ceph-users] 6 osds on 2 hosts, does Ceph always write data in one osd on host1 and replica in osd on host2?

2017-06-26 Thread Stéphane Klein
2017-06-26 11:15 GMT+02:00 Ashley Merrick :

> Will need to see a full export of your crush map rules.
>

This is my crush map rules:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-storage-rbx-1 {
id -2 # do not change unnecessarily
# weight 10.852
alg straw
hash 0 # rjenkins1
item osd.0 weight 3.617
item osd.2 weight 3.617
item osd.4 weight 3.617
}
host ceph-storage-rbx-2 {
id -3 # do not change unnecessarily
# weight 10.852
alg straw
hash 0 # rjenkins1
item osd.1 weight 3.617
item osd.3 weight 3.617
item osd.5 weight 3.617
}
root default {
id -1 # do not change unnecessarily
# weight 21.704
alg straw
hash 0 # rjenkins1
item ceph-storage-rbx-1 weight 10.852
item ceph-storage-rbx-2 weight 10.852
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 6 osds on 2 hosts, does Ceph always write data in one osd on host1 and replica in osd on host2?

2017-06-26 Thread Ashley Merrick
Your going across host’s so each replication will be on a different host.

,Ashley

Sent from my iPhone

On 26 Jun 2017, at 4:39 PM, Stéphane Klein 
mailto:cont...@stephane-klein.info>> wrote:



2017-06-26 11:15 GMT+02:00 Ashley Merrick 
mailto:ash...@amerrick.co.uk>>:
Will need to see a full export of your crush map rules.

This is my crush map rules:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-storage-rbx-1 {
id -2 # do not change unnecessarily
# weight 10.852
alg straw
hash 0 # rjenkins1
item osd.0 weight 3.617
item osd.2 weight 3.617
item osd.4 weight 3.617
}
host ceph-storage-rbx-2 {
id -3 # do not change unnecessarily
# weight 10.852
alg straw
hash 0 # rjenkins1
item osd.1 weight 3.617
item osd.3 weight 3.617
item osd.5 weight 3.617
}
root default {
id -1 # do not change unnecessarily
# weight 21.704
alg straw
hash 0 # rjenkins1
item ceph-storage-rbx-1 weight 10.852
item ceph-storage-rbx-2 weight 10.852
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 6 osds on 2 hosts, does Ceph always write data in one osd on host1 and replica in osd on host2?

2017-06-26 Thread Stéphane Klein
2017-06-26 11:48 GMT+02:00 Ashley Merrick :

> Your going across host’s so each replication will be on a different host.
>

Thanks :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ideas on the UI/UX improvement of ceph-mgr: Cluster Status Dashboard

2017-06-26 Thread Massimiliano Cuttini

Hi Saumay,

i think you should take in account to track SMART on every SSD founded.
If it has SMART capabilities, then track its test (or commit tests) and 
display its values on the dashboard (or separate graph).

This allow ADMINS to forecast the next OSD will die.

Preventing is better than Restoring! :)



Il 26/06/2017 06:49, saumay agrawal ha scritto:

Hi everyone!

I am working on the improvement of the web-based dashboard for Ceph.
My intention is to add some UI elements to visualise some performance
counters of a Ceph cluster. This gives a better overview to the users
of the dashboard about how the Ceph cluster is performing and, if
necessary, where they can make necessary optimisations to get even
better performance from the cluster.

Here is my suggestion on the two perf counters, commit latency and
apply latency. They are visualised using line graphs. I have prepared
UI mockups for the same.
1. OSD apply latency
[https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtSG8]
2. OSD commit latency
[https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHeVU]

These mockups show the latency values (y-axis) against the instant of
time (x-axis). The latency values for different OSDs are highlighted
using different colours. The average latency value of all OSDs is
shown specifically in red. This representation allows the dashboard
user to compare the performances of an OSD with other OSDs, as well as
with the average performance of the cluster.

The line width in these graphs is specially kept less, so as to give a
crisp and clear representation for more number of OSDs. However, this
approach may clutter the graph and make it incomprehensible for a
cluster having significantly higher number of OSDs. For such
situations, we can retain only the average latency indications from
both the graphs to make things more simple for the dashboard user.

Also, higher latency values suggest bad performance. We can come up
with some specific values for both the counters, above which we can
say that the cluster is performing very bad. If the value of any of
the OSDs exceeds this value, we can highlight entire graph in a light
red shade to draw the attention of user towards it.

I am planning to use AJAX based templates and plugins (like
Flotcharts) for these graphs. This would allow real-time update of the
graphs without having any need to reload the entire dashboard page.

Another feature I propose to add is the representation of the version
distribution of all the clients in a cluster. This can be categorised
into distribution
1. on the basis of ceph version
[https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTWWM] and,
2. on the basis of kernel version
[https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwcnc]

I have used doughnut charts instead of regular pie charts, as they
have some whitespace at their centre. This whitespace makes the chart
appear less cluttered, while properly indicating the appropriate
fraction of the total value. Also, we can later add some data to
display at this centre space when we hover over a particular slice of
the chart.

The main purpose of this visualisation is to identify any number of
clients left behind while updating the clients of the cluster. Suppose
a cluster has 50 clients running ceph jewel. In the process of
updating this cluster, 40 clients get updated to ceph luminous, while
the other 10 clients remain behind on ceph jewel. This may occur due
to some bug or any interruption in the update process. In such
scenarios, the user can find which clients have not been updated and
update them according to his needs.  It may also give a clear picture
for troubleshooting, during any package dependency issues due to the
kernel. The clients are represented in both, absolutes numbers as well
as the percentage of the entire cluster, for a better overview.

An interesting approach could be highlighting the older version(s)
specifically to grab the attention of the user. For example, a user
running ceph jewel may not need to update as necessarily compared to
the user running ceph hammer.

As of now, I am looking for plugins in AdminLTE to implement these two
elements in the dashboard. I would like to have feedbacks and
suggestions on these two from the ceph community, on how can I make
them more informative about the cluster.

Also a request to the various ceph users and developers. It would be
great if you could share the various metrics you are using as a
performance indicator for your cluster, and how you are using them.
Any metrics being used to identify the issues in a cluster can also be
shared.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Primary Affinity / EC Pool

2017-06-26 Thread Ashley Merrick
Have some 8TB drives I am looking to remove from cluster Long term however 
would like to make use of Primary Affinity to decrease the reads going to these 
drives.

I have a replication and erasure code pool, I understand when setting the 
primary Affinity to 0 no PG’s will have their Primary PG set to another OSD 
thus all reads will hit this PG.

On an erasure coding pool does it do the same, but still allow other shards of 
the EC to be placed on the 8TB drives per the crush map rule. Hence only I/O 
hitting the 8TB drives should be from the EC pool?

,Ashleh

Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread Peter Maloney
On 06/26/17 11:36, Marco Gaiarin wrote:
> ...
> Three question:
>
> a) while a 'snapshot remove' action put system on load?
>
> b) as for options like:
>
>   osd scrub during recovery = false
> osd recovery op priority = 1
> osd recovery max active = 5
> osd max backfills = 1
>
>  (for recovery), there are option to reduce the impact of a stapshot
>  remove?
>
> c) snapshot are handled differently from other IO ops, or doing some
>  similar things (eg, a restore from a backup) i've to expect some
>  similar result?
>
>
> Thanks.
>
You also have to set:

> osd_pg_max_concurrent_snap_trims=1
> osd_snap_trim_sleep=0
2nd is default 0, but just make sure. Or maybe it doesn't exist in
hammer. It's bugged in jewel, and holds a lock during sleep, so you have
to set it 0.

And I think maybe this helps a little:

> filestore_split_multiple=8

And this one lower can make performance lower, but can reduce blocked
requests (due to blocking locks):

> osd_op_threads = 2
(currently I have 8 set on the last one... long ago I found that lowest
was the minimum bareable when doing snapshots and snap removal)


And keep in mind all the "priority" stuff possibly doesn't have any
effect without the cfq disk scheduler (at least in hammer... I think
I've heard different for jewel and later). Check with:

> grep . /sys/block/*/queue/scheduler


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.1.0 Luminous RC released

2017-06-26 Thread Ashley Merrick
With the EC Overwite support, if currently running behind a cache tier in Jewel 
will the overwrite still be of benefit through the cache tier and remove the 
need to promote the full block to make any edits?

Or we better totally removing the cache tier once fully upgraded?

,Ashley

Sent from my iPhone

On 26 Jun 2017, at 2:47 AM, Wido den Hollander 
mailto:w...@42on.com>> wrote:


Op 23 juni 2017 om 23:06 schreef Sage Weil 
mailto:s...@newdream.net>>:


On Fri, 23 Jun 2017, Abhishek L wrote:
This is the first release candidate for Luminous, the next long term
stable release.

I just want to reiterate that this is a release candidate, not the final
luminous release.  We're still squashing bugs and merging a few last
items.  Testing is welcome, but you probably should not deploy this in any
production environments.


Understood! Question though, as BlueStore is now marked as stable and the 
default backend, are there any gotchas?

The release notes don't say anything about it vs FileStore just that it's the 
new default.

Is there anything users should look into when going to BlueStore or deploy 
their new clusters with it?

Wido

Thanks!
sage


Ceph Luminous will be the foundation for the next long-term
stable release series.  There have been major changes since Kraken
(v11.2.z) and Jewel (v10.2.z).

Major Changes from Kraken
-

- *General*:

 * Ceph now has a simple, built-in web-based dashboard for monitoring
   cluster status.

- *RADOS*:

 * *BlueStore*:

   - The new *BlueStore* backend for *ceph-osd* is now stable and the new
 default for newly created OSDs.  BlueStore manages data stored by each OSD
 by directly managing the physical HDDs or SSDs without the use of an
 intervening file system like XFS.  This provides greater performance
 and features.
   - BlueStore supports *full data and metadata checksums* of all
 data stored by Ceph.
   - BlueStore supports inline compression using zlib, snappy, or LZ4.  (Ceph
 also supports zstd for RGW compression but zstd is not recommended for
 BlueStore for performance reasons.)

 * *Erasure coded* pools now have full support for *overwrites*,
   allowing them to be used with RBD and CephFS.

 * *ceph-mgr*:

   - There is a new daemon, *ceph-mgr*, which is a required part of any
 Ceph deployment.  Although IO can continue when *ceph-mgr* is
 down, metrics will not refresh and some metrics-related calls
 (e.g., ``ceph df``) may block.  We recommend deploying several instances of
 *ceph-mgr* for reliability.  See the notes on `Upgrading`_ below.
   - The *ceph-mgr* daemon includes a REST-based management API.  The
 API is still experimental and somewhat limited but will form the basis
 for API-based management of Ceph going forward.

 * The overall *scalability* of the cluster has improved. We have
   successfully tested clusters with up to 10,000 OSDs.
 * Each OSD can now have a *device class* associated with it (e.g., `hdd` or
   `ssd`), allowing CRUSH rules to trivially map data to a subset of devices
   in the system.  Manually writing CRUSH rules or manual editing of the CRUSH
   is normally not required.
 * You can now *optimize CRUSH weights* can now be optimized to
   maintain a *near-perfect distribution of data* across OSDs.
 * There is also a new `upmap` exception mechanism that allows
   individual PGs to be moved around to achieve a *perfect
   distribution* (this requires luminous clients).
 * Each OSD now adjusts its default configuration based on whether the
   backing device is an HDD or SSD.  Manual tuning generally not required.
 * The prototype *mclock QoS queueing algorithm* is now available.
 * There is now a *backoff* mechanism that prevents OSDs from being
   overloaded by requests to objects or PGs that are not currently able to
   process IO.
 * There is a *simplified OSD replacement process* that is more robust.
 * You can query the supported features and (apparent) releases of
   all connected daemons and clients with ``ceph features``.
 * You can configure the oldest Ceph client version you wish to allow to
   connect to the cluster via ``ceph osd set-require-min-compat-client`` and
   Ceph will prevent you from enabling features that will break compatibility
   with those clients.
 * Several `sleep` settings, include ``osd_recovery_sleep``,
   ``osd_snap_trim_sleep``, and ``osd_scrub_sleep`` have been
   reimplemented to work efficiently.  (These are used in some cases
   to work around issues throttling background work.)

- *RGW*:

 * RGW *metadata search* backed by ElasticSearch now supports end
   user requests service via RGW itself, and also supports custom
   metadata fields. A query language a set of RESTful APIs were
   created for users to be able to search objects by their
   metadata. New APIs that allow control of custom metadata fields
   were also added.
 * RGW now supports *dynamic bucket index sharding*.  As the number
   of objects in a

Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-26 Thread Piotr Dalek
On 17-06-21 03:24 PM, Sage Weil wrote:
> On Wed, 21 Jun 2017, Piotr Dałek wrote:
>> On 17-06-14 03:44 PM, Sage Weil wrote:
>>> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
 [snip]

 Is it safe to enable "filestore seek hole", are there any tests that
 verifies that everything related to RBD works fine with this enabled?
 Can we make this enabled by default?
>>>
>>> We would need to enable it in the qa environment first.  The risk here is
>>> that users run a broad range of kernels and we are exposing ourselves to
>>> any bugs in any kernel version they may run.  I'd prefer to leave it off
>>> by default.
>>
>> That's a common regression? If not, we could blacklist particular kernels and
>> call it a day.

>>> We can enable it in the qa suite, though, which covers
>>> centos7 (latest kernel) and ubuntu xenial and trusty.
>>
>> +1. Do you need some particular PR for that?
> 
> Sure.  How about a patch that adds the config option to several of the
> files in qa/suites/rados/thrash/thrashers?

Does 
https://github.com/ovh/ceph/commit/fe65e3a19470eea16c9d273d1aac1c7eff7d2ff1 
look reasonably?

-- 
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread Lindsay Mathieson

On 26/06/2017 7:36 PM, Marco Gaiarin wrote:

Last week i've used by the first time the snapshot feature. I've done
some test, before, on some ''spare'' VM doing snapshot on a powered off
VM (as expected, was merely istantaneus) and on a powered on one
(clearly, snapshotting the RAM pose some stress on that VM, but not so
much on the overral system, as expected).
I've also do some test of deleting the snapshot created, but some
minute after i've done that snapshot, and nothing relevant happens.




Have you tried restoring a snapshot? I found it unusablly slow - as in hours

--
Lindsay Mathieson

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object repair not going as planned

2017-06-26 Thread Brady Deetz
Resolved.

After all of the involved OSDs had been down for a while, I brought them
back up and issued another ceph pg repair. We are clean now.

On Sun, Jun 25, 2017 at 11:54 PM, Brady Deetz  wrote:

> I should have mentioned, I'm running ceph jewel 10.2.7
>
> On Sun, Jun 25, 2017 at 11:46 PM, Brady Deetz  wrote:
>
>> Over the course of the past year, I've had 3 instances where I had to
>> manually repair an object due to size. In this case, I was immediately
>> disappointed to discover what I think is evidence of only 1 of 3 replicas
>> good. It got worse when a segfault occurred I attempted to flush the
>> journal for one of the seemingly bad replicas.
>>
>> Below is a segfault from ceph-osd -i 160 --flush-journal
>> https://pastebin.com/GQkCn9T9
>>
>> More logs and command history can be found here:
>> https://pastebin.com/5knjNTd0
>>
>> So far, I've copied the object file to a tmp backup location, set noout,
>> stopped the osd service for the associated osds for that pg, flushed the
>> journals, and made a second copy of the objects post flush.
>>
>> Any help would be greatly appreciated.
>>
>> I'm considering just deleting the 2 known bad files and attempting a ceph
>> pg repair. But, I'm not really sure that will work with only 1 good replica.
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Willem Jan Withagen
On 26-6-2017 09:01, Christian Wuerdig wrote:
> Well, preferring faster clock CPUs for SSD scenarios has been floated
> several times over the last few months on this list. And realistic or
> not, Nick's and Kostas' setup are similar enough (testing single disk)
> that it's a distinct possibility.
> Anyway, as mentioned measuring the performance counters would probably
> provide more insight.

I read the advise as:
prefer GHz over cores.

And especially since there is a sort of balance between either GHz or
cores, that can be an expensive one. Getting both means you have to pay
relatively substantial more money.

And for an average Ceph server with plenty OSDs, I personally just don't
buy that. There you'd have to look at the total throughput of the the
system, and latency is only one of the many factors.

Let alone in a cluster with several hosts (and or racks). There the
latency is dictated by the network. So a bad choice of network card or
switch will out do any extra cycles that your CPU can burn.

I think that just testing 1 OSD is testing artifacts, and has very
little to do with running an actual ceph cluster.

So if one would like to test this, the test setup should be something
like: 3 hosts with something like 3 disks per host, min_disk=2  and a
nice workload.
Then turn the GHz-knob and see what happens with client latency and
throughput.

--WjW

> On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen  > wrote:
> 
> 
> 
> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar  > het volgende geschreven:
> 
>> My understanding was this test is targeting latency more than
>> IOPS. This is probably why its was run using QD=1. It also makes
>> sense that cpu freq will be more important than cores. 
>>
> 
> But then it is not generic enough to be used as an advise!
> It is just a line in 3D-space. 
> As there are so many
> 
> --WjW
> 
>> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>>
>>> On 24-6-2017 05:30, Christian Wuerdig wrote:
 The general advice floating around is that your want CPUs with high
 clock speeds rather than more cores to reduce latency and
 increase IOPS
 for SSD setups (see also
 http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
 )
 So
 something like a E5-2667V4 might bring better results in that
 situation.
 Also there was some talk about disabling the processor C states
 in order
 to bring latency down (something like this should be easy to test:
 https://stackoverflow.com/a/22482722/220986
 )
>>>
>>> I would be very careful to call this a general advice...
>>>
>>> Although the article is interesting, it is rather single sided.
>>>
>>> The only thing is shows that there is a lineair relation between
>>> clockspeed and write or read speeds???
>>> The article is rather vague on how and what is actually tested.
>>>
>>> By just running a single OSD with no replication a lot of the
>>> functionality is left out of the equation.
>>> Nobody is running just 1 osD on a box in a normal cluster host.
>>>
>>> Not using a serious SSD is another source of noise on the conclusion.
>>> More Queue depth can/will certainly have impact on concurrency.
>>>
>>> I would call this an observation, and nothing more.
>>>
>>> --WjW

 On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
 mailto:reverend...@gmail.com>
 >>
 wrote:

 Hello,

 We are in the process of evaluating the performance of a testing
 cluster (3 nodes) with ceph jewel. Our setup consists of:
 3 monitors (VMs)
 2 physical servers each connected with 1 JBOD running Ubuntu
 Server
 16.04

 Each server has 32 threads @2.1GHz and 128GB RAM.
 The disk distribution per server is:
 38 * HUS726020ALS210 (SAS rotational)
 2 * HUSMH8010BSS200 (SAS SSD for journals)
 2 * ST1920FM0043 (SAS SSD for data)
 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)

 Since we don't currently have a 10Gbit switch, we test the
 performance
 with the cluster in a degraded state, the noout flag set and
 we mount
 rbd images on the powered on osd node. We confirmed that the
 network
 is not saturated during the tests.

 We ran tests on the NVME disk and the pool created on this
 disk where
 we hoped to get the most performance without getting limited
 by the
 hardware specs since we have more 

Re: [ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread David Turner
Snapshots are not a free action.  To create them it's near enough free, but
deleting objects in Ceph is an n^2 operation.  Being on Hammer you do not
have access to the object map feature on RBDs which drastically reduces the
n^2 problem by keeping track of which objects it actually needs to delete.
For your week old snapshot the cluster is needing to throw every object for
the snapshot (whether it exists or not) into the snap_trim_q to be
deleted.  So what n^2 means, if you aren't familiar, is that if a 1GB
volume/snapshot takes 4 minutes to delete, then a 2GB volume takes 16
minutes.

Peter mentioned the setting that was implemented in Hammer and is the ONLY
setting in Hammer that can help with snapshot deletions that thrash your
cluster.  You NEED to use osd_snap_trim_sleep.  Jewel broke that without
properly implementing adequate work-arounds for the setting, but Jewel is
back on track now.  I would recommend an osd_snap_trim_sleep of about .05
to start with to see if that alleviates your pressure.  It was a bad
solution to fix a problem quickly that they've finally revisited to address
it properly.  What it does is every time it deletes snap shot objects, it
sleeps for .05 seconds and then does the next one.  In Jewel that was
broken because they moved snap shot deletions into the main op thread and
the snap trim sleep just put a sleep onto the main op thread telling the
osd thread to do nothing after deleting a snap trim object.

Upgrading to Jewel and enabling object_map on all of your rbds would help
this problem as well as researching the new options in Jewel to fine-tune
snap trim settings for your environment and hardware.  I personally still
just use a small osd_snap_trim_sleep on my 3 node proxmox cluster and it
works fine.  I don't get slow requests when I delete snapshots.  I used to
before putting in a little snap trim sleep.  I only create snapshots about
once/mo and cycle out the old ones, but it works well for me.

On Mon, Jun 26, 2017 at 8:07 AM Lindsay Mathieson <
lindsay.mathie...@gmail.com> wrote:

> On 26/06/2017 7:36 PM, Marco Gaiarin wrote:
> > Last week i've used by the first time the snapshot feature. I've done
> > some test, before, on some ''spare'' VM doing snapshot on a powered off
> > VM (as expected, was merely istantaneus) and on a powered on one
> > (clearly, snapshotting the RAM pose some stress on that VM, but not so
> > much on the overral system, as expected).
> > I've also do some test of deleting the snapshot created, but some
> > minute after i've done that snapshot, and nothing relevant happens.
>
>
>
> Have you tried restoring a snapshot? I found it unusablly slow - as in
> hours
>
> --
> Lindsay Mathieson
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ideas on the UI/UX improvement of ceph-mgr: Cluster Status Dashboard

2017-06-26 Thread Brady Deetz
+1 on SMART tracking

On Mon, Jun 26, 2017 at 5:19 AM, Massimiliano Cuttini 
wrote:

> Hi Saumay,
>
> i think you should take in account to track SMART on every SSD founded.
> If it has SMART capabilities, then track its test (or commit tests) and
> display its values on the dashboard (or separate graph).
> This allow ADMINS to forecast the next OSD will die.
>
> Preventing is better than Restoring! :)
>
>
>
>
> Il 26/06/2017 06:49, saumay agrawal ha scritto:
>
>> Hi everyone!
>>
>> I am working on the improvement of the web-based dashboard for Ceph.
>> My intention is to add some UI elements to visualise some performance
>> counters of a Ceph cluster. This gives a better overview to the users
>> of the dashboard about how the Ceph cluster is performing and, if
>> necessary, where they can make necessary optimisations to get even
>> better performance from the cluster.
>>
>> Here is my suggestion on the two perf counters, commit latency and
>> apply latency. They are visualised using line graphs. I have prepared
>> UI mockups for the same.
>> 1. OSD apply latency
>> [https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtSG8]
>> 2. OSD commit latency
>> [https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHeVU]
>>
>> These mockups show the latency values (y-axis) against the instant of
>> time (x-axis). The latency values for different OSDs are highlighted
>> using different colours. The average latency value of all OSDs is
>> shown specifically in red. This representation allows the dashboard
>> user to compare the performances of an OSD with other OSDs, as well as
>> with the average performance of the cluster.
>>
>> The line width in these graphs is specially kept less, so as to give a
>> crisp and clear representation for more number of OSDs. However, this
>> approach may clutter the graph and make it incomprehensible for a
>> cluster having significantly higher number of OSDs. For such
>> situations, we can retain only the average latency indications from
>> both the graphs to make things more simple for the dashboard user.
>>
>> Also, higher latency values suggest bad performance. We can come up
>> with some specific values for both the counters, above which we can
>> say that the cluster is performing very bad. If the value of any of
>> the OSDs exceeds this value, we can highlight entire graph in a light
>> red shade to draw the attention of user towards it.
>>
>> I am planning to use AJAX based templates and plugins (like
>> Flotcharts) for these graphs. This would allow real-time update of the
>> graphs without having any need to reload the entire dashboard page.
>>
>> Another feature I propose to add is the representation of the version
>> distribution of all the clients in a cluster. This can be categorised
>> into distribution
>> 1. on the basis of ceph version
>> [https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTWWM] and,
>> 2. on the basis of kernel version
>> [https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwcnc]
>>
>> I have used doughnut charts instead of regular pie charts, as they
>> have some whitespace at their centre. This whitespace makes the chart
>> appear less cluttered, while properly indicating the appropriate
>> fraction of the total value. Also, we can later add some data to
>> display at this centre space when we hover over a particular slice of
>> the chart.
>>
>> The main purpose of this visualisation is to identify any number of
>> clients left behind while updating the clients of the cluster. Suppose
>> a cluster has 50 clients running ceph jewel. In the process of
>> updating this cluster, 40 clients get updated to ceph luminous, while
>> the other 10 clients remain behind on ceph jewel. This may occur due
>> to some bug or any interruption in the update process. In such
>> scenarios, the user can find which clients have not been updated and
>> update them according to his needs.  It may also give a clear picture
>> for troubleshooting, during any package dependency issues due to the
>> kernel. The clients are represented in both, absolutes numbers as well
>> as the percentage of the entire cluster, for a better overview.
>>
>> An interesting approach could be highlighting the older version(s)
>> specifically to grab the attention of the user. For example, a user
>> running ceph jewel may not need to update as necessarily compared to
>> the user running ceph hammer.
>>
>> As of now, I am looking for plugins in AdminLTE to implement these two
>> elements in the dashboard. I would like to have feedbacks and
>> suggestions on these two from the ceph community, on how can I make
>> them more informative about the cluster.
>>
>> Also a request to the various ceph users and developers. It would be
>> great if you could share the various metrics you are using as a
>> performance indicator for your cluster, and how you are using them.
>> Any metrics being used to identify the issues in a cluster can also be
>> shared.
>> ___

Re: [ceph-users] 6 osds on 2 hosts, does Ceph always write data in one osd on host1 and replica in osd on host2?

2017-06-26 Thread David Turner
Just so you're aware of why that's the case, the line
  step chooseleaf firstn 0 type host
in your crush map under the rules section says "host".  If you changed that
to "osd", then your replicas would be unique per OSD instead of per
server.  If you had a larger cluster and changed it to "rack" and
implemented racks as buckets with hosts in them in your crush map, then
your replicas would be unique per rack that you configured.

On Mon, Jun 26, 2017 at 5:50 AM Stéphane Klein 
wrote:

> 2017-06-26 11:48 GMT+02:00 Ashley Merrick :
>
>> Your going across host’s so each replication will be on a different host.
>>
>
> Thanks :)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Willem Jan Withagen
> Sent: 26 June 2017 14:35
> To: Christian Wuerdig 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Ceph random read IOPS
> 
> On 26-6-2017 09:01, Christian Wuerdig wrote:
> > Well, preferring faster clock CPUs for SSD scenarios has been floated
> > several times over the last few months on this list. And realistic or
> > not, Nick's and Kostas' setup are similar enough (testing single disk)
> > that it's a distinct possibility.
> > Anyway, as mentioned measuring the performance counters would
> probably
> > provide more insight.
> 
> I read the advise as:
>   prefer GHz over cores.
> 
> And especially since there is a sort of balance between either GHz or
cores,
> that can be an expensive one. Getting both means you have to pay
relatively
> substantial more money.
> 
> And for an average Ceph server with plenty OSDs, I personally just don't
buy
> that. There you'd have to look at the total throughput of the the system,
and
> latency is only one of the many factors.
> 
> Let alone in a cluster with several hosts (and or racks). There the
latency is
> dictated by the network. So a bad choice of network card or switch will
out
> do any extra cycles that your CPU can burn.
> 
> I think that just testing 1 OSD is testing artifacts, and has very little
to do with
> running an actual ceph cluster.
> 
> So if one would like to test this, the test setup should be something
> like: 3 hosts with something like 3 disks per host, min_disk=2  and a nice
> workload.
> Then turn the GHz-knob and see what happens with client latency and
> throughput.

Did similar tests last summer. 5 nodes with 12x 7.2k disks each, connected
via 10G. NVME journal. 3x replica pool.

First test was with C-states left to auto and frequency scaling leaving the
cores at lowest frequency of 900mhz. The cluster will quite happily do a
couple of thousand IO's without generating enough CPU load to boost the 4
cores up to max C-state or frequency.

With small background IO going on in background, a QD=1 sequential 4kb write
was done with the following results:

write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
 lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
clat percentiles (usec):
 |  1.00th=[ 1480],  5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
 | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
 | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
 | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
 | 99.99th=[22400]

So just under 2.5ms write latency.

I don't have the results from the separate C-states/frequency scaling, but
adjusting either got me a boost. Forcing to C1 and max frequency of 3.6Ghz
got me:

write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
 lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
clat percentiles (usec):
 |  1.00th=[  540],  5.00th=[  572], 10.00th=[  588], 20.00th=[  604],
 | 30.00th=[  620], 40.00th=[  636], 50.00th=[  652], 60.00th=[  668],
 | 70.00th=[  692], 80.00th=[  716], 90.00th=[  764], 95.00th=[  820],
 | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
 | 99.99th=[24448]

Quite a bit faster. Although these are best case figures, if any substantial
workload is run, the average tends to hover around 1ms latency.

Nick

> 
> --WjW
> 
> > On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen  > > wrote:
> >
> >
> >
> > Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar
>  > > het volgende geschreven:
> >
> >> My understanding was this test is targeting latency more than
> >> IOPS. This is probably why its was run using QD=1. It also makes
> >> sense that cpu freq will be more important than cores.
> >>
> >
> > But then it is not generic enough to be used as an advise!
> > It is just a line in 3D-space.
> > As there are so many
> >
> > --WjW
> >
> >> On 2017-06-24 12:52, Willem Jan Withagen wrote:
> >>
> >>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>  The general advice floating around is that your want CPUs with
high
>  clock speeds rather than more cores to reduce latency and
>  increase IOPS
>  for SSD setups (see also
>  http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
>   performance/>)
>  So
>  something like a E5-2667V4 might bring better results in that
>  situation.
>  Also there was some talk about disabling the processor C stat

Re: [ceph-users] Multi Tenancy in Ceph RBD Cluster

2017-06-26 Thread David Turner
I don't know specifics on Kubernetes or creating multiple keyrings for
servers, so I'll leave those for someone else.  I will say that if you are
kernel mapping your RBDs, then the first tenant to do so will lock the RBD
and no other tenant can map it.  This is built into Ceph.  The original
tenant would need to unmap it for the second to be able to access it.  This
is different if you are not mapping RBDs and just using librbd to deal with
them.

Multiple pools in Ceph are not free.  Pools are a fairly costly resource in
Ceph because data for pools is stored in PGs, the PGs are stored and
distributed between the OSDs in your cluster, and the more PGs an OSD has
the more memory requirements that OSD has.  It does not scale infinitely.
If you are talking about one Pool per customer on a dozen or less
customers, then it might work for your use case, but again it doesn't scale
to growing the customer base.

RBD map could be run remotely via SSH, but that isn't what you were asking
about.  I don't know of any functionality that allows you to use a keyring
on server A to map an RBD on server B.

"Ceph Statistics" is VERY broad.  Are you talking IOPS, disk usage,
throughput, etc?  disk usage is incredibly simple to calculate, especially
if the RBD has object-map enabled.  A simple rbd du rbd_name would give you
the disk usage per RBD and return in seconds.

On Mon, Jun 26, 2017 at 2:00 AM Mayank Kumar  wrote:

> Hi Ceph Users
> I am relatively new to Ceph and trying to Provision CEPH RBD Volumes using
> Kubernetes.
>
> I would like to know what are the best practices for hosting a multi
> tenant CEPH cluster. Specifically i have the following questions:-
>
> - Is it ok to share a single Ceph Pool amongst multiple tenants ?  If yes,
> how do you guarantee that volumes of one Tenant are not
>  accessible(mountable/mapable/unmappable/deleteable/mutable) to other
> tenants ?
> - Can a single Ceph Pool have multiple admin and user keyrings generated
> for rbd create and rbd map commands ? This way i want to assign different
> keyrings to each tenant
>
> - can a rbd map command be run remotely for any node on which we want to
> mount RBD Volumes or it must be run from the same node on which we want to
> mount ? Is this going to be possible in the future ?
>
> - In terms of ceph fault tolerance and resiliency, is one ceph pool per
> customer a better design or a single pool must be shared with mutiple
> customers
> - In a single pool for all customers, how can we get the ceph statistics
> per customer ? Is it possible to somehow derive this from the RBD volumes ?
>
> Thanks for your responses
> Mayank
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread Marco Gaiarin
Mandi! Lindsay Mathieson
  In chel di` si favelave...

> Have you tried restoring a snapshot? I found it unusablly slow - as in hours

No, still no; i've never restored a snapshot...

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread Jason Dillaman
Restoring a snapshot involves copying the entire image from the
snapshot revision to the HEAD revision. The faster approach would be
to just create a clone from the snapshot.

2017-06-26 10:59 GMT-04:00 Marco Gaiarin :
> Mandi! Lindsay Mathieson
>   In chel di` si favelave...
>
>> Have you tried restoring a snapshot? I found it unusablly slow - as in hours
>
> No, still no; i've never restored a snapshot...
>
> --
> dott. Marco Gaiarin GNUPG Key ID: 240A3D66
>   Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
>   marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797
>
> Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>   http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
> (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Maged Mokhtar
On 2017-06-26 15:34, Willem Jan Withagen wrote:

> On 26-6-2017 09:01, Christian Wuerdig wrote: 
> 
>> Well, preferring faster clock CPUs for SSD scenarios has been floated
>> several times over the last few months on this list. And realistic or
>> not, Nick's and Kostas' setup are similar enough (testing single disk)
>> that it's a distinct possibility.
>> Anyway, as mentioned measuring the performance counters would probably
>> provide more insight.
> 
> I read the advise as:
> prefer GHz over cores.
> 
> And especially since there is a sort of balance between either GHz or
> cores, that can be an expensive one. Getting both means you have to pay
> relatively substantial more money.
> 
> And for an average Ceph server with plenty OSDs, I personally just don't
> buy that. There you'd have to look at the total throughput of the the
> system, and latency is only one of the many factors.
> 
> Let alone in a cluster with several hosts (and or racks). There the
> latency is dictated by the network. So a bad choice of network card or
> switch will out do any extra cycles that your CPU can burn.
> 
> I think that just testing 1 OSD is testing artifacts, and has very
> little to do with running an actual ceph cluster.
> 
> So if one would like to test this, the test setup should be something
> like: 3 hosts with something like 3 disks per host, min_disk=2  and a
> nice workload.
> Then turn the GHz-knob and see what happens with client latency and
> throughput.
> 
> --WjW 
> 
> In a high concurrency/queue depth situation, which is probably the most 
> common workload, there is no question that adding more cores will increase 
> IOPS almost linearly in case you have enough disk and network bandwidth, ie 
> your disk and network % utilization is low and your cpu is near 100%. Adding 
> more cores is also more economic to increase IOPS versus increasing 
> frequency. 
> But adding more cores will not lower latency below the value you get from the 
> QD=1 test. To achieve lower latency you need faster cpu freq. Yes it is 
> expensive and as you said you need lower latency switches and so on but you 
> just have to pay more to achieve this.  
> 
> /Maged 
> 
> On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen  > wrote:
> 
> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar  > het volgende geschreven:
> 
> My understanding was this test is targeting latency more than
> IOPS. This is probably why its was run using QD=1. It also makes
> sense that cpu freq will be more important than cores. 
> 
> But then it is not generic enough to be used as an advise!
> It is just a line in 3D-space. 
> As there are so many
> 
> --WjW
> 
> On 2017-06-24 12:52, Willem Jan Withagen wrote:
> 
> On 24-6-2017 05:30, Christian Wuerdig wrote: The general advice floating 
> around is that your want CPUs with high
> clock speeds rather than more cores to reduce latency and
> increase IOPS
> for SSD setups (see also
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
> )
> So
> something like a E5-2667V4 might bring better results in that
> situation.
> Also there was some talk about disabling the processor C states
> in order
> to bring latency down (something like this should be easy to test:
> https://stackoverflow.com/a/22482722/220986
> ) 
> I would be very careful to call this a general advice...
> 
> Although the article is interesting, it is rather single sided.
> 
> The only thing is shows that there is a lineair relation between
> clockspeed and write or read speeds???
> The article is rather vague on how and what is actually tested.
> 
> By just running a single OSD with no replication a lot of the
> functionality is left out of the equation.
> Nobody is running just 1 osD on a box in a normal cluster host.
> 
> Not using a serious SSD is another source of noise on the conclusion.
> More Queue depth can/will certainly have impact on concurrency.
> 
> I would call this an observation, and nothing more.
> 
> --WjW 
> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> mailto:reverend...@gmail.com>
> >>
> wrote:
> 
> Hello,
> 
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD running Ubuntu
> Server
> 16.04
> 
> Each server has 32 threads @2.1GHz and 128GB RAM.
> The disk distribution per server is:
> 38 * HUS726020ALS210 (SAS rotational)
> 2 * HUSMH8010BSS200 (SAS SSD for journals)
> 2 * ST1920FM0043 (SAS SSD for data)
> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
> 
> Since we don't currently have a 10Gbit switch, we test the
> performance
> with the cluster in a degraded state, the noout flag set and
> we mount
> rbd images on the po

Re: [ceph-users] cannot open /dev/xvdb: Input/output error

2017-06-26 Thread Massimiliano Cuttini



On Sun, Jun 25, 2017 at 11:28:37PM +0200, Massimiliano Cuttini wrote:

Il 25/06/2017 21:52, Mykola Golub ha scritto:

On Sun, Jun 25, 2017 at 06:58:37PM +0200, Massimiliano Cuttini wrote:

I can see the error even if I easily run list-mapped:

# rbd-nbd list-mapped
/dev/nbd0
2017-06-25 18:49:11.761962 7fcdd9796e00 -1 asok(0x7fcde3f72810) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) 
File exists/dev/nbd1

"AdminSocket::bind_and_listen: failed to bind" errors are harmless,
you can safely ignore them (or configure admin_socket in ceph.conf to
avoid names collisions).

I read around that this can lead to a lock in the opening.
http://tracker.ceph.com/issues/7690
If the daemon exists than you have to wait that it ends its operation before
you can connect.

In your case (rbd-nbd) this error is harmless. You can avoid them
setting in ceph.conf, [client] section something like below:

  admin socket = /var/run/ceph/$name.$pid.asok

Also to make every rbd-nbd process to log to a separate file you can
set (in [client] section):

  log file = /var/log/ceph/$name.$pid.log

I need to create all the user in ceph cluster before use this.
At the moment all the cluster was runnig with ceph admin keyring.
However, this is not an issue, I  can rapidly deploy all user needed.


root 12610  0.0  0.2 1836768 11412 ?   Sl   Jun23   0:43 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-602b05be-395d-442e-bd68-7742deaf97bd
 --name client.admin
root 17298  0.0  0.2 1644244 8420 ?Sl   21:15   0:01 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-3e16395d-7dad-4680-a7ad-7f398da7fd9e
 --name client.admin
root 18116  0.0  0.2 1570512 8428 ?Sl   21:15   0:01 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-41a76fe7-c9ff-4082-adb4-43f3120a9106
 --name client.admin
root 19063  0.1  1.3 2368252 54944 ?   Sl   21:15   0:10 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-6da2154e-06fd-4063-8af5-ae86ae61df50
 --name client.admin
root 21007  0.0  0.2 1570512 8644 ?Sl   21:15   0:01 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-c8aca7bd-1e37-4af4-b642-f267602e210f
 --name client.admin
root 21226  0.0  0.2 1703640 8744 ?Sl   21:15   0:01 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-cf2139ac-b1c4-404d-87da-db8f992a3e72
 --name client.admin
root 21615  0.5  1.4 2368252 60256 ?   Sl   21:15   0:33 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-acb2a9b0-e98d-474e-aa42-ed4e5534ddbe
 --name client.admin
root 21653  0.0  0.2 1703640 11100 ?   Sl   04:12   0:14 rbd-nbd 
--nbds_max 64 map 
RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-8631ab86-c85c-407b-9e15-bd86e830ba74
 --name client.admin

Do you observe the issue for all these volumes? I see many of them
were started recently (21:15) while other are older.

Only some of them.
But it's randomly.
Some of old and some just plugged becomes unavailable to xen.

Don't you observe sporadic crashes/restarts of rbd-nbd processes? You
can associate a nbd device with rbd-nbd process (and rbd volume)
looking at /sys/block/nbd*/pid and ps output.

I really don't know where to look for the rbd-nbd log.
Can you point it out?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot open /dev/xvdb: Input/output error

2017-06-26 Thread Mykola Golub
On Mon, Jun 26, 2017 at 07:12:31PM +0200, Massimiliano Cuttini wrote:

> >In your case (rbd-nbd) this error is harmless. You can avoid them
> >setting in ceph.conf, [client] section something like below:
> >
> >  admin socket = /var/run/ceph/$name.$pid.asok
> >
> >Also to make every rbd-nbd process to log to a separate file you can
> >set (in [client] section):
> >
> >  log file = /var/log/ceph/$name.$pid.log
> I need to create all the user in ceph cluster before use this.
> At the moment all the cluster was runnig with ceph admin keyring.
> However, this is not an issue, I  can rapidly deploy all user
> >needed.

I don't understand about this. I think just adding these parameters to
ceph.conf should work.

> 
> >>root 12610  0.0  0.2 1836768 11412 ?   Sl   Jun23   0:43 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-602b05be-395d-442e-bd68-7742deaf97bd
> >> --name client.admin
> >>root 17298  0.0  0.2 1644244 8420 ?Sl   21:15   0:01 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-3e16395d-7dad-4680-a7ad-7f398da7fd9e
> >> --name client.admin
> >>root 18116  0.0  0.2 1570512 8428 ?Sl   21:15   0:01 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-41a76fe7-c9ff-4082-adb4-43f3120a9106
> >> --name client.admin
> >>root 19063  0.1  1.3 2368252 54944 ?   Sl   21:15   0:10 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-6da2154e-06fd-4063-8af5-ae86ae61df50
> >> --name client.admin
> >>root 21007  0.0  0.2 1570512 8644 ?Sl   21:15   0:01 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-c8aca7bd-1e37-4af4-b642-f267602e210f
> >> --name client.admin
> >>root 21226  0.0  0.2 1703640 8744 ?Sl   21:15   0:01 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-cf2139ac-b1c4-404d-87da-db8f992a3e72
> >> --name client.admin
> >>root 21615  0.5  1.4 2368252 60256 ?   Sl   21:15   0:33 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-acb2a9b0-e98d-474e-aa42-ed4e5534ddbe
> >> --name client.admin
> >>root 21653  0.0  0.2 1703640 11100 ?   Sl   04:12   0:14 rbd-nbd 
> >>--nbds_max 64 map 
> >>RBD_XenStorage-51a45fd8-a4d1-4202-899c-00a0f81054cc/VHD-8631ab86-c85c-407b-9e15-bd86e830ba74
> >> --name client.admin
> >Do you observe the issue for all these volumes? I see many of them
> >were started recently (21:15) while other are older.
> Only some of them.
> But it's randomly.
> Some of old and some just plugged becomes unavailable to xen.

Do you mean by "unavailable" that image is corrupted or does it
reports IO errors? If this is the first case then it was corrupted
some time ago and we would need logs for that period to understand
what happened.

> >Don't you observe sporadic crashes/restarts of rbd-nbd processes? You
> >can associate a nbd device with rbd-nbd process (and rbd volume)
> >looking at /sys/block/nbd*/pid and ps output.
> I really don't know where to look for the rbd-nbd log.
> Can you point it out?

According to some of your previous messages rbd-nbd is writing to
/var/log/ceph/client.log:

> Under /var/log/ceph/client.log
> I see this error:
> 
> 2017-06-25 05:25:32.833202 7f658ff04e00  0 ceph version 10.2.7
> (50e863e0f4bc8f4b9e31156de690d765af245185), process rbd-nbd, pid 8524

You could look for errors in older log files if they are rotated.

-- 
Mykola Golub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi Tenancy in Ceph RBD Cluster

2017-06-26 Thread Mayank Kumar
Thanks David, few more questions:-
- Is there a way to limit the capability of the keyring which is used to
map/unmap/lock to only allow those operations and nothing else using that
specific keyring
- For a single pool, is there a way to generate multiple keyrings where a
rbd cannot be mapped by tenant A using keyring A, if it was mapped using a
different keyring created only for tenant B. I understand, that tenant A
has to unlock and unmap it, which would happen during the garbage
collection phase in our deployment.
- For us ,these are internal customers. Is 12-13 pools too much. I was
thinking if this scales upto 100, we are good.
- I heard something about ceph namespaces which would scale for different
customers. Isnt that implemented yet ? I couldnt find a any documentation
for it ?





On Mon, Jun 26, 2017 at 7:12 AM, David Turner  wrote:

> I don't know specifics on Kubernetes or creating multiple keyrings for
> servers, so I'll leave those for someone else.  I will say that if you are
> kernel mapping your RBDs, then the first tenant to do so will lock the RBD
> and no other tenant can map it.  This is built into Ceph.  The original
> tenant would need to unmap it for the second to be able to access it.  This
> is different if you are not mapping RBDs and just using librbd to deal with
> them.
>
> Multiple pools in Ceph are not free.  Pools are a fairly costly resource
> in Ceph because data for pools is stored in PGs, the PGs are stored and
> distributed between the OSDs in your cluster, and the more PGs an OSD has
> the more memory requirements that OSD has.  It does not scale infinitely.
> If you are talking about one Pool per customer on a dozen or less
> customers, then it might work for your use case, but again it doesn't scale
> to growing the customer base.
>
> RBD map could be run remotely via SSH, but that isn't what you were asking
> about.  I don't know of any functionality that allows you to use a keyring
> on server A to map an RBD on server B.
>
> "Ceph Statistics" is VERY broad.  Are you talking IOPS, disk usage,
> throughput, etc?  disk usage is incredibly simple to calculate, especially
> if the RBD has object-map enabled.  A simple rbd du rbd_name would give you
> the disk usage per RBD and return in seconds.
>
> On Mon, Jun 26, 2017 at 2:00 AM Mayank Kumar  wrote:
>
>> Hi Ceph Users
>> I am relatively new to Ceph and trying to Provision CEPH RBD Volumes
>> using Kubernetes.
>>
>> I would like to know what are the best practices for hosting a multi
>> tenant CEPH cluster. Specifically i have the following questions:-
>>
>> - Is it ok to share a single Ceph Pool amongst multiple tenants ?  If
>> yes, how do you guarantee that volumes of one Tenant are not
>>  accessible(mountable/mapable/unmappable/deleteable/mutable) to other
>> tenants ?
>> - Can a single Ceph Pool have multiple admin and user keyrings generated
>> for rbd create and rbd map commands ? This way i want to assign different
>> keyrings to each tenant
>>
>> - can a rbd map command be run remotely for any node on which we want to
>> mount RBD Volumes or it must be run from the same node on which we want to
>> mount ? Is this going to be possible in the future ?
>>
>> - In terms of ceph fault tolerance and resiliency, is one ceph pool per
>> customer a better design or a single pool must be shared with mutiple
>> customers
>> - In a single pool for all customers, how can we get the ceph statistics
>> per customer ? Is it possible to somehow derive this from the RBD volumes ?
>>
>> Thanks for your responses
>> Mayank
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] free space calculation

2017-06-26 Thread Papp Rudolf Péter

Dear cephers,

Could someone show me an url where can I found how ceph calculate the 
available space?


I've installed a small ceph (Kraken) environment with bluestore OSDs. 
The servers contains 2 disks and 1 ssd. The disk 1. part is UEFI (~500 
MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with 2 500 GB 
HDDs, 2 with 1 TB HDDs total 3 servers.


For example the HDD parts:
/dev/sdb1  2048 976895 974848   476M EFI System
/dev/sdb2976896   98633727   97656832  46,6G Linux RAID
/dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
info from ceph-disk:
 /dev/sda :
 /dev/sda1 other, vfat
 /dev/sda2 other, linux_raid_member
 /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db /dev/sdc1, 
block.wal /dev/sdc2

/dev/sdb :
 /dev/sdb1 other, vfat, mounted on /boot/efi
 /dev/sdb2 other, linux_raid_member
 /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db /dev/sdc3, 
block.wal /dev/sdc4

/dev/sdc :
 /dev/sdc1 ceph block.db, for /dev/sda3
 /dev/sdc2 ceph block.wal, for /dev/sda3
 /dev/sdc3 ceph block.db, for /dev/sdb3
 /dev/sdc4 ceph block.wal, for /dev/sdb3

The reported size from ceph osd df tree:
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
-1 0.17578-   179G   104M   179G 0.06 1.00   0 root default
-2 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl2
 0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
 3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
-3 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl3
 1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
 4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
-4 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl1
 2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
 5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
  TOTAL   179G   104M   179G 0.06
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

~ 30GB each 10 percent of the smallest real size. 3x replication. Could 
be possible that the system using wrong partition (2. in this scenario) 
for usable space calculation? Can I write more data than the calculated?


Another hint?

Thank you!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] free space calculation

2017-06-26 Thread David Turner
What is the output of `lsblk`?

On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter  wrote:

> Dear cephers,
>
> Could someone show me an url where can I found how ceph calculate the
> available space?
>
> I've installed a small ceph (Kraken) environment with bluestore OSDs.
> The servers contains 2 disks and 1 ssd. The disk 1. part is UEFI (~500
> MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with 2 500 GB
> HDDs, 2 with 1 TB HDDs total 3 servers.
>
> For example the HDD parts:
> /dev/sdb1  2048 976895 974848   476M EFI System
> /dev/sdb2976896   98633727   97656832  46,6G Linux RAID
> /dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
> info from ceph-disk:
>   /dev/sda :
>   /dev/sda1 other, vfat
>   /dev/sda2 other, linux_raid_member
>   /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db /dev/sdc1,
> block.wal /dev/sdc2
> /dev/sdb :
>   /dev/sdb1 other, vfat, mounted on /boot/efi
>   /dev/sdb2 other, linux_raid_member
>   /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db /dev/sdc3,
> block.wal /dev/sdc4
> /dev/sdc :
>   /dev/sdc1 ceph block.db, for /dev/sda3
>   /dev/sdc2 ceph block.wal, for /dev/sda3
>   /dev/sdc3 ceph block.db, for /dev/sdb3
>   /dev/sdc4 ceph block.wal, for /dev/sdb3
>
> The reported size from ceph osd df tree:
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
> -1 0.17578-   179G   104M   179G 0.06 1.00   0 root default
> -2 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl2
>   0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
>   3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
> -3 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl3
>   1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
>   4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
> -4 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl1
>   2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
>   5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
>TOTAL   179G   104M   179G 0.06
> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>
> ~ 30GB each 10 percent of the smallest real size. 3x replication. Could
> be possible that the system using wrong partition (2. in this scenario)
> for usable space calculation? Can I write more data than the calculated?
>
> Another hint?
>
> Thank you!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] free space calculation

2017-06-26 Thread David Turner
The output of `sudo df -h` would also be helpful.  Sudo/root is generally
required because the OSD folders are only readable by the Ceph user.

On Mon, Jun 26, 2017 at 4:37 PM David Turner  wrote:

> What is the output of `lsblk`?
>
> On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter  wrote:
>
>> Dear cephers,
>>
>> Could someone show me an url where can I found how ceph calculate the
>> available space?
>>
>> I've installed a small ceph (Kraken) environment with bluestore OSDs.
>> The servers contains 2 disks and 1 ssd. The disk 1. part is UEFI (~500
>> MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with 2 500 GB
>> HDDs, 2 with 1 TB HDDs total 3 servers.
>>
>> For example the HDD parts:
>> /dev/sdb1  2048 976895 974848   476M EFI System
>> /dev/sdb2976896   98633727   97656832  46,6G Linux RAID
>> /dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
>> info from ceph-disk:
>>   /dev/sda :
>>   /dev/sda1 other, vfat
>>   /dev/sda2 other, linux_raid_member
>>   /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db /dev/sdc1,
>> block.wal /dev/sdc2
>> /dev/sdb :
>>   /dev/sdb1 other, vfat, mounted on /boot/efi
>>   /dev/sdb2 other, linux_raid_member
>>   /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db /dev/sdc3,
>> block.wal /dev/sdc4
>> /dev/sdc :
>>   /dev/sdc1 ceph block.db, for /dev/sda3
>>   /dev/sdc2 ceph block.wal, for /dev/sda3
>>   /dev/sdc3 ceph block.db, for /dev/sdb3
>>   /dev/sdc4 ceph block.wal, for /dev/sdb3
>>
>> The reported size from ceph osd df tree:
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
>> -1 0.17578-   179G   104M   179G 0.06 1.00   0 root default
>> -2 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl2
>>   0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
>>   3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
>> -3 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl3
>>   1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
>>   4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
>> -4 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl1
>>   2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
>>   5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
>>TOTAL   179G   104M   179G 0.06
>> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>>
>> ~ 30GB each 10 percent of the smallest real size. 3x replication. Could
>> be possible that the system using wrong partition (2. in this scenario)
>> for usable space calculation? Can I write more data than the calculated?
>>
>> Another hint?
>>
>> Thank you!
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] free space calculation

2017-06-26 Thread Papp Rudolf Péter

Hi David!

lsblk:

NAMEMAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda   8:00 931,5G  0 disk
├─sda18:10   476M  0 part
├─sda28:20  46,6G  0 part
│ └─md0   9:00  46,5G  0 raid1 /
└─sda38:30 884,5G  0 part /var/lib/ceph/osd/ceph-3
sdb   8:16   0 931,5G  0 disk
├─sdb18:17   0   476M  0 part  /boot/efi
├─sdb28:18   0  46,6G  0 part
│ └─md0   9:00  46,5G  0 raid1 /
└─sdb38:19   0 884,5G  0 part /var/lib/ceph/osd/ceph-0
sdc   8:32   0 232,9G  0 disk
├─sdc18:33   020G  0 part
├─sdc28:34   0   576M  0 part
├─sdc38:35   020G  0 part
└─sdc48:36   0   576M  0 part


2017-06-26 22:37 keltezéssel, David Turner írta:

What is the output of `lsblk`?

On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter > wrote:


Dear cephers,

Could someone show me an url where can I found how ceph calculate the
available space?

I've installed a small ceph (Kraken) environment with bluestore OSDs.
The servers contains 2 disks and 1 ssd. The disk 1. part is UEFI (~500
MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with 2 500 GB
HDDs, 2 with 1 TB HDDs total 3 servers.

For example the HDD parts:
/dev/sdb1  2048 976895 974848   476M EFI System
/dev/sdb2976896   98633727   97656832  46,6G Linux RAID
/dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
info from ceph-disk:
  /dev/sda :
  /dev/sda1 other, vfat
  /dev/sda2 other, linux_raid_member
  /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db
/dev/sdc1,
block.wal /dev/sdc2
/dev/sdb :
  /dev/sdb1 other, vfat, mounted on /boot/efi
  /dev/sdb2 other, linux_raid_member
  /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db
/dev/sdc3,
block.wal /dev/sdc4
/dev/sdc :
  /dev/sdc1 ceph block.db, for /dev/sda3
  /dev/sdc2 ceph block.wal, for /dev/sda3
  /dev/sdc3 ceph block.db, for /dev/sdb3
  /dev/sdc4 ceph block.wal, for /dev/sdb3

The reported size from ceph osd df tree:
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
-1 0.17578-   179G   104M   179G 0.06 1.00   0 root default
-2 0.05859- 61439M 35696k 61405M 0.06 1.00   0  host cl2
  0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
  3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
-3 0.05859- 61439M 35696k 61405M 0.06 1.00   0  host cl3
  1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
  4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
-4 0.05859- 61439M 35696k 61405M 0.06 1.00   0  host cl1
  2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
  5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
   TOTAL   179G   104M   179G 0.06
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

~ 30GB each 10 percent of the smallest real size. 3x replication.
Could
be possible that the system using wrong partition (2. in this
scenario)
for usable space calculation? Can I write more data than the
calculated?

Another hint?

Thank you!


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] free space calculation

2017-06-26 Thread Papp Rudolf Péter

sudo df -h:
udev3,9G 0  3,9G   0% /dev
tmpfs   790M   19M  771M   3% /run
/dev/md0 46G  2,5G   41G   6% /
tmpfs   3,9G 0  3,9G   0% /dev/shm
tmpfs   5,0M 0  5,0M   0% /run/lock
tmpfs   3,9G 0  3,9G   0% /sys/fs/cgroup
/dev/sdb1   476M  3,4M  472M   1% /boot/efi
/dev/sda3   885G  1,4G  883G   1% /var/lib/ceph/osd/ceph-3
/dev/sdb3   885G  1,6G  883G   1% /var/lib/ceph/osd/ceph-0
tmpfs   790M 0  790M   0% /run/user/1001


2017-06-26 22:39 keltezéssel, David Turner írta:
The output of `sudo df -h` would also be helpful. Sudo/root is 
generally required because the OSD folders are only readable by the 
Ceph user.


On Mon, Jun 26, 2017 at 4:37 PM David Turner > wrote:


What is the output of `lsblk`?

On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter mailto:p...@peer.hu>> wrote:

Dear cephers,

Could someone show me an url where can I found how ceph
calculate the
available space?

I've installed a small ceph (Kraken) environment with
bluestore OSDs.
The servers contains 2 disks and 1 ssd. The disk 1. part is
UEFI (~500
MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with
2 500 GB
HDDs, 2 with 1 TB HDDs total 3 servers.

For example the HDD parts:
/dev/sdb1  2048 976895 974848   476M EFI System
/dev/sdb2976896   98633727   97656832  46,6G Linux RAID
/dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
info from ceph-disk:
  /dev/sda :
  /dev/sda1 other, vfat
  /dev/sda2 other, linux_raid_member
  /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db
/dev/sdc1,
block.wal /dev/sdc2
/dev/sdb :
  /dev/sdb1 other, vfat, mounted on /boot/efi
  /dev/sdb2 other, linux_raid_member
  /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db
/dev/sdc3,
block.wal /dev/sdc4
/dev/sdc :
  /dev/sdc1 ceph block.db, for /dev/sda3
  /dev/sdc2 ceph block.wal, for /dev/sda3
  /dev/sdc3 ceph block.db, for /dev/sdb3
  /dev/sdc4 ceph block.wal, for /dev/sdb3

The reported size from ceph osd df tree:
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
-1 0.17578-   179G   104M   179G 0.06 1.00   0 root
default
-2 0.05859- 61439M 35696k 61405M 0.06 1.00   0  host cl2
  0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
  3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
-3 0.05859- 61439M 35696k 61405M 0.06 1.00   0  host cl3
  1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
  4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
-4 0.05859- 61439M 35696k 61405M 0.06 1.00   0  host cl1
  2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
  5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
   TOTAL   179G   104M   179G 0.06
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

~ 30GB each 10 percent of the smallest real size. 3x
replication. Could
be possible that the system using wrong partition (2. in this
scenario)
for usable space calculation? Can I write more data than the
calculated?

Another hint?

Thank you!


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi Tenancy in Ceph RBD Cluster

2017-06-26 Thread Jason Dillaman
On Mon, Jun 26, 2017 at 2:55 PM, Mayank Kumar  wrote:
> Thanks David, few more questions:-
> - Is there a way to limit the capability of the keyring which is used to
> map/unmap/lock to only allow those operations and nothing else using that
> specific keyring

Since RBD is basically just a collection of helper logic that wraps
low-level RADOS commands, there is no (current) way to restrict
operations at such a granularity.

> - For a single pool, is there a way to generate multiple keyrings where a
> rbd cannot be mapped by tenant A using keyring A, if it was mapped using a
> different keyring created only for tenant B. I understand, that tenant A has
> to unlock and unmap it, which would happen during the garbage collection
> phase in our deployment.

Negative.

> - For us ,these are internal customers. Is 12-13 pools too much. I was
> thinking if this scales upto 100, we are good.
> - I heard something about ceph namespaces which would scale for different
> customers. Isnt that implemented yet ? I couldnt find a any documentation
> for it ?

Adding namespace support to RBD would probably be the best approach to
segregate tenants. However, this item is still on the backlog for a
future release since it's never really bubbled up to the top given the
current use-cases for RBD.

>
>
>
> On Mon, Jun 26, 2017 at 7:12 AM, David Turner  wrote:
>>
>> I don't know specifics on Kubernetes or creating multiple keyrings for
>> servers, so I'll leave those for someone else.  I will say that if you are
>> kernel mapping your RBDs, then the first tenant to do so will lock the RBD
>> and no other tenant can map it.  This is built into Ceph.  The original
>> tenant would need to unmap it for the second to be able to access it.  This
>> is different if you are not mapping RBDs and just using librbd to deal with
>> them.
>>
>> Multiple pools in Ceph are not free.  Pools are a fairly costly resource
>> in Ceph because data for pools is stored in PGs, the PGs are stored and
>> distributed between the OSDs in your cluster, and the more PGs an OSD has
>> the more memory requirements that OSD has.  It does not scale infinitely.
>> If you are talking about one Pool per customer on a dozen or less customers,
>> then it might work for your use case, but again it doesn't scale to growing
>> the customer base.
>>
>> RBD map could be run remotely via SSH, but that isn't what you were asking
>> about.  I don't know of any functionality that allows you to use a keyring
>> on server A to map an RBD on server B.
>>
>> "Ceph Statistics" is VERY broad.  Are you talking IOPS, disk usage,
>> throughput, etc?  disk usage is incredibly simple to calculate, especially
>> if the RBD has object-map enabled.  A simple rbd du rbd_name would give you
>> the disk usage per RBD and return in seconds.
>>
>> On Mon, Jun 26, 2017 at 2:00 AM Mayank Kumar  wrote:
>>>
>>> Hi Ceph Users
>>> I am relatively new to Ceph and trying to Provision CEPH RBD Volumes
>>> using Kubernetes.
>>>
>>> I would like to know what are the best practices for hosting a multi
>>> tenant CEPH cluster. Specifically i have the following questions:-
>>>
>>> - Is it ok to share a single Ceph Pool amongst multiple tenants ?  If
>>> yes, how do you guarantee that volumes of one Tenant are not
>>> accessible(mountable/mapable/unmappable/deleteable/mutable) to other tenants
>>> ?
>>> - Can a single Ceph Pool have multiple admin and user keyrings generated
>>> for rbd create and rbd map commands ? This way i want to assign different
>>> keyrings to each tenant
>>>
>>> - can a rbd map command be run remotely for any node on which we want to
>>> mount RBD Volumes or it must be run from the same node on which we want to
>>> mount ? Is this going to be possible in the future ?
>>>
>>> - In terms of ceph fault tolerance and resiliency, is one ceph pool per
>>> customer a better design or a single pool must be shared with mutiple
>>> customers
>>> - In a single pool for all customers, how can we get the ceph statistics
>>> per customer ? Is it possible to somehow derive this from the RBD volumes ?
>>>
>>> Thanks for your responses
>>> Mayank
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] free space calculation

2017-06-26 Thread David Turner
And the `sudo df -h`?  Also a `ceph df` might be helpful to see what's
going on.

On Mon, Jun 26, 2017 at 4:41 PM Papp Rudolf Péter  wrote:

> Hi David!
>
> lsblk:
>
> NAMEMAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> sda   8:00 931,5G  0 disk
> ├─sda18:10   476M  0 part
> ├─sda28:20  46,6G  0 part
> │ └─md0   9:00  46,5G  0 raid1 /
> └─sda38:30 884,5G  0 part  /var/lib/ceph/osd/ceph-3
> sdb   8:16   0 931,5G  0 disk
> ├─sdb18:17   0   476M  0 part  /boot/efi
> ├─sdb28:18   0  46,6G  0 part
> │ └─md0   9:00  46,5G  0 raid1 /
> └─sdb38:19   0 884,5G  0 part  /var/lib/ceph/osd/ceph-0
> sdc   8:32   0 232,9G  0 disk
> ├─sdc18:33   020G  0 part
> ├─sdc28:34   0   576M  0 part
> ├─sdc38:35   020G  0 part
> └─sdc48:36   0   576M  0 part
>
>
> 2017-06-26 22:37 keltezéssel, David Turner írta:
>
> What is the output of `lsblk`?
>
> On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter  wrote:
>
>> Dear cephers,
>>
>> Could someone show me an url where can I found how ceph calculate the
>> available space?
>>
>> I've installed a small ceph (Kraken) environment with bluestore OSDs.
>> The servers contains 2 disks and 1 ssd. The disk 1. part is UEFI (~500
>> MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with 2 500 GB
>> HDDs, 2 with 1 TB HDDs total 3 servers.
>>
>> For example the HDD parts:
>> /dev/sdb1  2048 976895 974848   476M EFI System
>> /dev/sdb2976896   98633727   97656832  46,6G Linux RAID
>> /dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
>> info from ceph-disk:
>>   /dev/sda :
>>   /dev/sda1 other, vfat
>>   /dev/sda2 other, linux_raid_member
>>   /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db /dev/sdc1,
>> block.wal /dev/sdc2
>> /dev/sdb :
>>   /dev/sdb1 other, vfat, mounted on /boot/efi
>>   /dev/sdb2 other, linux_raid_member
>>   /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db /dev/sdc3,
>> block.wal /dev/sdc4
>> /dev/sdc :
>>   /dev/sdc1 ceph block.db, for /dev/sda3
>>   /dev/sdc2 ceph block.wal, for /dev/sda3
>>   /dev/sdc3 ceph block.db, for /dev/sdb3
>>   /dev/sdc4 ceph block.wal, for /dev/sdb3
>>
>> The reported size from ceph osd df tree:
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
>> -1 0.17578-   179G   104M   179G 0.06 1.00   0 root default
>> -2 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl2
>>   0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
>>   3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
>> -3 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl3
>>   1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
>>   4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
>> -4 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl1
>>   2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
>>   5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
>>TOTAL   179G   104M   179G 0.06
>> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>>
>> ~ 30GB each 10 percent of the smallest real size. 3x replication. Could
>> be possible that the system using wrong partition (2. in this scenario)
>> for usable space calculation? Can I write more data than the calculated?
>>
>> Another hint?
>>
>> Thank you!
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm vms start or reboot hang long time while using the rbd mapped image

2017-06-26 Thread Jason Dillaman
May I ask why you are using krbd with QEMU instead of librbd?

On Fri, Jun 16, 2017 at 12:18 PM, 码云  wrote:
> Hi All,
> Recently.I meet a question and I did'nt find out any thing for explain it.
>
> Ops process like blow:
> ceph 10.2.5  jewel, qemu 2.5.0  centos 7.2 x86_64
> create pool  rbd_vms  3  replications with cache tier pool 3 replication
> too.
> create 100 images in rbd_vms
> rbd map 100 image to local device, like  /dev/rbd0  ... /dev/rbd100
> dd if=/root/win7.qcow2  of=/dev/rbd0 bs=1M count=3000
> virsh define 100 vms(vm0... vm100), 1 vms  configured 1 /dev/rbd .
> virsh start  100 vms.
>
> when the 100 vms start concurrence, will caused some vms hang.
> when do fio testing in those vms, will casued some vms hang .
>
> I checked ceph status ,osd status , log etc.  all like same as before.
>
> but check device with  iostat -dx 1,   some  rbd* device are  strange.
> util% are 100% full, but  read and wirte count all are zero.
>
> i checked virsh log, vms log etc, but not found any useful info.
>
> Can help to fingure out some infomartion.  librbd krbd or other place is
> need to adjust some arguments?
>
> Thanks All.
>
> --
> 王勇
> 上海德拓信息技术股份有限公司-成都研发中心
> 手机:15908149443
> 邮箱:wangy...@datatom.com
> 地址:四川省成都市天府大道666号希顿国际广场C座1409
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] free space calculation

2017-06-26 Thread David Turner
I'm not seeing anything that would show anything to indicate a problem. The
weights, cluster size, etc all say that ceph only sees 30GB per osd. I
don't see what is causing the discrepancy.  Anyone else have any ideas?

On Mon, Jun 26, 2017, 5:02 PM Papp Rudolf Péter  wrote:

> sudo df -h:
> udev3,9G 0  3,9G   0% /dev
> tmpfs   790M   19M  771M   3% /run
> /dev/md0 46G  2,5G   41G   6% /
> tmpfs   3,9G 0  3,9G   0% /dev/shm
> tmpfs   5,0M 0  5,0M   0% /run/lock
> tmpfs   3,9G 0  3,9G   0% /sys/fs/cgroup
> /dev/sdb1   476M  3,4M  472M   1% /boot/efi
> /dev/sda3   885G  1,4G  883G   1% /var/lib/ceph/osd/ceph-3
> /dev/sdb3   885G  1,6G  883G   1% /var/lib/ceph/osd/ceph-0
> tmpfs   790M 0  790M   0% /run/user/1001
>
>
> 2017-06-26 22:39 keltezéssel, David Turner írta:
>
> The output of `sudo df -h` would also be helpful.  Sudo/root is generally
> required because the OSD folders are only readable by the Ceph user.
>
> On Mon, Jun 26, 2017 at 4:37 PM David Turner 
> wrote:
>
>> What is the output of `lsblk`?
>>
>> On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter  wrote:
>>
>>> Dear cephers,
>>>
>>> Could someone show me an url where can I found how ceph calculate the
>>> available space?
>>>
>>> I've installed a small ceph (Kraken) environment with bluestore OSDs.
>>> The servers contains 2 disks and 1 ssd. The disk 1. part is UEFI (~500
>>> MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with 2 500 GB
>>> HDDs, 2 with 1 TB HDDs total 3 servers.
>>>
>>> For example the HDD parts:
>>> /dev/sdb1  2048 976895 974848   476M EFI System
>>> /dev/sdb2976896   98633727   97656832  46,6G Linux RAID
>>> /dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
>>> info from ceph-disk:
>>>   /dev/sda :
>>>   /dev/sda1 other, vfat
>>>   /dev/sda2 other, linux_raid_member
>>>   /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db /dev/sdc1,
>>> block.wal /dev/sdc2
>>> /dev/sdb :
>>>   /dev/sdb1 other, vfat, mounted on /boot/efi
>>>   /dev/sdb2 other, linux_raid_member
>>>   /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db /dev/sdc3,
>>> block.wal /dev/sdc4
>>> /dev/sdc :
>>>   /dev/sdc1 ceph block.db, for /dev/sda3
>>>   /dev/sdc2 ceph block.wal, for /dev/sda3
>>>   /dev/sdc3 ceph block.db, for /dev/sdb3
>>>   /dev/sdc4 ceph block.wal, for /dev/sdb3
>>>
>>> The reported size from ceph osd df tree:
>>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
>>> -1 0.17578-   179G   104M   179G 0.06 1.00   0 root default
>>> -2 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl2
>>>   0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.0
>>>   3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.3
>>> -3 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl3
>>>   1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.1
>>>   4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.4
>>> -4 0.05859- 61439M 35696k 61405M 0.06 1.00   0 host cl1
>>>   2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.2
>>>   5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00   0 osd.5
>>>TOTAL   179G   104M   179G 0.06
>>> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>>>
>>> ~ 30GB each 10 percent of the smallest real size. 3x replication. Could
>>> be possible that the system using wrong partition (2. in this scenario)
>>> for usable space calculation? Can I write more data than the calculated?
>>>
>>> Another hint?
>>>
>>> Thank you!
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Christian Balzer
On Mon, 26 Jun 2017 15:06:46 +0100 Nick Fisk wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Willem Jan Withagen
> > Sent: 26 June 2017 14:35
> > To: Christian Wuerdig 
> > Cc: Ceph Users 
> > Subject: Re: [ceph-users] Ceph random read IOPS
> > 
> > On 26-6-2017 09:01, Christian Wuerdig wrote:  
> > > Well, preferring faster clock CPUs for SSD scenarios has been floated
> > > several times over the last few months on this list. And realistic or
> > > not, Nick's and Kostas' setup are similar enough (testing single disk)
> > > that it's a distinct possibility.
> > > Anyway, as mentioned measuring the performance counters would  
> > probably  
> > > provide more insight.  
> > 
> > I read the advise as:
> > prefer GHz over cores.
> > 
> > And especially since there is a sort of balance between either GHz or  
> cores,
> > that can be an expensive one. Getting both means you have to pay  
> relatively
> > substantial more money.
> > 
> > And for an average Ceph server with plenty OSDs, I personally just don't  
> buy
> > that. There you'd have to look at the total throughput of the the system,  
> and
> > latency is only one of the many factors.
> > 
> > Let alone in a cluster with several hosts (and or racks). There the  
> latency is
> > dictated by the network. So a bad choice of network card or switch will  
> out
> > do any extra cycles that your CPU can burn.
> > 
> > I think that just testing 1 OSD is testing artifacts, and has very little  
> to do with
> > running an actual ceph cluster.
> > 
> > So if one would like to test this, the test setup should be something
> > like: 3 hosts with something like 3 disks per host, min_disk=2  and a nice
> > workload.
> > Then turn the GHz-knob and see what happens with client latency and
> > throughput.  
> 
> Did similar tests last summer. 5 nodes with 12x 7.2k disks each, connected
> via 10G. NVME journal. 3x replica pool.
> 
> First test was with C-states left to auto and frequency scaling leaving the
> cores at lowest frequency of 900mhz. The cluster will quite happily do a
> couple of thousand IO's without generating enough CPU load to boost the 4
> cores up to max C-state or frequency.
> 
> With small background IO going on in background, a QD=1 sequential 4kb write
> was done with the following results:
> 
> write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
> slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
> clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
>  lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
> clat percentiles (usec):
>  |  1.00th=[ 1480],  5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
>  | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
>  | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
>  | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
>  | 99.99th=[22400]
> 
> So just under 2.5ms write latency.
> 
> I don't have the results from the separate C-states/frequency scaling, but
> adjusting either got me a boost. Forcing to C1 and max frequency of 3.6Ghz
> got me:
> 
> write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
> slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
> clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
>  lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
> clat percentiles (usec):
>  |  1.00th=[  540],  5.00th=[  572], 10.00th=[  588], 20.00th=[  604],
>  | 30.00th=[  620], 40.00th=[  636], 50.00th=[  652], 60.00th=[  668],
>  | 70.00th=[  692], 80.00th=[  716], 90.00th=[  764], 95.00th=[  820],
>  | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
>  | 99.99th=[24448]
> 
> Quite a bit faster. Although these are best case figures, if any substantial
> workload is run, the average tends to hover around 1ms latency.
> 

And that's that.

If you care about latency and/or your "high" IOPS load is such that it
would still fit on a single core (real CPU usage of the OSD process less
than 100%) then less, faster cores are definitely the way to go.

Unfortunately single chip systems with current Intel offerings tend to
limit you as far as size and PCIe connections are concerned, not more than
6 SSDs realistically. So if you want more storage devices (need more cores)
or use NVMe (need more PCIe lanes), then you're forced into using dual CPU
systems, both paying for that pleasure by default and with 2 NUMA nodes as
well.

I predict that servers based on the new AMD Epyc CPUs will make absolutely
lovely OSD hosts, having loads of I/O, PCIe channels, plenty of fast
cores, full speed interconnect between dies (if you need more than 8 real
cores), thus basically all in a single NUMA zone with single chip systems.


As for the OP, if you're still reading this thread that is, your
assumption that a device that can do 300K IOPS (reads, also not s

Re: [ceph-users] free space calculation

2017-06-26 Thread Papp Rudolf Péter

sudo df -h:
udev3,9G 0  3,9G   0% /dev
tmpfs   790M   19M  771M   3% /run
/dev/md0 46G  2,5G   41G   6% /
tmpfs   3,9G 0  3,9G   0% /dev/shm
tmpfs   5,0M 0  5,0M   0% /run/lock
tmpfs   3,9G 0  3,9G   0% /sys/fs/cgroup
/dev/sdb1   476M  3,4M  472M   1% /boot/efi
/dev/sda3   885G  1,4G  883G   1% /var/lib/ceph/osd/ceph-3
/dev/sdb3   885G  1,6G  883G   1% /var/lib/ceph/osd/ceph-0
tmpfs   790M 0  790M   0% /run/user/1001

ceph df:
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
179G  179G 116M  0.06
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
dev  6 0 061401M   0


2017-06-26 22:55 keltezéssel, David Turner írta:
And the `sudo df -h`?  Also a `ceph df` might be helpful to see what's 
going on.


On Mon, Jun 26, 2017 at 4:41 PM Papp Rudolf Péter > wrote:


Hi David!

lsblk:

NAMEMAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda   8:00 931,5G  0 disk
├─sda18:10   476M  0 part
├─sda28:20  46,6G  0 part
│ └─md0   9:00  46,5G  0 raid1 /
└─sda38:30 884,5G  0 part /var/lib/ceph/osd/ceph-3
sdb   8:16   0 931,5G  0 disk
├─sdb18:17   0   476M  0 part  /boot/efi
├─sdb28:18   0  46,6G  0 part
│ └─md0   9:00  46,5G  0 raid1 /
└─sdb38:19   0 884,5G  0 part /var/lib/ceph/osd/ceph-0
sdc   8:32   0 232,9G  0 disk
├─sdc18:33   020G  0 part
├─sdc28:34   0   576M  0 part
├─sdc38:35   020G  0 part
└─sdc48:36   0   576M  0 part


2017-06-26 22:37 keltezéssel, David Turner írta:

What is the output of `lsblk`?

On Mon, Jun 26, 2017 at 4:32 PM Papp Rudolf Péter mailto:p...@peer.hu>> wrote:

Dear cephers,

Could someone show me an url where can I found how ceph
calculate the
available space?

I've installed a small ceph (Kraken) environment with
bluestore OSDs.
The servers contains 2 disks and 1 ssd. The disk 1. part is
UEFI (~500
MB), 2. raid (~50GB), 3. ceph disk (450-950MB). 1 server with
2 500 GB
HDDs, 2 with 1 TB HDDs total 3 servers.

For example the HDD parts:
/dev/sdb1  2048 976895 974848   476M EFI System
/dev/sdb2976896   98633727   97656832  46,6G Linux RAID
/dev/sdb3  98633728 1953525134 1854891407 884,5G Ceph OSD
info from ceph-disk:
  /dev/sda :
  /dev/sda1 other, vfat
  /dev/sda2 other, linux_raid_member
  /dev/sda3 ceph data, active, cluster ceph, osd.4, block.db
/dev/sdc1,
block.wal /dev/sdc2
/dev/sdb :
  /dev/sdb1 other, vfat, mounted on /boot/efi
  /dev/sdb2 other, linux_raid_member
  /dev/sdb3 ceph data, active, cluster ceph, osd.1, block.db
/dev/sdc3,
block.wal /dev/sdc4
/dev/sdc :
  /dev/sdc1 ceph block.db, for /dev/sda3
  /dev/sdc2 ceph block.wal, for /dev/sda3
  /dev/sdc3 ceph block.db, for /dev/sdb3
  /dev/sdc4 ceph block.wal, for /dev/sdb3

The reported size from ceph osd df tree:
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS TYPE NAME
-1 0.17578-   179G   104M   179G 0.06 1.00   0 root
default
-2 0.05859- 61439M 35696k 61405M 0.06 1.00  0   
 host cl2

  0 0.02930  1.0 30719M 17848k 30702M 0.06 1.00  0 osd.0
  3 0.02930  1.0 30719M 17848k 30702M 0.06 1.00  0 osd.3
-3 0.05859- 61439M 35696k 61405M 0.06 1.00  0   
 host cl3

  1 0.02930  1.0 30719M 17848k 30702M 0.06 1.00  0 osd.1
  4 0.02930  1.0 30719M 17848k 30702M 0.06 1.00  0 osd.4
-4 0.05859- 61439M 35696k 61405M 0.06 1.00  0   
 host cl1

  2 0.02930  1.0 30719M 17848k 30702M 0.06 1.00  0 osd.2
  5 0.02930  1.0 30719M 17848k 30702M 0.06 1.00  0 osd.5
   TOTAL   179G   104M   179G 0.06
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

~ 30GB each 10 percent of the smallest real size. 3x
replication. Could
be possible that the system using wrong partition (2. in this
scenario)
for usable space calculation? Can I write more data than the
calculated?

Another hint?

Thank you!


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-

[ceph-users] ceph-mon not starting on Ubuntu 16.04 with Luminous RC

2017-06-26 Thread Wido den Hollander
Hi,

Just checking before I start looking into ceph-deploy if the behavior I'm 
seeing is correct.

On a freshly installed Ubuntu 16.04 + Luminous 12.1.0 system I see that my 
ceph-mon services aren't starting on boot.

Deployed Ceph on three machines: alpha, bravo and charlie. Using 'alpha' I've 
deployed using ceph-deploy.

I found my thread from last year and followed the hints there: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009617.html

I checked the dependencies and ceph-mon.target wasn't enabled nor was it 
depending on ceph-mon@bravo (same happens on alpha and charlie).

I enabled the ceph-mon.target and re-enabled ceph-mon@bravo and now it works.

root@bravo:~# systemctl list-dependencies ceph.target
ceph.target
● ├─ceph-mon.target
● ├─ceph-mon.target
● └─ceph-osd.target
●   ├─ceph-osd@1.service
●   └─ceph-osd@1.service
root@bravo:~# systemctl enable ceph-mon@bravo
Created symlink from 
/etc/systemd/system/ceph-mon.target.wants/ceph-mon@bravo.service to 
/lib/systemd/system/ceph-mon@.service.
root@bravo:~# systemctl list-dependencies ceph.target
ceph.target
● ├─ceph-mgr.target
● │ ├─ceph-mgr@bravo.service
● │ └─ceph-mgr@bravo.service
● ├─ceph-mon.target
● │ ├─ceph-mon@bravo.service
● │ └─ceph-mon@bravo.service
● ├─ceph-mon.target
● │ ├─ceph-mon@bravo.service
● │ └─ceph-mon@bravo.service
● └─ceph-osd.target
●   ├─ceph-osd@1.service
●   └─ceph-osd@1.service
root@bravo:~#

Looking in my ceph-deploy log I found this:

[2017-06-26 11:20:57,386][bravo][INFO  ] Running command: systemctl enable 
ceph.target
[2017-06-26 11:20:57,396][bravo][WARNING] Created symlink from 
/etc/systemd/system/multi-user.target.wants/ceph.target to 
/lib/systemd/system/ceph.target.
[2017-06-26 11:20:57,465][bravo][INFO  ] Running command: systemctl enable 
ceph-mon@bravo
[2017-06-26 11:20:57,475][bravo][WARNING] Created symlink from 
/etc/systemd/system/ceph-mon.target.wants/ceph-mon@bravo.service to 
/lib/systemd/system/ceph-mon@.service.
[2017-06-26 11:20:57,541][bravo][INFO  ] Running command: systemctl start 
ceph-mon@bravo

So it enables ceph.target, ceph-mon@bravo and then starts the MON. But it never 
enabled ceph-mon.target. Is that correct?

If not I'll open a PR for ceph-deploy to enable ceph-mon.target as well.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ?????? qemu-kvm vms start or reboot hang long time whileusing the rbd mapped image

2017-06-26 Thread ????
Hi Jason,
In one VDI integrated test  environment, we need to known the best practise.
It seems like librbd performance weak than krbd.
qemu 2.5.0 is not link to librbd unless manual configure and compile it.
By the way, rbd and libceph ko code are both adjusted lots of place in the 
centos7.3,
 are they fixed for something?
Tks and Rgds.


 




--  --
??: "Jason Dillaman";;
: 2017??6??27??(??) 7:28
??: ""; 
: "ceph-users"; 
: Re: [ceph-users] qemu-kvm vms start or reboot hang long time whileusing 
the rbd mapped image



May I ask why you are using krbd with QEMU instead of librbd?

On Fri, Jun 16, 2017 at 12:18 PM,   wrote:
> Hi All,
> Recently.I meet a question and I did'nt find out any thing for explain it.
>
> Ops process like blow:
> ceph 10.2.5  jewel, qemu 2.5.0  centos 7.2 x86_64
> create pool  rbd_vms  3  replications with cache tier pool 3 replication
> too.
> create 100 images in rbd_vms
> rbd map 100 image to local device, like  /dev/rbd0  ... /dev/rbd100
> dd if=/root/win7.qcow2  of=/dev/rbd0 bs=1M count=3000
> virsh define 100 vms(vm0... vm100), 1 vms  configured 1 /dev/rbd .
> virsh start  100 vms.
>
> when the 100 vms start concurrence, will caused some vms hang.
> when do fio testing in those vms, will casued some vms hang .
>
> I checked ceph status ,osd status , log etc.  all like same as before.
>
> but check device with  iostat -dx 1,   some  rbd* device are  strange.
> util% are 100% full, but  read and wirte count all are zero.
>
> i checked virsh log, vms log etc, but not found any useful info.
>
> Can help to fingure out some infomartion.  librbd krbd or other place is
> need to adjust some arguments?
>
> Thanks All.
>
> --
> 
> -
> ??15908149443
> ??wangy...@datatom.com
> ??666??C??1409
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com