Re: [ceph-users] OSD size and performance

2016-01-06 Thread Srinivasula Maram
Hi Prabu,

We generally use SCSI - PR(persistent reservation) supported drives (drive 
firmware should support) for HA/CFS. RBD would not support this feature because 
it is not physical drive.
But as you did we can mount the same rbd across multiple clients after writing 
file system(here OCFS2).

For more understanding about SCSI-PR in OCFS follow this link: 
http://www.dba-oracle.com/real_application_clusters_rac_grid/io_fencing.html

Thanks,
Srinivas

From: gjprabu [mailto:gjpr...@zohocorp.com]
Sent: Wednesday, January 06, 2016 10:58 AM
To: gjprabu
Cc: Srinivasula Maram; ceph-users; Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance

Hi srinivas,

Do we have any other options to check this issue.

Regads
Prabu

 On Mon, 04 Jan 2016 17:32:03 +0530 gjprabu 
mailto:gjpr...@zohocorp.com>>wrote 


Hi  Srinivas,

I am not sure RBD support SCSI but OCFS2 having that capability to lock and 
unlock while write.

(kworker/u192:5,71152,28):dlm_unlock_lock_handler:424 lvb: none
(kworker/u192:5,71152,28):__dlm_lookup_lockres:232 
O00946c510c
(kworker/u192:5,71152,28):__dlm_lookup_lockres_full:198 
O00946c510c
(kworker/u192:5,71152,28):dlmunlock_common:111 master_node = 1, valblk = 0
(kworker/u192:5,71152,28):dlmunlock_common:251 lock 4:7162177 should be gone 
now! refs=1
(kworker/u192:5,71152,28):__dlm_dirty_lockres:483 
A895BC216BE641A8A7E20AA89D57E051: res O00946c510c
(kworker/u192:5,71152,28):dlm_lock_detach_lockres:393 removing lock's lockres 
reference
(kworker/u192:5,71152,28):dlm_lock_release:371 freeing kernel-allocated lksb
(kworker/u192:5,71152,28):__dlm_lookup_lockres_full:198 
O00946c4fd2
(kworker/u192:5,71152,28):dlm_lockres_clear_refmap_bit:651 res 
O00946c4fd2, clr node 4, dlm_deref_lockres_handler()

Regards
Prabu

 On Mon, 04 Jan 2016 13:58:21 +0530 Srinivasula Maram 
mailto:srinivasula.ma...@sandisk.com>>wrote 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

My point is rbd device should support SCSI reservation, so that OCFS can take 
write lock while write on particular client to avoid corruption.



Thanks,

Srinivas



From: gjprabu [mailto:gjpr...@zohocorp.com]
Sent: Monday, January 04, 2016 1:40 PM
To: Srinivasula Maram
Cc: Somnath Roy; ceph-users; Siva Sokkumuthu
Subject: RE: [ceph-users] OSD size and performance



Hi Srinivas,



  Our cause OCFS2 is not directly interacting with SCSI. Here we have 
ceph Storage that is mounted to many client system using OCFS2. More ever ocfs2 
support SCSI.



https://blogs.oracle.com/wim/entry/what_s_up_with_ocfs2

http://www.linux-mag.com/id/7809/



Regards

Prabu





 On Mon, 04 Jan 2016 12:46:48 +0530 Srinivasula Maram 
mailto:srinivasula.ma...@sandisk.com>>wrote 



I doubt rbd driver will not support SCSI Reservation to mount the same rbd 
across multiple clients with OCFS ?



Generally underlying devices(here  rbd) should have SCSI reservation support 
for cluster file system.



Thanks,

Srinivas



From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Somnath Roy
Sent: Monday, January 04, 2016 12:29 PM
To: gjprabu
Cc: ceph-users; Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance



Hi Prabu,

Check the krbd version (and libceph) running in the kernel..You can try 
building the latest krbd source for the 7.1 kernel if this is an option for you.

As I mentioned in my earlier mail, please isolate problem the way I suggested 
if that seems reasonable to you.



Thanks & Regards

Somnath



From: gjprabu [mailto:gjpr...@zohocorp.com]
Sent: Sunday, January 03, 2016 10:53 PM
To: gjprabu
Cc: Somnath Roy; ceph-users; Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance



Hi Somnath,



   Just check the below details and let us know do you need any 
other information.



Regards

Prabu



 On Sat, 02 Jan 2016 08:47:05 +0530 gjprabu 
mailto:gjpr...@zohocorp.com>>wrote 



Hi Somnath,



   Please check the details and help me on this issue.



Regards

Prabu



 On Thu, 31 Dec 2015 12:50:36 +0530 gjprabu 
mailto:gjpr...@zohocorp.com>>wrote 







___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Somnath,



 We are using RBD, please find linux and rbd versions. I agree this is 
related to client side issue. My though gone to backup because weekly once will 
take full backup not incremental at the time we found issue once but not sure.



Linux version

CentOS Linux release 7.1.1503 (Core)

Kernel : - 3.10.91



rbd --version

ceph

[ceph-users] very high OSD RAM usage values

2016-01-06 Thread Kenneth Waegeman

Hi all,

We experienced some serious trouble with our cluster: A running cluster 
started failing and started a chain reaction until the ceph cluster was 
down, as about half the OSDs are down (in a EC pool)


Each host has 8 OSDS of 8 TB (i.e. RAID 0 of 2 4TB disk) for an EC pool 
(10+3, 14 hosts) and 2 cache OSDS and 32 GB of RAM.
The reason we have the Raid0 of the disks, is because we tried with 16 
disk before, but 32GB didn't seem enough to keep the cluster stable


We don't know for sure what triggered the chain reaction, but what we 
certainly see, is that while recovering, our OSDS are using a lot of 
memory. We've seen some OSDS using almost 8GB of RAM (resident; virtual 
11GB)
So right now we don't have enough memory to recover the cluster, because 
the  OSDS  get killed by OOMkiller before they can recover..

And I don't know doubling our memory will be enough..

A few questions:

* Does someone has seen this before?
* 2GB was still normal, but 8GB seems a lot, is this expected behaviour?
* We didn't see this with an nearly empty cluster. Now it was filled 
about 1/4 (270TB). I guess it would become worse when filled half or more?
* How high can this memory usage become ? Can we calculate the maximum 
memory of an OSD? Can we limit it ?

* We can upgrade/reinstall to infernalis, will that solve anything?

This is related to a previous post of me : 
http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22259



Thank you very much !!

Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Long peering - throttle at FileStore::queue_transactions

2016-01-06 Thread Sage Weil
On Tue, 5 Jan 2016, Guang Yang wrote:
> On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil  wrote:
> > On Mon, 4 Jan 2016, Guang Yang wrote:
> >> Hi Cephers,
> >> Happy New Year! I got question regards to the long PG peering..
> >>
> >> Over the last several days I have been looking into the *long peering*
> >> problem when we start a OSD / OSD host, what I observed was that the
> >> two peering working threads were throttled (stuck) when trying to
> >> queue new transactions (writing pg log), thus the peering process are
> >> dramatically slow down.
> >>
> >> The first question came to me was, what were the transactions in the
> >> queue? The major ones, as I saw, included:
> >>
> >> - The osd_map and incremental osd_map, this happens if the OSD had
> >> been down for a while (in a large cluster), or when the cluster got
> >> upgrade, which made the osd_map epoch the down OSD had, was far behind
> >> the latest osd_map epoch. During the OSD booting, it would need to
> >> persist all those osd_maps and generate lots of filestore transactions
> >> (linear with the epoch gap).
> >> > As the PG was not involved in most of those epochs, could we only take 
> >> > and persist those osd_maps which matter to the PGs on the OSD?
> >
> > This part should happen before the OSD sends the MOSDBoot message, before
> > anyone knows it exists.  There is a tunable threshold that controls how
> > recent the map has to be before the OSD tries to boot.  If you're
> > seeing this in the real world, be probably just need to adjust that value
> > way down to something small(er).
> It would queue the transactions and then sends out the MOSDBoot, thus
> there is still a chance that it could have contention with the peering
> OPs (especially on large clusters where there are lots of activities
> which generates many osdmap epoch). Any chance we can change the
> *queue_transactions* to "apply_transactions*, thus we block there
> waiting for the persistent of the osdmap. At least we may be able to
> do that during OSD booting? The concern is, if the OSD is active, the
> apply_transaction would take longer with holding the osd_lock..
> I don't find such tuning, could you elaborate? Thanks!

Yeah, that sounds like a good idea (and clearly safe).  Probably a simpler 
fix is to just call store->flush() or similar before sending the boot 
message?

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] KVM problems when rebalance occurs

2016-01-06 Thread nick
Heya,
we are using a ceph cluster (6 Nodes with each having 10x4TB HDD + 2x SSD (for 
journal)) in combination with KVM virtualization. All our virtual machine hard 
disks are stored on the ceph cluster. The ceph cluster was updated to the 
'infernalis' release recently.

We are experiencing problems during cluster maintenance. A normal workflow for 
us looks like this:

- set the noout flag for the cluster
- stop all OSDs on one node
- update the node
- reboot the node
- start all OSDs
- wait for the backfilling to finish
- unset the noout flag

After we start all OSDs on the node again the cluster backfills and tries to 
get all the OSDs in sync. During the beginning of this process we experience 
'stalls' in our running virtual machines. On some the load raises to a very 
high value. On others a running webserver responses only with 5xx HTTP codes. 
It takes around 5-6 minutes until all is ok again. After those 5-6 minutes the 
cluster is still backfilling, but the virtual machines behave normal again.

I already set the following parameters in ceph.conf on the nodes to have a 
better rebalance traffic/user traffic ratio:

"""
[osd]
osd max backfills = 1
osd backfill scan max = 8
osd backfill scan min = 4
osd recovery max active = 1
osd recovery op priority = 1
osd op threads = 8
"""

It helped a bit, but we are still experiencing the above written problems. It 
feels like that for a short time some virtual hard disks are locked. Our ceph 
nodes are using bonded 10G network interfaces for the 'OSD network', so I do 
not think that network is a bottleneck.

After reading this blog post:
http://dachary.org/?p=2182
I wonder if there is really a 'read lock' during the object push.

Does anyone know more about this or do others have the same problems and were 
able to fix it?

Best Regards
Nick
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad sectors on rbd device?

2016-01-06 Thread Jan Schermer
I think you are running out of memory(?), or at least of the memory for the 
type of allocation krbd tries to use.
I'm not going to decode all the logs but you can try increasing min_free_kbytes 
as the first step. I assume this is amd64 when there's no HIGHMEM trouble (I 
don't remember how to solve those).
It can happen either due to system being under memory pressure (from device 
drivers and other in-kernel allocations) or if it is too slow to satisfy the 
allocation request in time (if it's a VM for example). It can also be caused by 
bug in the rbd client of course...

Newer kernel almost always helps with vm troubles like this :-)

Jan


> On 05 Jan 2016, at 14:55, Philipp Schwaha  wrote:
> 
> Hi List,
> 
> I have an issue with an rbd device. I have an rbd device on which I
> created a file system. When I copy files to the file system I get issues
> about failing to write to a sector to sectors on the rbd block device.
> I see the following in the log file:
> 
> [88931.224311] rbd: rbd0: write 8 at 202e777000 result -12
> [88931.224317] blk_update_request: I/O error, dev rbd0, sector 269958072
> [88931.224542] rbd: rbd0: write 8 at 202e6f7000 result -12
> [88931.225908] rbd: rbd0: write 8 at 202e677000 result -12
> [88931.226198] rbd: rbd0: write 8 at 202e7f7000 result -12
> [88931.227501] rbd: rbd0: write 8 at 202e877000 result -12
> [88931.247151] rbd: rbd0: write 8 at 202eff7000 result -12
> [88931.247827] rbd: rbd0: write 8 at 202f077000 result -12
> 
> Looking further I found the following:
> 
> [88931.181608] warn_alloc_failed: 119 callbacks suppressed
> [88931.181616] kworker/2:13: page allocation failure: order:1, mode:0x204020
> [88931.181621] CPU: 2 PID: 7300 Comm: kworker/2:13 Tainted: G W 4.3.3-ge
> [88931.181636] Workqueue: rbd rbd_queue_workfn [rbd]
> [88931.181641] 88013c483ae0 813656c3 00204020
> 8114c438
> [88931.181645]  88017fff9b00 
> 
> [88931.181648]  0f12 00244220
> 
> [88931.181652] Call Trace:
> [88931.181665] [] ? dump_stack+0x40/0x5d
> [88931.181670] [] ? warn_alloc_failed+0xd8/0x130
> [88931.181673] [] ? __alloc_pages_nodemask+0x2b3/0x9e0
> [88931.181679] [] ? kmem_getpages+0x5d/0x100
> [88931.181683] [] ? fallback_alloc+0x141/0x1f0
> [88931.181686] [] ? kmem_cache_alloc+0x1e3/0x450
> [88931.181696] [] ? ceph_osdc_alloc_request+0x51/0x250
> [libceph]
> [88931.181700] [] ?
> rbd_osd_req_create.isra.25+0x51/0x1a0 [rbd]
> [88931.181704] [] ? rbd_img_request_fill+0x228/0x850 [rbd]
> [88931.181708] [] ? rbd_queue_workfn+0x2b9/0x3b0 [rbd]
> [88931.181713] [] ? process_one_work+0x14c/0x3b0
> [88931.181717] [] ? worker_thread+0x4d/0x440
> [88931.181720] [] ? rescuer_thread+0x2e0/0x2e0
> [88931.181724] [] ? kthread+0xbd/0xe0
> [88931.181727] [] ? kthread_park+0x50/0x50
> [88931.181731] [] ? ret_from_fork+0x3f/0x70
> [88931.181734] [] ? kthread_park+0x50/0x50
> [88931.181736] Mem-Info:
> [88931.181745] active_anon:57146 inactive_anon:65771 isolated_anon:0
> [88931.181745] active_file:405123 inactive_file:397563 isolated_file:0
> [88931.181745] unevictable:0 dirty:192 writeback:16100 unstable:0
> [88931.181745] slab_reclaimable:28501 slab_unreclaimable:8143
> [88931.181745] mapped:14501 shmem:24976 pagetables:1962 bounce:0
> [88931.181745] free:8824 free_pcp:816 free_cma:0
> [88931.181750] Node 0 DMA free:15436kB min:28kB low:32kB high:40kB
> active_anon:4kB in
> B inactive_file:28kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:15
> kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:48kB
> slab_unreclaima
> tables:4kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> free_cma:0kB writeback_
> eclaimable? no
> [88931.181758] lowmem_reserve[]: 0 1873 3856 3856
> [88931.181762] Node 0 DMA32 free:13720kB min:3800kB low:4748kB
> high:5700kB active_ano
> B active_file:806264kB inactive_file:776948kB unevictable:0kB
> isolated(anon):0kB isol
> B managed:1921632kB mlocked:0kB dirty:104kB writeback:9384kB
> mapped:35712kB shmem:514
> lab_unreclaimable:12532kB kernel_stack:2224kB pagetables:3464kB
> unstable:0kB bounce:0
> 0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [88931.181769] lowmem_reserve[]: 0 0 1982 1982
> [88931.181773] Node 0 Normal free:6140kB min:4024kB low:5028kB
> high:6036kB active_ano
> kB active_file:814224kB inactive_file:813276kB unevictable:0kB
> isolated(anon):0kB iso
> kB managed:2030320kB mlocked:0kB dirty:664kB writeback:55016kB
> mapped:22292kB shmem:4
> slab_unreclaimable:19964kB kernel_stack:2352kB pagetables:4380kB
> unstable:0kB bounce
> 100kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [88931.181780] lowmem_reserve[]: 0 0 0 0
> [88931.181784] Node 0 DMA: 11*4kB (UEM) 8*8kB (EM) 8*16kB (UEM) 3*32kB
> (UE) 2*64kB (U
> 12kB (EM) 3*1024kB (UEM) 1*2048kB (E) 2*4096kB (M) = 15436kB
> [88931.181801] Node 0 DMA32: 3432*4k

Re: [ceph-users] KVM problems when rebalance occurs

2016-01-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

There has been a lot of "discussion" about osd_backfill_scan[min,max]
lately. My experience with hammer has been opposite that of what
people have said before. Increasing those values for us has reduced
the load of recovery and has prevented a lot of the disruption seen in
our cluster caused by backfilling. It does increase the amount of time
to do the recovery (a new node added to the cluster took about 3-4
hours before, now takes about 24 hours).

We are currently using these values and seem to work well for us.
osd_max_backfills = 1
osd_backfill_scan_min = 16
osd_recovery_max_active = 1
osd_backfill_scan_max = 32

I would be interested in your results if you try these values.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWjUu/CRDmVDuy+mK58QAArdMQAI+0Er/sdN7TF7knGey2
5wJ6Ie81KJlrt/X9fIMpFdwkU2g5ET+sdU9R2hK4XcBpkonfGvwS8Ctha5Aq
XOJPrN4bMMeDK9Z4angK86ioLJevTH7tzp3FZL0U4Kbt1s9ZpwF6t+wlvkKl
mt6Tkj4VKr0917TuXqk58AYiZTYcEjGAb0QUe/gC24yFwZYrPO0vUVb4gmTQ
klNKAdTinGSn4Ynj+lBsEstWGVlTJiL3FA6xRBTz1BSjb4vtb2SoIFwHlAp+
GO+bKSh19YIasXCZfRqC/J2XcNauOIVfb4l4viV23JN2fYavEnLCnJSglYjF
Rjxr0wK+6NhRl7naJ1yGNtdMkw+h+nu/xsbYhNqT0EVq1d0nhgzh6ZjAhW1w
oRiHYA4KNn2uWiUgigpISFi4hJSP4CEPToO8jbhXhARs0H6v33oWrR8RYKxO
dFz+Lxx969rpDkk+1nRks9hTeIF+oFnW7eezSiR6TILYxvCZQ0ThHXQsL4ph
bvUr0FQmdV3ukC+Xwa/cePIlVY6JsIQfOlqmrtG7caTZWLvLUDwrwcleb272
243GXlbWCxoI7+StJDHPnY2k7NHLvbN2yG3f5PZvZaBgqqyAP8Fnq6CDtTIE
vZ/p+ZcuRw8lqoDgjjdiFyMmhQnFcCtDo3vtIy/UXDw23AVsI5edUyyP/sHt
ruPt
=X7SH
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jan 6, 2016 at 7:13 AM, nick  wrote:
> Heya,
> we are using a ceph cluster (6 Nodes with each having 10x4TB HDD + 2x SSD (for
> journal)) in combination with KVM virtualization. All our virtual machine hard
> disks are stored on the ceph cluster. The ceph cluster was updated to the
> 'infernalis' release recently.
>
> We are experiencing problems during cluster maintenance. A normal workflow for
> us looks like this:
>
> - set the noout flag for the cluster
> - stop all OSDs on one node
> - update the node
> - reboot the node
> - start all OSDs
> - wait for the backfilling to finish
> - unset the noout flag
>
> After we start all OSDs on the node again the cluster backfills and tries to
> get all the OSDs in sync. During the beginning of this process we experience
> 'stalls' in our running virtual machines. On some the load raises to a very
> high value. On others a running webserver responses only with 5xx HTTP codes.
> It takes around 5-6 minutes until all is ok again. After those 5-6 minutes the
> cluster is still backfilling, but the virtual machines behave normal again.
>
> I already set the following parameters in ceph.conf on the nodes to have a
> better rebalance traffic/user traffic ratio:
>
> """
> [osd]
> osd max backfills = 1
> osd backfill scan max = 8
> osd backfill scan min = 4
> osd recovery max active = 1
> osd recovery op priority = 1
> osd op threads = 8
> """
>
> It helped a bit, but we are still experiencing the above written problems. It
> feels like that for a short time some virtual hard disks are locked. Our ceph
> nodes are using bonded 10G network interfaces for the 'OSD network', so I do
> not think that network is a bottleneck.
>
> After reading this blog post:
> http://dachary.org/?p=2182
> I wonder if there is really a 'read lock' during the object push.
>
> Does anyone know more about this or do others have the same problems and were
> able to fix it?
>
> Best Regards
> Nick
>
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM problems when rebalance occurs

2016-01-06 Thread Josef Johansson
Hi,

Also make sure that you optimize the debug log config. There's a lot on the
ML on how to set them all to low values (0/0).

Not sure how it's in infernalis but it did a lot in previous versions.

Regards,
Josef
On 6 Jan 2016 18:16, "Robert LeBlanc"  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> There has been a lot of "discussion" about osd_backfill_scan[min,max]
> lately. My experience with hammer has been opposite that of what
> people have said before. Increasing those values for us has reduced
> the load of recovery and has prevented a lot of the disruption seen in
> our cluster caused by backfilling. It does increase the amount of time
> to do the recovery (a new node added to the cluster took about 3-4
> hours before, now takes about 24 hours).
>
> We are currently using these values and seem to work well for us.
> osd_max_backfills = 1
> osd_backfill_scan_min = 16
> osd_recovery_max_active = 1
> osd_backfill_scan_max = 32
>
> I would be interested in your results if you try these values.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWjUu/CRDmVDuy+mK58QAArdMQAI+0Er/sdN7TF7knGey2
> 5wJ6Ie81KJlrt/X9fIMpFdwkU2g5ET+sdU9R2hK4XcBpkonfGvwS8Ctha5Aq
> XOJPrN4bMMeDK9Z4angK86ioLJevTH7tzp3FZL0U4Kbt1s9ZpwF6t+wlvkKl
> mt6Tkj4VKr0917TuXqk58AYiZTYcEjGAb0QUe/gC24yFwZYrPO0vUVb4gmTQ
> klNKAdTinGSn4Ynj+lBsEstWGVlTJiL3FA6xRBTz1BSjb4vtb2SoIFwHlAp+
> GO+bKSh19YIasXCZfRqC/J2XcNauOIVfb4l4viV23JN2fYavEnLCnJSglYjF
> Rjxr0wK+6NhRl7naJ1yGNtdMkw+h+nu/xsbYhNqT0EVq1d0nhgzh6ZjAhW1w
> oRiHYA4KNn2uWiUgigpISFi4hJSP4CEPToO8jbhXhARs0H6v33oWrR8RYKxO
> dFz+Lxx969rpDkk+1nRks9hTeIF+oFnW7eezSiR6TILYxvCZQ0ThHXQsL4ph
> bvUr0FQmdV3ukC+Xwa/cePIlVY6JsIQfOlqmrtG7caTZWLvLUDwrwcleb272
> 243GXlbWCxoI7+StJDHPnY2k7NHLvbN2yG3f5PZvZaBgqqyAP8Fnq6CDtTIE
> vZ/p+ZcuRw8lqoDgjjdiFyMmhQnFcCtDo3vtIy/UXDw23AVsI5edUyyP/sHt
> ruPt
> =X7SH
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Jan 6, 2016 at 7:13 AM, nick  wrote:
> > Heya,
> > we are using a ceph cluster (6 Nodes with each having 10x4TB HDD + 2x
> SSD (for
> > journal)) in combination with KVM virtualization. All our virtual
> machine hard
> > disks are stored on the ceph cluster. The ceph cluster was updated to the
> > 'infernalis' release recently.
> >
> > We are experiencing problems during cluster maintenance. A normal
> workflow for
> > us looks like this:
> >
> > - set the noout flag for the cluster
> > - stop all OSDs on one node
> > - update the node
> > - reboot the node
> > - start all OSDs
> > - wait for the backfilling to finish
> > - unset the noout flag
> >
> > After we start all OSDs on the node again the cluster backfills and
> tries to
> > get all the OSDs in sync. During the beginning of this process we
> experience
> > 'stalls' in our running virtual machines. On some the load raises to a
> very
> > high value. On others a running webserver responses only with 5xx HTTP
> codes.
> > It takes around 5-6 minutes until all is ok again. After those 5-6
> minutes the
> > cluster is still backfilling, but the virtual machines behave normal
> again.
> >
> > I already set the following parameters in ceph.conf on the nodes to have
> a
> > better rebalance traffic/user traffic ratio:
> >
> > """
> > [osd]
> > osd max backfills = 1
> > osd backfill scan max = 8
> > osd backfill scan min = 4
> > osd recovery max active = 1
> > osd recovery op priority = 1
> > osd op threads = 8
> > """
> >
> > It helped a bit, but we are still experiencing the above written
> problems. It
> > feels like that for a short time some virtual hard disks are locked. Our
> ceph
> > nodes are using bonded 10G network interfaces for the 'OSD network', so
> I do
> > not think that network is a bottleneck.
> >
> > After reading this blog post:
> > http://dachary.org/?p=2182
> > I wonder if there is really a 'read lock' during the object push.
> >
> > Does anyone know more about this or do others have the same problems and
> were
> > able to fix it?
> >
> > Best Regards
> > Nick
> >
> > --
> > Sebastian Nickel
> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd partition table

2016-01-06 Thread Dan Nica
Hi guys,

Should I create a partition table on a rbd image or it is enough to create the 
filesystem only ?

Every time I map a rbd image I get the message "unknown partition table" but I 
was able to create the filesystem.

Thank
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados images sync

2016-01-06 Thread Dan Nica
Hi guys,

I there a way to replicate the rbd images of a pool on another cluster  ? other 
than clone/snap ?

--
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] double rebalance when removing osd

2016-01-06 Thread Rafael Lopez
Hi all,

I am curious what practices other people follow when removing OSDs from a
cluster. According to the docs, you are supposed to:

1. ceph osd out
2. stop daemon
3. ceph osd crush remove
4. ceph auth del
5. ceph osd rm

What value does ceph osd out (1) add to the removal process and why is it
in the docs ? We have found (as have others) that by outing(1) and then
crush removing (3), the cluster has to do two recoveries. Is it necessary?
Can you just do a crush remove without step 1?

I found this earlier message from GregF which he seems to affirm that just
doing the crush remove is fine:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/007227.html

This recent blog post from Sebastien that suggests reweighting to 0 first,
but havent tested it:
http://www.sebastien-han.fr/blog/2015/12/11/ceph-properly-remove-an-osd/

I thought that by marking it out, it sets the reweight to 0 anyway, so not
sure how this would make a difference in terms of two rebalances but maybe
there is a subtle difference.. ?

Thanks,
Raf

-- 
Senior Storage Engineer - Automation and Delivery
Infrastructure Services - eSolutions
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] double rebalance when removing osd

2016-01-06 Thread Dan Nica
I followed these steps and worked just fine

http://www.sebastien-han.fr/blog/2015/12/11/ceph-properly-remove-an-osd/

--
Dan
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Rafael 
Lopez
Sent: Thursday, January 7, 2016 1:53 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] double rebalance when removing osd

Hi all,

I am curious what practices other people follow when removing OSDs from a 
cluster. According to the docs, you are supposed to:

1. ceph osd out
2. stop daemon
3. ceph osd crush remove
4. ceph auth del
5. ceph osd rm

What value does ceph osd out (1) add to the removal process and why is it in 
the docs ? We have found (as have others) that by outing(1) and then crush 
removing (3), the cluster has to do two recoveries. Is it necessary? Can you 
just do a crush remove without step 1?

I found this earlier message from GregF which he seems to affirm that just 
doing the crush remove is fine:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/007227.html

This recent blog post from Sebastien that suggests reweighting to 0 first, but 
havent tested it:
http://www.sebastien-han.fr/blog/2015/12/11/ceph-properly-remove-an-osd/

I thought that by marking it out, it sets the reweight to 0 anyway, so not sure 
how this would make a difference in terms of two rebalances but maybe there is 
a subtle difference.. ?

Thanks,
Raf

--
Senior Storage Engineer - Automation and Delivery
Infrastructure Services - eSolutions
[http://assets.monash.edu/logos/logo.gif]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd partition table

2016-01-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

It is just a block device, you can use it with or without a partition.
I should be careful with that statement as bcache looks like a block
device, but you can not partition it directly.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jan 6, 2016 at 4:27 PM, Dan Nica  wrote:
> Hi guys,
>
>
>
> Should I create a partition table on a rbd image or it is enough to create
> the filesystem only ?
>
>
>
> Every time I map a rbd image I get the message “unknown partition table” but
> I was able to create the filesystem.
>
>
>
> Thank
>
> Dan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWjbIJCRDmVDuy+mK58QAAPGoP/jL0IuuKCJ6ojwVdTmq3
rUqVl+1FFDdxu8C+k4m9joOGh1aAx63DwALknZUfy/gwGWMn9ZT66ABc3WSm
i2Qc85uCIA5UUdjKlpnvYKAUA+846mPOxRFJcPJ1XczVCZbm5ocqPrsKLrie
GTW9a1124OXnbeOHRPZXyMBwzDhDZCIHsMSj5nk/ldUy+inEVjFeDqguCvAE
a4Vgmf1y9Y48hWvqlRL4l/daGWOzDPnaFZ7GIp+ni63MwBhAb/XYaygLp14c
nPnmUXZLNA7sTL0PBcSJ8ztRZ+zpfEd8MFJbiYFFnuxklO08Synx6/4g9OkD
1Q1BE9Hr+BN9JBspsWm0Zo4tWipEMkvx5Rg4ieNQBMWBilhzs6RJNm+j7T2Q
X6qIAzXbmmDClpFTthiB8m1tyiTV/ORUeV9AmsUK8bVymSXOxquxjGXFXsfP
1dmmJnSCQ7wD8i1x0lZgeAgvbaXgn9bsS6dQo00sG9upk+5RDvysF1vRoF3J
Fs07SEU11j9rrJ0JTLcD5hZcMnbc2Z4fYiU7wZJb9fk05rwFIQPitBtqeF8+
elidULbph+6wWSqXDLbPmGV57OTcr7ubneTttpsj0rlYBp7FlE24y+z/PSTK
dWIlcXjDu9wOAEdVyFWzNw3F3rP/B8rgvmuh6GWwwoeb7K3ESb8I+sBuYiNl
2coL
=osw9
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd partition table

2016-01-06 Thread Shinobu Kinjo
I'm not sure which is *better* performance wise in terms of RBD but I've never 
created any tables on top of that.

Thank you,
Shinobu

- Original Message -
From: "Dan Nica" 
To: ceph-users@lists.ceph.com
Sent: Thursday, January 7, 2016 8:27:54 AM
Subject: [ceph-users] rbd partition table



Hi guys, 



Should I create a partition table on a rbd image or it is enough to create the 
filesystem only ? 



Every time I map a rbd image I get the message “unknown partition table” but I 
was able to create the filesystem. 



Thank 

Dan 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados images sync

2016-01-06 Thread Le Quang Long
Hi,

You can export rbd images and import to another cluster.
Note:you have to purge images if it has snapshots.

AFAIK, there is no way to export images keeping their clones and snapshots.
On Jan 7, 2016 6:36 AM, "Dan Nica"  wrote:

> Hi guys,
>
>
>
> I there a way to replicate the rbd images of a pool on another cluster  ?
> other than clone/snap ?
>
>
>
> --
>
> Dan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unable to see LTTng tracepoints in Ceph

2016-01-06 Thread Aakanksha Pudipeddi-SSI
Hello Cephers,

A very happy new year to you all!

I wanted to enable LTTng tracepoints for a few tests with infernalis and 
configured Ceph with the -with-lttng option. Seeing a recent post on conf file 
options for tracing, I added these lines:

osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true

However, I am unable to see LTTng tracepoints within ceph-osd. I can see 
tracepoints in ceph-mon though. The main difference with respect to tracing 
between ceph-mon and ceph-osd seems to be TracepointProvider and I thought the 
addition in my config file should do the trick but that didn't change anything. 
I do not know if this is relevant but I also checked with lsof and I see 
ceph-osd is accessing the lttng library as is ceph-mon. Did anyone come across 
this issue and if so, could you give me some direction on this? Thanks a lot 
for your help!

Aakanksha
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any suggestion to deal with slow request?

2016-01-06 Thread Jevon Qiao

Hi Cephers,

We have a Ceph cluster running 0.80.9, which consists of 36 OSDs with 3 
replicas. Recently, some OSDs keep reporting slow request and the 
cluster has a performance downgrade.


From the log of one OSD, I observe that all the slow requests are 
resulted from waiting for the replicas to complete. And the replication 
OSDs are not always some specific ones but could be any other two OSDs.


   2016-01-06 08:17:11.887016 7f175ef25700  0 log [WRN] : slow request
   1.162776 seconds old, received at 2016-01-06 08:17:11.887092:
   osd_op(client.13302933.0:839452
   rbd_data.c2659c728b0ddb.0024 [stat,set-alloc-hint
   object_size 16777216 write_size 16777216,write 12099584~8192]
   3.abd08522 ack+ondisk+write e4661) v4 currently waiting for subops
   from 24,31

I dumped out the historic Ops of the OSD and noticed the following 
information:

1) wait about 8 seconds for the replies from the replica OSDs.
{ "time": "2016-01-06 08:17:03.879264",
  "event": "op_applied"},
{ "time": "2016-01-06 08:17:11.684598",
  "event": "sub_op_applied_rec"},
{ "time": "2016-01-06 08:17:11.687016",
  "event": "sub_op_commit_rec"},

2) spend more than 3 seconds in writeq and 2 seconds to write the journal.
  { "time": "2016-01-06 08:19:16.887519",
  "event": "commit_queued_for_journal_write"},
{ "time": "2016-01-06 08:19:20.109339",
  "event": "write_thread_in_journal_buffer"},
{ "time": "2016-01-06 08:19:22.177952",
  "event": "journaled_completion_queued"},

Any ideas or suggestions?

BTW, I checked the underlying network with iperf, it works fine.

Thanks,
Jevon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS 7.2, Infernalis, preparing osd's and partprobe issues.

2016-01-06 Thread Goncalo Borges

Hi All...

If I can step in on this issue, I just would like to report that I am 
experiencing the same problem.


1./ I am installing my infernalis OSDs in a Centos 7.2.1511, and 'ceph 
disk prepare' fails with the following message


   # ceph-disk prepare --cluster ceph --cluster-uuid
   a9431bc6-3ee1-4b0a-8d21-0ad883a4d2ed --fs-type xfs /dev/sdd /dev/sdb
   WARNING:ceph-disk:OSD will not be hot-swappable if journal is not
   the same device as the osd data
   The operation has completed successfully.
   Error: Error informing the kernel about modifications to partition
   /dev/sdb1 -- Device or resource busy.  This means Linux won't know
   about any changes you made to /dev/sdb1 until you reboot -- so you
   shouldn't mount it or use it in any way before rebooting.
   Error: Failed to add partition 1 (Device or resource busy)
   ceph-disk: Error: Command '['/usr/sbin/partprobe', '/dev/sdb']'
   returned non-zero exit status 1


2./ I've then followed the discussion in 
http://tracker.ceph.com/issues/14080 , and tried the last ceph-disk 
suggestion by Luc [1]. Sometimes it succeeds and sometimes it doesn't. 
But, it is taking a lot more time than before since there is now a 5 
time loop and a sleep 60 for each loop to wait for partprobe to succeed. 
Besides the time it is taking, when it fails, I then have to zap the 
partitions manually because sometimes the journal partition is ok but 
the data partition is the one where partprobe is timing out.



3./ In the cases that ceph-disk succeeds, the partition was not mounted 
nor the daemon started. This was because python-setuptools was not 
installed and ceph-disk depends on it. It would be worthwhile to make an 
explicit rpm dependence for it.



I am not sure why this behavior is showing up much more on the new 
servers I am configuring. Some weeks ago, the same exercise with other 
servers (but using a different storage controller) succeeded without 
problems.


Is there a clear idea of how to improve this behavior?


Cheers
Goncalo





On 12/17/2015 10:02 AM, Matt Taylor wrote:

Hi Loic,

No problems, I'll add my my report on your bug report.

I also tried adding the sleep prior to invoking partprobe, but it 
didn't work (same error).


See pastebin for complete output:

http://pastebin.com/Q26CeUge

Cheers,
Matt.


On 16/12/2015 19:57, Loic Dachary wrote:

Hi Matt,

Could you please add your report to 
http://tracker.ceph.com/issues/14080 ? I think what you're seeing is 
a partprobe timeout because things get too long to complete (that's 
also why adding sleep as mentionned in the mail thread sometime 
helps). There is a variant of that problem where udevadm settle also 
timesout (but it is less common on real hardware). I'm testing a fix 
to make this more robust.


Cheers

On 16/12/2015 07:17, Matt Taylor wrote:

Hi all,

After recently upgrading to CentOS 7.2 and installing a new Ceph 
cluster using Infernalis v9.2.0, I have noticed that disk's are 
failing to prepare.


I have observed the same behaviour over multiple Ceph servers when 
preparing disk's. All the servers are identical.


Disk's are zapping fine, however when running 'ceph-deploy disk 
prepare', we're encountering the following error:


[ceph_deploy.cli][INFO ] Invoked (1.5.30): /usr/bin/ceph-deploy 
disk prepare kvsrv02:/dev/sdr

[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] disk : [('kvsrv02', '/dev/sdr', None)]
[ceph_deploy.cli][INFO ] dmcrypt : False
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] subcommand : prepare
[ceph_deploy.cli][INFO ] dmcrypt_key_dir : /etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : 


[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] fs_type : xfs
[ceph_deploy.cli][INFO ] func : 
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.cli][INFO ] zap_disk : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks 
kvsrv02:/dev/sdr:

[kvsrv02][DEBUG ] connection detected need for sudo
[kvsrv02][DEBUG ] connected to host: kvsrv02
[kvsrv02][DEBUG ] detect platform information from remote host
[kvsrv02][DEBUG ] detect machine type
[kvsrv02][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.2.1511 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to kvsrv02
[kvsrv02][DEBUG ] write cluster configuration to 
/etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host kvsrv02 disk /dev/sdr 
journal None activate False
[kvsrv02][INFO ] Running command: sudo ceph-disk -v prepare 
--cluster ceph --fs-type xfs -- /dev/sdr
[kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--check-allows-journal -i 0 --cluster ceph
[kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--check-wants-journal -i 0 --cluster ceph
[kvsrv02][WARNIN] INFO:ceph-dis