from:"Laszlo Budai"

[ceph-users] cluster network question

2017-07-14 Thread Laszlo Budai

Dear all, I'm reading the docs at http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/ regarding the cluster network and I wonder which nodes are connected to the dedicated cluster network? The digram on the mentioned page only shows the OSDs connected to the cluster netwo

Re: [ceph-users] cluster network question

2017-07-17 Thread Laszlo Budai

mds services on the network will do nothing. On Fri, Jul 14, 2017, 11:39 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all, I'm reading the docs at http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/ regarding the cluster network and I wond

[ceph-users] best practices for expanding hammer cluster

2017-07-18 Thread Laszlo Budai

Dear all, we are planning to add new hosts to our existing hammer clusters, and I'm looking for best practices recommendations. currently we have 2 clusters with 72 OSDs and 6 nodes each. We want to add 3 more nodes (36 OSDs) to each cluster, and we have some questions about what would be the

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-18 Thread Laszlo Budai

https://www.spinics.net/lists/ceph-users/msg37252.html On Tue, Jul 18, 2017, 9:07 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all, we are planning to add new hosts to our existing hammer clusters, and I'm looking for best practices recommendations. cur

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai

e the hosts into them. Sage explains a lot of the crush map here. https://www.slideshare.net/mobile/sageweil1/a-crash-course-in-crush On Wed, Jul 19, 2017, 2:43 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hi David, thank you for pointing this out. Google wasn't

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai

olled the impact of the recovery/refilling operation on your clients' data traffic? What setting have you used to avoid slow requests? Kind regards, Laszlo On 19.07.2017 17:40, Richard Hesketh wrote: On 19/07/17 15:14, Laszlo Budai wrote: Hi David, Thank you for that reference about CRU

[ceph-users] RBD Snapsot space accounting ...

2017-07-26 Thread Laszlo Budai

Dear all, Where can I read more about how the space used by a snapshot of an RBD image is calculated? Or can someone explain it here? I can see that before the snapshot is created, the size of the image is let's say 100M as reported by the rbd du command, while after taking the snapshot, I ca

[ceph-users] expanding cluster with minimal impact

2017-08-03 Thread Laszlo Budai

Dear all, I need to expand a ceph cluster with minimal impact. Reading previous threads on this topic from the list I've found the ceph-gentle-reweight script (https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight) created by Dan van der Ster (Thank you Dan for sharin

Re: [ceph-users] expanding cluster with minimal impact

2017-08-08 Thread Laszlo Budai

reduce the extra data movement we were seeing with smaller weight increases. Maybe something to try out next time? Bryan From: ceph-users on behalf of Dan van der Ster Date: Friday, August 4, 2017 at 1:59 AM To: Laszlo Budai Cc: ceph-users Subject: Re: [ceph-users] expanding cluster with mini

Re: [ceph-users] expanding cluster with minimal impact

2017-08-08 Thread Laszlo Budai

at 8:12 PM, Laszlo Budai wrote: Dear all, I need to expand a ceph cluster with minimal impact. Reading previous threads on this topic from the list I've found the ceph-gentle-reweight script (https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight) created by Dan van der

[ceph-users] Changing the failure domain

2017-08-31 Thread Laszlo Budai

Dear all! In our Hammer cluster we are planning to switch our failure domain from host to chassis. We have performed some simulations, and regardless of the settings we have used some slow requests have appeared all the time. we had the the following settings: "osd_max_backfills": "1", "

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai

iling drives) which could easily cause things to block. Also checking if your disks or journals are maxed out with iostat could shine some light on any mitigating factor. On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all! In our Ha

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai

ost people find that they can use 3-5 before the disks are active enough to come close to impacting customer traffic. That would lead me to think you have a dying drive that you're reading from/writing to in sectors that are bad or at least slower. On Fri, Sep 1, 2017, 6:13 AM Laszlo B

[ceph-users] Luminous and calamari

2018-02-15 Thread Laszlo Budai

Hi, I've just started up the dasboard component of the ceph mgr. It looks OK, but from what can be seen, and what I was able to find in the docs, the dashboard is just for monitoring. Is there any plugin that allows management of the ceph resources (pool create/delete). Thanks, Laszlo ___

[ceph-users] librbd logging

2017-02-27 Thread Laszlo Budai

Hello, I have these settings in my /etc/ceph/ceph.conf: [client] rbd cache = true rbd cache writethrough until flush = true admin socket = /var/run/ceph/guests/$cluster-$type.$id.$pid.$cctid.asok log file = /var/log/qemu/qemu-guest-$pid.log rbd concurrent management ops = 20 Currently

Re: [ceph-users] librbd logging

2017-02-28 Thread Laszlo Budai

Hello, Thank you for the answer. I don't have the admin socket either :( the ceph subdirectory is missing in /var/run. What would be the steps to get the socket? Kind regards, Laszlo On 28.02.2017 05:32, Jason Dillaman wrote: On Mon, Feb 27, 2017 at 12:36 PM, Laszlo Budai wrote: Curr

[ceph-users] Can librbd operations increase iowait?

2017-02-28 Thread Laszlo Budai

Hello, I have a strange situation: On a host server we are running 5 VMs. The VMs have their disks provisioned by cinder from a ceph cluster and are attached by quemu-kvm using librbd. We have a very strange situation when the VMs apparently have stopped to work for a few seconds (10-20), and a

[ceph-users] re enable scrubbing

2017-03-08 Thread Laszlo Budai

Hello, is there any risk related to cluster overload when the scrub is re enabled after a certain amount of time being disabled? I am thinking of the following scenario: 1. scrub/deep scrub are disabled. 2. after a while (few days) we re enable them. How will the cluster perform? Will it run a

Re: [ceph-users] re enable scrubbing

2017-03-08 Thread Laszlo Budai

t the above is slowed down enough that everything is scrubbed within this long scrub interval, but might need adjustment for a more normal setting here: # 60 days ... default is 7 days osd deep scrub interval = 5259488 And more inline answers below On 03/08/17 10:46, Laszlo Budai wrote: Hello

[ceph-users] pgs stuck inactive

2017-03-09 Thread Laszlo Budai

Hello, After a major network outage our ceph cluster ended up with an inactive PG: # ceph health detail HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests pg 3.367 is stuck inactive for 912263.766607, current state

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai

pdate": "0'0", "current_last_stamp": "0.00", "current_info": { "begin": "0.00", "end": "0.00", "versio

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai

are marked DNE and seem to be uncontactable. This seems to be more than a network issue (unless the outage is still happening). http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai wrote: Hello, I was informed that

Re: [ceph-users] pgs stuck inactive

2017-03-11 Thread Laszlo Budai

/msg17820.html If you want to abandon the pg see http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html for a possible solution. http://ceph.com/community/incomplete-pgs-oh-my/ may also give some ideas. On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai wrote: The OSDs are al

[ceph-users] osd_disk_thread_ioprio_priority help

2017-03-11 Thread Laszlo Budai

Hello, Can someone explain the meaning of osd_disk_thread_ioprio_priority. I'm reading the definition from this page: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/configuration_guide/osd_configuration_reference it says: "It sets the ioprio_set(2) I/O scheduling p

Re: [ceph-users] osd_disk_thread_ioprio_priority help

2017-03-11 Thread Laszlo Budai

On 11.03.2017 16:25, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Laszlo Budai Sent: 11 March 2017 13:51 To: ceph-users Subject: [ceph-users] osd_disk_thread_ioprio_priority help Hello, Can someone explain the meaning

Re: [ceph-users] pgs stuck inactive

2017-03-12 Thread Laszlo Budai

27;ll read it. So far, searching for the architecture of an OSD, I could not find the gory details about these directories. Kind regards, Laszlo On 12.03.2017 02:12, Brad Hubbard wrote: On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai wrote: Hello, Thank you for your answer. indeed the min_size

Re: [ceph-users] osd_disk_thread_ioprio_priority help

2017-03-12 Thread Laszlo Budai

ions which would help to improove the cluster's responsiveness during deep scrub operations. Kind regards, Laszlo On 12.03.2017 21:21, Florian Haas wrote: On Sat, Mar 11, 2017 at 4:24 PM, Laszlo Budai wrote: Can someone explain the meaning of osd_disk_thread_ioprio_priority. I'm [...

Re: [ceph-users] pgs stuck inactive

2017-03-14 Thread Laszlo Budai

e. What else can I try? Thank you, Laszlo On 12.03.2017 13:06, Brad Hubbard wrote: On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai wrote: Hello, I have already done the export with ceph_objectstore_tool. I just have to decide which OSDs to keep. Can you tell me why the directory structur

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai

Hello, So, I've done the following seps: 1. set noout 2. stop osd2 3. ceph-objectstore-tool remove 4. start osd2 5. repeat step 2-4 on osd 28 and 35 then I've run the ceph pg force_create_pg 3.367. This has left the PG in creating state: # ceph -s cluster 6713d1b8-83da-11e6-aa79-525400d98

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai

on the disk). Use force_create_pg to recreate the pg empty. Use ceph-objectstore-tool to do a rados import on the exported pg copy. On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai wrote: Hello, I have tried to recover the pg using the following steps: Preparation: 1. set noout 2. stop osd.2

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai

ory on the disk). Use force_create_pg to recreate the pg empty. Use ceph-objectstore-tool to do a rados import on the exported pg copy. On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai wrote: Hello, I have tried to recover the pg using the following steps: Preparation: 1. set noout 2. stop

[ceph-users] ceph 0.94.10 ceph-objectstore-tool segfault

2017-03-15 Thread Laszlo Budai

Hello, I'm trying to do an import-rados operation, but the ceph-objectstore-tool crashes with segfault: [root@storage1 ~]# ceph-objectstore-tool import-rados images pg6.6exp-osd1 *** Caught signal (Segmentation fault) ** in thread 7f84e0b24880 ceph version 0.94.10 (b1e0532418e4631af01acbc0ced

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai

the debuginfo for ceph (how this works depends on your distro) and run the following? # gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 On Thu, Mar 16, 2017 at 12:02 AM, Laszlo Budai wrote: Hello, the

Re: [ceph-users] pgs stuck inactive

2017-03-16 Thread Laszlo Budai

My mistake, I've run it on a wrong system ... I've attached the terminal output. I've run this on a test system where I was getting the same segfault when trying import-rados. Kind regards, Laszlo On 16.03.2017 07:41, Laszlo Budai wrote: [root@storage2 ~]# gdb -ex 'r

Re: [ceph-users] pgs stuck inactive

2017-03-17 Thread Laszlo Budai

h/$cluster-$name.$pid.log Then run the ceph-objectstore-tool again taking careful note of what file is created in /var/log/ceph/ and upload that. On Thu, Mar 16, 2017 at 5:21 PM, Laszlo Budai wrote: My mistake, I've run it on a wrong system ... I've attached the terminal output

Re: [ceph-users] ceph 0.94.10 ceph-objectstore-tool segfault

2017-03-17 Thread Laszlo Budai

Hi all, I've found that the problem was due to missing /etc/ceph/ceph.client.admin.keyring file on the storage node where I was trying to do the import-rados operation. Kind regards, Laszlo On 15.03.2017 20:22, Laszlo Budai wrote: Hello, I'm trying to do an import-rados operatio

[ceph-users] pgs stale during patching

2017-03-21 Thread Laszlo Budai

Hello, we have been patching our ceph cluster 0.94.7 to 0.94.10. We were updating one node at a time, and after each OSD node has been rebooted we were waiting for the cluster health status to be OK. In the docs we have "stale - The placement group status has not been updated by a ceph-osd, in

[ceph-users] ceph pg dump - last_scrub last_deep_scrub

2017-03-24 Thread Laszlo Budai

Hello, can someone tell me the meaning of the last_scrub and last_deep_scrub values from the ceph pg dump output? I could not find it with google nor in the documentation. for example I can see here the last_scrub being 61092'4385, and the last_deep_scrub=61086'4379 pg_stat objects mip

[ceph-users] Librbd logging

2017-04-04 Thread Laszlo Budai

Hello cephers, I have a situation where from time to time the write operation to the seph storage hangs for 3-5 seconds. For testing we have a simple line like: while sleep 1; date >> logfile; done & with this we can see that rarely there are 3 seconds or more differences between the consecuti

[ceph-users] write to ceph hangs

2017-04-05 Thread Laszlo Budai

Hello, We have an issue when writing to ceph. From time to time the write operation seems to hang for a few seconds. We've seen the https://bugzilla.redhat.com/show_bug.cgi?id=1389503, and there it is said that when the qemu process would reach the max open files limit, then "the guest OS shou

Re: [ceph-users] Librbd logging

2017-04-07 Thread Laszlo Budai

|finalize)" log entries 3) use the asok file during one of these events to dump the objecter requests [1] http://docs.ceph.com/docs/jewel/rbd/rbd-replay/ [2] http://tracker.ceph.com/issues/14629 On Tue, Apr 4, 2017 at 7:36 AM, Laszlo Budai wrote: Hello cephers, I have a situation whe

[ceph-users] null characters at the end of the file on hard reboot of VM

2017-04-07 Thread Laszlo Budai

Hello, we have observed that there are null characters written into the open files when hard rebooting a VM. Is tis a known issue? Our VM is using ceph (0.94.10) storage. we have a script like this: while sleep 1; do date >> somefile ; done if we hard reset the VM while the above line is runnin

Re: [ceph-users] null characters at the end of the file on hard reboot of VM

2017-04-07 Thread Laszlo Budai

keystone_token_cache_size": "1", "rgw_bucket_quota_cache_size": "1", I did some tests and the problem has appeared when I was using ext4 in the VM, but not in the case of xfs. I did an other test when I was calling a sync at the end of the while loop,

Re: [ceph-users] null characters at the end of the file on hard reboot of VM

2017-04-09 Thread Laszlo Budai

mstances aren't the same, but the patterns of behaviour are similar enough that I wanted to raise awareness. k8 On Sat, Apr 8, 2017 at 6:39 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello Peter, Thank you for your answer. In our setup we have the virtu

[ceph-users] failed lossy con, dropping message

2017-04-12 Thread Laszlo Budai

Hello, yesterday one of our compute nodes has recorded the following message for one of the ceph connections: submit_message osd_op(client.28817736.0:690186 rbd_data.15c046b11ab57b7.00c4 [read 2097152~380928] 3.6f81364a ack+read+known_if_redirected e3617) v5 remote, 10.12.68.71:68

Re: [ceph-users] failed lossy con, dropping message

2017-04-12 Thread Laszlo Budai

On 12.04.2017 22:19, Alex Gorbachev wrote: Hi Laszlo, On Wed, Apr 12, 2017 at 6:26 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello, yesterday one of our compute nodes has recorded the following message for one of the ceph connections: submit_message osd_op(

Re: [ceph-users] failed lossy con, dropping message

2017-04-13 Thread Laszlo Budai

e connection. Maybe both are wrong and the truth is a third variant ... :) This is what I would like to understand. Kind regards, Laszlo On 13.04.2017 00:36, Gregory Farnum wrote: On Wed, Apr 12, 2017 at 3:00 AM, Laszlo Budai wrote: Hello, yesterday one of our compute nodes has record

Re: [ceph-users] failed lossy con, dropping message

2017-04-13 Thread Laszlo Budai

17 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello Greg, Thank you for the answer. I'm still in doubt with the "lossy". What does it mean in this context? I can think of different variants: 1. The designer of the protocol from start is consid

[ceph-users] strange remap on host failure

2017-05-29 Thread Laszlo Budai

Hello all, We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below): # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10

Re: [ceph-users] strange remap on host failure

2017-05-29 Thread Laszlo Budai

29.05.2017 14:58, Laszlo Budai wrote: Hello all, We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below): # rules rule replicated_ruleset { ruleset 0 type replicated

[ceph-users] Ceph recovery

2017-05-29 Thread Laszlo Budai

Hello, can someone give me some directions on how the ceph recovery works? Let's suppose we have a ceph cluster with several nodes grouped in 3 racks (2 nodes/rack). The crush map is configured to distribute PGs on OSDs from different racks. What happens if a node fails? Where can I read a des

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai

s can work if you replace failed storage quickly. On Mon, May 29, 2017, 12:07 PM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Dear all, How should ceph react in case of a host failure when from a total of 72 OSDs 12 are out? is it normal that for the remapping of the PG

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai

ee if any of the PGs are showing that they are reflecting that they are running on multiple OSDs inside of the same failure domain. On Tue, May 30, 2017 at 12:34 PM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hello David, Thank you for your message. Indeed we were exp

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai

the crush map, an osd being marked out changes the crush map, an osd being removed from the cluster changes the crush map... The crush map changes all the time even if you aren't modifying it directly. On Tue, May 30, 2017 at 2:08 PM Laszlo Budai mailto:las...@componentsoft.eu>> wro

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai

2017 at 6:17 AM, Gregory Farnum wrote: On Mon, May 29, 2017 at 4:58 AM, Laszlo Budai wrote: Hello all, We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below): # rules rule repli

Re: [ceph-users] strange remap on host failure

2017-05-30 Thread Laszlo Budai

on by default? Yesterday we were able to reproduce the issue on a test cluster. Hammer has performed the same way, but Jewel has worked properly. Upgrading to jewel is planned, but it was not decided yet when to happen. Thank you, Laszlo On 30.05.2017 23:17, Gregory Farnum wrote: On Mon, May

Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-02 Thread Laszlo Budai

Hi David, If I understand correctly your suggestion is the following: If we have for instance 12 servers grouped into 3 racks (4/rack) then you would build a crush map saying that you have 6 racks (virtual ones), and 2 servers in each of them, right? In this case if we are setting the failure

Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-02 Thread Laszlo Budai

position where you need to rush to the datacenter to fix the hardware problems ASAP. On Fri, Jun 2, 2017, 5:14 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote: Hi David, If I understand correctly your suggestion is the following: If we have for instance 12 servers gr

58 matches

Mail list logo