date:20160629

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Now I have also discovered that, by mistake, someone has put production data on a virtual machine of the cluster. I need that ceph starts I/O so I can boot that virtual machine. Can I mark the incomplete pgs as valid? If needed, where can I buy some paid support? Thanks again, Mario Il giorno mer

Re: [ceph-users] Ceph deployment

2016-06-29 Thread Fran Barrera

Hi Oliver, This is my problem: I have deployed Ceph AIO with two interfaces 192.168.1.67 and 10.0.0.67 but at the momento of installation I used 192.168.1.67 and I have an Openstack installed with two interfaces 192.168.1.65 and 10.0.0.65. Openstack have the storage in Ceph but is working on 192

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Tomasz Kuzemko

Hi, if you need fast access to your remaining data you can use ceph-objectstore-tool to mark those PGs as complete, however this will irreversibly lose the missing data. If you understand the risks, this procedure is pretty good explained here: http://ceph.com/community/incomplete-pgs-oh-my/ Sinc

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

I have read many times the post "incomplete pgs, oh my" I think my case is different. The broken disk is completely broken. So how can I simply mark incomplete pgs as complete? Should I stop ceph before? Il giorno mer 29 giu 2016 alle ore 09:36 Tomasz Kuzemko < tomasz.kuze...@corp.ovh.com> ha scr

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

I have searched google and I see that there is no official procedure. Il giorno mer 29 giu 2016 alle ore 09:43 Mario Giammarco < mgiamma...@gmail.com> ha scritto: > I have read many times the post "incomplete pgs, oh my" > I think my case is different. > The broken disk is completely broken. > So

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Tomasz Kuzemko

As far as I know there isn't, which is a shame. We have covered a situation like this in our dev environment to be ready for it in production and it worked, however be aware that the data that Ceph believes is missing will be lost after you mark a PG complete. In your situation I would find OSD wh

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic

Hi Mario, in my opinion you should 1. fix too many PGs per OSD (307 > max 300) 2. stop scrubbing / deeb scrubbing -- How looks your current ceph osd tree ? -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift:

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Christian Balzer

Hello, On Wed, 29 Jun 2016 06:02:59 + Mario Giammarco wrote: > pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash ^ And that's the root cause of all your woes. The default replication size is 3 for a reason and while I do run pools with repli

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

I thank you for your reply so I can add my experience: 1) the other time this thing happened to me I had a cluster with min_size=2 and size=3 and the problem was the same. That time I put min_size=1 to recover the pool but it did not help. So I do not understand where is the advantage to put three

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Zoltan Arnold Nagy

Just loosing one disk doesn’t automagically delete it from CRUSH, but in the output you had 10 disks listed, so there must be something else going - did you delete the disk from the crush map as well? Ceph waits by default 300 secs AFAIK to mark an OSD out after it will start to recover. > On

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Yes I have removed it from crush because it was broken. I have waited 24 hours to see if cephs would like to heals itself. Then I removed the disk completely (it was broken...) and I waited 24 hours again. Then I start getting worried. Are you saying to me that I should not remove a broken disk fro

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic

Hi, removing ONE disk while your replication is 2, is no problem. You dont need to wait a single second to replace of remove it. Its anyway not used and out/down. So from ceph's point of view its not existent. But as christian told you already, what we see now fits to a szenari

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Infact I am worried because: 1) ceph is under proxmox, and proxmox may decide to reboot a server if it is not responding 2) probably a server was rebooted while ceph was reconstructing 3) even using max=3 do not help Anyway this is the "unofficial" procedure that I am using, much simpler than blo

Re: [ceph-users] CephFS mds cache pressure

2016-06-29 Thread João Castro

xiaoxi chen writes: > > Hmm, I asked in the ML some days before,:) likely you hit the kernel bug which fixed by commit 5e804ac482 "ceph: don't invalidate page cache when inode is no longer used”. This fix is in 4.4 but not in 4.2. I haven't got a chance to play with 4.4 , it would be great i

[ceph-users] Maximum possible IOPS for the given configuration

2016-06-29 Thread Mykola Dvornik

Dear ceph-users, Are there any expressions / calculators available to calculate the maximum expected random write IOPS of the ceph cluster? To my understanding of the ceph IO, this should be something like MAXIOPS = (1-OVERHEAD) * OSD_BACKENDSTORAGE_IOPS * NUM_OSD / REPLICA_COUNT So the questio

Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?

2016-06-29 Thread Campbell Steven

Hi Alex/Stefan, I'm in the middle of testing 4.7rc5 on our test cluster to confirm once and for all this particular issue has been completely resolved by Peter's recent patch to sched/fair.c refereed to by Stefan above. For us anyway the patches that Stefan applied did not solve the issue and neit

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Now the problem is that ceph has put out two disks because scrub has failed (I think it is not a disk fault but due to mark-complete) How can I: - disable scrub - put in again the two disks I will wait anyway the end of recovery to be sure it really works again Il giorno mer 29 giu 2016 alle ore

Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?

2016-06-29 Thread Stefan Priebe - Profihost AG

Hi, to be precise i've far more patches attached to the sched part (around 20) of the kernel. So maybe that's the reason why it helps to me. Could you please post a complete stack trace? Also Qemu / KVM triggers this. Stefan Am 29.06.2016 um 11:41 schrieb Campbell Steven: > Hi Alex/Stefan, > >

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic

hi, ceph osd set noscrub ceph osd set nodeep-scrub ceph osd in -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Thanks, I can put in osds but the do not stay in, and I am pretty sure that are not broken. Il giorno mer 29 giu 2016 alle ore 12:07 Oliver Dzombic < i...@ip-interactive.de> ha scritto: > hi, > > ceph osd set noscrub > ceph osd set nodeep-scrub > > ceph osd in > > > -- > Mit freundlichen Gruesse

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic

Hi, again: You >must< check all your logs ( as fucky as it is for sure ). Means on the ceph nodes in /var/log/ceph/* And go back to the time where things went down the hill. There must be something else going on, beyond normal osd crash. And your manual pg repair/pg remove/pg set complete is,

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Just one question: why when ceph has some incomplete pgs it refuses to do I/o on good pgs? Il giorno mer 29 giu 2016 alle ore 12:55 Oliver Dzombic < i...@ip-interactive.de> ha scritto: > Hi, > > again: > > You >must< check all your logs ( as fucky as it is for sure ). > > Means on the ceph nodes

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic

Hi, it does not. But in your case, you have 10 OSD, and 7 of them have incomplete PG's. So since your proxmox vps's are not on single PG's but spread across many PG's you have a good chance that at least some data of any vps is on one of the defect PG's. -- Mit freundlichen Gruessen / Best reg

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Lionel Bouton

Hi, Le 29/06/2016 12:00, Mario Giammarco a écrit : > Now the problem is that ceph has put out two disks because scrub has > failed (I think it is not a disk fault but due to mark-complete) There is something odd going on. I've only seen deep-scrub failing (ie detect one inconsistency and marking

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

This time at the end of recovery procedure you described it was like most pgs active+clean 20 pgs incomplete. After that when trying to use the cluster I got "request blocked more than" and no vm can start. I know that something has happened after the broken disk, probably a server reboot. I am inv

Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-29 Thread Stefan Priebe - Profihost AG

> Am 28.06.2016 um 09:43 schrieb Lionel Bouton > : > > Hi, > > Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit : >> [...] >> Yes but at least BTRFS is still not working for ceph due to >> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it >> doubles it's I/O after a

[ceph-users] Hammer: PGs stuck creating

2016-06-29 Thread Brian Felton

Greetings, I have a lab cluster running Hammer 0.94.6 and being used exclusively for object storage. The cluster consists of four servers running 60 6TB OSDs each. The main .rgw.buckets pool is using k=3 m=1 erasure coding and contains 8192 placement groups. Last week, one of our guys out-ed an

Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-29 Thread Lionel Bouton

Hi, Le 29/06/2016 18:33, Stefan Priebe - Profihost AG a écrit : >> Am 28.06.2016 um 09:43 schrieb Lionel Bouton >> : >> >> Hi, >> >> Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit : >>> [...] >>> Yes but at least BTRFS is still not working for ceph due to >>> fragmentation. I've even t

[ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-29 Thread Mike Jacobacci

Hi all, Is there anyone using rbd for xenserver vm storage? I have XenServer 7 and the latest Ceph, I am looking for the the best way to mount the rbd volume under XenServer. There is not much recent info out there I have found except for this: http://www.mad-hacking.net/documentation/linux/h

[ceph-users] Improving metadata throughput

2016-06-29 Thread Daniel Davidson

I am starting to work with and benchmark our ceph cluster. While throughput is so far looking good, metadata performance so far looks to be suffering. Is there anything that can be done to speed up the response time of looking through a lot of small files and folders? Right now, I am running

Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-29 Thread Jake Young

On Wednesday, June 29, 2016, Mike Jacobacci wrote: > Hi all, > > Is there anyone using rbd for xenserver vm storage? I have XenServer 7 > and the latest Ceph, I am looking for the the best way to mount the rbd > volume under XenServer. There is not much recent info out there I have > found exce

Re: [ceph-users] Hammer: PGs stuck creating

2016-06-29 Thread Brad Hubbard

On Thu, Jun 30, 2016 at 3:22 AM, Brian Felton wrote: > Greetings, > > I have a lab cluster running Hammer 0.94.6 and being used exclusively for > object storage. The cluster consists of four servers running 60 6TB OSDs > each. The main .rgw.buckets pool is using k=3 m=1 erasure coding and > cont

[ceph-users] Can I modify ak/sk?

2016-06-29 Thread yang

Hello, everyone When I want to modify access_key using the following cmd: radosgw-admin user modify --uid=user --access_key="userak" I got: { "user_id": "user", "display_name": "User name", "email": "", "suspended": 0, "max_buckets": 1000, "auid": 0, "subusers": []

[ceph-users] object size changing after a pg repair

2016-06-29 Thread Goncalo Borges

were only appearing in osd.56 logs but not in others. # cat ceph-osd.56.log-20160629 | grep -Hn 'ERR' (standard input):8569:2016-06-29 08:09:50.952397 7fd023322700 -1 log_channel(cluster) log [ERR] : scrub 6.263 6:c645f18e:::12a343d.:head on disk size (1836) does not mat

Re: [ceph-users] object size changing after a pg repair

2016-06-29 Thread Shinobu Kinjo

errors > crush map has legacy tunables (require bobtail, min is firefly); see > http://ceph.com/docs/master/rados/operations/crush-map/#tunables > > We have started by looking to pg 6.263. Errors were only appearing in > osd.56 logs but not in others. > > # cat ceph-osd.56.log-

Re: [ceph-users] object size changing after a pg repair

2016-06-29 Thread Goncalo Borges

": "6.263", - "last_update": "1005'2273061", -"last_complete": "1005'2273061", -"log_tail": "1005'227", -"last_user_version": 2273061, +"last_update&quo

Re: [ceph-users] object size changing after a pg repair

2016-06-29 Thread Shinobu Kinjo

[], > "pushing": [] > } > }, > "scrub": { > "scrubber.epoch_start": "995", > "scrubber.active": 0, > "scrubber.st

Re: [ceph-users] object size changing after a pg repair

2016-06-29 Thread Goncalo Borges

"1005'2273061", -"last_complete": "1005'2273061", -"log_tail": "1005'227", -"last_user_version": 2273061, +"last_update": "1005'2273745", +"last_comp

[ceph-users] Double OSD failure (won't start) any recovery options?

2016-06-29 Thread XPC Design

I've had two osds fail and I'm pretty sure they wont recover from this. I'm looking for help trying to get them back online if possible... terminate called after throwing an instance of 'ceph::buffer::malformed_input' what(): buffer::malformed_input: bad checksum on pg_log_entry_t - I'm having

[ceph-users] Running ceph in docker

2016-06-29 Thread F21

Hey all, I am interested in running ceph in docker containers. This is extremely attractive given the recent integration of swarm into the docker engine, making it really easy to set up a docker cluster. When running ceph in docker, should monitors, radosgw and OSDs all be on separate physic

[ceph-users] Double OSD failure (won't start) any recovery options?

2016-06-29 Thread XPC Design

I've had two osds fail and I'm pretty sure they wont recover from this. I'm looking for help trying to get them back online if possible... terminate called after throwing an instance of 'ceph::buffer::malformed_input' what(): buffer::malformed_input: bad checksum on pg_log_entry_t - I'm having

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco

Last two questions: 1) I have used other systems in the past. In case of split brain or serious problems they offered me to choose which copy is "good" and then work again. Is there a way to tell ceph that all is ok? This morning again I have 19 incomplete pgs after recovery 2) Where can I find pai

Re: [ceph-users] object size changing after a pg repair

2016-06-29 Thread Shinobu Kinjo

kend": { > "pull_from_peer": [], > "pushing": [] > } > }, > "scrub": { > "scrubber.epoch_start": "995", > "scrubb

43 matches

Mail list logo