Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Sage Weil
On Thu, 17 Mar 2016, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I'm having trouble finding documentation about using ceph_test_rados. Can I 
> run this on the existing cluster and will that provide useful info? It seems
>  running it in the build will not have the caching set up (vstart.sh).
> 
> I have accepted a job with another company and only have until Wednesday to 
> help with getting information about this bug. My new job will not be using C
> eph, so I won't be able to provide any additional info after Tuesday. I want
>  to leave the company on a good trajectory for upgrading, so any input you c
> an provide will be helpful.

I'm sorry to hear it!  You'll be missed.  :)

> I've found:
> 
> ./ceph_test_rados --op read 100 --op write 100 --op delete 50
> - --max-ops 40 --objects 1024 --max-in-flight 64 --size 400
> - --min-stride-size 40 --max-stride-size 80 --max-seconds 600
> - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op
> rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0
> 
> Is that enough if I change --pool to the cached pool and do the toggling whi
> le ceph_test_rados is running? I think this will run for 10 minutes.

Precisely.  You can probably drop copy_from and snap ops from the list 
since your workload wasn't exercising those.

Thanks!
sage


> 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.6
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW
> t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4
> iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt
> /fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6
> EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm
> C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP
> ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE
> pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z
> dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp
> Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb
> D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ
> MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8
> qLo+
> =qgB7
> -END PGP SIGNATURE-
> 
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil  wrote:
>   On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>   > We are trying to figure out how to use rados bench to
>   reproduce. Ceph
>   > itself doesn't seem to think there is any corruption, but when
>   you do a
>   > verify inside the RBD, there is. Can rados bench verify the
>   objects after
>   > they are written? It also seems to be primarily the filesystem
>   metadata
>   > that is corrupted. If we fsck the volume, there is missing
>   data (put into
>   > lost+found), but if it is there it is primarily OK. There only
>   seems to be
>   > a few cases where a file's contents are corrupted. I would
>   suspect on an
>   > object boundary. We would have to look at blockinfo to map
>   that out and see
>   > if that is what is happening.
> 
>   'rados bench' doesn't do validation.  ceph_test_rados does,
>   though--if you
>   can reproduce with that workload then it should be pretty easy
>   to track
>   down.
> 
>   Thanks!
>   sage
> 
> 
>   > We stopped all the IO and did put the tier in writeback mode
>   with recency
>   > 1,  set the recency to 2 and started the test and there was
>   corruption, so
>   > it doesn't seem to be limited to changing the mode. I don't
>   know how that
>   > patch could cause the issue either. Unless there is a bug that
>   reads from
>   > the back tier, but writes to cache tier, then the object gets
>   promoted
>   > wiping that last write, but then it seems like it should not
>   be as much
>   > corruption since the metadata should be in the cache pretty
>   quick. We
>   > usually evited the cache before each try so we should not be
>   evicting on
>   > writeback.
>   >
>   > Sent from a mobile device, please excuse any typos.
>   > On Mar 17, 2016 6:26 AM, "Sage Weil"  wrote:
>   >
>   > > On Thu, 17 Mar 2016, Nick Fisk wrote:
>   > > > There is got to be something else going on here. All that
>   PR does is to
>   > > > potentially delay the promotion to hit_set_period*recency
>   instead of
>   > > > just doing it on the 2nd read regardless, it's got to be
>   uncovering
>   > > > another bug.
>   > > >
>   > > > Do you see the same problem if the cache is in writeback
>   mode before you
>   > > > start the unpacking. Ie is it the switching mid operation
>   which causes
>   > > > the problem? If it only happens mid operati

[ceph-users] [cephfs] About feature 'snapshot'

2016-03-19 Thread 施柏安
Hi all,
I encounter a trouble about cephfs sanpshot. It seems that the folder
'.snap' is exist.
But I use 'll -a' can't let it show up. And I enter that folder and create
folder in it, it showed something wrong to use snapshot.

Please check : http://imgur.com/elZhQvD
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Irek Fasikhov
Hi,All.

I confirm the problem. When min_read_recency_for_promote> 1 data failure.

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2016-03-17 15:26 GMT+03:00 Sage Weil :

> On Thu, 17 Mar 2016, Nick Fisk wrote:
> > There is got to be something else going on here. All that PR does is to
> > potentially delay the promotion to hit_set_period*recency instead of
> > just doing it on the 2nd read regardless, it's got to be uncovering
> > another bug.
> >
> > Do you see the same problem if the cache is in writeback mode before you
> > start the unpacking. Ie is it the switching mid operation which causes
> > the problem? If it only happens mid operation, does it still occur if
> > you pause IO when you make the switch?
> >
> > Do you also see this if you perform on a RBD mount, to rule out any
> > librbd/qemu weirdness?
> >
> > Do you know if it’s the actual data that is getting corrupted or if it's
> > the FS metadata? I'm only wondering as unpacking should really only be
> > writing to each object a couple of times, whereas FS metadata could
> > potentially be being updated+read back lots of times for the same group
> > of objects and ordering is very important.
> >
> > Thinking through it logically the only difference is that with recency=1
> > the object will be copied up to the cache tier, where recency=6 it will
> > be proxy read for a long time. If I had to guess I would say the issue
> > would lie somewhere in the proxy read + writeback<->forward logic.
>
> That seems reasonable.  Was switching from writeback -> forward always
> part of the sequence that resulted in corruption?  Not that there is a
> known ordering issue when switching to forward mode.  I wouldn't really
> expect it to bite real users but it's possible..
>
> http://tracker.ceph.com/issues/12814
>
> I've opened a ticket to track this:
>
> http://tracker.ceph.com/issues/15171
>
> What would be *really* great is if you could reproduce this with a
> ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados
> running, and then find the sequence of operations that are sufficient to
> trigger a failure.
>
> sage
>
>
>
>  >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> > > Mike Lovell
> > > Sent: 16 March 2016 23:23
> > > To: ceph-users ; sw...@redhat.com
> > > Cc: Robert LeBlanc ; William Perkins
> > > 
> > > Subject: Re: [ceph-users] data corruption with hammer
> > >
> > > just got done with a test against a build of 0.94.6 minus the two
> commits that
> > > were backported in PR 7207. everything worked as it should with the
> cache-
> > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > assuming it works properly on master, there must be a commit that we're
> > > missing on the backport to support this properly.
> > >
> > > sage,
> > > i'm adding you to the recipients on this so hopefully you see it. the
> tl;dr
> > > version is that the backport of the cache recency fix to hammer
> doesn't work
> > > right and potentially corrupts data when
> > > the min_read_recency_for_promote is set to greater than 1.
> > >
> > > mike
> > >
> > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > >  wrote:
> > > robert and i have done some further investigation the past couple days
> on
> > > this. we have a test environment with a hard drive tier and an ssd
> tier as a
> > > cache. several vms were created with volumes from the ceph cluster. i
> did a
> > > test in each guest where i un-tarred the linux kernel source multiple
> times
> > > and then did a md5sum check against all of the files in the resulting
> source
> > > tree. i started off with the monitors and osds running 0.94.5 and
> never saw
> > > any problems.
> > >
> > > a single node was then upgraded to 0.94.6 which has osds in both the
> ssd and
> > > hard drive tier. i then proceeded to run the same test and, while the
> untar
> > > and md5sum operations were running, i changed the ssd tier cache-mode
> > > from forward to writeback. almost immediately the vms started
> reporting io
> > > errors and odd data corruption. the remainder of the cluster was
> updated to
> > > 0.94.6, including the monitors, and the same thing happened.
> > >
> > > things were cleaned up and reset and then a test was run
> > > where min_read_recency_for_promote for the ssd cache pool was set to 1.
> > > we previously had it set to 6. there was never an error with the
> recency
> > > setting set to 1. i then tested with it set to 2 and it immediately
> caused
> > > failures. we are currently thinking that it is related to the backport
> of the fix
> > > for the recency promotion and are in progress of making a .6 build
> without
> > > that backport to see if we can cause corruption. is anyone using a
> version
> > > from after the original recency fix (PR 6702) with a cache tier in
> writeback
> > > mode? anyone have a similar problem?
> > >
> > > mike
> > >
> > > On Mon, Mar

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
Also, is this ceph_test_rados rewriting objects quickly? I think that
the issue is with rewriting objects so if we can tailor the
ceph_test_rados to do that, it might be easier to reproduce.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  wrote:
> I'll  miss the Ceph community as well. There was a few things I really
> wanted to work in with Ceph.
>
> I got this:
>
> update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
> dirty exists
> 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
> 1040:  finishing write tid 1 to nodez23350-256
> 1040:  finishing write tid 2 to nodez23350-256
> 1040:  finishing write tid 3 to nodez23350-256
> 1040:  finishing write tid 4 to nodez23350-256
> 1040:  finishing write tid 6 to nodez23350-256
> 1035: done (4 left)
> 1037: done (3 left)
> 1038: done (2 left)
> 1043: read oid 430 snap -1
> 1043:  expect (ObjNum 429 snap 0 seq_num 429)
> 1040:  finishing write tid 7 to nodez23350-256
> update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
> dirty exists
> 1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
> 1042:  expect (ObjNum 664 snap 0 seq_num 664)
> 1043: Error: oid 430 read returned error code -2
> ./test/osd/RadosModel.h: In function 'virtual void
> ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
> 2016-03-17 10:47:19.085414
> ./test/osd/RadosModel.h: 1109: FAILED assert(0)
> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x76) [0x4db956]
> 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
> 3: (()+0x9791d) [0x7fa1d472191d]
> 4: (()+0x72519) [0x7fa1d46fc519]
> 5: (()+0x13c178) [0x7fa1d47c6178]
> 6: (()+0x80a4) [0x7fa1d425a0a4]
> 7: (clone()+0x6d) [0x7fa1d2bd504d]
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> Aborted
>
> I had to toggle writeback/forward and min_read_recency_for_promote a
> few times to get it, but I don't know if it is because I only have one
> job running. Even with six jobs running, it is not easy to trigger
> with ceph_test_rados, but it is very instant in the RBD VMs.
>
> Here are the six run crashes (I have about the last 2000 lines of each
> if needed):
>
> nodev:
> update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num
> 1014) dirty exists
> 1015:  left oid 1015 (ObjNum 1014 snap 0 seq_num 1014)
> 1016:  finishing write tid 1 to nodev21799-1016
> 1016:  finishing write tid 2 to nodev21799-1016
> 1016:  finishing write tid 3 to nodev21799-1016
> 1016:  finishing write tid 4 to nodev21799-1016
> 1016:  finishing write tid 6 to nodev21799-1016
> 1016:  finishing write tid 7 to nodev21799-1016
> update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num
> 1015) dirty exists
> 1016:  left oid 1016 (ObjNum 1015 snap 0 seq_num 1015)
> 1017:  finishing write tid 1 to nodev21799-1017
> 1017:  finishing write tid 2 to nodev21799-1017
> 1017:  finishing write tid 3 to nodev21799-1017
> 1017:  finishing write tid 5 to nodev21799-1017
> 1017:  finishing write tid 6 to nodev21799-1017
> update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num
> 1016) dirty exists
> 1017:  left oid 1017 (ObjNum 1016 snap 0 seq_num 1016)
> 1018:  finishing write tid 1 to nodev21799-1018
> 1018:  finishing write tid 2 to nodev21799-1018
> 1018:  finishing write tid 3 to nodev21799-1018
> 1018:  finishing write tid 4 to nodev21799-1018
> 1018:  finishing write tid 6 to nodev21799-1018
> 1018:  finishing write tid 7 to nodev21799-1018
> update_object_version oid 1018 v 1093 (ObjNum 1017 snap 0 seq_num
> 1017) dirty exists
> 1018:  left oid 1018 (ObjNum 1017 snap 0 seq_num 1017)
> 1019:  finishing write tid 1 to nodev21799-1019
> 1019:  finishing write tid 2 to nodev21799-1019
> 1019:  finishing write tid 3 to nodev21799-1019
> 1019:  finishing write tid 5 to nodev21799-1019
> 1019:  finishing write tid 6 to nodev21799-1019
> update_object_version oid 1019 v 462 (ObjNum 1018 snap 0 seq_num 1018)
> dirty exists
> 1019:  left oid 1019 (ObjNum 1018 snap 0 seq_num 1018)
> 1021:  finishing write tid 1 to nodev21799-1021
> 1020:  finishing write tid 1 to nodev21799-1020
> 1020:  finishing write tid 2 to nodev21799-1020
> 1020:  finishing write tid 3 to nodev21799-1020
> 1020:  finishing write tid 5 to nodev21799-1020
> 1020:  finishing write tid 6 to nodev21799-1020
> update_object_version oid 1020 v 1287 (ObjNum 1019 snap 0 seq_num
> 1019) dirty exists
> 1020:  left oid 1020 (ObjNum 1019 snap 0 seq_num 1019)
> 1021:  finishing write tid 2 to nodev21799-1021
> 1021:  finishing write tid 3 to nodev21799-1021
> 1021:  finishing write tid 5 to nodev21799-1021
> 1021:  finishing write tid 6 to nodev21799-1021
> update_object_version oid 1021 v 1077 (ObjNum 1020 snap 0 seq_num
> 1020) dirty exists
>

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Heath Albritton
Neither of these file systems is recommended for production use underlying an 
OSD.  The general direction for ceph is to move away from having a file system 
at all.

That effort is called "bluestore" and is supposed to show up in the jewel 
release.

-H

> On Mar 18, 2016, at 11:15, Schlacta, Christ  wrote:
> 
> Insofar as I've been able to tell, both BTRFS and ZFS provide similar
> capabilities back to CEPH, and both are sufficiently stable for the
> basic CEPH use case (Single disk -> single mount point), so the
> question becomes this:  Which actually provides better performance?
> Which is the more highly optimized single write path for ceph?  Does
> anybody have a handful of side-by-side benchmarks?  I'm more
> interested in higher IOPS, since you can always scale-out throughput,
> but throughput is also important.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-03-19 Thread Heath Albritton
The rule of thumb is to match the journal throughput to the OSD throughout.  
I'm seeing ~180MB/s sequential write on my OSDs and I'm using one of the P3700 
400GB units per six OSDs.  The 400GB P3700 yields around 1200MB/s* and has 
around 1/10th the latency of any SATA SSD I've tested.

I put a pair of them in a 12-drive chassis and get excellent performance.  One 
could probably do the same in an 18-drive chassis without any issues.  Failure 
domain for a journal starts to get pretty large at they point.  I have dozens 
of the "Fultondale" SSDs deployed and have had zero failures.  Endurance is 
excellent, etc.

*the larger units yield much better write throughout but don't make sense 
financially for journals.

-H



On Mar 16, 2016, at 09:37, Nick Fisk  wrote:

>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Stephen Harker
>> Sent: 16 March 2016 16:22
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
> which is
>> better?
>> 
>>> On 2016-02-17 11:07, Christian Balzer wrote:
>>> 
>>> On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:
>>> 
>> Let's consider both cases:
>> Journals on SSDs - for writes, the write operation returns right
>> after data lands on the Journal's SSDs, but before it's written
>> to the backing HDD. So, for writes, SSD journal approach should
>> be comparable to having a SSD cache tier.
> Not quite, see below.
 Could you elaborate a bit more?
 
 Are you saying that with a Journal on a SSD writes from clients,
 before they can return from the operation to the client, must end up
 on both the SSD (Journal) *and* HDD (actual data store behind that
 journal)?
>>> 
>>> No, your initial statement is correct.
>>> 
>>> However that burst of speed doesn't last indefinitely.
>>> 
>>> Aside from the size of the journal (which is incidentally NOT the most
>>> limiting factor) there are various "filestore" parameters in Ceph, in
>>> particular the sync interval ones.
>>> There was a more in-depth explanation by a developer about this in
>>> this ML, try your google-foo.
>>> 
>>> For short bursts of activity, the journal helps a LOT.
>>> If you send a huge number of for example 4KB writes to your cluster,
>>> the speed will eventually (after a few seconds) go down to what your
>>> backing storage (HDDs) are capable of sustaining.
>>> 
> (Which SSDs do you plan to use anyway?)
 
 Intel DC S3700
>>> Good choice, with the 200GB model prefer the 3700 over the 3710
>>> (higher sequential write speed).
>> 
>> Hi All,
>> 
>> I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
> each
>> of which has 6 4TB SATA drives within. I had my eye on these:
>> 
>> 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
>> 
>> but reading through this thread, it might be better to go with the P3700
> given
>> the improved iops. So a couple of questions.
>> 
>> * Are the PCI-E versions of these drives different in any other way than
> the
>> interface?
> 
> Yes and no. Internally they are probably not much difference, but the
> NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum
> latency and bandwidth.
> 
>> 
>> * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity
> is
>> 10GE, or will be shortly anyway), would the SATA S3700 be sufficient?
> 
> Again depends on your use case. The S3700 may suffer if you are doing large
> sequential writes, it might not have a high enough sequential write speed
> and will become the bottleneck. 6 Disks could potentially take around
> 500-700MB/s of writes. A P3700 will have enough and will give slightly lower
> write latency as well if this is important. You may even be able to run more
> than 6 disk OSD's on it if needed.
> 
>> 
>> Given they're not hot-swappable, it'd be good if they didn't wear out in
>> 6 months too.
> 
> Probably won't unless you are doing some really extreme write workloads and
> even then I would imagine they would last 1-2 years.
> 
>> 
>> I realise I've not given you much to go on and I'm Googling around as
> well, I'm
>> really just asking in case someone has tried this already and has some
>> feedback or advice..
> 
> That's ok, I'm currently running S3700 100GB's on current cluster and new
> cluster that's in planning stages will be using the 400Gb P3700's.
> 
>> 
>> Thanks! :)
>> 
>> Stephen
>> 
>> --
>> Stephen Harker
>> Chief Technology Officer
>> The Positive Internet Company.
>> 
>> --
>> All postal correspondence to:
>> The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
>> 
>> *Follow us on Twitter* @posipeople
>> 
>> The Positive Internet Company Limited is registered in England and Wales.
>> Registered company number: 3673639. VAT no: 726 7072 28.
>> Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4
> 9EE.
>> ___
>> ceph-users mai

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-19 Thread Dan van der Ster
Hi,

Is there a tracker for this? We just hit the same problem on 10.0.5.

Cheers, Dan

# rpm -q ceph
ceph-10.0.5-0.el7.x86_64

# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

# ceph-disk -v prepare /dev/sdc
DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph
--show-config-value=fsid
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_mkfs_type
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_mkfs_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_fs_mkfs_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_mount_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph
--show-config-value=osd_journal_size
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_cryptsetup_parameters
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_dmcrypt_key_size
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph
--name=osd. --lookup osd_dmcrypt_type
DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
INFO:ceph-disk:Will colocate journal with data on /dev/sdc
DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
DEBUG:ceph-disk:Creating journal partition num 2 size 20480 on /dev/sdc
INFO:ceph-disk:Running command: /usr/sbin/sgdisk --new=2:0:20480M
--change-name=2:ceph journal
--partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt --
/dev/sdc
The operation has completed successfully.
DEBUG:ceph-disk:Calling partprobe on prepared device /dev/sdc
INFO:ceph-disk:Running command: /usr/bin/udevadm settle
INFO:ceph-disk:Running command: /usr/sbin/partprobe /dev/sdc
Error: Error informing the kernel about modifications to partition
/dev/sdc2 -- Device or resource busy.  This means Linux won't know
about any changes you made to /dev/sdc2 until you reboot -- so you
shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 2 (Device or resource busy)
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 3528, in 
main(sys.argv[1:])
  File "/usr/sbin/ceph-disk", line 3482, in main
args.func(args)
  File "/usr/sbin/ceph-disk", line 1817, in main_prepare
luks=luks
  File "/usr/sbin/ceph-disk", line 1447, in prepare_journal
return prepare_journal_dev(data, journal, journal_size,
journal_uuid, journal_dm_keypath, cryptsetup_parameters, luks)
  File "/usr/sbin/ceph-disk", line 1401, in prepare_journal_dev
raise Error(e)
__main__.Error: Error: Command '['/usr/sbin/partprobe', '/dev/sdc']'
returned non-zero exit status 1

On Tue, Mar 15, 2016 at 8:38 PM, Vasu Kulkarni  wrote:
> Thanks for the steps that should be enough to test it out, I hope you got
> the latest ceph-deploy either from pip or throught github.
>
> On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord 
> wrote:
>>
>> I would have to nuke my cluster right now, and I do not have a spare one..
>>
>> The procedure though is literally this, given a 3 node redhat 7.2 cluster,
>> ceph00, ceph01 and ceph02
>>
>> ceph-deploy install --testing ceph00 ceph01 ceph02
>> ceph-deploy new ceph00 ceph01 ceph02
>>
>> ceph-deploy mon create  ceph00 ceph01 ceph02
>> ceph-deploy gatherkeys  ceph00
>>
>> ceph-deploy osd create ceph00:sdb:/dev/sdi
>> ceph-deploy osd create ceph00:sdc:/dev/sdi
>>
>> All devices have their partition tables wiped before this. They are all
>> just SATA devices, no special devices in the way.
>>
>> sdi is an ssd and it is being carved up for journals. The first osd create
>> works, the second one gets stuck in a loop in the update_partition call in
>> ceph_disk for the 5 iterations before it gives up. When I look in
>> /sys/block/sdi the partition for the first osd is visible, the one for the
>> second is not. However looking at /proc/partitions it sees the correct
>> thing. So something about partprobe is not kicking udev into doing the right
>> thing when the second partition is added I suspect.
>>
>> If I do not use the separate journal device then it usually works, but
>> occasionally I see a single retry in that same loop.
>>
>> There is code in ceph_deploy which uses partprobe or partx depending on
>> which distro it detects, that is how I worked out what to change here.
>>
>> If I have to tear things down again I will reproduce and post here.
>>
>> Steve
>>
>> > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni  wrote:
>> >
>> > Do you mind giving the full failed logs somewhere

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Basically, the lookup process is:

try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/DIR_9/DIR_7...doesn't exist
try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/DIR_9/...doesn't exist
try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/...doesn't exist
try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/...does exist, object must be here

If DIR_E did not exist, then it would check DIR_9/DIR_5/DIR_4/DIR_D
and so on.  The hash is always 32 bit (8 hex digits) -- baked into the
rados object distribution algorithms.  When DIR_E hits the threshhold
(320 iirc), the objects (files) in that directory will be moved one
more directory deeper.  An object with hash 79CED459 would then be in
DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/.

Basically, the depth of the tree is dynamic.  The file will be in the
deepest existing path that matches the hash (might even be different
between replicas, the tree structure is purely internal to the
filestore).
-Sam

On Wed, Mar 16, 2016 at 10:46 AM, Jeffrey McDonald  wrote:
> OK, I think I have it now.   I do have one more question, in this case, the
> hash indicates the directory structure but how do I know from the hash how
> many levels I should go down.If the hash is a 32-bit hex integer, *how
> do I know how many should be included as part of the hash for the directory
> structure*?
>
> e.g. our example: the hash is 79CED459 and the directory is then the last
> five taken in reverse order, what happens if there are only 4 levels of
> hierarchy?I only have this one example so far.is the 79C of the hash
> constant?   Would the hash pick up another hex character if the pg splits
> again?
>
> Thanks,
> Jeff
>
> On Wed, Mar 16, 2016 at 10:24 AM, Samuel Just  wrote:
>>
>> There is a directory structure hash, it's just that it's at the end of
>> the name and you'll have to check the xattr I mentioned to find it.
>>
>> I think that file is actually the one we are talking about removing.
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0
>>
>> Notice that the user.cephosd.lfn3 attr has the full name, and it
>> *does* have a hash 79CED459 (you referred to it as a directory hash I
>> think, but it's actually the hash we used to place it on this osd to
>> begin with).
>>
>> In specifically this case, you shouldn't find any files in the
>> DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories
>> (so all hash values should hash to one of those).
>>
>> The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's
>> the actual object file, don't remove that.  If you look at the attr:
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
>>
>> The hash is 79CED459, which means that (assuming
>> DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the
>> right place.
>>
>> The ENOENT return
>>
>> 2016-03-07 16:11:41.828332 7ff30cdad700 10
>> filestore(/var/lib/ceph/osd/ceph-307) remove
>>
>> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
>> = -2
>> 2016-03-07 21:44:02.197676 7fe96b56f700 10
>> filestore(/var/lib/ceph/osd/ceph-307) remove
>>
>> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
>> = -2
>>
>> actually was a symptom in this case, but, in general, it's not
>> indicative of anything -- the filestore can get ENOENT return values
>> for legitimate reasons.
>>
>> To reiterate: files that end in something like
>> fa202ec9b4b3b217275a_0_long are *not* necessarily orphans -- you need
>> to check the user.cephos.lfn3 attr (as you did before) for the full
>> length file name and determine whether the file is in the right place.
>

[ceph-users] Single key delete performance against increasing bucket size

2016-03-19 Thread Robin H. Johnson
On Wed, Mar 16, 2016 at 06:36:33AM +, Pavan Rallabhandi wrote:
> I find this to be discussed here before, but couldn¹t find any solution
> hence the mail. In RGW, for a bucket holding objects in the range of ~
> millions, one can find it to take for ever to delete the bucket(via
> radosgw-admin). I understand the gc(and its parameters) that would reclaim
> the space eventually, but am looking more at the bucket deletion options
> that can possibly speed up the operation.
This ties well into a mail I had sitting in my drafts, but never got
around to sending.

Whilst doing some rough benchmarking on bucket index sharding, I ran
into some terrible performance for key deletion on non-existent keys.

Shards did NOT alleviate this performance issue, but did help elsewhere.
Numbers given below are for unsharded buckets; relatively empty buckets
perform worse when shards before performance picks up again.

Test methodology:
- Fire single DELETE key ops to the RGW; not using multi-object delete. 
- I measured the time taken for each delete, and report it here for the
  99% percentile (1% of operations took longer than this). 
- I took at least 1K samples for #keys up to and including 10k keys per
  bucket. For 50k keys/bucket I capped it to the first 100 samples
  instead of waiting 10 hours for the run to complete.
- The DELETE operations were run single-threaded, with no concurrency.

Test environments:
Clusters are were both running Hammer 0.94.5 on Ubuntu precise; the
hardware is a long way from being new; there are no SSDs, the journal is
the first partition on each OSD's disk. The test source host was
unloaded, and approx 1ms of latency away from the RGWs.

Cluster 1 (Congress, ~1350 OSDs; production cluster; haproxy of 10 RGWs)
#keys-in-bucket time per single key delete
0   6.899ms
10  7.507ms
10013.573ms
1000  327.936ms
14825.597ms
5   33802.497ms
10  did-not-finish

Cluster 2 (Benjamin, ~50 OSDs; test cluster, practically idle; haproxy of 2 
RGWs)
#keys-in-bucket time per single key delete
0   4.825ms
10  6.749ms
100 6.146ms
10006.816ms
11233.727ms
5   64262.764ms
10  did-not-finish

The cases marked with did-not-finish are where the RGW seems to time out
the operation even with the client having an unlimited timeout. It did
occur also connected directly to CivetWeb and not HAProxy.

I'm not sure why the 100-keys case on the second cluster seems to have
been faster than the 10-key case, but I'm willing to put it down to
statistical noise.

The huge increase at the end, and the operation not returning over 100k
items is concerning.

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v10.0.4 released

2016-03-19 Thread Loic Dachary
Hi,

Because of a tiny mistake preventing deb packages to be built, v10.0.5 was 
released shortly after v10.0.4 and is now the current development release. The 
Stable release team[0] collectively decided to help by publishing development 
packages[1], starting with v10.0.5.

The packages for v10.0.5 are available at http://ceph-releases.dachary.org/ 
which can be used as a replacement for http://download.ceph.com/ for both 
http://download.ceph.com/rpm-testing and 
http://download.ceph.com/debian-testing . The only difference is the key used 
to sign the releases which can be imported with

wget -q -O- 'http://ceph-releases.dachary.org/release-key.asc' | sudo 
apt-key add -

or 

rpm --import http://ceph-releases.dachary.org/release-key.asc

The instructions to install development packages found at 
http://docs.ceph.com/docs/master/install/get-packages/ can otherwise be applied 
with no change.

Cheers

[0] Stable release team 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#Whos-who
[1] Publishing development releases 
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/30126

On 08/03/2016 22:35, Sage Weil wrote:
> This is the fourth and last development release before Jewel. The next 
> release will be a release candidate with the final set of features. Big 
> items include RGW static website support, librbd journal framework, fixed 
> mon sync of config-key data, C++11 updates, and bluestore/kstore.
> 
> Note that, due to general developer busyness, we aren’t building official 
> release packages for this dev release. You can fetch autobuilt gitbuilder 
> packages from the usual location (http://gitbuilder.ceph.com).
> 
> Notable Changes
> ---
> 
> http://ceph.com/releases/v10-0-4-released/
> 
> Getting Ceph
> 
> 
> * Git at git://github.com/ceph/ceph.git
> * For packages, see 
> http://ceph.com/docs/master/install/get-packages#add-ceph-development
> * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD hanging on some volumes of a pool

2016-03-19 Thread Adrien Gillard
Hi,

I am facing issues with some of my rbd volumes since yesterday. Some of
them completely hang at some point before eventually resuming IO, may it be
a few minutes or several hours later.

First and foremost, my setup : I already detailed it on the mailing list
[0][1]. Some changes have been made : the 3 monitors are now VM and we are
trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7).

Using EC pools, I already had some trouble with RBD features not supported
by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the
hassle. Everything has been working pretty smoothly since.

All my volumes (currently 5) are on an EC pool with writeback cache. Two of
them are perfectly fine. On the other 3, different story : doing IO is
impossible, if I start a simple copy I get a new file of a few dozen MB (or
sometimes 0) then it hangs. Doing dd with direct and sync flags has the
same behaviour.

I tried witching back to 3.10, no changes, on the client I rebooted I
currently cannot mount the filesystem, mount hangs (the volume seems
correctly mapped however).

strace on the cp command freezes in the middle of a read :

11:17:56 write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536
11:17:56 read(3,


I tried to bump up the logging but I don't really know what to look for
exactly and didn't see anything obvious.

Any input or lead on how to debug this would be highly appreciated :)

Adrien

[0] http://www.spinics.net/lists/ceph-users/msg23990.html
[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-January/007004.html
[2]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007746.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-03-19 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Stephen Harker
> Sent: 16 March 2016 16:22
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
> better?
> 
> On 2016-02-17 11:07, Christian Balzer wrote:
> >
> > On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:
> >
> >> > > Let's consider both cases:
> >> > > Journals on SSDs - for writes, the write operation returns right
> >> > > after data lands on the Journal's SSDs, but before it's written
> >> > > to the backing HDD. So, for writes, SSD journal approach should
> >> > > be comparable to having a SSD cache tier.
> >> > Not quite, see below.
> >> >
> >> >
> >> Could you elaborate a bit more?
> >>
> >> Are you saying that with a Journal on a SSD writes from clients,
> >> before they can return from the operation to the client, must end up
> >> on both the SSD (Journal) *and* HDD (actual data store behind that
> >> journal)?
> >
> > No, your initial statement is correct.
> >
> > However that burst of speed doesn't last indefinitely.
> >
> > Aside from the size of the journal (which is incidentally NOT the most
> > limiting factor) there are various "filestore" parameters in Ceph, in
> > particular the sync interval ones.
> > There was a more in-depth explanation by a developer about this in
> > this ML, try your google-foo.
> >
> > For short bursts of activity, the journal helps a LOT.
> > If you send a huge number of for example 4KB writes to your cluster,
> > the speed will eventually (after a few seconds) go down to what your
> > backing storage (HDDs) are capable of sustaining.
> >
> >> > (Which SSDs do you plan to use anyway?)
> >> >
> >>
> >> Intel DC S3700
> >>
> > Good choice, with the 200GB model prefer the 3700 over the 3710
> > (higher sequential write speed).
> 
> Hi All,
> 
> I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
each
> of which has 6 4TB SATA drives within. I had my eye on these:
> 
> 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
> 
> but reading through this thread, it might be better to go with the P3700
given
> the improved iops. So a couple of questions.
> 
> * Are the PCI-E versions of these drives different in any other way than
the
> interface?

Yes and no. Internally they are probably not much difference, but the
NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum
latency and bandwidth.

> 
> * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity
is
> 10GE, or will be shortly anyway), would the SATA S3700 be sufficient?

Again depends on your use case. The S3700 may suffer if you are doing large
sequential writes, it might not have a high enough sequential write speed
and will become the bottleneck. 6 Disks could potentially take around
500-700MB/s of writes. A P3700 will have enough and will give slightly lower
write latency as well if this is important. You may even be able to run more
than 6 disk OSD's on it if needed.

> 
> Given they're not hot-swappable, it'd be good if they didn't wear out in
> 6 months too.

Probably won't unless you are doing some really extreme write workloads and
even then I would imagine they would last 1-2 years.

> 
> I realise I've not given you much to go on and I'm Googling around as
well, I'm
> really just asking in case someone has tried this already and has some
> feedback or advice..

That's ok, I'm currently running S3700 100GB's on current cluster and new
cluster that's in planning stages will be using the 400Gb P3700's.

> 
> Thanks! :)
> 
> Stephen
> 
> --
> Stephen Harker
> Chief Technology Officer
> The Positive Internet Company.
> 
> --
> All postal correspondence to:
> The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
> 
> *Follow us on Twitter* @posipeople
> 
> The Positive Internet Company Limited is registered in England and Wales.
> Registered company number: 3673639. VAT no: 726 7072 28.
> Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4
9EE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ssd only storage and ceph

2016-03-19 Thread Jan Schermer



> On 17 Mar 2016, at 17:28, Erik Schwalbe  wrote:
> 
> Hi,
> 
> at the moment I do some tests with SSD's and ceph.
> My Question is, how to mount an SSD OSD? With or without discard option?

I recommend running without discard but running "fstrim" command every now and 
then (depends on how fast your SSD is - some SSDs hang for quite a while when 
fstrim is run on them, test it)

> Where should I do the fstrim, when I mount the OSD without discard? On the 
> ceph storage node? Inside the vm, running on rbd?
> 

discard on the SSD itself makes garbage collection easier - that might make the 
SSD faster and it can last longer (how faster and how longer depends on the 
SSD, generally if you use DC-class SSDs you won't notice anything)
discard in the VM (assuming everything supports it) makes thin-provisioning 
more effective, but you (IMO) need virtio-scsi for that. I have no real-life 
experience whether Ceph actually frees the unneeded space even if you make it 
work...



> What is the best practice there.
> 
> Thanks for your answers.
> 
> Regards,
> Erik
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-19 Thread Chris Dunlop
Hi Chen,

On Thu, Mar 17, 2016 at 12:40:28AM +, Chen, Xiaoxi wrote:
> It’s already there, in 
> http://download.ceph.com/debian-hammer/pool/main/c/ceph/.

I can only see ceph*_0.94.6-1~bpo80+1_amd64.deb there. Debian wheezy would
be bpo70.

Cheers,

Chris

> On 3/17/16, 7:20 AM, "Chris Dunlop"  wrote:
> 
>> Hi Stable Release Team for v0.94,
>>
>> On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote:
>>> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
 I think you misread what Sage wrote : "The intention was to
 continue building stable releases (0.94.x) on the old list of
 supported platforms (which inclues 12.04 and el6)". In other
 words, the old OS'es are still supported. Their absence is a
 glitch in the release process that will be fixed.
>>> 
>>> Any news on a release of v0.94.6 for debian wheezy?
>>
>> Any news on a release of v0.94.6 for debian wheezy?
>>
>> Cheers,
>>
>> Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-19 Thread Dan van der Ster
Hi,

It's true, partprobe works intermittently. I extracted the key
commands to show the problem:

[18:44]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph
journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt --
/dev/sdc
The operation has completed successfully.
[18:44]# partprobe /dev/sdc
Error: Error informing the kernel about modifications to partition
/dev/sdc2 -- Device or resource busy.  This means Linux won't know
about any changes you made to /dev/sdc2 until you reboot -- so you
shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 2 (Device or resource busy)
[18:44]# partprobe /dev/sdc
[18:44]# partprobe /dev/sdc
Error: Error informing the kernel about modifications to partition
/dev/sdc2 -- Device or resource busy.  This means Linux won't know
about any changes you made to /dev/sdc2 until you reboot -- so you
shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 2 (Device or resource busy)
[18:44]# partprobe /dev/sdc
Error: Error informing the kernel about modifications to partition
/dev/sdc2 -- Device or resource busy.  This means Linux won't know
about any changes you made to /dev/sdc2 until you reboot -- so you
shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 2 (Device or resource busy)

But partx works every time:

[18:46]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph
journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt --
/dev/sdd
The operation has completed successfully.
[18:46]# partx -u /dev/sdd
[18:46]# partx -u /dev/sdd
[18:46]# partx -u /dev/sdd
[18:46]#

-- Dan

On Thu, Mar 17, 2016 at 6:31 PM, Vasu Kulkarni  wrote:
> I can raise a tracker for this issue since it looks like an intermittent
> issue and mostly dependent on specific hardware or it would be better if you
> add all the hardware/os details in tracker.ceph.com,  also from your logs it
> looks like you have
>  Resource busy issue: Error: Failed to add partition 2 (Device or resource
> busy)
>
>  From my test run logs on centos 7.2 , 10.0.5 (
> http://qa-proxy.ceph.com/teuthology/vasu-2016-03-15_15:34:41-selinux-master---basic-mira/62626/teuthology.log
> )
>
> 2016-03-15T18:49:56.305
> INFO:teuthology.orchestra.run.mira041.stderr:[ceph_deploy.osd][DEBUG ]
> Preparing host mira041 disk /dev/sdb journal None activate True
> 2016-03-15T18:49:56.305
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][DEBUG ] find the
> location of an executable
> 2016-03-15T18:49:56.309
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][INFO  ] Running
> command: sudo /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs --
> /dev/sdb
> 2016-03-15T18:49:56.546
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
> 2016-03-15T18:49:56.611
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster
> ceph
> 2016-03-15T18:49:56.643
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster ceph
> 2016-03-15T18:49:56.708
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster ceph
> 2016-03-15T18:49:56.708
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid:
> get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> 2016-03-15T18:49:56.709
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] set_type:
> Will colocate journal with data on /dev/sdb
> 2016-03-15T18:49:56.709
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-osd --cluster=ceph
> --show-config-value=osd_journal_size
> 2016-03-15T18:49:56.774
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid:
> get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> 2016-03-15T18:49:56.774
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid:
> get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> 2016-03-15T18:49:56.775
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid:
> get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> 2016-03-15T18:49:56.775
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup
> osd_mkfs_options_xfs
> 2016-03-15T18:49:56.777
> INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup
> osd_fs_mkfs_options_xfs
> 2016-03-15T18:49:56.809
> INFO

Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-03-19 Thread Stephen Harker
Thanks all for your suggestions and advice. I'll let you know how it 
goes :)


Stephen

On 2016-03-16 16:58, Heath Albritton wrote:

The rule of thumb is to match the journal throughput to the OSD
throughout.  I'm seeing ~180MB/s sequential write on my OSDs and I'm
using one of the P3700 400GB units per six OSDs.  The 400GB P3700
yields around 1200MB/s* and has around 1/10th the latency of any SATA
SSD I've tested.

I put a pair of them in a 12-drive chassis and get excellent
performance.  One could probably do the same in an 18-drive chassis
without any issues.  Failure domain for a journal starts to get pretty
large at they point.  I have dozens of the "Fultondale" SSDs deployed
and have had zero failures.  Endurance is excellent, etc.

*the larger units yield much better write throughout but don't make
sense financially for journals.

-H

On Mar 16, 2016, at 09:37, Nick Fisk  wrote:


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
Of

Stephen Harker
Sent: 16 March 2016 16:22
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,

which is

better?


On 2016-02-17 11:07, Christian Balzer wrote:

On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:


Let's consider both cases:
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written
to the backing HDD. So, for writes, SSD journal approach should
be comparable to having a SSD cache tier.

Not quite, see below.

Could you elaborate a bit more?

Are you saying that with a Journal on a SSD writes from clients,
before they can return from the operation to the client, must end 
up

on both the SSD (Journal) *and* HDD (actual data store behind that
journal)?


No, your initial statement is correct.

However that burst of speed doesn't last indefinitely.

Aside from the size of the journal (which is incidentally NOT the 
most
limiting factor) there are various "filestore" parameters in Ceph, 
in

particular the sync interval ones.
There was a more in-depth explanation by a developer about this in
this ML, try your google-foo.

For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster,
the speed will eventually (after a few seconds) go down to what your
backing storage (HDDs) are capable of sustaining.


(Which SSDs do you plan to use anyway?)


Intel DC S3700

Good choice, with the 200GB model prefer the 3700 over the 3710
(higher sequential write speed).


Hi All,

I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD 
nodes,

each

of which has 6 4TB SATA drives within. I had my eye on these:

400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0

but reading through this thread, it might be better to go with the 
P3700

given

the improved iops. So a couple of questions.

* Are the PCI-E versions of these drives different in any other way 
than

the

interface?


Yes and no. Internally they are probably not much difference, but the
NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of 
minimum

latency and bandwidth.



* Would one of these as a journal for 6 4TB OSDs be overkill 
(connectivity

is

10GE, or will be shortly anyway), would the SATA S3700 be sufficient?


Again depends on your use case. The S3700 may suffer if you are doing 
large
sequential writes, it might not have a high enough sequential write 
speed

and will become the bottleneck. 6 Disks could potentially take around
500-700MB/s of writes. A P3700 will have enough and will give slightly 
lower
write latency as well if this is important. You may even be able to 
run more

than 6 disk OSD's on it if needed.



Given they're not hot-swappable, it'd be good if they didn't wear out 
in

6 months too.


Probably won't unless you are doing some really extreme write 
workloads and

even then I would imagine they would last 1-2 years.



I realise I've not given you much to go on and I'm Googling around as

well, I'm
really just asking in case someone has tried this already and has 
some

feedback or advice..


That's ok, I'm currently running S3700 100GB's on current cluster and 
new

cluster that's in planning stages will be using the 400Gb P3700's.



Thanks! :)

Stephen

--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.

--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and 
Wales.

Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, 
EN4

9EE.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://list

[ceph-users] radosgw_agent sync issues

2016-03-19 Thread ceph new
HI
i setup 2 clusters and in using radosgw_agent to sync them last week the
sync stop working if on runinig the agent from command line i see its stuck
on 2 files in the console im geting :
2016-03-17 21:11:57,391 14323 [radosgw_agent.worker][DEBUG ] op state is []
2016-03-17 21:11:57,391 14323 [radosgw_agent.worker][DEBUG ] error geting
op state: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/radosgw_agent/worker.py", line
275, in wait_for_object
state = state[0]['state']

and in the log i see :

2016-03-17 21:38:53,221 30848 [boto][DEBUG ] Signature:
AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk=
2016-03-17 21:38:53,221 30848 [boto][DEBUG ] url = '
http://s3-us-west.test.com/admin/opstate'
params={'client-id': 'radosgw-agent', 'object':
u'test/Kenny-Wormald-photo-premiere2-56b2b75d5f9b58def9c8ed52.jpg',
'op-id': 'nyprceph1.ops.test.com:30568:135'}
headers={'Date': 'Thu, 17 Mar 2016 21:38:53 GMT', 'Content-Length': '0',
'Authorization': u'AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk=',
'User-Agent': 'Boto/2.38.0 Python/2.6.6 Linux/2.6.32-504.8.1.el6.x86_64'}
data=None
2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Method: GET
2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Path:
/admin/opstate?client-id=radosgw-agent&object=test/Kenny-Wormald-photo-premiere2-56b2b75d5f9b58def9c8ed52.jpg&op-id=
nyprceph1.ops.test.com%3A30568%3A135
2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Data:
2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Headers: {}
2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Host: s3-us-west.test.com
2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Port: 80
2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Params: {'client-id':
'radosgw-agent', 'object':
'test/Kenny-Wormald-photo-premiere2-56b2b75d5f9b58def9c8ed52.jpg', 'op-id':
'nyprceph1.ops.test.com%3A30568%3A135'}
2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Token: None
2016-03-17 21:38:53,223 30848 [boto][DEBUG ] StringToSign:
GET


Thu, 17 Mar 2016 21:38:53 GMT
/admin/opstate
2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Signature:
AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk=
2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Final headers: {'Date': 'Thu,
17 Mar 2016 21:38:53 GMT', 'Content-Length': '0', 'Authorization': u'AWS
WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk=', 'User-Agent':
'Boto/2.38.0 Python/2.6.6 Linux/2.6.32-504.8.1.el6.x86_64'}
2016-03-17 21:38:53,298 30848 [boto][DEBUG ] Response headers: [('date',
'Thu, 17 Mar 2016 21:38:53 GMT'), ('content-length', '2'),
('x-amz-request-id', 'tx00019c09c-0056eb23ed-f149c-us-west')]
2016-03-17 21:38:53,369 30848 [radosgw_agent.worker][DEBUG ] op state is []
2016-03-17 21:38:53,369 30848 [radosgw_agent.worker][DEBUG ] error geting
op state: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/radosgw_agent/worker.py", line
275, in wait_for_object
state = state[0]['state']
IndexError: list index out of range



i can download the file from from the master i upload it to the slave and
rerun the sync but still didnt work
any way to skip the file and get the sync done ? (IE just remove it and
re-upload it under new name ? )
is it need to be fix from the master side or slave ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs] About feature 'snapshot'

2016-03-19 Thread 施柏安
Hi John,
How to set this feature on?

Thank you

2016-03-17 21:41 GMT+08:00 Gregory Farnum :

> On Thu, Mar 17, 2016 at 3:49 AM, John Spray  wrote:
> > Snapshots are disabled by default:
> >
> http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration
>
> Which makes me wonder if we ought to be hiding the .snaps directory
> entirely in that case. I haven't previously thought about that, but it
> *is* a bit weird.
> -Greg
>
> >
> > John
> >
> > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安  wrote:
> >> Hi all,
> >> I encounter a trouble about cephfs sanpshot. It seems that the folder
> >> '.snap' is exist.
> >> But I use 'll -a' can't let it show up. And I enter that folder and
> create
> >> folder in it, it showed something wrong to use snapshot.
> >>
> >> Please check : http://imgur.com/elZhQvD
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD/Ceph as Physical boot volume

2016-03-19 Thread Schlacta, Christ
I posted about this a while ago, and someone else has since inquired,
but I am seriously wanting to know if anybody has figured out how to
boot from a RBD device yet using ipxe or similar.  Last I read.
loading the kernel and initrd from object storage would be
theoretically easy, and would only require making an initramfs to
initialize and mount the rbd..  But I couldn't find any documented
instances of anybody having done this yet..  So..  Has anybody done
this yet?  If so, which distros is it working on, and where can I find
more info?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does object map feature lock snapshots ?

2016-03-19 Thread Christoph Adomeit
Hi,

I had no special logging activated.

Today I re-enabled exclusive-lock object-map and fast-diff on an image in 9.2.1

As soon as I ran an rbd export-diff I had lots of these error messages on the 
console of the rbd export process:

2016-03-18 11:18:21.546658 7f77245d1700  1 heartbeat_map is_healthy 
'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
2016-03-18 11:18:26.546750 7f77245d1700  1 heartbeat_map is_healthy 
'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
2016-03-18 11:18:31.546840 7f77245d1700  1 heartbeat_map is_healthy 
'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
2016-03-18 11:18:36.546928 7f77245d1700  1 heartbeat_map is_healthy 
'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
2016-03-18 11:18:41.547017 7f77245d1700  1 heartbeat_map is_healthy 
'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
2016-03-18 11:18:46.547105 7f77245d1700  1 heartbeat_map is_healthy 
'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60


Is this a known issue ? 





On Tue, Mar 08, 2016 at 11:22:17AM -0500, Jason Dillaman wrote:
> Is there anyway for you to provide debug logs (i.e. debug rbd = 20) from your 
> rbd CLI and qemu process when you attempt to create a snapshot?  In v9.2.0, 
> there was an issue [1] where the cache flush writeback from the snap create 
> request was being blocked when the exclusive lock feature was enabled, but 
> that should have been fixed in v9.2.1.
> 
> [1] http://tracker.ceph.com/issues/14542
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> - Original Message -
> > From: "Christoph Adomeit" 
> > To: ceph-us...@ceph.com
> > Sent: Tuesday, March 8, 2016 11:13:04 AM
> > Subject: [ceph-users] Does object map feature lock snapshots ?
> > 
> > Hi,
> > 
> > i have installed ceph 9.21 on proxmox with kernel 4.2.8-1-pve.
> > 
> > Afterwards I have enabled the features:
> > 
> > rbd feature enable $IMG exclusive-lock
> > rbd feature enable $IMG object-map
> > rbd feature enable $IMG fast-diff
> > 
> > 
> > During the night I have a cronjob which does a rbd snap create on each
> > of my images and then an rbd export-diff
> > 
> > I found out that my cronjob was hanging during the rbd snap create and
> > does not create the snapshot.
> > 
> > Also more worse, sometimes also the vms were hanging.
> > 
> > What are your experiences with object maps ? For me it looks that they
> > are not yet production ready.
> > 
> > Thanks
> >   Christoph
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Jeffrey McDonald
Great, I just recovered the first placement group from this error.   To be
sure, I  ran a deep-scrub and that comes back clean.

Thanks for all your help.
Regards,
Jeff

On Thu, Mar 17, 2016 at 11:58 AM, Samuel Just  wrote:

> Oh, it's getting a stat mismatch.  I think what happened is that on
> one of the earlier repairs it reset the stats to the wrong value (the
> orphan was causing the primary to scan two objects twice, which
> matches the stat mismatch I see here).  A pg repair repair will clear
> that up.
> -Sam
>
> On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald 
> wrote:
> > Thanks Sam.
> >
> > Since I have prepared a script for this, I decided to go ahead with the
> > checks.(patience isn't one of my extended attributes)
> >
> > I've got a file that searches the full erasure encoded spaces and does
> your
> > checklist below.   I have operated only on one PG so far, the 70.459 one
> > that we've been discussing.There was only the one file that I found
> to
> > be out of place--the one we already discussed/found and it has been
> removed.
> >
> > The pg is still marked as inconsistent.   I've scrubbed it a couple of
> times
> > now and what I've seen is:
> >
> > 2016-03-17 09:29:53.202818 7f2e816f8700  0 log_channel(cluster) log
> [INF] :
> > 70.459 deep-scrub starts
> > 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log
> [ERR] :
> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
> > 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log
> [ERR] :
> > 70.459 deep-scrub 1 errors
> > 2016-03-17 09:44:23.592302 7f2e816f8700  0 log_channel(cluster) log
> [INF] :
> > 70.459 deep-scrub starts
> > 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log
> [ERR] :
> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
> > 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log
> [ERR] :
> > 70.459 deep-scrub 1 errors
> >
> >
> > Should the scrub be sufficient to remove the inconsistent flag?   I took
> the
> > osd offline during the repairs.I've looked at files in all of the
> osds
> > in the placement group and I'm not finding any more problem files.The
> > vast majority of files do not have the user.cephos.lfn3 attribute.
> There
> > are 22321 objects that I seen and only about 230 have the
> user.cephos.lfn3
> > file attribute.   The files will have other attributes, just not
> > user.cephos.lfn3.
> >
> > Regards,
> > Jeff
> >
> >
> > On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just  wrote:
> >>
> >> Ok, like I said, most files with _long at the end are *not orphaned*.
> >> The generation number also is *not* an indication of whether the file
> >> is orphaned -- some of the orphaned files will have 
> >> as the generation number and others won't.  For each long filename
> >> object in a pg you would have to:
> >> 1) Pull the long name out of the attr
> >> 2) Parse the hash out of the long name
> >> 3) Turn that into a directory path
> >> 4) Determine whether the file is at the right place in the path
> >> 5) If not, remove it (or echo it to be checked)
> >>
> >> You probably want to wait for someone to get around to writing a
> >> branch for ceph-objectstore-tool.  Should happen in the next week or
> >> two.
> >> -Sam
> >>
> >
> > --
> >
> > Jeffrey McDonald, PhD
> > Assistant Director for HPC Operations
> > Minnesota Supercomputing Institute
> > University of Minnesota Twin Cities
> > 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> > 117 Pleasant St SE   phone: +1 612 625-6905
> > Minneapolis, MN 55455fax:   +1 612 624-8861
> >
> >
>



-- 

Jeffrey McDonald, PhD
Assistant Director for HPC Operations
Minnesota Supercomputing Institute
University of Minnesota Twin Cities
599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
117 Pleasant St SE   phone: +1 612 625-6905
Minneapolis, MN 55455fax:   +1 612 624-8861
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs] About feature 'snapshot'

2016-03-19 Thread John Spray
Snapshots are disabled by default:
http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration

John

On Thu, Mar 17, 2016 at 10:02 AM, 施柏安  wrote:
> Hi all,
> I encounter a trouble about cephfs sanpshot. It seems that the folder
> '.snap' is exist.
> But I use 'll -a' can't let it show up. And I enter that folder and create
> folder in it, it showed something wrong to use snapshot.
>
> Please check : http://imgur.com/elZhQvD
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Yep, thanks for all the help tracking down the root cause!
-Sam

On Thu, Mar 17, 2016 at 10:50 AM, Jeffrey McDonald  wrote:
> Great, I just recovered the first placement group from this error.   To be
> sure, I  ran a deep-scrub and that comes back clean.
>
> Thanks for all your help.
> Regards,
> Jeff
>
> On Thu, Mar 17, 2016 at 11:58 AM, Samuel Just  wrote:
>>
>> Oh, it's getting a stat mismatch.  I think what happened is that on
>> one of the earlier repairs it reset the stats to the wrong value (the
>> orphan was causing the primary to scan two objects twice, which
>> matches the stat mismatch I see here).  A pg repair repair will clear
>> that up.
>> -Sam
>>
>> On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald 
>> wrote:
>> > Thanks Sam.
>> >
>> > Since I have prepared a script for this, I decided to go ahead with the
>> > checks.(patience isn't one of my extended attributes)
>> >
>> > I've got a file that searches the full erasure encoded spaces and does
>> > your
>> > checklist below.   I have operated only on one PG so far, the 70.459 one
>> > that we've been discussing.There was only the one file that I found
>> > to
>> > be out of place--the one we already discussed/found and it has been
>> > removed.
>> >
>> > The pg is still marked as inconsistent.   I've scrubbed it a couple of
>> > times
>> > now and what I've seen is:
>> >
>> > 2016-03-17 09:29:53.202818 7f2e816f8700  0 log_channel(cluster) log
>> > [INF] :
>> > 70.459 deep-scrub starts
>> > 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
>> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
>> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
>> > 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459 deep-scrub 1 errors
>> > 2016-03-17 09:44:23.592302 7f2e816f8700  0 log_channel(cluster) log
>> > [INF] :
>> > 70.459 deep-scrub starts
>> > 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
>> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
>> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
>> > 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459 deep-scrub 1 errors
>> >
>> >
>> > Should the scrub be sufficient to remove the inconsistent flag?   I took
>> > the
>> > osd offline during the repairs.I've looked at files in all of the
>> > osds
>> > in the placement group and I'm not finding any more problem files.
>> > The
>> > vast majority of files do not have the user.cephos.lfn3 attribute.
>> > There
>> > are 22321 objects that I seen and only about 230 have the
>> > user.cephos.lfn3
>> > file attribute.   The files will have other attributes, just not
>> > user.cephos.lfn3.
>> >
>> > Regards,
>> > Jeff
>> >
>> >
>> > On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just  wrote:
>> >>
>> >> Ok, like I said, most files with _long at the end are *not orphaned*.
>> >> The generation number also is *not* an indication of whether the file
>> >> is orphaned -- some of the orphaned files will have 
>> >> as the generation number and others won't.  For each long filename
>> >> object in a pg you would have to:
>> >> 1) Pull the long name out of the attr
>> >> 2) Parse the hash out of the long name
>> >> 3) Turn that into a directory path
>> >> 4) Determine whether the file is at the right place in the path
>> >> 5) If not, remove it (or echo it to be checked)
>> >>
>> >> You probably want to wait for someone to get around to writing a
>> >> branch for ceph-objectstore-tool.  Should happen in the next week or
>> >> two.
>> >> -Sam
>> >>
>> >
>> > --
>> >
>> > Jeffrey McDonald, PhD
>> > Assistant Director for HPC Operations
>> > Minnesota Supercomputing Institute
>> > University of Minnesota Twin Cities
>> > 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
>> > 117 Pleasant St SE   phone: +1 612 625-6905
>> > Minneapolis, MN 55455fax:   +1 612 624-8861
>> >
>> >
>
>
>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs infernalis (ceph version 9.2.1) - bonnie++

2016-03-19 Thread Oliver Dzombic
Hi,

on ubuntu 14.04 client and centos 7.2 client with centos 7 Hammer

its working without problems.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 19.03.2016 um 02:38 schrieb Michael Hanscho:
> Hi!
> 
> Trying to run bonnie++ on cephfs mounted via the kernel driver on a
> centos 7.2.1511 machine resulted in:
> 
> # bonnie++ -r 128 -u root -d /data/cephtest/bonnie2/
> Using uid:0, gid:0.
> Writing a byte at a time...done
> Writing intelligently...done
> Rewriting...done
> Reading a byte at a time...done
> Reading intelligently...done
> start 'em...done...done...done...done...done...
> Create files in sequential order...done.
> Stat files in sequential order...done.
> Delete files in sequential order...Bonnie: drastic I/O error (rmdir):
> Directory not empty
> Cleaning up test directory after error.
> 
> # ceph -w
> cluster 
>  health HEALTH_OK
>  monmap e3: 3 mons at
> {cestor4=:6789/0,cestor5=:6789/0,cestor6=:6789/0}
> election epoch 62, quorum 0,1,2 cestor4,cestor5,cestor6
>  mdsmap e30: 1/1/1 up {0=cestor2=up:active}, 1 up:standby
>  osdmap e703: 60 osds: 60 up, 60 in
> flags sortbitwise
>   pgmap v135437: 1344 pgs, 4 pools, 4315 GB data, 2315 kobjects
> 7262 GB used, 320 TB / 327 TB avail
> 1344 active+clean
> 
> Any ideas?
> 
> Gruesse
> Michael
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Irek Fasikhov
Hi, Nick

I switched between forward and writeback. (forward -> writeback)

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2016-03-17 16:10 GMT+03:00 Nick Fisk :

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Irek Fasikhov
> > Sent: 17 March 2016 13:00
> > To: Sage Weil 
> > Cc: Robert LeBlanc ; ceph-users  > us...@lists.ceph.com>; Nick Fisk ; William Perkins
> > 
> > Subject: Re: [ceph-users] data corruption with hammer
> >
> > Hi,All.
> >
> > I confirm the problem. When min_read_recency_for_promote> 1 data
> > failure.
>
> But what scenario is this? Are you switching between forward and
> writeback, or just running in writeback?
>
> >
> >
> > С уважением, Фасихов Ирек Нургаязович
> > Моб.: +79229045757
> >
> > 2016-03-17 15:26 GMT+03:00 Sage Weil :
> > On Thu, 17 Mar 2016, Nick Fisk wrote:
> > > There is got to be something else going on here. All that PR does is to
> > > potentially delay the promotion to hit_set_period*recency instead of
> > > just doing it on the 2nd read regardless, it's got to be uncovering
> > > another bug.
> > >
> > > Do you see the same problem if the cache is in writeback mode before
> you
> > > start the unpacking. Ie is it the switching mid operation which causes
> > > the problem? If it only happens mid operation, does it still occur if
> > > you pause IO when you make the switch?
> > >
> > > Do you also see this if you perform on a RBD mount, to rule out any
> > > librbd/qemu weirdness?
> > >
> > > Do you know if it’s the actual data that is getting corrupted or if
> it's
> > > the FS metadata? I'm only wondering as unpacking should really only be
> > > writing to each object a couple of times, whereas FS metadata could
> > > potentially be being updated+read back lots of times for the same group
> > > of objects and ordering is very important.
> > >
> > > Thinking through it logically the only difference is that with
> recency=1
> > > the object will be copied up to the cache tier, where recency=6 it will
> > > be proxy read for a long time. If I had to guess I would say the issue
> > > would lie somewhere in the proxy read + writeback<->forward logic.
> >
> > That seems reasonable.  Was switching from writeback -> forward always
> > part of the sequence that resulted in corruption?  Not that there is a
> > known ordering issue when switching to forward mode.  I wouldn't really
> > expect it to bite real users but it's possible..
> >
> > http://tracker.ceph.com/issues/12814
> >
> > I've opened a ticket to track this:
> >
> > http://tracker.ceph.com/issues/15171
> >
> > What would be *really* great is if you could reproduce this with a
> > ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados
> > running, and then find the sequence of operations that are sufficient to
> > trigger a failure.
> >
> > sage
> >
> >
> >
> >  >
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > Behalf Of
> > > > Mike Lovell
> > > > Sent: 16 March 2016 23:23
> > > > To: ceph-users ; sw...@redhat.com
> > > > Cc: Robert LeBlanc ; William Perkins
> > > > 
> > > > Subject: Re: [ceph-users] data corruption with hammer
> > > >
> > > > just got done with a test against a build of 0.94.6 minus the two
> commits
> > that
> > > > were backported in PR 7207. everything worked as it should with the
> > cache-
> > > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > > assuming it works properly on master, there must be a commit that
> we're
> > > > missing on the backport to support this properly.
> > > >
> > > > sage,
> > > > i'm adding you to the recipients on this so hopefully you see it.
> the tl;dr
> > > > version is that the backport of the cache recency fix to hammer
> doesn't
> > work
> > > > right and potentially corrupts data when
> > > > the min_read_recency_for_promote is set to greater than 1.
> > > >
> > > > mike
> > > >
> > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > > >  wrote:
> > > > robert and i have done some further investigation the past couple
> days
> > on
> > > > this. we have a test environment with a hard drive tier and an ssd
> tier as a
> > > > cache. several vms were created with volumes from the ceph cluster. i
> > did a
> > > > test in each guest where i un-tarred the linux kernel source multiple
> > times
> > > > and then did a md5sum check against all of the files in the resulting
> > source
> > > > tree. i started off with the monitors and osds running 0.94.5 and
> never
> > saw
> > > > any problems.
> > > >
> > > > a single node was then upgraded to 0.94.6 which has osds in both the
> ssd
> > and
> > > > hard drive tier. i then proceeded to run the same test and, while the
> > untar
> > > > and md5sum operations were running, i changed the ssd tier cache-mode
> > > > from forward to writeback. almost immediately the vms started
> reporting
> > io
> > > > errors a

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Jeffrey McDonald
Hi Sam,

In the 70.459 logs from the deep-scrub, there is an error:

 $ zgrep "= \-2$" ceph-osd.307.log.1.gz
2016-03-07 16:11:41.828332 7ff30cdad700 10
filestore(/var/lib/ceph/osd/ceph-307) remove
70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
= -2
2016-03-07 21:44:02.197676 7fe96b56f700 10
filestore(/var/lib/ceph/osd/ceph-307) remove
70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
= -2

I'm taking this as an indication of the error you mentioned.It looks to
me as if this bug leaves two files with "issues" based upon what I see on
the filesystem.

First, I have a size-0 file in a directory where I expect only to have
directories:

root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D#
ls -ltr
total 320
-rw-r--r-- 1 root root 0 Jan 23 21:49
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
drwxr-xr-x 2 root root 16384 Feb  5 15:13 DIR_6
drwxr-xr-x 2 root root 16384 Feb  5 17:26 DIR_3
drwxr-xr-x 2 root root 16384 Feb 10 00:01 DIR_C
drwxr-xr-x 2 root root 16384 Mar  4 10:50 DIR_7
drwxr-xr-x 2 root root 16384 Mar  4 16:46 DIR_A
drwxr-xr-x 2 root root 16384 Mar  5 02:37 DIR_2
drwxr-xr-x 2 root root 16384 Mar  5 17:39 DIR_4
drwxr-xr-x 2 root root 16384 Mar  8 16:50 DIR_F
drwxr-xr-x 2 root root 16384 Mar 15 15:51 DIR_8
drwxr-xr-x 2 root root 16384 Mar 15 21:18 DIR_D
drwxr-xr-x 2 root root 16384 Mar 15 22:25 DIR_0
drwxr-xr-x 2 root root 16384 Mar 15 22:35 DIR_9
drwxr-xr-x 2 root root 16384 Mar 15 22:56 DIR_E
drwxr-xr-x 2 root root 16384 Mar 15 23:21 DIR_1
drwxr-xr-x 2 root root 12288 Mar 16 00:07 DIR_B
drwxr-xr-x 2 root root 16384 Mar 16 00:34 DIR_5

I assume that this file is an issue as well..and needs to be removed.



then, in the directory where the file should be, I have the same file:

root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D/DIR_E#
ls -ltr | grep -v __head_
total 64840
-rw-r--r-- 1 root root 1048576 Jan 23 21:49
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long

In the directory DIR_E here (from above), there is only one file without a
__head_ in the pathname -- the file aboveShould I be deleting both
these _long files without the __head_ in DIR_E and in one above .../DIR_E?

Since there is no directory structure HASH in these files, is that the
indication that it is an orphan?

Thanks,
Jeff




On Tue, Mar 15, 2016 at 8:38 PM, Samuel Just  wrote:

> Ah, actually, I think there will be duplicates only around half the
> time -- either the old link or the new link could be orphaned
> depending on which xfs decides to list first.  Only if the old link is
> orphaned will it match the name of the object once it's recreated.  I
> should be able to find time to put together a branch in the next week
> or two if you want to wait.  It's still probably worth trying removing
> that object in 70.459.
> -Sam
>
> On Tue, Mar 15, 2016 at 6:03 PM, Samuel Just  wrote:
> > The bug is entirely independent of hardware issues -- entirely a ceph
> > bug.  xfs doesn't let us specify an ordering when reading a directory,
> > so we have to keep directory sizes small.  That means that when one of
> > those pg collection subfolders has 320 files in it, we split it into
> > up to 16 smaller directories.  Overwriting or removing an ec object
> > requires us to rename the old version out of the way in case we need
> > to roll back (that's the generation number I mentioned above).  For
> > crash safety, this involves first creating a link to the new name,
> > then removing the old one.  Both the old and new link will be in the
> > same subdirectory.  If creating the new link pushes the directory to
> > 320 files then we do a split while both links are present.  If the
> > file in question is using the special long filename handling, then a
> > bug in the resulting link juggling causes us to orphan the old version
> > of the file.  Your cluster seems to have an unusual number of objects
> > with very long names, which is why it is so visible on your cluster.
> >
> > There are critical pool sizes where the PGs will all be close to one
> > of those limits.  It's possible you are not close to one of those
> > limits.  It's als

Re: [ceph-users] v10.0.4 released

2016-03-19 Thread Sage Weil
On Wed, 16 Mar 2016, Eric Eastman wrote:
> Thank you for doing this.  It will make testing 10.0.x easier for all of us
> in the field, and will make it easier to report bugs, as we will know that
> the problems we find were not caused by our build process. 

Note that you can also always pull builds from the gitbuilders (which is 
what we run QA against).  Both of these should work:

 ceph-deploy install --dev jewel HOST
 ceph-deploy install --dev v10.0.5 HOST

or you can grab builds directly from gitbuilder.ceph.com.

sage



> Eric
> 
> On Wed, Mar 16, 2016 at 7:14 AM, Loic Dachary  wrote:
>   Hi,
> 
>   Because of a tiny mistake preventing deb packages to be built,
>   v10.0.5 was released shortly after v10.0.4 and is now the
>   current development release. The Stable release team[0]
>   collectively decided to help by publishing development
>   packages[1], starting with v10.0.5.
> 
>   The packages for v10.0.5 are available at
>   http://ceph-releases.dachary.org/ which can be used as a
>   replacement for http://download.ceph.com/ for both
>   http://download.ceph.com/rpm-testing and
>   http://download.ceph.com/debian-testing . The only difference is
>   the key used to sign the releases which can be imported with
> 
>       wget -q -O-
>   'http://ceph-releases.dachary.org/release-key.asc' | sudo
>   apt-key add -
> 
>   or
> 
>       rpm --import
>   http://ceph-releases.dachary.org/release-key.asc
> 
>   The instructions to install development packages found at
>   http://docs.ceph.com/docs/master/install/get-packages/ can
>   otherwise be applied with no change.
> 
>   Cheers
> 
>   [0] Stable release team
>   http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#Whos-who
>   [1] Publishing development releases
>   http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/30126
> 
>   On 08/03/2016 22:35, Sage Weil wrote:
>   > This is the fourth and last development release before Jewel.
>   The next
>   > release will be a release candidate with the final set of
>   features. Big
>   > items include RGW static website support, librbd journal
>   framework, fixed
>   > mon sync of config-key data, C++11 updates, and
>   bluestore/kstore.
>   >
>   > Note that, due to general developer busyness, we aren’t
>   building official
>   > release packages for this dev release. You can fetch autobuilt
>   gitbuilder
>   > packages from the usual location (http://gitbuilder.ceph.com).
>   >
>   > Notable Changes
>   > ---
>   >
>   >         http://ceph.com/releases/v10-0-4-released/
>   >
>   > Getting Ceph
>   > 
>   >
>   > * Git at git://github.com/ceph/ceph.git
>   > * For packages, see
>   http://ceph.com/docs/master/install/get-packages#add-ceph-development
>   > * For ceph-deploy, see
>   http://ceph.com/docs/master/install/install-ceph-deploy
>   >
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
I'll  miss the Ceph community as well. There was a few things I really
wanted to work in with Ceph.

I got this:

update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
dirty exists
1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
1040:  finishing write tid 1 to nodez23350-256
1040:  finishing write tid 2 to nodez23350-256
1040:  finishing write tid 3 to nodez23350-256
1040:  finishing write tid 4 to nodez23350-256
1040:  finishing write tid 6 to nodez23350-256
1035: done (4 left)
1037: done (3 left)
1038: done (2 left)
1043: read oid 430 snap -1
1043:  expect (ObjNum 429 snap 0 seq_num 429)
1040:  finishing write tid 7 to nodez23350-256
update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
dirty exists
1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
1042:  expect (ObjNum 664 snap 0 seq_num 664)
1043: Error: oid 430 read returned error code -2
./test/osd/RadosModel.h: In function 'virtual void
ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
2016-03-17 10:47:19.085414
./test/osd/RadosModel.h: 1109: FAILED assert(0)
ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x76) [0x4db956]
2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
3: (()+0x9791d) [0x7fa1d472191d]
4: (()+0x72519) [0x7fa1d46fc519]
5: (()+0x13c178) [0x7fa1d47c6178]
6: (()+0x80a4) [0x7fa1d425a0a4]
7: (clone()+0x6d) [0x7fa1d2bd504d]
NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted

I had to toggle writeback/forward and min_read_recency_for_promote a
few times to get it, but I don't know if it is because I only have one
job running. Even with six jobs running, it is not easy to trigger
with ceph_test_rados, but it is very instant in the RBD VMs.

Here are the six run crashes (I have about the last 2000 lines of each
if needed):

nodev:
update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num
1014) dirty exists
1015:  left oid 1015 (ObjNum 1014 snap 0 seq_num 1014)
1016:  finishing write tid 1 to nodev21799-1016
1016:  finishing write tid 2 to nodev21799-1016
1016:  finishing write tid 3 to nodev21799-1016
1016:  finishing write tid 4 to nodev21799-1016
1016:  finishing write tid 6 to nodev21799-1016
1016:  finishing write tid 7 to nodev21799-1016
update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num
1015) dirty exists
1016:  left oid 1016 (ObjNum 1015 snap 0 seq_num 1015)
1017:  finishing write tid 1 to nodev21799-1017
1017:  finishing write tid 2 to nodev21799-1017
1017:  finishing write tid 3 to nodev21799-1017
1017:  finishing write tid 5 to nodev21799-1017
1017:  finishing write tid 6 to nodev21799-1017
update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num
1016) dirty exists
1017:  left oid 1017 (ObjNum 1016 snap 0 seq_num 1016)
1018:  finishing write tid 1 to nodev21799-1018
1018:  finishing write tid 2 to nodev21799-1018
1018:  finishing write tid 3 to nodev21799-1018
1018:  finishing write tid 4 to nodev21799-1018
1018:  finishing write tid 6 to nodev21799-1018
1018:  finishing write tid 7 to nodev21799-1018
update_object_version oid 1018 v 1093 (ObjNum 1017 snap 0 seq_num
1017) dirty exists
1018:  left oid 1018 (ObjNum 1017 snap 0 seq_num 1017)
1019:  finishing write tid 1 to nodev21799-1019
1019:  finishing write tid 2 to nodev21799-1019
1019:  finishing write tid 3 to nodev21799-1019
1019:  finishing write tid 5 to nodev21799-1019
1019:  finishing write tid 6 to nodev21799-1019
update_object_version oid 1019 v 462 (ObjNum 1018 snap 0 seq_num 1018)
dirty exists
1019:  left oid 1019 (ObjNum 1018 snap 0 seq_num 1018)
1021:  finishing write tid 1 to nodev21799-1021
1020:  finishing write tid 1 to nodev21799-1020
1020:  finishing write tid 2 to nodev21799-1020
1020:  finishing write tid 3 to nodev21799-1020
1020:  finishing write tid 5 to nodev21799-1020
1020:  finishing write tid 6 to nodev21799-1020
update_object_version oid 1020 v 1287 (ObjNum 1019 snap 0 seq_num
1019) dirty exists
1020:  left oid 1020 (ObjNum 1019 snap 0 seq_num 1019)
1021:  finishing write tid 2 to nodev21799-1021
1021:  finishing write tid 3 to nodev21799-1021
1021:  finishing write tid 5 to nodev21799-1021
1021:  finishing write tid 6 to nodev21799-1021
update_object_version oid 1021 v 1077 (ObjNum 1020 snap 0 seq_num
1020) dirty exists
1021:  left oid 1021 (ObjNum 1020 snap 0 seq_num 1020)
1022:  finishing write tid 1 to nodev21799-1022
1022:  finishing write tid 2 to nodev21799-1022
1022:  finishing write tid 3 to nodev21799-1022
1022:  finishing write tid 5 to nodev21799-1022
1022:  finishing write tid 6 to nodev21799-1022
update_object_version oid 1022 v 1213 (ObjNum 1021 snap 0 seq_num
1021) dirty exists
1022:  left oid 1022 (ObjNum 1021 snap 0 seq_num 1021)
1023:  finishing write tid 1 to nodev21799-1023
1023:  finishing write tid 2 to nodev21799-1023
1023:  finishing wri

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
Cherry-picking that commit onto v0.94.6 wasn't clean so I'm just
building your branch. I'm not sure what the difference between your
branch and 0.94.6 is, I don't see any commits against
osd/ReplicatedPG.cc in the last 5 months other than the one you did
today.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:38 AM, Robert LeBlanc  wrote:
> Yep, let me pull and build that branch. I tried installing the dbg
> packages and running it in gdb, but it didn't load the symbols.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil  wrote:
>> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>>> Also, is this ceph_test_rados rewriting objects quickly? I think that
>>> the issue is with rewriting objects so if we can tailor the
>>> ceph_test_rados to do that, it might be easier to reproduce.
>>
>> It's doing lots of overwrites, yeah.
>>
>> I was albe to reproduce--thanks!  It looks like it's specific to
>> hammer.  The code was rewritten for jewel so it doesn't affect the
>> latest.  The problem is that maybe_handle_cache may proxy the read and
>> also still try to handle the same request locally (if it doesn't trigger a
>> promote).
>>
>> Here's my proposed fix:
>>
>> https://github.com/ceph/ceph/pull/8187
>>
>> Do you mind testing this branch?
>>
>> It doesn't appear to be directly related to flipping between writeback and
>> forward, although it may be that we are seeing two unrelated issues.  I
>> seemed to be able to trigger it more easily when I flipped modes, but the
>> bug itself was a simple issue in the writeback mode logic.  :/
>>
>> Anyway, please see if this fixes it for you (esp with the RBD workload).
>>
>> Thanks!
>> sage
>>
>>
>>
>>
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  
>>> wrote:
>>> > I'll  miss the Ceph community as well. There was a few things I really
>>> > wanted to work in with Ceph.
>>> >
>>> > I got this:
>>> >
>>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
>>> > dirty exists
>>> > 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
>>> > 1040:  finishing write tid 1 to nodez23350-256
>>> > 1040:  finishing write tid 2 to nodez23350-256
>>> > 1040:  finishing write tid 3 to nodez23350-256
>>> > 1040:  finishing write tid 4 to nodez23350-256
>>> > 1040:  finishing write tid 6 to nodez23350-256
>>> > 1035: done (4 left)
>>> > 1037: done (3 left)
>>> > 1038: done (2 left)
>>> > 1043: read oid 430 snap -1
>>> > 1043:  expect (ObjNum 429 snap 0 seq_num 429)
>>> > 1040:  finishing write tid 7 to nodez23350-256
>>> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
>>> > dirty exists
>>> > 1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
>>> > 1042:  expect (ObjNum 664 snap 0 seq_num 664)
>>> > 1043: Error: oid 430 read returned error code -2
>>> > ./test/osd/RadosModel.h: In function 'virtual void
>>> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
>>> > 2016-03-17 10:47:19.085414
>>> > ./test/osd/RadosModel.h: 1109: FAILED assert(0)
>>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> > const*)+0x76) [0x4db956]
>>> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
>>> > 3: (()+0x9791d) [0x7fa1d472191d]
>>> > 4: (()+0x72519) [0x7fa1d46fc519]
>>> > 5: (()+0x13c178) [0x7fa1d47c6178]
>>> > 6: (()+0x80a4) [0x7fa1d425a0a4]
>>> > 7: (clone()+0x6d) [0x7fa1d2bd504d]
>>> > NOTE: a copy of the executable, or `objdump -rdS ` is
>>> > needed to interpret this.
>>> > terminate called after throwing an instance of 'ceph::FailedAssertion'
>>> > Aborted
>>> >
>>> > I had to toggle writeback/forward and min_read_recency_for_promote a
>>> > few times to get it, but I don't know if it is because I only have one
>>> > job running. Even with six jobs running, it is not easy to trigger
>>> > with ceph_test_rados, but it is very instant in the RBD VMs.
>>> >
>>> > Here are the six run crashes (I have about the last 2000 lines of each
>>> > if needed):
>>> >
>>> > nodev:
>>> > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num
>>> > 1014) dirty exists
>>> > 1015:  left oid 1015 (ObjNum 1014 snap 0 seq_num 1014)
>>> > 1016:  finishing write tid 1 to nodev21799-1016
>>> > 1016:  finishing write tid 2 to nodev21799-1016
>>> > 1016:  finishing write tid 3 to nodev21799-1016
>>> > 1016:  finishing write tid 4 to nodev21799-1016
>>> > 1016:  finishing write tid 6 to nodev21799-1016
>>> > 1016:  finishing write tid 7 to nodev21799-1016
>>> > update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num
>>> > 1015) dirty exists
>>> > 1016:  left oid 1016 (ObjNum 1015 snap 0 seq_num 1015)
>>> > 1017:  finishing write tid 1 

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-19 Thread Daniel Niasoff
Hi Nick,

Your solution requires manual configuration for each VM and cannot be setup as 
part of an automated OpenStack deployment.

It would be really nice if it was a hypervisor based setting as opposed to a VM 
based setting.

Thanks 

Daniel

-Original Message-
From: Nick Fisk [mailto:n...@fisk.me.uk] 
Sent: 16 March 2016 08:59
To: Daniel Niasoff ; 'Van Leeuwen, Robert' 
; 'Jason Dillaman' 
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Daniel Niasoff
> Sent: 16 March 2016 08:26
> To: Van Leeuwen, Robert ; Jason Dillaman 
> 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi Robert,
> 
> >Caching writes would be bad because a hypervisor failure would result 
> >in
> loss of the cache which pretty much guarantees inconsistent data on 
> the ceph volume.
> >Also live-migration will become problematic compared to running
> everything from ceph since you will also need to migrate the
local-storage.

I tested a solution using iSCSI for the cache devices. Each VM was using 
flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets 
around the problem of moving things around or if the hypervisor goes down.
It's not local caching but the write latency is at least 10x lower than the 
RBD. Note I tested it, I didn't put it into production :-)

> 
> My understanding of how a writeback cache should work is that it 
> should only take a few seconds for writes to be streamed onto the 
> network and is focussed on resolving the speed issue of small sync 
> writes. The writes
would
> be bundled into larger writes that are not time sensitive.
> 
> So there is potential for a few seconds data loss but compared to the
current
> trend of using ephemeral storage to solve this issue, it's a major 
> improvement.

Yeah, problem is a couple of seconds data loss mean different things to 
different people.

> 
> > (considering the time required for setting up and maintaining the 
> > extra
> caching layer on each vm, unless you work for free ;-)
> 
> Couldn't agree more there.
> 
> I am just so surprised how the openstack community haven't looked to 
> resolve this issue. Ephemeral storage is a HUGE compromise unless you 
> have built in failure into every aspect of your application but many 
> people use openstack as a general purpose devstack.
> 
> (Jason pointed out his blueprint but I guess it's at least a year or 2
away -
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-
> consistent_write-back_caching_extension)
> 
> I see articles discussing the idea such as this one
> 
> http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-
> scalable-cache/
> 
> but no real straightforward  validated setup instructions.
> 
> Thanks
> 
> Daniel
> 
> 
> -Original Message-
> From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com]
> Sent: 16 March 2016 08:11
> To: Jason Dillaman ; Daniel Niasoff 
> 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> >Indeed, well understood.
> >
> >As a shorter term workaround, if you have control over the VMs, you 
> >could
> always just slice out an LVM volume from local SSD/NVMe and pass it 
> through to the guest.  Within the guest, use dm-cache (or similar) to 
> add
a
> cache front-end to your RBD volume.
> 
> If you do this you need to setup your cache as read-cache only.
> Caching writes would be bad because a hypervisor failure would result 
> in
loss
> of the cache which pretty much guarantees inconsistent data on the 
> ceph volume.
> Also live-migration will become problematic compared to running 
> everything from ceph since you will also need to migrate the local-storage.
> 
> The question will be if adding more ram (== more read cache) would not 
> be more convenient and cheaper in the end.
> (considering the time required for setting up and maintaining the 
> extra caching layer on each vm, unless you work for free ;-) Also 
> reads from
ceph
> are pretty fast compared to the biggest bottleneck: (small) sync writes.
> So it is debatable how much performance you would win except for some 
> use-cases with lots of reads on very large data sets which are also 
> very latency sensitive.
> 
> Cheers,
> Robert van Leeuwen
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v10.0.4 released

2016-03-19 Thread Eric Eastman
Thank you for doing this.  It will make testing 10.0.x easier for all of us
in the field, and will make it easier to report bugs, as we will know that
the problems we find were not caused by our build process.

Eric

On Wed, Mar 16, 2016 at 7:14 AM, Loic Dachary  wrote:

> Hi,
>
> Because of a tiny mistake preventing deb packages to be built, v10.0.5 was
> released shortly after v10.0.4 and is now the current development release.
> The Stable release team[0] collectively decided to help by publishing
> development packages[1], starting with v10.0.5.
>
> The packages for v10.0.5 are available at
> http://ceph-releases.dachary.org/ which can be used as a replacement for
> http://download.ceph.com/ for both http://download.ceph.com/rpm-testing
> and http://download.ceph.com/debian-testing . The only difference is the
> key used to sign the releases which can be imported with
>
> wget -q -O- 'http://ceph-releases.dachary.org/release-key.asc' | sudo
> apt-key add -
>
> or
>
> rpm --import http://ceph-releases.dachary.org/release-key.asc
>
> The instructions to install development packages found at
> http://docs.ceph.com/docs/master/install/get-packages/ can otherwise be
> applied with no change.
>
> Cheers
>
> [0] Stable release team
> http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#Whos-who
> [1] Publishing development releases
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/30126
>
> On 08/03/2016 22:35, Sage Weil wrote:
> > This is the fourth and last development release before Jewel. The next
> > release will be a release candidate with the final set of features. Big
> > items include RGW static website support, librbd journal framework, fixed
> > mon sync of config-key data, C++11 updates, and bluestore/kstore.
> >
> > Note that, due to general developer busyness, we aren’t building official
> > release packages for this dev release. You can fetch autobuilt gitbuilder
> > packages from the usual location (http://gitbuilder.ceph.com).
> >
> > Notable Changes
> > ---
> >
> > http://ceph.com/releases/v10-0-4-released/
> >
> > Getting Ceph
> > 
> >
> > * Git at git://github.com/ceph/ceph.git
> > * For packages, see
> http://ceph.com/docs/master/install/get-packages#add-ceph-development
> > * For ceph-deploy, see
> http://ceph.com/docs/master/install/install-ceph-deploy
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Sage Weil
On Thu, 17 Mar 2016, Nick Fisk wrote:
> There is got to be something else going on here. All that PR does is to 
> potentially delay the promotion to hit_set_period*recency instead of 
> just doing it on the 2nd read regardless, it's got to be uncovering 
> another bug.
> 
> Do you see the same problem if the cache is in writeback mode before you 
> start the unpacking. Ie is it the switching mid operation which causes 
> the problem? If it only happens mid operation, does it still occur if 
> you pause IO when you make the switch?
> 
> Do you also see this if you perform on a RBD mount, to rule out any 
> librbd/qemu weirdness?
> 
> Do you know if it’s the actual data that is getting corrupted or if it's 
> the FS metadata? I'm only wondering as unpacking should really only be 
> writing to each object a couple of times, whereas FS metadata could 
> potentially be being updated+read back lots of times for the same group 
> of objects and ordering is very important.
> 
> Thinking through it logically the only difference is that with recency=1 
> the object will be copied up to the cache tier, where recency=6 it will 
> be proxy read for a long time. If I had to guess I would say the issue 
> would lie somewhere in the proxy read + writeback<->forward logic.

That seems reasonable.  Was switching from writeback -> forward always 
part of the sequence that resulted in corruption?  Not that there is a 
known ordering issue when switching to forward mode.  I wouldn't really 
expect it to bite real users but it's possible..

http://tracker.ceph.com/issues/12814

I've opened a ticket to track this:

http://tracker.ceph.com/issues/15171

What would be *really* great is if you could reproduce this with a 
ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados 
running, and then find the sequence of operations that are sufficient to 
trigger a failure.

sage



 > 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Mike Lovell
> > Sent: 16 March 2016 23:23
> > To: ceph-users ; sw...@redhat.com
> > Cc: Robert LeBlanc ; William Perkins
> > 
> > Subject: Re: [ceph-users] data corruption with hammer
> > 
> > just got done with a test against a build of 0.94.6 minus the two commits 
> > that
> > were backported in PR 7207. everything worked as it should with the cache-
> > mode set to writeback and the min_read_recency_for_promote set to 2.
> > assuming it works properly on master, there must be a commit that we're
> > missing on the backport to support this properly.
> > 
> > sage,
> > i'm adding you to the recipients on this so hopefully you see it. the tl;dr
> > version is that the backport of the cache recency fix to hammer doesn't work
> > right and potentially corrupts data when
> > the min_read_recency_for_promote is set to greater than 1.
> > 
> > mike
> > 
> > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> >  wrote:
> > robert and i have done some further investigation the past couple days on
> > this. we have a test environment with a hard drive tier and an ssd tier as a
> > cache. several vms were created with volumes from the ceph cluster. i did a
> > test in each guest where i un-tarred the linux kernel source multiple times
> > and then did a md5sum check against all of the files in the resulting source
> > tree. i started off with the monitors and osds running 0.94.5 and never saw
> > any problems.
> > 
> > a single node was then upgraded to 0.94.6 which has osds in both the ssd and
> > hard drive tier. i then proceeded to run the same test and, while the untar
> > and md5sum operations were running, i changed the ssd tier cache-mode
> > from forward to writeback. almost immediately the vms started reporting io
> > errors and odd data corruption. the remainder of the cluster was updated to
> > 0.94.6, including the monitors, and the same thing happened.
> > 
> > things were cleaned up and reset and then a test was run
> > where min_read_recency_for_promote for the ssd cache pool was set to 1.
> > we previously had it set to 6. there was never an error with the recency
> > setting set to 1. i then tested with it set to 2 and it immediately caused
> > failures. we are currently thinking that it is related to the backport of 
> > the fix
> > for the recency promotion and are in progress of making a .6 build without
> > that backport to see if we can cause corruption. is anyone using a version
> > from after the original recency fix (PR 6702) with a cache tier in writeback
> > mode? anyone have a similar problem?
> > 
> > mike
> > 
> > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell
> >  wrote:
> > something weird happened on one of the ceph clusters that i administer
> > tonight which resulted in virtual machines using rbd volumes seeing
> > corruption in multiple forms.
> > 
> > when everything was fine earlier in the day, the cluster was a number of
> > storage nodes spread across 3 different roots in the crush 

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
We are trying to figure out how to use rados bench to reproduce. Ceph
itself doesn't seem to think there is any corruption, but when you do a
verify inside the RBD, there is. Can rados bench verify the objects after
they are written? It also seems to be primarily the filesystem metadata
that is corrupted. If we fsck the volume, there is missing data (put into
lost+found), but if it is there it is primarily OK. There only seems to be
a few cases where a file's contents are corrupted. I would suspect on an
object boundary. We would have to look at blockinfo to map that out and see
if that is what is happening.

We stopped all the IO and did put the tier in writeback mode with recency
1,  set the recency to 2 and started the test and there was corruption, so
it doesn't seem to be limited to changing the mode. I don't know how that
patch could cause the issue either. Unless there is a bug that reads from
the back tier, but writes to cache tier, then the object gets promoted
wiping that last write, but then it seems like it should not be as much
corruption since the metadata should be in the cache pretty quick. We
usually evited the cache before each try so we should not be evicting on
writeback.

Sent from a mobile device, please excuse any typos.
On Mar 17, 2016 6:26 AM, "Sage Weil"  wrote:

> On Thu, 17 Mar 2016, Nick Fisk wrote:
> > There is got to be something else going on here. All that PR does is to
> > potentially delay the promotion to hit_set_period*recency instead of
> > just doing it on the 2nd read regardless, it's got to be uncovering
> > another bug.
> >
> > Do you see the same problem if the cache is in writeback mode before you
> > start the unpacking. Ie is it the switching mid operation which causes
> > the problem? If it only happens mid operation, does it still occur if
> > you pause IO when you make the switch?
> >
> > Do you also see this if you perform on a RBD mount, to rule out any
> > librbd/qemu weirdness?
> >
> > Do you know if it’s the actual data that is getting corrupted or if it's
> > the FS metadata? I'm only wondering as unpacking should really only be
> > writing to each object a couple of times, whereas FS metadata could
> > potentially be being updated+read back lots of times for the same group
> > of objects and ordering is very important.
> >
> > Thinking through it logically the only difference is that with recency=1
> > the object will be copied up to the cache tier, where recency=6 it will
> > be proxy read for a long time. If I had to guess I would say the issue
> > would lie somewhere in the proxy read + writeback<->forward logic.
>
> That seems reasonable.  Was switching from writeback -> forward always
> part of the sequence that resulted in corruption?  Not that there is a
> known ordering issue when switching to forward mode.  I wouldn't really
> expect it to bite real users but it's possible..
>
> http://tracker.ceph.com/issues/12814
>
> I've opened a ticket to track this:
>
> http://tracker.ceph.com/issues/15171
>
> What would be *really* great is if you could reproduce this with a
> ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados
> running, and then find the sequence of operations that are sufficient to
> trigger a failure.
>
> sage
>
>
>
>  >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> > > Mike Lovell
> > > Sent: 16 March 2016 23:23
> > > To: ceph-users ; sw...@redhat.com
> > > Cc: Robert LeBlanc ; William Perkins
> > > 
> > > Subject: Re: [ceph-users] data corruption with hammer
> > >
> > > just got done with a test against a build of 0.94.6 minus the two
> commits that
> > > were backported in PR 7207. everything worked as it should with the
> cache-
> > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > assuming it works properly on master, there must be a commit that we're
> > > missing on the backport to support this properly.
> > >
> > > sage,
> > > i'm adding you to the recipients on this so hopefully you see it. the
> tl;dr
> > > version is that the backport of the cache recency fix to hammer
> doesn't work
> > > right and potentially corrupts data when
> > > the min_read_recency_for_promote is set to greater than 1.
> > >
> > > mike
> > >
> > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > >  wrote:
> > > robert and i have done some further investigation the past couple days
> on
> > > this. we have a test environment with a hard drive tier and an ssd
> tier as a
> > > cache. several vms were created with volumes from the ceph cluster. i
> did a
> > > test in each guest where i un-tarred the linux kernel source multiple
> times
> > > and then did a md5sum check against all of the files in the resulting
> source
> > > tree. i started off with the monitors and osds running 0.94.5 and
> never saw
> > > any problems.
> > >
> > > a single node was then upgraded to 0.94.6 which has osds in both the
> ssd and
>

Re: [ceph-users] RBD hanging on some volumes of a pool

2016-03-19 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrien Gillard
> Sent: 17 March 2016 10:23
> To: ceph-users 
> Subject: [ceph-users] RBD hanging on some volumes of a pool
> 
> Hi,
> 
> I am facing issues with some of my rbd volumes since yesterday. Some of
> them completely hang at some point before eventually resuming IO, may it
> be a few minutes or several hours later.
> 
> First and foremost, my setup : I already detailed it on the mailing list 
> [0][1].
> Some changes have been made : the 3 monitors are now VM and we are
> trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7).
> 
> Using EC pools, I already had some trouble with RBD features not supported
> by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the
> hassle. Everything has been working pretty smoothly since.
> 
> All my volumes (currently 5) are on an EC pool with writeback cache. Two of
> them are perfectly fine. On the other 3, different story : doing IO is
> impossible, if I start a simple copy I get a new file of a few dozen MB (or
> sometimes 0) then it hangs. Doing dd with direct and sync flags has the same
> behaviour.

I can only guess that you are having problems with your cache tier not flushing 
and so writes are stalling on waiting for space to become available. Can you 
post 

ceph osd dump | grep pool

and 

ceph df detail

> 
> I tried witching back to 3.10, no changes, on the client I rebooted I 
> currently
> cannot mount the filesystem, mount hangs (the volume seems correctly
> mapped however).
> 
> strace on the cp command freezes in the middle of a read :
> 
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> 
> 
> I tried to bump up the logging but I don't really know what to look for 
> exactly
> and didn't see anything obvious.
> 
> Any input or lead on how to debug this would be highly appreciated :)
> 
> Adrien
> 
> [0] http://www.spinics.net/lists/ceph-users/msg23990.html
> [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> January/007004.html
> [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> February/007746.html
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD hanging on some volumes of a pool

2016-03-19 Thread Adrien Gillard
Hi Nick,

Thank you for your feedback. The cache tiers was fine. We identified some
packet loss between two switches. As usual with network, relatively easy to
identify but not something that comes to mind at first :)

Adrien

On Thu, Mar 17, 2016 at 2:32 PM, Nick Fisk  wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Adrien Gillard
> > Sent: 17 March 2016 10:23
> > To: ceph-users 
> > Subject: [ceph-users] RBD hanging on some volumes of a pool
> >
> > Hi,
> >
> > I am facing issues with some of my rbd volumes since yesterday. Some of
> > them completely hang at some point before eventually resuming IO, may it
> > be a few minutes or several hours later.
> >
> > First and foremost, my setup : I already detailed it on the mailing list
> [0][1].
> > Some changes have been made : the 3 monitors are now VM and we are
> > trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7).
> >
> > Using EC pools, I already had some trouble with RBD features not
> supported
> > by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the
> > hassle. Everything has been working pretty smoothly since.
> >
> > All my volumes (currently 5) are on an EC pool with writeback cache. Two
> of
> > them are perfectly fine. On the other 3, different story : doing IO is
> > impossible, if I start a simple copy I get a new file of a few dozen MB
> (or
> > sometimes 0) then it hangs. Doing dd with direct and sync flags has the
> same
> > behaviour.
>
> I can only guess that you are having problems with your cache tier not
> flushing and so writes are stalling on waiting for space to become
> available. Can you post
>
> ceph osd dump | grep pool
>
> and
>
> ceph df detail
>
> >
> > I tried witching back to 3.10, no changes, on the client I rebooted I
> currently
> > cannot mount the filesystem, mount hangs (the volume seems correctly
> > mapped however).
> >
> > strace on the cp command freezes in the middle of a read :
> >
> > 11:17:56 write(4,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 read(3,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 write(4,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 read(3,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 write(4,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 read(3,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 write(4,
> > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 65536) = 65536
> > 11:17:56 read(3,
> >
> >
> > I tried to bump up the logging but I don't really know what to look for
> exactly
> > and didn't see anything obvious.
> >
> > Any input or lead on how to debug this would be highly appreciated :)
> >
> > Adrien
> >
> > [0] http://www.spinics.net/lists/ceph-users/msg23990.html
> > [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> > January/007004.html
> > [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> > February/007746.html
> >
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy rgw

2016-03-19 Thread Derek Yarnell
For clusters that were created pre-hammer and want to use ceph-deploy to
create additional rgw instances is there a way to create the
bootstrap-rgw keyring?

http://docs.ceph.com/ceph-deploy/docs/rgw.html

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Irek Fasikhov
> Sent: 17 March 2016 13:00
> To: Sage Weil 
> Cc: Robert LeBlanc ; ceph-users  us...@lists.ceph.com>; Nick Fisk ; William Perkins
> 
> Subject: Re: [ceph-users] data corruption with hammer
> 
> Hi,All.
> 
> I confirm the problem. When min_read_recency_for_promote> 1 data
> failure.

But what scenario is this? Are you switching between forward and writeback, or 
just running in writeback?

> 
> 
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
> 
> 2016-03-17 15:26 GMT+03:00 Sage Weil :
> On Thu, 17 Mar 2016, Nick Fisk wrote:
> > There is got to be something else going on here. All that PR does is to
> > potentially delay the promotion to hit_set_period*recency instead of
> > just doing it on the 2nd read regardless, it's got to be uncovering
> > another bug.
> >
> > Do you see the same problem if the cache is in writeback mode before you
> > start the unpacking. Ie is it the switching mid operation which causes
> > the problem? If it only happens mid operation, does it still occur if
> > you pause IO when you make the switch?
> >
> > Do you also see this if you perform on a RBD mount, to rule out any
> > librbd/qemu weirdness?
> >
> > Do you know if it’s the actual data that is getting corrupted or if it's
> > the FS metadata? I'm only wondering as unpacking should really only be
> > writing to each object a couple of times, whereas FS metadata could
> > potentially be being updated+read back lots of times for the same group
> > of objects and ordering is very important.
> >
> > Thinking through it logically the only difference is that with recency=1
> > the object will be copied up to the cache tier, where recency=6 it will
> > be proxy read for a long time. If I had to guess I would say the issue
> > would lie somewhere in the proxy read + writeback<->forward logic.
> 
> That seems reasonable.  Was switching from writeback -> forward always
> part of the sequence that resulted in corruption?  Not that there is a
> known ordering issue when switching to forward mode.  I wouldn't really
> expect it to bite real users but it's possible..
> 
> http://tracker.ceph.com/issues/12814
> 
> I've opened a ticket to track this:
> 
> http://tracker.ceph.com/issues/15171
> 
> What would be *really* great is if you could reproduce this with a
> ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados
> running, and then find the sequence of operations that are sufficient to
> trigger a failure.
> 
> sage
> 
> 
> 
>  >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> Behalf Of
> > > Mike Lovell
> > > Sent: 16 March 2016 23:23
> > > To: ceph-users ; sw...@redhat.com
> > > Cc: Robert LeBlanc ; William Perkins
> > > 
> > > Subject: Re: [ceph-users] data corruption with hammer
> > >
> > > just got done with a test against a build of 0.94.6 minus the two commits
> that
> > > were backported in PR 7207. everything worked as it should with the
> cache-
> > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > assuming it works properly on master, there must be a commit that we're
> > > missing on the backport to support this properly.
> > >
> > > sage,
> > > i'm adding you to the recipients on this so hopefully you see it. the 
> > > tl;dr
> > > version is that the backport of the cache recency fix to hammer doesn't
> work
> > > right and potentially corrupts data when
> > > the min_read_recency_for_promote is set to greater than 1.
> > >
> > > mike
> > >
> > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > >  wrote:
> > > robert and i have done some further investigation the past couple days
> on
> > > this. we have a test environment with a hard drive tier and an ssd tier 
> > > as a
> > > cache. several vms were created with volumes from the ceph cluster. i
> did a
> > > test in each guest where i un-tarred the linux kernel source multiple
> times
> > > and then did a md5sum check against all of the files in the resulting
> source
> > > tree. i started off with the monitors and osds running 0.94.5 and never
> saw
> > > any problems.
> > >
> > > a single node was then upgraded to 0.94.6 which has osds in both the ssd
> and
> > > hard drive tier. i then proceeded to run the same test and, while the
> untar
> > > and md5sum operations were running, i changed the ssd tier cache-mode
> > > from forward to writeback. almost immediately the vms started reporting
> io
> > > errors and odd data corruption. the remainder of the cluster was updated
> to
> > > 0.94.6, including the monitors, and the same thing happened.
> > >
> > > things were cleaned up and reset and then a test was run
> > > where min_read_recency_for_promote for the ssd cache pool was set to
> 1.
> > > we previously had it set to 6. there was never an error with the recency
> > > setting set to 1. i then tested with it 

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Possible, it looks like all the messages comes from a test suite. Is
there some logging that would expose this or an assert that could be
added? We are about ready to do some testing in our lab to see if we
can replicate it and workaround the issue. I also can't tell which
version introduced this in Hammer, it doesn't look like it has been
resolved.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6bqRCRDmVDuy+mK58QAANTsP/1jceRh9zYDlm2rkVq3e
F6UKgezyCWV7h1cou8/rSVkxOfyyWEDSy1nMPBTHCtfMuOHzlx9VZftmPCiY
BmxbclpUhAbAbjMb/E7t0jFR7fAZylX4okjUTN1y7NII+6xMXyxb51drYrZv
AJzNcXfWYL1+y0Mz/QqOgEyij27OF8vYpSTJqXFDUcXtZNPfyvTjJ1ttYtuR
saFJJ6SrFXA5LliGBNQK+pTDq0ZF0Bn0soE73rpzwpQvIdiOf/Jg7hAbERCc
Vqjhg34YVLdpGd8W7IvaT0RirYbz8SmRdwOw1IIkBcqe0r9Mt08OgKu5NPT3
Rm0MKYynE1E7nKgutPisJQidT9QuaSVuY40oRDBIlrFA1BxNjGjwFxZn7y8r
WyNMHKqB9Y+78uWdtEZtGfiSwyxC2UZTQFI4+eLs/XOoRLWv9oxRYV55Co0W
e8zPW0nL1pm9iD9J+3fCRlNEL+cyDjsLLmW005BkF2q7da1XgxkoNndUBTlM
Az9RGHoCELfI6kle315/2BEGfE2aRokLngbyhQWKAWmrdTCTDZaJwDKIi4hb
69LGT2eHofTWB5KgMHoCFLUSy2lYa86GxLLsBvPuqOfAXPWHMZERGv94qH/E
CppgbnchgRHuI68rNM6nFYPJa4C3MlyQhu2WmOialAGgQi+IQP/g6h70e0RR
eqLX
=DcjE
-END PGP SIGNATURE-



Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Mar 16, 2016 at 1:40 PM, Gregory Farnum  wrote:

> This tracker ticket happened to go by my eyes today:
> http://tracker.ceph.com/issues/12814 . There isn't a lot of detail
> there but the headline matches.
> -Greg
>
> On Wed, Mar 16, 2016 at 2:02 AM, Nick Fisk  wrote:
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> >> Christian Balzer
> >> Sent: 16 March 2016 07:08
> >> To: Robert LeBlanc 
> >> Cc: Robert LeBlanc ; ceph-users  >> us...@lists.ceph.com>; William Perkins 
> >> Subject: Re: [ceph-users] data corruption with hammer
> >>
> >>
> >> Hello Robert,
> >>
> >> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote:
> >>
> >> > -BEGIN PGP SIGNED MESSAGE-
> >> > Hash: SHA256
> >> >
> >> > There are no monitors on the new node.
> >> >
> >> So one less possible source of confusion.
> >>
> >> > It doesn't look like there has been any new corruption since we
> >> > stopped changing the cache modes. Upon closer inspection, some files
> >> > have been changed such that binary files are now ASCII files and visa
> >> > versa. These are readable ASCII files and are things like PHP or
> >> > script files. Or C files where ASCII files should be.
> >> >
> >> What would be most interesting is if the objects containing those
> > corrupted
> >> files did reside on the new OSDs (primary PG) or the old ones, or both.
> >>
> >> Also, what cache mode was the cluster in before the first switch
> > (writeback I
> >> presume from the timeline) and which one is it in now?
> >>
> >> > I've seen this type of corruption before when a SAN node misbehaved
> >> > and both controllers were writing concurrently to the backend disks.
> >> > The volume was only mounted by one host, but the writes were split
> >> > between the controllers when it should have been active/passive.
> >> >
> >> > We have killed off the OSDs on the new node as a precaution and will
> >> > try to replicate this in our lab.
> >> >
> >> > I suspicion is that is has to do with the cache promotion code update,
> >> > but I'm not sure how it would have caused this.
> >> >
> >> While blissfully unaware of the code, I have a hard time imagining how
> it
> >> would cause that as well.
> >> Potentially a regression in the code that only triggers in one cache
> mode
> > and
> >> when wanting to promote something?
> >>
> >> Or if it is actually the switching action, not correctly promoting
> things
> > as it
> >> happens?
> >> And thus referencing a stale object?
> >
> > I can't think of any other reason why the recency would break things in
> any
> > other way. Can the OP confirm what recency setting is being used?
> >
> > When you switch to writeback, if you haven't reached the required recency
> > yet, all reads will be proxied, previous behaviour would have pretty much
> > promoted all the time regardless. So unless something is happening where
> > writes are getting sent to one tier in forward mode and then read from a
> > different tier in WB mode, I'm out of ideas.  I'm pretty sure the code
> says
> > Proxy Read then check for promotion, so I'm not even convinced that there
> > should be any difference anyway.
> >
> > I note the documentation states that in forward mode, modified objects
> get
> > written to the backing tier, I'm not if that sounds correct to me. But if
> > that is what is happening, that could also be related to the problem???
> >
> > I think this might be easyish to reproduce using the get/put commands
> with a
> > couple of objects on a test pool if anybody out there is running 94.6 on
> the
> > whole c

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Mike Lovell
just got done with a test against a build of 0.94.6 minus the two commits
that were backported in PR 7207. everything worked as it should with the
cache-mode set to writeback and the min_read_recency_for_promote set to 2.
assuming it works properly on master, there must be a commit that we're
missing on the backport to support this properly.

sage,
i'm adding you to the recipients on this so hopefully you see it. the tl;dr
version is that the backport of the cache recency fix to hammer doesn't
work right and potentially corrupts data when
the min_read_recency_for_promote is set to greater than 1.

mike

On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell 
wrote:

> robert and i have done some further investigation the past couple days on
> this. we have a test environment with a hard drive tier and an ssd tier as
> a cache. several vms were created with volumes from the ceph cluster. i did
> a test in each guest where i un-tarred the linux kernel source multiple
> times and then did a md5sum check against all of the files in the resulting
> source tree. i started off with the monitors and osds running 0.94.5 and
> never saw any problems.
>
> a single node was then upgraded to 0.94.6 which has osds in both the ssd
> and hard drive tier. i then proceeded to run the same test and, while the
> untar and md5sum operations were running, i changed the ssd tier cache-mode
> from forward to writeback. almost immediately the vms started reporting io
> errors and odd data corruption. the remainder of the cluster was updated to
> 0.94.6, including the monitors, and the same thing happened.
>
> things were cleaned up and reset and then a test was run
> where min_read_recency_for_promote for the ssd cache pool was set to 1. we
> previously had it set to 6. there was never an error with the recency
> setting set to 1. i then tested with it set to 2 and it immediately caused
> failures. we are currently thinking that it is related to the backport of
> the fix for the recency promotion and are in progress of making a .6 build
> without that backport to see if we can cause corruption. is anyone using a
> version from after the original recency fix (PR 6702) with a cache tier in
> writeback mode? anyone have a similar problem?
>
> mike
>
> On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell 
> wrote:
>
>> something weird happened on one of the ceph clusters that i administer
>> tonight which resulted in virtual machines using rbd volumes seeing
>> corruption in multiple forms.
>>
>> when everything was fine earlier in the day, the cluster was a number of
>> storage nodes spread across 3 different roots in the crush map. the first
>> bunch of storage nodes have both hard drives and ssds in them with the hard
>> drives in one root and the ssds in another. there is a pool for each and
>> the pool for the ssds is a cache tier for the hard drives. the last set of
>> storage nodes were in a separate root with their own pool that is being
>> used for burn in testing.
>>
>> these nodes had run for a while with test traffic and we decided to move
>> them to the main root and pools. the main cluster is running 0.94.5 and the
>> new nodes got 0.94.6 due to them getting configured after that was
>> released. i removed the test pool and did a ceph osd crush move to move the
>> first node into the main cluster, the hard drives into the root for that
>> tier of storage and the ssds into the root and pool for the cache tier.
>> each set was done about 45 minutes apart and they ran for a couple hours
>> while performing backfill without any issue other than high load on the
>> cluster.
>>
>> we normally run the ssd tier in the forward cache-mode due to the ssds we
>> have not being able to keep up with the io of writeback. this results in io
>> on the hard drives slowing going up and performance of the cluster starting
>> to suffer. about once a week, i change the cache-mode between writeback and
>> forward for short periods of time to promote actively used data to the
>> cache tier. this moves io load from the hard drive tier to the ssd tier and
>> has been done multiple times without issue. i normally don't do this while
>> there are backfills or recoveries happening on the cluster but decided to
>> go ahead while backfill was happening due to the high load.
>>
>> i tried this procedure to change the ssd cache-tier between writeback and
>> forward cache-mode and things seemed okay from the ceph cluster. about 10
>> minutes after the first attempt a changing the mode, vms using the ceph
>> cluster for their storage started seeing corruption in multiple forms. the
>> mode was flipped back and forth multiple times in that time frame and its
>> unknown if the corruption was noticed with the first change or subsequent
>> changes. the vms were having issues of filesystems having errors and
>> getting remounted RO and mysql databases seeing corruption (both myisam and
>> innodb). some of this was recoverable but on some filesystems there was
>> corruption that

Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-03-19 Thread Stephen Harker

On 2016-02-17 11:07, Christian Balzer wrote:


On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:


> > Let's consider both cases:
> > Journals on SSDs - for writes, the write operation returns right
> > after data lands on the Journal's SSDs, but before it's written to
> > the backing HDD. So, for writes, SSD journal approach should be
> > comparable to having a SSD cache tier.
> Not quite, see below.
>
>
Could you elaborate a bit more?

Are you saying that with a Journal on a SSD writes from clients, 
before
they can return from the operation to the client, must end up on both 
the

SSD (Journal) *and* HDD (actual data store behind that journal)?


No, your initial statement is correct.

However that burst of speed doesn't last indefinitely.

Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in this 
ML,

try your google-foo.

For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster, 
the
speed will eventually (after a few seconds) go down to what your 
backing

storage (HDDs) are capable of sustaining.


> (Which SSDs do you plan to use anyway?)
>

Intel DC S3700


Good choice, with the 200GB model prefer the 3700 over the 3710 (higher
sequential write speed).


Hi All,

I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes, 
each of which has 6 4TB SATA drives within. I had my eye on these:


400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0

but reading through this thread, it might be better to go with the P3700 
given the improved iops. So a couple of questions.


* Are the PCI-E versions of these drives different in any other way than 
the interface?


* Would one of these as a journal for 6 4TB OSDs be overkill 
(connectivity is 10GE, or will be shortly anyway), would the SATA S3700 
be sufficient?


Given they're not hot-swappable, it'd be good if they didn't wear out in 
6 months too.


I realise I've not given you much to go on and I'm Googling around as 
well, I'm really just asking in case someone has tried this already and 
has some feedback or advice..


Thanks! :)

Stephen

--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.

--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw (civetweb) hangs once around 850 established connections

2016-03-19 Thread seapasu...@uchicago.edu
I have a cluster of around 630 OSDs with 3 dedicated monitors and 2 
dedicated gateways. The entire cluster is running hammer (0.94.5 
(9764da52395923e0b32908d83a9f7304401fee43)).


(Both of my gateways have stopped responding to curl right now.
root@host:~# timeout 5 curl localhost ; echo $?
124

From here I checked and it looks like radosgw has over 1 million open 
files:

root@host:~# grep -i rados whatisopen.files.list | wc -l
1151753

And around 750 open connections:
root@host:~# netstat -planet | grep radosgw | wc -l
752
root@host:~# ss -tnlap | grep rados | wc -l
752

I don't think that the backend storage is hanging based on the following 
dump:


root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok 
objecter_requests | grep -i mtime

"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
[...]
"mtime": "0.00",

The radosgw log is still showing lots of activity and so does strace 
which makes me think this is a config issue or limit of some kind that 
is not triggering a log. Of what I am not sure as the log doesn't seem 
to show any open file limit being hit and I don't see any big errors 
showing up in the logs.

(last 500 lines of /var/log/radosgw/client.radosgw.log)
http://pastebin.com/jmM1GFSA

Perf dump of radosgw
http://pastebin.com/rjfqkxzE

Radosgw objecter requests:
http://pastebin.com/skDJiyHb

After restarting the gateway with '/etc/init.d/radosgw restart' the old 
process remains, no error is sent, and then I get connection refused via 
curl or netcat::

root@kh11-9:~# curl localhost
curl: (7) Failed to connect to localhost port 80: Connection refused

Once I kill the old radosgw via sigkill the new radosgw instance 
restarts automatically and starts responding::

root@kh11-9:~# curl localhost
xmlns="http://s3.amazonaws.com/doc/2006-03-01/";>anonymous

What is going on here?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-19 Thread Chris Dunlop
Hi Stable Release Team for v0.94,

On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote:
> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
>> I think you misread what Sage wrote : "The intention was to
>> continue building stable releases (0.94.x) on the old list of
>> supported platforms (which inclues 12.04 and el6)". In other
>> words, the old OS'es are still supported. Their absence is a
>> glitch in the release process that will be fixed.
> 
> Any news on a release of v0.94.6 for debian wheezy?

Any news on a release of v0.94.6 for debian wheezy?

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW quota

2016-03-19 Thread Marius Vaitiekunas
On Wednesday, 16 March 2016, Derek Yarnell  wrote:

> Hi,
>
> We have a user with a 50GB quota and has now a single bucket with 20GB
> of files.  They had previous buckets created and removed but the quota
> has not decreased.  I understand that we do garbage collection but it
> has been significantly longer than the defaults that we have not
> overridden.  They get 403 QuotaExceeded when trying to write additional
> data to a new bucket or the existing bucket.
>
> # radosgw-admin user info --uid=username
> ...
> "user_quota": {
> "enabled": true,
> "max_size_kb": 52428800,
> "max_objects": -1
> },
>
> # radosgw-admin bucket stats --bucket=start
> ...
> "usage": {
> "rgw.main": {
> "size_kb": 21516505,
> "size_kb_actual": 21516992,
> "num_objects": 243
> }
> },
>
> # radosgw-admin user stats --uid=username
> ...
> {
> "stats": {
> "total_entries": 737,
> "total_bytes": 55060794604,
> "total_bytes_rounded": 55062102016
> },
> "last_stats_sync": "2016-03-16 14:16:25.205060Z",
> "last_stats_update": "2016-03-16 14:16:25.190605Z"
> }
>
> Thanks,
> derek
>
> --
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

 Hi,
It's possible that somebody changed the owner of some bucket. But all
objects in that bucket still belongs to this user. That way you can get
quota exceeded. We had the same situation.

-- 
Marius Vaitiekūnas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD/Ceph as Physical boot volume

2016-03-19 Thread Josh Durgin

On 03/17/2016 03:51 AM, Schlacta, Christ wrote:

I posted about this a while ago, and someone else has since inquired,
but I am seriously wanting to know if anybody has figured out how to
boot from a RBD device yet using ipxe or similar.  Last I read.
loading the kernel and initrd from object storage would be
theoretically easy, and would only require making an initramfs to
initialize and mount the rbd..  But I couldn't find any documented
instances of anybody having done this yet..  So..  Has anybody done
this yet?  If so, which distros is it working on, and where can I find
more info?


Not sure if anyone is doing this, though there was a patch for creating
an initramfs that would mount rbd:

https://lists.debian.org/debian-kernel/2015/06/msg00161.html

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Schlacta, Christ
On Mar 18, 2016 4:31 PM, "Lionel Bouton" 
>
> Will bluestore provide the same protection against bitrot than BTRFS?
> Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s)
> with invalid data get IO errors when trying to read corrupted data and
> as such can't be used as the source for repairs even if they are primary
> OSD(s). So with BTRFS you get a pretty good overall protection against
> bitrot in Ceph (it allowed us to automate the repair process in the most
> common cases). With XFS IIRC unless  you override the default behavior
> the primary OSD is always the source for repairs (even if all the
> secondaries agree on another version of the data).

I have a functionally identical question about bluestore, but with zfs
instead of btrfs.  Do you have more info on this  bluestore?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Heath Albritton
If you google "ceph bluestore" you'll be able to find a couple slide decks on 
the topic.  One of them by Sage is easy to follow without the benefit of the 
presentation.  There's also the " Redhat Ceph Storage Roadmap 2016" deck.

In any case, bluestore is not intended to address bitrot.  Given that ceph is a 
distributed file system, many of the posix file system features are not 
required for the underlying block storage device.  Bluestore is intended to 
address this and reduce the disk IO required to store user data.

Ceph protects against bitrot at a much higher level by validating the checksum 
of the entire placement group during a deep scrub.

-H

> On Mar 19, 2016, at 10:06, Schlacta, Christ  wrote:
> 
> 
> On Mar 18, 2016 4:31 PM, "Lionel Bouton"  
> >
> > Will bluestore provide the same protection against bitrot than BTRFS?
> > Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s)
> > with invalid data get IO errors when trying to read corrupted data and
> > as such can't be used as the source for repairs even if they are primary
> > OSD(s). So with BTRFS you get a pretty good overall protection against
> > bitrot in Ceph (it allowed us to automate the repair process in the most
> > common cases). With XFS IIRC unless  you override the default behavior
> > the primary OSD is always the source for repairs (even if all the
> > secondaries agree on another version of the data).
> 
> I have a functionally identical question about bluestore, but with zfs 
> instead of btrfs.  Do you have more info on this  bluestore? 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot remove rbd locks

2016-03-19 Thread Jason Dillaman
Try the following:

# rbd lock remove vm-114-disk-1 "auto 140454012457856" client.71260575

-- 

Jason Dillaman 


- Original Message -
> From: "Christoph Adomeit" 
> To: ceph-us...@ceph.com
> Sent: Friday, March 18, 2016 11:14:00 AM
> Subject: [ceph-users] Cannot remove rbd locks
> 
> Hi,
> 
> some of my rbds show they have an exclusive lock.
> 
> I think the lock can be stale or weeks old.
> 
> We have also once added feature exclusive lock and later removed that feature
> 
> I can see the lock:
> 
> root@machine:~# rbd lock list vm-114-disk-1
> There is 1 exclusive lock on this image.
> Locker  ID   Address
> client.71260575 auto 140454012457856 10.67.1.14:0/1131494432
> 
> iBut I cannot remove the lock:
> 
> root@machine:~# rbd lock remove vm-114-disk-1 auto client.71260575
> rbd: releasing lock failed: (2) No such file or directory
> 
> How can I remove the locks ?
> 
> Thanks
>   Christoph
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reallocate when OSD down

2016-03-19 Thread Trelohan Christophe
Hello,

I have a problem with the following crushmap :

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 device
type 1 host
type 2 chassis
type 3 rack
type 4 room
type 5 datacenter
type 6 root

# buckets
host testctrcephosd1 {
id -1   # do not change unnecessarily
# weight 3.000
alg straw
hash 0  # rjenkins1
item osd.0 weight 1.000
item osd.1 weight 1.000
item osd.2 weight 1.000
}
host testctrcephosd2 {
id -2   # do not change unnecessarily
# weight 3.000
alg straw
hash 0  # rjenkins1
item osd.3 weight 1.000
item osd.4 weight 1.000
item osd.5 weight 1.000
}
host testctrcephosd3 {
id -3   # do not change unnecessarily
# weight 3.000
alg straw
hash 0  # rjenkins1
item osd.6 weight 1.000
item osd.7 weight 1.000
item osd.8 weight 1.000
}
host testctrcephosd4 {
id -4   # do not change unnecessarily
# weight 3.000
alg straw
hash 0  # rjenkins1
item osd.9 weight 1.000
item osd.10 weight 1.000
item osd.11 weight 1.000
}
chassis chassis1 {
id -5   # do not change unnecessarily
# weight 6.000
alg straw
hash 0  # rjenkins1
item testctrcephosd1 weight 3.000
item testctrcephosd2 weight 3.000
}
chassis chassis2 {
id -6   # do not change unnecessarily
# weight 6.000
alg straw
hash 0  # rjenkins1
item testctrcephosd3 weight 3.000
item testctrcephosd4 weight 3.000
}

room salle1 {
id -7
# weight 6.000
alg straw
hash 0
item chassis1 weight 6.000
}

room salle2 {
id -8
# weight 6.000
alg straw
hash 0
item chassis2 weight 6.000
}

root dc1 {
id -9
# weight 6.000
alg straw
hash 0
item salle1 weight 6.000
item salle2 weight 6.000
}


# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 0 type host
step emit
}

rule dc {
ruleset 1
type replicated
min_size 2
max_size 10
step take dc1
step choose firstn 0 type room
step chooseleaf firstn 0 type chassis
step emit
}

ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-9 12.0 root dc1
-7  6.0 room salle1
-5  6.0 chassis chassis1
-1  3.0 host testctrcephosd1
0  1.0 osd.0 up  1.0  1.0
1  1.0 osd.1 up  1.0  1.0
2  1.0 osd.2 up  1.0  1.0
-2  3.0 host testctrcephosd2
3  1.0 osd.3 up  1.0  1.0
4  1.0 osd.4 up  1.0  1.0
5  1.0 osd.5 up  1.0  1.0
-8  6.0 room salle2
-6  6.0 chassis chassis2
-3  3.0 host testctrcephosd3
6  1.0 osd.6 up  1.0  1.0
7  1.0 osd.7 up  1.0  1.0
8  1.0 osd.8 up  1.0  1.0
-4  3.0 host testctrcephosd4
9  1.0 osd.9 up  1.0  1.0
10  1.0 osd.10up  1.0  1.0
11  1.0 osd.11up  1.0  1.0


Allocating when creating is ok, my datas are replicated in 2 rooms.

ceph osd map rbdnew testvol1
osdmap e127 pool 'rbdnew' (1) object 'testvol1' -> pg 1.c657d5a4 (1.a4) -> up 
([9,5], p9) acting ([9,5], p9)

but when one of these host is down, I want to create another replica on the 
other host in the same room. For example, when host "testctrcephosd2" is down, 
I want CRUSH to create another copy in "testctrcephosd1" (keeping another copy 
on one of the host in room "salle 2".
In place of this, cluster stays with only one osd used (instead of 2) :

ceph osd map rbdnew testvol1
osdmap e130 pool 'rbdnew' (1) object 'testvol1' -> pg 1.c657d5a4 (1.a4) -> up 
([9], p9) acting ([9], p9)

Do you have any idea to do this ?

Regards

Christophe

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://l

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Mike Lovell
robert and i have done some further investigation the past couple days on
this. we have a test environment with a hard drive tier and an ssd tier as
a cache. several vms were created with volumes from the ceph cluster. i did
a test in each guest where i un-tarred the linux kernel source multiple
times and then did a md5sum check against all of the files in the resulting
source tree. i started off with the monitors and osds running 0.94.5 and
never saw any problems.

a single node was then upgraded to 0.94.6 which has osds in both the ssd
and hard drive tier. i then proceeded to run the same test and, while the
untar and md5sum operations were running, i changed the ssd tier cache-mode
from forward to writeback. almost immediately the vms started reporting io
errors and odd data corruption. the remainder of the cluster was updated to
0.94.6, including the monitors, and the same thing happened.

things were cleaned up and reset and then a test was run
where min_read_recency_for_promote for the ssd cache pool was set to 1. we
previously had it set to 6. there was never an error with the recency
setting set to 1. i then tested with it set to 2 and it immediately caused
failures. we are currently thinking that it is related to the backport of
the fix for the recency promotion and are in progress of making a .6 build
without that backport to see if we can cause corruption. is anyone using a
version from after the original recency fix (PR 6702) with a cache tier in
writeback mode? anyone have a similar problem?

mike

On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell 
wrote:

> something weird happened on one of the ceph clusters that i administer
> tonight which resulted in virtual machines using rbd volumes seeing
> corruption in multiple forms.
>
> when everything was fine earlier in the day, the cluster was a number of
> storage nodes spread across 3 different roots in the crush map. the first
> bunch of storage nodes have both hard drives and ssds in them with the hard
> drives in one root and the ssds in another. there is a pool for each and
> the pool for the ssds is a cache tier for the hard drives. the last set of
> storage nodes were in a separate root with their own pool that is being
> used for burn in testing.
>
> these nodes had run for a while with test traffic and we decided to move
> them to the main root and pools. the main cluster is running 0.94.5 and the
> new nodes got 0.94.6 due to them getting configured after that was
> released. i removed the test pool and did a ceph osd crush move to move the
> first node into the main cluster, the hard drives into the root for that
> tier of storage and the ssds into the root and pool for the cache tier.
> each set was done about 45 minutes apart and they ran for a couple hours
> while performing backfill without any issue other than high load on the
> cluster.
>
> we normally run the ssd tier in the forward cache-mode due to the ssds we
> have not being able to keep up with the io of writeback. this results in io
> on the hard drives slowing going up and performance of the cluster starting
> to suffer. about once a week, i change the cache-mode between writeback and
> forward for short periods of time to promote actively used data to the
> cache tier. this moves io load from the hard drive tier to the ssd tier and
> has been done multiple times without issue. i normally don't do this while
> there are backfills or recoveries happening on the cluster but decided to
> go ahead while backfill was happening due to the high load.
>
> i tried this procedure to change the ssd cache-tier between writeback and
> forward cache-mode and things seemed okay from the ceph cluster. about 10
> minutes after the first attempt a changing the mode, vms using the ceph
> cluster for their storage started seeing corruption in multiple forms. the
> mode was flipped back and forth multiple times in that time frame and its
> unknown if the corruption was noticed with the first change or subsequent
> changes. the vms were having issues of filesystems having errors and
> getting remounted RO and mysql databases seeing corruption (both myisam and
> innodb). some of this was recoverable but on some filesystems there was
> corruption that lead to things like lots of data ending up in the
> lost+found and some of the databases were un-recoverable (backups are
> helping there).
>
> i'm not sure what would have happened to cause this corruption. the
> libvirt logs for the qemu processes for the vms did not provide any output
> of problems from the ceph client code. it doesn't look like any of the qemu
> processes had crashed. also, it has now been several hours since this
> happened with no additional corruption noticed by the vms. it doesn't
> appear that we had any corruption happen before i attempted the flipping of
> the ssd tier cache-mode.
>
> the only think i can think of that is different between this time doing
> this procedure vs previous attempts was that there was the one stor

[ceph-users] Cannot remove rbd locks

2016-03-19 Thread Christoph Adomeit
Hi,

some of my rbds show they have an exclusive lock.

I think the lock can be stale or weeks old.

We have also once added feature exclusive lock and later removed that feature

I can see the lock:

root@machine:~# rbd lock list vm-114-disk-1
There is 1 exclusive lock on this image.
Locker  ID   Address 
client.71260575 auto 140454012457856 10.67.1.14:0/1131494432 

iBut I cannot remove the lock:

root@machine:~# rbd lock remove vm-114-disk-1 auto client.71260575
rbd: releasing lock failed: (2) No such file or directory

How can I remove the locks ?

Thanks
  Christoph


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections

2016-03-19 Thread Ben Hines
What OS are you using?

I have a lot more open connections than that. (though i have some other
issues, where rgw sometimes returns 500 errors, it doesn't stop like yours)

You might try tuning civetweb's num_threads and 'rgw num rados handles':

rgw frontends = civetweb num_threads=125
error_log_file=/var/log/radosgw/civetweb.error.log
access_log_file=/var/log/radosgw/civetweb.access.log
rgw num rados handles = 32

You can also up civetweb loglevel:

debug civetweb = 20

-Ben

On Wed, Mar 16, 2016 at 5:03 PM, seapasu...@uchicago.edu <
seapasu...@uchicago.edu> wrote:

> I have a cluster of around 630 OSDs with 3 dedicated monitors and 2
> dedicated gateways. The entire cluster is running hammer (0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43)).
>
> (Both of my gateways have stopped responding to curl right now.
> root@host:~# timeout 5 curl localhost ; echo $?
> 124
>
> From here I checked and it looks like radosgw has over 1 million open
> files:
> root@host:~# grep -i rados whatisopen.files.list | wc -l
> 1151753
>
> And around 750 open connections:
> root@host:~# netstat -planet | grep radosgw | wc -l
> 752
> root@host:~# ss -tnlap | grep rados | wc -l
> 752
>
> I don't think that the backend storage is hanging based on the following
> dump:
>
> root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok
> objecter_requests | grep -i mtime
> "mtime": "0.00",
> "mtime": "0.00",
> "mtime": "0.00",
> "mtime": "0.00",
> "mtime": "0.00",
> "mtime": "0.00",
> [...]
> "mtime": "0.00",
>
> The radosgw log is still showing lots of activity and so does strace which
> makes me think this is a config issue or limit of some kind that is not
> triggering a log. Of what I am not sure as the log doesn't seem to show any
> open file limit being hit and I don't see any big errors showing up in the
> logs.
> (last 500 lines of /var/log/radosgw/client.radosgw.log)
> http://pastebin.com/jmM1GFSA
>
> Perf dump of radosgw
> http://pastebin.com/rjfqkxzE
>
> Radosgw objecter requests:
> http://pastebin.com/skDJiyHb
>
> After restarting the gateway with '/etc/init.d/radosgw restart' the old
> process remains, no error is sent, and then I get connection refused via
> curl or netcat::
> root@kh11-9:~# curl localhost
> curl: (7) Failed to connect to localhost port 80: Connection refused
>
> Once I kill the old radosgw via sigkill the new radosgw instance restarts
> automatically and starts responding::
> root@kh11-9:~# curl localhost
> http://s3.amazonaws.com/doc/2006-03-01/
> ">anonymous
> What is going on here?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Jeffrey McDonald
Hi Sam,

I've written a script but i'm a little leary of unleasing it until I find a
few more cases to test.   The script successfully removed the file
mentioned above.
I took the next pg which was marked inconsistent and ran the following
command over those pg directory structures:

find . -name "*_long" -exec xattr -p user.cephos.lfn3 {} +  | grep -v


I didn't find any files that "orphaned" by this command.   All of these
files should have "_long" and the grep should pull out the invalid
generation, correct?

I'm looking wider but in the next pg marked inconsistent I didn't find any
orphans.

Thanks,
Jeff

-- 

Jeffrey McDonald, PhD
Assistant Director for HPC Operations
Minnesota Supercomputing Institute
University of Minnesota Twin Cities
599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
117 Pleasant St SE   phone: +1 612 625-6905
Minneapolis, MN 55455fax:   +1 612 624-8861
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Jeffrey McDonald
OK, I think I have it now.   I do have one more question, in this case, the
hash indicates the directory structure but how do I know from the hash how
many levels I should go down.If the hash is a 32-bit hex integer, *how
do I know how many should be included as part of the hash for the directory
structure*?

e.g. our example: the hash is 79CED459 and the directory is then the last
five taken in reverse order, what happens if there are only 4 levels of
hierarchy?I only have this one example so far.is the 79C of the
hash constant?   Would the hash pick up another hex character if the pg
splits again?

Thanks,
Jeff

On Wed, Mar 16, 2016 at 10:24 AM, Samuel Just  wrote:

> There is a directory structure hash, it's just that it's at the end of
> the name and you'll have to check the xattr I mentioned to find it.
>
> I think that file is actually the one we are talking about removing.
>
>
> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
> user.cephos.lfn3:
>
> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0
>
> Notice that the user.cephosd.lfn3 attr has the full name, and it
> *does* have a hash 79CED459 (you referred to it as a directory hash I
> think, but it's actually the hash we used to place it on this osd to
> begin with).
>
> In specifically this case, you shouldn't find any files in the
> DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories
> (so all hash values should hash to one of those).
>
> The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's
> the actual object file, don't remove that.  If you look at the attr:
>
>
> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
> user.cephos.lfn3:
>
> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
>
> The hash is 79CED459, which means that (assuming
> DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the
> right place.
>
> The ENOENT return
>
> 2016-03-07 16:11:41.828332 7ff30cdad700 10
> filestore(/var/lib/ceph/osd/ceph-307) remove
>
> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
> = -2
> 2016-03-07 21:44:02.197676 7fe96b56f700 10
> filestore(/var/lib/ceph/osd/ceph-307) remove
>
> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
> = -2
>
> actually was a symptom in this case, but, in general, it's not
> indicative of anything -- the filestore can get ENOENT return values
> for legitimate reasons.
>
> To reiterate: files that end in something like
> fa202ec9b4b3b217275a_0_long are *not* necessarily orphans -- you need
> to check the user.cephos.lfn3 attr (as you did before) for the full
> length file name and determine whether the file is in the right place.
> -Sam
>
> On Wed, Mar 16, 2016 at 7:49 AM, Jeffrey McDonald 
> wrote:
> > Hi Sam,
> >
> > In the 70.459 logs from the deep-scrub, there is an error:
> >
> >  $ zgrep "= \-2$" ceph-osd.307.log.1.gz
> > 2016-03-07 16:11:41.828332 7ff30cdad700 10
> > filestore(/var/lib/ceph/osd/ceph-307) remove
> >
> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
> > = -2
> > 2016-03-07 21:44:02.197676 7fe96b56f700 10
> > filestore(/var/lib/ceph/osd/ceph-307) remove
> >
> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
> > = -2
> >
> > I'm taking this as an indication of the error you mentioned.It looks
> to
> > me as if t

[ceph-users] ssd only storage and ceph

2016-03-19 Thread Erik Schwalbe
Hi, 

at the moment I do some tests with SSD's and ceph. 
My Question is, how to mount an SSD OSD? With or without discard option? 
Where should I do the fstrim, when I mount the OSD without discard? On the ceph 
storage node? Inside the vm, running on rbd? 

What is the best practice there. 

Thanks for your answers. 

Regards, 
Erik 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Infernalis .rgw.buckets.index objects becoming corrupted in on RHEL 7.2 during recovery

2016-03-19 Thread Brandon Morris, PMP
List,

We have stood up a Infernalis 9.2.0 cluster on RHEL 7.2.  We are using the
radosGW to store potentially billions of small to medium sized objects (64k
- 1MB).

We have run into an issue twice thus far where .rgw.bucket.index placement
groups will become corrupt during recovery after a drive failure.  This
corruption will cause the OSD to crash with a  suicide_timeout error when
trying to backfill the corrupted index file to a different OSD.  Exporting
the corrupted placement group using the ceph-objectstore-tool will also
hang. When this first came up, we were able to simply rebuild the .rgw
pools and start from scratch.  There were no underlying XFS issues.

Before we put this cluster into full operation, we are looking to determine
what caused this and if there is a hard limit to the number of objects in a
bucket.  We are currently putting all objects into 1 bucket, but should
probably divide these up.

I have uploaded the OSD and ceph-objectstore tool debug files here:
https://github.com/garignack/ceph_misc/raw/master/ceph-osd.388.zip   Any
help would be greatly appreciated.

I'm not a ceph expert by any means, but here is where I've gotten to thus
far. (And may be way off base)

The PG in question only has 1 object - .dir.default.808642.1.163
| [root@node13 ~]# ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-388/ --journal-path
/var/lib/ceph/osd/ceph-388/journal --pgid 24.197 --op list
| SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 18 00 00
00 00 20 00 00 00 00 00 83 1c 00 00 00 00 00 00 00 00 00 00 00 00
| SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 18 00 00
00 00 20 00 00 00 00 00 83 1c 00 00 00 00 00 00 00 00 00 00 00 00
|
["24.197",{"oid":".dir.default.808642.1.163","key":"","snapid":-2,"hash":491874711,"max":0,"pool":24,"namespace":"","max":0}]

Here are the final lines of the ceph-objectstore-tool before it hangs:

| e140768: 570 osds: 558 up, 542 in
| Read 24/1d516997/.dir.default.808642.1.163/head
| size=0
| object_info:
24/1d516997/.dir.default.808642.1.163/head(139155'2197754
client.1137891.0:20837319 dirty|omap|data_digest s 0 uv 2197754 dd )
| attrs size 2

This leads me to suspect something between line 564 and line 576 in the
tool is hanging.
https://github.com/ceph/ceph/blob/master/src/tools/ceph_objectstore_tool.cc#L564.
Current suspect is the objectstore read command.

| ret = store->read(cid, obj, offset, len, rawdatabl);

Looking through the OSD debug logs, I also see a strange
size(18446744073709551615) on the recovery operation for the
24/1d516997/.dir.default.808642.1.163/head object

| 2016-03-17 12:12:29.753446 7f972ca3d700 10 osd.388 154849 dequeue_op
0x7f97580d3500 prio 2 cost 1049576 latency 0.000185 MOSDPGPull(24.197
154849 [PullOp(24/1d516997/.dir.default.808642.1.163/head, recovery_info:
ObjectRecoveryInfo(24/1d516997/.dir.default.808642.1.163/head@139155'2197754,
size: 18446744073709551615, copy_subset: [0~18446744073709551615],
clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first,
data_recovered_to:0, data_complete:false, omap_recovered_to:,
omap_complete:false))]) v2 pg pg[24.197( v 139155'2197754
(139111'2194700,139155'2197754] local-les=154480 n=1 ec=128853 les/c/f
154268/138679/0 154649/154650/154650) [179,443,517]/[306,441] r=-1
lpr=154846 pi=138674-154649/37 crt=139155'2197752 lcod 0'0 inactive NOTIFY
NIBBLEWISE]

this error eventually causes the thread to hang and eventually trigger the
suicide timeout
| 2016-03-17 12:12:45.541528 7f973524e700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f972ca3d700' had timed out after 15
| 2016-03-17 12:12:45.541533 7f973524e700 20 heartbeat_map is_healthy =
NOT HEALTHY, total workers: 29, number of unhealthy: 1
| 2016-03-17 12:12:45.541534 7f973524e700 10 osd.388 154849 internal
heartbeat not healthy, dropping ping request

| 2016-03-17 12:15:02.148193 7f973524e700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f972ca3d700' had timed out after 15
| 2016-03-17 12:15:02.148195 7f973524e700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f972ca3d700' had suicide timed out after 150


| ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
|  1: (()+0x7e6ab2) [0x7f9753701ab2]
|  2: (()+0xf100) [0x7f9751893100]
| 3: (gsignal()+0x37) [0x7f97500705f7]
| 4: (abort()+0x148) [0x7f9750071ce8]
| 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f97509749d5]
| 6: (()+0x5e946) [0x7f9750972946]
| 7: (()+0x5e973) [0x7f9750972973]
| 8: (()+0x5eb93) [0x7f9750972b93]
| 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0x7f97537f6dda]
| 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
const*, long)+0x2d9) [0x7f97537363b9]
| 11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0x7f9753736bf6]
| 12: (OSD::handle_osd_ping(MOSDPing*)+0x933) [0x7f9753241593]
| 13: (OSD::heartbeat_dispatch(M

Re: [ceph-users] RGW quota

2016-03-19 Thread Derek Yarnell
On 3/17/16 1:41 PM, Marius Vaitiekunas wrote:
> It's possible that somebody changed the owner of some bucket. But all
> objects in that bucket still belongs to this user. That way you can get
> quota exceeded. We had the same situation.

Well the user says he didn't write to any other buckets than his own.
The usage shows that he did have two other buckets boston_bombing,
charlie_hebdo and the buckets no longer exist (and we have apache logs
for the DELETE for them) but from the usage they were never deleted.  I
am concerned that since the delete never shows up that this is where the
quota is being lost to.

ceph-access.log:192.168.79.51 - - [16/Mar/2016:02:50:01 -0400] "DELETE
/boston_bombing/ HTTP/1.1" 204 - "-" "Boto/2.34.0 Python/2.6.6
Linux/2.6.32-573.18.1.el6.x86_64"

ceph-access.log:192.168.79.51 - - [16/Mar/2016:02:51:47 -0400] "DELETE
/charlie_hebdo/ HTTP/1.1" 204 - "-" "Boto/2.34.0 Python/2.6.6
Linux/2.6.32-573.18.1.el6.x86_64"

# radosgw-admin usage show --uid=username --start-date=2015-01-01
--end-date=2016-03-16
...
{
"bucket": "boston",
"time": "2016-03-07 23:00:00.00Z",
"epoch": 1457391600,
"categories": [
{
"category": "create_bucket",
"bytes_sent": 19,
"bytes_received": 0,
"ops": 1,
"successful_ops": 1
},
{
"category": "get_acls",
"bytes_sent": 174400,
"bytes_received": 0,
"ops": 352,
"successful_ops": 352
},
{
"category": "get_obj",
"bytes_sent": 86170638,
"bytes_received": 0,
"ops": 14,
"successful_ops": 10
},
{
"category": "list_bucket",
"bytes_sent": 381327,
"bytes_received": 0,
"ops": 10,
"successful_ops": 10
},
{
"category": "put_acls",
"bytes_sent": 3230,
"bytes_received": 73031,
"ops": 170,
"successful_ops": 170
},
{
"category": "put_obj",
"bytes_sent": 0,
"bytes_received": 14041021516,
"ops": 169,
"successful_ops": 169
},
{
"category": "stat_bucket",
"bytes_sent": 6688,
"bytes_received": 0,
"ops": 353,
"successful_ops": 352
}
]
}
,
{
"bucket": "charlie_hebdo",
"time": "2016-03-07 23:00:00.00Z",
"epoch": 1457391600,
"categories": [
{
"category": "create_bucket",
"bytes_sent": 19,
"bytes_received": 0,
"ops": 1,
"successful_ops": 1
},
{
"category": "get_acls",
"bytes_sent": 79062,
"bytes_received": 0,
"ops": 159,
"successful_ops": 159
},
{
"category": "get_obj",
"bytes_sent": 1096,
"bytes_received": 0,
"ops": 9,
"successful_ops": 4
},
{
"category": "list_bucket",
"bytes_sent": 84129,
"bytes_received": 0,
"ops": 6,
"successful_ops": 6
},
{
"category": "put_acls",
"bytes_sent": 1406,
"bytes_received": 31655,
"ops": 74,
"successful_ops": 74
},
   

[ceph-users] Infernalis: chown ceph:ceph at runtime ?

2016-03-19 Thread Christoph Adomeit
Hi,

we have upgraded our ceph-cluster to infernalis from hammer.

Ceph is still running as root and we are using the 
"setuser match path = /var/lib/ceph/$type/$cluster-$id" directive in ceph.conf

Now we would like to change the ownership of data-files and devices to ceph at 
runtime.

What ist the best way to do this ?

I am thinking about removing the "setuser match path" directive from ceph.conf 
and then stopping one osd after the other, change all files to ceph:ceph and 
then restart the daemon.

Is this the best and recommended way ?

I also once read about a fast parallel chmod syntax in this mailing list but I 
did not yet find the mail. Does someone remember how this was done ?

Thanks
  Christoph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Mark Nelson
FWIW, from purely a performance perspective Ceph usually looks pretty 
fantastic on a fresh BTRFS filesystem.  In fact it will probably 
continue to look great until you do small random writes to large objects 
(like say to blocks in an RBD volume).  Then COW starts fragmenting the 
objects into oblivion.  I've seen sequential read performance drop by 
300% after 5 minutes of 4K random writes to the same RBD blocks.


Autodefrag might help.  A long time ago I recall Josef told me it was 
dangerous to use (I think it could run the node out of memory and 
corrupt the FS), but it may be that it's safer now.  In any event we 
don't really do a lot of testing with BTRFS these days as bluestore is 
indeed the next gen OSD backend.  If you do decide to give either BTRFS 
or ZFS a go with filestore, let us know how it goes. ;)


Mark

On 03/18/2016 02:42 PM, Heath Albritton wrote:

Neither of these file systems is recommended for production use underlying an 
OSD.  The general direction for ceph is to move away from having a file system 
at all.

That effort is called "bluestore" and is supposed to show up in the jewel 
release.

-H


On Mar 18, 2016, at 11:15, Schlacta, Christ  wrote:

Insofar as I've been able to tell, both BTRFS and ZFS provide similar
capabilities back to CEPH, and both are sufficiently stable for the
basic CEPH use case (Single disk -> single mount point), so the
question becomes this:  Which actually provides better performance?
Which is the more highly optimized single write path for ceph?  Does
anybody have a handful of side-by-side benchmarks?  I'm more
interested in higher IOPS, since you can always scale-out throughput,
but throughput is also important.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-19 Thread Vasu Kulkarni
Thanks Dan, I have raised the tracker for this issue
http://tracker.ceph.com/issues/15176

On Thu, Mar 17, 2016 at 10:47 AM, Dan van der Ster 
wrote:

> Hi,
>
> It's true, partprobe works intermittently. I extracted the key
> commands to show the problem:
>
> [18:44]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph
> journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d
> --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt --
> /dev/sdc
> The operation has completed successfully.
> [18:44]# partprobe /dev/sdc
> Error: Error informing the kernel about modifications to partition
> /dev/sdc2 -- Device or resource busy.  This means Linux won't know
> about any changes you made to /dev/sdc2 until you reboot -- so you
> shouldn't mount it or use it in any way before rebooting.
> Error: Failed to add partition 2 (Device or resource busy)
> [18:44]# partprobe /dev/sdc
> [18:44]# partprobe /dev/sdc
> Error: Error informing the kernel about modifications to partition
> /dev/sdc2 -- Device or resource busy.  This means Linux won't know
> about any changes you made to /dev/sdc2 until you reboot -- so you
> shouldn't mount it or use it in any way before rebooting.
> Error: Failed to add partition 2 (Device or resource busy)
> [18:44]# partprobe /dev/sdc
> Error: Error informing the kernel about modifications to partition
> /dev/sdc2 -- Device or resource busy.  This means Linux won't know
> about any changes you made to /dev/sdc2 until you reboot -- so you
> shouldn't mount it or use it in any way before rebooting.
> Error: Failed to add partition 2 (Device or resource busy)
>
> But partx works every time:
>
> [18:46]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph
> journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d
> --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt --
> /dev/sdd
> The operation has completed successfully.
> [18:46]# partx -u /dev/sdd
> [18:46]# partx -u /dev/sdd
> [18:46]# partx -u /dev/sdd
> [18:46]#
>
> -- Dan
>
> On Thu, Mar 17, 2016 at 6:31 PM, Vasu Kulkarni 
> wrote:
> > I can raise a tracker for this issue since it looks like an intermittent
> > issue and mostly dependent on specific hardware or it would be better if
> you
> > add all the hardware/os details in tracker.ceph.com,  also from your
> logs it
> > looks like you have
> >  Resource busy issue: Error: Failed to add partition 2 (Device or
> resource
> > busy)
> >
> >  From my test run logs on centos 7.2 , 10.0.5 (
> >
> http://qa-proxy.ceph.com/teuthology/vasu-2016-03-15_15:34:41-selinux-master---basic-mira/62626/teuthology.log
> > )
> >
> > 2016-03-15T18:49:56.305
> > INFO:teuthology.orchestra.run.mira041.stderr:[ceph_deploy.osd][DEBUG ]
> > Preparing host mira041 disk /dev/sdb journal None activate True
> > 2016-03-15T18:49:56.305
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][DEBUG ] find the
> > location of an executable
> > 2016-03-15T18:49:56.309
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][INFO  ] Running
> > command: sudo /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type
> xfs --
> > /dev/sdb
> > 2016-03-15T18:49:56.546
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> > Running command: /usr/bin/ceph-osd --cluster=ceph
> --show-config-value=fsid
> > 2016-03-15T18:49:56.611
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> > Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster
> > ceph
> > 2016-03-15T18:49:56.643
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> > Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster
> ceph
> > 2016-03-15T18:49:56.708
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> > Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster
> ceph
> > 2016-03-15T18:49:56.708
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING]
> get_dm_uuid:
> > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> > 2016-03-15T18:49:56.709
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] set_type:
> > Will colocate journal with data on /dev/sdb
> > 2016-03-15T18:49:56.709
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command:
> > Running command: /usr/bin/ceph-osd --cluster=ceph
> > --show-config-value=osd_journal_size
> > 2016-03-15T18:49:56.774
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING]
> get_dm_uuid:
> > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> > 2016-03-15T18:49:56.774
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING]
> get_dm_uuid:
> > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> > 2016-03-15T18:49:56.775
> > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING]
> get_dm_uuid:
> > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid
> > 2016-03-15T18:49:56.775
> > INFO:teuthology.orchestra.run.mi

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-19 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Daniel Niasoff
> Sent: 16 March 2016 21:02
> To: Nick Fisk ; 'Van Leeuwen, Robert'
> ; 'Jason Dillaman' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi Nick,
> 
> Your solution requires manual configuration for each VM and cannot be
> setup as part of an automated OpenStack deployment.

Absolutely, potentially flaky as well.

> 
> It would be really nice if it was a hypervisor based setting as opposed to
a VM
> based setting.

Yes, I can't wait until we can just specify "rbd_cache_device=/dev/ssd" in
the ceph.conf and get it to write to that instead. Ideally ceph would also
provide some sort of lightweight replication for the cache devices, but
otherwise a iSCSI SSD farm or switched SAS could be used so that the caching
device is not tied to one physical host.

> 
> Thanks
> 
> Daniel
> 
> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: 16 March 2016 08:59
> To: Daniel Niasoff ; 'Van Leeuwen, Robert'
> ; 'Jason Dillaman' 
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Daniel Niasoff
> > Sent: 16 March 2016 08:26
> > To: Van Leeuwen, Robert ; Jason Dillaman
> > 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> >
> > Hi Robert,
> >
> > >Caching writes would be bad because a hypervisor failure would result
> > >in
> > loss of the cache which pretty much guarantees inconsistent data on
> > the ceph volume.
> > >Also live-migration will become problematic compared to running
> > everything from ceph since you will also need to migrate the
> local-storage.
> 
> I tested a solution using iSCSI for the cache devices. Each VM was using
> flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This
gets
> around the problem of moving things around or if the hypervisor goes down.
> It's not local caching but the write latency is at least 10x lower than
the RBD.
> Note I tested it, I didn't put it into production :-)
> 
> >
> > My understanding of how a writeback cache should work is that it
> > should only take a few seconds for writes to be streamed onto the
> > network and is focussed on resolving the speed issue of small sync
> > writes. The writes
> would
> > be bundled into larger writes that are not time sensitive.
> >
> > So there is potential for a few seconds data loss but compared to the
> current
> > trend of using ephemeral storage to solve this issue, it's a major
> > improvement.
> 
> Yeah, problem is a couple of seconds data loss mean different things to
> different people.
> 
> >
> > > (considering the time required for setting up and maintaining the
> > > extra
> > caching layer on each vm, unless you work for free ;-)
> >
> > Couldn't agree more there.
> >
> > I am just so surprised how the openstack community haven't looked to
> > resolve this issue. Ephemeral storage is a HUGE compromise unless you
> > have built in failure into every aspect of your application but many
> > people use openstack as a general purpose devstack.
> >
> > (Jason pointed out his blueprint but I guess it's at least a year or 2
> away -
> > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-
> > consistent_write-back_caching_extension)
> >
> > I see articles discussing the idea such as this one
> >
> > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering-
> > scalable-cache/
> >
> > but no real straightforward  validated setup instructions.
> >
> > Thanks
> >
> > Daniel
> >
> >
> > -Original Message-
> > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com]
> > Sent: 16 March 2016 08:11
> > To: Jason Dillaman ; Daniel Niasoff
> > 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> >
> > >Indeed, well understood.
> > >
> > >As a shorter term workaround, if you have control over the VMs, you
> > >could
> > always just slice out an LVM volume from local SSD/NVMe and pass it
> > through to the guest.  Within the guest, use dm-cache (or similar) to
> > add
> a
> > cache front-end to your RBD volume.
> >
> > If you do this you need to setup your cache as read-cache only.
> > Caching writes would be bad because a hypervisor failure would result
> > in
> loss
> > of the cache which pretty much guarantees inconsistent data on the
> > ceph volume.
> > Also live-migration will become problematic compared to running
> > everything from ceph since you will also need to migrate the
local-storage.
> >
> > The question will be if adding more ram (== more read cache) would not
> > be more convenient and cheaper in the end.
> > (considering the time required for setting up and maintaining 

Re: [ceph-users] rgw bucket deletion woes

2016-03-19 Thread Yehuda Sadeh-Weinraub
On Tue, Mar 15, 2016 at 11:36 PM, Pavan Rallabhandi
 wrote:
> Hi,
>
> I find this to be discussed here before, but couldn¹t find any solution
> hence the mail. In RGW, for a bucket holding objects in the range of ~
> millions, one can find it to take for ever to delete the bucket(via
> radosgw-admin). I understand the gc(and its parameters) that would reclaim
> the space eventually, but am looking more at the bucket deletion options
> that can possibly speed up the operation.
>
> I realize, currently rgw_remove_bucket(), does it 1000 objects at a time,
> serially. Wanted to know if there is a reason(that am possibly missing and
> discussed) for this to be left that way, otherwise I was considering a
> patch to make it happen better.
>

There is no real reason. You might want to have a version of that
command that doesn't schedule the removal to gc, but rather removes
all the object parts by itself. Otherwise, you're just going to flood
the gc. You'll need to iterate through all the objects, and for each
object you'll need to remove all of it's rados objects (starting with
the tail, then the head). Removal of each rados object can be done
asynchronously, but you'll need to throttle the operations, not send
everything to the osds at once (which will be impossible, as the
objecter will throttle the requests anyway, which will lead to a high
memory consumption).

Thanks,
Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16)

2016-03-19 Thread VHPC 16
CfP 11th Workshop on Virtualization in High-Performance Cloud
Computing (VHPC '16)


CALL FOR PAPERS


11th Workshop on Virtualization in High­-Performance Cloud Computing  (VHPC '16)
held in conjunction with the International Supercomputing Conference -
High Performance,
June 19-23, 2016, Frankfurt, Germany.



Date: June 23, 2016
Workshop URL: http://vhpc.org

Paper Submission Deadline: April 25, 2016


Call for Papers

Virtualization technologies constitute a key enabling factor for
flexible resource
management in modern data centers, and particularly in cloud environments.
Cloud providers need to manage complex infrastructures in a seamless
fashion to support
the highly dynamic and heterogeneous workloads and hosted applications customers
deploy. Similarly, HPC environments have been increasingly adopting
techniques that
enable flexible management of vast computing and networking resources,
close to marginal
provisioning cost, which is unprecedented in the history of scientific
and commercial
computing.

Various virtualization technologies contribute to the overall picture
in different ways: machine
virtualization, with its capability to enable consolidation of
multiple under­utilized servers with
heterogeneous software and operating systems (OSes), and its
capability to live­-migrate a
fully operating virtual machine (VM) with a very short downtime,
enables novel and dynamic
ways to manage physical servers; OS-­level virtualization (i.e.,
containerization), with its
capability to isolate multiple user­-space environments and to allow
for their co­existence
within the same OS kernel, promises to provide many of the advantages of machine
virtualization with high levels of responsiveness and performance; I/O
Virtualization allows
physical NICs/HBAs to take traffic from multiple VMs or containers;
network virtualization,
with its capability to create logical network overlays that are
independent of the underlying
physical topology and IP addressing, provides the fundamental ground
on top of which
evolved network services can be realized with an unprecedented level
of dynamicity and
flexibility; the increasingly adopted paradigm of Software-­Defined
Networking (SDN)
promises to extend this flexibility to the control and data planes of
network paths.


Topics of Interest

The VHPC program committee solicits original, high-quality submissions
related to
virtualization across the entire software stack with a special focus
on the intersection of HPC
and the cloud. Topics include, but are not limited to:

- Virtualization in supercomputing environments, HPC clusters, cloud
HPC and grids
- OS-level virtualization including container runtimes (Docker, rkt et al.)
- Lightweight compute node operating systems/VMMs
- Optimizations of virtual machine monitor platforms, hypervisors
- QoS and SLA in hypervisors and network virtualization
- Cloud based network and system management for SDN and NFV
- Management, deployment and monitoring of virtualized environments
- Virtual per job / on-demand clusters and cloud bursting
- Performance measurement, modelling and monitoring of
virtualized/cloud workloads
- Programming models for virtualized environments
- Virtualization in data intensive computing and Big Data processing
- Cloud reliability, fault-tolerance, high-availability and security
- Heterogeneous virtualized environments, virtualized accelerators,
GPUs and co-processors
- Optimized communication libraries/protocols in the cloud and for HPC
in the cloud
- Topology management and optimization for distributed virtualized applications
- Adaptation of emerging HPC technologies (high performance networks,
RDMA, etc..)
- I/O and storage virtualization, virtualization aware file systems
- Job scheduling/control/policy in virtualized environments
- Checkpointing and migration of VM-based large compute jobs
- Cloud frameworks and APIs
- Energy-efficient / power-aware virtualization


The Workshop on Virtualization in High­-Performance Cloud Computing
(VHPC) aims to
bring together researchers and industrial practitioners facing the challenges
posed by virtualization in order to foster discussion, collaboration,
mutual exchange
of knowledge and experience, enabling research to ultimately provide novel
solutions for virtualized computing systems of tomorrow.

The workshop will be one day in length, composed of 20 min paper
presentations, each
followed by 10 min discussion sections, plus lightning talks that are
limited to 5 minutes.
Presentations may be accompanied by interactive demonstrations.

Important Dates

April 25, 2016 - Paper submission deadline
May 30, 2016 Acceptance notification
June 23, 2016 - Workshop Day
July 25, 2016 - Camera-ready version due


Chair

Michael Alexander (chair), TU Wien, Austria
Anastassios Nanos (co-­chair), NTUA, Greece
Balazs Gerofi (co-­chair), RIKEN Advan

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Lindsay Mathieson

On 20/03/2016 3:38 AM, Heath Albritton wrote:
Ceph protects against bitrot at a much higher level by validating the 
checksum of the entire placement group during a deep scrub.





Ceph has checksums? I didn't think it did.

Its my understanding that it just compares blocks between replications 
and marks the pg invalid when it finds a mismatch, unlike btrfs/zfs 
which auto repair the block if the mirror has a valid checksum.


--
Lindsay Mathieson

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Lionel Bouton
Le 19/03/2016 18:38, Heath Albritton a écrit :
> If you google "ceph bluestore" you'll be able to find a couple slide
> decks on the topic.  One of them by Sage is easy to follow without the
> benefit of the presentation.  There's also the " Redhat Ceph Storage
> Roadmap 2016" deck.
>
> In any case, bluestore is not intended to address bitrot.  Given that
> ceph is a distributed file system, many of the posix file system
> features are not required for the underlying block storage device.
>  Bluestore is intended to address this and reduce the disk IO required
> to store user data.
>
> Ceph protects against bitrot at a much higher level by validating the
> checksum of the entire placement group during a deep scrub.

My impression is that the only protection against bitrot is provided by
the underlying filesystem which means that you don't get any if you use
XFS or EXT4.

I can't trust Ceph on this alone until its bitrot protection (if any) is
clearly documented. The situation is far from clear right now. The
documentations states that deep scrubs are using checksums to validate
data, but this is not good enough at least because we don't known what
these checksums are supposed to cover (see below for another reason).
There is even this howto by Sebastien Han about repairing a PG :
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
which clearly concludes that with only 2 replicas you can't reliably
find out which object is corrupted with Ceph alone. If Ceph really
stored checksums to verify all the objects it stores we could manually
check which replica is valid.

Even if deep scrubs would use checksums to verify data this would not be
enough to protect against bitrot: there is a window between a corruption
event and a deep scrub where the data on a primary can be returned to a
client. BTRFS solves this problem by returning an IO error for any data
read that doesn't match its checksum (or automatically rebuilds it if
the allocation group is using RAID1/10/5/6). I've never seen this kind
of behavior documented for Ceph.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs] About feature 'snapshot'

2016-03-19 Thread 施柏安
Hi John,
Really thank you for your help, and sorry about that I ask such a stupid
question of setting...
So isn't this feature ready in Jewel? I found something info says that the
features(snapshot, quota...) become stable in Jewel

Thank you

2016-03-18 21:07 GMT+09:00 John Spray :

> On Fri, Mar 18, 2016 at 1:33 AM, 施柏安  wrote:
> > Hi John,
> > How to set this feature on?
>
> ceph mds set allow_new_snaps true --yes-i-really-mean-it
>
> John
>
> > Thank you
> >
> > 2016-03-17 21:41 GMT+08:00 Gregory Farnum :
> >>
> >> On Thu, Mar 17, 2016 at 3:49 AM, John Spray  wrote:
> >> > Snapshots are disabled by default:
> >> >
> >> >
> http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration
> >>
> >> Which makes me wonder if we ought to be hiding the .snaps directory
> >> entirely in that case. I haven't previously thought about that, but it
> >> *is* a bit weird.
> >> -Greg
> >>
> >> >
> >> > John
> >> >
> >> > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安 
> wrote:
> >> >> Hi all,
> >> >> I encounter a trouble about cephfs sanpshot. It seems that the folder
> >> >> '.snap' is exist.
> >> >> But I use 'll -a' can't let it show up. And I enter that folder and
> >> >> create
> >> >> folder in it, it showed something wrong to use snapshot.
> >> >>
> >> >> Please check : http://imgur.com/elZhQvD
> >> >>
> >> >>
> >> >> ___
> >> >> ceph-users mailing list
> >> >> ceph-users@lists.ceph.com
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs] About feature 'snapshot'

2016-03-19 Thread John Spray
On Fri, Mar 18, 2016 at 1:33 AM, 施柏安  wrote:
> Hi John,
> How to set this feature on?

ceph mds set allow_new_snaps true --yes-i-really-mean-it

John

> Thank you
>
> 2016-03-17 21:41 GMT+08:00 Gregory Farnum :
>>
>> On Thu, Mar 17, 2016 at 3:49 AM, John Spray  wrote:
>> > Snapshots are disabled by default:
>> >
>> > http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration
>>
>> Which makes me wonder if we ought to be hiding the .snaps directory
>> entirely in that case. I haven't previously thought about that, but it
>> *is* a bit weird.
>> -Greg
>>
>> >
>> > John
>> >
>> > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安  wrote:
>> >> Hi all,
>> >> I encounter a trouble about cephfs sanpshot. It seems that the folder
>> >> '.snap' is exist.
>> >> But I use 'll -a' can't let it show up. And I enter that folder and
>> >> create
>> >> folder in it, it showed something wrong to use snapshot.
>> >>
>> >> Please check : http://imgur.com/elZhQvD
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs] About feature 'snapshot'

2016-03-19 Thread Gregory Farnum
On Thu, Mar 17, 2016 at 3:49 AM, John Spray  wrote:
> Snapshots are disabled by default:
> http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration

Which makes me wonder if we ought to be hiding the .snaps directory
entirely in that case. I haven't previously thought about that, but it
*is* a bit weird.
-Greg

>
> John
>
> On Thu, Mar 17, 2016 at 10:02 AM, 施柏安  wrote:
>> Hi all,
>> I encounter a trouble about cephfs sanpshot. It seems that the folder
>> '.snap' is exist.
>> But I use 'll -a' can't let it show up. And I enter that folder and create
>> folder in it, it showed something wrong to use snapshot.
>>
>> Please check : http://imgur.com/elZhQvD
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Nmz
Yes, I`m missing protection from Ceph too. 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007680.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
There is a directory structure hash, it's just that it's at the end of
the name and you'll have to check the xattr I mentioned to find it.

I think that file is actually the one we are talking about removing.

./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
user.cephos.lfn3:
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0

Notice that the user.cephosd.lfn3 attr has the full name, and it
*does* have a hash 79CED459 (you referred to it as a directory hash I
think, but it's actually the hash we used to place it on this osd to
begin with).

In specifically this case, you shouldn't find any files in the
DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories
(so all hash values should hash to one of those).

The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's
the actual object file, don't remove that.  If you look at the attr:

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
user.cephos.lfn3:
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0

The hash is 79CED459, which means that (assuming
DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the
right place.

The ENOENT return

2016-03-07 16:11:41.828332 7ff30cdad700 10
filestore(/var/lib/ceph/osd/ceph-307) remove
70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
= -2
2016-03-07 21:44:02.197676 7fe96b56f700 10
filestore(/var/lib/ceph/osd/ceph-307) remove
70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
= -2

actually was a symptom in this case, but, in general, it's not
indicative of anything -- the filestore can get ENOENT return values
for legitimate reasons.

To reiterate: files that end in something like
fa202ec9b4b3b217275a_0_long are *not* necessarily orphans -- you need
to check the user.cephos.lfn3 attr (as you did before) for the full
length file name and determine whether the file is in the right place.
-Sam

On Wed, Mar 16, 2016 at 7:49 AM, Jeffrey McDonald  wrote:
> Hi Sam,
>
> In the 70.459 logs from the deep-scrub, there is an error:
>
>  $ zgrep "= \-2$" ceph-osd.307.log.1.gz
> 2016-03-07 16:11:41.828332 7ff30cdad700 10
> filestore(/var/lib/ceph/osd/ceph-307) remove
> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
> = -2
> 2016-03-07 21:44:02.197676 7fe96b56f700 10
> filestore(/var/lib/ceph/osd/ceph-307) remove
> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
> = -2
>
> I'm taking this as an indication of the error you mentioned.It looks to
> me as if this bug leaves two files with "issues" based upon what I see on
> the filesystem.
>
> First, I have a size-0 file in a directory where I expect only to have
> directories:
>
> root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D#
> ls -ltr
> total 320
> -rw-r--r-- 1 root root 0 Jan 23 21:49
> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
> drwxr-xr-x 2 root root 16384 Feb  5 15:13 DIR_6
> drwxr-xr-x 2 root root 16384 Feb  5 17:26 DIR_3
> drwxr-xr-x 2 root root 16384 Feb 10 00:01 DIR_C
> drwxr-xr-x 2 root root 16384 Mar  4 10:50 DIR_7
> drwxr-xr-x 2 root root 16384 Mar  4 16:46 D

[ceph-users] Upgrade from .94 to 10.0.5

2016-03-19 Thread RDS
Is there documentation on all the steps showing how to upgrade from .94 to 
10.0.5?
Thanks

Rick 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Nick Fisk
There is got to be something else going on here. All that PR does is to 
potentially delay the promotion to hit_set_period*recency instead of just doing 
it on the 2nd read regardless, it's got to be uncovering another bug.

Do you see the same problem if the cache is in writeback mode before you start 
the unpacking. Ie is it the switching mid operation which causes the problem? 
If it only happens mid operation, does it still occur if you pause IO when you 
make the switch?

Do you also see this if you perform on a RBD mount, to rule out any librbd/qemu 
weirdness?

Do you know if it’s the actual data that is getting corrupted or if it's the FS 
metadata? I'm only wondering as unpacking should really only be writing to each 
object a couple of times, whereas FS metadata could potentially be being 
updated+read back lots of times for the same group of objects and ordering is 
very important.

Thinking through it logically the only difference is that with recency=1 the 
object will be copied up to the cache tier, where recency=6 it will be proxy 
read for a long time. If I had to guess I would say the issue would lie 
somewhere in the proxy read + writeback<->forward logic.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mike Lovell
> Sent: 16 March 2016 23:23
> To: ceph-users ; sw...@redhat.com
> Cc: Robert LeBlanc ; William Perkins
> 
> Subject: Re: [ceph-users] data corruption with hammer
> 
> just got done with a test against a build of 0.94.6 minus the two commits that
> were backported in PR 7207. everything worked as it should with the cache-
> mode set to writeback and the min_read_recency_for_promote set to 2.
> assuming it works properly on master, there must be a commit that we're
> missing on the backport to support this properly.
> 
> sage,
> i'm adding you to the recipients on this so hopefully you see it. the tl;dr
> version is that the backport of the cache recency fix to hammer doesn't work
> right and potentially corrupts data when
> the min_read_recency_for_promote is set to greater than 1.
> 
> mike
> 
> On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
>  wrote:
> robert and i have done some further investigation the past couple days on
> this. we have a test environment with a hard drive tier and an ssd tier as a
> cache. several vms were created with volumes from the ceph cluster. i did a
> test in each guest where i un-tarred the linux kernel source multiple times
> and then did a md5sum check against all of the files in the resulting source
> tree. i started off with the monitors and osds running 0.94.5 and never saw
> any problems.
> 
> a single node was then upgraded to 0.94.6 which has osds in both the ssd and
> hard drive tier. i then proceeded to run the same test and, while the untar
> and md5sum operations were running, i changed the ssd tier cache-mode
> from forward to writeback. almost immediately the vms started reporting io
> errors and odd data corruption. the remainder of the cluster was updated to
> 0.94.6, including the monitors, and the same thing happened.
> 
> things were cleaned up and reset and then a test was run
> where min_read_recency_for_promote for the ssd cache pool was set to 1.
> we previously had it set to 6. there was never an error with the recency
> setting set to 1. i then tested with it set to 2 and it immediately caused
> failures. we are currently thinking that it is related to the backport of the 
> fix
> for the recency promotion and are in progress of making a .6 build without
> that backport to see if we can cause corruption. is anyone using a version
> from after the original recency fix (PR 6702) with a cache tier in writeback
> mode? anyone have a similar problem?
> 
> mike
> 
> On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell
>  wrote:
> something weird happened on one of the ceph clusters that i administer
> tonight which resulted in virtual machines using rbd volumes seeing
> corruption in multiple forms.
> 
> when everything was fine earlier in the day, the cluster was a number of
> storage nodes spread across 3 different roots in the crush map. the first
> bunch of storage nodes have both hard drives and ssds in them with the hard
> drives in one root and the ssds in another. there is a pool for each and the
> pool for the ssds is a cache tier for the hard drives. the last set of storage
> nodes were in a separate root with their own pool that is being used for burn
> in testing.
> 
> these nodes had run for a while with test traffic and we decided to move
> them to the main root and pools. the main cluster is running 0.94.5 and the
> new nodes got 0.94.6 due to them getting configured after that was
> released. i removed the test pool and did a ceph osd crush move to move
> the first node into the main cluster, the hard drives into the root for that 
> tier
> of storage and the ssds into the root and pool for the cache tier. each set 
> was
> done about 45 minutes apart and they r

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Oh, it's getting a stat mismatch.  I think what happened is that on
one of the earlier repairs it reset the stats to the wrong value (the
orphan was causing the primary to scan two objects twice, which
matches the stat mismatch I see here).  A pg repair repair will clear
that up.
-Sam

On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald  wrote:
> Thanks Sam.
>
> Since I have prepared a script for this, I decided to go ahead with the
> checks.(patience isn't one of my extended attributes)
>
> I've got a file that searches the full erasure encoded spaces and does your
> checklist below.   I have operated only on one PG so far, the 70.459 one
> that we've been discussing.There was only the one file that I found to
> be out of place--the one we already discussed/found and it has been removed.
>
> The pg is still marked as inconsistent.   I've scrubbed it a couple of times
> now and what I've seen is:
>
> 2016-03-17 09:29:53.202818 7f2e816f8700  0 log_channel(cluster) log [INF] :
> 70.459 deep-scrub starts
> 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
> 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
> 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459 deep-scrub 1 errors
> 2016-03-17 09:44:23.592302 7f2e816f8700  0 log_channel(cluster) log [INF] :
> 70.459 deep-scrub starts
> 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
> 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
> 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459 deep-scrub 1 errors
>
>
> Should the scrub be sufficient to remove the inconsistent flag?   I took the
> osd offline during the repairs.I've looked at files in all of the osds
> in the placement group and I'm not finding any more problem files.The
> vast majority of files do not have the user.cephos.lfn3 attribute.There
> are 22321 objects that I seen and only about 230 have the user.cephos.lfn3
> file attribute.   The files will have other attributes, just not
> user.cephos.lfn3.
>
> Regards,
> Jeff
>
>
> On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just  wrote:
>>
>> Ok, like I said, most files with _long at the end are *not orphaned*.
>> The generation number also is *not* an indication of whether the file
>> is orphaned -- some of the orphaned files will have 
>> as the generation number and others won't.  For each long filename
>> object in a pg you would have to:
>> 1) Pull the long name out of the attr
>> 2) Parse the hash out of the long name
>> 3) Turn that into a directory path
>> 4) Determine whether the file is at the right place in the path
>> 5) If not, remove it (or echo it to be checked)
>>
>> You probably want to wait for someone to get around to writing a
>> branch for ceph-objectstore-tool.  Should happen in the next week or
>> two.
>> -Sam
>>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
Yep, let me pull and build that branch. I tried installing the dbg
packages and running it in gdb, but it didn't load the symbols.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil  wrote:
> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>> Also, is this ceph_test_rados rewriting objects quickly? I think that
>> the issue is with rewriting objects so if we can tailor the
>> ceph_test_rados to do that, it might be easier to reproduce.
>
> It's doing lots of overwrites, yeah.
>
> I was albe to reproduce--thanks!  It looks like it's specific to
> hammer.  The code was rewritten for jewel so it doesn't affect the
> latest.  The problem is that maybe_handle_cache may proxy the read and
> also still try to handle the same request locally (if it doesn't trigger a
> promote).
>
> Here's my proposed fix:
>
> https://github.com/ceph/ceph/pull/8187
>
> Do you mind testing this branch?
>
> It doesn't appear to be directly related to flipping between writeback and
> forward, although it may be that we are seeing two unrelated issues.  I
> seemed to be able to trigger it more easily when I flipped modes, but the
> bug itself was a simple issue in the writeback mode logic.  :/
>
> Anyway, please see if this fixes it for you (esp with the RBD workload).
>
> Thanks!
> sage
>
>
>
>
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  
>> wrote:
>> > I'll  miss the Ceph community as well. There was a few things I really
>> > wanted to work in with Ceph.
>> >
>> > I got this:
>> >
>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
>> > dirty exists
>> > 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
>> > 1040:  finishing write tid 1 to nodez23350-256
>> > 1040:  finishing write tid 2 to nodez23350-256
>> > 1040:  finishing write tid 3 to nodez23350-256
>> > 1040:  finishing write tid 4 to nodez23350-256
>> > 1040:  finishing write tid 6 to nodez23350-256
>> > 1035: done (4 left)
>> > 1037: done (3 left)
>> > 1038: done (2 left)
>> > 1043: read oid 430 snap -1
>> > 1043:  expect (ObjNum 429 snap 0 seq_num 429)
>> > 1040:  finishing write tid 7 to nodez23350-256
>> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
>> > dirty exists
>> > 1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
>> > 1042:  expect (ObjNum 664 snap 0 seq_num 664)
>> > 1043: Error: oid 430 read returned error code -2
>> > ./test/osd/RadosModel.h: In function 'virtual void
>> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
>> > 2016-03-17 10:47:19.085414
>> > ./test/osd/RadosModel.h: 1109: FAILED assert(0)
>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > const*)+0x76) [0x4db956]
>> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
>> > 3: (()+0x9791d) [0x7fa1d472191d]
>> > 4: (()+0x72519) [0x7fa1d46fc519]
>> > 5: (()+0x13c178) [0x7fa1d47c6178]
>> > 6: (()+0x80a4) [0x7fa1d425a0a4]
>> > 7: (clone()+0x6d) [0x7fa1d2bd504d]
>> > NOTE: a copy of the executable, or `objdump -rdS ` is
>> > needed to interpret this.
>> > terminate called after throwing an instance of 'ceph::FailedAssertion'
>> > Aborted
>> >
>> > I had to toggle writeback/forward and min_read_recency_for_promote a
>> > few times to get it, but I don't know if it is because I only have one
>> > job running. Even with six jobs running, it is not easy to trigger
>> > with ceph_test_rados, but it is very instant in the RBD VMs.
>> >
>> > Here are the six run crashes (I have about the last 2000 lines of each
>> > if needed):
>> >
>> > nodev:
>> > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num
>> > 1014) dirty exists
>> > 1015:  left oid 1015 (ObjNum 1014 snap 0 seq_num 1014)
>> > 1016:  finishing write tid 1 to nodev21799-1016
>> > 1016:  finishing write tid 2 to nodev21799-1016
>> > 1016:  finishing write tid 3 to nodev21799-1016
>> > 1016:  finishing write tid 4 to nodev21799-1016
>> > 1016:  finishing write tid 6 to nodev21799-1016
>> > 1016:  finishing write tid 7 to nodev21799-1016
>> > update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num
>> > 1015) dirty exists
>> > 1016:  left oid 1016 (ObjNum 1015 snap 0 seq_num 1015)
>> > 1017:  finishing write tid 1 to nodev21799-1017
>> > 1017:  finishing write tid 2 to nodev21799-1017
>> > 1017:  finishing write tid 3 to nodev21799-1017
>> > 1017:  finishing write tid 5 to nodev21799-1017
>> > 1017:  finishing write tid 6 to nodev21799-1017
>> > update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num
>> > 1016) dirty exists
>> > 1017:  left oid 1017 (ObjNum 1016 snap 0 seq_num 1016)
>> > 1018:  finishing write tid 1 to nodev21799-1018
>> > 1018:  finishing write tid 2 to nodev21799-1018
>> > 1018:  finishing write tid 3 to nodev217

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Christian Balzer

Hello,

On Sun, 20 Mar 2016 00:45:47 +0100 Lionel Bouton wrote:

> Le 19/03/2016 18:38, Heath Albritton a écrit :
> > If you google "ceph bluestore" you'll be able to find a couple slide
> > decks on the topic.  One of them by Sage is easy to follow without the
> > benefit of the presentation.  There's also the " Redhat Ceph Storage
> > Roadmap 2016" deck.
> >
> > In any case, bluestore is not intended to address bitrot.  Given that
> > ceph is a distributed file system, many of the posix file system
> > features are not required for the underlying block storage device.
> >  Bluestore is intended to address this and reduce the disk IO required
> > to store user data.
> >
> > Ceph protects against bitrot at a much higher level by validating the
> > checksum of the entire placement group during a deep scrub.
> 
That's not protection, that's an "uh-oh, something is wrong, you better
check it out" notification, after which you get to spend a lot of time
figuring out which is the good replica and as Lionel wrote in the case of
just 2 replicas and faced with binary data you might as well roll a dice.

Completely unacceptable and my oldest pet peeve about Ceph.

I'd be deeply disappointed if bluestore would go ahead ignoring that
elephant in the room as well.

> My impression is that the only protection against bitrot is provided by
> the underlying filesystem which means that you don't get any if you use
> XFS or EXT4.
> 
Indeed.

> I can't trust Ceph on this alone until its bitrot protection (if any) is
> clearly documented. The situation is far from clear right now. The
> documentations states that deep scrubs are using checksums to validate
> data, but this is not good enough at least because we don't known what
> these checksums are supposed to cover (see below for another reason).
> There is even this howto by Sebastien Han about repairing a PG :
> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> which clearly concludes that with only 2 replicas you can't reliably
> find out which object is corrupted with Ceph alone. If Ceph really
> stored checksums to verify all the objects it stores we could manually
> check which replica is valid.
> 
AFAIK it uses checksums created on the fly to compare the data during
deep-scrubs.
I also recall talks about having permanent checksums stored, but no idea
what the status is.

> Even if deep scrubs would use checksums to verify data this would not be
> enough to protect against bitrot: there is a window between a corruption
> event and a deep scrub where the data on a primary can be returned to a
> client. BTRFS solves this problem by returning an IO error for any data
> read that doesn't match its checksum (or automatically rebuilds it if
> the allocation group is using RAID1/10/5/6). I've never seen this kind
> of behavior documented for Ceph.
> 
Ditto.

And if/when Ceph has reliable checksumming (in the storage layer) it
should definitely get auto-repair abilities as well.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Sage Weil
On Thu, 17 Mar 2016, Robert LeBlanc wrote:
> We are trying to figure out how to use rados bench to reproduce. Ceph
> itself doesn't seem to think there is any corruption, but when you do a
> verify inside the RBD, there is. Can rados bench verify the objects after
> they are written? It also seems to be primarily the filesystem metadata
> that is corrupted. If we fsck the volume, there is missing data (put into
> lost+found), but if it is there it is primarily OK. There only seems to be
> a few cases where a file's contents are corrupted. I would suspect on an
> object boundary. We would have to look at blockinfo to map that out and see
> if that is what is happening.

'rados bench' doesn't do validation.  ceph_test_rados does, though--if you 
can reproduce with that workload then it should be pretty easy to track 
down.

Thanks!
sage

 
> We stopped all the IO and did put the tier in writeback mode with recency
> 1,  set the recency to 2 and started the test and there was corruption, so
> it doesn't seem to be limited to changing the mode. I don't know how that
> patch could cause the issue either. Unless there is a bug that reads from
> the back tier, but writes to cache tier, then the object gets promoted
> wiping that last write, but then it seems like it should not be as much
> corruption since the metadata should be in the cache pretty quick. We
> usually evited the cache before each try so we should not be evicting on
> writeback.
> 
> Sent from a mobile device, please excuse any typos.
> On Mar 17, 2016 6:26 AM, "Sage Weil"  wrote:
> 
> > On Thu, 17 Mar 2016, Nick Fisk wrote:
> > > There is got to be something else going on here. All that PR does is to
> > > potentially delay the promotion to hit_set_period*recency instead of
> > > just doing it on the 2nd read regardless, it's got to be uncovering
> > > another bug.
> > >
> > > Do you see the same problem if the cache is in writeback mode before you
> > > start the unpacking. Ie is it the switching mid operation which causes
> > > the problem? If it only happens mid operation, does it still occur if
> > > you pause IO when you make the switch?
> > >
> > > Do you also see this if you perform on a RBD mount, to rule out any
> > > librbd/qemu weirdness?
> > >
> > > Do you know if it’s the actual data that is getting corrupted or if it's
> > > the FS metadata? I'm only wondering as unpacking should really only be
> > > writing to each object a couple of times, whereas FS metadata could
> > > potentially be being updated+read back lots of times for the same group
> > > of objects and ordering is very important.
> > >
> > > Thinking through it logically the only difference is that with recency=1
> > > the object will be copied up to the cache tier, where recency=6 it will
> > > be proxy read for a long time. If I had to guess I would say the issue
> > > would lie somewhere in the proxy read + writeback<->forward logic.
> >
> > That seems reasonable.  Was switching from writeback -> forward always
> > part of the sequence that resulted in corruption?  Not that there is a
> > known ordering issue when switching to forward mode.  I wouldn't really
> > expect it to bite real users but it's possible..
> >
> > http://tracker.ceph.com/issues/12814
> >
> > I've opened a ticket to track this:
> >
> > http://tracker.ceph.com/issues/15171
> >
> > What would be *really* great is if you could reproduce this with a
> > ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados
> > running, and then find the sequence of operations that are sufficient to
> > trigger a failure.
> >
> > sage
> >
> >
> >
> >  >
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of
> > > > Mike Lovell
> > > > Sent: 16 March 2016 23:23
> > > > To: ceph-users ; sw...@redhat.com
> > > > Cc: Robert LeBlanc ; William Perkins
> > > > 
> > > > Subject: Re: [ceph-users] data corruption with hammer
> > > >
> > > > just got done with a test against a build of 0.94.6 minus the two
> > commits that
> > > > were backported in PR 7207. everything worked as it should with the
> > cache-
> > > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > > assuming it works properly on master, there must be a commit that we're
> > > > missing on the backport to support this properly.
> > > >
> > > > sage,
> > > > i'm adding you to the recipients on this so hopefully you see it. the
> > tl;dr
> > > > version is that the backport of the cache recency fix to hammer
> > doesn't work
> > > > right and potentially corrupts data when
> > > > the min_read_recency_for_promote is set to greater than 1.
> > > >
> > > > mike
> > > >
> > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > > >  wrote:
> > > > robert and i have done some further investigation the past couple days
> > on
> > > > this. we have a test environment with a hard drive tier and an ssd
> > tier as a
> > > > cache. several vms

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Ok, like I said, most files with _long at the end are *not orphaned*.
The generation number also is *not* an indication of whether the file
is orphaned -- some of the orphaned files will have 
as the generation number and others won't.  For each long filename
object in a pg you would have to:
1) Pull the long name out of the attr
2) Parse the hash out of the long name
3) Turn that into a directory path
4) Determine whether the file is at the right place in the path
5) If not, remove it (or echo it to be checked)

You probably want to wait for someone to get around to writing a
branch for ceph-objectstore-tool.  Should happen in the next week or
two.
-Sam

On Wed, Mar 16, 2016 at 1:36 PM, Jeffrey McDonald  wrote:
> Hi Sam,
>
> I've written a script but i'm a little leary of unleasing it until I find a
> few more cases to test.   The script successfully removed the file mentioned
> above.
> I took the next pg which was marked inconsistent and ran the following
> command over those pg directory structures:
>
> find . -name "*_long" -exec xattr -p user.cephos.lfn3 {} +  | grep -v
> 
>
> I didn't find any files that "orphaned" by this command.   All of these
> files should have "_long" and the grep should pull out the invalid
> generation, correct?
>
> I'm looking wider but in the next pg marked inconsistent I didn't find any
> orphans.
>
> Thanks,
> Jeff
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw bucket deletion woes

2016-03-19 Thread Ben Hines
We would be a big user of this. We delete large buckets often and it takes
forever.

Though didn't I read that 'object expiration' support is on the near-term
RGW roadmap? That may do what we want.. we're creating thousands of objects
a day, and thousands of objects a day will be expiring, so RGW will need to
handle.


-Ben

On Wed, Mar 16, 2016 at 9:40 AM, Yehuda Sadeh-Weinraub 
wrote:

> On Tue, Mar 15, 2016 at 11:36 PM, Pavan Rallabhandi
>  wrote:
> > Hi,
> >
> > I find this to be discussed here before, but couldn¹t find any solution
> > hence the mail. In RGW, for a bucket holding objects in the range of ~
> > millions, one can find it to take for ever to delete the bucket(via
> > radosgw-admin). I understand the gc(and its parameters) that would
> reclaim
> > the space eventually, but am looking more at the bucket deletion options
> > that can possibly speed up the operation.
> >
> > I realize, currently rgw_remove_bucket(), does it 1000 objects at a time,
> > serially. Wanted to know if there is a reason(that am possibly missing
> and
> > discussed) for this to be left that way, otherwise I was considering a
> > patch to make it happen better.
> >
>
> There is no real reason. You might want to have a version of that
> command that doesn't schedule the removal to gc, but rather removes
> all the object parts by itself. Otherwise, you're just going to flood
> the gc. You'll need to iterate through all the objects, and for each
> object you'll need to remove all of it's rados objects (starting with
> the tail, then the head). Removal of each rados object can be done
> asynchronously, but you'll need to throttle the operations, not send
> everything to the osds at once (which will be impossible, as the
> objecter will throttle the requests anyway, which will lead to a high
> memory consumption).
>
> Thanks,
> Yehuda
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com