Re: [ceph-users] Question about PGMonitor::waiting_for_finished_proposal

2017-06-01 Thread Joao Eduardo Luis

On 06/01/2017 05:35 AM, 许雪寒 wrote:

Hi, everyone.

Recently, I’m reading the source code of Monitor. I found that, in

PGMonitor::preprare_pg_stats() method, a callback C_Stats is put into
PGMonitor::waiting_for_finished_proposal. I wonder, if a previous PGMap
incremental is in PAXOS's proposeaccept phase at the moment C_Stats
is put into PGMonitor::waiting_for_finished_proposal, would this C_Stats
be called when that PGMap incremental's PAXOS procedure is complete and
PaxosService::_active() is invoked? If so, there exists the possibility
that a MPGStats request get responsed before going through the PAXOS
procedure.


Is this right? Thank you:-)


Much like the other PaxosServices, the PGMonitor will only handle 
requests with potential side-effects (i.e., updates) if the service is 
writeable.


A precondition on being writeable is not having a PGMonitor proposal 
currently in progress. Other proposals, from other PaxosServices, may be 
happening, but not from PGMonitor.


When your request reaches PGMonitor::prepare_pg_stats(), it is 
guaranteed (except in case of unexpected behavior) that the service is 
not currently undergoing a proposal.


This means that when we queue C_Stats waiting for a finished proposal, 
it will be called back upon once the next proposal finishes.


We may bundle other update requests to PGMonitor (much like what happens 
on other PaxosServices) into the same proposal. In which case, all the 
callbacks that were waiting for a finished proposal will be woken up 
once the proposal is finished.


So, to answer your question, no.

  -Joao

P.S.: If you are curious to know how the writeable decision is made, 
check out PaxosServices::dispatch().


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW: Truncated objects and bad error handling

2017-06-01 Thread Jens Rosenboom
On a large Hammer-based cluster (> 1 Gobjects) we are seeing a small
amount of objects being truncated. All of these objects are between
512kB and 4MB in size and they are not uploaded as multipart, so the
first 512kB get stored into the head object and the next chunks should
be in tail objects named __shadow__N, but the latter
seem to go missing sometimes. The PUT operation for these objects is
logged as successful (HTTP code 200), so I'm currently having two
hypotheses as to what might be happening:

1. The object is received by the radosgw process, the head object is
written successfully, then the write for the tail object somehow
fails. So the question is whether this is possible or whether radosgw
will always wait until all operations have completed successfully
before returning the 200. This blog [1] at least mentions some
asynchronous operations.

2. The full object is written correctly, but the tail objects are
getting deleted somehow afterwards. This might happen during garbage
collection if there was a collision between the tail object names for
two objects, but again I'm not sure whether this is possible.

So the question is whether anyone else has seen this issue, also
whether it may possibly be fixed in Jewel or later.

The second issue is what happens when a client tries to access such an
truncated object. The radosgw first answers with the full headers and
a content-length of e.g. 60, then sends the first chunk of data
(524288 bytes) from the head object. After that it tries to read the
first tail object, but receives an error -2 (file not found). radosgw
now tries to send a 404 status with a NoSuchKey error in XML body, but
of course this is too late, the clients sees this as part of the
object data. After that, the connection stays open, the clients waits
for the rest of the object to be sent and times out with an error in
the end. Or, if the original object was just slightly larger than
512k, the client will append the 404 header at that point and continue
with corrupted data, hopefully checking the MD5 sum and noticing the
issue. This behaviour is still unchanged at least in Jewel and you can
easily reproduce it by manually deleting the shadow object from the
bucket pool after you have uploaded an object of the proper size.

I have created a bug report with the first issue[2], please let me
know whether you would like a different ticket for the second one.

[1] 
http://www.ksingh.co.in/blog/2017/01/15/ceph-object-storage-performance-improvement-using-indexless-buckets/
[2] http://tracker.ceph.com/issues/20107
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG Stuck EC Pool

2017-06-01 Thread Ashley Merrick
Have a PG which is stuck in this state (Is an EC with K=10 M=3)



pg 6.14 is active+undersized+degraded+remapped+inconsistent+backfilling, acting 
[2147483647,2147483647,84,83,22,26,69,72,53,59,8,4,46]


Currently have no-recover set, if I unset no recover both OSD 83 + 84 start to 
flap and go up and down, I see the following in the log's of the OSD.


*

-5> 2017-06-01 10:08:29.658593 7f430ec97700  1 -- 172.16.3.14:6806/5204 <== 
osd.17 172.16.3.3:6806/2006016 57  MOSDECSubOpWriteReply(6.31as0 71513 
ECSubWriteReply(tid=152, last_complete=0'0, committed=0, applied=1)) v1  
67+0+0 (245959818 0 0) 0x563c9db7be00 con 0x563c9cfca480
-4> 2017-06-01 10:08:29.658620 7f430ec97700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658620, event: queued_for_pg, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
-3> 2017-06-01 10:08:29.658649 7f4319e11700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658649, event: reached_pg, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
-2> 2017-06-01 10:08:29.658661 7f4319e11700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658660, event: done, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
-1> 2017-06-01 10:08:29.663107 7f43320ec700  5 -- op tracker -- seq: 2317, 
time: 2017-06-01 10:08:29.663107, event: sub_op_applied, op: 
osd_op(osd.79.66617:8675008 6.82058b1a rbd_data.e5208a238e1f29.00025f3e 
[copy-from ver 4678410] snapc 0=[] 
ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e71513)
 0> 2017-06-01 10:08:29.663474 7f4319610700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4319610700 thread_name:tp_osd_recov

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9564a7) [0x563c6a6f24a7]
 2: (()+0xf890) [0x7f4342308890]
 3: (gsignal()+0x37) [0x7f434034f067]
 4: (abort()+0x148) [0x7f4340350448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x563c6a7f83d6]
 6: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x62f) 
[0x563c6a2850ff]
 7: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xa8a) 
[0x563c6a2b878a]
 8: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x36d) [0x563c6a131bbd]
 9: (ThreadPool::WorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x1d) [0x563c6a17c88d]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x563c6a7e8e3f]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x563c6a7e9d70]
 12: (()+0x8064) [0x7f4342301064]
 13: (clone()+0x6d) [0x7f434040262d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
*



What should my next steps be?


Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map fails, ceph release jewel

2017-06-01 Thread Shambhu Rajak
Thanks David, I upgraded the kernel version and the rbd map worked.
Regards,
Shambhu

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Wednesday, May 31, 2017 9:35 PM
To: Shambhu Rajak; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd map fails, ceph release jewel

You are trying to use the kernel client to map the RBD in Jewel.  Jewel RBDs 
have options enabled that require you to run a kernel 4.9 or newer.  You can 
disable the features that are requiring the newer kernel, but that's not very 
good as those new features are very nice to have.  You can use RBD-fuse to 
mount them, that is up to date for your Ceph version.  I would probably go the 
RBD-fuse route in your position, unless upgrading your kernel to 4.9 is an 
option.

On Wed, May 31, 2017 at 9:36 AM Shambhu Rajak 
mailto:sra...@sandvine.com>> wrote:
Hi Cepher,
I have created a pool and trying to create rbd image on the ceph client, while 
mapping the rbd image it fails as:

ubuntu@shambhucephnode0:~$ sudo rbd map pool1-img1 -p pool1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (5) Input/output error


so I checked the dmesg as suggested:

ubuntu@shambhucephnode0:~$ dmesg | tail
[788743.741818] libceph: mon2 10.186.210.243:6789 
feature set mismatch, my 4a042a42 < server's 2004a042a42, missing 200
[788743.746352] libceph: mon2 10.186.210.243:6789 
socket error on read
[788753.757934] libceph: mon2 10.186.210.243:6789 
feature set mismatch, my 4a042a42 < server's 2004a042a42, missing 200
[788753.777578] libceph: mon2 10.186.210.243:6789 
socket error on read
[788763.773857] libceph: mon0 10.186.210.241:6789 
feature set mismatch, my 4a042a42 < server's 2004a042a42, missing 200
[788763.780539] libceph: mon0 10.186.210.241:6789 
socket error on read
[788773.790371] libceph: mon1 10.186.210.242:6789 
feature set mismatch, my 4a042a42 < server's 2004a042a42, missing 200
[788773.811208] libceph: mon1 10.186.210.242:6789 
socket error on read
[788783.805987] libceph: mon1 10.186.210.242:6789 
feature set mismatch, my 4a042a42 < server's 2004a042a42, missing 200
[788783.826907] libceph: mon1 10.186.210.242:6789 
socket error on read

I am not sure what is going wrong here, my cluster health is HEALTH_OK though.



My configuration details:
Ceph version: ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
OSD: 12 on 3 storage nodes
Monitor : 3 running on the 3 osd nodes

OS:
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.5 LTS
Release:14.04
Codename:   trusty

Ceph Client Kernal Version:
Linux version 3.13.0-95-generic (buildd@lgw01-58) (gcc version 4.8.4 (Ubuntu 
4.8.4-2ubuntu1~14.04.3) )

KRBD:
ubuntu@shambhucephnode0:~$ /sbin/modinfo rbd
filename:   /lib/modules/3.13.0-95-generic/kernel/drivers/block/rbd.ko
license:GPL
author: Jeff Garzik mailto:j...@garzik.org>>
description:rados block device
author: Yehuda Sadeh 
mailto:yeh...@hq.newdream.net>>
author: Sage Weil mailto:s...@newdream.net>>
author: Alex Elder mailto:el...@inktank.com>>
srcversion: 48BFBD5C3D31D799F01D218
depends:libceph
intree: Y
vermagic:   3.13.0-95-generic SMP mod_unload modversions
signer: Magrathea: Glacier signing key
sig_key:51:D5:D7:73:F1:07:BA:1B:C0:9D:33:68:38:C4:3C:DE:74:9E:4E:05
sig_hashalgo:   sha512

Thanks,
Shambhu Rajak
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Read errors on OSD

2017-06-01 Thread Oliver Humpage

Hello,

We have a small cluster of 44 OSDs across 4 servers.

A few times a week, ceph health reports a pg is inconsistent. Looking at the 
relevant OSD’s logs, it always says "head candidate had a read error”. No other 
info, i.e. it’s not that the digest is wrong, it just has an I/O error. It’s 
usually a different OSD each time, so it’s not a specific 
disk/controller/server.

Manually running a deep scrub on the pg succeeds, and ceph health goes back to 
normal.

As a test today, before scrubbing the pg I found the relevant file in 
/var/lib/ceph/osd/… and cat(1)ed it. The first time I ran cat(1) on it I got an 
Input/output error. The second time I did it, however, it worked fine.

These read errors are all on Samsung 850 Pro 2TB disks (journals are on 
separate enterprise SSDs). The SMART status on all of them are similar and show 
nothing out of the ordinary.

Has anyone else experienced anything similar? Is this just a curse of 
non-enterprise SSDs, or do you think there might be something else going on, 
e.g. could it be an XFS issue? Any suggestions as to what to look at would be 
welcome.

Many thanks,

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read errors on OSD

2017-06-01 Thread Matthew Vernon

Hi,

On 01/06/17 10:38, Oliver Humpage wrote:


These read errors are all on Samsung 850 Pro 2TB disks (journals are
on separate enterprise SSDs). The SMART status on all of them are
similar and show nothing out of the ordinary.

Has anyone else experienced anything similar? Is this just a curse of
non-enterprise SSDs, or do you think there might be something else
going on, e.g. could it be an XFS issue? Any suggestions as to what
to look at would be welcome.


You don't say what's in kern.log - we've had (rotating) disks that were 
throwing read errors but still saying they were OK on SMART.


Regards,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read errors on OSD

2017-06-01 Thread Oliver Humpage

> On 1 Jun 2017, at 11:55, Matthew Vernon  wrote:
> 
> You don't say what's in kern.log - we've had (rotating) disks that were 
> throwing read errors but still saying they were OK on SMART.

Fair point. There was nothing correlating to the time that ceph logged an error 
this morning, which is why I didn’t mention it, but looking harder I see 
yesterday there was a

May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Sense Key : Hardware Error 
[current] 
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Add. Sense: Internal 
target failure
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 CDB: Read(10) 28 00 77 51 
42 d8 00 02 00 00
May 31 07:20:13 osd1 kernel: blk_update_request: critical target error, dev 
sdi, sector 2001814232

sdi was the disk with the OSD affected today. Guess it’s flakey SSDs then. 

Weird that just re-reading the file makes everything OK though - wondering how 
much it’s worth worrying about that, or if there’s a way of making ceph retry 
reads automatically?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tools to display information from ceph report

2017-06-01 Thread Loic Dachary
Hi,

Is there a tool that displays information (such as the total bytes in each 
pool) using the content of the "ceph report" json ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] http://planet.eph.com/ is down

2017-06-01 Thread Patrick McGarry
Hey Loic,

I have updated planet.ceph.com with a 301 that redirects to the
/category/planet, so I think it should be set now. Thanks.


On Sun, May 28, 2017 at 1:29 AM, Loic Dachary  wrote:
> The URL is http://ceph.com/category/planet/ and works like a charm :-) There 
> is a blog at http://eph.com/ but it's more about Bible than Squids.
>
> On 05/28/2017 08:27 AM, Loic Dachary wrote:
>> Hi Patrick,
>>
>> http://planet.eph.com/ is down and shows a white page containing "pageok" 
>> (amusing ;-). I kind of remember reading messages about troubles regarding 
>> planet.ceph.com but forgot the specifics. Is this a permanent situation ?
>>
>> Cheers
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read errors on OSD

2017-06-01 Thread Steve Taylor
I've seen similar issues in the past with 4U Supermicro servers populated with 
spinning disks. In my case it turned out to be a specific firmware+BIOS 
combination on the disk controller card that was buggy. I fixed it by updating 
the firmware and BIOS on the card to the latest versions.

I saw this on several servers, and it took a while to track down as you can 
imagine. Same symptoms you're reporting.

There was a data corruption problem a while back with the Linux kernel and 
Samsung 850 Pro drives, but your problem doesn't sound like data corruption. 
Still, I'd check to make sure the kernel version you're running has the fix.




[cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg]


Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




On Thu, 2017-06-01 at 13:40 +0100, Oliver Humpage wrote:


On 1 Jun 2017, at 11:55, Matthew Vernon 
mailto:m...@sanger.ac.uk>> wrote:

You don't say what's in kern.log - we've had (rotating) disks that were 
throwing read errors but still saying they were OK on SMART.



Fair point. There was nothing correlating to the time that ceph logged an error 
this morning, which is why I didn’t mention it, but looking harder I see 
yesterday there was a

May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Sense Key : Hardware Error 
[current]
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Add. Sense: Internal 
target failure
May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 CDB: Read(10) 28 00 77 51 
42 d8 00 02 00 00
May 31 07:20:13 osd1 kernel: blk_update_request: critical target error, dev 
sdi, sector 2001814232

sdi was the disk with the OSD affected today. Guess it’s flakey SSDs then.

Weird that just re-reading the file makes everything OK though - wondering how 
much it’s worth worrying about that, or if there’s a way of making ceph retry 
reads automatically?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-01 Thread koukou73gr

Hello list,

Today I had to create a new image for a VM. This was the first time,
since our cluster was updated from Hammer to Jewel. So far I was just
copying an existing golden image and resized it as appropriate. But this
time I used rbd create.

So I "rbd create"d a 2T image and attached it to an existing VM guest
with librbd using:

  
  

  
  
  
  



Booted the guest and tried to partition it the new drive from inside the
guest. That's it, parted (and anything else for that matter) that tried
to access the new disk would freeze. After 2 minutes the kernel would
start complaining:

[  360.212391] INFO: task parted:1836 blocked for more than 120 seconds.
[  360.216001]   Not tainted 4.4.0-78-generic #99-Ubuntu
[  360.218663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

After much headbanging, trial and error, I finaly thought of checking
the enabled rbd features of an existing image versus the new one.

pre-existing: layering, stripping
new: layering, exclusive-lock, object-map, fast-diff, deep-flatten

Disabling exclusive-lock (and fast-diff and object-map before that)
would allow the new image to become usable in the guest at last.

This is with:

ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
qemu-img version 2.6.0 (qemu-kvm-ev-2.6.0-28.el7_3.3.1), Copyright (c)
2004-2008 Fabrice Bellard

on a host running:
CentOS Linux release 7.3.1611 (Core)
Linux host-10-206-123-184.physics.auth.gr 3.10.0-327.36.2.el7.x86_64 #1
SMP Mon Oct 10 23:08:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

and a guest
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
Linux srv-10-206-123-87.physics.auth.gr 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I vagually remember references of problems when exclusive-lock was
enabled on rbd images but trying Google didn't reveal much to me.

So what is it with exclusive lock? Why does it fail like this? Could you
please point me to some documentation on this behaviour?

Thanks for any feedback.

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Editing Ceph source code and debugging

2017-06-01 Thread Oleg Kolosov
Hi
I'm interested in writing an original erasure code in Ceph for my research
purposes. I was wondering if there is any tool or method supporting quick
compilation and debugging.

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-01 Thread Jake Grimmett
Hi Mark,

Firstly, many thanks for looking into this.

Jayaram appears to have a similar config to me;

v12.0.3, EC 4+1 bluestore
SciLin,7.3 - 3.10.0-514.21.1.el7.x86_64

I have 5 EC nodes (10 x 8TB ironwolf each) plus 2 nodes with replicated
NVMe (Cephfs hot tier)

I now think the Highpoint r750 rocket cards are not at fault; I swapped
the r750 on one node for LSI cards, but still had OSD errors occurring
on this node.

The OSD logs recovered are huge; typically exceeding 500MB. I've trimmed
one down to just over 500K, and put this on pasted here...

http://pasted.co/f0c49591

Hopefully this has some useful info.


One other thing, probably a red herring, is that when trying to run
"ceph pg repair" or "ceph pg deep-scrub" I get this...
"Error EACCES: access denied"

thanks again for your help,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Stuck EC Pool

2017-06-01 Thread Ashley Merrick
Have attached the full pg query for the effected PG encase this shows anything 
of interest.

Thanks

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ashley 
Merrick
Sent: 01 June 2017 17:19
To: ceph-us...@ceph.com
Subject: [ceph-users] PG Stuck EC Pool


This sender failed our fraud detection checks and may not be who they appear to 
be. Learn about spoofing

Feedback


Have a PG which is stuck in this state (Is an EC with K=10 M=3)





pg 6.14 is active+undersized+degraded+remapped+inconsistent+backfilling, acting 
[2147483647,2147483647,84,83,22,26,69,72,53,59,8,4,46]



Currently have no-recover set, if I unset no recover both OSD 83 + 84 start to 
flap and go up and down, I see the following in the log's of the OSD.



*
-5> 2017-06-01 10:08:29.658593 7f430ec97700  1 -- 172.16.3.14:6806/5204 <== 
osd.17 172.16.3.3:6806/2006016 57  MOSDECSubOpWriteReply(6.31as0 71513 
ECSubWriteReply(tid=152, last_complete=0'0, committed=0, applied=1)) v1  
67+0+0 (245959818 0 0) 0x563c9db7be00 con 0x563c9cfca480
-4> 2017-06-01 10:08:29.658620 7f430ec97700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658620, event: queued_for_pg, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
-3> 2017-06-01 10:08:29.658649 7f4319e11700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658649, event: reached_pg, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
-2> 2017-06-01 10:08:29.658661 7f4319e11700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658660, event: done, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
-1> 2017-06-01 10:08:29.663107 7f43320ec700  5 -- op tracker -- seq: 2317, 
time: 2017-06-01 10:08:29.663107, event: sub_op_applied, op: 
osd_op(osd.79.66617:8675008 6.82058b1a rbd_data.e5208a238e1f29.00025f3e 
[copy-from ver 4678410] snapc 0=[] 
ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e71513)
 0> 2017-06-01 10:08:29.663474 7f4319610700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4319610700 thread_name:tp_osd_recov

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9564a7) [0x563c6a6f24a7]
 2: (()+0xf890) [0x7f4342308890]
 3: (gsignal()+0x37) [0x7f434034f067]
 4: (abort()+0x148) [0x7f4340350448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x563c6a7f83d6]
 6: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x62f) 
[0x563c6a2850ff]
 7: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xa8a) 
[0x563c6a2b878a]
 8: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x36d) [0x563c6a131bbd]
 9: (ThreadPool::WorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x1d) [0x563c6a17c88d]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x563c6a7e8e3f]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x563c6a7e9d70]
 12: (()+0x8064) [0x7f4342301064]
 13: (clone()+0x6d) [0x7f434040262d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
*




What should my next steps be?



Thanks!
{
"state": "active+undersized+degraded+remapped+inconsistent+backfilling",
"snap_trimq": "[]",
"epoch": 71527,
"up": [
11,
10,
84,
83,
22,
26,
69,
72,
53,
59,
8,
4,
46
],
"acting": [
2147483647,
2147483647,
84,
83,
22,
26,
69,
72,
53,
59,
8,
4,
46
],
"backfill_targets": [
"10(1)",
"11(0)"
],
"actingbackfill": [
"4(11)",
"8(10)",
"10(1)",
"11(0)",
"22(4)",
"26(5)",
"46(12)",
"53(8)",
"59(9)",
"69(6)",
"72(7)",
"83(3)",
"84(2)"
],
"info": {
"pgid": "6.14s2",
"last_update": "71527'34088",
"last_complete": "71527'34088",
"log_tail": "68694'30812",
"last_user_version": 3736168,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": "[1~3]",
"history": {
"epoch_created": 31534,
"last_epoch_started": 71525,
"last_epoch_clean": 69510,
"last_epoch_split": 0,
"last_epoch_marked_full": 67943,
"same_up_since": 71521,
"same_interval_since": 71524,
"same_primary_since": 71524,
"last_scrub": "69507'33607",
"last_scrub_stamp": "2017-05-30 03:19:34.992284",
"last_deep_scrub": "69507'33607",
"last_deep_scrub_stamp": "2017-05-30 03:19:34.992284",
"last_clean_scrub_st

[ceph-users] Crushmap from Rack aware to Node aware

2017-06-01 Thread Deepak Naidu
Greetings Folks.

Wanted to understand how ceph works when we start with rack aware(rack level 
replica) example 3 racks and 3 replica in crushmap in future is replaced by 
node aware(node level replica) ie 3 replica spread across nodes.

This can be vice-versa. If this happens. How does ceph rearrange the "old" 
data. Do I need to trigger any command to ensure the data placement is based on 
latest crushmap algorithm or ceph takes care of it automatically.

Thanks for your time.

--
Deepak

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Editing Ceph source code and debugging

2017-06-01 Thread David Turner
I'm pretty sure this is a question for ceph-devel.

On Thu, Jun 1, 2017 at 11:22 AM Oleg Kolosov  wrote:

> Hi
> I'm interested in writing an original erasure code in Ceph for my research
> purposes. I was wondering if there is any tool or method supporting quick
> compilation and debugging.
>
> Thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read errors on OSD

2017-06-01 Thread Oliver Humpage

> On 1 Jun 2017, at 14:38, Steve Taylor  wrote:
> 
> I saw this on several servers, and it took a while to track down as you can 
> imagine. Same symptoms you're reporting.

Thanks, that’s very useful info. We’re using separate Adaptec controllers, but 
will double check firmware on them. Who knows, it may even be a read cache 
issue.

I think we’re OK with the kernel, running recent CentOS.

Cheers all,

Oliver.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW: Truncated objects and bad error handling

2017-06-01 Thread Gregory Farnum
On Thu, Jun 1, 2017 at 2:03 AM Jens Rosenboom  wrote:

> On a large Hammer-based cluster (> 1 Gobjects) we are seeing a small
> amount of objects being truncated. All of these objects are between
> 512kB and 4MB in size and they are not uploaded as multipart, so the
> first 512kB get stored into the head object and the next chunks should
> be in tail objects named __shadow__N, but the latter
> seem to go missing sometimes. The PUT operation for these objects is
> logged as successful (HTTP code 200), so I'm currently having two
> hypotheses as to what might be happening:
>
> 1. The object is received by the radosgw process, the head object is
> written successfully, then the write for the tail object somehow
> fails. So the question is whether this is possible or whether radosgw
> will always wait until all operations have completed successfully
> before returning the 200. This blog [1] at least mentions some
> asynchronous operations.
>
> 2. The full object is written correctly, but the tail objects are
> getting deleted somehow afterwards. This might happen during garbage
> collection if there was a collision between the tail object names for
> two objects, but again I'm not sure whether this is possible.
>
> So the question is whether anyone else has seen this issue, also
> whether it may possibly be fixed in Jewel or later.
>
> The second issue is what happens when a client tries to access such an
> truncated object. The radosgw first answers with the full headers and
> a content-length of e.g. 60, then sends the first chunk of data
> (524288 bytes) from the head object. After that it tries to read the
> first tail object, but receives an error -2 (file not found). radosgw
> now tries to send a 404 status with a NoSuchKey error in XML body, but
> of course this is too late, the clients sees this as part of the
> object data. After that, the connection stays open, the clients waits
> for the rest of the object to be sent and times out with an error in
> the end. Or, if the original object was just slightly larger than
> 512k, the client will append the 404 header at that point and continue
> with corrupted data, hopefully checking the MD5 sum and noticing the
> issue. This behaviour is still unchanged at least in Jewel and you can
> easily reproduce it by manually deleting the shadow object from the
> bucket pool after you have uploaded an object of the proper size.
>
> I have created a bug report with the first issue[2], please let me
> know whether you would like a different ticket for the second one.
>


No idea what's going on here but they definitely warrant separate issues.
The second one is about handling error states; the first is about inducing
them. :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-01 Thread David Turner
The way to do this is to download your crush map, modify it manually after
decompiling it to text format or modify it using the crushtool.  Once you
have your crush map with the rules in place that you want, you will upload
the crush map to the cluster.  When you change your failure domain from
host to rack, or any other change to failure domain, it will cause all of
your PGs to peer at the same time.  You want to make sure that you have
enough memory to handle this scenario.  After that point, your cluster will
just backfill the PGs from where they currently are to their new location
and then clean up after itself.  It is recommended to monitor your cluster
usage and modify osd_max_backfills during this process to optimize how fast
you can finish your backfilling while keeping your cluster usable by the
clients.

I generally recommend starting a cluster with at least n+2 failure domains
so would recommend against going to a rack failure domain with only 3
racks.  As an alternative that I've done, I've set up 6 "racks" when I only
have 3 racks with planned growth to a full 6 racks.  When I added servers
and expanded to fill more racks, I moved the servers to where they are
represented in the crush map.  So if it's physically in rack1 but it's set
as rack4 in the crush map, then I would move those servers to the physical
rack 4 and start filling out rack 1 and rack 4 to complete their capacity,
then do the same for rack 2/5 when I start into the 5th rack.

Another option to having full racks in your crush map is having half
racks.  I've also done this for clusters that wouldn't grow larger than 3
racks.  Have 6 failure domains at half racks.  It lowers your chance of
having random drives fail in different failure domains at the same time and
gives you more servers that you can run maintenance on at a time over using
a host failure domain.  It doesn't resolve the issue of using a single
cross-link for the entire rack or a full power failure of the rack, but
it's closer.

The problem with having 3 failure domains with replica 3 is that if you
lose a complete failure domain, then you have nowhere for the 3rd replica
to go.  If you have 4 failure domains with replica 3 and you lose an entire
failure domain, then you over fill the remaining 3 failure domains and can
only really use 55% of your cluster capacity.  If you have 5 failure
domains, then you start normalizing and losing a failure domain doesn't
impact as severely.  The more failure domains you get to, the less it
affects you when you lose one.

Let's do another scenario with 3 failure domains and replica size 3.  Every
OSD you lose inside of a failure domain gets backfilled directly onto the
remaining OSDs in that failure domain.  There reaches a point where a
switch failure in a rack or losing a node in the rack could over-fill the
remaining OSDs in that rack.  If you have enough servers and OSDs in the
rack, then this becomes moot but if you have a smaller cluster with
only 3 nodes and <4 drives in each... if you lose a drive in one of your
nodes, then all of it's data gets distributed to the other 3 drives in that
node.  That means you either have to replace your storage ASAP when it
fails or never fill your cluster up more than 55% if you want to be able to
automatically recover from a drive failure.

tl;dr . Make sure you calculate what your failure domain, replica size,
drive size, etc means for how fast you have to replace storage when it
fails and how full you can fill your cluster to afford a hardware loss.

On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu  wrote:

> Greetings Folks.
>
>
>
> Wanted to understand how ceph works when we start with rack aware(rack
> level replica) example 3 racks and 3 replica in crushmap in future is
> replaced by node aware(node level replica) ie 3 replica spread across nodes.
>
>
>
> This can be vice-versa. If this happens. How does ceph rearrange the “old”
> data. Do I need to trigger any command to ensure the data placement is
> based on latest crushmap algorithm or ceph takes care of it automatically.
>
>
>
> Thanks for your time.
>
>
>
> --
>
> Deepak
> --
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-01 Thread Deepak Naidu
Perfect David for detailed explanation. Appreciate it!.

In my case I have 10 OSD servers with each 60 Disks(ya I know…) ie total 600 
OSD and I have 3 racks to spare.

--
Deepak

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Thursday, June 01, 2017 12:23 PM
To: Deepak Naidu; ceph-users
Subject: Re: [ceph-users] Crushmap from Rack aware to Node aware

The way to do this is to download your crush map, modify it manually after 
decompiling it to text format or modify it using the crushtool.  Once you have 
your crush map with the rules in place that you want, you will upload the crush 
map to the cluster.  When you change your failure domain from host to rack, or 
any other change to failure domain, it will cause all of your PGs to peer at 
the same time.  You want to make sure that you have enough memory to handle 
this scenario.  After that point, your cluster will just backfill the PGs from 
where they currently are to their new location and then clean up after itself.  
It is recommended to monitor your cluster usage and modify osd_max_backfills 
during this process to optimize how fast you can finish your backfilling while 
keeping your cluster usable by the clients.

I generally recommend starting a cluster with at least n+2 failure domains so 
would recommend against going to a rack failure domain with only 3 racks.  As 
an alternative that I've done, I've set up 6 "racks" when I only have 3 racks 
with planned growth to a full 6 racks.  When I added servers and expanded to 
fill more racks, I moved the servers to where they are represented in the crush 
map.  So if it's physically in rack1 but it's set as rack4 in the crush map, 
then I would move those servers to the physical rack 4 and start filling out 
rack 1 and rack 4 to complete their capacity, then do the same for rack 2/5 
when I start into the 5th rack.

Another option to having full racks in your crush map is having half racks.  
I've also done this for clusters that wouldn't grow larger than 3 racks.  Have 
6 failure domains at half racks.  It lowers your chance of having random drives 
fail in different failure domains at the same time and gives you more servers 
that you can run maintenance on at a time over using a host failure domain.  It 
doesn't resolve the issue of using a single cross-link for the entire rack or a 
full power failure of the rack, but it's closer.

The problem with having 3 failure domains with replica 3 is that if you lose a 
complete failure domain, then you have nowhere for the 3rd replica to go.  If 
you have 4 failure domains with replica 3 and you lose an entire failure 
domain, then you over fill the remaining 3 failure domains and can only really 
use 55% of your cluster capacity.  If you have 5 failure domains, then you 
start normalizing and losing a failure domain doesn't impact as severely.  The 
more failure domains you get to, the less it affects you when you lose one.

Let's do another scenario with 3 failure domains and replica size 3.  Every OSD 
you lose inside of a failure domain gets backfilled directly onto the remaining 
OSDs in that failure domain.  There reaches a point where a switch failure in a 
rack or losing a node in the rack could over-fill the remaining OSDs in that 
rack.  If you have enough servers and OSDs in the rack, then this becomes 
moot but if you have a smaller cluster with only 3 nodes and <4 drives in 
each... if you lose a drive in one of your nodes, then all of it's data gets 
distributed to the other 3 drives in that node.  That means you either have to 
replace your storage ASAP when it fails or never fill your cluster up more than 
55% if you want to be able to automatically recover from a drive failure.

tl;dr . Make sure you calculate what your failure domain, replica size, drive 
size, etc means for how fast you have to replace storage when it fails and how 
full you can fill your cluster to afford a hardware loss.

On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu 
mailto:dna...@nvidia.com>> wrote:
Greetings Folks.

Wanted to understand how ceph works when we start with rack aware(rack level 
replica) example 3 racks and 3 replica in crushmap in future is replaced by 
node aware(node level replica) ie 3 replica spread across nodes.

This can be vice-versa. If this happens. How does ceph rearrange the “old” 
data. Do I need to trigger any command to ensure the data placement is based on 
latest crushmap algorithm or ceph takes care of it automatically.

Thanks for your time.

--
Deepak

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

___
ceph-users mailing list
ceph-users@lists.ceph.co

Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-01 Thread David Turner
If all 6 racks are tagged for Ceph storage nodes, I'd go ahead and just put
the nodes in there now and configure the crush map accordingly.  That way
you can grow each of the racks while keeping each failure domain closer in
size to the rest of the cluster.

On Thu, Jun 1, 2017 at 3:40 PM Deepak Naidu  wrote:

> Perfect David for detailed explanation. Appreciate it!.
>
>
>
> In my case I have 10 OSD servers with each 60 Disks(ya I know…) ie total
> 600 OSD and I have 3 racks to spare.
>
>
>
> --
>
> Deepak
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Thursday, June 01, 2017 12:23 PM
> *To:* Deepak Naidu; ceph-users
> *Subject:* Re: [ceph-users] Crushmap from Rack aware to Node aware
>
>
>
> The way to do this is to download your crush map, modify it manually after
> decompiling it to text format or modify it using the crushtool.  Once you
> have your crush map with the rules in place that you want, you will upload
> the crush map to the cluster.  When you change your failure domain from
> host to rack, or any other change to failure domain, it will cause all of
> your PGs to peer at the same time.  You want to make sure that you have
> enough memory to handle this scenario.  After that point, your cluster will
> just backfill the PGs from where they currently are to their new location
> and then clean up after itself.  It is recommended to monitor your cluster
> usage and modify osd_max_backfills during this process to optimize how fast
> you can finish your backfilling while keeping your cluster usable by the
> clients.
>
>
>
> I generally recommend starting a cluster with at least n+2 failure domains
> so would recommend against going to a rack failure domain with only 3
> racks.  As an alternative that I've done, I've set up 6 "racks" when I only
> have 3 racks with planned growth to a full 6 racks.  When I added servers
> and expanded to fill more racks, I moved the servers to where they are
> represented in the crush map.  So if it's physically in rack1 but it's set
> as rack4 in the crush map, then I would move those servers to the physical
> rack 4 and start filling out rack 1 and rack 4 to complete their capacity,
> then do the same for rack 2/5 when I start into the 5th rack.
>
>
>
> Another option to having full racks in your crush map is having half
> racks.  I've also done this for clusters that wouldn't grow larger than 3
> racks.  Have 6 failure domains at half racks.  It lowers your chance of
> having random drives fail in different failure domains at the same time and
> gives you more servers that you can run maintenance on at a time over using
> a host failure domain.  It doesn't resolve the issue of using a single
> cross-link for the entire rack or a full power failure of the rack, but
> it's closer.
>
>
>
> The problem with having 3 failure domains with replica 3 is that if you
> lose a complete failure domain, then you have nowhere for the 3rd replica
> to go.  If you have 4 failure domains with replica 3 and you lose an entire
> failure domain, then you over fill the remaining 3 failure domains and can
> only really use 55% of your cluster capacity.  If you have 5 failure
> domains, then you start normalizing and losing a failure domain doesn't
> impact as severely.  The more failure domains you get to, the less it
> affects you when you lose one.
>
>
>
> Let's do another scenario with 3 failure domains and replica size 3.
> Every OSD you lose inside of a failure domain gets backfilled directly onto
> the remaining OSDs in that failure domain.  There reaches a point where a
> switch failure in a rack or losing a node in the rack could over-fill the
> remaining OSDs in that rack.  If you have enough servers and OSDs in the
> rack, then this becomes moot but if you have a smaller cluster with
> only 3 nodes and <4 drives in each... if you lose a drive in one of your
> nodes, then all of it's data gets distributed to the other 3 drives in that
> node.  That means you either have to replace your storage ASAP when it
> fails or never fill your cluster up more than 55% if you want to be able to
> automatically recover from a drive failure.
>
>
>
> tl;dr . Make sure you calculate what your failure domain, replica size,
> drive size, etc means for how fast you have to replace storage when it
> fails and how full you can fill your cluster to afford a hardware loss.
>
>
>
> On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu  wrote:
>
> Greetings Folks.
>
>
>
> Wanted to understand how ceph works when we start with rack aware(rack
> level replica) example 3 racks and 3 replica in crushmap in future is
> replaced by node aware(node level replica) ie 3 replica spread across nodes.
>
>
>
> This can be vice-versa. If this happens. How does ceph rearrange the “old”
> data. Do I need to trigger any command to ensure the data placement is
> based on latest crushmap algorithm or ceph takes care of it automatically.
>
>
>
> Thanks for your time.
>
>
>
> --
>
> Deepak
> 

Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-01 Thread Deepak Naidu
>> If all 6 racks are tagged for Ceph storage nodes, I'd go ahead and just put 
>> the nodes in there now and configure the crush map accordingly
I just have 3 racks. That’s the max I have for now. 10 OSD Nodes.

--
Deepak

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Thursday, June 01, 2017 2:05 PM
To: Deepak Naidu; ceph-users
Subject: Re: [ceph-users] Crushmap from Rack aware to Node aware

If all 6 racks are tagged for Ceph storage nodes, I'd go ahead and just put the 
nodes in there now and configure the crush map accordingly.  That way you can 
grow each of the racks while keeping each failure domain closer in size to the 
rest of the cluster.

On Thu, Jun 1, 2017 at 3:40 PM Deepak Naidu 
mailto:dna...@nvidia.com>> wrote:
Perfect David for detailed explanation. Appreciate it!.

In my case I have 10 OSD servers with each 60 Disks(ya I know…) ie total 600 
OSD and I have 3 racks to spare.

--
Deepak

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Thursday, June 01, 2017 12:23 PM
To: Deepak Naidu; ceph-users
Subject: Re: [ceph-users] Crushmap from Rack aware to Node aware

The way to do this is to download your crush map, modify it manually after 
decompiling it to text format or modify it using the crushtool.  Once you have 
your crush map with the rules in place that you want, you will upload the crush 
map to the cluster.  When you change your failure domain from host to rack, or 
any other change to failure domain, it will cause all of your PGs to peer at 
the same time.  You want to make sure that you have enough memory to handle 
this scenario.  After that point, your cluster will just backfill the PGs from 
where they currently are to their new location and then clean up after itself.  
It is recommended to monitor your cluster usage and modify osd_max_backfills 
during this process to optimize how fast you can finish your backfilling while 
keeping your cluster usable by the clients.

I generally recommend starting a cluster with at least n+2 failure domains so 
would recommend against going to a rack failure domain with only 3 racks.  As 
an alternative that I've done, I've set up 6 "racks" when I only have 3 racks 
with planned growth to a full 6 racks.  When I added servers and expanded to 
fill more racks, I moved the servers to where they are represented in the crush 
map.  So if it's physically in rack1 but it's set as rack4 in the crush map, 
then I would move those servers to the physical rack 4 and start filling out 
rack 1 and rack 4 to complete their capacity, then do the same for rack 2/5 
when I start into the 5th rack.

Another option to having full racks in your crush map is having half racks.  
I've also done this for clusters that wouldn't grow larger than 3 racks.  Have 
6 failure domains at half racks.  It lowers your chance of having random drives 
fail in different failure domains at the same time and gives you more servers 
that you can run maintenance on at a time over using a host failure domain.  It 
doesn't resolve the issue of using a single cross-link for the entire rack or a 
full power failure of the rack, but it's closer.

The problem with having 3 failure domains with replica 3 is that if you lose a 
complete failure domain, then you have nowhere for the 3rd replica to go.  If 
you have 4 failure domains with replica 3 and you lose an entire failure 
domain, then you over fill the remaining 3 failure domains and can only really 
use 55% of your cluster capacity.  If you have 5 failure domains, then you 
start normalizing and losing a failure domain doesn't impact as severely.  The 
more failure domains you get to, the less it affects you when you lose one.

Let's do another scenario with 3 failure domains and replica size 3.  Every OSD 
you lose inside of a failure domain gets backfilled directly onto the remaining 
OSDs in that failure domain.  There reaches a point where a switch failure in a 
rack or losing a node in the rack could over-fill the remaining OSDs in that 
rack.  If you have enough servers and OSDs in the rack, then this becomes 
moot but if you have a smaller cluster with only 3 nodes and <4 drives in 
each... if you lose a drive in one of your nodes, then all of it's data gets 
distributed to the other 3 drives in that node.  That means you either have to 
replace your storage ASAP when it fails or never fill your cluster up more than 
55% if you want to be able to automatically recover from a drive failure.

tl;dr . Make sure you calculate what your failure domain, replica size, drive 
size, etc means for how fast you have to replace storage when it fails and how 
full you can fill your cluster to afford a hardware loss.

On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu 
mailto:dna...@nvidia.com>> wrote:
Greetings Folks.

Wanted to understand how ceph works when we start with rack aware(rack level 
replica) example 3 racks and 3 replica in crushmap in future is replaced by 
node aware(node level r

Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-01 Thread Mark Nelson
Looking at this gdb output, it looks like all of the tp_osd_tp threads 
are idling around except for three that are all waiting on a PG lock.  I 
bet those sit there for 60s and eventually time out.  The kv_sync_thread 
looks idle so I don't think that's it.  Thread 16 is doing 
OSD::trim_maps but I don't really know what the locking semantics are 
there.  Maybe Sage or Josh can chime in.


In any event, here are the tp_osd_tp threads waiting on a pg lock:

Thread 64 (Thread 0x7f5ed184e700 (LWP 3545048)):
#0  __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

#1  0x7f5ef68f3d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x7f5ef68f3c08 in __GI___pthread_mutex_lock 
(mutex=0x7f5f50baf4e8) at pthread_mutex_lock.c:64
#3  0x7f5ef9c95e48 in Mutex::Lock (this=this@entry=0x7f5f50baf4d8, 
no_lockdep=no_lockdep@entry=false)

at /usr/src/debug/ceph-12.0.3/src/common/Mutex.cc:110
#4  0x7f5ef97f4d63 in PG::lock (this=0x7f5f50baf000, 
no_lockdep=no_lockdep@entry=false) at 
/usr/src/debug/ceph-12.0.3/src/osd/PG.cc:363
#5  0x7f5ef9796751 in OSD::ShardedOpWQ::_process 
(this=0x7f5f04437198, thread_index=, hb=0x7f5f05b27100)

at /usr/src/debug/ceph-12.0.3/src/osd/OSD.cc:9706
#6  0x7f5ef9cbb235 in ShardedThreadPool::shardedthreadpool_worker 
(this=0x7f5f04436958, thread_index=)

at /usr/src/debug/ceph-12.0.3/src/common/WorkQueue.cc:354
#7  0x7f5ef9cbd390 in ShardedThreadPool::WorkThreadSharded::entry 
(this=)

at /usr/src/debug/ceph-12.0.3/src/common/WorkQueue.h:685
#8  0x7f5ef68f1dc5 in start_thread (arg=0x7f5ed184e700) at 
pthread_create.c:308
#9  0x7f5ef57e173d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Thread 61 (Thread 0x7f5ed3051700 (LWP 3545041)):
#0  __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

#1  0x7f5ef68f3d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x7f5ef68f3c08 in __GI___pthread_mutex_lock 
(mutex=0x7f5f07f844e8) at pthread_mutex_lock.c:64
#3  0x7f5ef9c95e48 in Mutex::Lock (this=this@entry=0x7f5f07f844d8, 
no_lockdep=no_lockdep@entry=false)

at /usr/src/debug/ceph-12.0.3/src/common/Mutex.cc:110
#4  0x7f5ef97f4d63 in PG::lock (this=0x7f5f07f84000, 
no_lockdep=no_lockdep@entry=false) at 
/usr/src/debug/ceph-12.0.3/src/osd/PG.cc:363
#5  0x7f5ef9796751 in OSD::ShardedOpWQ::_process 
(this=0x7f5f04437198, thread_index=, hb=0x7f5f05b26d80)

at /usr/src/debug/ceph-12.0.3/src/osd/OSD.cc:9706
#6  0x7f5ef9cbb235 in ShardedThreadPool::shardedthreadpool_worker 
(this=0x7f5f04436958, thread_index=)

at /usr/src/debug/ceph-12.0.3/src/common/WorkQueue.cc:354
#7  0x7f5ef9cbd390 in ShardedThreadPool::WorkThreadSharded::entry 
(this=)

at /usr/src/debug/ceph-12.0.3/src/common/WorkQueue.h:685
#8  0x7f5ef68f1dc5 in start_thread (arg=0x7f5ed3051700) at 
pthread_create.c:308
#9  0x7f5ef57e173d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Thread 57 (Thread 0x7f5ed5055700 (LWP 3545035)):
#0  __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

#1  0x7f5ef68f3d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x7f5ef68f3c08 in __GI___pthread_mutex_lock 
(mutex=0x7f5f4f7674e8) at pthread_mutex_lock.c:64
#3  0x7f5ef9c95e48 in Mutex::Lock (this=this@entry=0x7f5f4f7674d8, 
no_lockdep=no_lockdep@entry=false)

at /usr/src/debug/ceph-12.0.3/src/common/Mutex.cc:110
#4  0x7f5ef97f4d63 in PG::lock (this=0x7f5f4f767000, 
no_lockdep=no_lockdep@entry=false) at 
/usr/src/debug/ceph-12.0.3/src/osd/PG.cc:363
#5  0x7f5ef9796751 in OSD::ShardedOpWQ::_process 
(this=0x7f5f04437198, thread_index=, hb=0x7f5f05b26900)

at /usr/src/debug/ceph-12.0.3/src/osd/OSD.cc:9706
#6  0x7f5ef9cbb235 in ShardedThreadPool::shardedthreadpool_worker 
(this=0x7f5f04436958, thread_index=)

at /usr/src/debug/ceph-12.0.3/src/common/WorkQueue.cc:354
#7  0x7f5ef9cbd390 in ShardedThreadPool::WorkThreadSharded::entry 
(this=)

at /usr/src/debug/ceph-12.0.3/src/common/WorkQueue.h:685
#8  0x7f5ef68f1dc5 in start_thread (arg=0x7f5ed5055700) at 
pthread_create.c:308
#9  0x7f5ef57e173d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Mark

On 06/01/2017 01:31 AM, nokia ceph wrote:


Hello Mark,

Please find  the attached gdb.txt file which having 'thread apply all
bt' result.

Thanks
Jayaram






On Wed, May 31, 2017 at 5:43 PM, Mark Nelson mailto:mnel...@redhat.com>> wrote:



On 05/31/2017 05:21 AM, nokia ceph wrote:

+ ceph-devel ..

$ps -ef | grep 294
ceph 3539720   1 14 08:04 ?00:16:35
/usr/bin/ceph-osd -f
--cluster ceph --id 294 --setuser ceph --setgroup ceph

$gcore -o coredump-osd  3539720


$(gdb) bt
#0  0x7f5ef68f56d5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x7f5ef9cc45ab in ceph::logging::Log::entry() ()
#2  0x

[ceph-users] is there any way to speed up cache evicting?

2017-06-01 Thread jiajia zhong
hi guys:

Our ceph cluster is working with tier cache.
I am running "rados -p data_cache cache-try-flush-evict-all" to evict all
the objects.
But It a bit slow

1. Is there any way to speed up the evicting?

2. Is evicting triggered by itself good enough for cluster ?

3. Does the flushing and evicting slow down the whole cluster?



thx
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph packages on stretch from eu.ceph.com

2017-06-01 Thread Christian Balzer

Hello,

Sorry for the thread necromancy.

With Stretch deep-frozen and amazingly enough on schedule for release in 2
weeks (and me having to finish a new cluster deployment by July), I sure
hope that whoever is in charge of this has everything set up and just
needs to push a button for things to be build the day Stretch is released.

Regards,

Christian

On Tue, 25 Apr 2017 20:07:50 +0200 Ronny Aasen wrote:

> Hello
> 
> i am trying to install ceph on debian stretch from
> 
> http://eu.ceph.com/debian-jewel/dists/
> 
> but there is no stretch repo there.
> 
> now with stretch being frozen, it is a good time to be testing ceph on 
> stretch. is it possible to get packages for stretch on jewel, kraken, 
> and lumious ?
> 
> 
> 
> kind regards
> 
> Ronny Aasen
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] is there any way to speed up cache evicting?

2017-06-01 Thread Christian Balzer
On Fri, 2 Jun 2017 10:30:46 +0800 jiajia zhong wrote:

> hi guys:
> 
> Our ceph cluster is working with tier cache.
If so, then I suppose you read all the discussions here as well and not
only the somewhat lacking documentation?

> I am running "rados -p data_cache cache-try-flush-evict-all" to evict all
> the objects.
Why?
And why all of it?

> But It a bit slow
>
Define slow, but it has to do a LOT of work and housekeeping to do this,
so unless your cluster is very fast (probably not, or you wouldn't
want/need a cache tier) and idle, that's the way it is.
 
> 1. Is there any way to speed up the evicting?
>
Not really, see above.
 
> 2. Is evicting triggered by itself good enough for cluster ?
>
See above, WHY are you manually flushing/evicting?

Are you aware that flushing is the part that's very I/O intensive, while
evicting is a very low cost/impact operation?

In normal production, the various parameters that control this will do
fine, if properly configured of course.

> 3. Does the flushing and evicting slow down the whole cluster?
>
Of course, as any good sysadmin with the correct tools (atop, iostat,
etc, graphing Ceph performance values with Grafana/Graphite) will be able
to see instantly.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] is there any way to speed up cache evicting?

2017-06-01 Thread jiajia zhong
christian, thanks for your reply.

2017-06-02 11:39 GMT+08:00 Christian Balzer :

> On Fri, 2 Jun 2017 10:30:46 +0800 jiajia zhong wrote:
>
> > hi guys:
> >
> > Our ceph cluster is working with tier cache.
> If so, then I suppose you read all the discussions here as well and not
> only the somewhat lacking documentation?
>
> > I am running "rados -p data_cache cache-try-flush-evict-all" to evict all
> > the objects.
> Why?
> And why all of it?


we found that when the threshold(flush/evict) was triggered, the
performance would make us a bit upset :), so I wish to flush/evict the tier
in a spare time,eg, middle night,In this scenario,the tier could not pay
any focus on flush/evict while the great w/r operations on cephfs which we
are using.

>
> > But It a bit slow
> >
> Define slow, but it has to do a LOT of work and housekeeping to do this,
> so unless your cluster is very fast (probably not, or you wouldn't
> want/need a cache tier) and idle, that's the way it is.
>
> > 1. Is there any way to speed up the evicting?
> >
> Not really, see above.
>
> > 2. Is evicting triggered by itself good enough for cluster ?
> >
> See above, WHY are you manually flushing/evicting?
>
explained above.


> Are you aware that flushing is the part that's very I/O intensive, while
> evicting is a very low cost/impact operation?
>
not very sure, my instinct believed those.


> In normal production, the various parameters that control this will do
> fine, if properly configured of course.
>
> > 3. Does the flushing and evicting slow down the whole cluster?
> >
> Of course, as any good sysadmin with the correct tools (atop, iostat,
> etc, graphing Ceph performance values with Grafana/Graphite) will be able
> to see instantly.

actually, we are using graphite,  but I could not see that instantly, lol
:(, I could only got the threshold triggered by calculating after happening.

btw, we have cephfs to store a huge number of small files, (64T , about
100K per file),


>
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com