Re: [ceph-users] Question about "osd objectstore = keyvaluestore-dev" setting

2014-05-22 Thread Gregory Farnum
On Thu, May 22, 2014 at 5:04 AM, Geert Lindemulder  wrote:
> Hello All
>
> Trying to implement the osd leveldb backend at an existing ceph test
> cluster.
> The test cluster was updated from 0.72.1 to 0.80.1. The update was ok.
> After the update, the "osd objectstore = keyvaluestore-dev" setting was
> added to ceph.conf.

Does that mean you tried to switch to the KeyValueStore on one of your
existing OSDs? That isn't going to work; you'll need to create new
ones (or knock out old ones and recreate them with it).

> After restarting an osd it gives the following error:
> 2014-05-22 12:28:06.805290 7f2e7d9de800 -1 KeyValueStore::mount : stale
> version stamp 3. Please run the KeyValueStore update script before starting
> the OSD, or set keyvaluestore_update_to to 1
>
> How can the "keyvaluestore_update_to" parameter be set or where can i find
> the "KeyValueStore update script"

Hmm, it looks like that config value isn't actually plugged in to the
KeyValueStore, so you can't set it with the stock binaries. Maybe
Haomai has an idea?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-22 Thread Gregory Farnum
On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman
 wrote:
>
> - Message from Gregory Farnum  -
>Date: Wed, 21 May 2014 15:46:17 -0700
>
>    From: Gregory Farnum 
> Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
>  To: Kenneth Waegeman 
>  Cc: ceph-users 
>
>
>> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
>>  wrote:
>>>
>>> Thanks! I increased the max processes parameter for all daemons quite a
>>> lot
>>> (until ulimit -u 3802720)
>>>
>>> These are the limits for the daemons now..
>>> [root@ ~]# cat /proc/17006/limits
>>> Limit Soft Limit   Hard Limit   Units
>>> Max cpu time  unlimitedunlimited
>>> seconds
>>> Max file size unlimitedunlimitedbytes
>>> Max data size unlimitedunlimitedbytes
>>> Max stack size10485760 unlimitedbytes
>>> Max core file sizeunlimitedunlimitedbytes
>>> Max resident set  unlimitedunlimitedbytes
>>> Max processes 3802720  3802720
>>> processes
>>> Max open files3276832768files
>>> Max locked memory 6553665536bytes
>>> Max address space unlimitedunlimitedbytes
>>> Max file locksunlimitedunlimitedlocks
>>> Max pending signals   9506895068
>>> signals
>>> Max msgqueue size 819200   819200   bytes
>>> Max nice priority 00
>>> Max realtime priority 00
>>> Max realtime timeout  unlimitedunlimitedus
>>>
>>> But this didn't help. Are there other parameters I should change?
>>
>>
>> Hrm, is it exactly the same stack trace? You might need to bump the
>> open files limit as well, although I'd be surprised. :/
>
>
> I increased the open file limit as test to 128000, still the same results.
>
> Stack trace:



> But I see some things happening on the system while doing this too:
>
>
>
> [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096
> set pool 16 pgp_num to 4096
> [root@ ~]# ceph status
> Traceback (most recent call last):
>   File "/usr/bin/ceph", line 830, in 
> sys.exit(main())
>   File "/usr/bin/ceph", line 590, in main
> conffile=conffile)
>   File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__
> librados_path = find_library('rados')
>   File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library
> return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
>   File "/usr/lib64/python2.6/ctypes/util.py", line 203, in
> _findSoname_ldconfig
> os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read())
> OSError: [Errno 12] Cannot allocate memory
> [root@ ~]# lsof | wc
> -bash: fork: Cannot allocate memory
> [root@ ~]# lsof | wc
>   21801  211209 3230028
> [root@ ~]# ceph status
> ^CError connecting to cluster: InterruptedOrTimeoutError
> ^[[A[root@ ~]# lsof | wc
>2028   17476  190947
>
>
>
> And meanwhile the daemons has then been crashed.
>
> I verified the memory never ran out.

Is there anything in dmesg? It sure looks like the OS thinks it's run
out of memory one way or another.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.conf public network

2014-05-27 Thread Gregory Farnum
On Tue, May 27, 2014 at 9:55 AM, Ignazio Cassano
 wrote:
> Hi all,
> I read a lot of emails  messages and I am confused because in some public
> network in /etc/ceph/ceph.com is reported like :
> public_network = a.b.c.d/netmask
> in others like :
>
> public network = a.b.c.d/netmask

These are equivalent in Ceph's parsing system. In general, anywhere
you are setting config values, you can use a space (" ") or an
underscore ("_") interchangeably.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-27 Thread Gregory Farnum
On Sun, May 25, 2014 at 6:24 PM, Guang Yang  wrote:
> On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:
>
>> This failure means the messenger subsystem is trying to create a
>> thread and is getting an error code back — probably due to a process
>> or system thread limit that you can turn up with ulimit.
>>
>> This is happening because a replicated PG primary needs a connection
>> to only its replicas (generally 1 or 2 connections), but with an
>> erasure-coded PG the primary requires a connection to m+n-1 replicas
>> (everybody who's in the erasure-coding set, including itself). Right
>> now our messenger requires a thread for each connection, so kerblam.
>> (And it actually requires a couple such connections because we have
>> separate heartbeat, cluster data, and client data systems.)
> Hi Greg,
> Is there any plan to refactor the messenger component to reduce the num of 
> threads? For example, use event-driven mode.

We've discussed it in very broad terms, but there are no concrete
designs and it's not on the schedule yet. If anybody has conclusive
evidence that it's causing them trouble they can't work around, that
would be good to know...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is there a way to repair placement groups?

2014-05-27 Thread Gregory Farnum
Note that while the "repair" command *will* return your cluster to
consistency, it is not guaranteed to restore the data you want to see
there — in general, it will simply put the primary OSD's view of the
world on the replicas. If you have a massive inconsistency like that,
you probably want to figure out what happened and if it's simply one
bad OSD you can remove, or a more general problem.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 27, 2014 at 3:17 PM, Michael  wrote:
> Hi Peter,
>
> Please use "ceph pg repair XX.xx". It might take a few seconds to kick in
> after being instructed.
>
> -Michael
>
>
> On 27/05/2014 21:40, phowell wrote:
>>
>> Hi
>>
>> First apologies if this is the wrong place to ask this question.
>>
>> We are running a small Ceph (0.79) cluster will about 12 osd's which are
>> on top of a zfs raid 1+0 (for another discussion)... which were created on
>> this version.
>>
>> Below I have included the ceph health detail which has rather a lot of
>> inconsistent placement groups. The underlying zfs arrays have reported no
>> problems and all the zpools are good. None of the 12 osd's are down or out.
>>
>> I could not find a way to repair these placement groups.  I have tried
>> using 'ceph pg scrub XX.xx' with no effect. I would be grateful if someone
>> could point me in the right direction to fix the problem.
>>
>> Thanks
>>
>> Peter Howell.
>>
>> root@durotar:/var/log/ceph# ceph health detail
>> HEALTH_ERR 86 pgs inconsistent; 292 scrub errors
>> pg 42.57 is active+clean+inconsistent, acting [2,3,10]
>> pg 42.56 is active+clean+inconsistent, acting [8,9,1]
>> pg 42.50 is active+clean+inconsistent, acting [4,1,9]
>> pg 42.53 is active+clean+inconsistent, acting [3,1,8]
>> pg 42.52 is active+clean+inconsistent, acting [11,7,3]
>> pg 42.5d is active+clean+inconsistent, acting [4,1,6]
>> pg 42.5c is active+clean+inconsistent, acting [7,9,5]
>> pg 42.5e is active+clean+inconsistent, acting [6,10,0]
>> pg 42.59 is active+clean+inconsistent, acting [5,0,11]
>> pg 42.5b is active+clean+inconsistent, acting [1,3,6]
>> pg 42.5a is active+clean+inconsistent, acting [8,9,4]
>> pg 42.45 is active+clean+inconsistent, acting [3,0,9]
>> pg 42.44 is active+clean+inconsistent, acting [10,8,5]
>> pg 42.46 is active+clean+inconsistent, acting [5,0,6]
>> pg 42.41 is active+clean+inconsistent, acting [5,2,8]
>> pg 42.42 is active+clean+inconsistent, acting [7,10,5]
>> pg 42.4d is active+clean+inconsistent, acting [1,5,11]
>> pg 42.4c is active+clean+inconsistent, acting [0,5,6]
>> pg 42.4f is active+clean+inconsistent, acting [6,9,1]
>> pg 42.4e is active+clean+inconsistent, acting [5,0,6]
>> pg 42.49 is active+clean+inconsistent, acting [6,9,0]
>> pg 42.48 is active+clean+inconsistent, acting [2,4,7]
>> pg 42.4a is active+clean+inconsistent, acting [9,8,4]
>> pg 42.74 is active+clean+inconsistent, acting [3,2,11]
>> pg 42.77 is active+clean+inconsistent, acting [5,2,8]
>> pg 42.76 is active+clean+inconsistent, acting [9,7,3]
>> pg 42.71 is active+clean+inconsistent, acting [8,10,0]
>> pg 42.7c is active+clean+inconsistent, acting [8,10,0]
>> pg 42.7f is active+clean+inconsistent, acting [10,6,0]
>> pg 42.7e is active+clean+inconsistent, acting [2,4,9]
>> pg 42.78 is active+clean+inconsistent, acting [4,1,11]
>> pg 42.65 is active+clean+inconsistent, acting [8,11,4]
>> pg 42.67 is active+clean+inconsistent, acting [0,3,11]
>> pg 42.66 is active+clean+inconsistent, acting [4,1,7]
>> pg 42.6c is active+clean+inconsistent, acting [4,2,6]
>> pg 42.6f is active+clean+inconsistent, acting [9,7,5]
>> pg 42.6e is active+clean+inconsistent, acting [8,11,4]
>> pg 42.69 is active+clean+inconsistent, acting [10,8,4]
>> pg 42.68 is active+clean+inconsistent, acting [9,7,0]
>> pg 42.6b is active+clean+inconsistent, acting [10,7,4]
>> pg 42.6a is active+clean+inconsistent, acting [10,8,2]
>> pg 42.15 is active+clean+inconsistent, acting [9,6,4]
>> pg 42.14 is active+clean+inconsistent, acting [6,10,3]
>> pg 42.16 is active+clean+inconsistent, acting [4,2,9]
>> pg 42.11 is active+clean+inconsistent, acting [6,9,2]
>> pg 42.10 is active+clean+inconsistent, acting [11,6,2]
>> pg 42.1f is active+clean+inconsistent, acting [4,0,10]
>> pg 42.1e is active+clean+inconsistent, acting [10,7,1]
>> pg 42.19 is active+clean+inconsistent, acting [3,0,9]
>> pg 42.18 is active+clean+inconsistent, acting [5,0,9]
>> pg 42.1b is active+clean+inconsistent, acting [7,9,5]
>> pg 42.1a is active+clean+inconsistent, acting [3,0,9]
>> pg 42.5 is active+clean+inconsistent, acting [11,7,3]
>> pg 42.7 is active+clean+inconsistent, acting [7,11,0]
>> pg 42.0 is active+clean+inconsistent, acting [7,10,5]
>> pg 42.2 is active+clean+inconsistent, acting [11,6,4]
>> pg 42.c is active+clean+inconsistent, acting [7,9,5]
>> pg 42.f is active+clean+inconsistent, acting [2,3,7]
>> pg 42.9 is active+clean+inconsistent, acting [2,3,10]
>> pg 42.8 is active+clean+inconsistent, acting [10,7,2]
>> pg 42.a is active+clean+incons

Re: [ceph-users] Is there a way to repair placement groups?

2014-05-27 Thread Gregory Farnum
Yeah, there are a lot of smarts like that which could be added in.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 27, 2014 at 4:17 PM, Michael  wrote:
> Would it be feasible to try for an odd one out policy by default when
> repairing from a pool of 3 or more disks? Or is the most common cause of
> inconsistency most likely to not effect the primary?
>
> -Michael
>
>
> On 27/05/2014 23:55, Gregory Farnum wrote:
>>
>> Note that while the "repair" command *will* return your cluster to
>> consistency, it is not guaranteed to restore the data you want to see
>> there — in general, it will simply put the primary OSD's view of the
>> world on the replicas. If you have a massive inconsistency like that,
>> you probably want to figure out what happened and if it's simply one
>> bad OSD you can remove, or a more general problem.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Tue, May 27, 2014 at 3:17 PM, Michael 
>> wrote:
>>>
>>> Hi Peter,
>>>
>>> Please use "ceph pg repair XX.xx". It might take a few seconds to kick in
>>> after being instructed.
>>>
>>> -Michael
>>>
>>>
>>> On 27/05/2014 21:40, phowell wrote:
>>>>
>>>> Hi
>>>>
>>>> First apologies if this is the wrong place to ask this question.
>>>>
>>>> We are running a small Ceph (0.79) cluster will about 12 osd's which are
>>>> on top of a zfs raid 1+0 (for another discussion)... which were created
>>>> on
>>>> this version.
>>>>
>>>> Below I have included the ceph health detail which has rather a lot of
>>>> inconsistent placement groups. The underlying zfs arrays have reported
>>>> no
>>>> problems and all the zpools are good. None of the 12 osd's are down or
>>>> out.
>>>>
>>>> I could not find a way to repair these placement groups.  I have tried
>>>> using 'ceph pg scrub XX.xx' with no effect. I would be grateful if
>>>> someone
>>>> could point me in the right direction to fix the problem.
>>>>
>>>> Thanks
>>>>
>>>> Peter Howell.
>>>>
>>>> root@durotar:/var/log/ceph# ceph health detail
>>>> HEALTH_ERR 86 pgs inconsistent; 292 scrub errors
>>>> pg 42.57 is active+clean+inconsistent, acting [2,3,10]
>>>> pg 42.56 is active+clean+inconsistent, acting [8,9,1]
>>>> pg 42.50 is active+clean+inconsistent, acting [4,1,9]
>>>> pg 42.53 is active+clean+inconsistent, acting [3,1,8]
>>>> pg 42.52 is active+clean+inconsistent, acting [11,7,3]
>>>> pg 42.5d is active+clean+inconsistent, acting [4,1,6]
>>>> pg 42.5c is active+clean+inconsistent, acting [7,9,5]
>>>> pg 42.5e is active+clean+inconsistent, acting [6,10,0]
>>>> pg 42.59 is active+clean+inconsistent, acting [5,0,11]
>>>> pg 42.5b is active+clean+inconsistent, acting [1,3,6]
>>>> pg 42.5a is active+clean+inconsistent, acting [8,9,4]
>>>> pg 42.45 is active+clean+inconsistent, acting [3,0,9]
>>>> pg 42.44 is active+clean+inconsistent, acting [10,8,5]
>>>> pg 42.46 is active+clean+inconsistent, acting [5,0,6]
>>>> pg 42.41 is active+clean+inconsistent, acting [5,2,8]
>>>> pg 42.42 is active+clean+inconsistent, acting [7,10,5]
>>>> pg 42.4d is active+clean+inconsistent, acting [1,5,11]
>>>> pg 42.4c is active+clean+inconsistent, acting [0,5,6]
>>>> pg 42.4f is active+clean+inconsistent, acting [6,9,1]
>>>> pg 42.4e is active+clean+inconsistent, acting [5,0,6]
>>>> pg 42.49 is active+clean+inconsistent, acting [6,9,0]
>>>> pg 42.48 is active+clean+inconsistent, acting [2,4,7]
>>>> pg 42.4a is active+clean+inconsistent, acting [9,8,4]
>>>> pg 42.74 is active+clean+inconsistent, acting [3,2,11]
>>>> pg 42.77 is active+clean+inconsistent, acting [5,2,8]
>>>> pg 42.76 is active+clean+inconsistent, acting [9,7,3]
>>>> pg 42.71 is active+clean+inconsistent, acting [8,10,0]
>>>> pg 42.7c is active+clean+inconsistent, acting [8,10,0]
>>>> pg 42.7f is active+clean+inconsistent, acting [10,6,0]
>>>> pg 42.7e is active+clean+inconsistent, acting [2,4,9]
>>>> pg 42.78 is active+clean+inconsistent, acting [4,1,11]
>>>> pg 42.65 is active+clean+inconsistent, acting [8,11,4]
>>>> pg 42.67 is active+clean+inc

Re: [ceph-users] why use hadoop with ceph ?

2014-05-30 Thread Gregory Farnum
On Friday, May 30, 2014, Ignazio Cassano  wrote:

> Hi all,
> I am testing ceph because I found it is very interesting as far as remote
> block
> device is concerned.
> But my company is very interested in big data.
> So I read something about hadoop and ceph integration.
> Anyone can suggest me some documentation explaining the purpose of
> ceph/hadoop integration ?
> Why don't use only hadoop for big data ?
>

It has a couple of advantages now:
1) if you're already running Ceph, you only need to manage one storage
cluster
2) you get all of Ceph's reliability, resiliency, and dynamism
3)  you get a real posix filesystem that you can run Hadoop workloads
against (which enables things like using other data Analytics systems
against it)

In the future, when CephFS is more fully supported for production use,
you'll also be able to do things like use Ceph as the canonical location of
all your data, and run Hadoop loads against it without having to so an
export/import, etc.
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication

2014-05-30 Thread Gregory Farnum
Depending on what level of verification you need, you can just do a "ceph
pg dump" and look to see which OSDs host every PG. If you want to
demonstrate replication to a skeptical audience, sure, turn off the
machines and show that data remains accessible.
-Greg

On Friday, May 30, 2014,  wrote:

>  Hi All,
>
>
>
> I have a four node ceph storage cluster. By default all the storage
> objects are replicated in Ceph.
>
> My installation consists of three OSDs.  If I create some volumes as block
> devices in ceph and write some data
>
> onto it,  how to verify or test that the data is replicated ? Is it by
> stopping one or more of the OSDs and check the other
>
> running ones or how  ?
>
>
>
>
>
>
>
> Thanks
>
> Kumar
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> __
>
> www.accenture.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW: Multi Part upload and resulting objects

2014-06-04 Thread Gregory Farnum
On Wed, Jun 4, 2014 at 7:58 AM, Sylvain Munaut
 wrote:
> Hi,
>
>
> During a multi part upload you can't upload parts smaller than 5M, and
> radosgw also slices object in slices of 4M. Having those two being
> different is a bit unfortunate because if you slice your files in the
> minimum chunk size you end up with a main file of 4M and a shadowfile
> of 1M for each part ...
>
>
> Would it make sense to allow either multipart upload of 4M, or to rise
> the slice size to something more than 4M (4M or 8M if you want power
> of 2) ?

Huh. We took the 5MB limit from S3, but it definitely is unfortunate
in combination with our 4MB chunking. You can change the default slice
size using a config option, though. I believe you want to change
rgw_obj_stripe_size (default: 4 << 20). There might be some other
considerations around the initial 512KB "head" objects,
though...Yehuda?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistency, deep-scrub / repair won't fix (v0.80.1)

2014-06-05 Thread Gregory Farnum
On Thu, Jun 5, 2014 at 4:38 AM, Dennis Kramer  wrote:
> Hi all,
>
> A couple of weeks ago i've upgraded from emperor to firefly.
> I'm using Cloudstack /w CEPH as the storage backend for VMs and templates.

Which versions exactly were you and are you running?

>
> Since the upgrade, ceph is in a HEALTH_ERR with 500+ pgs inconsistent and
> 2000+ scrub errors. Not sure if it has the do with firefly though, but the
> upgrade was the only major change I had.
>
> After the upgrade i've noticed that some of my OSDs were near-full. My
> current ceph setup has two racks defined, each with a couple of hosts. One
> rack was purely for archiving/backup purposes and wasn't that active at all,
> so I've changed the crushmap and moved some hosts from one rack to another.
> I've noticed no problems during this move at all and the cluster was
> rebalancing itself after this change. The current problems I have began
> after the upgrade and the hosts move.
>
> The logs shows messages like:
>
> 2014-06-05 12:09:54.233404 osd.0 [ERR] 9.ac shard 0: soid
> 1e3d14ac/rbd_data.867c0514e5cb0.00e3/head//9 digest 693024524 !=
> known digest 2075700712
>
> Manual repair with for example "ceph osd repair"

How did you invoke this command?

> doesn't fix the
> inconsistency. I've investigated the rbd image(s) and can pinpoint it to a
> specific VM. When I delete this VM (with the inconsistency pgs in it) from
> ceph and run a deep-scrub again, the inconsistency is gone (makes sense,
> because the rbd image is removed). But when I re-create the VM, I get the
> same inconsistency errors again. The errors are showing in the same ceph
> pool, but different pg. First I thought the base template was the faulty
> image, but even after removing the base VM template and re-creating a new
> template the inconsistencies still occur.
>
> In total I have 8 pools, and the problem exists in at least half of them.
>
> It doesn't look like the osd itself has any problems or has HDD bad sectors.
> The inconsistency is spread over a bunch of different (almost all actually)
> OSDs.
>
> It seems the VMs are running fine though, even with all these inconsistency
> errors, but I'm still worried because I doubt this is a false-positive..
>
> I'm at a loss at the moment and not sure what my next step would be.
> Is there anyone who can shed some light over this issue?

If you're still seeing this, you probably want to compare the objects
directly. When the system reports a bad object, go to each OSD which
stores it, grab the file involved from each replica, and do a manual
diff to see how they compare.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW: Multi Part upload and resulting objects

2014-06-05 Thread Gregory Farnum
I don't believe that should cause any issues; the chunk sizes are in
the metadata.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jun 5, 2014 at 12:23 AM, Sylvain Munaut
 wrote:
> Hello,
>
>> Huh. We took the 5MB limit from S3, but it definitely is unfortunate
>> in combination with our 4MB chunking. You can change the default slice
>> size using a config option, though. I believe you want to change
>> rgw_obj_stripe_size (default: 4 << 20). There might be some other
>> considerations around the initial 512KB "head" objects,
>> though...Yehuda?
>
> Ah great. Can you change this option on a cluster with existing data ?
> That won't prevent files added prior to the change to be accessed right ?
>
>
> Cheers,
>
> Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimal io block in rbd

2014-06-05 Thread Gregory Farnum
There's some prefetching and stuff, but the rbd library and RADOS storage
are capable of issuing reads and writes in any size (well, down to the
minimal size of the underlying physical disk).
There are some scenarios where you will see it writing a lot more if you
use layering -- promotion of data happens a full object at a time.
-Greg

On Thursday, June 5, 2014, Cédric Lemarchand 
wrote:

> I would think that rbd block are like stripes for RAID or blocks for hard
> drives, even if you only need to read or write 1k, the full stripe has to
> be read or write.
>
> Cheers
>
> --
> Cédric Lemarchand
>
> > Le 5 juin 2014 à 22:56, Timofey Koolin >
> a écrit :
> >
> > Do for every read/write rbd read/write full block of data (4MB) or rbd
> can read/wite part of block?
> >
> > For example - I have a 500MB file (database) and need random read/write
> by blocks about 1-10Kb.
> >
> > Do for every read 1 Kb rbd will read 4MB from hdd?
> > for write?
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs snapshots : mkdir: cannot create directory `.snap/test': Operation not permitted

2014-06-06 Thread Gregory Farnum
Snapshots are disabled by default; there's a command you can run to
enable them if you want, but the reason they're disabled is because
they're significantly more likely to break your filesystem than
anything else is!
ceph mds set allow_new_snaps true
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 6, 2014 at 12:22 AM, Micha Krause  wrote:
> Hi,
>
> I'm playing around with cephfs, everything works fine except creating
> snapshots:
>
> # mkdir .snap/test
> mkdir: cannot create directory `.snap/test': Operation not permitted
>
> Client Kernel version:
>
> 3.14
>
> Ceph Cluster version:
>
> 0.80.1
>
> I tried it on 2 different clients, both Debian, one with jessie, one with
> wheezy + backports kernel.
>
> Is there some config-option to enable snapshots, or is this a bug?
>
>
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fail to add osd to cluster

2014-06-06 Thread Gregory Farnum
I haven't used ceph-deploy to do this much, but I think you need to
"prepare" before you "activate" and it looks like you haven't done so.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 6, 2014 at 3:54 PM, Jonathan Gowar  wrote:
> Assitance really appreciated.  This output says it all:-
>
> ceph@ceph-admin:~$ ceph-deploy osd activate ceph-4:/dev/sdb1
> ceph-4:/dev/sdc1 ceph-4:/dev/sdd1
> [ceph_deploy.conf][DEBUG ] found configuration file
> at: /home/ceph/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.2): /usr/bin/ceph-deploy osd
> activate ceph-4:/dev/sdb1 ceph-4:/dev/sdc1 ceph-4:/dev/sdd1
> [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks
> ceph-4:/dev/sdb1: ceph-4:/dev/sdc1: ceph-4:/dev/sdd1:
> [ceph-4][DEBUG ] connected to host: ceph-4
> [ceph-4][DEBUG ] detect platform information from remote host
> [ceph-4][DEBUG ] detect machine type
> [ceph_deploy.osd][INFO  ] Distro info: debian 7.5 wheezy
> [ceph_deploy.osd][DEBUG ] activating host ceph-4 disk /dev/sdb1
> [ceph_deploy.osd][DEBUG ] will use init type: sysvinit
> [ceph-4][INFO  ] Running command: sudo ceph-disk-activate --mark-init
> sysvinit --mount /dev/sdb1
> [ceph-4][WARNIN] got latest monmap
> [ceph-4][WARNIN] 2014-06-06 23:25:12.798014 7fbf1085d780 -1 journal
> FileJournal::_open: disabling aio for non-block journal.  Use
> journal_force_aio to force use of aio anyway
> [ceph-4][WARNIN] 2014-06-06 23:25:12.798042 7fbf1085d780 -1 journal
> check: ondisk fsid ---- doesn't match
> expected 382cf137-3891-42ef-b23c-7b4664d97466, invalid (someone else's?)
> journal
> [ceph-4][WARNIN] 2014-06-06 23:25:12.798060 7fbf1085d780 -1
> filestore(/dev/sdb1) mkjournal error creating journal
> on /dev/sdb1/journal: (22) Invalid argument
> [ceph-4][WARNIN] 2014-06-06 23:25:12.798068 7fbf1085d780 -1 OSD::mkfs:
> ObjectStore::mkfs failed with error -22
> [ceph-4][WARNIN] 2014-06-06 23:25:12.798097 7fbf1085d780 -1  ** ERROR:
> error creating empty object store in /dev/sdb1: (22) Invalid argument
> [ceph-4][WARNIN] Traceback (most recent call last):
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 2579, in 
> [ceph-4][WARNIN] main()
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 2557, in main
> [ceph-4][WARNIN] args.func(args)
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 1917, in
> main_activate
> [ceph-4][WARNIN] init=args.mark_init,
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 1749, in
> activate_dir
> [ceph-4][WARNIN] (osd_id, cluster) = activate(path,
> activate_key_template, init)
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 1849, in activate
> [ceph-4][WARNIN] keyring=keyring,
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 1484, in mkfs
> [ceph-4][WARNIN] '--keyring', os.path.join(path, 'keyring'),
> [ceph-4][WARNIN]   File "/usr/sbin/ceph-disk", line 303, in
> command_check_call
> [ceph-4][WARNIN] return subprocess.check_call(arguments)
> [ceph-4][WARNIN]   File "/usr/lib/python2.7/subprocess.py", line 511, in
> check_call
> [ceph-4][WARNIN] raise CalledProcessError(retcode, cmd)
> [ceph-4][WARNIN] subprocess.CalledProcessError: Command
> '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i',
> '7', '--monmap', '/dev/sdb1/activate.monmap', '--osd-data', '/dev/sdb1',
> '--osd-journal', '/dev/sdb1/journal', '--osd-uuid',
> '382cf137-3891-42ef-b23c-7b4664d97466', '--keyring',
> '/dev/sdb1/keyring']' returned non-zero exit status 1
> [ceph-4][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command:
> ceph-disk-activate --mark-init sysvinit --mount /dev/sdb1
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] failed assertion on AuthMonitor

2014-06-09 Thread Gregory Farnum
Barring a newly-introduced bug (doubtful), that assert basically means
that your computer lied to the ceph monitor about the durability or
ordering of data going to disk, and the store is now inconsistent. If
you don't have data you care about on the cluster, by far your best
option is:
1) Figure out what part of the system is lying about data durability
(probably your filesystem or controller is ignoring barriers),
2) start the Ceph install over
It's possible that the ceph-monstore-tool will let you edit the store
back into a consistent state, but it looks like the system can't find
the *initial* commit, which means you'll need to manufacture a new one
wholesale with the right keys from the other system components.

(I am assuming that the system didn't crash right while you were
turning on the monitor for the first time; if it did that makes it
slightly more likely to be a bug on our end, but again it'll be
easiest to just start over since you don't have any data in it yet.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sun, Jun 8, 2014 at 10:26 PM, Mohammad Salehe  wrote:
> Hi,
>
> I'm receiving failed assertion in AuthMonitor::update_from_paxos(bool*)
> after a system crash. I've saved a complete monitor log with 10/20 for 'mon'
> and 'paxos' here.
> There is only one monitor and two OSDs in the cluster as I was just at the
> beginning of deployment.
>
> I will be thankful if someone could help.
>
> --
> Mohammad Salehe
> sal...@gmail.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Gregory Farnum
On Mon, Jun 9, 2014 at 3:22 PM, Craig Lewis  wrote:
> I've correlated a large deep scrubbing operation to cluster stability
> problems.
>
> My primary cluster does a small amount of deep scrubs all the time, spread
> out over the whole week.  It has no stability problems.
>
> My secondary cluster doesn't spread them out.  It saves them up, and tries
> to do all of the deep scrubs over the weekend.  The secondary starts loosing
> OSDs about an hour after these deep scrubs start.
>
> To avoid this, I'm thinking of writing a script that continuously scrubs the
> oldest outstanding PG.  In psuedo-bash:
> # Sort by the deep-scrub timestamp, taking the single oldest PG
> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}'
> | sort | head -1 | read date time pg
>  do
>   ceph pg deep-scrub ${pg}
>   while ceph status | grep scrubbing+deep
>do
> sleep 5
>   done
>   sleep 30
> done
>
>
> Does anybody think this will solve my problem?
>
> I'm also considering disabling deep-scrubbing until the secondary finishes
> replicating from the primary.  Once it's caught up, the write load should
> drop enough that opportunistic deep scrubs should have a chance to run.  It
> should only take another week or two to catch up.

If the problem is just that your secondary cluster is under a heavy
write load, and so the scrubbing won't run automatically until the PGs
hit their time limit, maybe it's appropriate to change the limits so
they can run earlier. You can bump up "osd scrub load threshold".
Or maybe that would be a terrible thing to do, not sure. But it sounds
like the cluster is just skipping the voluntary scrubs, and then they
all come due at once (probably from some earlier event).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Gregory Farnum
On Mon, Jun 9, 2014 at 6:42 PM, Mike Dawson  wrote:
> Craig,
>
> I've struggled with the same issue for quite a while. If your i/o is similar
> to mine, I believe you are on the right track. For the past month or so, I
> have been running this cronjob:
>
> * * * * *   for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' |
> sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg;
> done
>
> That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7
> days. Your script may be a bit better, but this quick and dirty method has
> helped my cluster maintain more consistency.
>
> The real key for me is to avoid the "clumpiness" I have observed without
> that hack where concurrent deep-scrubs sit at zero for a long period of time
> (despite having PGs that were months overdue for a deep-scrub), then
> concurrent deep-scrubs suddenly spike up and stay in the teens for hours,
> killing client writes/second.
>
> The scrubbing behavior table[0] indicates that a periodic tick initiates
> scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently
> randomized when you restart lots of OSDs concurrently (for instance via
> pdsh).
>
> On my cluster I suffer a significant drag on client writes/second when I
> exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent
> deep-scrubs get into the teens, I get a massive drop in client
> writes/second.
>
> Greg, is there locking involved when a PG enters deep-scrub? If so, is the
> entire PG locked for the duration or is each individual object inside the PG
> locked as it is processed? Some of my PGs will be in deep-scrub for minutes
> at a time.

It locks very small regions of the key space, but the expensive part
is that deep scrub actually has to read all the data off disk, and
that's often a lot more disk seeks than simply examining the metadata
is.
-Greg

>
> 0: http://ceph.com/docs/master/dev/osd_internals/scrub/
>
> Thanks,
> Mike Dawson
>
>
>
> On 6/9/2014 6:22 PM, Craig Lewis wrote:
>>
>> I've correlated a large deep scrubbing operation to cluster stability
>> problems.
>>
>> My primary cluster does a small amount of deep scrubs all the time,
>> spread out over the whole week.  It has no stability problems.
>>
>> My secondary cluster doesn't spread them out.  It saves them up, and
>> tries to do all of the deep scrubs over the weekend.  The secondary
>> starts loosing OSDs about an hour after these deep scrubs start.
>>
>> To avoid this, I'm thinking of writing a script that continuously scrubs
>> the oldest outstanding PG.  In psuedo-bash:
>> # Sort by the deep-scrub timestamp, taking the single oldest PG
>> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
>> $1}' | sort | head -1 | read date time pg
>>   do
>>ceph pg deep-scrub ${pg}
>>while ceph status | grep scrubbing+deep
>> do
>>  sleep 5
>>done
>>sleep 30
>> done
>>
>>
>> Does anybody think this will solve my problem?
>>
>> I'm also considering disabling deep-scrubbing until the secondary
>> finishes replicating from the primary.  Once it's caught up, the write
>> load should drop enough that opportunistic deep scrubs should have a
>> chance to run.  It should only take another week or two to catch up.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Selection Criteria for Deep-Scrub

2014-06-10 Thread Gregory Farnum
Hey Mike, has your manual scheduling resolved this? I think I saw
another similar-sounding report, so a feature request to improve scrub
scheduling would be welcome. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 20, 2014 at 5:46 PM, Mike Dawson  wrote:
> I tend to set it whenever I don't want to be bothered by storage performance
> woes (nights I value sleep, etc).
>
> This cluster is bounded by relentless small writes (it has a couple dozen
> rbd volumes backing video surveillance DVRs). Some of the software we run is
> completely unaffected whereas other software falls apart during periods of
> deep-scrubs. I theorize it has to do with the individual software's attitude
> about flushing to disk / buffering.
>
> - Mike
>
>
>
> On 5/20/2014 8:31 PM, Aaron Ten Clay wrote:
>>
>> For what it's worth, version 0.79 has different headers, and the awk
>> command needs $19 instead of $20. But here is the output I have on a
>> small cluster that I recently rebuilt:
>>
>> $ ceph pg dump all | grep active | awk '{ print $19}' | sort -k1 | uniq -c
>> dumped all in format plain
>>1 2014-05-15
>>2 2014-05-17
>>   19 2014-05-18
>>  193 2014-05-19
>>  105 2014-05-20
>>
>> I have set noscrub and nodeep-scrub, as well as noout and nodown off and
>> on while I performed various maintenance, but that hasn't (apparently)
>> impeded the regular schedule.
>>
>> With what frequency are you setting the nodeep-scrub flag?
>>
>> -Aaron
>>
>>
>> On Tue, May 20, 2014 at 5:21 PM, Mike Dawson > > wrote:
>>
>> Today I noticed that deep-scrub is consistently missing some of my
>> Placement Groups, leaving me with the following distribution of PGs
>> and the last day they were successfully deep-scrubbed.
>>
>> # ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 |
>> uniq -c
>>5 2013-11-06
>>  221 2013-11-20
>>1 2014-02-17
>>   25 2014-02-19
>>   60 2014-02-20
>>4 2014-03-06
>>3 2014-04-03
>>6 2014-04-04
>>6 2014-04-05
>>   13 2014-04-06
>>4 2014-04-08
>>3 2014-04-10
>>2 2014-04-11
>>   50 2014-04-12
>>   28 2014-04-13
>>   14 2014-04-14
>>3 2014-04-15
>>   78 2014-04-16
>>   44 2014-04-17
>>8 2014-04-18
>>1 2014-04-20
>>   16 2014-05-02
>>   69 2014-05-04
>>  140 2014-05-05
>>  569 2014-05-06
>> 9231 2014-05-07
>>  103 2014-05-08
>>  514 2014-05-09
>> 1593 2014-05-10
>>  393 2014-05-16
>> 2563 2014-05-17
>> 1283 2014-05-18
>> 1640 2014-05-19
>> 1979 2014-05-20
>>
>> I have been running the default "osd deep scrub interval" of once
>> per week, but have disabled deep-scrub on several occasions in an
>> attempt to avoid the associated degraded cluster performance I have
>> written about before.
>>
>> To get the PGs longest in need of a deep-scrub started, I set the
>> nodeep-scrub flag, and wrote a script to manually kick off
>> deep-scrub according to age. It is processing as expected.
>>
>> Do you consider this a feature request or a bug? Perhaps the code
>> that schedules PGs to deep-scrub could be improved to prioritize PGs
>> that have needed a deep-scrub the longest.
>>
>> Thanks,
>> Mike Dawson
>> _
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to implement a rados plugin to encode/decode data while r/w

2014-06-10 Thread Gregory Farnum
On Tue, May 27, 2014 at 7:44 PM, Plato  wrote:
> For certain security issue, I need to make sure the data finally saved to
> disk is encrypted.
> So, I'm trying to write a rados class, which would be triggered to reading
> and writing process.
> That is, before data is written, encrypting method of the class will be
> invoked; and then after data is readed, decrypting method of the class will
> be invoked.
>
> I checked the interfaces in objclass.h, and found that cls_link perhaps is
> what I need.
> However, the interface not implemented yet. So, how to write such a rados
> plugin? Is it possible.

There are a number of existing class plugins in the source tree. Look
at the different folders in ceph/src/cls (probably start with "hello"
as a simple example).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] failed assertion on AuthMonitor

2014-06-10 Thread Gregory Farnum
I'd have to look for details, but I don't think the auth monitor ever
removes those keys, so if there are some missing, it sounds like some
data got lost out from underneath it. That could have happened if the
filesystem dropped a file, which we have seen on some kernels.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Tue, Jun 10, 2014 at 3:31 AM, Mohammad Salehe  wrote:
> Hi Greg,
>
> Thank for your suggestion and information. I've installed the cluster over
> again.
>
> I just wanted to investigate a little more based on your information. I can
> see that auth/paxos values in monitor K/V store are these:
> 'authfirst_commited': 251
> 'authlast_commited': 329
>
> and I have all the keys 'auth251'...'auth329' in there. However, there is no
> 'auth1' or 'auth250' but it seems monitor failed while reading 'auth1'. Is
> this normal?
> As a side note, I did not use cephx in this cluster.
>
> Thanks,
>
>
> 2014-06-09 22:11 GMT+04:30 Gregory Farnum :
>>
>> Barring a newly-introduced bug (doubtful), that assert basically means
>> that your computer lied to the ceph monitor about the durability or
>> ordering of data going to disk, and the store is now inconsistent. If
>> you don't have data you care about on the cluster, by far your best
>> option is:
>> 1) Figure out what part of the system is lying about data durability
>> (probably your filesystem or controller is ignoring barriers),
>> 2) start the Ceph install over
>> It's possible that the ceph-monstore-tool will let you edit the store
>> back into a consistent state, but it looks like the system can't find
>> the *initial* commit, which means you'll need to manufacture a new one
>> wholesale with the right keys from the other system components.
>>
>> (I am assuming that the system didn't crash right while you were
>> turning on the monitor for the first time; if it did that makes it
>> slightly more likely to be a bug on our end, but again it'll be
>> easiest to just start over since you don't have any data in it yet.)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Sun, Jun 8, 2014 at 10:26 PM, Mohammad Salehe  wrote:
>> > Hi,
>> >
>> > I'm receiving failed assertion in AuthMonitor::update_from_paxos(bool*)
>> > after a system crash. I've saved a complete monitor log with 10/20 for
>> > 'mon'
>> > and 'paxos' here.
>> > There is only one monitor and two OSDs in the cluster as I was just at
>> > the
>> > beginning of deployment.
>> >
>> > I will be thankful if someone could help.
>> >
>> > --
>> > Mohammad Salehe
>> > sal...@gmail.com
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
>
>
> --
> Mohammad Salehe
> sal...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash dump ?

2014-06-11 Thread Gregory Farnum
On Wednesday, June 11, 2014, Florent B  wrote:

> Hi every one,
>
> Sometimes my MDS crashes... sometimes after a few hours, sometimes after
> a few days.
>
> I know I could enable debugging and so on to get more information. But
> if it crashes after a few days, it generates gigabytes of debugging data
> that are not related to the crash.
>
> Is it possible to get just a crash dump when MDS is crashing, to see
> what's wrong ?


You should be getting a backtrace regardless of what debugging levels are
enabled, so I assume you mean having it dump out prior log lines when that
happens. And indeed you can.
Normally you specify something like
debug mds =10
And that dumps out the log. You can instead specify two values, separated
by a slash, and the daemon will take the time to generate all the log lines
at the second value but only dump to disk the first value:
debug mds = 0/10
That will put nothing in the log, but will generate debug output level 10
in a memory ring buffer (1 entries), and dump it on a crash. You can do
this with any debug setting.
-Greg



>


> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to remove mds

2014-06-11 Thread Gregory Farnum
On Wed, Jun 11, 2014 at 4:56 AM,   wrote:
> Hi All,
>
>
>
> I have a four node ceph cluster. The metadata service is showing as degraded
> in health. How to remove the mds service from ceph ?

Unfortunately you can't remove it entirely right now, but if you
create a new filesystem using the "newfs" command, and don't turn on
an MDS daemon after that, it won't report a health error.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can we map OSDs from different hosts (servers) to a Pool in Ceph

2014-06-11 Thread Gregory Farnum
On Wed, Jun 11, 2014 at 5:18 AM, Davide Fanciola  wrote:
> Hi,
>
> we have a similar setup where we have SSD and HDD in the same hosts.
> Our very basic crushmap is configured as follows:
>
> # ceph osd tree
> # id weight type name up/down reweight
> -6 3 root ssd
> 3 1 osd.3 up 1
> 4 1 osd.4 up 1
> 5 1 osd.5 up 1
> -5 3 root platters
> 0 1 osd.0 up 1
> 1 1 osd.1 up 1
> 2 1 osd.2 up 1
> -1 3 root default
> -2 1 host chgva-srv-stor-001
> 0 1 osd.0 up 1
> 3 1 osd.3 up 1
> -3 1 host chgva-srv-stor-002
> 1 1 osd.1 up 1
> 4 1 osd.4 up 1
> -4 1 host chgva-srv-stor-003
> 2 1 osd.2 up 1
> 5 1 osd.5 up 1
>
>
> We do not seem to have problems with this setup, but i'm not sure if it's a
> good practice to have elements appearing multiple times in different
> branches.
> On the other hand, I see no way to follow the physical hierarchy of a
> datacenter for pools, since a pool can be spread among
> servers/racks/rooms...
>
> Can someone confirm this crushmap is any good for our configuration?

If you accidentally use the "default" node anywhere, you'll get data
scattered across both classes of device. If you try and use both the
"platters" and "ssd" nodes within a single CRUSH rule, you might end
up with copies of data on the same host (reducing your data
resiliency). Otherwise this is just fine.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tiering : hit_set_count && hit_set_period memory usage ?

2014-06-11 Thread Gregory Farnum
On Wed, Jun 11, 2014 at 12:44 PM, Alexandre DERUMIER
 wrote:
> Hi,
>
> I'm reading tiering doc here
> http://ceph.com/docs/firefly/dev/cache-pool/
>
> "
> The hit_set_count and hit_set_period define how much time each HitSet should 
> cover, and how many such HitSets to store. Binning accesses over time allows 
> Ceph to independently determine whether an object was accessed at least once 
> and whether it was accessed more than once over some time period (“age” vs 
> “temperature”). Note that the longer the period and the higher the count the 
> more RAM will be consumed by the ceph-osd process. In particular, when the 
> agent is active to flush or evict cache objects, all hit_set_count HitSets 
> are loaded into RAM"
>
> about how much memory do we talk here ? any formula ? (nr object x ? )

We haven't really quantified that yet. In particular, it's going to
depend on how many objects are accessed within a period; the OSD sizes
them based on the previous access count and the false positive
probability that you give it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tiering : hit_set_count && hit_set_period memory usage ?

2014-06-11 Thread Gregory Farnum
Any user access to an object promotes it into the cache pool.

On Wednesday, June 11, 2014, Alexandre DERUMIER  wrote:

> >>We haven't really quantified that yet. In particular, it's going to
> >>depend on how many objects are accessed within a period; the OSD sizes
> >>them based on the previous access count and the false positive
> >>probability that you give it
>
> Ok, thanks Greg.
>
>
>
> Another question, the doc describe how the objects are going from cache
> tier to base tier.
> But how does it work from base tier to cache tier ? (cache-mode writeback)
> Does any read on base tier promote the object in the cache tier ?
> Or they are also statistics on the base tier ?
>
> (I tell the question, because I have cold datas, but I have full backups
> jobs running each week, reading all theses cold datas)
>
>
>
> - Mail original -
>
> De: "Gregory Farnum" >
> À: "Alexandre DERUMIER" >
> Cc: "ceph-users" >
> Envoyé: Mercredi 11 Juin 2014 21:56:29
> Objet: Re: [ceph-users] tiering : hit_set_count && hit_set_period memory
> usage ?
>
> On Wed, Jun 11, 2014 at 12:44 PM, Alexandre DERUMIER
> > wrote:
> > Hi,
> >
> > I'm reading tiering doc here
> > http://ceph.com/docs/firefly/dev/cache-pool/
> >
> > "
> > The hit_set_count and hit_set_period define how much time each HitSet
> should cover, and how many such HitSets to store. Binning accesses over
> time allows Ceph to independently determine whether an object was accessed
> at least once and whether it was accessed more than once over some time
> period (“age” vs “temperature”). Note that the longer the period and the
> higher the count the more RAM will be consumed by the ceph-osd process. In
> particular, when the agent is active to flush or evict cache objects, all
> hit_set_count HitSets are loaded into RAM"
> >
> > about how much memory do we talk here ? any formula ? (nr object x ? )
>
> We haven't really quantified that yet. In particular, it's going to
> depend on how many objects are accessed within a period; the OSD sizes
> them based on the previous access count and the false positive
> probability that you give it.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can we map OSDs from different hosts (servers) to a Pool in Ceph

2014-06-12 Thread Gregory Farnum
On Thu, Jun 12, 2014 at 2:21 AM, VELARTIS Philipp Dürhammer
 wrote:
> Hi,
>
> Will ceph support mixing different disk pools (example spinners and ssds) in 
> the future a little bit better (more safe)?

There are no immediate plans to do so, but this is an extension to the
CRUSH language that we're interested in.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] error (24) Too many open files

2014-06-12 Thread Gregory Farnum
You probably just want to increase the ulimit settings. You can change the
OSD setting, but that only covers file descriptors against the backing
store, not sockets for network communication -- the latter is more often
the one that runs out.
-Greg

On Thursday, June 12, 2014, Christian Kauhaus > wrote:

> Hi,
>
> we have a Ceph cluster with 32 OSDs running on 4 servers (8 OSDs per
> server,
> one for each disk).
>
> From time to time, I see Ceph servers running out of file descriptors. It
> logs
> lines like:
>
> > 2014-06-08 22:15:35.154759 7f850ac25700  0
> filestore(/srv/ceph/osd/ceph-20)
> write couldn't open
> 86.37_head/a63e7df7/rbd_data.1933fe2ae8944a.042c/head//86: (24)
> Too many open files
> > 2014-06-08 22:15:35.255955 7f850ac25700 -1 os/FileStore.cc: In function
> 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&,
> uint64_t,
> int, ThreadPool::TPHandle*)' thread 7f850ac25700 time
> > 2014-06-08 22:15:35.191181 os/FileStore.cc: 2448: FAILED assert(0 ==
> "unexpected error")
>
> but apparently everything proceeds normally after that.
>
> Is the error considered critical? Should I lower "max open files" in
> ceph.conf? Or should I increase the value in /proc/sys/fs/file-max? Has
> anyone
> a good recommendation?
>
> TIA
>
> Christian
>
>
> Reference:
>
> * we are running Ceph Emperor 0.72.2 on Linux 3.10.7.
>
> * full log follows:
>
> 2014-06-08 22:15:34.928660 7f84e6770700  0  cls/lock/cls_lock.cc:89:
> error reading xattr lock.rbd_lock: -24
> 2014-06-08 22:15:34.934733 7f84e6770700  0  cls/lock/cls_lock.cc:384:
> Could not read lock info: Unknown error -24
> 2014-06-08 22:15:35.085361 7f84ecf7d700  0 accepter.accepter no incoming
> connection?  sd = -1 errno 24 Too many open files
> 2014-06-08 22:15:35.125393 7f84ecf7d700  0 accepter.accepter no incoming
> connection?  sd = -1 errno 24 Too many open files
> 2014-06-08 22:15:35.125403 7f84ecf7d700  0 accepter.accepter no incoming
> connection?  sd = -1 errno 24 Too many open files
> 2014-06-08 22:15:35.125407 7f84ecf7d700  0 accepter.accepter no incoming
> connection?  sd = -1 errno 24 Too many open files
> 2014-06-08 22:15:35.125410 7f84ecf7d700  0 accepter.accepter no incoming
> connection?  sd = -1 errno 24 Too many open files
> 2014-06-08 22:15:35.154759 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
> write couldn't open
> 86.37_head/a63e7df7/rbd_data.1933fe2ae8944a.042c/head//86: (24)
> Too many open files
> 2014-06-08 22:15:35.159074 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
> error (24) Too many open files not handled on operation 10 (488954466.1.0,
> or
> op 0, counting from 0)
> 2014-06-08 22:15:35.159095 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
> unexpected error code
> 2014-06-08 22:15:35.159098 7f850ac25700  0 filestore(/srv/ceph/osd/ceph-20)
> transaction dump:
> { "ops": [
> { "op_num": 0,
>   "op_name": "write",
>   "collection": "86.37_head",
>   "oid":
> "a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86",
>   "length": 4096,
>   "offset": 3104768,
>   "bufferlist length": 4096},
> { "op_num": 1,
>   "op_name": "setattr",
>   "collection": "86.37_head",
>   "oid":
> "a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86",
>   "name": "_",
>   "length": 251},
> { "op_num": 2,
>   "op_name": "setattr",
>   "collection": "86.37_head",
>   "oid":
> "a63e7df7\/rbd_data.1933fe2ae8944a.042c\/head\/\/86",
>   "name": "snapset",
>   "length": 31}]}
> 2014-06-08 22:15:35.255955 7f850ac25700 -1 os/FileStore.cc: In function
> 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&,
> uint64_t,
> int, ThreadPool::TPHandle*)' thread 7f850ac25700 time
> 2014-06-08 22:15:35.191181 os/FileStore.cc: 2448: FAILED assert(0 ==
> "unexpected error")
>
> --
> Dipl.-Inf. Christian Kauhaus <>< · k...@gocept.com · systems administration
> gocept gmbh & co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
> http://gocept.com · tel +49 345 219401-11
> Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [ceph] OSD priority / client localization

2014-06-12 Thread Gregory Farnum
You can set up pools which have all their primaries in one data
center, and point the clients at those pools. But writes will still
have to traverse the network link because Ceph does synchronous
replication for strong consistency.

If you want them to both write to the same pool, but use local OSDs:
no, you can't do that.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jun 12, 2014 at 12:12 AM, NEVEU Stephane
 wrote:
> Hi all,
>
>
>
> One short question quite useful for me :
>
> Is there a way to set up a highest osd/host priority for some clients in a
> datacenter and do the opposite in another datacenter ? I mean, my network
> links between those datacenters will be used in case of failover for clients
> accessing data on ceph. So clearly I’d like my clients to have higher
> priorities on the nearest hosts/osds.
>
> Is it possible to do so ?
>
>
>
> Thanks,
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] spiky io wait within VMs running on rbd

2014-06-12 Thread Gregory Farnum
To be clear, that's the solution to one of the causes of this issue.
The log message is very general, and just means that a disk access
thread has been gone for a long time (15 seconds, in this case)
without checking in (so usually, it's been inside of a read/write
syscall for >=15 seconds).
Other causes include simple overload of the OSDs in question, or a
broken local filesystem, or...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Thu, Jun 12, 2014 at 1:59 PM, Mark Nelson  wrote:
> Can you check and see if swap is being used on your OSD servers when this
> happens, and even better, use something like collectl or another tool to
> look for major page faults?
>
> If you see anything like this, you may want to tweak swappiness to be lower
> (say 10).
>
> Mark
>
>
> On 06/12/2014 03:17 PM, Xu (Simon) Chen wrote:
>>
>> I've done some more tracing. It looks like the high IO wait in VMs are
>> somewhat correlated when some OSDs have high inflight ops (ceph admin
>> socket, dump_ops_in_flight).
>>
>> When in_flight_ops is high, I see something like this in the OSD log:
>> 2014-06-12 19:57:24.572338 7f4db6bdf700  1 heartbeat_map reset_timeout
>> 'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15
>>
>> Any ideas why this happens?
>>
>> Thanks.
>> -Simon
>>
>>
>>
>> On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson > > wrote:
>>
>> On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote:
>>
>> 1) I did check iostat on all OSDs, and iowait seems normal.
>> 2) ceph -w shows no correlation between high io wait and high
>> iops.
>> Sometimes the reverse is true: when io wait is high (since it's a
>> cluster wide thing), the overall ceph iops drops too.
>>
>>
>> Not sure if you are doing it yet, but you may want to look at the
>> statistics the OSDs can provide via the admin socket, especially
>> outstanding operations and dump_historic_ops.  If you look at these
>> for all of your OSDs you can start getting a feel for whether any
>> specific OSDs are slow and if so, what slow ops are hanging up on.
>>
>> 3) We have collectd running in VMs, and that's how we identified
>> the
>> frequent high io wait. This happens for even lightly used VMs.
>>
>> Thanks.
>> -Simon
>>
>>
>> On Thu, Jun 12, 2014 at 9:26 AM, David > 
>> >> wrote:
>>
>>  Hi Simon,
>>
>>  Did you check iostat on the OSDs to check their
>> utilization? What
>>  does your ceph -w say - pehaps you’re maxing your cluster’s
>> IOPS?
>>  Also, are you running any monitoring of your VMs iostats?
>> We’ve
>>  often found some culprits overusing IOs..
>>
>>  Kind Regards,
>>  David Majchrzak
>>
>>  12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen
>> mailto:xche...@gmail.com>
>>  >>:
>>
>>
>>
>>   > Hi folks,
>>   >
>>   > We have two similar ceph deployments, but one of them is
>> having
>>  trouble: VMs running with ceph-provided block devices are
>> seeing
>>  frequent high io wait, every a few minutes, usually 15-20%,
>> but as
>>  high as 60-70%. This is cluster-wide and not correlated
>> with VM's IO
>>  load. We turned on rbd cache and enabled writeback in qemu,
>> but the
>>  problem persists. No-deepscrub doesn't help either.
>>   >
>>   > Without providing any one of our probably wrong
>> theories, any
>>  ideas on how to troubleshoot?
>>   >
>>   > Thanks.
>>   > -Simon
>>   > _
>>
>>   > ceph-users mailing list
>>   > ceph-users@lists.ceph.com
>> 
>> > >
>>   >
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>
>>
>>
>>
>>
>> _
>>
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>
>>
>> _
>>
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>
>>
>
> __

Re: [ceph-users] Fixing inconsistent placement groups

2014-06-12 Thread Gregory Farnum
The OSD should have logged the identities of the inconsistent objects
to the central log on the monitors, as well as to its own local log
file. You'll need to identify for yourself which version is correct,
which will probably involve going and looking at them inside each
OSD's data store. If the primary is correct for all the objects in a
PG, you can just run repair; otherwise you'll want to copy the
replica's copy to the primary. Sorry. :/
(If you have no way of checking yourself which is correct, and you
have more than 2 replicas, you can compare the stored copies and just
take the one held by the majority — that's probably correct.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jun 12, 2014 at 7:27 PM, Aaron Ten Clay  wrote:
> I'm having trouble finding a concise set of steps to repair inconsistent
> placement groups. I know from other threads that issuing a 'ceph pg repair
> ...' command could cause loss of data integrity if the primary OSD happens
> to have the bad copy of the placement group. I know how to find which PG's
> are bad (ceph pg dump), but I'm not sure how to figure out which objects in
> the PG failed their CRCs during the deep scrub, and I'm not sure how to get
> the correct CRC so I can determine which OSD holds the correct copy.
>
> Maybe I'm on the wrong path entirely? If someone knows how to resolve this,
> I'd appreciate some insight. I think this would be a good topic for adding
> to the OSD/PG operations section of the manual, or at least a wiki article.
>
> Thanks!
> -Aaron
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Run ceph from source code

2014-06-13 Thread Gregory Farnum
I don't know anybody who makes much use of "make install", so it's
probably not putting the init system scripts into place. So make sure
they aren't there, copy them from the source tree, and try again?
Patches to fix are welcome! :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 13, 2014 at 1:41 PM, Zhe Zhang  wrote:
> Hello, there,
>
>
>
> I am trying to run ceph from source code. configure, make and make install
> worked fine. But after done these steps, I can't see the binary files in
> /etc/init.d/. My current OS is Centos6.5. I also tried Ubuntu 12.04, the
> same issue occurred which said "unknown job ceph..." when I tried to use
> upstart to run monitors and osds. How should I start ceph with source code?
> basically I hope I could modified the code and run it from there.
>
>
>
> Zhe
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD turned itself off

2014-06-13 Thread Gregory Farnum
The OSD did a read off of the local filesystem and it got back the EIO
error code. That means the store got corrupted or something, so it
killed itself to avoid spreading bad data to the rest of the cluster.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson  wrote:
> Hey,
>
> Just examing what happened to an OSD, that was just turned off. Data has
> been moved away from it, so hesitating to turned it back on.
>
> Got the below in the logs, any clues to what the assert talks about?
>
> Cheers,
> Josef
>
> -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const
> hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
> c700 time 2014-06-11 21:13:54.036982
> os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio ||
> got != -5)
>
>  ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>  1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned long,
> ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
>  2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector std::allocator >&)+0x350) [0x708230]
>  3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
> [0x713366]
>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0x3095) [0x71acb5]
>  5: (PG::do_request(std::tr1::shared_ptr,
> ThreadPool::TPHandle&)+0x3f0) [0x812340]
>  6: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x2ea) [0x75c80a]
>  7: (OSD::OpWQ::_process(boost::intrusive_ptr,
> ThreadPool::TPHandle&)+0x198) [0x770da8]
>  8: (ThreadPool::WorkQueueVal,
> std::tr1::shared_ptr >, boost::intrusive_ptr
>>::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89
> ce]
>  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
>  10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
>  11: (()+0x6b50) [0x7fdadffdfb50]
>  12: (clone()+0x6d) [0x7fdade53b0ed]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD turned itself off

2014-06-13 Thread Gregory Farnum
On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson  wrote:
> Hi Greg,
>
> Thanks for the clarification. I believe the OSD was in the middle of a deep
> scrub (sorry for not mentioning this straight away), so then it could've
> been a silent error that got wind during scrub?

Yeah.

>
> What's best practice when the store is corrupted like this?

Remove the OSD from the cluster, and either reformat the disk or
replace as you judge appropriate.
-Greg

>
> Cheers,
> Josef
>
> Gregory Farnum skrev 2014-06-14 02:21:
>
>> The OSD did a read off of the local filesystem and it got back the EIO
>> error code. That means the store got corrupted or something, so it
>> killed itself to avoid spreading bad data to the rest of the cluster.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson 
>> wrote:
>>>
>>> Hey,
>>>
>>> Just examing what happened to an OSD, that was just turned off. Data has
>>> been moved away from it, so hesitating to turned it back on.
>>>
>>> Got the below in the logs, any clues to what the assert talks about?
>>>
>>> Cheers,
>>> Josef
>>>
>>> -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t,
>>> const
>>> hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
>>> c700 time 2014-06-11 21:13:54.036982
>>> os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio
>>> ||
>>> got != -5)
>>>
>>>   ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>>>   1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
>>> long,
>>> ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
>>>   2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
>>> std::vector>> std::allocator >&)+0x350) [0x708230]
>>>   3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
>>> [0x713366]
>>>   4: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0x3095)
>>> [0x71acb5]
>>>   5: (PG::do_request(std::tr1::shared_ptr,
>>> ThreadPool::TPHandle&)+0x3f0) [0x812340]
>>>   6: (OSD::dequeue_op(boost::intrusive_ptr,
>>> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x2ea) [0x75c80a]
>>>   7: (OSD::OpWQ::_process(boost::intrusive_ptr,
>>> ThreadPool::TPHandle&)+0x198) [0x770da8]
>>>   8: (ThreadPool::WorkQueueVal,
>>> std::tr1::shared_ptr >, boost::intrusive_ptr
>>>>
>>>> ::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89
>>>
>>> ce]
>>>   9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
>>>   10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
>>>   11: (()+0x6b50) [0x7fdadffdfb50]
>>>   12: (clone()+0x6d) [0x7fdade53b0ed]
>>>   NOTE: a copy of the executable, or `objdump -rdS ` is
>>> needed to
>>> interpret this.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing inconsistent placement groups

2014-06-16 Thread Gregory Farnum
On Mon, Jun 16, 2014 at 7:13 AM, Markus Blank-Burian  wrote:
> I am also having inconsistent PGs (running ceph v0.80.1), where some
> objects are missing. Excerpt from the logs (many similar lines):
> "0.7f1 shard 66 missing a32857f1/1129786./head//0"

Shard...66? Really, that's what it says? Can you copy a few lines of the output?


> The primary PG and one copy only have 453MB data of the PG, but a
> third copy exists with 3.1GB data. The referenced objects (identified
> by filename) are also present on another third OSD. First try: Move
> "0.7f1_head" to a backup directory on both first and second OSD. This
> resulted in a same 453MB copy with missing objects on the primary OSD.
> Shouldn't all the data be copied automatically?
>
> So i tried to copy the whole PG directory "0.7f1_head" from the third
> OSD to the primary. This results the following assert:
> 2014-06-16T15:49:29+02:00 kaa-96 ceph-osd: -2> 2014-06-16
> 15:49:29.046925 7f2e86b93780 10 osd.1 197813 pgid 0.7f1 coll
> 0.7f1_head
> 2014-06-16T15:49:29+02:00 kaa-96 ceph-osd: -1> 2014-06-16
> 15:49:29.047033 7f2e86b93780 10 filestore(/local/ceph)
> collection_getattr /local/ceph/current/0.7f1_head 'info' = -61
> 2014-06-16T15:49:29+02:00 kaa-96 ceph-osd:  0> 2014-06-16
> 15:49:29.048966 7f2e86b93780 -1 osd/PG.cc: In function 'static epoch_t
> PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
> ceph::bufferlist*)' thread 7f2e86b93780 time 2014-06-16
> 15:49:29.047045
> osd/PG.cc: 2559: FAILED assert(r > 0)
>
>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
> ceph::buffer::list*)+0x48d) [0x742a8b]
>  2: (OSD::load_pgs()+0xda3) [0x64c419]
>  3: (OSD::init()+0x780) [0x64e9ce]
>  4: (main()+0x25d9) [0x602cbf]
>
> Am i missing something?

This may be tangling up some of the other issues you're seeing, but it
looks like you didn't preserve xattrs (at least on the directory).


> And wouldn't it be relatively easy to
> implement an option to "pg repair" to choose a backup OSD as source
> instead of the primary OSD?

Umm, maybe. Tickets welcome!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


>
> It is still unclear, where these inconsistencies (i.e. missing objects
> / empty directories) result from, see also:
> http://tracker.ceph.com/issues/8532.
>
> On Fri, Jun 13, 2014 at 4:58 AM, Gregory Farnum  wrote:
>> The OSD should have logged the identities of the inconsistent objects
>> to the central log on the monitors, as well as to its own local log
>> file. You'll need to identify for yourself which version is correct,
>> which will probably involve going and looking at them inside each
>> OSD's data store. If the primary is correct for all the objects in a
>> PG, you can just run repair; otherwise you'll want to copy the
>> replica's copy to the primary. Sorry. :/
>> (If you have no way of checking yourself which is correct, and you
>> have more than 2 replicas, you can compare the stored copies and just
>> take the one held by the majority — that's probably correct.)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Thu, Jun 12, 2014 at 7:27 PM, Aaron Ten Clay  wrote:
>>> I'm having trouble finding a concise set of steps to repair inconsistent
>>> placement groups. I know from other threads that issuing a 'ceph pg repair
>>> ...' command could cause loss of data integrity if the primary OSD happens
>>> to have the bad copy of the placement group. I know how to find which PG's
>>> are bad (ceph pg dump), but I'm not sure how to figure out which objects in
>>> the PG failed their CRCs during the deep scrub, and I'm not sure how to get
>>> the correct CRC so I can determine which OSD holds the correct copy.
>>>
>>> Maybe I'm on the wrong path entirely? If someone knows how to resolve this,
>>> I'd appreciate some insight. I think this would be a good topic for adding
>>> to the OSD/PG operations section of the manual, or at least a wiki article.
>>>
>>> Thanks!
>>> -Aaron
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fixing inconsistent placement groups

2014-06-16 Thread Gregory Farnum
On Mon, Jun 16, 2014 at 11:11 AM, Aaron Ten Clay  wrote:
> I would also like to see Ceph get smarter about inconsistent PGs. If we
> can't automate the repair, at least the "ceph pg repair" command should
> figure out which copy is correct and use that, instead of overwriting all
> OSDs with whatever the primary has.
>
> Is it impossible to get the expected CRC out of Ceph so I can detect which
> object is correct, instead of looking at the contents or comparing copies
> from multiple OSDs?

The CRCs are pretty unlikely to happen (for replicated pools) until
there's kernel support for end-to-end data integrity. I imagine that
our next step will be a vote-for-correctness system, but it needs to
be designed and up until nowish there just haven't been enough people
running the software and getting inconsistent PGs for it to be a pain
point.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster status reported wrongly as HEALTH_WARN

2014-06-17 Thread Gregory Farnum
Try running "ceph health detail" on each of the monitors. Your disk space
thresholds probably aren't configured correctly or something.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Jun 17, 2014 at 2:09 AM, Andrija Panic 
wrote:

> Hi,
>
> thanks for that, but is not space issue:
>
> OSD drives are only 12% full.
> and /var drive on which MON lives is over 70% only on CS3 server, but I
> have increased alert treshold in ceph.conf (mon data avail warn = 15, mon
> data avail crit = 5), and since I increased them those alerts are gone
> (anyway, these alerts for /var full over 70% can be normally seen in logs
> and in ceph -w output).
>
> Here I get no normal/visible warning in eather logs or ceph -w output...
>
> Thanks,
> Andrija
>
>
>
>
> On 17 June 2014 11:00, Stanislav Yanchev  wrote:
>
>> Try grep in cs1 and cs3 could be a disk space issue.
>>
>>
>>
>>
>>
>> Regards,
>>
>> *Stanislav Yanchev*
>> Core System Administrator
>>
>> [image: MAX TELECOM]
>>
>> Mobile: +359 882 549 441
>> s.yanc...@maxtelecom.bg
>> www.maxtelecom.bg
>>
>>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *Andrija Panic
>> *Sent:* Tuesday, June 17, 2014 11:57 AM
>> *To:* Christian Balzer
>> *Cc:* ceph-users@lists.ceph.com
>> *Subject:* Re: [ceph-users] Cluster status reported wrongly as
>> HEALTH_WARN
>>
>>
>>
>> Hi Christian,
>>
>>
>>
>> that seems true, thanks.
>>
>>
>>
>> But again, there are only occurence in GZ logs files (that were
>> logrotated, not in current log files):
>>
>> Example:
>>
>>
>>
>> [root@cs2 ~]# grep -ir "WRN" /var/log/ceph/
>>
>> Binary file /var/log/ceph/ceph-mon.cs2.log-20140612.gz matches
>>
>> Binary file /var/log/ceph/ceph.log-20140614.gz matches
>>
>> Binary file /var/log/ceph/ceph.log-20140611.gz matches
>>
>> Binary file /var/log/ceph/ceph.log-20140612.gz matches
>>
>> Binary file /var/log/ceph/ceph.log-20140613.gz matches
>>
>>
>>
>> Thanks,
>>
>> Andrija
>>
>>
>>
>> On 17 June 2014 10:48, Christian Balzer  wrote:
>>
>>
>> Hello,
>>
>>
>> On Tue, 17 Jun 2014 10:30:44 +0200 Andrija Panic wrote:
>>
>> > Hi,
>> >
>> > I have 3 node (2 OSD per node) CEPH cluster, running fine, not much
>> data,
>> > network also fine:
>> > Ceph ceph-0.72.2.
>> >
>> > When I issue "ceph status" command, I get randomly HEALTH_OK, and
>> > imidiately after that when repeating command, I get HEALTH_WARN
>> >
>> > Examle given down - these commands were issues within less than 1 sec
>> > between them
>> > There are NO occuring of word "warn" in the logs (grep -ir "warn"
>> > /var/log/ceph) on any of the servers...
>> > I get false alerts with my status monitoring script, for this reason...
>> >
>>
>> If I recall correctly, the logs will show INF, WRN and ERR, so grep for
>> WRN.
>>
>> Regards,
>>
>> Christian
>>
>>
>> > Any help would be greatly appriciated.
>> >
>> > Thanks,
>> >
>> > [root@cs3 ~]# ceph status
>> > cluster cab20370-bf6a-4589-8010-8d5fc8682eab
>> >  health HEALTH_OK
>> >  monmap e2: 3 mons at
>> >
>> {cs1=10.44.xxx.10:6789/0,cs2=10.44.xxx.11:6789/0,cs3=10.44.xxx.12:6789/0},
>> > election epoch 122, quorum 0,1,2 cs1,cs2,cs3
>> >  osdmap e890: 6 osds: 6 up, 6 in
>> >   pgmap v2379904: 448 pgs, 4 pools, 862 GB data, 217 kobjects
>> > 2576 GB used, 19732 GB / 22309 GB avail
>> >  448 active+clean
>> >   client io 17331 kB/s rd, 113 kB/s wr, 176 op/s
>> >
>> > [root@cs3 ~]# ceph status
>> > cluster cab20370-bf6a-4589-8010-8d5fc8682eab
>> >  health HEALTH_WARN
>> >  monmap e2: 3 mons at
>> >
>> {cs1=10.44.xxx.10:6789/0,cs2=10.44.xxx.11:6789/0,cs3=10.44.xxx.12:6789/0},
>> > election epoch 122, quorum 0,1,2 cs1,cs2,cs3
>> >  osdmap e890: 6 osds: 6 up, 6 in
>> >   pgmap v2379905: 448 pgs, 4 pools, 862 GB data, 217 kobjects
>> > 2576 GB used, 19732 GB / 22309 GB avail
>> >  448 active+clean
>> >   client io 28383 kB/s rd, 566 kB/s wr, 321 op/s
>> >
>> > [root@cs3 ~]# ceph status
>> > cluster cab20370-bf6a-4589-8010-8d5fc8682eab
>> >  health HEALTH_OK
>> >  monmap e2: 3 mons at
>> >
>> {cs1=10.44.xxx.10:6789/0,cs2=10.44.xxx.11:6789/0,cs3=10.44.xxx.12:6789/0},
>> > election epoch 122, quorum 0,1,2 cs1,cs2,cs3
>> >  osdmap e890: 6 osds: 6 up, 6 in
>> >   pgmap v2379913: 448 pgs, 4 pools, 862 GB data, 217 kobjects
>> > 2576 GB used, 19732 GB / 22309 GB avail
>> >  448 active+clean
>> >   client io 21632 kB/s rd, 49354 B/s wr, 283 op/s
>> >
>>
>>
>> --
>>
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> Andrija Panić
>>
>> --
>>
>>   http://admintweets.com
>>
>> --
>>
>> 
>>
>> *Confidentiality notice*
>> --
>>
>>
>>
>> The information contained in this message 

Re: [ceph-users] Data versus used space inconsistency

2014-06-17 Thread Gregory Farnum
You probably have sparse objects from RBD. The PG statistics are built
off of file size, but the total data used spaces are looking at df
output.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 16, 2014 at 7:34 PM, Christian Balzer  wrote:
>
> Hello,
>
> this is is a 0.80.1 cluster, upgraded from emperor. I'm mentioning the
> later since I don't recall seeing this back with emperor, it was a perfect
> match then.
> The pools are all set to a replication of 2, only the rbd one is used.
> So a having less than 2x the amount of actual data being used gives me
> quite the pause and cause to worries:
>
>   pgmap v2876480: 1152 pgs, 3 pools, 642 GB data, 168 kobjects
> 1246 GB used, 98932 GB / 100178 GB avail
> 1152 active+clean
>
> My test cluster and every other ceph -s output I've seen always used
> double (tripple) or more than that compared to the actual data, never less
> than the replication factor.
>
> So are there some objects that are not replicated twice, despite having a
> clean health and after several scrubs including deep ones?
>
> Or is that some stale data that very much intentionally isn't getting
> replicated? (I never used snapshots, FWIW)
>
> Either way how can I find out what is going on here?
>
> Regards,
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about RADOS object consistency

2014-06-17 Thread Gregory Farnum
On Tue, Jun 17, 2014 at 3:22 AM, Ke-fei Lin  wrote:
> Hi list,
>
> How does RADOS check an object and its replica are consistent? Is there
> a checksum in object's metadata or some other mechanisms? Does the
> mechanism depend on
> OSD's underlying file system?

It does not check consistency on read. On scrub it compares the local
FS metadata (size et al) and RADOS metadata (object versions and
things); on deep scrub it computes a checksum of each replica and
compares them.
RADOS does not maintain checksums alongside the objects in replicated pools.

> And what would happen if a corrupted object being readed (like a
> corrupted block in traditional file system)?

If the local filesystem doesn't return an error, it will return the
data it was given to the end user. (btrfs maintains its own checksums
and will return errors, but unfortunately xfs will not.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding private network AFTER cluster creation ?

2014-06-17 Thread Gregory Farnum
On Tue, Jun 17, 2014 at 5:00 AM, Florent B  wrote:
> Hi all,
>
> I would like to know if I can add a private network to my running Ceph
> cluster ?
>
> And how to proceed ? I add the config to ceph.conf, then restart osd's ?
> So, some OSD will have both networks and others not.

Yeah. As long as the OSDs can route to each other on both networks,
you can do a rolling upgrade.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> What is the best way to do it ? :)
>
> Thank you
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephx authentication issue

2014-06-17 Thread Gregory Farnum
It's unlikely to be the issue, but you might check the times on your OSDs.
cephx is clock-sensitive if you're off by more than an hour or two.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Jun 17, 2014 at 8:30 AM, Fred Yang  wrote:

> What's strange is OSD rebalance obviously has no problem, it's just new
> object can't be written since the new segments can't be distributed to new
> OSDs.
>
> Here is the error from radosgw.log:
>
> 2014-06-17 10:34:01.568754 7fc7e83f4700  0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 2014-06-17 10:34:01.568763 7fc7e83f4700  0 -- 172.17.9.218:0/1034041 >>
> 10.122.134.204:6820/14745 pipe(0x1e24710 sd=11 :54045 s=1 pgs=0 cs=0 l=1
> c=0x1e23db0).failed verifying authorize reply
>
> So it appears OSD can authenticate with each other, but the key generated
> between client and mon are only visible to existing OSDs, but not new OSDs
> just added?
>
> I'm trying increase cephx debug level on mon but it seems hanging:
>
> # ceph tell mon.* injectarts '--debug-auth=5'
> no valid command found; 10 closest matches:
> config-key exists 
> config-key list
> config-key put  {}
> config-key del 
> osd tier remove-overlay 
> config-key get 
> osd tier cache-mode  none|writeback|invalidate+forward|readonly
> osd tier set-overlay  
> mon remove 
> osd tier remove  
> mon.nysanlab04: Error EINVAL: invalid command
> mon.nysanlab04: invalid command
> 2014-06-17 11:08:20.995510 7ffcd078b700  0 -- 172.17.9.218:0/1001296 >>
> 172.17.9.219:6789/0 pipe(0x7ffccc021fe0 sd=4 :0 s=1 pgs=0 cs=0 l=1
> c=0x7ffccc022240).fault
>
> Am I using the wrong syntax to increse debug level for auth in mon? Or
> something wrong with the cluster?
>
>
> On Mon, Jun 16, 2014 at 5:56 PM, John Wilkins 
> wrote:
>
>> Did you run ceph-deploy in the directory where you ran ceph-deploy new
>> and ceph-deploy gatherkeys? That's where the monitor bootstrap key should
>> be.
>>
>>
>> On Mon, Jun 16, 2014 at 8:49 AM, Fred Yang 
>> wrote:
>>
>>>  I'm adding three OSD nodes(36 osds in total) to existing 3-node
>>> cluster(35 osds) using ceph-deploy, after disks prepared and OSDs
>>> activated, the cluster re-balanced and shows all pgs active+clean:
>>>
>>>  osdmap e820: 72 osds: 71 up, 71 in
>>>   pgmap v173328: 15920 pgs, 17 pools, 12538 MB data, 3903 objects
>>> 30081 MB used, 39631 GB / 39660 GB avail
>>>15920 active+clean
>>>
>>> However, the object write start having issue since the new OSDs added to
>>> cluster:
>>>
>>> 2014-06-16 11:36:36.421868 osd.35 [WRN] slow request 30.317529 seconds
>>> old, received at 2014-06-16 11:36:06.104256: osd_op(client.5568.0:1502400
>>> default.5250.4_loadtest/512B_file [getxattrs,stat] 9.552a7900 e820) v4
>>> currently waiting for rw locks
>>>
>>> And from existing osd log, it seems it's having problem to authenticate
>>> the new OSDs (10.122.134.204 is the IP of one of new OSD nodes) :
>>>
>>> 2014-06-16 11:38:25.281270 7f58562ce700  0 cephx: verify_reply couldn't
>>> decrypt with error: error decoding block for decryption
>>> 2014-06-16 11:38:25.281288 7f58562ce700  0 -- 172.17.9.218:6811/2047255
>>> >> 10.122.134.204:6831/17571 pipe(0x2891280 sd=90 :48493 s=1 pgs=3091
>>> cs=10 l=0 c=0x62d1840).failed verifying authorize reply
>>>
>>>
>>> The cephx auth list shows good to me:
>>>
>>> exported keyring for osd.45
>>> [osd.45]
>>> key = AQAoCp5TqBq/MhAANwclbs1nCgefNfxqqPnkZQ==
>>> caps mon = "allow profile osd"
>>> caps osd = "allow *"
>>>
>>> The key above does not match the keyring on osd.45.
>>>
>>> Anybody have any clue what might be the authentication issue here? I'm
>>> running Ceph 0.72.2.
>>>
>>> Thanks in advance,
>>> Fred
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>> --
>> John Wilkins
>> Senior Technical Writer
>> Intank
>> john.wilk...@inktank.com
>> (415) 425-9599
>> http://inktank.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster status reported wrongly as HEALTH_WARN

2014-06-18 Thread Gregory Farnum
The lack of warnings in ceph -w for this issue is a bug in Emperor.
It's resolved in Firefly.
-Greg

On Wed, Jun 18, 2014 at 3:49 AM, Andrija Panic  wrote:
>
> Hi Gregory,
>
> indeed - I still have warnings about 20% free space on CS3 server, where MON 
> lives...strange is that I don't get these warnings with prolonged "ceph -w" 
> output...
> [root@cs2 ~]# ceph health detail
> HEALTH_WARN
> mon.cs3 addr 10.44.xxx.12:6789/0 has 20% avail disk space -- low disk space!
>
> I don't understand, how is this possible to get warnings - I have folowing in 
> each ceph.conf file, under the general section:
>
> mon data avail warn = 15
> mon data avail crit = 5
>
> I found this settings on ceph mailing list...
>
> Thanks a lot,
> Andrija
>
>
> On 17 June 2014 19:22, Gregory Farnum  wrote:
>>
>> Try running "ceph health detail" on each of the monitors. Your disk space 
>> thresholds probably aren't configured correctly or something.
>> -Greg
>>
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Tue, Jun 17, 2014 at 2:09 AM, Andrija Panic  
>> wrote:
>>>
>>> Hi,
>>>
>>> thanks for that, but is not space issue:
>>>
>>> OSD drives are only 12% full.
>>> and /var drive on which MON lives is over 70% only on CS3 server, but I 
>>> have increased alert treshold in ceph.conf (mon data avail warn = 15, mon 
>>> data avail crit = 5), and since I increased them those alerts are gone 
>>> (anyway, these alerts for /var full over 70% can be normally seen in logs 
>>> and in ceph -w output).
>>>
>>> Here I get no normal/visible warning in eather logs or ceph -w output...
>>>
>>> Thanks,
>>> Andrija
>>>
>>>
>>>
>>>
>>> On 17 June 2014 11:00, Stanislav Yanchev  wrote:
>>>>
>>>> Try grep in cs1 and cs3 could be a disk space issue.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Stanislav Yanchev
>>>> Core System Administrator
>>>>
>>>>
>>>>
>>>> Mobile: +359 882 549 441
>>>> s.yanc...@maxtelecom.bg
>>>> www.maxtelecom.bg
>>>>
>>>>
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>>> Andrija Panic
>>>> Sent: Tuesday, June 17, 2014 11:57 AM
>>>> To: Christian Balzer
>>>> Cc: ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Cluster status reported wrongly as HEALTH_WARN
>>>>
>>>>
>>>>
>>>> Hi Christian,
>>>>
>>>>
>>>>
>>>> that seems true, thanks.
>>>>
>>>>
>>>>
>>>> But again, there are only occurence in GZ logs files (that were 
>>>> logrotated, not in current log files):
>>>>
>>>> Example:
>>>>
>>>>
>>>>
>>>> [root@cs2 ~]# grep -ir "WRN" /var/log/ceph/
>>>>
>>>> Binary file /var/log/ceph/ceph-mon.cs2.log-20140612.gz matches
>>>>
>>>> Binary file /var/log/ceph/ceph.log-20140614.gz matches
>>>>
>>>> Binary file /var/log/ceph/ceph.log-20140611.gz matches
>>>>
>>>> Binary file /var/log/ceph/ceph.log-20140612.gz matches
>>>>
>>>> Binary file /var/log/ceph/ceph.log-20140613.gz matches
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Andrija
>>>>
>>>>
>>>>
>>>> On 17 June 2014 10:48, Christian Balzer  wrote:
>>>>
>>>>
>>>> Hello,
>>>>
>>>>
>>>> On Tue, 17 Jun 2014 10:30:44 +0200 Andrija Panic wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > I have 3 node (2 OSD per node) CEPH cluster, running fine, not much data,
>>>> > network also fine:
>>>> > Ceph ceph-0.72.2.
>>>> >
>>>> > When I issue "ceph status" command, I get randomly HEALTH_OK, and
>>>> > imidiately after that when repeating command, I get HEALTH_WARN
>>>> >
>>>> > Examle given down - these commands were issues within less than 1 sec
>>>> > between them
>>>> > There are NO occuring of word "warn" in the logs (grep -ir "warn"
>&g

Re: [ceph-users] Adding private network AFTER cluster creation ?

2014-06-18 Thread Gregory Farnum
On Tue, Jun 17, 2014 at 4:08 PM, Florent B  wrote:
> Ok so during "upgrade", I need routing between both networks ?
>
> And then, when all nodes are reconfigured, no more routing is needed (if
> I understood well).

Yeah, although the OSDs will of course need to be able to route to the monitors.

>
> Does monitors need restart ?

Not from Ceph's perspective!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> On 06/17/2014 07:29 PM, Gregory Farnum wrote:
>> On Tue, Jun 17, 2014 at 5:00 AM, Florent B  wrote:
>>> Hi all,
>>>
>>> I would like to know if I can add a private network to my running Ceph
>>> cluster ?
>>>
>>> And how to proceed ? I add the config to ceph.conf, then restart osd's ?
>>> So, some OSD will have both networks and others not.
>> Yeah. As long as the OSDs can route to each other on both networks,
>> you can do a rolling upgrade.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>> What is the best way to do it ? :)
>>>
>>> Thank you
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about RADOS object consistency

2014-06-18 Thread Gregory Farnum
On Tue, Jun 17, 2014 at 9:46 PM, Ke-fei Lin  wrote:
> 2014-06-18 1:28 GMT+08:00 Gregory Farnum :
>> On Tue, Jun 17, 2014 at 3:22 AM, Ke-fei Lin  wrote:
>>> Hi list,
>>>
>>> How does RADOS check an object and its replica are consistent? Is there
>>> a checksum in object's metadata or some other mechanisms? Does the
>>> mechanism depend on OSD's underlying file system?
>>
>> It does not check consistency on read. On scrub it compares the local
>> FS metadata (size et al) and RADOS metadata (object versions and
>> things); on deep scrub it computes a checksum of each replica and
>> compares them.
> Thank you Greg.
> Let's say if there are an object A and its replica B. On deep scrubbing RADOS
> find that two objects have different checksums. How does RADOS determine
> and repair the corrupted object?

You have to explicitly trigger a scrub "repair". Right now, whatever
the primary has wins; that's obviously suboptimal. (So generally you
should try and get manually involved with repairs.)

>> RADOS does not maintain checksums alongside the objects in replicated pools.
>>
>>> And what would happen if a corrupted object being readed (like a
>>> corrupted block in traditional file system)?
>>
>> If the local filesystem doesn't return an error, it will return the
>> data it was given to the end user. (btrfs maintains its own checksums
> This sounds kind of dangerous. I think corrupted objects will be normal 
> instead
> of exception because we usually build up Ceph cluster by commodity hardware.
>> and will return errors, but unfortunately xfs will not.)
> And it seems there are lots of people still using XFS...
> By the way, is this the main reason that Ceph officially suggests btrfs?

Well, we officially suggest XFS for other reasons, but it is why our
long-term vision is to run on btrfs.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier pool in CephFS

2014-06-18 Thread Gregory Farnum
On Wed, Jun 18, 2014 at 12:54 AM, Sherry Shahbazi  wrote:
> Hi everyone,
>
> If I have a pool called cold-storage (1) and a pool called hot-storage (2)
> that hot-storage is a cache tier for the cold-storage.
>
> I normally do the followings in order to map a directory in my client to a
> pool.
>
> on a Ceph monitor,
> ceph mds add_data_pool 1
> ceph mds add_data_pool 2
> Q1) Do I need to add both cold-storage and hot-storage to the data pools?

Nooo...generally whenever using cache pools, you should only refer to
the base pool. (You might change that if using read-only caching or
something, but none of that is really "certified".)

>
> on client,
> mkdir /mnt/test
> mount -t ceph ceph-mon1:6789,ceph-mon2:6789,ceph-mon3:6789:/ /mnt/test -o
> name=admin,secretfile=/etc/ceph/client.admin
>
> cephfs /mnt/test set_layout -p 1 OR cephfs /mnt/test set_layout -p 2
> Q2) Do I need to set the layout of the directory to pool1 (cold-storage) or
> pool2 (hot-storage)?

Again, only refer to the base pool (cold pool). The cache handling
happens within the RADOS layer and the filesystem isn't really aware
of it.
The only bit that's different is that if you're granting cephx
permissions on individual pools, you'll need to grant permissions on
both the cold and hot storage pool.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding private network AFTER cluster creation ?

2014-06-18 Thread Gregory Farnum
Yeah, the OSDs connect to the monitors over the  OSD's public address.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jun 18, 2014 at 11:37 AM, Florent B  wrote:
> On 06/18/2014 04:34 PM, Gregory Farnum wrote:
>> On Tue, Jun 17, 2014 at 4:08 PM, Florent B  wrote:
>>> Ok so during "upgrade", I need routing between both networks ?
>>>
>>> And then, when all nodes are reconfigured, no more routing is needed (if
>>> I understood well).
>> Yeah, although the OSDs will of course need to be able to route to the 
>> monitors.
>>
>
> Yes of course but this is done via cluster or public network ? I think
> public one.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about RADOS object consistency

2014-06-18 Thread Gregory Farnum
On Wed, Jun 18, 2014 at 12:07 PM, Ke-fei Lin  wrote:
> 2014-06-18 22:44 GMT+08:00 Gregory Farnum :
>> On Tue, Jun 17, 2014 at 9:46 PM, Ke-fei Lin  wrote:
>>> 2014-06-18 1:28 GMT+08:00 Gregory Farnum :
>>>> On Tue, Jun 17, 2014 at 3:22 AM, Ke-fei Lin  wrote:
>>>>> Hi list,
>>>>>
>>>>> How does RADOS check an object and its replica are consistent? Is there
>>>>> a checksum in object's metadata or some other mechanisms? Does the
>>>>> mechanism depend on OSD's underlying file system?
>>>>
>>>> It does not check consistency on read. On scrub it compares the local
>>>> FS metadata (size et al) and RADOS metadata (object versions and
>>>> things); on deep scrub it computes a checksum of each replica and
>>>> compares them.
>>> Thank you Greg.
>>> Let's say if there are an object A and its replica B. On deep scrubbing 
>>> RADOS
>>> find that two objects have different checksums. How does RADOS determine
>>> and repair the corrupted object?
>>
>> You have to explicitly trigger a scrub "repair". Right now, whatever
>> the primary has wins; that's obviously suboptimal. (So generally you
>> should try and get manually involved with repairs.)
>
> If I choose XFS as the underlying file system, according to my understanding,
> the corrupted object will be detected if and only if a deep scrub
> happened. Then it's
> possible that an inconsistent object (on primary) being accidentally readed 
> and
> without any error, right?

You are correct.

>
> So, in such a case, a higher level application logic (or the file
> system sitting on
> RBD) should take responsibility for data consistency. Am I worried too much?

Well, I don't know if you're worried too much, but the scenarios you
describe are possible. You need to evaluate what guarantees you need
about that. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Level DB with RADOS

2014-06-18 Thread Gregory Farnum
On Wed, Jun 18, 2014 at 9:14 PM, Shesha Sreenivasamurthy
 wrote:
> I am doing some research work at UCSC and wanted use LevelDB to store OMAP
> key/value pairs. What is the best way to start playing with it. I am a
> newbie to RADOS/CEPH code. Can any one point me in the right direction ?

I'm not quite sure what you're asking — omap entries *are* stored in
leveldb. If you want a way of using that interface, you probably want
to use the librados library, although you can also look at them via
the rados cli tool (but it's a poor choice for high-frequency
programmatic access).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding rados df statistics

2014-06-19 Thread Gregory Farnum
The total used/available/capacity is calculated by running the syscall
which "df" uses across all OSDs and summing the results. The "total data"
is calculated by summing the sizes of the objects stored.

It depends on how you've configured your system, but I'm guessing the
markup is due to the (constant size) overhead of your journals. Or anything
else which you might have stored on the disks besides Ceph?
-Greg

On Thursday, June 19, 2014,  wrote:

>  Hi all,
>
> I’m struggling to understand some Ceph usage statistics and I was hoping
> someone might be able to explain them to me.
>
>
>
> If I run ‘rados df’ I get the following:
>
> # rados df
>
> pool name category KB  objects   clones
> degraded  unfound   rdrd KB   wrwr KB
>
> pool-1-  00
> 00   000
> 00
>
> pool-2-2339809 1299
> 00   0  300   540600 3301
> 2340798
>
> pool-3-409574914654
> 00   0 396917256  3337952
> 70296734
>
> pool-4-180283239332
> 00   000
> 22059790
>
> pool-5-  19310248582397
> 00   0   668938102410614  5230404
> 254457331
>
>   total used  5402116076   137682
>
>   total avail   854277445084
>
>   total space   859679561160
>
>
>
> Pools 2 and 4 have a size of 2, whilst pools 3 and 5 have a size of 3.
>
>
>
> ‘ceph status’ tells me the following stats: “192 GB data, 134 kobjects,
> 5151 GB used, 795 TB / 800 TB avail”
>
>
>
> The 192 GB of data is equal to the sum of the ‘KB’ column of the rados df
> data.  The used and available numbers are the same the totals given by
> rados df.
>
>
>
> What I don’t understand is how we have used 5,151 GB of data. Given the
> sizes of each pool I would expect it to be closer to 572 GB (sum of the
> size of each pool multiplied by pool ‘size’)   plus some overhead of some
> kind. This is a factor of 9 different. So my question is:  what have I
> missed?
>
>
>
> Cheers,
>
>
>
> George Ryall
>
>
> Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell
> Oxford | Didcot | OX11 0QX
>
> (01235 44) 5021
>
>
>
> --
> Scanned by iCritical.
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier pool in CephFS

2014-06-19 Thread Gregory Farnum
1) it will take time for the deleted objects to flush out of the cache pool
and then be deleted in the cold pool. They will disappear eventually,
though!
2) you can't delete pools which are in the MDSMap.

On Thursday, June 19, 2014, Sherry Shahbazi  wrote:

> Hi Greg,
>
> Thanks for your prompt reply. I appreciate, if you could also help me
> with the following issues:
>
> 1) After mounting a directory to a pool called cold-pool, I started to
> save data through CephFS. By removing all of the created files from CephFS,
> I could not remove objects from the cold-pool!
> 2) Then I thought to remove the cold-pool instead. After removing the
> cache tier pool of cold-pool, I was not able to remove the cold-pool! I got
> the following error:
> pool cold-storage does not exist
> error 16: (16) Device or resource busy.
> This is where all of my PGs are clean+active! By the way, I disabled CephX.
>
> Thanks in advance,
> Sherry
>
>
>
>
>   On Thursday, June 19, 2014 3:16 AM, Gregory Farnum  > wrote:
>
>
> On Wed, Jun 18, 2014 at 12:54 AM, Sherry Shahbazi  > wrote:
> > Hi everyone,
> >
> > If I have a pool called cold-storage (1) and a pool called hot-storage
> (2)
> > that hot-storage is a cache tier for the cold-storage.
> >
> > I normally do the followings in order to map a directory in my client to
> a
> > pool.
> >
> > on a Ceph monitor,
> > ceph mds add_data_pool 1
> > ceph mds add_data_pool 2
> > Q1) Do I need to add both cold-storage and hot-storage to the data pools?
>
> Nooo...generally whenever using cache pools, you should only refer to
> the base pool. (You might change that if using read-only caching or
> something, but none of that is really "certified".)
>
>
> >
> > on client,
> > mkdir /mnt/test
> > mount -t ceph ceph-mon1:6789,ceph-mon2:6789,ceph-mon3:6789:/ /mnt/test -o
> > name=admin,secretfile=/etc/ceph/client.admin
> >
> > cephfs /mnt/test set_layout -p 1 OR cephfs /mnt/test set_layout -p 2
> > Q2) Do I need to set the layout of the directory to pool1 (cold-storage)
> or
> > pool2 (hot-storage)?
>
>
> Again, only refer to the base pool (cold pool). The cache handling
> happens within the RADOS layer and the filesystem isn't really aware
> of it.
> The only bit that's different is that if you're granting cephx
> permissions on individual pools, you'll need to grant permissions on
> both the cold and hot storage pool.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] switch pool from replicated to erasure coded

2014-06-19 Thread Gregory Farnum
On Thursday, June 19, 2014, Pavel V. Kaygorodov  wrote:

> Hi!
>
> May be I have missed something in docs, but is there a way to switch a
> pool from replicated to erasure coded?


No.


> Or I have to create a new pool an somehow manually transfer data from old
> pool to new one?


Yes. Please keep in mind that erasure-coded pools are significantly limited
compared to the replicated pools. They're really only usable directly by
RGW or somebody using librados who can handle the limitations.



>
> Pavel.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] understanding rados df statistics

2014-06-19 Thread Gregory Farnum
Yeah, the journal is a fixed size; it won't grow!

On Thursday, June 19, 2014,  wrote:

>  Having looked at a sample of OSDs it appears that it is indeed the case
> that for every GB of data we have 9 GB of Journal. Is this normal? Or are
> we not doing some Journal/cluster management that we should be?
>
>
>
>
>
> George
>
>
>
> *From:* Gregory Farnum [mailto:g...@inktank.com
> ]
> *Sent:* 19 June 2014 13:53
> *To:* Ryall, George (STFC,RAL,SC)
> *Cc:* ceph-users@lists.ceph.com
> 
> *Subject:* Re: [ceph-users] understanding rados df statistics
>
>
>
> The total used/available/capacity is calculated by running the syscall
> which "df" uses across all OSDs and summing the results. The "total data"
> is calculated by summing the sizes of the objects stored.
>
>
>
> It depends on how you've configured your system, but I'm guessing the
> markup is due to the (constant size) overhead of your journals. Or anything
> else which you might have stored on the disks besides Ceph?
>
> -Greg
>
>
> On Thursday, June 19, 2014,  > wrote:
>
> Hi all,
>
> I’m struggling to understand some Ceph usage statistics and I was hoping
> someone might be able to explain them to me.
>
>
>
> If I run ‘rados df’ I get the following:
>
> # rados df
>
> pool name category KB  objects   clones
> degraded  unfound   rdrd KB   wrwr KB
>
> pool-1-  00
> 00   000
> 00
>
> pool-2-2339809 1299
> 00   0  300   540600 3301
> 2340798
>
> pool-3-409574914654
> 00   0 396917256  3337952
> 70296734
>
> pool-4-180283239332
> 00   000
> 22059790
>
> pool-5-  19310248582397
> 00   0   668938102410614  5230404
> 254457331
>
>   total used  5402116076   137682
>
>   total avail   854277445084
>
>   total space   859679561160
>
>
>
> Pools 2 and 4 have a size of 2, whilst pools 3 and 5 have a size of 3.
>
>
>
> ‘ceph status’ tells me the following stats: “192 GB data, 134 kobjects,
> 5151 GB used, 795 TB / 800 TB avail”
>
>
>
> The 192 GB of data is equal to the sum of the ‘KB’ column of the rados df
> data.  The used and available numbers are the same the totals given by
> rados df.
>
>
>
> What I don’t understand is how we have used 5,151 GB of data. Given the
> sizes of each pool I would expect it to be closer to 572 GB (sum of the
> size of each pool multiplied by pool ‘size’)   plus some overhead of some
> kind. This is a factor of 9 different. So my question is:  what have I
> missed?
>
>
>
> Cheers,
>
>
>
> George Ryall
>
>
> Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell
> Oxford | Didcot | OX11 0QX
>
> (01235 44) 5021
>
>
>
>
>
> --
> Scanned by iCritical.
>
>
>
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Taking down one OSD node (10 OSDs) for maintenance - best practice?

2014-06-19 Thread Gregory Farnum
No, you definitely don't need to shut down the whole cluster. Just do
a polite shutdown of the daemons, optionally with the noout flag that
Wido mentioned.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jun 19, 2014 at 1:55 PM, Alphe Salas Michels  wrote:
> Hello, the best practice is to simply shut down the whole cluster starting
> form the clients,  monitors the mds and the osd. You do your maintenance
> then you bring back everyone starting from monitors, mds, osd. clients.
>
> Other while the osds missing will lead to a reconstruction of your cluster
> that will not end with the return of the "faulty" osd(s). In the case you
> turn off everything related to ceph cluster then it will be transparent for
> the monitors and will not have to deal with partial reconstruction to clean
> up and rescrubing of the returned OSD(s).
>
> best regards.
>
> Alphe Salas
> T.I ingeneer.
>
>
>
> On 06/13/2014 04:56 AM, David wrote:
>>
>> Hi,
>>
>> We’re going to take down one OSD node for maintenance (add cpu + ram)
>> which might take 10-20 minutes.
>> What’s the best practice here in a production cluster running dumpling
>> 0.67.7-1~bpo70+1?
>>
>> Kind Regards,
>> David Majchrzak
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Level DB with RADOS

2014-06-23 Thread Gregory Farnum
Well, it's in the Ceph repository, in the "OSD" and "os" directories,
available at https://github.com/ceph/ceph.

But it's not the kind of thing you can really extract from Ceph, and
if you're interested in getting involved in the project you're going
to need to spend a lot of time poking around things like this on your
own, so be prepared! :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jun 19, 2014 at 10:48 AM, Shesha Sreenivasamurthy
 wrote:
> Thanks, What is the right GIT repo from where I can download (clone) the
> RADOS code in which OMAP uses LevelDB. I am a newbie hence the question.
>
>
> On Wed, Jun 18, 2014 at 7:28 PM, Gregory Farnum  wrote:
>>
>> On Wed, Jun 18, 2014 at 9:14 PM, Shesha Sreenivasamurthy
>>  wrote:
>> > I am doing some research work at UCSC and wanted use LevelDB to store
>> > OMAP
>> > key/value pairs. What is the best way to start playing with it. I am a
>> > newbie to RADOS/CEPH code. Can any one point me in the right direction ?
>>
>> I'm not quite sure what you're asking — omap entries *are* stored in
>> leveldb. If you want a way of using that interface, you probably want
>> to use the librados library, although you can also look at them via
>> the rados cli tool (but it's a poor choice for high-frequency
>> programmatic access).
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple hierarchies and custom placement

2014-06-23 Thread Gregory Farnum
On Fri, Jun 20, 2014 at 4:23 PM, Shayan Saeed  wrote:
> Is it allowed for crush maps to have multiple hierarchies for different
> pools. So for example, I want one pool to treat my cluster as flat with
> every host being equal but the other pool to have a more hierarchical idea
> as hosts->racks->root?

Yes. It can get complicated, so make sure you know exactly what you're
doing, but you can create different "root" buckets and link the OSDs
in to each root in different ways.

>
> Also, is it currently possible in ceph to have a custom placement of erasure
> coded chunks. So for example within a pool, I want objects to reside exactly
> on the OSDs I choose instead of doing placement for load balancing. Can I
> specify something like: "For object 1, I want systematic chunks on rack1 and
> non systematic distributed between rack2 and rack3 and then for object 2, I
> want systematic ones on rack2 and non systematic distributed between rack1
> and rack3"?

Not generally, no — you need to let the CRUSH algorithm place them.
You can do things like specify specific buckets within a CRUSH rule,
but that applies on a pool level.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> I would greatly appreciate any suggestions I get.
>
> Regards,
> Shayan Saeed
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub versus osd scrub load threshold

2014-06-23 Thread Gregory Farnum
Looks like it's a doc error (at least on master), but it might have
changed over time. If you're running Dumpling we should change the
docs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer  wrote:
>
> Hello,
>
> This weekend I noticed that the deep scrubbing took a lot longer than
> usual (long periods without a scrub running/finishing), even though the
> cluster wasn't all that busy.
> It was however busier than in the past and the load average was above 0.5
> frequently.
>
> Now according to the documentation "osd scrub load threshold" is ignored
> when it comes to deep scrubs.
>
> However after setting it to 1.5 and restarting the OSDs the floodgates
> opened and all those deep scrubs are now running at full speed.
>
> Documentation error or did I "unstuck" something by the OSD restart?
>
> Regards,
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] trying to interpret lines in osd.log

2014-06-23 Thread Gregory Farnum
On Mon, Jun 23, 2014 at 4:26 AM, Christian Kauhaus  wrote:
> I see several instances of the following log messages in the OSD logs each 
> day:
>
> 2014-06-21 02:05:27.740697 7fbc58b78700  0 -- 172.22.8.12:6810/31918 >>
> 172.22.8.12:6800/28827 pipe(0x7fbe400029f0 sd=764 :6810 s=0 pgs=0 cs=0 l=0
> c=0x7fbe40003190).accept connect_seq 30 vs existing 29 state standby

"I'm getting an incoming connection, and it's claiming to be the 30th
round, against an existing connection I have from the 29th round
(which is in the standby state)."

>
> 2014-06-21 07:44:29.437810 7fbc452cb700  0 -- 172.22.8.12:6810/31918 >>
> 172.22.8.16:6802/31292 pipe(0x7fbe40002d90 sd=748 :6810 s=2 pgs=11345 cs=57
> l=0 c=0x7fbf68eb2a70).fault with nothing to send, going to standby

"A tcp socket got killed or timed out or something, but I have nothing
to send on that socket, so I'm going to the standby state on this
connection".

This is expected and not anything to worry about; you'll generally see
the second whenever a connection has been idle for 15 minutes, and
then the first when one of them has something they need to send.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> What does this mean? Anything to worry about?
>
> TIA
>
> Christian
>
> --
> Dipl.-Inf. Christian Kauhaus <>< · k...@gocept.com · systems administration
> gocept gmbh & co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
> http://gocept.com · tel +49 345 219401-11
> Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of ceph pg repair on different replication levels

2014-06-23 Thread Gregory Farnum
On Mon, Jun 23, 2014 at 4:54 AM, Christian Eichelmann
 wrote:
> Hi ceph users,
>
> since our cluster had a few inconsistent pgs in the last time, i was
> wondering what ceph pg repair does, depending on the replication level.
> So I just wanted to check if my assumptions are correct:
>
> Replication 2x
> Since the cluster can not decide which version is correct one, it would
> just copy the primary copy (the active one) over the secondary copy.
> Which is a 50/50 chance to get the correct version.
>
> Replication 3x or more
> Now the cluster has a quorum and a ceph pg repair will replace the
> corrupt replica with one of the correct one. No manual intervention needed.

Well, actually it always takes the primary copy, unless the primary
has some way of locally telling that its version is corrupt. (This
might happen if the primary thinks it should have an object, but it
doesn't exist on disk.) But there's not a voting or anything at this
time.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple hierarchies and custom placement

2014-06-24 Thread Gregory Farnum
There's not really a simple way to do this. There are functions in the
OSDMap structure to calculate the location of a particular PG, but there
are a lot of independent places that map objects into PGs.

On Monday, June 23, 2014, Shayan Saeed  wrote:

> Thanks for getting back with a helpful reply. Assuming that I change the
> source code to do custom placement, what are the places I need to look in
> the code to do that? I am currently trying to change the CRUSH code, but is
> there any place else I need to be concerned about?
>
> Regards,
> Shayan Saeed
>
>
> On Mon, Jun 23, 2014 at 2:14 PM, Gregory Farnum  > wrote:
>
>> On Fri, Jun 20, 2014 at 4:23 PM, Shayan Saeed > > wrote:
>> > Is it allowed for crush maps to have multiple hierarchies for different
>> > pools. So for example, I want one pool to treat my cluster as flat with
>> > every host being equal but the other pool to have a more hierarchical
>> idea
>> > as hosts->racks->root?
>>
>> Yes. It can get complicated, so make sure you know exactly what you're
>> doing, but you can create different "root" buckets and link the OSDs
>> in to each root in different ways.
>>
>> >
>> > Also, is it currently possible in ceph to have a custom placement of
>> erasure
>> > coded chunks. So for example within a pool, I want objects to reside
>> exactly
>> > on the OSDs I choose instead of doing placement for load balancing. Can
>> I
>> > specify something like: "For object 1, I want systematic chunks on
>> rack1 and
>> > non systematic distributed between rack2 and rack3 and then for object
>> 2, I
>> > want systematic ones on rack2 and non systematic distributed between
>> rack1
>> > and rack3"?
>>
>> Not generally, no — you need to let the CRUSH algorithm place them.
>> You can do things like specify specific buckets within a CRUSH rule,
>> but that applies on a pool level.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>> >
>> > I would greatly appreciate any suggestions I get.
>> >
>> > Regards,
>> > Shayan Saeed
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuing placement group problems

2014-06-25 Thread Gregory Farnum
You probably want to look at the central log (on your monitors) and
see exactly what scrub errors it's reporting. There might also be
useful info if you dump the pg info on the inconsistent PGs. But if
you're getting this frequently, you're either hitting some unknown
issues with the OSDs around some peering issue (unlikely, but
possible), or there's an issue in ZFS or the way the OSD is using ZFS
(much more possible).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Jun 24, 2014 at 9:38 AM, Peter Howell  wrote:
> We are using two instances of version 8.0 of ceph both on ZFS. We are
> frequently getting placement group inconsistent on both Ceph clusters
>
> We are suspecting that there is a problem with the network that is randomly
> corrupting the update of placement groups. Does anyone have any suggestions
> as to where and how to look for the problem. The network does not seem to
> have any problems. ZFS is not reporting any problems with the disks and the
> OSD's are fine.
>
> Thanks
>
> Peter.
>
> Log as follows
>
>  health HEALTH_ERR 50 pgs inconsistent; 121 scrub errors
>  monmap e8: 6 mons at
> {broll=10.5.8.9:6789/0,gelbin=10.5.8.10:6789/0,magni=10.5.8.12:6789/0,sicco=10.5.8.11:6789/0,tyrande=10.5.8.8:6789/0,varian=10.5.8.14:6789/0},
> election epoch 272, quorum 0,1,2,3,4,5
> tyrande,broll,gelbin,sicco,magni,varian
>  mdsmap e430: 1/1/1 up {0=broll=up:active}, 5 up:standby
>  osdmap e18928: 7 osds: 7 up, 7 in
>   pgmap v4910054: 512 pgs, 4 pools, 13043 MB data, 3681 objects
> 40800 MB used, 856 GB / 895 GB avail
>  462 active+clean
>   50 active+clean+inconsistent
>   client io 12769 B/s rd, 5 op/s
>
>
> ---
>
> Follow us on:
> www.twitter.com/teamenergyeaa
> www.youtube.com/user/teamenergyeaa
>
> Date for your diary
> TEAM User Group Conference
> 5th November 2014
>
> Subscribe to our newsletter
> http://www.teamenergy.com/newsletter-subscription/
> www.teamenergy.com +44 (0)1908 690018 enquir...@teamenergy.com
> Registered Office: TEAM (Energy Auditing Agency Ltd), 34 The Forum,
> Rockingham Drive, Linford Wood, Milton Keynes, MK14 6LY
> Registered in England No. 1916768
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with RadosGW and special characters

2014-06-25 Thread Gregory Farnum
Unfortunately Yehuda's out for a while as he could best handle this,
but it sounds familiar so I think you probably want to search the list
archives and the bug tracker (http://tracker.ceph.com/projects/rgw).
What version precisely are you on?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jun 25, 2014 at 2:58 PM, Brian Rak  wrote:
> I'm trying to find an issue with RadosGW and special characters in
> filenames.  Specifically, it seems that filenames with a + in them are not
> being handled correctly, and that I need to explicitly escape them.
>
> For example:
>
> ---request begin---
> HEAD /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb HTTP/1.0
> User-Agent: Wget/1.12 (linux-gnu)
>
> Will fail with a 404 error, but
>
> ---request begin---
> HEAD /ubuntu/pool/main/a/adduser/adduser_3.113%2Bnmu3ubuntu3_all.deb
> HTTP/1.0
> User-Agent: Wget/1.12 (linux-gnu)
>
> will work properly.
>
> I enabled debug mode on radosgw, and see this:
>
> 2014-06-25 17:30:37.383029 7f7ca7fff700 20 RGWWQ:
> 2014-06-25 17:30:37.383040 7f7ca7fff700 20 req: 0x7f7ca000b180
> 2014-06-25 17:30:37.383053 7f7ca7fff700 10 allocated request
> req=0x7f7ca0015ef0
> 2014-06-25 17:30:37.383064 7f7c6cfa9700 20 dequeued request
> req=0x7f7ca000b180
> 2014-06-25 17:30:37.383070 7f7c6cfa9700 20 RGWWQ: empty
> 2014-06-25 17:30:37.383121 7f7c6cfa9700 20 CONTENT_LENGTH=
> 2014-06-25 17:30:37.383123 7f7c6cfa9700 20 CONTENT_TYPE=
> 2014-06-25 17:30:37.383124 7f7c6cfa9700 20 DOCUMENT_ROOT=/etc/nginx/html
> 2014-06-25 17:30:37.383125 7f7c6cfa9700 20
> DOCUMENT_URI=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383126 7f7c6cfa9700 20 FCGI_ROLE=RESPONDER
> 2014-06-25 17:30:37.383127 7f7c6cfa9700 20 GATEWAY_INTERFACE=CGI/1.1
> 2014-06-25 17:30:37.383128 7f7c6cfa9700 20 HTTP_ACCEPT=*/*
> 2014-06-25 17:30:37.383129 7f7c6cfa9700 20 HTTP_CONNECTION=Keep-Alive
> 2014-06-25 17:30:37.383129 7f7c6cfa9700 20 HTTP_HOST=xxx
> 2014-06-25 17:30:37.383130 7f7c6cfa9700 20 HTTP_USER_AGENT=Wget/1.12
> (linux-gnu)
> 2014-06-25 17:30:37.383131 7f7c6cfa9700 20 QUERY_STRING=
> 2014-06-25 17:30:37.383131 7f7c6cfa9700 20 REDIRECT_STATUS=200
> 2014-06-25 17:30:37.383132 7f7c6cfa9700 20 REMOTE_ADDR=yyy
> 2014-06-25 17:30:37.383133 7f7c6cfa9700 20 REMOTE_PORT=43855
> 2014-06-25 17:30:37.383134 7f7c6cfa9700 20 REQUEST_METHOD=HEAD
> 2014-06-25 17:30:37.383134 7f7c6cfa9700 20
> REQUEST_URI=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383135 7f7c6cfa9700 20
> SCRIPT_NAME=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383136 7f7c6cfa9700 20 SERVER_ADDR=yyy
> 2014-06-25 17:30:37.383136 7f7c6cfa9700 20 SERVER_NAME=xxx
> 2014-06-25 17:30:37.383137 7f7c6cfa9700 20 SERVER_PORT=80
> 2014-06-25 17:30:37.383138 7f7c6cfa9700 20 SERVER_PROTOCOL=HTTP/1.0
> 2014-06-25 17:30:37.383138 7f7c6cfa9700 20 SERVER_SOFTWARE=nginx/1.4.6
> 2014-06-25 17:30:37.383140 7f7c6cfa9700  1 == starting new request
> req=0x7f7ca000b180 =
> 2014-06-25 17:30:37.383152 7f7c6cfa9700  2 req 1:0.13::HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb::initializing
> 2014-06-25 17:30:37.383158 7f7c6cfa9700 10 host= rgw_dns_name=
> 2014-06-25 17:30:37.383199 7f7c6cfa9700 10
> s->object=ubuntu/pool/main/a/adduser/adduser_3.113 nmu3ubuntu3_all.deb
> s->bucket=ubuntu
> 2014-06-25 17:30:37.383207 7f7c6cfa9700  2 req 1:0.68:s3:HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb::getting op
> 2014-06-25 17:30:37.383211 7f7c6cfa9700  2 req 1:0.72:s3:HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb:get_obj:authorizing
> 2014-06-25 17:30:37.383218 7f7c6cfa9700  2 req 1:0.79:s3:HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb:get_obj:reading
> permissions
> 2014-06-25 17:30:37.383268 7f7c6cfa9700 20 get_obj_state:
> rctx=0x7f7c6cfa8640 obj=.rgw:ubuntu state=0x7f7c6800c0a8 s->prefetch_data=0
> 2014-06-25 17:30:37.383279 7f7c6cfa9700 10 cache get: name=.rgw+ubuntu :
> miss
>
>
> It seems that Ceph is attempting to urldecode the filename, even when it
> shouldn't be.  (Going by
> http://stackoverflow.com/questions/1005676/urls-and-plus-signs ).  Is this a
> bug, or is this the desired behavior?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of ceph pg repair on different replication levels

2014-06-25 Thread Gregory Farnum
On Wed, Jun 25, 2014 at 12:22 AM, Christian Kauhaus  wrote:
> Am 23.06.2014 20:24, schrieb Gregory Farnum:
>> Well, actually it always takes the primary copy, unless the primary
>> has some way of locally telling that its version is corrupt. (This
>> might happen if the primary thinks it should have an object, but it
>> doesn't exist on disk.) But there's not a voting or anything at this
>> time.
>
> Thanks Greg for the clarification. I wonder if some sort of voting during
> recovery would be feasible to implement. Having this available would make a 3x
> replica scheme immensely more useful.

It's a good idea, and in fact there was a discussion yesterday during
the Ceph Developer Summit about making scrub repair significantly more
powerful; they're keeping that use case in mind in addition to very
fine-grained ones like specifying a particular replica for every
object.

>
> In my current understanding Ceph has no guards against local bit rot (e.g.,
> when a local disk returns incorrect data).

Yeah, it's got nothing and is relying on the local filesystem to barf
if that happens. Unfortunately, neither xfs nor ext4 provide that
checking functionality (which is one of the reasons we continue to
look to btrfs as our long-term goal).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

> Or is there already a voting scheme
> in place during deep scrub?
>
> Regards
>
> Christian
>
> --
> Dipl.-Inf. Christian Kauhaus <>< · k...@gocept.com · systems administration
> gocept gmbh & co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
> http://gocept.com · tel +49 345 219401-11
> Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-06-25 Thread Gregory Farnum
Sorry we let this drop; we've all been busy traveling and things.

There have been a lot of changes to librados between Dumpling and
Firefly, but we have no idea what would have made it slower. Can you
provide more details about how you were running these tests?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 13, 2014 at 7:59 AM, Stefan Priebe  wrote:
> Hi,
>
> while testint firefly i cam into the sitation where i had a client where the
> latest dumpling packages where installed (0.67.9).
>
> As my pool has hashppool false and the tunables are set to default it can
> talk to my firefly ceph sotrage.
>
> For random 4k writes using fio with librbd and 32 jobs and an iodepth of 32.
>
> I get these results:
>
> librbd / librados2 from dumpling:
>   write: io=3020.9MB, bw=103083KB/s, iops=25770, runt= 30008msec
>   WRITE: io=3020.9MB, aggrb=103082KB/s, minb=103082KB/s, maxb=103082KB/s,
> mint=30008msec, maxt=30008msec
>
> librbd / librados2 from firefly:
>   write: io=7344.3MB, bw=83537KB/s, iops=20884, runt= 90026msec
>   WRITE: io=7344.3MB, aggrb=83537KB/s, minb=83537KB/s, maxb=83537KB/s,
> mint=90026msec, maxt=90026msec
>
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference between "ceph osd reweight" and "ceph osd crush reweight"

2014-06-26 Thread Gregory Farnum
On Thu, Jun 26, 2014 at 7:03 AM, Micha Krause  wrote:
> Hi,
>
> could someone explain to me what the difference is between
>
> ceph osd reweight
>
> and
>
> ceph osd crush reweight

"ceph osd crush reweight" sets the CRUSH weight of the OSD. This
weight is an arbitrary value (generally the size of the disk in TB or
something) and controls how much data the system tries to allocate to
the OSD.

"ceph osd reweight" sets an override weight on the OSD. This value is
in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the
data that would otherwise live on this drive. It does *not* change the
weights assigned to the buckets above the OSD, and is a corrective
measure in case the normal CRUSH distribution isn't working out quite
right. (For instance, if one of your OSDs is at 90% and the others are
at 50%, you could reduce this weight to try and compensate for it.)

It looks like our docs aren't very clear on the difference, when it
even mentions them...and admittedly it's a pretty subtle issue!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuing placement group problems

2014-06-26 Thread Gregory Farnum
On Thu, Jun 26, 2014 at 12:52 PM, Kevin Horan
 wrote:
> I am also getting inconsistent object errors on a regular basis, about 1-2
> every week or so for about 300GB of data. All OSDs are using XFS
> filesystems. Some OSDs are individual 3TB internal hard drives and some are
> external FC attached raid6 arrays. I am using this cluster to store kvm
> images and I've noticed that the inconsistent objects always occur on my two
> most recently created VM images, even though one of them is hardly ever used
> (just a bare VM not put into production yet). This all started about 4
> months ago on 0.72 and now is continuing to occur on version .80. I also
> changed the number of replicas from 2 to 3 for the pool containing these
> images and that had no effect.
>
> Here is an example log entry:
>
> 2014-06-24 18:11:51.683310 7faf44297700  0 log [ERR] : 4.b6 shard 0: soid
> c539a8b6/rbd_data.9fdea2ae8944a.04e2/head//4 digest 2541762784
> != known digest 3305022936
> 2014-06-24 18:11:52.107321 7faf50f60700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:52.215752 7faf5075f700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:52.365798 7faf50f60700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:52.674643 7faf5075f700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:52.749641 7faf50f60700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:55.194967 7faf5075f700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:55.259322 7faf50f60700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:55.526157 7faf5075f700  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
> Invalid argument
> 2014-06-24 18:11:55.547270 7faf44297700  0 log [ERR] : 4.b6 deep-scrub 0
> missing, 1 inconsistent objects
> 2014-06-24 18:11:55.547282 7faf44297700  0 log [ERR] : 4.b6 deep-scrub 1
> errors

Can you go find out what about those files is different? Are they
different sizes, with the overlapping pieces being the same? Are they
completely different? Are your systems losing power or otherwise doing
mean things to the local filesystem? Have you noticed a pattern of
distribution in terms of the underlying storage system on the
inconsistent OSDs?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> Sometimes one of the objects has 0 size. I've also started getting the
> FSSETXATTR errors recently, though I think that started after this problem
> started. I've read elsewhere that these are harmless and will go away in a
> future version.  I also looked in the monitor logs but didn't see any
> reference to inconsistent or scrubbed objects.
>
> Kevin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference between "ceph osd reweight" and "ceph osd crush reweight"

2014-06-27 Thread Gregory Farnum
Yep, definitely use "osd crush reweigh" for your permanent data placement.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 27, 2014 at 12:13 AM, Micha Krause  wrote:
> Hi,
>
>
>> "ceph osd crush reweight" sets the CRUSH weight of the OSD. This
>> weight is an arbitrary value (generally the size of the disk in TB or
>> something) and controls how much data the system tries to allocate to
>> the OSD.
>>
>> "ceph osd reweight" sets an override weight on the OSD. This value is
>> in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the
>> data that would otherwise live on this drive. It does *not* change the
>> weights assigned to the buckets above the OSD, and is a corrective
>> measure in case the normal CRUSH distribution isn't working out quite
>> right. (For instance, if one of your OSDs is at 90% and the others are
>> at 50%, you could reduce this weight to try and compensate for it.)
>
>
> thanks, so if I have some older osds, and I want them to receive less
> data/iops
> than the other nodes, I would use "ceph osd crush reweight"?
>
> Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Data not evenly distributed

2014-06-28 Thread Gregory Farnum
Did you also increase the "pgp_num"?

On Saturday, June 28, 2014, Jianing Yang  wrote:

> Actually, I did increase PG number to 32768 (120 osds) and I also use
> "tunable optimal". But the data still not distribute evenly.
>
>
> On Sun, Jun 29, 2014 at 3:42 AM, Konrad Gutkowski  > wrote:
>
>> Hi,
>>
>> Increasing PG number for pools that hold data might help if you didn't do
>> that already.
>>
>> Check out this thread:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/
>> 2014-January/027094.html
>>
>> You might find some tips there (although it was pre firefly).
>>
>> W dniu 28.06.2014 o 14:44 Jianing Yang > > pisze:
>>
>>
>>> Hi, all
>>>
>>> My cluster has been running for about 4 month now. I have about 108
>>> osds and all are 600G SAS Disk. Their disk usage is between 70% and 85%.
>>> It seems that ceph cannot distribute data evenly by default settings. Is
>>> there any configuration that helps distribute data more evenly?
>>>
>>> Thanks very much
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>>
>> Konrad Gutkowski
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD backfill full tunings

2014-06-30 Thread Gregory Farnum
It looks like that value isn't live-updateable, so you'd need to
restart after changing the daemon's config. Sorry!
Made a ticket: http://tracker.ceph.com/issues/8695
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 12:41 AM, Kostis Fardelas  wrote:
> Hi,
> during PGs remapping, the cluster recovery process sometimes gets
> stuck on PGs with backfill_toofull state. The obvious solution is to
> reweight the impacted OSD until we add new OSDs to the cluster. In
> order to force the remapping process to complete asap we try to inject
> a higher value on "osd_backfill_full_ratio" tunable (by default on
> 85%). However, after applying the higher backfill full ratio values,
> the remapping does not seem to start and continues to be stuck with
> backfill_toofull PGs. Is there something more we should try?
>
> Thanks,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS : directory sharding ?

2014-06-30 Thread Gregory Farnum
Directory sharding is even less stable than the rest of the MDS, but
if you need it I have some hope that things willow work. You just need
to set the "mds bal frag" option to "true". You can configure the
limits as well; see the options following:
https://github.com/ceph/ceph/blob/master/src/common/config_opts.h#L323
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 7:09 AM, Florent B  wrote:
> Hi,
>
> I would like to do Multi-MDS (directory sharding) with CephFS.
>
> But I didn't find any documentation.
>
> Is it supported in Firefly ?
>
> Where can I find some information ?
>
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-06-30 Thread Gregory Farnum
What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to upgrade
to unnamed versions like 0.82 (but it's probably too late to go back
now).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 wrote:
> Hi,
>
> After the upgrade to firefly, I have some PG in peering state.
> I seen the output of 0.82 so I try to upgrade for solved my problem.
>
> My three MDS crash and some OSD triggers a chain reaction that kills other
> OSD.
> I think my MDS will not start because of the metadata are on the OSD.
>
> I have 36 OSD on three servers and I identified 5 OSD which makes crash
> others. If i not start their, the cluster passe in reconstructive state with
> 31 OSD but i have 378 in down+peering state.
>
> How can I do ? Would you more information ( os, crash log, etc ... ) ?
>
> Regards
>
> --
> --
> Pierre BLONDEAU
> Administrateur Systèmes & réseaux
> Université de Caen
> Laboratoire GREYC, Département d'informatique
>
> tel : 02 31 56 75 42
> bureau  : Campus 2, Science 3, 406
> --
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS : directory sharding ?

2014-06-30 Thread Gregory Farnum
Umm...there are hooks for that, but they're for debug purposes only.
And running multiple MDSes *will* break something, in ways that
fragmenting the directories won't.
If you're dead set on this course, you can dig through the qa
directory for the MDS tests to see what commands it's running to force
fragments into a specific MDS.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 8:51 AM, Florent B  wrote:
> Ok thank you. So it is not possible to set a specific directory assigned
> to a MDS ?
>
> On 06/30/2014 05:34 PM, Gregory Farnum wrote:
>> Directory sharding is even less stable than the rest of the MDS, but
>> if you need it I have some hope that things willow work. You just need
>> to set the "mds bal frag" option to "true". You can configure the
>> limits as well; see the options following:
>> https://github.com/ceph/ceph/blob/master/src/common/config_opts.h#L323
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Mon, Jun 30, 2014 at 7:09 AM, Florent B  wrote:
>>> Hi,
>>>
>>> I would like to do Multi-MDS (directory sharding) with CephFS.
>>>
>>> But I didn't find any documentation.
>>>
>>> Is it supported in Firefly ?
>>>
>>> Where can I find some information ?
>>>
>>> Thank you.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD backfill full tunings

2014-06-30 Thread Gregory Farnum
Oh, you're right — I just ran a grep and didn't look closely enough.
It looks like once they're in that too_full state, they need to get
kicked by the OSD to try again though. I believe (haven't checked)
that that can happen if other backfills finish, but if none are
running and all the PGs needing backfill are in this state, a restart
isn't terribly lightweight but will provide the kick needed.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 1:25 PM, Henrik Korkuc  wrote:
> well, at least for me it is live-updateable (0.80.1). It may be that
> during recovery OSDs are currently backfilling other pgs, so stats are
> not updated (because pg were not tried to backfill after setting change).
>
> On 2014.06.30 18:31, Gregory Farnum wrote:
>> It looks like that value isn't live-updateable, so you'd need to
>> restart after changing the daemon's config. Sorry!
>> Made a ticket: http://tracker.ceph.com/issues/8695
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Mon, Jun 30, 2014 at 12:41 AM, Kostis Fardelas  
>> wrote:
>>> Hi,
>>> during PGs remapping, the cluster recovery process sometimes gets
>>> stuck on PGs with backfill_toofull state. The obvious solution is to
>>> reweight the impacted OSD until we add new OSDs to the cluster. In
>>> order to force the remapping process to complete asap we try to inject
>>> a higher value on "osd_backfill_full_ratio" tunable (by default on
>>> 85%). However, after applying the higher backfill full ratio values,
>>> the remapping does not seem to start and continues to be stuck with
>>> backfill_toofull PGs. Is there something more we should try?
>>>
>>> Thanks,
>>> Kostis
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iscsi and cache pool

2014-07-01 Thread Gregory Farnum
It looks like you're using a kernel RBD mount in the second case? I imagine
your kernel doesn't support caching pools and you'd need to upgrade for it
to work.
-Greg

On Tuesday, July 1, 2014, Никитенко Виталий  wrote:

> Good day!
> I have server with Ubunu 14.04 and installed ceph firefly. Configured
> main_pool (2 osd) and ssd_pool (1 ssd osd). I want use ssd_pool as cache
> pool for main_pool
>
>   ceph osd tier add main_pool ssd_pool
>   ceph osd tier cache-mode ssd_pool writeback
>   ceph osd tier set-overlay main_pool ssd_pool
>
>   ceph osd pool set ssd_pool hit_set_type bloom
>   ceph osd pool set ssd_pool hit_set_count 1
>   ceph osd pool set ssd_pool hit_set_period 600
>   ceph osd pool set ssd_pool target_max_bytes 1000
>
>  If use tgt as:
>  tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --bstype
> rbd --backing-store main_pool/store_main --bsopts "conf=/etc/ceph/ceph.conf"
>  and then connected from iscsi initiator to this Lun1, i see that ssd_pool
> is used as cache (i see through iostat -x 1) but slow speed
>
>  If use tgt as (or other sush as scst, iscsitarget):
>  tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 -b
> /dev/rbd1 (where rbd1=main_pool/store_main)
>  and then connected from iscsi initiator to this Lun1, i see that ssd_pool
> is not used, that write through to 2 osd
>
>  Help me, anyone work this iscsi and cache pool?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN active+degraded on fresh install CENTOS 6.5

2014-07-01 Thread Gregory Farnum
What's the output of "ceph osd map"?

Your CRUSH map probably isn't trying to segregate properly, with 2
hosts and 4 OSDs each.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Jul 1, 2014 at 11:22 AM, Brian Lovett
 wrote:
> I'm pulling my hair out with ceph. I am testing things with a 5 server
> cluster. I have 3 monitors, and two storage machines each with 4 osd's. I
> have started from scratch 4 times now, and can't seem to figure out how to
> get a clean status. Ceph health reports:
>
> HEALTH_WARN 34 pgs degraded; 192 pgs stuck unclean; recovery 40/60 objects
> degraded (66.667%)
>
> ceph status reports:
>
> cluster 99567882-2e01-4dec-8ca5-692e439a5a47
>  health HEALTH_WARN 34 pgs degraded; 192 pgs stuck unclean; recovery
> 40/60 objects degraded (66.667%)
>  monmap e2: 3 mons at
> {monitor01=192.168.1.200:6789/0,monitor02=192.168.1.201:6789/0,monitor03=192
> .168.1.202:6789/0}, election epoch 8, quorum 0,1,2
> monitor01,monitor02,monitor03
>  mdsmap e4: 1/1/1 up {0=monitor01.mydomain.com=up:active}
>  osdmap e49: 8 osds: 8 up, 8 in
>   pgmap v85: 192 pgs, 3 pools, 1884 bytes data, 20 objects
> 297 MB used, 14856 GB / 14856 GB avail
> 40/60 objects degraded (66.667%)
>1 active
>   34 active+degraded
>  157 active+remapped
>
>
> My ceph.conf contains the following:
>
> [default]
> osd_pool_default_size = 2
>
> [global]
> auth_service_required = cephx
> filestore_xattr_use_omap = true
> auth_client_required = cephx
> auth_cluster_required = cephx
> mon_host = 192.168.1.200,192.168.1.201,192.168.1.202
> mon_initial_members = monitor01, monitor02, monitor03
> fsid = 99567882-2e01-4dec-8ca5-692e439a5a47
>
>
>
> Any suggestions are welcome at this point.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN active+degraded on fresh install CENTOS 6.5

2014-07-01 Thread Gregory Farnum
On Tue, Jul 1, 2014 at 11:33 AM, Brian Lovett
 wrote:
> Brian Lovett  writes:
>
>
> I restarted all of the osd's and noticed that ceph shows 2 osd's up even if
> the servers are completely powered down:  osdmap e95: 8 osds: 2 up, 8 in
>
> Why would that be?

The OSDs report each other down much more quickly (~30s) than the
monitor timeout (~15 minutes). They'd get marked down eventually.

On Tue, Jul 1, 2014 at 11:43 AM, Brian Lovett
 wrote:
> Gregory Farnum  writes:
>
>>
>> What's the output of "ceph osd map"?
>>
>> Your CRUSH map probably isn't trying to segregate properly, with 2
>> hosts and 4 OSDs each.
>> Software Engineer #42http://inktank.com | http://ceph.com
>>
>  Is this what you are looking for?
>
> ceph osd map rbd ceph
> osdmap e104 pool 'rbd' (2) object 'ceph' -> pg 2.3482c180 (2.0) -> up ([3,5],
> p3) acting ([3,5,0], p3)

Whoops, I mean "ceph osd list", sorry! (That should output a textual
representation of how they're arranged in the CRUSH map.)


>
> We're bringing on a 3rd host tomorrow with 4 more osd's. Would this correct
> the issue?

There's a good chance, but you're seeing a lot more degraded PGs than
one normally does when it's just a mapping failure, so I'd like to see
a few more details. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN active+degraded on fresh install CENTOS 6.5

2014-07-01 Thread Gregory Farnum
On Tue, Jul 1, 2014 at 11:45 AM, Gregory Farnum  wrote:
> On Tue, Jul 1, 2014 at 11:33 AM, Brian Lovett
>  wrote:
>> Brian Lovett  writes:
>>
>>
>> I restarted all of the osd's and noticed that ceph shows 2 osd's up even if
>> the servers are completely powered down:  osdmap e95: 8 osds: 2 up, 8 in
>>
>> Why would that be?
>
> The OSDs report each other down much more quickly (~30s) than the
> monitor timeout (~15 minutes). They'd get marked down eventually.
>
> On Tue, Jul 1, 2014 at 11:43 AM, Brian Lovett
>  wrote:
>> Gregory Farnum  writes:
>>
>>>
>>> What's the output of "ceph osd map"?
>>>
>>> Your CRUSH map probably isn't trying to segregate properly, with 2
>>> hosts and 4 OSDs each.
>>> Software Engineer #42http://inktank.com | http://ceph.com
>>>
>>  Is this what you are looking for?
>>
>> ceph osd map rbd ceph
>> osdmap e104 pool 'rbd' (2) object 'ceph' -> pg 2.3482c180 (2.0) -> up ([3,5],
>> p3) acting ([3,5,0], p3)
>
> Whoops, I mean "ceph osd list", sorry! (That should output a textual
> representation of how they're arranged in the CRUSH map.)

...and one more time, because apparently my brain's out to lunch today:

ceph osd tree

*sigh*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN active+degraded on fresh install CENTOS 6.5

2014-07-01 Thread Gregory Farnum
On Tue, Jul 1, 2014 at 11:57 AM, Brian Lovett
 wrote:
> Gregory Farnum  writes:
>
>> ...and one more time, because apparently my brain's out to lunch today:
>>
>> ceph osd tree
>>
>> *sigh*
>>
>
> haha, we all have those days.
>
> [root@monitor01 ceph]# ceph osd tree
> # idweight  type name   up/down reweight
> -1  14.48   root default
> -2  7.24host ceph01
> 0   2.72osd.0   up  1
> 1   0.9 osd.1   up  1
> 2   0.9 osd.2   up  1
> 3   2.72osd.3   up  1
> -3  7.24host ceph02
> 4   2.72osd.4   up  1
> 5   0.9 osd.5   up  1
> 6   0.9 osd.6   up  1
> 7   2.72osd.7   up  1
>
> I notice that the weights are all over the place. I was planning on the
> following once I got things going.
>
> 6 1tb ssd osd's (across 3 hosts) as a writeback cache pool, and 6 3tb sata's
> behind them in another pool for data that isn't accessed as often.

So those disks are actually different sizes, in proportion to their
weights? It could be having an impact on this, although it *shouldn't*
be an issue. And your tree looks like it's correct, which leaves me
thinking that something is off about your crush rules. :/
Anyway, having looked at that, what are your crush rules? ("ceph osd
crush dump" will provide that and some other useful data in json
format. I checked the command this time.)
And can you run "ceph pg dump" and put that on pastebin for viewing?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN active+degraded on fresh install CENTOS 6.5

2014-07-01 Thread Gregory Farnum
On Tue, Jul 1, 2014 at 1:26 PM, Brian Lovett
 wrote:
>   "profile": "bobtail",

Okay. That's unusual. What's the oldest client you need to support,
and what Ceph version are you using? You probably want to set the
crush tunables to "optimal"; the "bobtail" ones are going to have all
kinds of issues with a small map like this. (Specifically, a map where
the number of buckets/items at each level is similar to the number of
requested replicas.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-01 Thread Gregory Farnum
On Thu, Jun 26, 2014 at 11:49 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi Greg,
>
> Am 26.06.2014 02:17, schrieb Gregory Farnum:
>> Sorry we let this drop; we've all been busy traveling and things.
>>
>> There have been a lot of changes to librados between Dumpling and
>> Firefly, but we have no idea what would have made it slower. Can you
>> provide more details about how you were running these tests?
>
> it's just a normal fio run:
> fio --ioengine=rbd --bs=4k --name=foo --invalidate=0
> --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor
> --runtime=90 --numjobs=32 --direct=1 --group
>
> Running one time with firefly libs and one time with dumpling libs.
> Traget is always the same pool on a firefly ceph storage.

What's the backing cluster you're running against? What kind of CPU
usage do you see with both? 25k IOPS is definitely getting up there,
but I'd like some guidance about whether we're looking for a reduction
in parallelism, or an increase in per-op costs, or something else.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iscsi and cache pool

2014-07-01 Thread Gregory Farnum
Yeah, the features are new from January or something so you need a
very new kernel to support it. There are no options to set.
But in general I wouldn't use krbd if you can use librbd instead; it's
easier to update and more featureful!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Jul 1, 2014 at 5:44 PM, Никитенко Виталий  wrote:
> Hi!
>
> There is some option in the kernel which must be included, or just upgrade
> to the latest version of the kernel? I use 3.13.0-24
>
> Thanks
>
> 01.07.2014, 20:17, "Gregory Farnum" :
>
> It looks like you're using a kernel RBD mount in the second case? I imagine
> your kernel doesn't support caching pools and you'd need to upgrade for it
> to work.
> -Greg
>
> On Tuesday, July 1, 2014, Никитенко Виталий  wrote:
>
> Good day!
> I have server with Ubunu 14.04 and installed ceph firefly. Configured
> main_pool (2 osd) and ssd_pool (1 ssd osd). I want use ssd_pool as cache
> pool for main_pool
>
>   ceph osd tier add main_pool ssd_pool
>   ceph osd tier cache-mode ssd_pool writeback
>   ceph osd tier set-overlay main_pool ssd_pool
>
>   ceph osd pool set ssd_pool hit_set_type bloom
>   ceph osd pool set ssd_pool hit_set_count 1
>   ceph osd pool set ssd_pool hit_set_period 600
>   ceph osd pool set ssd_pool target_max_bytes 1000
>
>  If use tgt as:
>  tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --bstype rbd
> --backing-store main_pool/store_main --bsopts "conf=/etc/ceph/ceph.conf"
>  and then connected from iscsi initiator to this Lun1, i see that ssd_pool
> is used as cache (i see through iostat -x 1) but slow speed
>
>  If use tgt as (or other sush as scst, iscsitarget):
>  tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 -b /dev/rbd1
> (where rbd1=main_pool/store_main)
>  and then connected from iscsi initiator to this Lun1, i see that ssd_pool
> is not used, that write through to 2 osd
>
>  Help me, anyone work this iscsi and cache pool?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-02 Thread Gregory Farnum
Yeah, it's fighting for attention with a lot of other urgent stuff. :(

Anyway, even if you can't look up any details or reproduce at this
time, I'm sure you know what shape the cluster was (number of OSDs,
running on SSDs or hard drives, etc), and that would be useful
guidance. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 6:12 AM, Stefan Priebe - Profihost AG
 wrote:
>
> Am 02.07.2014 15:07, schrieb Haomai Wang:
>> Could you give some perf counter from rbd client side? Such as op latency?
>
> Sorry haven't any counters. As this mail was some days unseen - i
> thought nobody has an idea or could help.
>
> Stefan
>
>> On Wed, Jul 2, 2014 at 9:01 PM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Am 02.07.2014 00:51, schrieb Gregory Farnum:
>>>> On Thu, Jun 26, 2014 at 11:49 PM, Stefan Priebe - Profihost AG
>>>>  wrote:
>>>>> Hi Greg,
>>>>>
>>>>> Am 26.06.2014 02:17, schrieb Gregory Farnum:
>>>>>> Sorry we let this drop; we've all been busy traveling and things.
>>>>>>
>>>>>> There have been a lot of changes to librados between Dumpling and
>>>>>> Firefly, but we have no idea what would have made it slower. Can you
>>>>>> provide more details about how you were running these tests?
>>>>>
>>>>> it's just a normal fio run:
>>>>> fio --ioengine=rbd --bs=4k --name=foo --invalidate=0
>>>>> --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor
>>>>> --runtime=90 --numjobs=32 --direct=1 --group
>>>>>
>>>>> Running one time with firefly libs and one time with dumpling libs.
>>>>> Traget is always the same pool on a firefly ceph storage.
>>>>
>>>> What's the backing cluster you're running against? What kind of CPU
>>>> usage do you see with both? 25k IOPS is definitely getting up there,
>>>> but I'd like some guidance about whether we're looking for a reduction
>>>> in parallelism, or an increase in per-op costs, or something else.
>>>
>>> Hi Greg,
>>>
>>> i don't have that test cluster anymore. It had to go into production
>>> with dumpling.
>>>
>>> So i can't tell you.
>>>
>>> Sorry.
>>>
>>> Stefan
>>>
>>>> -Greg
>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issues upgrading from 0.72.x (emperor) to 0.81.x (firefly)

2014-07-02 Thread Gregory Farnum
On Wed, Jul 2, 2014 at 6:18 AM, Sylvain Munaut
 wrote:
> Hi,
>
>
> I'm having a couple of issues during this update. On the test cluster
> it went fine, but when running it on production I have a few issues.
> (I guess there is some subtle difference I missed, I updated the test
> one back when emperor came out).
>
> For reference, I'm on ubuntu precise, I use self-built packages
> (because I'm hitting bugs that are not fixed in the latest official
> ones, but there is no change whatsoever to the debian/ directory
> except the changelog and they're built with the dpkg-buildpackage). I
> did a 'apt-get dist-upgrade' to upgrade everything despite the new
> requirements.
>
>
> * The first one is essentially the same as
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/19632
>
> dpkg: error processing
> /var/cache/apt/archives/ceph-common_0.80.1-1we3_amd64.deb (--unpack):
>  trying to overwrite '/etc/ceph/rbdmap', which is also in package ceph
> 0.80.1-1we3
>
> apt complained about /etc/ceph/rbdmap being in two package and refused
> to go further. I ended up using -o Dpkg::Options::="--force-overwrite"
>  to force it to go on (because it just left some weird inconsistent
> state and I needed to clean up the mess), but this seems wrong.
>
>
> * The second one is that apparently it ran a "rm /etc/ceph" somehow
> ... on my setup this is not a directory, but a symlink to the real
> place the config is stored (the root partition is considered
> 'expendable', so machine specific config is elsewhere). It also tried
> to erase the /var/log/ceph but failed:

I can't help you with packaging issues, but i can tell you that the
rbdmap executable got moved to a different package at some point, but
I believe the official ones handle it properly.
And I'm just guessing here (like I said, can't help with packaging),
but I think the deleted /etc/ceph is a result of the force-overwrite
option you used.

>
> ---
> Replacing files in old package ceph-common ...
> dpkg: warning: unable to delete old directory '/var/log/ceph':
> Directory not empty
> ---
>
>
> * And finally the upgraded monitor can't join the existing quorum.
> Nowhere in the firefly update notes does it say that the new mon can't
> join an old quorum. When this was the case back in dumpling, there was
> a very explicit explanation but here it just doesn't join and spits
> out "pipe fault" in the logs continuously.
>
> Now it might be "normal", but being the production cluster, I can't
> risk and upgrading more than half the mons if I'm not sure this is
> indeed normal and not a symptom that the install/update failed and
> that the mon is not actually working.

That's not normal. A first guess is that you didn't give the new
monitor the same keyring as the old ones, but I couldn't say for sure
without more info. Turn up logging and post it somewhere?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
>
>
> Cheers,
>
> Sylvain
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-02 Thread Gregory Farnum
On Wed, Jul 2, 2014 at 12:00 PM, Stefan Priebe  wrote:
>
> Am 02.07.2014 16:00, schrieb Gregory Farnum:
>
>> Yeah, it's fighting for attention with a lot of other urgent stuff. :(
>>
>> Anyway, even if you can't look up any details or reproduce at this
>> time, I'm sure you know what shape the cluster was (number of OSDs,
>> running on SSDs or hard drives, etc), and that would be useful
>> guidance. :)
>
>
> Sure
>
> Number of OSDs: 24
> Each OSD has an SSD capable of doing tested with fio before installing ceph
> (70.000 iop/s 4k write, 580MB/s seq. write 1MB blocks)
>
> Single Xeon E5-1620 v2 @ 3.70GHz
>
> 48GB RAM

Awesome, thanks.

I went through the changelogs on the librados/, osdc/, and msg/
directories to see if I could find any likely change candidates
between Dumpling and Firefly and couldn't see any issues. :( But I
suspect that the sharding changes coming will more than make up the
difference, so you might want to plan on checking that out when it
arrives, even if you don't want to deploy it to production.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-02 Thread Gregory Farnum
On Wed, Jul 2, 2014 at 12:44 PM, Stefan Priebe  wrote:
> Hi Greg,
>
> Am 02.07.2014 21:36, schrieb Gregory Farnum:
>>
>> On Wed, Jul 2, 2014 at 12:00 PM, Stefan Priebe 
>> wrote:
>>>
>>>
>>> Am 02.07.2014 16:00, schrieb Gregory Farnum:
>>>
>>>> Yeah, it's fighting for attention with a lot of other urgent stuff. :(
>>>>
>>>> Anyway, even if you can't look up any details or reproduce at this
>>>> time, I'm sure you know what shape the cluster was (number of OSDs,
>>>> running on SSDs or hard drives, etc), and that would be useful
>>>> guidance. :)
>>>
>>>
>>>
>>> Sure
>>>
>>> Number of OSDs: 24
>>> Each OSD has an SSD capable of doing tested with fio before installing
>>> ceph
>>> (70.000 iop/s 4k write, 580MB/s seq. write 1MB blocks)
>>>
>>> Single Xeon E5-1620 v2 @ 3.70GHz
>>>
>>> 48GB RAM
>>
>>
>> Awesome, thanks.
>>
>> I went through the changelogs on the librados/, osdc/, and msg/
>> directories to see if I could find any likely change candidates
>> between Dumpling and Firefly and couldn't see any issues. :( But I
>> suspect that the sharding changes coming will more than make up the
>> difference, so you might want to plan on checking that out when it
>> arrives, even if you don't want to deploy it to production.n
>
>
> To which changes do you refer? Will they be part or backported of/to
> firefly?

Yehuda's got a pretty big patchset that is sharding up the "big
Objecter lock" into many smaller mutexes and RWLocks that will make it
much more parallel. He's on vacation just now but I understand it's
almost ready to merge; I don't think it'll be suitable for backport to
firefly, though (it's big).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread Gregory Farnum
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq->reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer->journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]->sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as "putting", and "put"). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com  wrote:
> hi, everyone
>
> when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
> --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
> write
>
> I found when RGW call the method "bucket_prepare_op " is very slow. so I
> observed from 'dump_historic_ops',to see:
> { "description": "osd_op(client.4211.0:265984 .dir.default.4148.1 [call
> rgw.bucket_prepare_op] 3.b168f3d0 e37)",
>   "received_at": "2014-07-03 11:07:02.465700",
>   "age": "308.315230",
>   "duration": "3.401743",
>   "type_data": [
> "commit sent; apply or cleanup",
> { "client": "client.4211",
>   "tid": 265984},
> [
> { "time": "2014-07-03 11:07:02.465852",
>   "event": "waiting_for_osdmap"},
> { "time": "2014-07-03 11:07:02.465875",
>   "event": "queue op_wq"},
> { "time": "2014-07-03 11:07:03.729087",
>   "event": "reached_pg"},
> { "time": "2014-07-03 11:07:03.729120",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.729126",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.804366",
>   "event": "waiting for subops from [19,9]"},
> { "time": "2014-07-03 11:07:03.804431",
>   "event": "commit_queued_for_journal_write"},
> { "time": "2014-07-03 11:07:03.804509",
>   "event": "write_thread_in_journal_buffer"},
> { "time": "2014-07-03 11:07:03.934419",
>   "event": "journaled_completion_queued"},
> { "time": "2014-07-03 11:07:05.297282",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.297319",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.311217",
>   "event": "op_applied"},
> { "time": "2014-07-03 11:07:05.867384",
>   "event": "op_commit finish lock"},
> { "time": "2014-07-03 11:07:05.867385",
>   "event": "op_commit"},
> { "time": "2014-07-03 11:07:05.867424",
>   "event": "commit_sent"},
> { "time": "2014-07-03 11:07:05.867428",
>   "event": "op_commit finish"},
> { "time": "2014-07-03 11:07:05.867443",
>   "event": "done"}]]}]}
>
> so I find 2 performance degradation. one is from "queue op_wq" to
> "reached_pg" , anothor is from "journaled_completion_queued" to "op_commit".
> and I must stess that there are so many ops write to one bucket object, so
> how to reduce Latency ?
>
>
> 
> baijia...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pools do not respond

2014-07-03 Thread Gregory Farnum
The PG in question isn't being properly mapped to any OSDs. There's a
good chance that those trees (with 3 OSDs in 2 hosts) aren't going to
map well anyway, but the immediate problem should resolve itself if
you change the "choose" to "chooseleaf" in your rules.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo  wrote:
> Hi folk,
>   I am following step by step the test intallation, and checking some
> configuration before try to deploy a production cluster.
>
>   Now I have a Health cluster with 3 mons + 4 OSDs.
>   I have created a pool with belonging all osd.x and two more one for two
> servers o the other for the other two.
>
>   The general pool work fine (I can create images and mount it on remote
> machines).
>
>   But the other two does not work (the commands rados put, or rbd ls "pool"
> hangs for ever).
>
>   this is the tree:
>
>[ceph@cephadm ceph-cloud]$ sudo ceph osd tree
> # id weight type name up/down reweight
> -7 5.4 root 4x1GbFCnlSAS
> -3 2.7 host node04
> 1 2.7 osd.1 up 1
> -4 2.7 host node03
> 2 2.7 osd.2 up 1
> -6 8.1 root 4x4GbFCnlSAS
> -5 5.4 host node01
> 3 2.7 osd.3 up 1
> 4 2.7 osd.4 up 1
> -2 2.7 host node04
> 0 2.7 osd.0 up 1
> -1 13.5 root default
> -2 2.7 host node04
> 0 2.7 osd.0 up 1
> -3 2.7 host node04
> 1 2.7 osd.1 up 1
> -4 2.7 host node03
> 2 2.7 osd.2 up 1
> -5 5.4 host node01
> 3 2.7 osd.3 up 1
> 4 2.7 osd.4 up 1
>
>
> And this is the crushmap:
>
> ...
> root 4x4GbFCnlSAS {
> id -6 #do not change unnecessarily
> alg straw
> hash 0  # rjenkins1
> item node01 weight 5.400
> item node04 weight 2.700
> }
> root 4x1GbFCnlSAS {
> id -7 #do not change unnecessarily
> alg straw
> hash 0  # rjenkins1
> item node04 weight 2.700
> item node03 weight 2.700
> }
> # rules
> rule 4x4GbFCnlSAS {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take 4x4GbFCnlSAS
> step choose firstn 0 type host
> step emit
> }
> rule 4x1GbFCnlSAS {
> ruleset 2
> type replicated
> min_size 1
> max_size 10
> step take 4x1GbFCnlSAS
> step choose firstn 0 type host
> step emit
> }
> ..
> I of course set the crush_rules:
> sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
> sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1
>
> but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
>sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
> 4x4GbFCnlSAS.pool
> !!HANGS for eve!
>
> from the ceph-client happen the same
>  rbd ls cloud-4x1GbFCnlSAS
>  !!HANGS for eve!
>
>
> [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
> 4x1GbFCnlSAS.object
> osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' -> pg
> 3.114ae7a9 (3.29) -> up ([], p-1) acting ([], p-1)
>
> Any idea what i am doing wrong??
>
> Thanks in advance, I
> Bertrand Russell:
> "El problema con el mundo es que los estúpidos están seguros de todo y los
> inteligentes están llenos de dudas"
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)

2014-07-03 Thread Gregory Farnum
On Wed, Jul 2, 2014 at 3:06 PM, Marc  wrote:
> Hi,
>
> I was wondering, having a cache pool in front of an RBD pool is all fine
> and dandy, but imagine you want to pull backups of all your VMs (or one
> of them, or multiple...). Going to the cache for all those reads isn't
> only pointless, it'll also potentially fill up the cache and possibly
> evict actually frequently used data. Which got me thinking... wouldn't
> it be nifty if there was a special way of doing specific backup reads
> where you'd bypass the cache, ensuring the dirty cache contents get
> written to cold pool first? Or at least doing special reads where a
> cache-miss won't actually cache the requested data?

Yeah, these are nifty features but the cache coherency implications
are a bit difficult. More options will come as we are able to develop
and (more importantly, by far) validate them.
-Greg

>
> AFAIK the backup routine for an RBD-backed KVM usually involves creating
> a snapshot of the RBD and putting that into a backup storage/tape, all
> done via librbd/API.
>
> Maybe something like that even already exists?
>
>
> KR,
> Marc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why lock th whole osd handle thread

2014-07-03 Thread Gregory Farnum
On Thu, Jul 3, 2014 at 8:24 AM, baijia...@126.com  wrote:
> when I see the function "OSD::OpWQ::_process ". I find pg lock locks the
> whole function. so when I  use multi-thread write the same object , so are
> they must
> serialize from osd handle thread to journal write thread ?

It's serialized while processing the write, but that doesn't include
the wait time for the data to be placed on disk — merely sequencing it
and feeding it into the journal queue. Writes have to be ordered, so
that's not likely to change.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pools do not respond

2014-07-03 Thread Gregory Farnum
On Thu, Jul 3, 2014 at 11:17 AM, Iban Cabrillo  wrote:
> Hi Gregory,
>   Thanks a lot I begin to understand who ceph works.
>   I add a couple of osd servers, and balance the disk between them.
>
>
> [ceph@cephadm ceph-cloud]$ sudo ceph osd tree
> # idweighttype nameup/downreweight
> -716.2root 4x1GbFCnlSAS
> -95.4host node02
> 72.7osd.7up1
> 82.7osd.8up1
> -45.4host node03
>
> 22.7osd.2up1
> 92.7osd.9up1
> -35.4host node04
>
> 12.7osd.1up1
> 102.7osd.10up1
> -616.2root 4x4GbFCnlSAS
>
> -55.4host node01
> 32.7osd.3up1
> 42.7osd.4up1
> -85.4host node02
> 52.7osd.5up1
> 62.7osd.6up1
> -25.4host node04
>
> 02.7osd.0up1
> 112.7osd.11up1
> -132.4root default
> -25.4host node04
>
> 02.7osd.0up1
> 112.7osd.11up1
> -35.4host node04
>
> 12.7osd.1up1
> 102.7osd.10up1
> -45.4host node03
>
> 22.7osd.2up1
> 92.7osd.9up1
> -55.4host node01
> 32.7osd.3up1
> 42.7osd.4up1
> -85.4host node02
> 52.7osd.5up1
> 62.7osd.6up1
> -95.4host node02
> 72.7osd.7up1
> 82.7osd.8up1
>
> The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for
> server per pool.
> Now i have to adjust the pg and pgp and make some performance test.
>
> PD which is the difference betwwwn chose ans choseleaf?

"choose" instructs the system to choose N different buckets of the
given type (where N is specified by the "firstn 0" block to be the
replication level, but could be 1: "firstn 1", or replication - 1:
"firstn -1"). Since you're saying "choose firstn 0 type host", that's
what you're getting out, and then you're emitting those 3 (by default)
hosts. But they aren't valid "devices" (OSDs), so it's not a valid
mapping; you're supposed to then say "choose firstn 1 device" or
similar.
"chooseleaf" instead tells the system to choose N different buckets,
and then descend from each of those buckets to a leaf ("device") in
the CRUSH hierarchy. It's a little more robust against different
mappings and failure conditions, so generally a better choice than
"choose" if you don't need the finer granularity provided by choose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error initializing cluster client: Error

2014-07-07 Thread Gregory Farnum
Do you have a ceph.conf file that the "ceph" tool can access in a
known location? Try specifying it manually with the "-c ceph.conf"
argument. You can also add "--debug-ms 1, --debug-monc 10" and see if
it outputs more useful error logs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sat, Jul 5, 2014 at 2:23 AM, Pavel V. Kaygorodov  wrote:
> Hi!
>
> I still have the same problem with "Error initializing cluster client: Error" 
> on all monitor nodes:
>
> root@bastet-mon2:~# ceph -w
> Error initializing cluster client: Error
>
> root@bastet-mon2:~# ceph --admin-daemon /var/run/ceph/ceph-mon.2.asok 
> mon_status
> { "name": "2",
>   "rank": 1,
>   "state": "peon",
>   "election_epoch": 1566,
>   "quorum": [
> 0,
> 1,
> 2],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 3,
>   "fsid": "fffeafa2-a664-48a7-979a-517e3ffa0da1",
>   "modified": "2014-06-19 18:16:01.074917",
>   "created": "2014-06-19 18:14:43.350501",
>   "mons": [
> { "rank": 0,
>   "name": "1",
>   "addr": "10.92.8.80:6789\/0"},
> { "rank": 1,
>   "name": "2",
>   "addr": "10.92.8.81:6789\/0"},
> { "rank": 2,
>   "name": "3",
>   "addr": "10.92.8.82:6789\/0"}]}}
>
> root@bastet-mon2:~# ceph --admin-daemon /var/run/ceph/ceph-mon.2.asok 
> quorum_status
> { "election_epoch": 1566,
>   "quorum": [
> 0,
> 1,
> 2],
>   "quorum_names": [
> "1",
> "2",
> "3"],
>   "quorum_leader_name": "1",
>   "monmap": { "epoch": 3,
>   "fsid": "fffeafa2-a664-48a7-979a-517e3ffa0da1",
>   "modified": "2014-06-19 18:16:01.074917",
>   "created": "2014-06-19 18:14:43.350501",
>   "mons": [
> { "rank": 0,
>   "name": "1",
>   "addr": "10.92.8.80:6789\/0"},
> { "rank": 1,
>   "name": "2",
>   "addr": "10.92.8.81:6789\/0"},
> { "rank": 2,
>   "name": "3",
>   "addr": "10.92.8.82:6789\/0"}]}}
>
> root@bastet-mon2:~# ceph --admin-daemon /var/run/ceph/ceph-mon.2.asok 
> version{"version":"0.80.1"}
>
> /
>
> The same situation on all 3 monitor nodes, but the cluster is alive and all 
> clients works fine.
> Any ideas how to fix this?
>
> Pavel.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2014-07-07 Thread Gregory Farnum
What was the exact sequence of events — were you rebalancing when you
did the upgrade? Did the marked out OSDs get upgraded?
Did you restart all the monitors prior to changing the tunables? (Are
you *sure*?)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sat, Jul 5, 2014 at 10:31 PM, James Harper  wrote:
>>
>> I have 4 physical boxes each running 2 OSD's. I needed to retire one so I set
>> the 2 OSD's on it to 'out' and everything went as expected. Then I noticed
>> that 'ceph health' was reporting that my crush map had legacy tunables. The
>> release notes told me I needed to do 'ceph osd crush tunables optimal' to fix
>> this, and I wasn't running any old kernel clients, so I made it so. Shortly 
>> after
>> that, my OSD's started dying until only one remained. I eventually figured 
>> out
>> that they would stay up until I started the OSD's on the 'out' node. I hadn't
>> made the connection to the tunables until I turned up an old mailing list 
>> post,
>> but sure enough setting the tunables back to legacy got everything stable
>> again. I assume that the churn introduced by 'optimal' resulted in the
>> situation where the 'out' node stored the only copy of some data, because
>> there were down pgs until I got all the OSD's running again
>>
>
> Forgot to add, on the 'out' node, the following would be logged in the osd 
> logfile:
>
> 7f5688e59700 -1 osd/PG.cc: In function 'void PG::fulfill_info(pg_shard_t, 
> const pg_query_t&, std::pair&)' thread 7f5688e59700 
> time 2014-07-05 21:47:51.595687
> osd/PG.cc: 4424: FAILED assert(from == primary)
>
> and in the others when they crashed:
>
> 7fdcb9600700 -1 osd/PG.cc: In function 
> 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state  PG::RecoveryState::RecoveryMachine>::my_context)' thread 7fdcb9600700 time 
> 2014-07-05 21:14:57.260547
> osd/PG.cc: 5307: FAILED assert(0 == "we got a bad state machine event")
> (sometimes that would appear in the 'out' node too).
>
> Even after the rebalance is complete and the old node is completely retired,  
> with one node down and 2 still running (as a test), I get a very small number 
> (0.006%) of "unfound" pg's. This is a bit of a worry...
>
> James
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly : Significant increase in RAM usage

2014-07-07 Thread Gregory Farnum
We don't test explicitly for this, but I'm surprised to hear about a
jump of that magnitude. Do you have any more detailed profiling? Can
you generate some? (With the tcmalloc heap dumps.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Jul 7, 2014 at 3:03 AM, Sylvain Munaut
 wrote:
> Hi,
>
>
>> We actually saw a decrease in memory usage after upgrading to Firefly,
>> though we did reboot the nodes after the upgrade while we had the
>> maintenance window. This is with 216 OSDs total (32-40 per node):
>> http://i.imgur.com/BC7RuXJ.png
>
>
> Interesting. Is that cluster for RBD or RGW ?  My RBD OSDs are a bit
> better behaved but still had this 25% bump in mem usage ...
>
>
>
> Here the memory pretty much just grows continually.
>
> This is the log over the last year.
>
> http://i.imgur.com/0NUFjpz.png
>
> At the very beginning (~250M per process) those OSD were empty, just
> added. Then we changed the crushmap to map all the RGW pools we have
> to them, then it just grows slowly with a bump at pretty much each
> update.
>
> And this is a pretty small pool of OSDs, for theses there is only 8
> OSD processes over 4 nodes, storing barely 1 To in 2.5 millions
> objects, split into 7 pools and 5376 PGs (some pools have size=3,
> other size=2)
>
> 1.5 Go per OSD process seems a bit big to me.
>
>
> Cheers,
>
>Sylvain
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clear active+degraded pgs

2014-07-07 Thread Gregory Farnum
CRUSH is a probabilistic algorithm. By having all those non-existent
OSDs in the map, you made it so that 10/12 attempts at mapping would
fail and need to be retried. CRUSH handles a lot of retries, but not
enough for that to work out well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jul 7, 2014 at 4:09 AM, hua peng  wrote:
> Hi,
>
> I have resolved it by run these:
> root@ceph2:~# ceph osd crush rm osd.0
> removed item id 0 name 'osd.0' from crush map
> root@ceph2:~# ceph osd crush rm osd.1
> removed item id 1 name 'osd.1' from crush map
> root@ceph2:~# ceph osd crush rm osd.2
> removed item id 2 name 'osd.2' from crush map
> root@ceph2:~# ceph osd crush rm osd.3
> removed item id 3 name 'osd.3' from crush map
> root@ceph2:~# ceph osd crush rm osd.4
> removed item id 4 name 'osd.4' from crush map
> root@ceph2:~# ceph osd crush rm osd.5
> removed item id 5 name 'osd.5' from crush map
> root@ceph2:~# ceph osd crush rm osd.6
> removed item id 6 name 'osd.6' from crush map
> root@ceph2:~# ceph osd crush rm osd.7
> removed item id 7 name 'osd.7' from crush map
> root@ceph2:~# ceph osd crush rm osd.8
> removed item id 8 name 'osd.8' from crush map
> root@ceph2:~# ceph osd crush rm osd.9
> removed item id 9 name 'osd.9' from crush map
>
> Though I am not sure why it resolved my problem...
> Now it's oK:
>
> root@ceph2:~# ceph health
> HEALTH_OK
>
> Thanks.
>
>
>> Hi,
>>
>> I have 135 pgs degraded in the system. How can I remove them?
>> They are in test environment, all data are not important.
>>
>> Thanks for the kind helps.
>>
>> root@ceph2:~# ceph osd tree
>>
>> # idweight  type name   up/down reweight
>> -1  0.8398  root default
>> -2  0.8398  host ceph2
>> 0   0.01999 osd.0   DNE
>> 1   0.01999 osd.1   DNE
>> 2   0.07999 osd.2   DNE
>> 3   0.07999 osd.3   DNE
>> 4   0.07999 osd.4   DNE
>> 5   0.07999 osd.5   DNE
>> 6   0.07999 osd.6   DNE
>> 7   0.07999 osd.7   DNE
>> 8   0.07999 osd.8   DNE
>> 9   0.07999 osd.9   DNE
>> 10  0.07999 osd.10  up  1
>> 11  0.07999 osd.11  up  1
>>
>> root@ceph2:~# ceph -s
>> health HEALTH_WARN 135 pgs degraded; 135 pgs stuck unclean
>> monmap e1: 1 mons at {ceph2=172.17.6.176:6789/0}, election epoch 1,
>> quorum 0 ceph2
>> osdmap e95: 2 osds: 2 up, 2 in
>>  pgmap v2351: 192 pgs: 57 active+clean, 135 active+degraded; 0 bytes
>> data, 33243 MB used, 114 GB / 154 GB avail
>> mdsmap e1: 0/0/1 up
>>
>> root@ceph2:~#
>> root@ceph2:~# ceph health
>> HEALTH_WARN 135 pgs degraded; 135 pgs stuck unclean
>> root@ceph2:~#
>> root@ceph2:~# ceph health detail
>> HEALTH_WARN 135 pgs degraded; 135 pgs stuck unclean
>> pg 2.3d is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.3e is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 0.3f is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.3f is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 2.3f is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 2.3e is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 2.39 is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 1.3a is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 0.3b is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 2.38 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 0.3a is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.38 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 0.39 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.39 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 2.34 is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 0.36 is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 2.37 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.34 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 0.35 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.35 is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 0.34 is stuck unclean since forever, current state active+degraded,
>> last acting [10]
>> pg 2.31 is stuck unclean since forever, current state active+degraded,
>> last acting [11]
>> pg 1.32 is stuck unclean sin

Re: [ceph-users] Temporary degradation when adding OSD's

2014-07-07 Thread Gregory Farnum
On Mon, Jul 7, 2014 at 7:03 AM, Erik Logtenberg  wrote:
> Hi,
>
> If you add an OSD to an existing cluster, ceph will move some existing
> data around so the new OSD gets its respective share of usage right away.
>
> Now I noticed that during this moving around, ceph reports the relevant
> PG's as degraded. I can more or less understand the logic here: if a
> piece of data is supposed to be in a certain place (the new OSD), but it
> is not yet there, it's degraded.
>
> However I would hope that the movement of data is executed in such a way
> that first a new copy is made on the new OSD and only after successfully
> doing that, one of the existing copies is removed. If so, there is never
> actually any "degradation" of that PG.
>
> More to the point, if I have a PG replicated over three OSD's: 1, 2 and
> 3; now I add an OSD 4, and ceph decides to move the copy of OSD 3 to the
> new OSD 4; if it turns out that ceph can't read the copies on OSD 1 and
> 2 due to some disk error, I would assume that ceph would still use the
> copy that exists on OSD 3 to populate the copy on OSD 4. Is that indeed
> the case?

Yeah, Ceph will never voluntarily reduce the redundancy. I believe
splitting the "degraded" state into separate "wrongly placed" and
"degraded" (reduced redundancy) states is currently on the menu for
the Giant release, but it's not been done yet.

>
>
> I have a very similar question about removing an OSD. You can tell ceph
> to mark an OSD as "out" before physically removing it. The OSD is still
> "up" but ceph will no longer assign PG's to it, and will make new copies
> of the PG's that are on this OSD to other OSD's.
> Now again ceph will report degradation, even though the "out" OSD is
> still "up", so the existing copies are not actually lost. Does ceph use
> the OSD that is marked "out" as a source for making the new copies on
> other OSD's?

Yep!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2014-07-07 Thread Gregory Farnum
Okay. Based on your description I think the reason for the tunables
crashes is that either the "out" OSDs, or possibly one of the
monitors, never got restarted. You should be able to update the
tunables now, if you want to. (Or there's also a config option that
will disable the warning; check the release notes.)
As for why the MDSes (plural? if you have multiple, be aware that's
less stable than a single MDS) were blocked, you might want to check
your CRUSH map and make sure it's segregating replicas across hosts.
I'm betting you knocked out the only copies of some of your PGs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Jul 7, 2014 at 3:57 PM, James Harper  wrote:
>>
>> What was the exact sequence of events
>>
>
> Exact sequence of events was:
> . set retiring node OSD's out
> . noticed that mds's were now stuck in 'rejoining' state
> . messed around with restarting mds's but couldn't fix
> . google told me that upgrading ceph resolved such a problem for them
> . upgraded all binaries (apt-get install ...)
> . restarted all mons
> . (noticed that apt-get had grabbed firefly from Jessie instead of dumpling 
> from ceph.com - I thought I might just be grabbing a bugfix for dumpling)
> . restarted all osds
> . restarted all mds's
> . mds's came good and cluster was healthy again (and still moving pg's from 
> retiring node), but getting a  warning about legacy tunables
> . read the release notes for instructions on what the tunables message meant. 
> I am running kernel 3.14 but not using the kernel rbd driver so assume that 
> would be okay (is that correct?). Set tunables to optimal
> . alerts that cluster was degraded with osd's down
> . messed around restarting osd's until I found that the cluster remained 
> stable with the osd's on the retiring node stopped - starting either of the 2 
> osd's on there resulted in the cascade of crashing osd's
> . on a whim, set tunables back to legacy and the cluster became stable again. 
> The pg's all migrated from the retiring node and I removed it from the cluster
>
> It was getting late by then so things got a bit hazy towards the end but I'm 
> pretty sure that's how it all went down. The fact that my mds's got stuck 
> after setting one node out makes me think there is something else at work and 
> it was an indirect force at work that meant legacy=stable and optimal=crashy. 
> I can't see what though - everything had been working great up until that 
> point. I haven't touched the tunables since then so I still get the constant 
> warning.
>
> I'd kind of prefer to stick with the deb's from ceph.com - I hadn't noticed 
> that they were included in Jessie until it was too late, and qemu now depends 
> on them so I guess I'm stuck with the Debian repo versions anyway...
>
> thanks
>
> James
>
>> — were you rebalancing when you
>> did the upgrade? Did the marked out OSDs get upgraded?
>> Did you restart all the monitors prior to changing the tunables? (Are
>> you *sure*?)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Sat, Jul 5, 2014 at 10:31 PM, James Harper 
>> wrote:
>> >>
>> >> I have 4 physical boxes each running 2 OSD's. I needed to retire one so I
>> set
>> >> the 2 OSD's on it to 'out' and everything went as expected. Then I noticed
>> >> that 'ceph health' was reporting that my crush map had legacy tunables.
>> The
>> >> release notes told me I needed to do 'ceph osd crush tunables optimal' to
>> fix
>> >> this, and I wasn't running any old kernel clients, so I made it so. 
>> >> Shortly
>> after
>> >> that, my OSD's started dying until only one remained. I eventually figured
>> out
>> >> that they would stay up until I started the OSD's on the 'out' node. I 
>> >> hadn't
>> >> made the connection to the tunables until I turned up an old mailing list
>> post,
>> >> but sure enough setting the tunables back to legacy got everything stable
>> >> again. I assume that the churn introduced by 'optimal' resulted in the
>> >> situation where the 'out' node stored the only copy of some data,
>> because
>> >> there were down pgs until I got all the OSD's running again
>> >>
>> >
>> > Forgot to add, on the 'out' node, the following would be logged in the osd
>> logfile:
>> >
>> > 7f5688e59700 -1 osd/PG.cc: In function 'void PG::fulfill_info(pg_shard_t,
>> const pg_query_t&, std::pair&)' thread
>> 7f5688e59700 time 2014-07-05 21:47:51.595687
>> > osd/PG.cc: 4424: FAILED assert(from == primary)
>> >
>> > and in the others when they crashed:
>> >
>> > 7fdcb9600700 -1 osd/PG.cc: In function
>> 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state> yState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)'
>> thread 7fdcb9600700 time 2014-07-05 21:14:57.260547
>> > osd/PG.cc: 5307: FAILED assert(0 == "we got a bad state machine event")
>> > (sometimes that would appear in the 'out' node too).
>> >
>> > Even after the rebalance is complete and the old node is completely
>> retired,  wi

Re: [ceph-users] inconsistent pgs

2014-07-07 Thread Gregory Farnum
On Mon, Jul 7, 2014 at 4:21 PM, James Harper  wrote:
>>
>> Okay. Based on your description I think the reason for the tunables
>> crashes is that either the "out" OSDs, or possibly one of the
>> monitors, never got restarted. You should be able to update the
>> tunables now, if you want to. (Or there's also a config option that
>> will disable the warning; check the release notes.)
>
> There was never a monitor on the node with the 'out' OSDs. And even if I 
> forgot to restart the OSD's, they definitely got restarted once things got 
> crashy, although maybe it was too late by then?

Yeah, that's probable.

>> As for why the MDSes (plural? if you have multiple, be aware that's
>> less stable than a single MDS) were blocked, you might want to check
>> your CRUSH map and make sure it's segregating replicas across hosts.
>> I'm betting you knocked out the only copies of some of your PGs.
>
> Yeah I had a question about that. In a setup with 3 (was 4) nodes with 2 
> OSD's on each, why are there a very small number of pg's that only exist on 
> one node? That kind of defeats the purpose. I haven't checked that that's 
> still the case after the migration is all completed, and maybe it was an 
> artefact of the tunables change, but taking one node out completely for a 
> reboot definitely results in 'not found' pg's.

It sounds like maybe you've got a bad CRUSH map if you're seeing that.
One of the things the tunables do is make the algorithm handle a
variety of maps better, but if PGs are only mapping to one OSD you
need to fix that.

> And are you saying that when I took the 2 OSD's on one node 'out' that some 
> pg's were now inaccessible, even though the OSD's with the pg's on them were 
> still running (and that there should have been other OSDs with replicas)? My 
> setup is with 2 replicas.

That's what I'm guessing. An "out" PG cannot be used to serve client
IO, so if there were (improperly) no other replicas elsewhere (as you
just said is the case), the MDS would need to wait for the PG to
migrate before it could do IO against it. (Strictly speaking it
doesn't need to wait for the whole PG, just "enough" of it, but there
are a bunch of throttles that probably prevented even that minimal
amount of data from being moved over while other PGs were backfilled.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2014-07-07 Thread Gregory Farnum
You can look at which OSDs the PGs map to. If the PGs have
insufficient replica counts they'll report as degraded in "ceph -s" or
"ceph -w".
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jul 7, 2014 at 4:30 PM, James Harper  wrote:
>>
>> It sounds like maybe you've got a bad CRUSH map if you're seeing that.
>> One of the things the tunables do is make the algorithm handle a
>> variety of maps better, but if PGs are only mapping to one OSD you
>> need to fix that.
>>
>
> How can I tell that this is definitely the case (all copies of a pg on a 
> single osd or a single node)?
>
> Thanks
>
> James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2014-07-07 Thread Gregory Farnum
On Mon, Jul 7, 2014 at 4:39 PM, James Harper  wrote:
>>
>> You can look at which OSDs the PGs map to. If the PGs have
>> insufficient replica counts they'll report as degraded in "ceph -s" or
>> "ceph -w".
>
> I meant in a general sense. If I have a pg that I suspect might be 
> insufficiently redundant I can look that up, but I'd like to know in advance 
> any pgs that do not have the required spread across osds and nodes.

Any PG which is not replicated according to the dictates of the CRUSH
map will also be marked as "degraded". If there are PGs not placed
like that, and they aren't degraded, your CRUSH map isn't set how you
think it is.

I recommend going over ceph.com/docs and looking at all the pages about CRUSH.
-Greg

>
> Ideally the crush map would ensure the highest level of redundancy, right? no 
> pg should be replicated to the same osd. If there are osd's on other nodes 
> that have sufficient capacity then no pg should be replicated to an osd in 
> the same node. Probably the same for other levels in the hierarchy (rack, 
> etc) too. Is there a health check I can run that can tell me that my cluster 
> is all as it should be?
>
> Thanks
>
> James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error on firefly

2014-07-07 Thread Gregory Farnum
It's not very intuitive or easy to look at right now (there are plans
from the recent developer summit to improve things), but the central
log should have output about exactly what objects are busted. You'll
then want to compare the copies manually to determine which ones are
good or bad, get the good copy on the primary (make sure you preserve
xattrs), and run repair.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith  wrote:
> Greetings,
>
> I upgraded to firefly last week and I suddenly received this error:
>
> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>
> ceph health detail shows the following:
>
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 3.c6 is active+clean+inconsistent, acting [2,5]
> 1 scrub errors
>
> The docs say that I can run `ceph pg repair 3.c6` to fix this. What I want
> to know is what are the risks of data loss if I run that command in this
> state and how can I mitigate them?
>
> --
> Randall Smith
> Computing Services
> Adams State University
> http://www.adams.edu/
> 719-587-7741
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Throttle pool pg_num/pgp_num increase impact

2014-07-08 Thread Gregory Farnum
The impact won't be 300 times bigger, but it will be bigger. There are two
things impacting your cluster here
1) the initial "split" of the affected PGs into multiple child PGs. You can
mitigate this by stepping through pg_num at small multiples.
2) the movement of data to its new location (when you adjust pgp_num). This
can be adjusted by setting the "OSD max backfills" and related parameters;
check the docs.
-Greg

On Tuesday, July 8, 2014, Kostis Fardelas  wrote:

> Hi,
> we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw
> used space. We store data objects basically on two pools, the one
> being appr. 300x larger in data stored and # of objects terms than the
> other. Based on the formula provided here
> http://ceph.com/docs/master/rados/operations/placement-groups/ we
> computed that we need to increase our per pool pg_num & pgp_num to
> appr 6300 PGs / pool (100 * 126 / 2).
> We started by increasing the pg & pgp number on the smaller pool from
> 1800 to 2048 PGs (first the pg_num, then the pgp_num) and we
> experienced a 10X increase in Ceph total operations and an appr 3X
> disk latency increase in some underlying OSD disks. At the same time,
> for appr 10 seconds we experienced very low values of client io and
> op/s
>
> Should we be worried that the pg/pgp num increase on the bigger pool
> will have a 300X larger impact?
> Can we throttle this impact by injecting any thresholds or applying an
> appropriate configuration on our ceph conf?
>
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Throttle pool pg_num/pgp_num increase impact

2014-07-08 Thread Gregory Farnum
On Tue, Jul 8, 2014 at 10:14 AM, Dan Van Der Ster
 wrote:
> Hi Greg,
> We're also due for a similar splitting exercise in the not too distant
> future, and will also need to minimize the impact on latency.
>
> In addition to increasing pg_num in small steps and using a minimal
> max_backfills/recoveries configuration, I was planning to increase pgp_num
> very slowly as well. In fact, I don't mind if the whole splitting exercise
> takes weeks to complete. Do you think that'd work, or are intermediate
> values for pgp_num somehow counterproductive?

Yeah, it should work fine. Depending on how much you're increasing the
values by, it might move some of the data more than once, but that's
the only counterproductive impact of it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   6   7   8   9   10   >