[ceph-users] Crush Location

2014-09-08 Thread Jakes John
Hi all,
I have been reading  ceph-crush-location hook  (
http://ceph.com/docs/master/rados/operations/crush-map/ ) which says to add
crush location field in the conf file to provide location awareness to ceph
deamons and clients. I would like to understand whether ceph use this
location awareness information internally for placing read/write requests
of objects. I couldn't find related details anywhere in the documentation.

Thanks



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] delete performance

2014-09-08 Thread Luis Periquito
Hi,

I've been trying to tweak and improve the performance of our ceph cluster.

One of the operations that I can't seem to be able to improve much is the
delete. From what I've gathered every time there is a delete it goes
directly to the HDD, hitting its performance - the op may be recorded in
the journal but I don't notice almost any impact.

>From my tests (1M files with 512k) writing the data will take 2x as much as
the delete operation - should there be a bigger difference? And whilst the
delete operation is running all the remaining operations will be slower -
it does impact the whole cluster performance in a significant way.

Is there any way to improve the delete performance on the cluster? I'm
using S3 to do all the tests, and the .rgw.bucket.index is already running
from SSDs as is the journal. I'm running firefly 0.80.5.

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Location

2014-09-08 Thread Wido den Hollander

On 09/08/2014 09:57 AM, Jakes John wrote:

Hi all,
 I have been reading  ceph-crush-location hook  (
http://ceph.com/docs/master/rados/operations/crush-map/ ) which says to
add crush location field in the conf file to provide location awareness
to ceph deamons and clients. I would like to understand whether ceph use
this location awareness information internally for placing read/write
requests of objects. I couldn't find related details anywhere in the
documentation.



I used this recently for a deployment with servers mixed with SSDs and 
HDDs. I used this location lookup script: 
https://gist.github.com/wido/5d26d88366e28e25e23d


This way I have a 'hostname-ssd' and 'hostname-hdd' bucket for each 
machine. I can then create CRUSH rules using those machines to put data 
on SSDs or HDDs.



Thanks


  




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Francois Deppierraz
Hi,

This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2
under Ubuntu 12.04. The cluster was filling up (a few osds near full)
and I tried to increase the number of pg per pool to 1024 for each of
the 14 pools to improve storage space balancing. This increase triggered
high memory usage on the servers which were unfortunately
under-provisioned (16 GB RAM for 22 osds) and started to swap and crash.

After installing memory into the servers, the result is a broken cluster
with unfound objects and two osds (osd.6 and osd.43) crashing at startup.

$ ceph health
HEALTH_WARN 166 pgs backfill; 326 pgs backfill_toofull; 2 pgs
backfilling; 765 pgs degraded; 715 pgs down; 1 pgs incomplete; 715 pgs
peering; 5 pgs recovering; 2 pgs recovery_wait; 716 pgs stuck inactive;
1856 pgs stuck unclean; 164 requests are blocked > 32 sec; recovery
517735/15915673 objects degraded (3.253%); 1241/7910367 unfound
(0.016%); 3 near full osd(s); 1/43 in osds are down; noout flag(s) set

osd.6 is crashing due to an assertion ("trim_objectcould not find coid")
which leads to a resolved bug report which unfortunately doesn't give
any advise on how to repair the osd.

http://tracker.ceph.com/issues/5473

It is much less obvious why osd.43 is crashing, please have a look at
the following osd logs:

http://paste.ubuntu.com/8288607/
http://paste.ubuntu.com/8288609/

Any advise on how to repair both osds and recover the unfound objects
would be more than welcome.

Thanks!

François

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-08 Thread Dan Van Der Ster
Hi Scott,

> On 06 Sep 2014, at 20:39, Scott Laird  wrote:
> 
> IOPS are weird things with SSDs.  In theory, you'd see 25% of the write IOPS 
> when writing to a 4-way RAID5 device, since you write to all 4 devices in 
> parallel.  Except that's not actually true--unlike HDs where an IOP is an 
> IOP, SSD IOPS limits are really just a function of request size.  Because 
> each operation would be ~1/3rd the size, you should see a net of about 3x the 
> performance of one drive overall, or 75% of the sum of the drives.  

Which chunk size are you using? I presume this would only work if our writes 
are larger than the chuck size, which is normally around 128k, right?. In our 
cluster we are dominated by 4k writes so I don’t expect to get this IOPS boost 
you mention. Or did I miss something?

Cheers, Dan

> The CPU use will be higher, but it may or may not be a substantial hit for 
> your use case.  Journals are basically write-only, and 200G S3700s are 
> supposed to be able to sustain around 360 MB/sec, so RAID 5 would give you 
> somewhere around 1 GB/sec writing on paper.  Depending on your access 
> patterns, that may or may not be a win vs single SSDs; it should give you 
> slightly lower latency for uncongested writes at the very least.  It's 
> probably worth benchmarking if you have the time.  
> 
> OTOH, S3700s seem to be pretty reliable, and if your cluster is big enough to 
> handle the loss of 5 OSDs without a big hit, then the lack of complexity may 
> be a bigger win all on its own.
> 
> 
> Scott

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I fail to add a monitor in a ceph cluster

2014-09-08 Thread Pascal GREGIS
Thanks, it seems to work.
I also had to complete the "mon host" line in config file to add my second (and 
then my third) monitor :
mon host = 172.16.1.11,172.16.1.12,172.16.1.13
otherwise it still didn't work when I stoppd 1 of the 3 monitors, I mean when I 
stopped grenier, the only one which was referenced in the config file.

However, the last command you gave me seemed to produce no useful result :
$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.gail.asok 
add_bootstrap_peer_hint 172.16.1.11
mon already active; ignoring bootstrap hint

Then another detail that I find not really fine, but that is probably normal: 
when there is no quorum, ceph -s doesn't return before a long time (maybe 5 
minutes) and prints an error message.
This makes diagnostic a little harder as it is not possible to get precise 
infos when there is no quorum, or maybe there is a hint to do it.

Thanks again

Pascal

Craig Lewis a écrit, le Wed 03 Sep 2014 à 05:26:32PM :
> "monclient: hunting for new mon" happens whenever the monmap changes.  It
> will hang if there's no quorum.
> 
> I haven't done this manually in a long time, so I'll refer to the Chef
> recipes.  The recipe doesn't do the 'ceph-mon add', it just starts the
> daemon up.
> 
> 
> Try:
> sudo ceph-mon -i gail --mkfs --monmap /var/tmp/monmap --keyring
> /var/tmp/ceph.mon.keyring
> sudo ceph-mon -i gail --public-addr 172.16.1.12
> sudo ceph --admin-daemon /var/run/ceph/ceph-mon.gail.asok
> add_bootstrap_peer_hint 172.16.1.11
> 
> 
> 
> 
> 
> On Mon, Sep 1, 2014 at 8:21 AM, Pascal GREGIS  wrote:
> 
> > Hello,
> >
> > I am currently testing ceph to make a replicated block device for a
> > project that would involve 2 data servers accessing this block device, so
> > that if one fails or crashes, the data can still be used and the cluster
> > can be rebuilt.
> >
> > This project requires that both machines run an OSD and a monitor, and
> > that a 3rd monitor is run somewhere else, so that there is not a single
> > point of failure.
> > I know it is not the best thing to run an OSD and a monitor on the same
> > machine, but I cannot really find a better solution.
> >
> > My problem is that, after having read several times and followed the
> > documentation, I cannot succeed to add a second monitor.
> >
> > I have bootstrapped a first monitor, added 2 OSDs (one on the machine with
> > the monitor, one on the other), and I try to add a second monitor but it
> > doesn't work.
> > I think I misunderstood something.
> >
> > Here's what I did :
> >
> > On the first machine named grenier:
> > # setup the configuration file /etc/ceph/ceph.conf (see content further)
> > # bootstrap monitor:
> > $ ceph-authtool --create-keyring /var/tmp/ceph.mon.keyring --gen-key -n
> > mon. --cap mon 'allow *'
> > $ sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
> > --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow
> > *' --cap mds 'allow'
> > $ sudo chown myuser /etc/ceph/ceph.client.admin.keyring
> > $ ceph-authtool /var/tmp/ceph.mon.keyring --import-keyring
> > /etc/ceph/ceph.client.admin.keyring
> > $ monmaptool --create --add grenier 172.16.1.11 --fsid $monuuid $tmp/monmap
> > $ sudo mkdir -p /var/lib/ceph/mon/ceph-grenier
> > $ sudo chown $ID -R /var/lib/ceph/mon/ceph-grenier
> > $ ceph-mon --mkfs -i grenier --monmap /var/tmp/monmap --keyring
> > /var/tmp/ceph.mon.keyring
> > # start monitor:
> > $ sudo start ceph-mon id=grenier
> > # add OSD:
> > $ sudo ceph osd create $osduuid
> > $ sudo mkdir -p /var/lib/ceph/osd/ceph-0
> > $ sudo ceph-osd -i 0 --mkfs --mkkey --osd-uuid $osduuid
> > $ sudo ceph auth add osd.0 osd 'allow *' mon 'allow profile osd' -i
> > /var/lib/ceph/osd/ceph-0/keyring
> > $ ceph osd crush add-bucket grenier host
> > $ ceph osd crush move grenier root=default
> > $ ceph osd crush add osd.0 1.0 host=grenier
> > # start this OSD
> > $ sudo ceph-osd -i 0
> >
> > # copy /etc/ceph/ceph.conf, /etc/ceph/ceph.client.admin.keyring,
> > /var/tmp/ceph/ceph.mon.keyring and /var/tmp/ceph/monmap from grenier to
> > second node named gail:
> > # add and start OSD on the second node
> > $ sudo ceph osd create $newosduuid
> > $ sudo mkdir -p /var/lib/ceph/osd/ceph-1
> > $ sudo ceph-osd -i 1 --mkfs --mkkey --osd-uuid $newosduuid
> > $ sudo ceph auth add osd.1 osd 'allow *' mon 'allow profile osd' -i
> > /var/lib/ceph/osd/ceph-1/keyring
> > $ ceph osd crush add-bucket gail host
> > $ ceph osd crush move gail root=default
> > $ ceph osd crush add osd.1 1.0 host=gail
> > # start this OSD
> > $ sudo ceph-osd -i 1
> >
> > There, everything works correctly, I can create and map a block device,
> > and then write on it and the data is replicated on both nodes.
> > When I perform a ceph -s I get :
> > cluster a98faf65-b105-4ec7-913c-f8a33a4db4d1
> >  health HEALTH_OK
> >  monmap e1: 1 mons at {grenier=172.16.1.11:6789/0}, election epoch 2,
> > quorum 0 grenier
> >  osdmap e13: 2 osds: 2 up, 2 in
> >   pgmap v47: 192 pgs, 3 pool

Re: [ceph-users] I fail to add a monitor in a ceph cluster

2014-09-08 Thread Pascal GREGIS
Oh, I forgot to say:
I made a mistake in my first message, the command you suggested to remove was 
in fact :
$ sudo ceph mon add $HOSTNAME $IP
and not
$ sudo ceph-mon add $HOSTNAME $IP

Anyway, removing it makes the whole work. But the doc indicates to execute it. 
Should the doc be changed?
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-a-monitor-manual

Pascal

 Original message 
Thanks, it seems to work.
I also had to complete the "mon host" line in config file to add my second (and 
then my third) monitor :
mon host = 172.16.1.11,172.16.1.12,172.16.1.13
otherwise it still didn't work when I stoppd 1 of the 3 monitors, I mean when I 
stopped grenier, the only one which was referenced in the config file.

However, the last command you gave me seemed to produce no useful result :
$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.gail.asok 
add_bootstrap_peer_hint 172.16.1.11
mon already active; ignoring bootstrap hint

Then another detail that I find not really fine, but that is probably normal: 
when there is no quorum, ceph -s doesn't return before a long time (maybe 5 
minutes) and prints an error message.
This makes diagnostic a little harder as it is not possible to get precise 
infos when there is no quorum, or maybe there is a hint to do it.

Thanks again

Pascal

Craig Lewis a écrit, le Wed 03 Sep 2014 à 05:26:32PM :
> "monclient: hunting for new mon" happens whenever the monmap changes.  It
> will hang if there's no quorum.
> 
> I haven't done this manually in a long time, so I'll refer to the Chef
> recipes.  The recipe doesn't do the 'ceph-mon add', it just starts the
> daemon up.
> 
> 
> Try:
> sudo ceph-mon -i gail --mkfs --monmap /var/tmp/monmap --keyring
> /var/tmp/ceph.mon.keyring
> sudo ceph-mon -i gail --public-addr 172.16.1.12
> sudo ceph --admin-daemon /var/run/ceph/ceph-mon.gail.asok
> add_bootstrap_peer_hint 172.16.1.11
> 
> 
> 
> 
> 
> On Mon, Sep 1, 2014 at 8:21 AM, Pascal GREGIS  wrote:
> 
> > Hello,
> >
> > I am currently testing ceph to make a replicated block device for a
> > project that would involve 2 data servers accessing this block device, so
> > that if one fails or crashes, the data can still be used and the cluster
> > can be rebuilt.
> >
> > This project requires that both machines run an OSD and a monitor, and
> > that a 3rd monitor is run somewhere else, so that there is not a single
> > point of failure.
> > I know it is not the best thing to run an OSD and a monitor on the same
> > machine, but I cannot really find a better solution.
> >
> > My problem is that, after having read several times and followed the
> > documentation, I cannot succeed to add a second monitor.
> >
> > I have bootstrapped a first monitor, added 2 OSDs (one on the machine with
> > the monitor, one on the other), and I try to add a second monitor but it
> > doesn't work.
> > I think I misunderstood something.
> >
> > Here's what I did :
> >
> > On the first machine named grenier:
> > # setup the configuration file /etc/ceph/ceph.conf (see content further)
> > # bootstrap monitor:
> > $ ceph-authtool --create-keyring /var/tmp/ceph.mon.keyring --gen-key -n
> > mon. --cap mon 'allow *'
> > $ sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
> > --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow
> > *' --cap mds 'allow'
> > $ sudo chown myuser /etc/ceph/ceph.client.admin.keyring
> > $ ceph-authtool /var/tmp/ceph.mon.keyring --import-keyring
> > /etc/ceph/ceph.client.admin.keyring
> > $ monmaptool --create --add grenier 172.16.1.11 --fsid $monuuid $tmp/monmap
> > $ sudo mkdir -p /var/lib/ceph/mon/ceph-grenier
> > $ sudo chown $ID -R /var/lib/ceph/mon/ceph-grenier
> > $ ceph-mon --mkfs -i grenier --monmap /var/tmp/monmap --keyring
> > /var/tmp/ceph.mon.keyring
> > # start monitor:
> > $ sudo start ceph-mon id=grenier
> > # add OSD:
> > $ sudo ceph osd create $osduuid
> > $ sudo mkdir -p /var/lib/ceph/osd/ceph-0
> > $ sudo ceph-osd -i 0 --mkfs --mkkey --osd-uuid $osduuid
> > $ sudo ceph auth add osd.0 osd 'allow *' mon 'allow profile osd' -i
> > /var/lib/ceph/osd/ceph-0/keyring
> > $ ceph osd crush add-bucket grenier host
> > $ ceph osd crush move grenier root=default
> > $ ceph osd crush add osd.0 1.0 host=grenier
> > # start this OSD
> > $ sudo ceph-osd -i 0
> >
> > # copy /etc/ceph/ceph.conf, /etc/ceph/ceph.client.admin.keyring,
> > /var/tmp/ceph/ceph.mon.keyring and /var/tmp/ceph/monmap from grenier to
> > second node named gail:
> > # add and start OSD on the second node
> > $ sudo ceph osd create $newosduuid
> > $ sudo mkdir -p /var/lib/ceph/osd/ceph-1
> > $ sudo ceph-osd -i 1 --mkfs --mkkey --osd-uuid $newosduuid
> > $ sudo ceph auth add osd.1 osd 'allow *' mon 'allow profile osd' -i
> > /var/lib/ceph/osd/ceph-1/keyring
> > $ ceph osd crush add-bucket gail host
> > $ ceph osd crush move gail root=default
> > $ ceph osd crush add osd.1 1.0 host=gail
> > # start this OSD
> > $ sudo ceph-osd -i 1
> >
> > There, everythi

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-08 Thread Kenneth Waegeman


Thank you very much !

Is this problem then related to the weird sizes I see:
  pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects
418 GB used, 88130 GB / 88549 GB avail

a calculation with df shows indeed that there is about 400GB used on  
disks, but the tests I ran should indeed have generated 3,5 TB, as  
also seen in rados df:


pool name   category KB  objects   clones   
   degraded  unfound   rdrd KB   wr
 wr KB
cache   -   59150443154660  
   0   0  1388365   5686734850  3665984
4709621763
ecdata  - 3512807425   8576200  
   0   0  1109938312332288   857621
3512807426


I thought it was related to the inconsistency?
Or can this be a sparse objects thing? (But I don't seem to found  
anything in the docs about that)


Thanks again!

Kenneth



- Message from Haomai Wang  -
   Date: Sun, 7 Sep 2014 20:34:39 +0800
   From: Haomai Wang 
Subject: Re: ceph cluster inconsistency keyvaluestore
 To: Kenneth Waegeman 
 Cc: ceph-users@lists.ceph.com



I have found the root cause. It's a bug.

When chunky scrub happen, it will iterate the who pg's objects and
each iterator only a few objects will be scan.

osd/PG.cc:3758
ret = get_pgbackend()-> objects_list_partial(
  start,
  cct->_conf->osd_scrub_chunk_min,
  cct->_conf->osd_scrub_chunk_max,
  0,
  &objects,
  &candidate_end);

candidate_end is the end of object set and it's used to indicate the
next scrub process's start position. But it will be truncated:

osd/PG.cc:3777
while (!boundary_found && objects.size() > 1) {
  hobject_t end = objects.back().get_boundary();
  objects.pop_back();

  if (objects.back().get_filestore_key() !=
end.get_filestore_key()) {
candidate_end = end;
boundary_found = true;
  }
}
end which only contain "hash" field as hobject_t will be assign to
candidate_end.  So the next scrub process a hobject_t only contains
"hash" field will be passed in to get_pgbackend()->
objects_list_partial.

It will cause incorrect results for KeyValueStore backend. Because it
will use strict key ordering for "collection_list_paritial" method. A
hobject_t only contains "hash" field will be:

1%e79s0_head!972F1B5D!!none!!!!0!0

and the actual object is
1%e79s0_head!972F1B5D!!1!!!object-name!head

In other word, a object only contain "hash" field can't used by to
search a absolute object has the same "hash" field.

@sage The simply way is modify obj->key function which will change
storage format. Because it's a experiment backend I would like to
provide with a external format change program help users do it. Is it
OK?


On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
 wrote:

I also can reproduce it on a new slightly different set up (also EC on KV
and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
'inconsistent' status



- Message from Kenneth Waegeman  -
   Date: Mon, 01 Sep 2014 16:28:31 +0200
   From: Kenneth Waegeman 
Subject: Re: ceph cluster inconsistency keyvaluestore
 To: Haomai Wang 
 Cc: ceph-users@lists.ceph.com




Hi,


The cluster got installed with quattor, which uses ceph-deploy for
installation of daemons, writes the config file and installs the crushmap.
I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
ECdata pool and a small cache partition (50G) for the cache

I manually did this:

ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3
ruleset-failure-domain=osd
ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

(But the previous time I had the problem already without the cache part)



Cluster live since 2014-08-29 15:34:16

Config file on host ceph001:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.143.8.0/24
filestore_xattr_use_omap = 1
fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
mon_cluster_log_to_syslog = 1
mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
mon_initial_members = ceph001, ceph002, ceph003
osd_crush_update_on_start = 0
osd_journal_size = 10240
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_pool_default_size = 3
public_network = 10.141.8.0/24

[osd.11]
osd_objectstore = keyvaluestore-dev

[osd.13]
osd_o

Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-08 Thread BG
Apologies for piggybacking this issue but I appear to have  a similar problem
with Firefly on a CentOS 7 install, thought it better to add it here rather
than start a new thread.

$ ceph --version
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)

$ ceph health
HEALTH_WARN 96 pgs degraded; 96 pgs peering; 192 pgs stale; 96 pgs stuck
inactive; 192 pgs stuck 
stale; 192 pgs stuck unclean

$ ceph osd dump
epoch 11
fsid 809d719a-65b5-40a5-b8c2-572d03b43da4
created 2014-09-08 10:46:38.446033
modified 2014-09-08 10:55:03.342147
flags 
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 
64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 64 
pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 64 
last_change 1 flags hashpspool stripe_width 0
max_osd 2
osd.0 down out weight 0 up_from 4 up_thru 8 down_at 10 last_clean_interval
[0,0) 
10.119.16.15:6800/4433 10.119.16.15:6801/4433 10.119.16.15:6802/4433 
10.119.16.15:6803/4433 autoout,exists ed7d9e41-6976-4a6e-b929-82d77f916470
osd.1 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0)
10.119.16.16:6800/4418 10.119.16.16:6801/4418 10.119.16.16:6802/4418 
10.119.16.16:6803/4418 exists,up 89e49ab3-b22b-41e9-9b7b-89c33f9cb0fb

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-08 Thread Loic Dachary
Hi,

It it looks like your osd.0 is down and you only have one osd left (osd.1) 
which would explain why the cluster cannot get to a healthy state. The "size 2" 
in  "pool 0 'data' replicated size 2 ..." means the pool needs at least two 
OSDs up to function properly. Do you know why the osd.0 is not up ?

Cheers

On 08/09/2014 12:55, BG wrote:
> Apologies for piggybacking this issue but I appear to have  a similar problem
> with Firefly on a CentOS 7 install, thought it better to add it here rather
> than start a new thread.
> 
> $ ceph --version
> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
> 
> $ ceph health
> HEALTH_WARN 96 pgs degraded; 96 pgs peering; 192 pgs stale; 96 pgs stuck
> inactive; 192 pgs stuck 
> stale; 192 pgs stuck unclean
> 
> $ ceph osd dump
> epoch 11
> fsid 809d719a-65b5-40a5-b8c2-572d03b43da4
> created 2014-09-08 10:46:38.446033
> modified 2014-09-08 10:55:03.342147
> flags 
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
> rjenkins
> pg_num 64 pgp_num 
> 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 
> pgp_num 64 last_change 1 flags hashpspool stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
> pg_num 64 pgp_num 64 
> last_change 1 flags hashpspool stripe_width 0
> max_osd 2
> osd.0 down out weight 0 up_from 4 up_thru 8 down_at 10 last_clean_interval
> [0,0) 
> 10.119.16.15:6800/4433 10.119.16.15:6801/4433 10.119.16.15:6802/4433 
> 10.119.16.15:6803/4433 autoout,exists ed7d9e41-6976-4a6e-b929-82d77f916470
> osd.1 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval 
> [0,0)
> 10.119.16.16:6800/4418 10.119.16.16:6801/4418 10.119.16.16:6802/4418 
> 10.119.16.16:6803/4418 exists,up 89e49ab3-b22b-41e9-9b7b-89c33f9cb0fb
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-08 Thread BG
Sorry, no idea, this is a first time install I'm trying and I'm following the
"Storage Cluster Quick Start" guide.

Looking in the "ceph.log" file I do see warnings related to osd.0:
2014-09-08 11:06:44.000667 osd.0 10.119.16.15:6800/4433 1 : [WRN] map e10
wrongly marked me down

I've also just noticed the monitor in my setup is spooling an enormous log file
(multiple GB) with the same message as below over and over again:
2014-09-08 11:43:13.764953 7f0ee985c700  1 mon.hp09@0(leader).paxos(paxos
active c 1..64) is_readable now=2014-09-08 11:43:13.764954
lease_expire=0.00 has v0 lc 64



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-08 Thread BG
Also, for info, this is from the osd.0 log file:

2014-09-08 11:06:44.000663 7f41144c7700  0 log [WRN] : map e10 wrongly marked
me down
2014-09-08 11:06:44.002595 7f41144c7700  0 osd.0 10 crush map has features
1107558400, adjusting msgr requires for mons
2014-09-08 11:06:44.003346 7f41072ab700  0 -- 10.119.16.15:6805/1004433 >>
10.119.16.16:6801/4418 pipe(0x27c0c80 sd=105 :0 s=1 pgs=0 cs=0 l=0
c=0x26c6880).fault with nothing to send, going to standby
2014-09-08 11:06:44.003752 7f41071aa700  0 -- :/4433 >> 10.119.16.16:6802/4418
pipe(0x27c0a00 sd=106 :0 s=1 pgs=0 cs=0 l=1 c=0x26c6b40).fault
2014-09-08 11:06:44.003766 7f41073ac700  0 -- :/4433 >> 10.119.16.16:6803/4418
pipe(0x27c1900 sd=107 :0 s=1 pgs=0 cs=0 l=1 c=0x26c6f60).fault
2014-09-08 11:07:04.004352 7f4106fa9700  0 -- :/4433 >> 10.119.16.16:6802/4418
pipe(0x27c1180 sd=114 :0 s=1 pgs=0 cs=0 l=1 c=0x26c70c0).fault
2014-09-08 11:07:04.004573 7f4106ea8700  0 -- :/4433 >> 10.119.16.16:6803/4418
pipe(0x27c0f00 sd=106 :0 s=1 pgs=0 cs=0 l=1 c=0x26c7220).fault

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-08 Thread Haomai Wang
I'm not very sure, it's possible that keyvaluestore will use spare
write which make big difference with ceph space statistic

On Mon, Sep 8, 2014 at 6:35 PM, Kenneth Waegeman
 wrote:
>
> Thank you very much !
>
> Is this problem then related to the weird sizes I see:
>   pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects
> 418 GB used, 88130 GB / 88549 GB avail
>
> a calculation with df shows indeed that there is about 400GB used on disks,
> but the tests I ran should indeed have generated 3,5 TB, as also seen in
> rados df:
>
> pool name   category KB  objects   clones
> degraded  unfound   rdrd KB   wrwr KB
> cache   -   59150443154660
> 0   0  1388365   5686734850  3665984   4709621763
> ecdata  - 3512807425   8576200
> 0   0  1109938312332288   857621   3512807426
>
> I thought it was related to the inconsistency?
> Or can this be a sparse objects thing? (But I don't seem to found anything
> in the docs about that)
>
> Thanks again!
>
> Kenneth
>
>
>
> - Message from Haomai Wang  -
>Date: Sun, 7 Sep 2014 20:34:39 +0800
>
>From: Haomai Wang 
> Subject: Re: ceph cluster inconsistency keyvaluestore
>  To: Kenneth Waegeman 
>  Cc: ceph-users@lists.ceph.com
>
>
>> I have found the root cause. It's a bug.
>>
>> When chunky scrub happen, it will iterate the who pg's objects and
>> each iterator only a few objects will be scan.
>>
>> osd/PG.cc:3758
>> ret = get_pgbackend()-> objects_list_partial(
>>   start,
>>   cct->_conf->osd_scrub_chunk_min,
>>   cct->_conf->osd_scrub_chunk_max,
>>   0,
>>   &objects,
>>   &candidate_end);
>>
>> candidate_end is the end of object set and it's used to indicate the
>> next scrub process's start position. But it will be truncated:
>>
>> osd/PG.cc:3777
>> while (!boundary_found && objects.size() > 1) {
>>   hobject_t end = objects.back().get_boundary();
>>   objects.pop_back();
>>
>>   if (objects.back().get_filestore_key() !=
>> end.get_filestore_key()) {
>> candidate_end = end;
>> boundary_found = true;
>>   }
>> }
>> end which only contain "hash" field as hobject_t will be assign to
>> candidate_end.  So the next scrub process a hobject_t only contains
>> "hash" field will be passed in to get_pgbackend()->
>> objects_list_partial.
>>
>> It will cause incorrect results for KeyValueStore backend. Because it
>> will use strict key ordering for "collection_list_paritial" method. A
>> hobject_t only contains "hash" field will be:
>>
>> 1%e79s0_head!972F1B5D!!none!!!!0!0
>>
>> and the actual object is
>> 1%e79s0_head!972F1B5D!!1!!!object-name!head
>>
>> In other word, a object only contain "hash" field can't used by to
>> search a absolute object has the same "hash" field.
>>
>> @sage The simply way is modify obj->key function which will change
>> storage format. Because it's a experiment backend I would like to
>> provide with a external format change program help users do it. Is it
>> OK?
>>
>>
>> On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
>>  wrote:
>>>
>>> I also can reproduce it on a new slightly different set up (also EC on KV
>>> and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
>>> 'inconsistent' status
>>>
>>>
>>>
>>> - Message from Kenneth Waegeman  -
>>>Date: Mon, 01 Sep 2014 16:28:31 +0200
>>>From: Kenneth Waegeman 
>>> Subject: Re: ceph cluster inconsistency keyvaluestore
>>>  To: Haomai Wang 
>>>  Cc: ceph-users@lists.ceph.com
>>>
>>>
>>>
 Hi,


 The cluster got installed with quattor, which uses ceph-deploy for
 installation of daemons, writes the config file and installs the
 crushmap.
 I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for
 the
 ECdata pool and a small cache partition (50G) for the cache

 I manually did this:

 ceph osd pool create cache 1024 1024
 ceph osd pool set cache size 2
 ceph osd pool set cache min_size 1
 ceph osd erasure-code-profile set profile11 k=8 m=3
 ruleset-failure-domain=osd
 ceph osd pool create ecdata 128 128 erasure profile11
 ceph osd tier add ecdata cache
 ceph osd tier cache-mode cache writeback
 ceph osd tier set-overlay ecdata cache
 ceph osd pool set cache hit_set_type bloom
 ceph osd pool set cache hit_set_count 1
 ceph osd pool set cache hit_set_period 3600
 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

 (But the previous time I had the problem already without the cache part)



 Cluster live since 2014-08-29 15:34:16

 Config file on host ceph001:

 [global]
 auth_client_required = cephx
 

[ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread JR
Greetings all,

I have a small ceph cluster (4 nodes, 2 osds per node) which recently
started showing:

root@ocd45:~# ceph health
HEALTH_WARN 1 near full osd(s)

admin@node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
'Filesystem|osd/ceph'; done
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdc1   442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
/dev/sdb1   442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdc1   442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
/dev/sdb1   442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdb1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
/dev/sdc1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdc1   442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
/dev/sdb1   442G  278G  165G  63% /var/lib/ceph/osd/ceph-0


This cluster has been running for weeks, under significant load, and has
been 100% stable. Unfortunately we have to ship it out of the building
to another part of our business (where we will have little access to it).

Based on what I've read about 'ceph osd reweight' I'm a bit hesitant to
just run it (I don't want to do anything that impacts this cluster's
stability).

Is there another, better way to equalize the distribution the data on
the osd partitions?

I'm running dumpling.

Thanks much,
JR

-- 
Your electronic communications are being monitored; strong encryption is
an answer. My public key

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph object back up details

2014-09-08 Thread Yehuda Sadeh
Not sure I understand what you ask. Multiple zones within the same
region configuration is described here:

http://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication

Yehuda

On Sun, Sep 7, 2014 at 10:32 PM, M Ranga Swami Reddy
 wrote:
> Hi Yahuda,
> I need more info on Ceph object backup mechanism.. Could  please share
> a related doc or link for this?
> Thanks
> Swami
>
> On Thu, Sep 4, 2014 at 10:58 PM, M Ranga Swami Reddy
>  wrote:
>> Hi,
>> I need more info on Ceph object backup mechanism.. Could someone share a
>> related doc or link for this?
>> Thanks
>> Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread Christian Balzer

Hello,

On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:

> Greetings all,
> 
> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
> started showing:
> 
> root@ocd45:~# ceph health
> HEALTH_WARN 1 near full osd(s)
> 
> admin@node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
> 'Filesystem|osd/ceph'; done
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdc1   442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
> /dev/sdb1   442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdc1   442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
> /dev/sdb1   442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdb1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
> /dev/sdc1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdc1   442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
> /dev/sdb1   442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
> 
>
See the very recent "Uneven OSD usage" for a discussion about this.
What are your PG/PGP values?

> This cluster has been running for weeks, under significant load, and has
> been 100% stable. Unfortunately we have to ship it out of the building
> to another part of our business (where we will have little access to it).
> 
> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant to
> just run it (I don't want to do anything that impacts this cluster's
> stability).
> 
> Is there another, better way to equalize the distribution the data on
> the osd partitions?
> 
> I'm running dumpling.
> 
As per the thread and my experience, Firefly would solve this. If you can
upgrade during a weekend or whenever there is little to no access, do it.

Another option (of course any and all of these will result in data
movement, so pick an appropriate time), would be to "use ceph osd
reweight" to lower the weight of osd.7 in particular.

Lastly, given the utilization of your cluster, your really ought to deploy
more OSDs and/or more nodes, if a node would go down you'd easily get into
a "real" near full or full situation.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communult in data
movement, so pick an appropriate time), would be to ications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resizing the OSD

2014-09-08 Thread JIten Shah

On Sep 6, 2014, at 8:22 PM, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 06 Sep 2014 10:28:19 -0700 JIten Shah wrote:
> 
>> Thanks Christian.  Replies inline.
>> On Sep 6, 2014, at 8:04 AM, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:
>>> 
 Hello Cephers,
 
 We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of
 the stuff seems to be working fine but we are seeing some degrading
 on the osd's due to lack of space on the osd's. 
>>> 
>>> Please elaborate on that degradation.
>> 
>> The degradation happened on few OSD's because it got quickly filled up.
>> They were not of the same size as the other OSD's. Now I want to remove
>> these OSD's and readd them with correct size to match the others.
> 
> Alright, that's good idea, uniformity helps. ^^
> 
>>> 
 Is there a way to resize the
 OSD without bringing the cluster down?
 
>>> 
>>> Define both "resize" and "cluster down".
>> 
>> Basically I want to remove the OSD's with incorrect size and readd them
>> with the size matching the other OSD's. 
>>> 
>>> As in, resizing how? 
>>> Are your current OSDs on disks/LVMs that are not fully used and thus
>>> could be grown?
>>> What is the size of your current OSDs?
>> 
>> The size of current OSD's is 20GB and we do have more unused space on
>> the disk that we can make the LVM bigger and increase the size of the
>> OSD's. I agree that we need to have all the disks of same size and I am
>> working towards that.Thanks.
>>> 
> OK, so your OSDs are backed by LVM. 
> A curious choice, any particular reason to do so?

We already had lvm’s carved out for some other project and were not using it so 
we decided to have OSD’s on those LVMs

> 
> Either way, in theory you could grow things in place, obviously first the
> LVM and then the underlying filesystem. Both ext4 and xfs support online
> growing, so the OSD can keep running the whole time.
> If you're unfamiliar with these things, play with them on a test machine
> first. 
> 
> Now for the next step we will really need to know how you deployed ceph
> and the result of "ceph osd tree" (not all 100 OSDs are needed, a sample of
> a "small" and "big" OSD is sufficient).

Fixed all the sizes so all of them weight as 1
[jshah@pv11p04si-mzk001 ~]$ ceph osd tree
# idweight  type name   up/down reweight
-1  99  root default
-2  1   host pv11p04si-mslave0005
0   1   osd.0   up  1   
-3  1   host pv11p04si-mslave0006
1   1   osd.1   up  1   
-4  1   host pv11p04si-mslave0007
2   1   osd.2   up  1   
-5  1   host pv11p04si-mslave0008
3   1   osd.3   up  1   
-6  1   host pv11p04si-mslave0009
4   1   osd.4   up  1   
-7  1   host pv11p04si-mslave0010
5   1   osd.5   up  1   
> 
> Depending on the results (it will probably have varying weights depending
> on the size and a reweight value of 1 for all) you will need to adjust the
> weight of the grown OSD in question accordingly with "ceph osd crush
> reweight". 
> That step will incur data movement, so do it one OSD at a time.
> 
>>> The normal way of growing a cluster is to add more OSDs.
>>> Preferably of the same size and same performance disks.
>>> This will not only simplify things immensely but also make them a lot
>>> more predictable.
>>> This of course depends on your use case and usage patterns, but often
>>> when running out of space you're also running out of other resources
>>> like CPU, memory or IOPS of the disks involved. So adding more instead
>>> of growing them is most likely the way forward.
>>> 
>>> If you were to replace actual disks with larger ones, take them (the
>>> OSDs) out one at a time and re-add it. If you're using ceph-deploy, it
>>> will use the disk size as basic weight, if you're doing things
>>> manually make sure to specify that size/weight accordingly.
>>> Again, you do want to do this for all disks to keep things uniform.
>>> 
>>> If your cluster (pools really) are set to a replica size of at least 2
>>> (risky!) or 3 (as per Firefly default), taking a single OSD out would
>>> of course never bring the cluster down.
>>> However taking an OSD out and/or adding a new one will cause data
>>> movement that might impact your cluster's performance.
>>> 
>> 
>> We have a current replica size of 2 with 100 OSD's. How many can I loose
>> without affecting the performance? I understand the impact of data
>> movement.
>> 
> Unless your LVMs are in turn living on a RAID, a replica of 2 with 100
> OSDs is begging Murphy for a double disk failure. I'm also curious on how
> many actual physical disks those OSD live and how many physical hosts are
> in your cluster.

we have 1 physical disk on each 

[ceph-users] Updating the pg and pgp values

2014-09-08 Thread JIten Shah
While checking the health of the cluster, I ran to the following error:

warning: health HEALTH_WARN too few pgs per osd (1< min 20)

When I checked the pg and php numbers, I saw the value was the default value of 
64

ceph osd pool get data pg_num
pg_num: 64
ceph osd pool get data pgp_num
pgp_num: 64

Checking the ceph documents, I updated the numbers to 2000 using the following 
commands:

ceph osd pool set data pg_num 2000
ceph osd pool set data pgp_num 2000

It started resizing the data and saw health warnings again:

health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num 2000 > 
pgp_num 64

and then:

ceph health detail
HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
5 ops are blocked > 65.536 sec
1 ops are blocked > 32.768 sec
1 ops are blocked > 32.768 sec on osd.16
1 ops are blocked > 65.536 sec on osd.77
4 ops are blocked > 65.536 sec on osd.98
3 osds have slow requests

This error also went away after a day.

ceph health detail
HEALTH_OK


Now, the question I have is, will this pg number remain effective on the 
cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t 
changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the 
ceph.conf and push that change to all the MON, MSD and OSD’s ?


Thanks.

—Jiten


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread Gregory Farnum
On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah  wrote:
> While checking the health of the cluster, I ran to the following error:
>
> warning: health HEALTH_WARN too few pgs per osd (1< min 20)
>
> When I checked the pg and php numbers, I saw the value was the default value
> of 64
>
> ceph osd pool get data pg_num
> pg_num: 64
> ceph osd pool get data pgp_num
> pgp_num: 64
>
> Checking the ceph documents, I updated the numbers to 2000 using the
> following commands:
>
> ceph osd pool set data pg_num 2000
> ceph osd pool set data pgp_num 2000
>
> It started resizing the data and saw health warnings again:
>
> health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num 2000 >
> pgp_num 64
>
> and then:
>
> ceph health detail
> HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
> 5 ops are blocked > 65.536 sec
> 1 ops are blocked > 32.768 sec
> 1 ops are blocked > 32.768 sec on osd.16
> 1 ops are blocked > 65.536 sec on osd.77
> 4 ops are blocked > 65.536 sec on osd.98
> 3 osds have slow requests
>
> This error also went away after a day.
>
> ceph health detail
> HEALTH_OK
>
>
> Now, the question I have is, will this pg number remain effective on the
> cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t
> changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the
> ceph.conf and push that change to all the MON, MSD and OSD’s ?

It's durable once the commands are successful on the monitors. You're all done.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread JIten Shah
Thanks Greg.

—Jiten

On Sep 8, 2014, at 10:31 AM, Gregory Farnum  wrote:

> On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah  wrote:
>> While checking the health of the cluster, I ran to the following error:
>> 
>> warning: health HEALTH_WARN too few pgs per osd (1< min 20)
>> 
>> When I checked the pg and php numbers, I saw the value was the default value
>> of 64
>> 
>> ceph osd pool get data pg_num
>> pg_num: 64
>> ceph osd pool get data pgp_num
>> pgp_num: 64
>> 
>> Checking the ceph documents, I updated the numbers to 2000 using the
>> following commands:
>> 
>> ceph osd pool set data pg_num 2000
>> ceph osd pool set data pgp_num 2000
>> 
>> It started resizing the data and saw health warnings again:
>> 
>> health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num 2000 >
>> pgp_num 64
>> 
>> and then:
>> 
>> ceph health detail
>> HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
>> 5 ops are blocked > 65.536 sec
>> 1 ops are blocked > 32.768 sec
>> 1 ops are blocked > 32.768 sec on osd.16
>> 1 ops are blocked > 65.536 sec on osd.77
>> 4 ops are blocked > 65.536 sec on osd.98
>> 3 osds have slow requests
>> 
>> This error also went away after a day.
>> 
>> ceph health detail
>> HEALTH_OK
>> 
>> 
>> Now, the question I have is, will this pg number remain effective on the
>> cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t
>> changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the
>> ceph.conf and push that change to all the MON, MSD and OSD’s ?
> 
> It's durable once the commands are successful on the monitors. You're all 
> done.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread JR
Hi Christian,

I have 448 PGs and 448 PGPs (according to ceph -s).

This seems borne out by:

root@osd45:~# rados lspools
data
metadata
rbd
volumes
images
root@osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
data pg(pg_num: 64, pgppg_num: 64
metadata pg(pg_num: 64, pgppg_num: 64
rbd pg(pg_num: 64, pgppg_num: 64
volumes pg(pg_num: 128, pgppg_num: 128
images pg(pg_num: 128, pgppg_num: 128

According to the formula discussed in 'Uneven OSD usage,'

"The formula is actually OSDs * 100 / replication

in my case:

8*100/2=400

So I'm erroring on the large size?

Or, does this formula apply on by pool basis?  Of my 5 pools I'm using 3:

root@nebula45:~# rados df|cut -c1-45
pool name   category KB
data-  0
images  -  0
metadata- 10
rbd -  568489533
volumes -  594078601
  total used  2326235048   285923
  total avail 1380814968
  total space 3707050016

So should I up the number of PGs for the rbd and volumes pools?

I'll continue looking at docs, but for now I'll send this off.

Thanks very much, Christain.

ps. This cluster is self-contained and all nodes in it are completely
loaded (i.e., I can't add any more nodes nor disks).  It's also not an
option at the moment to upgrade to firefly (can't make a big change
before sending it out the door).



On 9/8/2014 12:09 PM, Christian Balzer wrote:
> 
> Hello,
> 
> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
> 
>> Greetings all,
>>
>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
>> started showing:
>>
>> root@ocd45:~# ceph health
>> HEALTH_WARN 1 near full osd(s)
>>
>> admin@node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
>> 'Filesystem|osd/ceph'; done
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/sdc1   442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
>> /dev/sdb1   442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/sdc1   442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
>> /dev/sdb1   442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/sdb1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
>> /dev/sdc1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/sdc1   442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
>> /dev/sdb1   442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
>>
>>
> See the very recent "Uneven OSD usage" for a discussion about this.
> What are your PG/PGP values?
> 
>> This cluster has been running for weeks, under significant load, and has
>> been 100% stable. Unfortunately we have to ship it out of the building
>> to another part of our business (where we will have little access to it).
>>
>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant to
>> just run it (I don't want to do anything that impacts this cluster's
>> stability).
>>
>> Is there another, better way to equalize the distribution the data on
>> the osd partitions?
>>
>> I'm running dumpling.
>>
> As per the thread and my experience, Firefly would solve this. If you can
> upgrade during a weekend or whenever there is little to no access, do it.
> 
> Another option (of course any and all of these will result in data
> movement, so pick an appropriate time), would be to "use ceph osd
> reweight" to lower the weight of osd.7 in particular.
> 
> Lastly, given the utilization of your cluster, your really ought to deploy
> more OSDs and/or more nodes, if a node would go down you'd easily get into
> a "real" near full or full situation.
> 
> Regards,
> 
> Christian
> 

-- 
Your electronic communications are being monitored; strong encryption is
an answer. My public key

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread JIten Shah
So, if it doesn’t refer to the entry in ceph.conf. Where does it actually store 
the new value?

—Jiten

On Sep 8, 2014, at 10:31 AM, Gregory Farnum  wrote:

> On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah  wrote:
>> While checking the health of the cluster, I ran to the following error:
>> 
>> warning: health HEALTH_WARN too few pgs per osd (1< min 20)
>> 
>> When I checked the pg and php numbers, I saw the value was the default value
>> of 64
>> 
>> ceph osd pool get data pg_num
>> pg_num: 64
>> ceph osd pool get data pgp_num
>> pgp_num: 64
>> 
>> Checking the ceph documents, I updated the numbers to 2000 using the
>> following commands:
>> 
>> ceph osd pool set data pg_num 2000
>> ceph osd pool set data pgp_num 2000
>> 
>> It started resizing the data and saw health warnings again:
>> 
>> health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num 2000 >
>> pgp_num 64
>> 
>> and then:
>> 
>> ceph health detail
>> HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
>> 5 ops are blocked > 65.536 sec
>> 1 ops are blocked > 32.768 sec
>> 1 ops are blocked > 32.768 sec on osd.16
>> 1 ops are blocked > 65.536 sec on osd.77
>> 4 ops are blocked > 65.536 sec on osd.98
>> 3 osds have slow requests
>> 
>> This error also went away after a day.
>> 
>> ceph health detail
>> HEALTH_OK
>> 
>> 
>> Now, the question I have is, will this pg number remain effective on the
>> cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t
>> changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the
>> ceph.conf and push that change to all the MON, MSD and OSD’s ?
> 
> It's durable once the commands are successful on the monitors. You're all 
> done.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread Gregory Farnum
It's stored in the OSDMap on the monitors.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Sep 8, 2014 at 10:50 AM, JIten Shah  wrote:
> So, if it doesn’t refer to the entry in ceph.conf. Where does it actually 
> store the new value?
>
> —Jiten
>
> On Sep 8, 2014, at 10:31 AM, Gregory Farnum  wrote:
>
>> On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah  wrote:
>>> While checking the health of the cluster, I ran to the following error:
>>>
>>> warning: health HEALTH_WARN too few pgs per osd (1< min 20)
>>>
>>> When I checked the pg and php numbers, I saw the value was the default value
>>> of 64
>>>
>>> ceph osd pool get data pg_num
>>> pg_num: 64
>>> ceph osd pool get data pgp_num
>>> pgp_num: 64
>>>
>>> Checking the ceph documents, I updated the numbers to 2000 using the
>>> following commands:
>>>
>>> ceph osd pool set data pg_num 2000
>>> ceph osd pool set data pgp_num 2000
>>>
>>> It started resizing the data and saw health warnings again:
>>>
>>> health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num 2000 >
>>> pgp_num 64
>>>
>>> and then:
>>>
>>> ceph health detail
>>> HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
>>> 5 ops are blocked > 65.536 sec
>>> 1 ops are blocked > 32.768 sec
>>> 1 ops are blocked > 32.768 sec on osd.16
>>> 1 ops are blocked > 65.536 sec on osd.77
>>> 4 ops are blocked > 65.536 sec on osd.98
>>> 3 osds have slow requests
>>>
>>> This error also went away after a day.
>>>
>>> ceph health detail
>>> HEALTH_OK
>>>
>>>
>>> Now, the question I have is, will this pg number remain effective on the
>>> cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t
>>> changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the
>>> ceph.conf and push that change to all the MON, MSD and OSD’s ?
>>
>> It's durable once the commands are successful on the monitors. You're all 
>> done.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread JIten Shah
Thanks. How do I query the OSDMap on monitors? 

Using "ceph osd pool get data pg” ? or is there a way to get the full list of 
settings?

—jiten


On Sep 8, 2014, at 10:52 AM, Gregory Farnum  wrote:

> It's stored in the OSDMap on the monitors.
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Mon, Sep 8, 2014 at 10:50 AM, JIten Shah  wrote:
>> So, if it doesn’t refer to the entry in ceph.conf. Where does it actually 
>> store the new value?
>> 
>> —Jiten
>> 
>> On Sep 8, 2014, at 10:31 AM, Gregory Farnum  wrote:
>> 
>>> On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah  wrote:
 While checking the health of the cluster, I ran to the following error:
 
 warning: health HEALTH_WARN too few pgs per osd (1< min 20)
 
 When I checked the pg and php numbers, I saw the value was the default 
 value
 of 64
 
 ceph osd pool get data pg_num
 pg_num: 64
 ceph osd pool get data pgp_num
 pgp_num: 64
 
 Checking the ceph documents, I updated the numbers to 2000 using the
 following commands:
 
 ceph osd pool set data pg_num 2000
 ceph osd pool set data pgp_num 2000
 
 It started resizing the data and saw health warnings again:
 
 health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num 2000 >
 pgp_num 64
 
 and then:
 
 ceph health detail
 HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
 5 ops are blocked > 65.536 sec
 1 ops are blocked > 32.768 sec
 1 ops are blocked > 32.768 sec on osd.16
 1 ops are blocked > 65.536 sec on osd.77
 4 ops are blocked > 65.536 sec on osd.98
 3 osds have slow requests
 
 This error also went away after a day.
 
 ceph health detail
 HEALTH_OK
 
 
 Now, the question I have is, will this pg number remain effective on the
 cluster, even if we restart MON or OSD’s on the individual disks?  I 
 haven’t
 changed the values in /etc/ceph/ceph.conf. Do I need to make a change to 
 the
 ceph.conf and push that change to all the MON, MSD and OSD’s ?
>>> 
>>> It's durable once the commands are successful on the monitors. You're all 
>>> done.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delays while waiting_for_osdmap according to dump_historic_ops

2014-09-08 Thread Gregory Farnum
On Sun, Sep 7, 2014 at 4:28 PM, Alex Moore  wrote:
> I recently found out about the "ceph --admin-daemon
> /var/run/ceph/ceph-osd..asok dump_historic_ops" command, and noticed
> something unexpected in the output on my cluster, after checking numerous
> output samples...
>
> It looks to me like "normal" write ops on my cluster spend roughly:
>
> <1ms between "received_at" and "waiting_for_osdmap"
> <1ms between "waiting_for_osdmap" and "reached_pg"
> <15ms between "reached_pg" and "commit_sent"
> <15ms between "commit_sent" and "done"
>
> For reference, this is a small (3-host) all-SSD cluster, with monitors
> co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the
> journal, and 1 SSD for the OSD + monitor data (I initially had the monitor
> data on the same drive as the OS, but encountered performance problems -
> which have since been allieviated by moving the monitor data to the same
> drives as the OSDs. Networking is infiniband (8 Gbps dedicated
> point-to-point link between each pair of hosts). I'm running v0.80.5. And
> the OSDs use XFS.
>
> Anyway, as this command intentionally shows the worst few recent IOs, I only
> rarely see examples that match the above "norm". Rather, the typical
> outliers that it highlights are usually write IOs with ~100-300ms latency,
> where the extra latency exists purely between the "received_at" and
> "reached_pg" timestamps, and mostly in the "waiting_for_osdmap" step. Also
> it looks like these slow IOs come in batches. Every write IO arriving within
> the same ~1 second period will suffer from these strangely slow initial two
> steps, with the additional latency being almost identical for each one
> within the same batch. After which things return to normal again in that
> those steps take <1ms. So compared to the above "norm", these look more
> like:
>
> ~50ms between "received_at" and "waiting_for_osdmap"
> ~150ms between "waiting_for_osdmap" and "reached_pg"
> <15ms between "reached_pg" and "commit_sent"
> <15ms between "commit_sent" and "done"
>
> This seems unexpected to me. I don't see why those initial steps in the IO
> should ever take such a long time to complete. Where should I be looking
> next to track down the cause? I'm guessing that "waiting_for_osdmap"
> involves OSD<->Mon communication, and so perhaps indicates poor performance
> of the Mons. But for there to be any non-negligible delay between
> "received_at" and "waiting_for_osdmap" makes no sense to me at all.

First thing here is to explain what each of these events actually mean.
"received_at" is the point at which we *started* reading the message
off the wire. We have to finish reading it off and dispatch it to the
OSD before the next one.
"waiting_for_osdmap" is slightly misnamed; it's the point at which the
op was submitted to the OSD. It's called that because receiving a
message with a newer OSDMap epoch than we have is the most common
long-term delay in this phase, but we also have to do some other
preprocessing and queue the Op up.
"reached_pg" is the point at which the Op is dequeued by a worker
thread and has the necessary mutexes to get processed. After this
point we're going to try and actually do the operations described
(reads or writes).
"commit_sent" indicates that we've actually sent back the commit to
the client or primary OSD.
"done" indicates that the op has been completed (commit_sent doesn't
wait for the op to have been applied to the backing filesystem; this
does).

There are probably a bunch of causes for the behavior you're seeing,
but the most likely is that you've occasionally got a whole bunch of
operations going to a single object/placement group and they're taking
some time to process because they have to be serialized. This would
prevent the PG from handling newer ops while the old ones are still
being processed, and that could back up through the pipeline to slow
down the reads off the wire as well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Gregory Farnum
On Mon, Sep 8, 2014 at 1:42 AM, Francois Deppierraz
 wrote:
> Hi,
>
> This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2
> under Ubuntu 12.04. The cluster was filling up (a few osds near full)
> and I tried to increase the number of pg per pool to 1024 for each of
> the 14 pools to improve storage space balancing. This increase triggered
> high memory usage on the servers which were unfortunately
> under-provisioned (16 GB RAM for 22 osds) and started to swap and crash.
>
> After installing memory into the servers, the result is a broken cluster
> with unfound objects and two osds (osd.6 and osd.43) crashing at startup.
>
> $ ceph health
> HEALTH_WARN 166 pgs backfill; 326 pgs backfill_toofull; 2 pgs
> backfilling; 765 pgs degraded; 715 pgs down; 1 pgs incomplete; 715 pgs
> peering; 5 pgs recovering; 2 pgs recovery_wait; 716 pgs stuck inactive;
> 1856 pgs stuck unclean; 164 requests are blocked > 32 sec; recovery
> 517735/15915673 objects degraded (3.253%); 1241/7910367 unfound
> (0.016%); 3 near full osd(s); 1/43 in osds are down; noout flag(s) set
>
> osd.6 is crashing due to an assertion ("trim_objectcould not find coid")
> which leads to a resolved bug report which unfortunately doesn't give
> any advise on how to repair the osd.
>
> http://tracker.ceph.com/issues/5473
>
> It is much less obvious why osd.43 is crashing, please have a look at
> the following osd logs:
>
> http://paste.ubuntu.com/8288607/
> http://paste.ubuntu.com/8288609/

The first one is not caused by the same thing as the ticket you
reference (it was fixed well before emperor), so it appears to be some
kind of disk corruption.
The second one is definitely corruption of some kind as it's missing
an OSDMap it thinks it should have. It's possible that you're running
into bugs in emperor that were fixed after we stopped doing regular
support releases of it, but I'm more concerned that you've got disk
corruption in the stores. What kind of crashes did you see previously;
are there any relevant messages in dmesg, etc?

Given these issues, you might be best off identifying exactly which
PGs are missing, carefully copying them to working OSDs (use the osd
store tool), and killing these OSDs. Do lots of backups at each
stage...
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-08 Thread Sebastien Han
They definitely are Warren!

Thanks for bringing this here :).

On 05 Sep 2014, at 23:02, Wang, Warren  wrote:

> +1 to what Cedric said.
> 
> Anything more than a few minutes of heavy sustained writes tended to get our 
> solid state devices into a state where garbage collection could not keep up. 
> Originally we used small SSDs and did not overprovision the journals by much. 
> Manufacturers publish their SSD stats, and then in very small font, state 
> that the attained IOPS are with empty drives, and the tests are only run for 
> very short amounts of time.  Even if the drives are new, it's a good idea to 
> perform an hdparm secure erase on them (so that the SSD knows that the blocks 
> are truly unused), and then overprovision them. You'll know if you have a 
> problem by watching for utilization and wait data on the journals.
> 
> One of the other interesting performance issues is that the Intel 10Gbe NICs 
> + default kernel that we typically use max out around 1million packets/sec. 
> It's worth tracking this metric to if you are close. 
> 
> I know these aren't necessarily relevant to the test parameters you gave 
> below, but they're worth keeping in mind.
> 
> -- 
> Warren Wang
> Comcast Cloud (OpenStack)
> 
> 
> From: Cedric Lemarchand 
> Date: Wednesday, September 3, 2014 at 5:14 PM
> To: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS
> 
> 
> Le 03/09/2014 22:11, Sebastien Han a écrit :
>> Hi Warren,
>> 
>> What do mean exactly by secure erase? At the firmware level with constructor 
>> softwares?
>> SSDs were pretty new so I don’t we hit that sort of things. I believe that 
>> only aged SSDs have this behaviour but I might be wrong.
>> 
> Sorry I forgot to reply to the real question ;-)
> So yes it only plays after some times, for your case, if the SSD still 
> delivers write IOPS specified by the manufacturer, it will doesn't help in 
> any ways.
> 
> But it seems this practice is nowadays increasingly used.
> 
> Cheers
>> On 02 Sep 2014, at 18:23, Wang, Warren 
>>  wrote:
>> 
>> 
>>> Hi Sebastien,
>>> 
>>> Something I didn't see in the thread so far, did you secure erase the SSDs 
>>> before they got used? I assume these were probably repurposed for this 
>>> test. We have seen some pretty significant garbage collection issue on 
>>> various SSD and other forms of solid state storage to the point where we 
>>> are overprovisioning pretty much every solid state device now. By as much 
>>> as 50% to handle sustained write operations. Especially important for the 
>>> journals, as we've found.
>>> 
>>> Maybe not an issue on the short fio run below, but certainly evident on 
>>> longer runs or lots of historical data on the drives. The max transaction 
>>> time looks pretty good for your test. Something to consider though.
>>> 
>>> Warren
>>> 
>>> -Original Message-
>>> From: ceph-users [
>>> mailto:ceph-users-boun...@lists.ceph.com
>>> ] On Behalf Of Sebastien Han
>>> Sent: Thursday, August 28, 2014 12:12 PM
>>> To: ceph-users
>>> Cc: Mark Nelson
>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
>>> IOPS
>>> 
>>> Hey all,
>>> 
>>> It has been a while since the last thread performance related on the ML :p 
>>> I've been running some experiment to see how much I can get from an SSD on 
>>> a Ceph cluster.
>>> To achieve that I did something pretty simple:
>>> 
>>> * Debian wheezy 7.6
>>> * kernel from debian 3.14-0.bpo.2-amd64
>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real 
>>> deployment i'll use 3)
>>> * 1 OSD backed by an SSD (journal and osd data on the same device)
>>> * 1 replica count of 1
>>> * partitions are perfectly aligned
>>> * io scheduler is set to noon but deadline was showing the same results
>>> * no updatedb running
>>> 
>>> About the box:
>>> 
>>> * 32GB of RAM
>>> * 12 cores with HT @ 2,4 GHz
>>> * WB cache is enabled on the controller
>>> * 10Gbps network (doesn't help here)
>>> 
>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K 
>>> iops with random 4k writes (my fio results) As a benchmark tool I used fio 
>>> with the rbd engine (thanks deutsche telekom guys!).
>>> 
>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>>> 
>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536
>>> 65536+0 records in
>>> 65536+0 records out
>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>>> 
>>> # du -sh rand.file
>>> 256Mrand.file
>>> 
>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
>>> 65536+0 records in
>>> 65536+0 records out
>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>>> 
>>> See my ceph.conf:
>>> 
>>> [global]
>>>  auth cluster required = cephx
>>>  auth service required = cephx
>>>  auth client required = cephx
>>>  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>>>  osd pool default pg num = 4096
>>>  osd pool default pgp num = 4096
>>>  osd pool default size = 2

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Francois Deppierraz
Hi Greg,

Thanks for your support!

On 08. 09. 14 20:20, Gregory Farnum wrote:

> The first one is not caused by the same thing as the ticket you
> reference (it was fixed well before emperor), so it appears to be some
> kind of disk corruption.
> The second one is definitely corruption of some kind as it's missing
> an OSDMap it thinks it should have. It's possible that you're running
> into bugs in emperor that were fixed after we stopped doing regular
> support releases of it, but I'm more concerned that you've got disk
> corruption in the stores. What kind of crashes did you see previously;
> are there any relevant messages in dmesg, etc?

Nothing special in dmesg except probably irrelevant XFS warnings:

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

All logs from before the disaster are still there, do you have any
advise on what would be relevant?

> Given these issues, you might be best off identifying exactly which
> PGs are missing, carefully copying them to working OSDs (use the osd
> store tool), and killing these OSDs. Do lots of backups at each
> stage...

This sounds scary, I'll keep fingers crossed and will do a bunch of
backups. There are 17 pg with missing objects.

What do you exactly mean by the osd store tool? Is it the
'ceph_filestore_tool' binary?

François

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Gregory Farnum
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
 wrote:
> Hi Greg,
>
> Thanks for your support!
>
> On 08. 09. 14 20:20, Gregory Farnum wrote:
>
>> The first one is not caused by the same thing as the ticket you
>> reference (it was fixed well before emperor), so it appears to be some
>> kind of disk corruption.
>> The second one is definitely corruption of some kind as it's missing
>> an OSDMap it thinks it should have. It's possible that you're running
>> into bugs in emperor that were fixed after we stopped doing regular
>> support releases of it, but I'm more concerned that you've got disk
>> corruption in the stores. What kind of crashes did you see previously;
>> are there any relevant messages in dmesg, etc?
>
> Nothing special in dmesg except probably irrelevant XFS warnings:
>
> XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

Hmm, I'm not sure what the outcome of that could be. Googling for the
error message returns this as the first result, though:
http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58429
Which indicates that it's a real deadlock and capable of messing up
your OSDs pretty good.

>
> All logs from before the disaster are still there, do you have any
> advise on what would be relevant?
>
>> Given these issues, you might be best off identifying exactly which
>> PGs are missing, carefully copying them to working OSDs (use the osd
>> store tool), and killing these OSDs. Do lots of backups at each
>> stage...
>
> This sounds scary, I'll keep fingers crossed and will do a bunch of
> backups. There are 17 pg with missing objects.
>
> What do you exactly mean by the osd store tool? Is it the
> 'ceph_filestore_tool' binary?

Yeah, that one.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread JR
Hi Christian, all,

Having researched this a bit more, it seemed that just doing

ceph osd pool set rbd pg_num 128
ceph osd pool set rbd pgp_num 128

might be the answer.  Alas, it was not. After running the above the
cluster just sat there.

Finally, reading some more, I ran:

 ceph osd reweight-by-utilization

This accomplished moving the utilization of the first drive on the
affected node to the 2nd drive! .e.g.:

---
BEFORE RUNNING:
---
Filesystem Use%
/dev/sdc1 57%
/dev/sdb1 65%
Filesystem Use%
/dev/sdc1 90%
/dev/sdb1 75%
Filesystem Use%
/dev/sdb1 52%
/dev/sdc1 52%
Filesystem Use%
/dev/sdc1 54%
/dev/sdb1 63%

---
AFTER RUNNING:
---
Filesystem Use%
/dev/sdc1 57%
/dev/sdb1 65%
Filesystem Use%
/dev/sdc1 70%  ** these two swapped (roughly) **
/dev/sdb1 92%  ** ^ ^^^ ^^^   **
Filesystem Use%
/dev/sdb1 52%
/dev/sdc1 52%
Filesystem Use%
/dev/sdc1 54%
/dev/sdb1 63%

root@osd45:~# ceph osd tree
# idweight  type name   up/down reweight
-1  3.44root default
-2  0.86host osd45
0   0.43osd.0   up  1
4   0.43osd.4   up  1
-3  0.86host osd42
1   0.43osd.1   up  1
5   0.43osd.5   up  1
-4  0.86host osd44
2   0.43osd.2   up  1
6   0.43osd.6   up  1
-5  0.86host osd43
3   0.43osd.3   up  1
7   0.43osd.7   up  0.7007

So this isn't the answer either.

Could someone please chime in with an explanation/suggestion?

I suspect that might make sense to use 'ceph osd reweight osd.7 1' and
then run some form of 'ceph osd crush ...'?

Of course, I've read a number of things which suggest that the two
things I've done should have fixed my problem.

Is it (gasp!) possible that this, as Christian suggests, is a dumpling
issue and, were I running on firefly, it would be sufficient?


Thanks much
JR
On 9/8/2014 1:50 PM, JR wrote:
> Hi Christian,
> 
> I have 448 PGs and 448 PGPs (according to ceph -s).
> 
> This seems borne out by:
> 
> root@osd45:~# rados lspools
> data
> metadata
> rbd
> volumes
> images
> root@osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
> data pg(pg_num: 64, pgppg_num: 64
> metadata pg(pg_num: 64, pgppg_num: 64
> rbd pg(pg_num: 64, pgppg_num: 64
> volumes pg(pg_num: 128, pgppg_num: 128
> images pg(pg_num: 128, pgppg_num: 128
> 
> According to the formula discussed in 'Uneven OSD usage,'
> 
> "The formula is actually OSDs * 100 / replication
> 
> in my case:
> 
> 8*100/2=400
> 
> So I'm erroring on the large size?
> 
> Or, does this formula apply on by pool basis?  Of my 5 pools I'm using 3:
> 
> root@osd45:~# rados df|cut -c1-45
> pool name   category KB
> data-  0
> images  -  0
> metadata- 10
> rbd -  568489533
> volumes -  594078601
>   total used  2326235048   285923
>   total avail 1380814968
>   total space 3707050016
> 
> So should I up the number of PGs for the rbd and volumes pools?
> 
> I'll continue looking at docs, but for now I'll send this off.
> 
> Thanks very much, Christain.
> 
> ps. This cluster is self-contained and all nodes in it are completely
> loaded (i.e., I can't add any more nodes nor disks).  It's also not an
> option at the moment to upgrade to firefly (can't make a big change
> before sending it out the door).
> 
> 
> 
> On 9/8/2014 12:09 PM, Christian Balzer wrote:
>>
>> Hello,
>>
>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
>>
>>> Greetings all,
>>>
>>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
>>> started showing:
>>>
>>> root@ocd45:~# ceph health
>>> HEALTH_WARN 1 near full osd(s)
>>>
>>> admin@node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
>>> 'Filesystem|osd/ceph'; done
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/sdc1   442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
>>> /dev/sdb1   442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/sdc1   442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
>>> /dev/sdb1   442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/sdb1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
>>> /dev/sdc1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/sdc1   442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
>>> /dev/sdb1   442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
>>>
>>>
>> See the very recen

[ceph-users] OSD is crashing while running admin socket

2014-09-08 Thread Somnath Roy
Hi Sage/Sam,

I faced a crash in OSD with latest Ceph master. Here is the log trace for the 
same.

ceph version 0.85-677-gd5777c4 (d5777c421548e7f039bb2c77cb0df2e9c7404723)
1: ceph-osd() [0x990def]
2: (()+0xfbb0) [0x7f72ae6e6bb0]
3: (gsignal()+0x37) [0x7f72acc08f77]
4: (abort()+0x148) [0x7f72acc0c5e8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
6: (()+0x5e856) [0x7f72ad512856]
7: (()+0x5e883) [0x7f72ad512883]
8: (()+0x5eaae) [0x7f72ad512aae]
9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned int, 
unsigned int)+0x277) [0xa88747]
10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81) [0xa89541]
11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) [0x6b6483]
14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057]
15: (OSD::asok_command(std::string, std::map >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_>, 
std::less, std::allocator >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_> > > >&, 
std::string, std::ostream&)+0x1d7) [0x612cb7]
16: (OSDSocketHook::call(std::string, std::map >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_>, 
std::less, std::allocator >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_> > > >&, 
std::string, ceph::buffer::list&)+0x67) [0x67c8b7]
17: (AdminSocket::do_accept()+0x1007) [0xa79817]
18: (AdminSocket::entry()+0x258) [0xa7b448]
19: (()+0x7f6e) [0x7f72ae6def6e]
20: (clone()+0x6d) [0x7f72a9cd]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Steps to reproduce:
---


1.   Run ios

2.   While ios running , run the following command continuously.



"ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight"


3.   At some point the osd will be crashed.

I think I have root caused it..


1.   OpTracker::RemoveOnDelete::operator() is calling op->_unregistered() 
which clears out message->data() and payload

2.   After that, if optracking is enabled we are calling 
unregister_inflight_op() which removed the op from the xlist.

3.   Now, while dumping ops, we are calling _dump_op_descriptor_unlocked() 
from TrackedOP::dump, which tries to print the message.

4.   So, there is a race condition when it tries to print the message whoes 
ops (data) field is already cleared.

Fix could be, call this op->_unregistered (in case optracking is enabled) after 
it is removed from xlist.

With this fix, I am not getting the crash anymore.

If my observation is correct, please let me know. I will raise a bug and will 
fix that as part of the overall optracker performance improvement (I will 
submit that pull request soon).

Thanks & Regards
Somnath



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
teleph

Re: [ceph-users] OSD is crashing while running admin socket

2014-09-08 Thread Samuel Just
That seems reasonable.  Bug away!
-Sam

On Mon, Sep 8, 2014 at 5:11 PM, Somnath Roy  wrote:
> Hi Sage/Sam,
>
>
>
> I faced a crash in OSD with latest Ceph master. Here is the log trace for
> the same.
>
>
>
> ceph version 0.85-677-gd5777c4 (d5777c421548e7f039bb2c77cb0df2e9c7404723)
>
> 1: ceph-osd() [0x990def]
>
> 2: (()+0xfbb0) [0x7f72ae6e6bb0]
>
> 3: (gsignal()+0x37) [0x7f72acc08f77]
>
> 4: (abort()+0x148) [0x7f72acc0c5e8]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
>
> 6: (()+0x5e856) [0x7f72ad512856]
>
> 7: (()+0x5e883) [0x7f72ad512883]
>
> 8: (()+0x5eaae) [0x7f72ad512aae]
>
> 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned int,
> unsigned int)+0x277) [0xa88747]
>
> 10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81)
> [0xa89541]
>
> 11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
>
> 12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
>
> 13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) [0x6b6483]
>
> 14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057]
>
> 15: (OSD::asok_command(std::string, std::map boost::variant std::allocator >, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_>,
> std::less, std::allocator boost::variant std::allocator >, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_> > > >&,
> std::string, std::ostream&)+0x1d7) [0x612cb7]
>
> 16: (OSDSocketHook::call(std::string, std::map boost::variant std::allocator >, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_>,
> std::less, std::allocator boost::variant std::allocator >, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_> > > >&,
> std::string, ceph::buffer::list&)+0x67) [0x67c8b7]
>
> 17: (AdminSocket::do_accept()+0x1007) [0xa79817]
>
> 18: (AdminSocket::entry()+0x258) [0xa7b448]
>
> 19: (()+0x7f6e) [0x7f72ae6def6e]
>
> 20: (clone()+0x6d) [0x7f72a9cd]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
>
> Steps to reproduce:
>
> ---
>
>
>
> 1.   Run ios
>
> 2.   While ios running , run the following command continuously.
>
>
>
> “ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight”
>
>
>
> 3.   At some point the osd will be crashed.
>
>
>
> I think I have root caused it..
>
>
>
> 1.   OpTracker::RemoveOnDelete::operator() is calling
> op->_unregistered() which clears out message->data() and payload
>
> 2.   After that, if optracking is enabled we are calling
> unregister_inflight_op() which removed the op from the xlist.
>
> 3.   Now, while dumping ops, we are calling
> _dump_op_descriptor_unlocked() from TrackedOP::dump, which tries to print
> the message.
>
> 4.   So, there is a race condition when it tries to print the message
> whoes ops (data) field is already cleared.
>
>
>
> Fix could be, call this op->_unregistered (in case optracking is enabled)
> after it is removed from xlist.
>
>
>
> With this fix, I am not getting the crash anymore.
>
>
>
> If my observation is correct, please let me know. I will raise a bug and
> will fix that as part of the overall optracker performance improvement (I
> will submit that pull request soon).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
> 
>
> PLEASE NOTE: T

Re: [ceph-users] OSD is crashing while running admin socket

2014-09-08 Thread Somnath Roy
Created the following tracker and assigned to me.

http://tracker.ceph.com/issues/9384

Thanks & Regards
Somnath

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com]
Sent: Monday, September 08, 2014 5:22 PM
To: Somnath Roy
Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
ceph-users@lists.ceph.com
Subject: Re: OSD is crashing while running admin socket

That seems reasonable.  Bug away!
-Sam

On Mon, Sep 8, 2014 at 5:11 PM, Somnath Roy  wrote:
> Hi Sage/Sam,
>
>
>
> I faced a crash in OSD with latest Ceph master. Here is the log trace
> for the same.
>
>
>
> ceph version 0.85-677-gd5777c4
> (d5777c421548e7f039bb2c77cb0df2e9c7404723)
>
> 1: ceph-osd() [0x990def]
>
> 2: (()+0xfbb0) [0x7f72ae6e6bb0]
>
> 3: (gsignal()+0x37) [0x7f72acc08f77]
>
> 4: (abort()+0x148) [0x7f72acc0c5e8]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
>
> 6: (()+0x5e856) [0x7f72ad512856]
>
> 7: (()+0x5e883) [0x7f72ad512883]
>
> 8: (()+0x5eaae) [0x7f72ad512aae]
>
> 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned
> int, unsigned int)+0x277) [0xa88747]
>
> 10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81)
> [0xa89541]
>
> 11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
>
> 12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
>
> 13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223)
> [0x6b6483]
>
> 14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057]
>
> 15: (OSD::asok_command(std::string, std::map boost::variant std::vector >,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_>, std::less,
> std::allocator boost::variant std::vector >,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_> > > >&, std::string,
> std::ostream&)+0x1d7) [0x612cb7]
>
> 16: (OSDSocketHook::call(std::string, std::map boost::variant std::vector >,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_>, std::less,
> std::allocator boost::variant std::vector >,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_, boost::detail::variant::void_,
> boost::detail::variant::void_> > > >&, std::string,
> ceph::buffer::list&)+0x67) [0x67c8b7]
>
> 17: (AdminSocket::do_accept()+0x1007) [0xa79817]
>
> 18: (AdminSocket::entry()+0x258) [0xa7b448]
>
> 19: (()+0x7f6e) [0x7f72ae6def6e]
>
> 20: (clone()+0x6d) [0x7f72a9cd]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
>
>
> Steps to reproduce:
>
> ---
>
>
>
> 1.   Run ios
>
> 2.   While ios running , run the following command continuously.
>
>
>
> “ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight”
>
>
>
> 3.   At some point the osd will be crashed.
>
>
>
> I think I have root caused it..
>
>
>
> 1.   OpTracker::RemoveOnDelete::operator() is calling
> op->_unregistered() which clears out message->data() and payload
>
> 2.   After that, if optracking is enabled we are calling
> unregister_inflight_op() which removed the op from the xlist.
>
> 3.   Now, while dumping ops, we are calling
> _dump_op_descriptor_unlocked() from TrackedOP::dump, which tries to
> print the message.
>
> 4.   So, there is a race condition when it tries to print the message
> whoes ops (data) field is already cleared.
>
>
>
> Fix could be, call this op->_unregistered (in case optracking is

Re: [ceph-users] OSD is crashing while running admin socket

2014-09-08 Thread Sage Weil
On Tue, 9 Sep 2014, Somnath Roy wrote:
> Created the following tracker and assigned to me.
> 
> http://tracker.ceph.com/issues/9384

By the way, this might be the same as or similar to
http://tracker.ceph.com/issues/8885

Thanks!
sage


> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Monday, September 08, 2014 5:22 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OSD is crashing while running admin socket
> 
> That seems reasonable.  Bug away!
> -Sam
> 
> On Mon, Sep 8, 2014 at 5:11 PM, Somnath Roy  wrote:
> > Hi Sage/Sam,
> >
> >
> >
> > I faced a crash in OSD with latest Ceph master. Here is the log trace
> > for the same.
> >
> >
> >
> > ceph version 0.85-677-gd5777c4
> > (d5777c421548e7f039bb2c77cb0df2e9c7404723)
> >
> > 1: ceph-osd() [0x990def]
> >
> > 2: (()+0xfbb0) [0x7f72ae6e6bb0]
> >
> > 3: (gsignal()+0x37) [0x7f72acc08f77]
> >
> > 4: (abort()+0x148) [0x7f72acc0c5e8]
> >
> > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
> >
> > 6: (()+0x5e856) [0x7f72ad512856]
> >
> > 7: (()+0x5e883) [0x7f72ad512883]
> >
> > 8: (()+0x5eaae) [0x7f72ad512aae]
> >
> > 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned
> > int, unsigned int)+0x277) [0xa88747]
> >
> > 10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81)
> > [0xa89541]
> >
> > 11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
> >
> > 12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
> >
> > 13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223)
> > [0x6b6483]
> >
> > 14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057]
> >
> > 15: (OSD::asok_command(std::string, std::map > boost::variant > std::vector >,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_>, std::less,
> > std::allocator > boost::variant > std::vector >,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_> > > >&, std::string,
> > std::ostream&)+0x1d7) [0x612cb7]
> >
> > 16: (OSDSocketHook::call(std::string, std::map > boost::variant > std::vector >,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_>, std::less,
> > std::allocator > boost::variant > std::vector >,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_, boost::detail::variant::void_,
> > boost::detail::variant::void_> > > >&, std::string,
> > ceph::buffer::list&)+0x67) [0x67c8b7]
> >
> > 17: (AdminSocket::do_accept()+0x1007) [0xa79817]
> >
> > 18: (AdminSocket::entry()+0x258) [0xa7b448]
> >
> > 19: (()+0x7f6e) [0x7f72ae6def6e]
> >
> > 20: (clone()+0x6d) [0x7f72a9cd]
> >
> > NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> >
> >
> > Steps to reproduce:
> >
> > ---
> >
> >
> >
> > 1.   Run ios
> >
> > 2.   While ios running , run the following command continuously.
> >
> >
> >
> > ?ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight?
> >
> >
> >
> > 3.   At some point the osd will be crashed.
> >
> >
> >
> > I think I have root caused it..
> >
> >
> >
> > 1.   OpTracker::RemoveOnDelete::operator() is calling
> > op->_unregistered() which clears out message->data() and payload
> >
> > 2.   After tha

Re: [ceph-users] OSD is crashing while running admin socket

2014-09-08 Thread Somnath Roy
Yeah!!..Looks similar but not entirely..
There is another potential race condition that may cause this.

We are protecting the TrackedOp::events structure only during 
TrackedOp::mark_event with lock mutex. I couldn't find it anywhere else. The 
events structure should also be protected during dump and more specifically 
within _dump().
I am taking care of it as well.

Thanks & Regards
Somnath
-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Monday, September 08, 2014 5:59 PM
To: Somnath Roy
Cc: Samuel Just; ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: RE: OSD is crashing while running admin socket

On Tue, 9 Sep 2014, Somnath Roy wrote:
> Created the following tracker and assigned to me.
> 
> http://tracker.ceph.com/issues/9384

By the way, this might be the same as or similar to
http://tracker.ceph.com/issues/8885

Thanks!
sage


> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Monday, September 08, 2014 5:22 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OSD is crashing while running admin socket
> 
> That seems reasonable.  Bug away!
> -Sam
> 
> On Mon, Sep 8, 2014 at 5:11 PM, Somnath Roy  wrote:
> > Hi Sage/Sam,
> >
> >
> >
> > I faced a crash in OSD with latest Ceph master. Here is the log 
> > trace for the same.
> >
> >
> >
> > ceph version 0.85-677-gd5777c4
> > (d5777c421548e7f039bb2c77cb0df2e9c7404723)
> >
> > 1: ceph-osd() [0x990def]
> >
> > 2: (()+0xfbb0) [0x7f72ae6e6bb0]
> >
> > 3: (gsignal()+0x37) [0x7f72acc08f77]
> >
> > 4: (abort()+0x148) [0x7f72acc0c5e8]
> >
> > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
> >
> > 6: (()+0x5e856) [0x7f72ad512856]
> >
> > 7: (()+0x5e883) [0x7f72ad512883]
> >
> > 8: (()+0x5eaae) [0x7f72ad512aae]
> >
> > 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, 
> > unsigned int, unsigned int)+0x277) [0xa88747]
> >
> > 10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81) 
> > [0xa89541]
> >
> > 11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
> >
> > 12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
> >
> > 13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) 
> > [0x6b6483]
> >
> > 14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) 
> > [0x6b7057]
> >
> > 15: (OSD::asok_command(std::string, std::map > boost::variant > std::vector >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_>, std::less, 
> > std::allocator > boost::variant > std::vector >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_> > > >&, std::string,
> > std::ostream&)+0x1d7) [0x612cb7]
> >
> > 16: (OSDSocketHook::call(std::string, std::map > boost::variant > std::vector >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_>, std::less, 
> > std::allocator > boost::variant > std::vector >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_> > > >&, std::string,
> > ceph::buffer::list&)+0x67) [0x67c8b7]
> >
> > 17: (AdminSocket::do_accept()+0x1007) [0xa79817]
> >
> > 18: (AdminSocket::entry()+0x258) [0xa7b448]
> >
> > 19: (()+0x7f6e) [0x7

Re: [ceph-users] resizing the OSD

2014-09-08 Thread Christian Balzer

Hello,

On Mon, 08 Sep 2014 09:53:58 -0700 JIten Shah wrote:

> 
> On Sep 6, 2014, at 8:22 PM, Christian Balzer  wrote:
> 
> > 
> > Hello,
> > 
> > On Sat, 06 Sep 2014 10:28:19 -0700 JIten Shah wrote:
> > 
> >> Thanks Christian.  Replies inline.
> >> On Sep 6, 2014, at 8:04 AM, Christian Balzer  wrote:
> >> 
> >>> 
> >>> Hello,
> >>> 
> >>> On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:
> >>> 
>  Hello Cephers,
>  
>  We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of
>  the stuff seems to be working fine but we are seeing some degrading
>  on the osd's due to lack of space on the osd's. 
> >>> 
> >>> Please elaborate on that degradation.
> >> 
> >> The degradation happened on few OSD's because it got quickly filled
> >> up. They were not of the same size as the other OSD's. Now I want to
> >> remove these OSD's and readd them with correct size to match the
> >> others.
> > 
> > Alright, that's good idea, uniformity helps. ^^
> > 
> >>> 
>  Is there a way to resize the
>  OSD without bringing the cluster down?
>  
> >>> 
> >>> Define both "resize" and "cluster down".
> >> 
> >> Basically I want to remove the OSD's with incorrect size and readd
> >> them with the size matching the other OSD's. 
> >>> 
> >>> As in, resizing how? 
> >>> Are your current OSDs on disks/LVMs that are not fully used and thus
> >>> could be grown?
> >>> What is the size of your current OSDs?
> >> 
> >> The size of current OSD's is 20GB and we do have more unused space on
> >> the disk that we can make the LVM bigger and increase the size of the
> >> OSD's. I agree that we need to have all the disks of same size and I
> >> am working towards that.Thanks.
> >>> 
> > OK, so your OSDs are backed by LVM. 
> > A curious choice, any particular reason to do so?
> 
> We already had lvm’s carved out for some other project and were not
> using it so we decided to have OSD’s on those LVMs
> 
I see. ^^
You might want to do things quite a bit differently with your next cluster
and things you're learning from this one.

> > 
> > Either way, in theory you could grow things in place, obviously first
> > the LVM and then the underlying filesystem. Both ext4 and xfs support
> > online growing, so the OSD can keep running the whole time.
> > If you're unfamiliar with these things, play with them on a test
> > machine first. 
> > 
> > Now for the next step we will really need to know how you deployed ceph
> > and the result of "ceph osd tree" (not all 100 OSDs are needed, a
> > sample of a "small" and "big" OSD is sufficient).
> 
> Fixed all the sizes so all of them weight as 1
> [jshah@pv11p04si-mzk001 ~]$ ceph osd tree
> # id  weight  type name   up/down reweight
> -199  root default
> -21   host pv11p04si-mslave0005
> 0 1   osd.0   up  1   
> -31   host pv11p04si-mslave0006
> 1 1   osd.1   up  1   
> -41   host pv11p04si-mslave0007
> 2 1   osd.2   up  1   
> -51   host pv11p04si-mslave0008
> 3 1   osd.3   up  1   
> -61   host pv11p04si-mslave0009
> 4 1   osd.4   up  1   
> -71   host pv11p04si-mslave0010
> 5 1   osd.5   up  1   
> > 
Alright then, your cluster already thinks all OSDs are the same, even if
they're not.

So go ahead with what I wrote below, grow the LVs to the size of the
others, grow the filesystem and you should be done. 

No further activity needed, zero impact to the cluster.

> > Depending on the results (it will probably have varying weights
> > depending on the size and a reweight value of 1 for all) you will need
> > to adjust the weight of the grown OSD in question accordingly with
> > "ceph osd crush reweight". 
> > That step will incur data movement, so do it one OSD at a time.
> > 
> >>> The normal way of growing a cluster is to add more OSDs.
> >>> Preferably of the same size and same performance disks.
> >>> This will not only simplify things immensely but also make them a lot
> >>> more predictable.
> >>> This of course depends on your use case and usage patterns, but often
> >>> when running out of space you're also running out of other resources
> >>> like CPU, memory or IOPS of the disks involved. So adding more
> >>> instead of growing them is most likely the way forward.
> >>> 
> >>> If you were to replace actual disks with larger ones, take them (the
> >>> OSDs) out one at a time and re-add it. If you're using ceph-deploy,
> >>> it will use the disk size as basic weight, if you're doing things
> >>> manually make sure to specify that size/weight accordingly.
> >>> Again, you do want to do this for all disks to keep things uniform.
> >>> 
> >>> If your cluster (pools really) are set to a replica size of at least
> >>> 2 (risky!) or 3 (as per Firefly default), taki

Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread Christian Balzer

Hello,

On Mon, 08 Sep 2014 10:08:27 -0700 JIten Shah wrote:

> While checking the health of the cluster, I ran to the following error:
> 
> warning: health HEALTH_WARN too few pgs per osd (1< min 20)
> 
> When I checked the pg and php numbers, I saw the value was the default
> value of 64
> 
> ceph osd pool get data pg_num
> pg_num: 64
> ceph osd pool get data pgp_num
> pgp_num: 64
> 
> Checking the ceph documents, I updated the numbers to 2000 using the
> following commands:
> 
If that is the same cluster as in the other thread, you're having 100 OSDs
and a replica of 2. Which gives a PG target of 5000, rounded up to 8192!

> ceph osd pool set data pg_num 2000
> ceph osd pool set data pgp_num 2000
>

At the very least increase this to 2048, for a better chance at even data
distribution, but 4096 would be definitely better and 8192 the
recommended target. 
 
> It started resizing the data and saw health warnings again:
> 
> health HEALTH_WARN 1 requests are blocked > 32 sec; pool data pg_num
> 2000 > pgp_num 64
> 
> and then:
> 
> ceph health detail
> HEALTH_WARN 6 requests are blocked > 32 sec; 3 osds have slow requests
> 5 ops are blocked > 65.536 sec
> 1 ops are blocked > 32.768 sec
> 1 ops are blocked > 32.768 sec on osd.16
> 1 ops are blocked > 65.536 sec on osd.77
> 4 ops are blocked > 65.536 sec on osd.98
> 3 osds have slow requests
> 
> This error also went away after a day.
> 
That's caused by the data movement, given that your on 100 hosts with a
single disk each I would have thought that to be faster and having less
impact. It could of course also be related to other things, network
congestion comes to mind.

Increase PGs and PGPs in small steps, Firefly won't let you add more than
256 at a time in my case.

You can also limit the impact of this to a point with the appropriate
settings, see the documentation.

Christian

> ceph health detail
> HEALTH_OK
> 
> 
> Now, the question I have is, will this pg number remain effective on the
> cluster, even if we restart MON or OSD’s on the individual disks?  I
> haven’t changed the values in /etc/ceph/ceph.conf. Do I need to make a
> change to the ceph.conf and push that change to all the MON, MSD and
> OSD’s ?
> 
> 
> Thanks.
> 
> —Jiten
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread Christian Balzer

Hello,

On Mon, 08 Sep 2014 13:50:08 -0400 JR wrote:

> Hi Christian,
> 
> I have 448 PGs and 448 PGPs (according to ceph -s).
> 
> This seems borne out by:
> 
> root@osd45:~# rados lspools
> data
> metadata
> rbd
> volumes
> images
> root@osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
> data pg(pg_num: 64, pgppg_num: 64
> metadata pg(pg_num: 64, pgppg_num: 64
> rbd pg(pg_num: 64, pgppg_num: 64
> volumes pg(pg_num: 128, pgppg_num: 128
> images pg(pg_num: 128, pgppg_num: 128
> 
> According to the formula discussed in 'Uneven OSD usage,'
> 
> "The formula is actually OSDs * 100 / replication
> 
> in my case:
> 
> 8*100/2=400
> 
> So I'm erroring on the large size?
>
No, because for starters in the documentation (and if I recall correctly
also in that thread) the suggestion is to round up to nearest power of 2.
That would be 512 in your case, but with really small clusters
overprovisioning PGs/PGPs makes a lot of sense.

> Or, does this formula apply on by pool basis?  Of my 5 pools I'm using 3:
>
Not strictly, but clearly only used pools will have any impact and
benefit from this.
 
> root@nebula45:~# rados df|cut -c1-45
> pool name   category KB
> data-  0
> images  -  0
> metadata- 10
> rbd -  568489533
> volumes -  594078601
>   total used  2326235048   285923
>   total avail 1380814968
>   total space 3707050016
> 
> So should I up the number of PGs for the rbd and volumes pools?
> 
Definitely. 256 at least, but I personally would even go for 512,
especially with dumpling.

Christian

> I'll continue looking at docs, but for now I'll send this off.
> 
> Thanks very much, Christain.
> 
> ps. This cluster is self-contained and all nodes in it are completely
> loaded (i.e., I can't add any more nodes nor disks).  It's also not an
> option at the moment to upgrade to firefly (can't make a big change
> before sending it out the door).
> 
> 
> 
> On 9/8/2014 12:09 PM, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
> > 
> >> Greetings all,
> >>
> >> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
> >> started showing:
> >>
> >> root@ocd45:~# ceph health
> >> HEALTH_WARN 1 near full osd(s)
> >>
> >> admin@node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
> >> 'Filesystem|osd/ceph'; done
> >> Filesystem  Size  Used Avail Use% Mounted on
> >> /dev/sdc1   442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
> >> /dev/sdb1   442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
> >> Filesystem  Size  Used Avail Use% Mounted on
> >> /dev/sdc1   442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
> >> /dev/sdb1   442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
> >> Filesystem  Size  Used Avail Use% Mounted on
> >> /dev/sdb1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
> >> /dev/sdc1   442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
> >> Filesystem  Size  Used Avail Use% Mounted on
> >> /dev/sdc1   442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
> >> /dev/sdb1   442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
> >>
> >>
> > See the very recent "Uneven OSD usage" for a discussion about this.
> > What are your PG/PGP values?
> > 
> >> This cluster has been running for weeks, under significant load, and
> >> has been 100% stable. Unfortunately we have to ship it out of the
> >> building to another part of our business (where we will have little
> >> access to it).
> >>
> >> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant
> >> to just run it (I don't want to do anything that impacts this
> >> cluster's stability).
> >>
> >> Is there another, better way to equalize the distribution the data on
> >> the osd partitions?
> >>
> >> I'm running dumpling.
> >>
> > As per the thread and my experience, Firefly would solve this. If you
> > can upgrade during a weekend or whenever there is little to no access,
> > do it.
> > 
> > Another option (of course any and all of these will result in data
> > movement, so pick an appropriate time), would be to "use ceph osd
> > reweight" to lower the weight of osd.7 in particular.
> > 
> > Lastly, given the utilization of your cluster, your really ought to
> > deploy more OSDs and/or more nodes, if a node would go down you'd
> > easily get into a "real" near full or full situation.
> > 
> > Regards,
> > 
> > Christian
> > 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-08 Thread Quenten Grasso
This reminds me of something I was trying to find out awhile back.

If we have 2000 "Random" IOPS of which are 4K Blocks our cluster  (assuming 3 x 
Replicas) will generate 6000 IOPS @ 4K onto the journals.

Does this mean our Journals will absorb 6000 IOPS and turn these into X IOPS 
onto our spindles? 

If this is the case Is it possible to calculate how many IOPS a journal would 
"absorb" and how this would translate to x IOPS on spindle disk?

Regards,
Quenten Grasso

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Sunday, 7 September 2014 1:38 AM
To: ceph-users
Subject: Re: [ceph-users] SSD journal deployment experiences

On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:

> September 6 2014 4:01 PM, "Christian Balzer"  wrote: 
> > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > 
> >> Hi Christian,
> >> 
> >> Let's keep debating until a dev corrects us ;)
> > 
> > For the time being, I give the recent:
> > 
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > 
> > And not so recent:
> > http://www.spinics.net/lists/ceph-users/msg04152.html
> > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > 
> > And I'm not going to use BTRFS for mainly RBD backed VM images 
> > (fragmentation city), never mind the other stability issues that 
> > crop up here ever so often.
> 
> 
> Thanks for the links... So until I learn otherwise, I better assume 
> the OSD is lost when the journal fails. Even though I haven't 
> understood exactly why :( I'm going to UTSL to understand the consistency 
> better.
> An op state diagram would help, but I didn't find one yet.
> 
Using the source as an option of last resort is always nice, having to actually 
do so for something like this feels a bit lacking in the documentation 
department (that or my google foo being weak). ^o^

> BTW, do you happen to know, _if_ we re-use an OSD after the journal 
> has failed, are any object inconsistencies going to be found by a 
> scrub/deep-scrub?
> 
No idea. 
And really a scenario I hope to never encounter. ^^;;

> >> 
> >> We have 4 servers in a 3U rack, then each of those servers is 
> >> connected to one of these enclosures with a single SAS cable.
> >> 
>  With the current config, when I dd to all drives in parallel I 
>  can write at 24*74MB/s = 1776MB/s.
> >>> 
> >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 
> >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s.
> >>> And given your storage pod I assume it is connected with 2 
> >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s 
> >>> SATA bandwidth.
> >> 
> >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> > 
> > Alright, that explains that then. Any reason for not using both ports?
> > 
> 
> Probably to minimize costs, and since the single 10Gig-E is a 
> bottleneck anyway. The whole thing is suboptimal anyway, since this 
> hardware was not purchased for Ceph to begin with. Hence retrofitting SSDs, 
> etc...
>
The single 10Gb/s link is the bottleneck for sustained stuff, but when looking 
at spikes...
Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port might 
also get some loving. ^o^

The cluster I'm currently building is based on storage nodes with 4 SSDs (100GB 
DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 HDDs. 
Connected with 40Gb/s Infiniband. Dual port, dual switch for redundancy, not 
speed. ^^ 
 
> >>> Impressive, even given your huge cluster with 1128 OSDs.
> >>> However that's not really answering my question, how much data is 
> >>> on an average OSD and thus gets backfilled in that hour?
> >> 
> >> That's true -- our drives have around 300TB on them. So I guess it 
> >> will take longer - 3x longer - when the drives are 1TB full.
> > 
> > On your slides, when the crazy user filled the cluster with 250 
> > million objects and thus 1PB of data, I recall seeing a 7 hour backfill 
> > time?
> > 
> 
> Yeah that was fun :) It was 250 million (mostly) 4k objects, so not 
> close to 1PB. The point was that to fill the cluster with RBD, we'd 
> need
> 250 million (4MB) objects. So, object-count-wise this was a full 
> cluster, but for the real volume it was more like 70TB IIRC (there 
> were some other larger objects too).
> 
Ah, I see. ^^

> In that case, the backfilling was CPU-bound, or perhaps 
> wbthrottle-bound, I don't remember... It was just that there were many 
> tiny tiny objects to synchronize.
> 
Indeed. This is something me and others have seen as well, as in backfilling 
being much slower than the underlying HW would permit and being CPU intensive.

> > Anyway, I guess the lesson to take away from this is that size and 
> > parallelism does indeed help, but even in a cluster like yours 
> > recovering from a 2TB loss would likely be in the 10 hour range...
> 
> Bigger clusters probably backfill faster 

[ceph-users] all my osds are down, but ceph -s tells they are up and in.

2014-09-08 Thread yuelongguang
hi,all
 
that is crazy.
1.
all my osds are down, but ceph -s tells they are up and in. why?
2.
now all osds are down, a vm is using rbd as its disk, and inside vm  fio is 
r/wing the disk , but it hang ,can not be killed. why ?
 
thanks
 
[root@cephosd2-monb ~]# ceph -v
ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc)
 
 [root@cephosd2-monb ~]# ceph -s
cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
 health HEALTH_WARN mds 0 is laggy
 monmap e13: 3 mons at 
{cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0},
 election epoch 154, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc
 mdsmap e21: 1/1/1 up {0=0=up:active(laggy or crashed)}
 osdmap e196: 5 osds: 5 up, 5 in
  pgmap v21836: 512 pgs, 5 pools, 3115 MB data, 805 objects
9623 MB used, 92721 MB / 102344 MB avail
 512 active+clean___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread Christian Balzer

Hello,

On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:

> Hi Christian, all,
> 
> Having researched this a bit more, it seemed that just doing
> 
> ceph osd pool set rbd pg_num 128
> ceph osd pool set rbd pgp_num 128
> 
> might be the answer.  Alas, it was not. After running the above the
> cluster just sat there.
> 
Really now? No data movement, no health warnings during that in the logs,
no other error in the logs or when issuing that command?
Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?

You really want to get this addressed as per the previous reply before
doing anything further. Because with just 64 PGs (as in only 8 per OSD!)
massive imbalances are a given.

> Finally, reading some more, I ran:
> 
>  ceph osd reweight-by-utilization
> 
Reading can be dangerous. ^o^

I didn't mention this, as it never worked for me in any predictable way
and with a desirable outcome, especially in situations like yours.

> This accomplished moving the utilization of the first drive on the
> affected node to the 2nd drive! .e.g.:
> 
> ---
> BEFORE RUNNING:
> ---
> Filesystem Use%
> /dev/sdc1 57%
> /dev/sdb1 65%
> Filesystem Use%
> /dev/sdc1 90%
> /dev/sdb1 75%
> Filesystem Use%
> /dev/sdb1 52%
> /dev/sdc1 52%
> Filesystem Use%
> /dev/sdc1 54%
> /dev/sdb1 63%
> 
> ---
> AFTER RUNNING:
> ---
> Filesystem Use%
> /dev/sdc1 57%
> /dev/sdb1 65%
> Filesystem Use%
> /dev/sdc1 70%  ** these two swapped (roughly) **
> /dev/sdb1 92%  ** ^ ^^^ ^^^   **
> Filesystem Use%
> /dev/sdb1 52%
> /dev/sdc1 52%
> Filesystem Use%
> /dev/sdc1 54%
> /dev/sdb1 63%
> 
> root@osd45:~# ceph osd tree
> # idweight  type name   up/down reweight
> -1  3.44root default
> -2  0.86host osd45
> 0   0.43osd.0   up  1
> 4   0.43osd.4   up  1
> -3  0.86host osd42
> 1   0.43osd.1   up  1
> 5   0.43osd.5   up  1
> -4  0.86host osd44
> 2   0.43osd.2   up  1
> 6   0.43osd.6   up  1
> -5  0.86host osd43
> 3   0.43osd.3   up  1
> 7   0.43osd.7   up  0.7007
> 
> So this isn't the answer either.
> 
It might have been, if it had more PGs to distribute things along, see
above. But even then with the default dumpling tunables it might not be
much better.

> Could someone please chime in with an explanation/suggestion?
> 
> I suspect that might make sense to use 'ceph osd reweight osd.7 1' and
> then run some form of 'ceph osd crush ...'?
>
No need to crush anything, reweight it to 1 after adding PGs/PGPs and
after all that data movement has finished slowly dial down any still
overly utilized OSD.

Also per the "Uneven OSD usage" thread, you might run into a "full"
situation during data re-distribution. Increase PGs in small (64)
increments.

> Of course, I've read a number of things which suggest that the two
> things I've done should have fixed my problem.
> 
> Is it (gasp!) possible that this, as Christian suggests, is a dumpling
> issue and, were I running on firefly, it would be sufficient?
> 
Running Firefly with all the tunables and probably hashpspool. 
Most of the tunables with the exception of "chooseleaf_vary_r" are
available on dumpling, hashpspool isn't AFAIK.
See http://ceph.com/docs/master/rados/operations/crush-map/#tunables

Christian
> 
> Thanks much
> JR
> On 9/8/2014 1:50 PM, JR wrote:
> > Hi Christian,
> > 
> > I have 448 PGs and 448 PGPs (according to ceph -s).
> > 
> > This seems borne out by:
> > 
> > root@osd45:~# rados lspools
> > data
> > metadata
> > rbd
> > volumes
> > images
> > root@osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
> > get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
> > data pg(pg_num: 64, pgppg_num: 64
> > metadata pg(pg_num: 64, pgppg_num: 64
> > rbd pg(pg_num: 64, pgppg_num: 64
> > volumes pg(pg_num: 128, pgppg_num: 128
> > images pg(pg_num: 128, pgppg_num: 128
> > 
> > According to the formula discussed in 'Uneven OSD usage,'
> > 
> > "The formula is actually OSDs * 100 / replication
> > 
> > in my case:
> > 
> > 8*100/2=400
> > 
> > So I'm erroring on the large size?
> > 
> > Or, does this formula apply on by pool basis?  Of my 5 pools I'm using
> > 3:
> > 
> > root@osd45:~# rados df|cut -c1-45
> > pool name   category KB
> > data-  0
> > images  -  0
> > metadata- 10
> > rbd -  568489533
> > volumes -  594078601
> >   total used  2326235048   285923
> >   total avail 1380814968
> >   total space 3707050016
> > 
> > So should I up the number of

[ceph-users] mix ceph verion with 0.80.5 and 0.85

2014-09-08 Thread 廖建锋
dear,
 As there are a lot of bugs of keyvalue backend in 0.80.5 firely version ,  
So i want to upgrade to 0.85 for some osds which already down and unable to 
start
and keep some other osd with 0.80.5,I wondering ,  will it works?


[Adobe Systems]
廖建锋 Derek
运维经理
Add: 上海市浦东新区金桥路525号654幢2楼
Tel : +86 21 6133 0163-165
Fax : +86 21 6133 0262
Mob : +86 137 0181 5755
Mail : de...@f-club.cn
Http : www.fclub.cn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-08 Thread Christian Balzer
On Tue, 9 Sep 2014 01:40:42 + Quenten Grasso wrote:

> This reminds me of something I was trying to find out awhile back.
> 
> If we have 2000 "Random" IOPS of which are 4K Blocks our cluster
> (assuming 3 x Replicas) will generate 6000 IOPS @ 4K onto the journals.
> 
> Does this mean our Journals will absorb 6000 IOPS and turn these into X
> IOPS onto our spindles? 
> 
In theory, yes.

> If this is the case Is it possible to calculate how many IOPS a journal
> would "absorb" and how this would translate to x IOPS on spindle disk?
> 
It very much depends, there are a number of configuration parameters that
will influence this and what those IOPS actually are.

As an example, with "rados -p rbd bench 30 write -t 32 -b 4096" I see a
ratio of 3:1 on a cluster here, as measured with the ole mark 1 eyeball
and atop or iostat.

Christian
> Regards,
> Quenten Grasso
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer Sent: Sunday, 7 September 2014 1:38 AM
> To: ceph-users
> Subject: Re: [ceph-users] SSD journal deployment experiences
> 
> On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:
> 
> > September 6 2014 4:01 PM, "Christian Balzer"  wrote: 
> > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > > 
> > >> Hi Christian,
> > >> 
> > >> Let's keep debating until a dev corrects us ;)
> > > 
> > > For the time being, I give the recent:
> > > 
> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > > 
> > > And not so recent:
> > > http://www.spinics.net/lists/ceph-users/msg04152.html
> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > > 
> > > And I'm not going to use BTRFS for mainly RBD backed VM images 
> > > (fragmentation city), never mind the other stability issues that 
> > > crop up here ever so often.
> > 
> > 
> > Thanks for the links... So until I learn otherwise, I better assume 
> > the OSD is lost when the journal fails. Even though I haven't 
> > understood exactly why :( I'm going to UTSL to understand the
> > consistency better. An op state diagram would help, but I didn't find
> > one yet.
> > 
> Using the source as an option of last resort is always nice, having to
> actually do so for something like this feels a bit lacking in the
> documentation department (that or my google foo being weak). ^o^
> 
> > BTW, do you happen to know, _if_ we re-use an OSD after the journal 
> > has failed, are any object inconsistencies going to be found by a 
> > scrub/deep-scrub?
> > 
> No idea. 
> And really a scenario I hope to never encounter. ^^;;
> 
> > >> 
> > >> We have 4 servers in a 3U rack, then each of those servers is 
> > >> connected to one of these enclosures with a single SAS cable.
> > >> 
> >  With the current config, when I dd to all drives in parallel I 
> >  can write at 24*74MB/s = 1776MB/s.
> > >>> 
> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 
> > >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s.
> > >>> And given your storage pod I assume it is connected with 2 
> > >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s 
> > >>> SATA bandwidth.
> > >> 
> > >> From above, we are only using 4 lanes -- so around 2GB/s is
> > >> expected.
> > > 
> > > Alright, that explains that then. Any reason for not using both
> > > ports?
> > > 
> > 
> > Probably to minimize costs, and since the single 10Gig-E is a 
> > bottleneck anyway. The whole thing is suboptimal anyway, since this 
> > hardware was not purchased for Ceph to begin with. Hence retrofitting
> > SSDs, etc...
> >
> The single 10Gb/s link is the bottleneck for sustained stuff, but when
> looking at spikes... Oh well, I guess if you ever connect that 2nd 10GbE
> card that 2nd port might also get some loving. ^o^
> 
> The cluster I'm currently building is based on storage nodes with 4 SSDs
> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and
> 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
> redundancy, not speed. ^^ 
> > >>> Impressive, even given your huge cluster with 1128 OSDs.
> > >>> However that's not really answering my question, how much data is 
> > >>> on an average OSD and thus gets backfilled in that hour?
> > >> 
> > >> That's true -- our drives have around 300TB on them. So I guess it 
> > >> will take longer - 3x longer - when the drives are 1TB full.
> > > 
> > > On your slides, when the crazy user filled the cluster with 250 
> > > million objects and thus 1PB of data, I recall seeing a 7 hour
> > > backfill time?
> > > 
> > 
> > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not 
> > close to 1PB. The point was that to fill the cluster with RBD, we'd 
> > need
> > 250 million (4MB) objects. So, object-count-wise this was a full 
> > cluster, but for the real volume it was more like 70TB IIRC (there 
> > were some other larger objects too).
> > 
> Ah, I se

Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread JR
Hi Christian,

Ha ...

root@osd45:~# ceph osd pool get rbd pg_num
pg_num: 128
root@osd45:~# ceph osd pool get rbd pgp_num
pgp_num: 64

That's the explanation!  I did run the command but it spit out some
(what I thought was a harmless) warning; should have checked more carefully.

I now have the expected data movement.

Thanks alot!
JR

On 9/8/2014 10:04 PM, Christian Balzer wrote:
> 
> Hello,
> 
> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
> 
>> Hi Christian, all,
>>
>> Having researched this a bit more, it seemed that just doing
>>
>> ceph osd pool set rbd pg_num 128
>> ceph osd pool set rbd pgp_num 128
>>
>> might be the answer.  Alas, it was not. After running the above the
>> cluster just sat there.
>>
> Really now? No data movement, no health warnings during that in the logs,
> no other error in the logs or when issuing that command?
> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?
> 
> You really want to get this addressed as per the previous reply before
> doing anything further. Because with just 64 PGs (as in only 8 per OSD!)
> massive imbalances are a given.
> 
>> Finally, reading some more, I ran:
>>
>>  ceph osd reweight-by-utilization
>>
> Reading can be dangerous. ^o^
> 
> I didn't mention this, as it never worked for me in any predictable way
> and with a desirable outcome, especially in situations like yours.
> 
>> This accomplished moving the utilization of the first drive on the
>> affected node to the 2nd drive! .e.g.:
>>
>> ---
>> BEFORE RUNNING:
>> ---
>> Filesystem Use%
>> /dev/sdc1 57%
>> /dev/sdb1 65%
>> Filesystem Use%
>> /dev/sdc1 90%
>> /dev/sdb1 75%
>> Filesystem Use%
>> /dev/sdb1 52%
>> /dev/sdc1 52%
>> Filesystem Use%
>> /dev/sdc1 54%
>> /dev/sdb1 63%
>>
>> ---
>> AFTER RUNNING:
>> ---
>> Filesystem Use%
>> /dev/sdc1 57%
>> /dev/sdb1 65%
>> Filesystem Use%
>> /dev/sdc1 70%  ** these two swapped (roughly) **
>> /dev/sdb1 92%  ** ^ ^^^ ^^^   **
>> Filesystem Use%
>> /dev/sdb1 52%
>> /dev/sdc1 52%
>> Filesystem Use%
>> /dev/sdc1 54%
>> /dev/sdb1 63%
>>
>> root@osd45:~# ceph osd tree
>> # idweight  type name   up/down reweight
>> -1  3.44root default
>> -2  0.86host osd45
>> 0   0.43osd.0   up  1
>> 4   0.43osd.4   up  1
>> -3  0.86host osd42
>> 1   0.43osd.1   up  1
>> 5   0.43osd.5   up  1
>> -4  0.86host osd44
>> 2   0.43osd.2   up  1
>> 6   0.43osd.6   up  1
>> -5  0.86host osd43
>> 3   0.43osd.3   up  1
>> 7   0.43osd.7   up  0.7007
>>
>> So this isn't the answer either.
>>
> It might have been, if it had more PGs to distribute things along, see
> above. But even then with the default dumpling tunables it might not be
> much better.
> 
>> Could someone please chime in with an explanation/suggestion?
>>
>> I suspect that might make sense to use 'ceph osd reweight osd.7 1' and
>> then run some form of 'ceph osd crush ...'?
>>
> No need to crush anything, reweight it to 1 after adding PGs/PGPs and
> after all that data movement has finished slowly dial down any still
> overly utilized OSD.
> 
> Also per the "Uneven OSD usage" thread, you might run into a "full"
> situation during data re-distribution. Increase PGs in small (64)
> increments.
> 
>> Of course, I've read a number of things which suggest that the two
>> things I've done should have fixed my problem.
>>
>> Is it (gasp!) possible that this, as Christian suggests, is a dumpling
>> issue and, were I running on firefly, it would be sufficient?
>>
> Running Firefly with all the tunables and probably hashpspool. 
> Most of the tunables with the exception of "chooseleaf_vary_r" are
> available on dumpling, hashpspool isn't AFAIK.
> See http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> 
> Christian
>>
>> Thanks much
>> JR
>> On 9/8/2014 1:50 PM, JR wrote:
>>> Hi Christian,
>>>
>>> I have 448 PGs and 448 PGPs (according to ceph -s).
>>>
>>> This seems borne out by:
>>>
>>> root@osd45:~# rados lspools
>>> data
>>> metadata
>>> rbd
>>> volumes
>>> images
>>> root@osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
>>> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
>>> data pg(pg_num: 64, pgppg_num: 64
>>> metadata pg(pg_num: 64, pgppg_num: 64
>>> rbd pg(pg_num: 64, pgppg_num: 64
>>> volumes pg(pg_num: 128, pgppg_num: 128
>>> images pg(pg_num: 128, pgppg_num: 128
>>>
>>> According to the formula discussed in 'Uneven OSD usage,'
>>>
>>> "The formula is actually OSDs * 100 / replication
>>>
>>> in my case:
>>>
>>> 8*100/2=400
>>>
>>> So I'm erroring on the large size?
>>>
>>> Or, does this formula apply on by pool basis?  Of

[ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85

2014-09-08 Thread 廖建锋
Looks like it dosn't work,  i noticed that 0.85 added superblock to leveldb 
osd,   the osd which I alread have do not have superblock
is there anybody can tell me how to upgrade OSDs ?



发件人: ceph-users
发送时间: 2014-09-09 10:32
收件人: ceph-users
主题: [ceph-users] mix ceph verion with 0.80.5 and 0.85
dear,
 As there are a lot of bugs of keyvalue backend in 0.80.5 firely version ,  
So i want to upgrade to 0.85 for some osds which already down and unable to 
start
and keep some other osd with 0.80.5,I wondering ,  will it works?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85

2014-09-08 Thread Jason King
Check the docs.

2014-09-09 11:02 GMT+08:00 廖建锋 :

>  Looks like it dosn't work,  i noticed that 0.85 added superblock to
> leveldb osd,   the osd which I alread have do not have superblock
> is there anybody can tell me how to upgrade OSDs ?
>
>
>
>  *发件人:* ceph-users 
> *发送时间:* 2014-09-09 10:32
> *收件人:* ceph-users 
> *主题:* [ceph-users] mix ceph verion with 0.80.5 and 0.85
>   dear,
>  As there are a lot of bugs of keyvalue backend in 0.80.5 firely
> version ,  So i want to upgrade to 0.85 for some osds which already down
> and unable to start
> and keep some other osd with 0.80.5,I wondering ,  will it works?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] all my osds are down, but ceph -s tells they are up and in.

2014-09-08 Thread Sage Weil
On Tue, 9 Sep 2014, yuelongguang wrote:
> hi,all
>  
> that is crazy.
> 1.
> all my osds are down, but ceph -s tells they are up and in. why?

Peer OSDs normally handle failure detection.  If all OSDs are down, 
there is nobody to report the failures.

After 5 or 10 minutes if the OSDs don't report any stats to the monitor it 
will eventually assume they are dead and mark them down.

> 2.
> now all osds are down, a vm is using rbd as its disk, and inside vm  fio is
> r/wing the disk , but it hang ,can not be killed. why ?

The IOs will block indefinitely until the cluster is available.  Once 
the OSDs are started teh VM will become responsive again.

sage

>  
> thanks
>  
> [root@cephosd2-monb ~]# ceph -v
> ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc)
>  
>  [root@cephosd2-monb ~]# ceph -s
>     cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
>  health HEALTH_WARN mds 0 is laggy
>  monmap e13: 3 mons 
> at{cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephos
> d3-monc=10.154.249.5:6789/0}, election epoch 154, quorum 0,1,2
> cephosd1-mona,cephosd2-monb,cephosd3-monc
>  mdsmap e21: 1/1/1 up {0=0=up:active(laggy or crashed)}
>  osdmap e196: 5 osds: 5 up, 5 in
>   pgmap v21836: 512 pgs, 5 pools, 3115 MB data, 805 objects
>     9623 MB used, 92721 MB / 102344 MB avail
>  512 active+clean
> 
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster inconsistency keyvaluestore

2014-09-08 Thread Sage Weil
On Sun, 7 Sep 2014, Haomai Wang wrote:
> I have found the root cause. It's a bug.
> 
> When chunky scrub happen, it will iterate the who pg's objects and
> each iterator only a few objects will be scan.
> 
> osd/PG.cc:3758
> ret = get_pgbackend()-> objects_list_partial(
>   start,
>   cct->_conf->osd_scrub_chunk_min,
>   cct->_conf->osd_scrub_chunk_max,
>   0,
>   &objects,
>   &candidate_end);
> 
> candidate_end is the end of object set and it's used to indicate the
> next scrub process's start position. But it will be truncated:
> 
> osd/PG.cc:3777
> while (!boundary_found && objects.size() > 1) {
>   hobject_t end = objects.back().get_boundary();
>   objects.pop_back();
> 
>   if (objects.back().get_filestore_key() !=
> end.get_filestore_key()) {
> candidate_end = end;
> boundary_found = true;
>   }
> }
> end which only contain "hash" field as hobject_t will be assign to
> candidate_end.  So the next scrub process a hobject_t only contains
> "hash" field will be passed in to get_pgbackend()->
> objects_list_partial.
> 
> It will cause incorrect results for KeyValueStore backend. Because it
> will use strict key ordering for "collection_list_paritial" method. A
> hobject_t only contains "hash" field will be:
> 
> 1%e79s0_head!972F1B5D!!none!!!!0!0
> 
> and the actual object is
> 1%e79s0_head!972F1B5D!!1!!!object-name!head
> 
> In other word, a object only contain "hash" field can't used by to
> search a absolute object has the same "hash" field.

You mean the problem is that the sort order is wrong and the hash-only 
hobject_t key doesn't sort before the other objects, right?

> @sage The simply way is modify obj->key function which will change
> storage format. Because it's a experiment backend I would like to
> provide with a external format change program help users do it. Is it
> OK?

Yeah, I think it's okay to just go ahead and make an incompatible change.

If it is easy to do an upgrade converter, it might be worthwhile, but this 
is an experimental backend so you are certainly not required to.  :)

sage



> 
> 
> On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
>  wrote:
> > I also can reproduce it on a new slightly different set up (also EC on KV
> > and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
> > 'inconsistent' status
> >
> >
> >
> > - Message from Kenneth Waegeman  -
> >Date: Mon, 01 Sep 2014 16:28:31 +0200
> >From: Kenneth Waegeman 
> > Subject: Re: ceph cluster inconsistency keyvaluestore
> >  To: Haomai Wang 
> >  Cc: ceph-users@lists.ceph.com
> >
> >
> >
> >> Hi,
> >>
> >>
> >> The cluster got installed with quattor, which uses ceph-deploy for
> >> installation of daemons, writes the config file and installs the crushmap.
> >> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
> >> ECdata pool and a small cache partition (50G) for the cache
> >>
> >> I manually did this:
> >>
> >> ceph osd pool create cache 1024 1024
> >> ceph osd pool set cache size 2
> >> ceph osd pool set cache min_size 1
> >> ceph osd erasure-code-profile set profile11 k=8 m=3
> >> ruleset-failure-domain=osd
> >> ceph osd pool create ecdata 128 128 erasure profile11
> >> ceph osd tier add ecdata cache
> >> ceph osd tier cache-mode cache writeback
> >> ceph osd tier set-overlay ecdata cache
> >> ceph osd pool set cache hit_set_type bloom
> >> ceph osd pool set cache hit_set_count 1
> >> ceph osd pool set cache hit_set_period 3600
> >> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
> >>
> >> (But the previous time I had the problem already without the cache part)
> >>
> >>
> >>
> >> Cluster live since 2014-08-29 15:34:16
> >>
> >> Config file on host ceph001:
> >>
> >> [global]
> >> auth_client_required = cephx
> >> auth_cluster_required = cephx
> >> auth_service_required = cephx
> >> cluster_network = 10.143.8.0/24
> >> filestore_xattr_use_omap = 1
> >> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
> >> mon_cluster_log_to_syslog = 1
> >> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
> >> mon_initial_members = ceph001, ceph002, ceph003
> >> osd_crush_update_on_start = 0
> >> osd_journal_size = 10240
> >> osd_pool_default_min_size = 2
> >> osd_pool_default_pg_num = 512
> >> osd_pool_default_pgp_num = 512
> >> osd_pool_default_size = 3
> >> public_network = 10.141.8.0/24
> >>
> >> [osd.11]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.13]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.15]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.17]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.19]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.21]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.23]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >> [osd.25]
> >> osd_objectstore = keyvaluestore-dev
> >>
> >

Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85

2014-09-08 Thread 廖建锋
there is nothing about this in ceph.com


发件人: Jason King
发送时间: 2014-09-09 11:19
收件人: 廖建锋
抄送: ceph-users; 
ceph-users
主题: Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85
Check the docs.

2014-09-09 11:02 GMT+08:00 廖建锋 mailto:de...@f-club.cn>>:
Looks like it dosn't work,  i noticed that 0.85 added superblock to leveldb 
osd,   the osd which I alread have do not have superblock
is there anybody can tell me how to upgrade OSDs ?



发件人: ceph-users
发送时间: 2014-09-09 10:32
收件人: ceph-users
主题: [ceph-users] mix ceph verion with 0.80.5 and 0.85
dear,
 As there are a lot of bugs of keyvalue backend in 0.80.5 firely version ,  
So i want to upgrade to 0.85 for some osds which already down and unable to 
start
and keep some other osd with 0.80.5,I wondering ,  will it works?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread JR
Greetings

After running for a couple of hours, my attempt to re-balance a near ful
disk has stopped with a stuck unclean error:

root@osd45:~# ceph -s
  cluster c8122868-27af-11e4-b570-52540004010f
   health HEALTH_WARN 6 pgs backfilling; 6 pgs stuck unclean; recovery
13086/1158268 degraded (1.130%)
   monmap e1: 3 mons at
{osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0},
election epoch 80, quorum 0,1,2 osd42,osd43,osd45
   osdmap e723: 8 osds: 8 up, 8 in
pgmap v543113: 640 pgs: 634 active+clean, 6
active+remapped+backfilling;  GB data, 2239 GB used, 1295 GB / 3535
GB avail; 8268B/s wr, 0op/s; 13086/1158268 degraded (1.130%)
   mdsmap e63: 1/1/1 up {0=osd42=up:active}, 3 up:standby


The sequence of events today that led to this were:

# starting state: pg_num/pgp_num == 64
ceph osd pool set rbd pg_num 128
ceph osd pool set rbd pgp_num 128
# there was a warning thrown up (which I've lost) and which left pgg_num
== 64
# nothing happens since pgp_num was inadvertently not raised
ceph osd reweight-by-utilization
# data moves from one osd on a host to another osd on same host
ceph osd reweight  7 1
# data moves back to roughly what it had been
ceph osd pool set volumes pg_num 192
ceph osd pool set volumes pgp_num 192
# data moves successfully
ceph osd pool set rbd pg_num 192
ceph osd pool set rbd pgp_num 192
# data stuck

googling (nowadays known as research) reveals that these might be helpful:

- ceph osd crush tunables optimal
- setting crush weights to 1

I resist doing anything for now in the hopes that someone has something
coherent to say (Christian? ;-)

Thanks
JR


On 9/8/2014 10:37 PM, JR wrote:
> Hi Christian,
> 
> Ha ...
> 
> root@osd45:~# ceph osd pool get rbd pg_num
> pg_num: 128
> root@osd45:~# ceph osd pool get rbd pgp_num
> pgp_num: 64
> 
> That's the explanation!  I did run the command but it spit out some
> (what I thought was a harmless) warning; should have checked more carefully.
> 
> I now have the expected data movement.
> 
> Thanks alot!
> JR
> 
> On 9/8/2014 10:04 PM, Christian Balzer wrote:
>>
>> Hello,
>>
>> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
>>
>>> Hi Christian, all,
>>>
>>> Having researched this a bit more, it seemed that just doing
>>>
>>> ceph osd pool set rbd pg_num 128
>>> ceph osd pool set rbd pgp_num 128
>>>
>>> might be the answer.  Alas, it was not. After running the above the
>>> cluster just sat there.
>>>
>> Really now? No data movement, no health warnings during that in the logs,
>> no other error in the logs or when issuing that command?
>> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?
>>
>> You really want to get this addressed as per the previous reply before
>> doing anything further. Because with just 64 PGs (as in only 8 per OSD!)
>> massive imbalances are a given.
>>
>>> Finally, reading some more, I ran:
>>>
>>>  ceph osd reweight-by-utilization
>>>
>> Reading can be dangerous. ^o^
>>
>> I didn't mention this, as it never worked for me in any predictable way
>> and with a desirable outcome, especially in situations like yours.
>>
>>> This accomplished moving the utilization of the first drive on the
>>> affected node to the 2nd drive! .e.g.:
>>>
>>> ---
>>> BEFORE RUNNING:
>>> ---
>>> Filesystem Use%
>>> /dev/sdc1 57%
>>> /dev/sdb1 65%
>>> Filesystem Use%
>>> /dev/sdc1 90%
>>> /dev/sdb1 75%
>>> Filesystem Use%
>>> /dev/sdb1 52%
>>> /dev/sdc1 52%
>>> Filesystem Use%
>>> /dev/sdc1 54%
>>> /dev/sdb1 63%
>>>
>>> ---
>>> AFTER RUNNING:
>>> ---
>>> Filesystem Use%
>>> /dev/sdc1 57%
>>> /dev/sdb1 65%
>>> Filesystem Use%
>>> /dev/sdc1 70%  ** these two swapped (roughly) **
>>> /dev/sdb1 92%  ** ^ ^^^ ^^^   **
>>> Filesystem Use%
>>> /dev/sdb1 52%
>>> /dev/sdc1 52%
>>> Filesystem Use%
>>> /dev/sdc1 54%
>>> /dev/sdb1 63%
>>>
>>> root@osd45:~# ceph osd tree
>>> # idweight  type name   up/down reweight
>>> -1  3.44root default
>>> -2  0.86host osd45
>>> 0   0.43osd.0   up  1
>>> 4   0.43osd.4   up  1
>>> -3  0.86host osd42
>>> 1   0.43osd.1   up  1
>>> 5   0.43osd.5   up  1
>>> -4  0.86host osd44
>>> 2   0.43osd.2   up  1
>>> 6   0.43osd.6   up  1
>>> -5  0.86host osd43
>>> 3   0.43osd.3   up  1
>>> 7   0.43osd.7   up  0.7007
>>>
>>> So this isn't the answer either.
>>>
>> It might have been, if it had more PGs to distribute things along, see
>> above. But even then with the default dumpling tunables it might not be
>> much better.
>>
>>> Could someone please chime in with an explanation/suggestion?
>>>
>>> I suspect that might make sense to use 'ceph osd reweight 

Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-08 Thread Christian Balzer

Hello,

On Tue, 09 Sep 2014 01:25:17 -0400 JR wrote:

> Greetings
> 
> After running for a couple of hours, my attempt to re-balance a near ful
> disk has stopped with a stuck unclean error:
> 
Which is exactly what I warned you about below and what you should have
also taken away from fully reading the "Uneven OSD usage" thread.

This also should hammer my previous point about your current cluster
size/utilization home. Even with a better (don't expect perfect) data
distribution, loss of one node might well find you with a full OSD again. 

> root@osd45:~# ceph -s
>   cluster c8122868-27af-11e4-b570-52540004010f
>health HEALTH_WARN 6 pgs backfilling; 6 pgs stuck unclean; recovery
> 13086/1158268 degraded (1.130%)
>monmap e1: 3 mons at
> {osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0},
> election epoch 80, quorum 0,1,2 osd42,osd43,osd45
>osdmap e723: 8 osds: 8 up, 8 in
> pgmap v543113: 640 pgs: 634 active+clean, 6
> active+remapped+backfilling;  GB data, 2239 GB used, 1295 GB / 3535
> GB avail; 8268B/s wr, 0op/s; 13086/1158268 degraded (1.130%)
>mdsmap e63: 1/1/1 up {0=osd42=up:active}, 3 up:standby
> 
>From what I've read in the past the way forward here is to increase the
full ratio setting so it can finish the recovery. 
Or add more OSDs, at least temporarily. See:
http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity

Read that and apply that knowledge to your cluster, I personally wouldn't
deploy it in this state.

Once the recovery is finished I'd proceed cautiously, see below.

> 
> The sequence of events today that led to this were:
> 
> # starting state: pg_num/pgp_num == 64
> ceph osd pool set rbd pg_num 128
> ceph osd pool set rbd pgp_num 128
> # there was a warning thrown up (which I've lost) and which left pgg_num
> == 64
> # nothing happens since pgp_num was inadvertently not raised
> ceph osd reweight-by-utilization
> # data moves from one osd on a host to another osd on same host
> ceph osd reweight  7 1
> # data moves back to roughly what it had been
Never mind the the lack of PGs to play with, manually lowering the weight
of the fullest OSD (in small steps) at this time might have given you at
least a more level playing field.
 
> ceph osd pool set volumes pg_num 192
> ceph osd pool set volumes pgp_num 192
> # data moves successfully
This would have been the time to check what actually happened and if
things improved or not (just adding PGs/PGPs might not be enough) and
again to manually reweight overly full OSDs.

> ceph osd pool set rbd pg_num 192
> ceph osd pool set rbd pgp_num 192
> # data stuck
> 
Baby steps. As in, applying the rise to 128 PGPs first. 
But I guess you would have run into the full OSD either way w/o
reweighting things between steps.

> googling (nowadays known as research) reveals that these might be
> helpful:
> 
> - ceph osd crush tunables optimal
Yes, this might help. 
Not sure if that works with dumpling, but as I already mentioned dumpling
doesn't support "chooseleaf_vary_r". And hashspool.
And while the data movement caused by this probably will result in a
better balanced cluster (again, with too little PGs it will still do
poorly), in the process of getting there it might still run into a full
OSD scenario.

> - setting crush weights to 1
> 
Dunno about then one, my crush weights were 1 when I deployed things
manually for the first time, the size of the OSD for the 2nd manual
deployment and ceph-deploy also uses the OSD size in TB. 

Christian

> I resist doing anything for now in the hopes that someone has something
> coherent to say (Christian? ;-)
> 
> Thanks
> JR
> 
> 
> On 9/8/2014 10:37 PM, JR wrote:
> > Hi Christian,
> > 
> > Ha ...
> > 
> > root@osd45:~# ceph osd pool get rbd pg_num
> > pg_num: 128
> > root@osd45:~# ceph osd pool get rbd pgp_num
> > pgp_num: 64
> > 
> > That's the explanation!  I did run the command but it spit out some
> > (what I thought was a harmless) warning; should have checked more
> > carefully.
> > 
> > I now have the expected data movement.
> > 
> > Thanks alot!
> > JR
> > 
> > On 9/8/2014 10:04 PM, Christian Balzer wrote:
> >>
> >> Hello,
> >>
> >> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
> >>
> >>> Hi Christian, all,
> >>>
> >>> Having researched this a bit more, it seemed that just doing
> >>>
> >>> ceph osd pool set rbd pg_num 128
> >>> ceph osd pool set rbd pgp_num 128
> >>>
> >>> might be the answer.  Alas, it was not. After running the above the
> >>> cluster just sat there.
> >>>
> >> Really now? No data movement, no health warnings during that in the
> >> logs, no other error in the logs or when issuing that command?
> >> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?
> >>
> >> You really want to get this addressed as per the previous reply before
> >> doing anything further. Because with just 64 PGs (as in only 8 per
> >> OSD!) massive imbalances are a given.
> >>
> >>> Finally, reading some more, I ran:
> >>>

[ceph-users] heterogeneous set of storage disks as a single storage

2014-09-08 Thread pragya jain
hi all!

I have a very low level query. Please help to clarify it.

To store data on a storage cluster, at the bottom, there is a heterogeneous set 
of storage disks,in which there can be a variety of storage disks, such as 
SSDs, HDDs, flash drives, tapes and any other type also. Document says that 
provider view this heterogeneous set of storage disks as a single pool. My 
question is:

#1. how does the provider come to this abstraction layer to view different 
storage disks as a single storage?
#2. Is it a part of storage virtualization?
#2. are there any set of APIs to interact with this heterogeneous set of 
storage disks?

Regards
Pragya Jain___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com