[ceph-users] 答复: Re: RBD read-ahead didn't improve 4K read performance

2014-11-21 Thread duan . xufeng
Hi,

I test in VM with fio, here is the config:

[global]
direct=1
ioengine=aio
iodepth=1

[sequence read 4K]
rw=read
bs=4K
size=1024m
directory=/mnt
filename=test


sequence read 4K: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=1
fio-2.1.3
Starting 1 process
sequence read 4K: Laying out IO file(s) (1 file(s) / 1024MB)
^Cbs: 1 (f=1): [R] [18.0% done] [1994KB/0KB/0KB /s] [498/0/0 iops] [eta 
07m:14s]
fio: terminating on signal 2

sequence read 4K: (groupid=0, jobs=1): err= 0: pid=1156: Fri Nov 21 
12:32:53 2014
  read : io=187408KB, bw=1984.1KB/s, iops=496, runt= 94417msec
slat (usec): min=22, max=878, avg=48.36, stdev=22.63
clat (usec): min=1335, max=17618, avg=1956.45, stdev=247.26
 lat (usec): min=1371, max=17680, avg=2006.97, stdev=248.47
clat percentiles (usec):
 |  1.00th=[ 1560],  5.00th=[ 1640], 10.00th=[ 1704], 20.00th=[ 1784],
 | 30.00th=[ 1848], 40.00th=[ 1896], 50.00th=[ 1944], 60.00th=[ 1992],
 | 70.00th=[ 2064], 80.00th=[ 2128], 90.00th=[ 2192], 95.00th=[ 2288],
 | 99.00th=[ 2448], 99.50th=[ 2640], 99.90th=[ 3856], 99.95th=[ 4256],
 | 99.99th=[ 9408]
bw (KB  /s): min= 1772, max= 2248, per=100.00%, avg=1986.55, 
stdev=85.76
lat (msec) : 2=60.69%, 4=39.23%, 10=0.07%, 20=0.01%
  cpu  : usr=1.92%, sys=2.98%, ctx=47125, majf=0, minf=28
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued: total=r=46852/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=187408KB, aggrb=1984KB/s, minb=1984KB/s, maxb=1984KB/s, 
mint=94417msec, maxt=94417msec

Disk stats (read/write):
  sda: ios=46754/11, merge=0/10, ticks=91144/40, in_queue=91124, 
util=96.73%


the 


the rados benchmark:

# rados -p volumes bench 60 seq -b 4096 -t 1
Total time run:44.922178
Total reads made: 24507
Read size:4096
Bandwidth (MB/sec):2.131 

Average Latency:   0.00183069
Max latency:   0.004598
Min latency:   0.001224







Re: [ceph-users] RBD read-ahead didn't improve 4K read performance

Alexandre DERUMIER 
收件人:
duan xufeng
2014/11/21 14:51


抄送:
si dawei, ceph-users






Hi, 

I don't have tested yet rbd readhead,
but maybe do you reach qemu limit. (by default qemu can use only 
1thread/1core to manage ios, check you qemu cpu).

Do you have some performance results ? how many iops ?


but I have had 4x improvement in qemu-kvm, with virtio-scsi + num_queues + 
lasts kernel.
(4k seq coalesced reads in qemu, was doing bigger iops to ceph).

libvirt : 


Regards,

Alexandre
- Mail original - 

De: "duan xufeng"  
À: "ceph-users"  
Cc: "si dawei"  
Envoyé: Vendredi 21 Novembre 2014 03:58:38 
Objet: [ceph-users] RBD read-ahead didn't improve 4K read performance 


hi, 

I upgraded CEPH to 0.87 for rbd readahead , but can't see any performance 
improvement in 4K seq read in the VM. 
How can I know if the readahead is take effect? 

thanks. 

ceph.conf 
[client] 
rbd_cache = true 
rbd_cache_size = 335544320 
rbd_cache_max_dirty = 251658240 
rbd_cache_target_dirty = 167772160 

rbd readahead trigger requests = 1 
rbd readahead max bytes = 4194304 
rbd readahead disable after bytes = 0 

ZTE Information Security Notice: The information contained in this mail 
(and any attachment transmitted herewith) is privileged and confidential 
and is intended for the exclusive use of the addressee(s).  If you are not 
an intended recipient, any disclosure, reproduction, distribution or other 
dissemination or use of the information contained is strictly prohibited. 
If you have received this mail in error, please delete it and notify us 
immediately. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 



ZTE Information Security Notice: The information contained in this mail (and 
any attachment transmitted herewith) is privileged and confidential and is 
intended for the exclusive use of the addressee(s).  If you are not an intended 
recipient, any disclosure, reproduction, distribution or other dissemination or 
use of the information contained is strictly prohibited.  If you have received 
this mail in error, please delete it and notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph inconsistency after deep-scrub

2014-11-21 Thread Paweł Sadowski
Hi,

During deep-scrub Ceph discovered some inconsistency between OSDs on my
cluster (size 3, min size 2). I have fund broken object and calculated
md5sum of it on each OSD (osd.195 is acting_primary):
 osd.195 - md5sum_
 osd.40 - md5sum_
 osd.314 - md5sum_

I run ceph pg repair and Ceph successfully reported that everything went
OK. I checked md5sum of the objects again:
 osd.195 - md5sum_
 osd.40 - md5sum_
 osd.314 - md5sum_

This is a bit odd. How Ceph decides which copy is the correct one? Based
on last modification time/sequence number (or similar)? If yes, then why
object can be stored on one node only? If not, then why Ceph selected
osd.314 as a correct one? What would happen if osd.314 goes down? Will
ceph return wrong (old?) data, even with three copies and no failure in
the cluster?

For now I'm unable to reproduce this on my test cluster. I'll post here
if I reproduce this.

Thanks for any help,
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-announce list

2014-11-21 Thread JuanFra Rodriguez Cardoso
Hi all:

As it was asked weeks ago.. what is the way the ceph community uses to
stay tuned on new features and bug fixes?

Thanks!

Best,
---
JuanFra Rodriguez Cardoso
es.linkedin.com/in/jfrcardoso/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rest-bench ERROR: failed to create bucket: XmlParseFailure

2014-11-21 Thread Frank Li
Hi,

Is anyone help me to resolve the error as follows ? Thank a lot's.

rest-bench --api-host=172.20.10.106 --bucket=test
--access-key=BXXX --secret=z
--protocol=http --uri_style=path --concurrent-ios=3 --block-size=4096 write

host=172.20.10.106

ERROR: failed to create bucket: XmlParseFailure

failed initializing benchmark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD in uninterruptible sleep

2014-11-21 Thread Jon Kåre Hellan
We are testing a Giant cluster - on virtual machines for now. We have 
seen the same
problem two nights in a row: One of the OSDs gets stuck in 
uninterruptible sleep.
The only way to get rid of it is apparently to reboot - kill -9, -11 and 
-15 have all

been tried.

The monitor apparently believes it is gone, because every 30 minutes we 
see in the log:
  lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another 
ceph-osd still

  running? (11) Resource temporarily unavailable
We interpret this as an attempt to start a new instance.

There is a pastebin of the osd log from the night before last in: 
http://pastebin.com/Y42GvGjr

Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy
The pid of the stuck OSD is 4222. syslog has call traces of pids 4405, 
4406, 4435, 4436,

which have been blocked for > 120 s.

What can we do to get to the bottom of this?

Context: This is a test cluster to evaluate Ceph. There are 3 monitor vms,
3 OSD vms each running 2 OSDs, 1 MSD vm and 1 radosgw vm. The vms are 
running Debian
Wheezy under Hyper-V. OSD storage is xfs on virtual disks. The test load 
was a linux
kernel compilation with the tree in cephfs. Silly, I know, but we needed 
a test load.
We do not intend to use cephfs in production. Obviously, we would use 
physical OSD nodes

if we were to decide to deploy ceph in production.

Jon
Jon Kåre Hellan, UNINETT AS, Trondheim, Norway

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Cache Considered Harmful? (on all-SSD pools, at least)

2014-11-21 Thread Florian Haas
Hi everyone,

been trying to get to the bottom of this for a few days; thought I'd
take this to the list to see if someone had insight to share.

Situation: Ceph 0.87 (Giant) cluster with approx. 250 OSDs. One set of
OSD nodes with just spinners put into one CRUSH ruleset assigned to a
"spinner" pool, another set of OSD nodes with just SSDs put into another
ruleset, assigned to an "ssd" pool. Both pools use size 3. In the
default rados bench write (16 threads, 4MB object size), the spinner
pool gets about 500 MB/s throughput, the ssd pool gets about 850. All
relatively normal and what one would expect:

$ sudo rados -p spinner-test bench 30 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run: 30.544917
Total writes made:  3858
Write size: 4194304
Bandwidth (MB/sec): 505.223
[...]
Average Latency:0.126193

$ sudo rados -p ssd-test bench 30 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run: 30.046918
Total writes made:  6410
Write size: 4194304
Bandwidth (MB/sec): 853.332
[...]
Average Latency:0.0749883

So we see a bandwidth increase and a latency drop as we go from spinners
to SSDs (note: 80ms latency still isn't exactly great, but that's a
different discussion to have).

Now I'm trying to duplicate the rados bench results with rbd
bench-write. My assumption would be (and generally this assumption holds
true, in my experience) that when duplicating the rados bench parameters
with rbd bench-write, results should be *roughly* equivalent without RBD
caching, and slightly better with caching.

So here is the spinner pool, no caching:

$ sudo rbd -p spinner-test \
  --rbd_cache=false \
  --rbd_cache_writethrough_until_flush=false \
  bench-write rbdtest \
  --io-threads 16 \
  --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1   114112.88  473443678.94
2   156 77.91  326786116.90
3   197 65.35  274116599.05
4   240 59.57  249866261.64
elapsed: 4  ops:  256  ops/sec:55.83  bytes/sec: 234159074.98

Throughput dropped from 500 MB/s (rados bench) to less than half of that
(rbd bench-write).

With caching (all cache related settings at their defaults, unless
overridden with --rbd_* args):

$ sudo rbd -p spinner-test \
  --rbd_cache=true \
  --rbd_cache_writethrough_until_flush=false \
  bench-write rbdtest \
  --io-threads 16 \
  --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1   126110.44  463201062.29
2   232108.33  454353540.71
elapsed: 2  ops:  256  ops/sec:   105.97  bytes/sec: 62860.84

So somewhat closer to what rados bench can do, but not nearly where
you'd expect to be.

And then for the ssd pool, things get weird. Here's rbd bench-write with
no caching:

$ sudo rbd -p ssd-test \
  --rbd_cache=false \
  --rbd_cache_writethrough_until_flush=false \
  bench-write rbdtest \
  --io-threads 16 \
  --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1   208193.64  812202592.78
elapsed: 1  ops:  256  ops/sec:   202.14  bytes/sec: 847828574.27

850MB/s, which is what rados bench reports too. No overhead at all? That
would be nice. Let's write 4GB instead of 1GB:

$ sudo rbd -p ssd-test \
  --rbd_cache=false \
  --rbd_cache_writethrough_until_flush=false \
  bench-write rbdtest \
  --io-threads 16 \
  --io-size $((4<<20)) \
  --io-total $((4<<30))
  SEC   OPS   OPS/SEC   BYTES/SEC
1   208197.41  827983956.90
2   416207.91  872038511.36
3   640211.52  887162647.59
4   864213.98  897482175.07
elapsed: 4  ops: 1024  ops/sec:   216.39  bytes/sec: 907597866.21

Well, that's kinda nice, except it seems illogical that RBD would be
faster than RADOS, without caching. Let's turn caching on:

$ sudo rbd -p ssd-test \
  --rbd_cache=true \
  --rbd_cache_writethrough_until_flush=false \
  bench-write rbdtest \
  --io-threads 16 \
  --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1   152141.46  593324418.90
elapsed: 1  ops:  256  ops/sec:   148.64  bytes/sec: 623450766.90

Oddly, we've dropped back to 620 MB/s. Try the 4GB total write for good
measure:

$ sudo rbd -p ssd-test \
  --rbd_cache=true \
  --rbd_cache_writethrough_until_flush=false \
  bench-write rbdtest \
  --io-threads 16 \
  --io-size $((4<<20)) \
  --io-total $((4<<30))
bench-write  io_size 4194304 io_threads 16 bytes 4294967296 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1   150138.46  580729593.09
2   302145.23  609132960.16
   

Re: [ceph-users] RBD Cache Considered Harmful? (on all-SSD pools, at least)

2014-11-21 Thread Mark Nelson

On 11/21/2014 08:14 AM, Florian Haas wrote:

Hi everyone,

been trying to get to the bottom of this for a few days; thought I'd
take this to the list to see if someone had insight to share.

Situation: Ceph 0.87 (Giant) cluster with approx. 250 OSDs. One set of
OSD nodes with just spinners put into one CRUSH ruleset assigned to a
"spinner" pool, another set of OSD nodes with just SSDs put into another
ruleset, assigned to an "ssd" pool. Both pools use size 3. In the
default rados bench write (16 threads, 4MB object size), the spinner
pool gets about 500 MB/s throughput, the ssd pool gets about 850. All
relatively normal and what one would expect:

$ sudo rados -p spinner-test bench 30 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run: 30.544917
Total writes made:  3858
Write size: 4194304
Bandwidth (MB/sec): 505.223
[...]
Average Latency:0.126193

$ sudo rados -p ssd-test bench 30 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run: 30.046918
Total writes made:  6410
Write size: 4194304
Bandwidth (MB/sec): 853.332
[...]
Average Latency:0.0749883

So we see a bandwidth increase and a latency drop as we go from spinners
to SSDs (note: 80ms latency still isn't exactly great, but that's a
different discussion to have).

Now I'm trying to duplicate the rados bench results with rbd
bench-write. My assumption would be (and generally this assumption holds
true, in my experience) that when duplicating the rados bench parameters
with rbd bench-write, results should be *roughly* equivalent without RBD
caching, and slightly better with caching.

So here is the spinner pool, no caching:

$ sudo rbd -p spinner-test \
   --rbd_cache=false \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   114112.88  473443678.94
 2   156 77.91  326786116.90
 3   197 65.35  274116599.05
 4   240 59.57  249866261.64
elapsed: 4  ops:  256  ops/sec:55.83  bytes/sec: 234159074.98

Throughput dropped from 500 MB/s (rados bench) to less than half of that
(rbd bench-write).

With caching (all cache related settings at their defaults, unless
overridden with --rbd_* args):

$ sudo rbd -p spinner-test \
   --rbd_cache=true \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   126110.44  463201062.29
 2   232108.33  454353540.71
elapsed: 2  ops:  256  ops/sec:   105.97  bytes/sec: 62860.84

So somewhat closer to what rados bench can do, but not nearly where
you'd expect to be.

And then for the ssd pool, things get weird. Here's rbd bench-write with
no caching:

$ sudo rbd -p ssd-test \
   --rbd_cache=false \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   208193.64  812202592.78
elapsed: 1  ops:  256  ops/sec:   202.14  bytes/sec: 847828574.27

850MB/s, which is what rados bench reports too. No overhead at all? That
would be nice. Let's write 4GB instead of 1GB:

$ sudo rbd -p ssd-test \
   --rbd_cache=false \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20)) \
   --io-total $((4<<30))
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   208197.41  827983956.90
 2   416207.91  872038511.36
 3   640211.52  887162647.59
 4   864213.98  897482175.07
elapsed: 4  ops: 1024  ops/sec:   216.39  bytes/sec: 907597866.21

Well, that's kinda nice, except it seems illogical that RBD would be
faster than RADOS, without caching. Let's turn caching on:

$ sudo rbd -p ssd-test \
   --rbd_cache=true \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   152141.46  593324418.90
elapsed: 1  ops:  256  ops/sec:   148.64  bytes/sec: 623450766.90

Oddly, we've dropped back to 620 MB/s. Try the 4GB total write for good
measure:

$ sudo rbd -p ssd-test \
   --rbd_cache=true \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20)) \
   --io-total $((4<<30))
bench-write  io_size 4194304 io_threads 16 bytes 4294967296 pattern seq
   SEC   OPS   OPS/S

[ceph-users] Calamari install issues

2014-11-21 Thread Shain Miley

Hello all,

I followed the setup steps provided here:

http://karan-mj.blogspot.com/2014/09/ceph-calamari-survival-guide.html

I was able to build and install everything correctly as far as I can 
tell...however I am still not able to get the server to see the cluster.


I am getting the following errors after I log into the web gui:

4 Ceph servers are connected to Calamari, but no Ceph cluster has been 
created yet.



The ceph nodes have salt installed and are being managed by the salt-master:

root@calamari:/home/# salt-run manage.up
hqceph1.npr.org
hqceph2.npr.org
hqceph3.npr.org
hqosd1.npr.org

However something still seems to be missing:

root@calamari:/home/#  salt '*' test.ping; salt '*' ceph.get_heartbeats
hqceph1.npr.org:
True
hqceph2.npr.org:
True
hqosd1.npr.org:
True
hqceph3.npr.org:
True
hqceph1.npr.org:
'ceph.get_heartbeats' is not available.
hqceph3.npr.org:
'ceph.get_heartbeats' is not available.
hqceph2.npr.org:
'ceph.get_heartbeats' is not available.
hqosd1.npr.org:
'ceph.get_heartbeats' is not available.


Any help trying to move forward would be great!

Thanks in advance,

Shain


--
_NPR | Shain Miley| Manager of Systems and Infrastructure, Digital Media 
| smi...@npr.org | p: 202-513-3649
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg's degraded

2014-11-21 Thread JIten Shah
Thanks Michael. That was a good idea.

I did:

1. sudo service ceph stop mds

2. ceph mds newfs 1 0 —yes-i-really-mean-it (where 1 and 0 are pool ID’s for 
metadata and data)

3. ceph health (It was healthy now!!!)

4. sudo servie ceph start mds.$(hostname -s)

And I am back in business.

Thanks again.

—Jiten



On Nov 20, 2014, at 5:47 PM, Michael Kuriger  wrote:

> Maybe delete the pool and start over?
>  
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> JIten Shah
> Sent: Thursday, November 20, 2014 5:46 PM
> To: Craig Lewis
> Cc: ceph-users
> Subject: Re: [ceph-users] pg's degraded
>  
> Hi Craig,
>  
> Recreating the missing PG’s fixed it.  Thanks for your help.
>  
> But when I tried to mount the Filesystem, it gave me the “mount error 5”. I 
> tried to restart the MDS server but it won’t work. It tells me that it’s 
> laggy/unresponsive.
>  
> BTW, all these machines are VM’s.
>  
> [jshah@Lab-cephmon001 ~]$ ceph health detail
> HEALTH_WARN mds cluster is degraded; mds Lab-cephmon001 is laggy
> mds cluster is degraded
> mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 rank 0 is replaying journal
> mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 is laggy/unresponsive
>  
>  
> —Jiten
>  
> On Nov 20, 2014, at 4:20 PM, JIten Shah  wrote:
> 
> 
> Ok. Thanks.
>  
> —Jiten
>  
> On Nov 20, 2014, at 2:14 PM, Craig Lewis  wrote:
> 
> 
> If there's no data to lose, tell Ceph to re-create all the missing PGs.
>  
> ceph pg force_create_pg 2.33
>  
> Repeat for each of the missing PGs.  If that doesn't do anything, you might 
> need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph 
> osd lost , then try the force_create_pg command again.
>  
> If that doesn't work, you can keep fighting with it, but it'll be faster to 
> rebuild the cluster.
>  
>  
>  
> On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah  wrote:
> Thanks for your help.
>  
> I was using puppet to install the OSD’s where it chooses a path over a device 
> name. Hence it created the OSD in the path within the root volume since the 
> path specified was incorrect.
>  
> And all 3 of the OSD’s were rebuilt at the same time because it was unused 
> and we had not put any data in there.
>  
> Any way to recover from this or should i rebuild the cluster altogether.
>  
> —Jiten
>  
> On Nov 20, 2014, at 1:40 PM, Craig Lewis  wrote:
> 
> 
> So you have your crushmap set to choose osd instead of choose host?
>  
> Did you wait for the cluster to recover between each OSD rebuild?  If you 
> rebuilt all 3 OSDs at the same time (or without waiting for a complete 
> recovery between them), that would cause this problem.
>  
>  
>  
> On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah  wrote:
> Yes, it was a healthy cluster and I had to rebuild because the OSD’s got 
> accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of 
> them.
>  
>  
> [jshah@Lab-cephmon001 ~]$ ceph osd tree
> # id weight type name up/down reweight
> -1 0.5 root default
> -2 0.0 host Lab-cephosd005
> 4 0.0 osd.4 up 1
> -3 0.0 host Lab-cephosd001
> 0 0.0 osd.0 up 1
> -4 0.0 host Lab-cephosd002
> 1 0.0 osd.1 up 1
> -5 0.0 host Lab-cephosd003
> 2 0.0 osd.2 up 1
> -6 0.0 host Lab-cephosd004
> 3 0.0 osd.3 up 1
>  
>  
> [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
> Error ENOENT: i don't have paid 2.33
>  
> —Jiten
>  
>  
> On Nov 20, 2014, at 11:18 AM, Craig Lewis  wrote:
> 
> 
> Just to be clear, this is from a cluster that was healthy, had a disk 
> replaced, and hasn't returned to healthy?  It's not a new cluster that has 
> never been healthy, right?
>  
> Assuming it's an existing cluster, how many OSDs did you replace?  It almost 
> looks like you replaced multiple OSDs at the same time, and lost data because 
> of it.
>  
> Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?
>  
>  
> On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah  wrote:
> After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. 
> Sone are in the unclean and others are in the stale state. Somehow the MDS is 
> also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read 
> through the documentation and on the web but no luck so far.
>  
> pg 2.33 is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 0.30 is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 1.31 is stuck unclean since forever, current state stale+active+degraded, 
> last acting [2]
> pg 2.32 is stuck unclean for 597129.903922, current state 
> stale+active+degraded, last acting [2]
> pg 0.2f is stuck unclean for 597129.903951, current state 
> stale+active+degraded, last acting [2]
> pg 1.2e is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 2.2d is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [2]

Re: [ceph-users] Calamari install issues

2014-11-21 Thread Michael Kuriger
I had to run "salt-call state.highstate” on my ceph nodes.
Also, if you’re running giant you’ll have to make a small change to get your 
disk stats to show up correctly.


/opt/calamari/venv/lib/python2.6/site-packages/calamari_rest_api-0.1-py2.6.egg/calamari_rest/views/v1.py


$ diff v1.py v1.py.ori

105c105

< return kb

---

> return kb * 1024

111,113c111,113

< 'used_bytes': 
to_bytes(get_latest_graphite(df_path('total_used_bytes'))),

< 'capacity_bytes': 
to_bytes(get_latest_graphite(df_path('total_bytes'))),

< 'free_bytes': 
to_bytes(get_latest_graphite(df_path('total_avail_bytes')))

---

> 'used_bytes': 
> to_bytes(get_latest_graphite(df_path('total_used'))),

> 'capacity_bytes': 
> to_bytes(get_latest_graphite(df_path('total_space'))),

> 'free_bytes': 
> to_bytes(get_latest_graphite(df_path('total_avail')))



Michael Kuriger
mk7...@yp.com
818-649-7235
MikeKuriger (IM)

From: Shain Miley mailto:smi...@npr.org>>
Date: Friday, November 21, 2014 at 8:51 AM
To: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: [ceph-users] Calamari install issues

Hello all,

I followed the setup steps provided here:

http://karan-mj.blogspot.com/2014/09/ceph-calamari-survival-guide.html

I was able to build and install everything correctly as far as I can 
tell...however I am still not able to get the server to see the cluster.

I am getting the following errors after I log into the web gui:

4 Ceph servers are connected to Calamari, but no Ceph cluster has been created 
yet.


The ceph nodes have salt installed and are being managed by the salt-master:

root@calamari:/home/# salt-run manage.up
hqceph1.npr.org
hqceph2.npr.org
hqceph3.npr.org
hqosd1.npr.org

However something still seems to be missing:

root@calamari:/home/#  salt '*' test.ping; salt '*' ceph.get_heartbeats
hqceph1.npr.org:
True
hqceph2.npr.org:
True
hqosd1.npr.org:
True
hqceph3.npr.org:
True
hqceph1.npr.org:
'ceph.get_heartbeats' is not available.
hqceph3.npr.org:
'ceph.get_heartbeats' is not available.
hqceph2.npr.org:
'ceph.get_heartbeats' is not available.
hqosd1.npr.org:
'ceph.get_heartbeats' is not available.


Any help trying to move forward would be great!

Thanks in advance,

Shain


--
[NPR] | Shain Miley| Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | p: 202-513-3649
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg's degraded

2014-11-21 Thread Michael Kuriger
I have started over from scratch a few times myself ;-)


Michael Kuriger
mk7...@yp.com
818-649-7235
MikeKuriger (IM)

From: JIten Shah mailto:jshah2...@me.com>>
Date: Friday, November 21, 2014 at 9:44 AM
To: Michael Kuriger mailto:mk7...@yp.com>>
Cc: Craig Lewis mailto:cle...@centraldesktop.com>>, 
ceph-users mailto:ceph-us...@ceph.com>>
Subject: Re: [ceph-users] pg's degraded

Thanks Michael. That was a good idea.

I did:

1. sudo service ceph stop mds

2. ceph mds newfs 1 0 —yes-i-really-mean-it (where 1 and 0 are pool ID’s for 
metadata and data)

3. ceph health (It was healthy now!!!)

4. sudo servie ceph start mds.$(hostname -s)

And I am back in business.

Thanks again.

—Jiten



On Nov 20, 2014, at 5:47 PM, Michael Kuriger 
mailto:mk7...@yp.com>> wrote:

Maybe delete the pool and start over?


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JIten 
Shah
Sent: Thursday, November 20, 2014 5:46 PM
To: Craig Lewis
Cc: ceph-users
Subject: Re: [ceph-users] pg's degraded

Hi Craig,

Recreating the missing PG’s fixed it.  Thanks for your help.

But when I tried to mount the Filesystem, it gave me the “mount error 5”. I 
tried to restart the MDS server but it won’t work. It tells me that it’s 
laggy/unresponsive.

BTW, all these machines are VM’s.

[jshah@Lab-cephmon001 ~]$ ceph health detail
HEALTH_WARN mds cluster is degraded; mds Lab-cephmon001 is laggy
mds cluster is degraded
mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 rank 0 is replaying journal
mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 is laggy/unresponsive


—Jiten

On Nov 20, 2014, at 4:20 PM, JIten Shah 
mailto:jshah2...@me.com>> wrote:


Ok. Thanks.

—Jiten

On Nov 20, 2014, at 2:14 PM, Craig Lewis 
mailto:cle...@centraldesktop.com>> wrote:


If there's no data to lose, tell Ceph to re-create all the missing PGs.

ceph pg force_create_pg 2.33

Repeat for each of the missing PGs.  If that doesn't do anything, you might 
need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph osd 
lost , then try the force_create_pg command again.

If that doesn't work, you can keep fighting with it, but it'll be faster to 
rebuild the cluster.



On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah 
mailto:jshah2...@me.com>> wrote:
Thanks for your help.

I was using puppet to install the OSD’s where it chooses a path over a device 
name. Hence it created the OSD in the path within the root volume since the 
path specified was incorrect.

And all 3 of the OSD’s were rebuilt at the same time because it was unused and 
we had not put any data in there.

Any way to recover from this or should i rebuild the cluster altogether.

—Jiten

On Nov 20, 2014, at 1:40 PM, Craig Lewis 
mailto:cle...@centraldesktop.com>> wrote:


So you have your crushmap set to choose osd instead of choose host?

Did you wait for the cluster to recover between each OSD rebuild?  If you 
rebuilt all 3 OSDs at the same time (or without waiting for a complete recovery 
between them), that would cause this problem.



On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah 
mailto:jshah2...@me.com>> wrote:
Yes, it was a healthy cluster and I had to rebuild because the OSD’s got 
accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of 
them.


[jshah@Lab-cephmon001 ~]$ ceph osd tree
# id weight type name up/down reweight
-1 0.5 root default
-2 0.0 host Lab-cephosd005
4 0.0 osd.4 up 1
-3 0.0 host Lab-cephosd001
0 0.0 osd.0 up 1
-4 0.0 host Lab-cephosd002
1 0.0 osd.1 up 1
-5 0.0 host Lab-cephosd003
2 0.0 osd.2 up 1
-6 0.0 host Lab-cephosd004
3 0.0 osd.3 up 1


[jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
Error ENOENT: i don't have paid 2.33

—Jiten


On Nov 20, 2014, at 11:18 AM, Craig Lewis 
mailto:cle...@centraldesktop.com>> wrote:


Just to be clear, this is from a cluster that was healthy, had a disk replaced, 
and hasn't returned to healthy?  It's not a new cluster that has never been 
healthy, right?

Assuming it's an existing cluster, how many OSDs did you replace?  It almost 
looks like you replaced multiple OSDs at the same time, and lost data because 
of it.

Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?


On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah 
mailto:jshah2...@me.com>> wrote:
After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. 
Sone are in the unclean and others are in the stale state. Somehow the MDS is 
also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read 
through the documentation and on the web but no luck so far.

pg 2.33 is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 0.30 is stuck unclean since forever, current state 
stale+active+degraded+remapped, last acting [3]
pg 1.31 is stuck unclean since forever, current state stale+active+degraded, 
last acting [2]
pg 2.32 is stuck unclean for 597129.903922, current state 
stale+active+degraded, last acting [2]
pg 0.

Re: [ceph-users] Ceph inconsistency after deep-scrub

2014-11-21 Thread Gregory Farnum
On Fri, Nov 21, 2014 at 2:35 AM, Paweł Sadowski  wrote:
> Hi,
>
> During deep-scrub Ceph discovered some inconsistency between OSDs on my
> cluster (size 3, min size 2). I have fund broken object and calculated
> md5sum of it on each OSD (osd.195 is acting_primary):
>  osd.195 - md5sum_
>  osd.40 - md5sum_
>  osd.314 - md5sum_
>
> I run ceph pg repair and Ceph successfully reported that everything went
> OK. I checked md5sum of the objects again:
>  osd.195 - md5sum_
>  osd.40 - md5sum_
>  osd.314 - md5sum_
>
> This is a bit odd. How Ceph decides which copy is the correct one? Based
> on last modification time/sequence number (or similar)? If yes, then why
> object can be stored on one node only? If not, then why Ceph selected
> osd.314 as a correct one? What would happen if osd.314 goes down? Will
> ceph return wrong (old?) data, even with three copies and no failure in
> the cluster?

Right now, Ceph recovers replicated PGs by pushing the primary's copy
to everybody. There are tickets to improve this, but for now it's best
if you handle this yourself by moving the right things into place, or
removing the primary's copy if it's incorrect before running the
repair command. This is why it doesn't do repair automatically.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD in uninterruptible sleep

2014-11-21 Thread Gregory Farnum
On Fri, Nov 21, 2014 at 4:56 AM, Jon Kåre Hellan
 wrote:
> We are testing a Giant cluster - on virtual machines for now. We have seen
> the same
> problem two nights in a row: One of the OSDs gets stuck in uninterruptible
> sleep.
> The only way to get rid of it is apparently to reboot - kill -9, -11 and -15
> have all
> been tried.
>
> The monitor apparently believes it is gone, because every 30 minutes we see
> in the log:
>   lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another
> ceph-osd still
>   running? (11) Resource temporarily unavailable
> We interpret this as an attempt to start a new instance.
>
> There is a pastebin of the osd log from the night before last in:
> http://pastebin.com/Y42GvGjr
> Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy
> The pid of the stuck OSD is 4222. syslog has call traces of pids 4405, 4406,
> 4435, 4436,
> which have been blocked for > 120 s.
>
> What can we do to get to the bottom of this?

So, the OSD log you pasted includes a backtrace of an assert failure
from the internal heartbeating, indicating that some threads went off
and never came back (these are probably the threads making the
syscalls that syslog is reporting on). It asserted and the OSD should
be gone now since it triggers an unfriendly coredump and termination.
The only thing I can think of is that maybe the system calls
responsible for dumping the core out have *also* failed in a way we
haven't seen before and so nothing's terminated.
In any case, it's definitely related to your disks being too slow and
the OS not handling it appropriately; I'd look at why the kernel is
getting stuck. The specific backtraces there aren't familiar to me,
but maybe somebody else has seen them.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw agent only syncing metadata

2014-11-21 Thread Mark Kirkwood

On 21/11/14 16:05, Mark Kirkwood wrote:

On 21/11/14 15:52, Mark Kirkwood wrote:

On 21/11/14 14:49, Mark Kirkwood wrote:


The only things that look odd in the destination zone logs are 383
requests getting 404 rather than 200:

$ grep "http_status=404" ceph-client.radosgw.us-west-1.log
...
2014-11-21 13:48:58.435201 7ffc4bf7f700  1 == req done
req=0x7ffca002df00 http_status=404 ==
2014-11-21 13:49:05.891680 7ffc35752700  1 == req done
req=0x7ffca00301e0 http_status=404 ==
...




Adding in "debug rgw = 20" and redoing the setup again, I see what looks
to be an http 500 during the data sync:

2014-11-21 15:13:31.920510 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920525 7fb5e3f87700 10 received header:HTTP/1.1 411
Length Required
2014-11-21 15:13:31.920531 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920534 7fb5e3f87700 10 received header:Date: Fri, 21
Nov 2014 02:13:31 GMT
2014-11-21 15:13:31.920574 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920578 7fb5e3f87700 10 received header:Server:
Apache/2.4.7 (Ubuntu)
2014-11-21 15:13:31.920586 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920588 7fb5e3f87700 10 received
header:Content-Length: 238
2014-11-21 15:13:31.920593 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920594 7fb5e3f87700 10 received header:Connection:
close
2014-11-21 15:13:31.920597 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920599 7fb5e3f87700 10 received header:Content-Type:
text/html; charset=iso-8859-1
2014-11-21 15:13:31.920602 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920603 7fb5e3f87700 10 received header:
2014-11-21 15:13:31.934664 7fb5e3f87700  0 WARNING: set_req_state_err
err_no=5 resorting to 500
2014-11-21 15:13:31.934725 7fb5e3f87700  2 req 502:0.048719:s3:PUT
/bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta:copy_obj:http

status=500



I am wondering if I am running into http://tracker.ceph.com/issues/9206
? I'll see if changing to apache 2.2 changes anything.




Don't think so - seeing that same thing on Ubuntu 12.04 with apache 2.2.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph inconsistency after deep-scrub

2014-11-21 Thread Paweł Sadowski
W dniu 21.11.2014 o 20:12, Gregory Farnum pisze:
> On Fri, Nov 21, 2014 at 2:35 AM, Paweł Sadowski  wrote:
>> Hi,
>>
>> During deep-scrub Ceph discovered some inconsistency between OSDs on my
>> cluster (size 3, min size 2). I have fund broken object and calculated
>> md5sum of it on each OSD (osd.195 is acting_primary):
>>  osd.195 - md5sum_
>>  osd.40 - md5sum_
>>  osd.314 - md5sum_
>>
>> I run ceph pg repair and Ceph successfully reported that everything went
>> OK. I checked md5sum of the objects again:
>>  osd.195 - md5sum_
>>  osd.40 - md5sum_
>>  osd.314 - md5sum_
>>
>> This is a bit odd. How Ceph decides which copy is the correct one? Based
>> on last modification time/sequence number (or similar)? If yes, then why
>> object can be stored on one node only? If not, then why Ceph selected
>> osd.314 as a correct one? What would happen if osd.314 goes down? Will
>> ceph return wrong (old?) data, even with three copies and no failure in
>> the cluster?
> Right now, Ceph recovers replicated PGs by pushing the primary's copy
> to everybody. There are tickets to improve this, but for now it's best
> if you handle this yourself by moving the right things into place, or
> removing the primary's copy if it's incorrect before running the
> repair command. This is why it doesn't do repair automatically.
> -Greg
But in my case Ceph used non-primary's copy to repair data while two
other OSDs had the same data (and one of them was primary). That should
not happen.

Beside that there should be big red warning in documentation[1]
regarding /ceph pg repair///.

1:
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent

Cheers,
PS
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Calamari install issues

2014-11-21 Thread Shain Miley

Michael,

Thanks for the info.

We are running ceph version 0.80.7 so I don't think the 2nd part applies 
here.


However when I run the salt command on the ceph nodes it fails:

root@hqceph1:~# salt-call state.highstate
[INFO] Loading fresh modules for state activity
local:
--
  ID: states
Function: no.None
  Result: False
 Comment: No Top file or external nodes data matches found
 Changes:

Summary

Succeeded: 0
Failed:1

Total: 1


Something must be missing...however I followed the writeup exactly...and 
I don't remember encountering any failures along the way :-/



Thanks again,

Shain

On 11/21/14, 1:15 PM, Michael Kuriger wrote:

I had to run "salt-call state.highstate” on my ceph nodes.
Also, if you’re running giant you’ll have to make a small change to 
get your disk stats to show up correctly.


/opt/calamari/venv/lib/python2.6/site-packages/calamari_rest_api-0.1-py2.6.egg/calamari_rest/views/v1.py


$ diff v1.py v1.py.ori

105c105

< return kb

---

> return kb * 1024

111,113c111,113

< 'used_bytes': 
to_bytes(get_latest_graphite(df_path('total_used_bytes'))),


< 'capacity_bytes': 
to_bytes(get_latest_graphite(df_path('total_bytes'))),


< 'free_bytes': 
to_bytes(get_latest_graphite(df_path('total_avail_bytes')))


---

> 'used_bytes': 
to_bytes(get_latest_graphite(df_path('total_used'))),


> 'capacity_bytes': 
to_bytes(get_latest_graphite(df_path('total_space'))),


> 'free_bytes': 
to_bytes(get_latest_graphite(df_path('total_avail')))





Michael Kuriger
mk7...@yp.com
818-649-7235
MikeKuriger (IM)

From: Shain Miley mailto:smi...@npr.org>>
Date: Friday, November 21, 2014 at 8:51 AM
To: "ceph-users@lists.ceph.com " 
mailto:ceph-users@lists.ceph.com>>

Subject: [ceph-users] Calamari install issues

Hello all,

I followed the setup steps provided here:

http://karan-mj.blogspot.com/2014/09/ceph-calamari-survival-guide.html

I was able to build and install everything correctly as far as I can 
tell...however I am still not able to get the server to see the cluster.


I am getting the following errors after I log into the web gui:

4 Ceph servers are connected to Calamari, but no Ceph cluster has been 
created yet.



The ceph nodes have salt installed and are being managed by the 
salt-master:


root@calamari:/home/# salt-run manage.up
hqceph1.npr.org
hqceph2.npr.org
hqceph3.npr.org
hqosd1.npr.org

However something still seems to be missing:

root@calamari:/home/#  salt '*' test.ping; salt '*' ceph.get_heartbeats
hqceph1.npr.org:
True
hqceph2.npr.org:
True
hqosd1.npr.org:
True
hqceph3.npr.org:
True
hqceph1.npr.org:
'ceph.get_heartbeats' is not available.
hqceph3.npr.org:
'ceph.get_heartbeats' is not available.
hqceph2.npr.org:
'ceph.get_heartbeats' is not available.
hqosd1.npr.org:
'ceph.get_heartbeats' is not available.


Any help trying to move forward would be great!

Thanks in advance,

Shain


--
_NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
Media | smi...@npr.org | p: 202-513-3649



--
_NPR | Shain Miley| Manager of Systems and Infrastructure, Digital Media 
| smi...@npr.org | p: 202-513-3649
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw agent only syncing metadata

2014-11-21 Thread Yehuda Sadeh
On Thu, Nov 20, 2014 at 6:52 PM, Mark Kirkwood
 wrote:
> On 21/11/14 14:49, Mark Kirkwood wrote:
>>
>>
>> The only things that look odd in the destination zone logs are 383
>> requests getting 404 rather than 200:
>>
>> $ grep "http_status=404" ceph-client.radosgw.us-west-1.log
>> ...
>> 2014-11-21 13:48:58.435201 7ffc4bf7f700  1 == req done
>> req=0x7ffca002df00 http_status=404 ==
>> 2014-11-21 13:49:05.891680 7ffc35752700  1 == req done
>> req=0x7ffca00301e0 http_status=404 ==
>> ...
>>
>>
>
> Adding in "debug rgw = 20" and redoing the setup again, I see what looks to
> be an http 500 during the data sync:
>
>
> 2014-11-21 15:13:31.886006 7fb5e3f87700  1 == starting new request
> req=0x7fb640032580 =
> 2014-11-21 15:13:31.886025 7fb5e3f87700  2 req 502:0.20::PUT
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta::initializing
> 2014-11-21 15:13:31.886034 7fb5e3f87700 10 host=ceph1 rgw_dns_name=ceph1
> 2014-11-21 15:13:31.886054 7fb5e3f87700 10 meta>> HTTP_X_AMZ_COPY_SOURCE
> 2014-11-21 15:13:31.886080 7fb5e3f87700 10 x>>
> x-amz-copy-source:bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta
> 2014-11-21 15:13:31.886124 7fb5e3f87700 10
> s->object=_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta
> s->bucket=bucketbig
> 2014-11-21 15:13:31.886166 7fb5e3f87700  2 req 502:0.000160:s3:PUT
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta::getting
> op
> 2014-11-21 15:13:31.886188 7fb5e3f87700  2 req 502:0.000182:s3:PUT
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta:copy_obj:authorizing
> 2014-11-21 15:13:31.886232 7fb5e3f87700 10 get_canon_resource():
> dest=/bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta
> 2014-11-21 15:13:31.886239 7fb5e3f87700 10 auth_hdr:
> PUT
>
> application/json; charset=UTF-8
> Fri, 21 Nov 2014 02:13:31 GMT
> x-amz-copy-source:bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta
> 2014-11-21 15:13:31.886271 7fb5e3f87700 15 calculated
> digest=A0wQIVW7UZ5j2GTDqxbHimNEN1o=
> 2014-11-21 15:13:31.886275 7fb5e3f87700 15
> auth_sign=A0wQIVW7UZ5j2GTDqxbHimNEN1o=
> 2014-11-21 15:13:31.886277 7fb5e3f87700 15 compare=0
> 2014-11-21 15:13:31.886281 7fb5e3f87700 20 system request
> 2014-11-21 15:13:31.886289 7fb5e3f87700  2 req 502:0.000283:s3:PUT
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta:copy_obj:reading
> permissions
> 2014-11-21 15:13:31.886325 7fb5e3f87700 20 get_obj_state:
> rctx=0x7fb5e3f861d0 obj=.us-west.domain.rgw:bucketbig state=0x7fb63c057008
> s->prefetch_data=0
> 2014-11-21 15:13:31.886339 7fb5e3f87700 10 cache get:
> name=.us-west.domain.rgw+bucketbig : type miss (requested=22, cached=19)
> 2014-11-21 15:13:31.887967 7fb5e3f87700 10 cache put:
> name=.us-west.domain.rgw+bucketbig
> 2014-11-21 15:13:31.887985 7fb5e3f87700 10 moving
> .us-west.domain.rgw+bucketbig to cache LRU end
> 2014-11-21 15:13:31.887998 7fb5e3f87700 20 get_obj_state: s->obj_tag was set
> empty
> 2014-11-21 15:13:31.888008 7fb5e3f87700 10 cache get:
> name=.us-west.domain.rgw+bucketbig : hit
> 2014-11-21 15:13:31.888031 7fb5e3f87700 20 rgw_get_bucket_info: bucket
> instance:
> bucketbig(@{i=.us-east.rgw.buckets.index}.us-east.rgw.buckets[us-east.4697.1])
> 2014-11-21 15:13:31.888043 7fb5e3f87700 20 reading from
> .us-west.domain.rgw:.bucket.meta.bucketbig:us-east.4697.1
> 2014-11-21 15:13:31.888059 7fb5e3f87700 20 get_obj_state:
> rctx=0x7fb5e3f861d0
> obj=.us-west.domain.rgw:.bucket.meta.bucketbig:us-east.4697.1
> state=0x7fb63c057bf8 s->prefetch_data=0
> 2014-11-21 15:13:31.888068 7fb5e3f87700 10 cache get:
> name=.us-west.domain.rgw+.bucket.meta.bucketbig:us-east.4697.1 : hit
> 2014-11-21 15:13:31.888075 7fb5e3f87700 20 get_obj_state: s->obj_tag was set
> empty
> 2014-11-21 15:13:31.888078 7fb5e3f87700 20 Read xattr: user.rgw.acl
> 2014-11-21 15:13:31.888080 7fb5e3f87700 20 Read xattr: user.rgw.idtag
> 2014-11-21 15:13:31.888081 7fb5e3f87700 20 Read xattr: user.rgw.manifest
> 2014-11-21 15:13:31.888084 7fb5e3f87700 10 cache get:
> name=.us-west.domain.rgw+.bucket.meta.bucketbig:us-east.4697.1 : hit
> 2014-11-21 15:13:31.888097 7fb5e3f87700 10 chain_cache_entry:
> cache_locator=.us-west.domain.rgw+bucketbig
> 2014-11-21 15:13:31.888099 7fb5e3f87700 10 chain_cache_entry:
> cache_locator=.us-west.domain.rgw+.bucket.meta.bucketbig:us-east.4697.1
> 2014-11-21 15:13:31.888181 7fb5e3f87700 15 Read
> AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/";>markirMark xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xsi:type="CanonicalUser">markirMarkFULL_CONTROL
> 2014-11-21 15:13:31.896569 7fb5e3f87700  2 req 502:0.010563:s3:PUT
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta:copy_obj:init
> op
> 2014-11-21 15:13:31.896579 7fb5e3f87700  2 req 502:0.010573:s3:PUT
> /bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_E

[ceph-users] Multiple MDS servers...

2014-11-21 Thread JIten Shah
I am trying to setup 3 MDS servers (one on each MON) but after I am done 
setting up the first one, it give me below error when I try to start it on the 
other ones. I understand that only 1 MDS is functional at a time, but I thought 
you can have multiple of them up, incase the first one dies? Or is that not 
true?

[jshah@Lab-cephmon002 mds.Lab-cephmon002]$ sudo service ceph start 
mds.Lab-cephmon002
/etc/init.d/ceph: mds.Lab-cephmon002 not found (/etc/ceph/ceph.conf defines 
mon.Lab-cephmon002 mds.cephmon002 , /var/lib/ceph defines mon.Lab-cephmon002 
mds.cephmon002)

[jshah@Lab-cephmon002 mds.Lab-cephmon002]$ ls -l 
/var/lib/ceph/mds/mds.Lab-cephmon002/
total 0
-rwxr-xr-x 1 root root 0 Nov 14 18:42 done
-rwxr-xr-x 1 root root 0 Nov 14 18:42 sysvinit

[jshah@Lab-cephmon002 mds.Lab-cephmon002]$ grep cephmon002 /etc/ceph/ceph.conf 
mon_initial_members = Lab-cephmon001, Lab-cephmon002, Lab-cephmon003
mds_data = /var/lib/ceph/mds/mds.Lab-cephmon002
keyring = /var/lib/ceph/mds/mds.Lab-cephmon002/keyring

—Jiten___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds cluster degraded

2014-11-21 Thread JIten Shah
This got taken care of after I deleted the pools for metadata and data and 
started it again. 

I did:

1. sudo service ceph stop mds

2. ceph mds newfs 1 0 —yes-i-really-mean-it (where 1 and 0 are pool ID’s for 
metadata and data)

3. ceph health (It was healthy now!!!)

4. sudo servie ceph start mds.$(hostname -s)

And I am back in business.

On Nov 18, 2014, at 3:27 PM, Gregory Farnum  wrote:

> Hmm, last time we saw this it meant that the MDS log had gotten
> corrupted somehow and was a little short (in that case due to the OSDs
> filling up). What do you mean by "rebuilt the OSDs"?
> -Greg
> 
> On Mon, Nov 17, 2014 at 12:52 PM, JIten Shah  wrote:
>> After i rebuilt the OSD’s, the MDS went into the degraded mode and will not
>> recover.
>> 
>> 
>> [jshah@Lab-cephmon001 ~]$ sudo tail -100f
>> /var/log/ceph/ceph-mds.Lab-cephmon001.log
>> 2014-11-17 17:55:27.855861 7fffef5d3700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=0 pgs=0 cs=0 l=0
>> c=0x1e02c00).accept peer addr is really X.X.16.114:0/838757053 (socket is
>> X.X.16.114:34672/0)
>> 2014-11-17 17:57:27.855519 7fffef5d3700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x1e02c00).fault with nothing to send, going to standby
>> 2014-11-17 17:58:47.883799 7fffef3d1700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=0 pgs=0 cs=0 l=0
>> c=0x1e04ba0).accept peer addr is really X.X.16.114:0/26738200 (socket is
>> X.X.16.114:34699/0)
>> 2014-11-17 18:00:47.882484 7fffef3d1700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x1e04ba0).fault with nothing to send, going to standby
>> 2014-11-17 18:01:47.886662 7fffef1cf700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=0 pgs=0 cs=0 l=0
>> c=0x1e05540).accept peer addr is really X.X.16.114:0/3673954317 (socket is
>> X.X.16.114:34718/0)
>> 2014-11-17 18:03:47.885488 7fffef1cf700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x1e05540).fault with nothing to send, going to standby
>> 2014-11-17 18:04:47.888983 7fffeefcd700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=0 pgs=0 cs=0 l=0
>> c=0x1e05280).accept peer addr is really X.X.16.114:0/3403131574 (socket is
>> X.X.16.114:34744/0)
>> 2014-11-17 18:06:47.888427 7fffeefcd700  0 -- X.X.16.111:6800/3046050 >>
>> X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x1e05280).fault with nothing to send, going to standby
>> 2014-11-17 20:02:03.558250 707de700 -1 mds.0.1 *** got signal Terminated
>> ***
>> 2014-11-17 20:02:03.558297 707de700  1 mds.0.1 suicide.  wanted
>> down:dne, now up:active
>> 2014-11-17 20:02:56.053339 77fe77a0  0 ceph version 0.80.5
>> (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 3424727
>> 2014-11-17 20:02:56.121367 730e4700  1 mds.-1.0 handle_mds_map standby
>> 2014-11-17 20:02:56.124343 730e4700  1 mds.0.2 handle_mds_map i am now
>> mds.0.2
>> 2014-11-17 20:02:56.124345 730e4700  1 mds.0.2 handle_mds_map state
>> change up:standby --> up:replay
>> 2014-11-17 20:02:56.124348 730e4700  1 mds.0.2 replay_start
>> 2014-11-17 20:02:56.124359 730e4700  1 mds.0.2  recovery set is
>> 2014-11-17 20:02:56.124362 730e4700  1 mds.0.2  need osdmap epoch 93,
>> have 92
>> 2014-11-17 20:02:56.124363 730e4700  1 mds.0.2  waiting for osdmap 93
>> (which blacklists prior instance)
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com