Re: [ceph-users] monitor quorum

2014-09-18 Thread James Eckersall
Is anyone able to offer any advice on how to fix this?
I've tried re-injecting the monmap into mon03 as that was mentioned in the
mon troubleshooting docs, but that has not helped at all.  mon03 is still
stuck in the same electing state :(

I've increased the debug level on mon03 and it is reporting the following,
repeatedly:

2014-09-18 10:22:12.788061 7f30f9818700  5
mon.ceph-mon-03@2(electing).elector(947)
start -- can i be leader?
2014-09-18 10:22:12.788105 7f30f9818700  1
mon.ceph-mon-03@2(electing).elector(947)
init, last seen epoch 947
2014-09-18 10:22:12.788111 7f30f9818700  1 -- 10.1.1.66:6789/0 --> mon.0
10.1.1.64:6789/0 -- election(XXX propose 947) v5 -- ?+0 0x7f3104568dc0
2014-09-18 10:22:12.788129 7f30f9818700  1 -- 10.1.1.66:6789/0 --> mon.1
10.1.1.65:6789/0 -- election(XXX propose 947) v5 -- ?+0 0x7f3104568b00
2014-09-18 10:22:14.470715 7f30f7f14700  1 -- 10.1.1.66:6789/0 >> :/0
pipe(0x7f31020a5c00 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f31036be7e0).accept
sd=13 10.1.1.10:50568/0
2014-09-18 10:22:14.470926 7f30f7f14700 10 mon.ceph-mon-03@2(electing) e3
ms_verify_authorizer 10.1.1.10:0/1007970 client protocol 0
2014-09-18 10:22:14.471281 7f30f9017700  1 -- 10.1.1.66:6789/0 <== client.?
10.1.1.10:0/1007970 1  auth(proto 0 30 bytes epoch 0) v1  60+0+0
(673663173 0 0) 0x7f310282d600 con 0x7f31036be7e0
2014-09-18 10:22:14.471296 7f30f9017700  5 mon.ceph-mon-03@2(electing) e3
waitlisting message auth(proto 0 30 bytes epoch 0) v1
2014-09-18 10:22:14.866689 7f30f9818700  5 mon.ceph-mon-03@2(electing) e3
waitlisting message auth(proto 0 30 bytes epoch 0) v1

2014-09-18 10:22:17.470417 7f30f9017700 10 mon.ceph-mon-03@2(electing) e3
ms_handle_reset 0x7f31036be7e0 10.1.1.10:0/1007970
2014-09-18 10:22:17.788184 7f30f9818700  5
mon.ceph-mon-03@2(electing).elector(947)
election timer expired


J

On 17 September 2014 17:05, James Eckersall 
wrote:

> Hi,
>
> Now I feel dumb for jumping to the conclusion that it was a simple
> networking issue - it isn't.
> I've just checked connectivity properly and I can ping and telnet 6789
> from all mon servers to all other mon servers.
>
> I've just restarted the mon03 service and the log is showing the following:
>
> 2014-09-17 16:49:02.355148 7f7ef9f8c800  0 starting mon.ceph-mon-03 rank 2
> at 10.1.1.66:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-mon-03 fsid
> 74069c87-b361-4bb8-8ce8-6ae9deb8a9bd
> 2014-09-17 16:49:02.355375 7f7ef9f8c800  1 mon.ceph-mon-03@-1(probing) e2
> preinit fsid 74069c87-b361-4bb8-8ce8-6ae9deb8a9bd
> 2014-09-17 16:49:02.356347 7f7ef9f8c800  1 
> mon.ceph-mon-03@-1(probing).paxosservice(pgmap
> 18241250..18241952) refresh upgraded, format 0 -> 1
> 2014-09-17 16:49:02.356360 7f7ef9f8c800  1 mon.ceph-mon-03@-1(probing).pg
> v0 on_upgrade discarding in-core PGMap
> 2014-09-17 16:49:02.400316 7f7ef9f8c800  0 mon.ceph-mon-03@-1(probing).mds
> e1 print_map
> epoch 1
> flags 0
> created 2013-12-09 10:19:58.534310
> modified 2013-12-09 10:19:58.534332
> tableserver 0
> root 0
> session_timeout 60
> session_autoclose 300
> max_file_size 1099511627776
> last_failure 0
> last_failure_osd_epoch 0
> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding}
> max_mds 1
> in
> up {}
> failed
> stopped
> data_pools 0
> metadata_pool 1
> inline_data disabled
>
> 2014-09-17 16:49:02.402373 7f7ef9f8c800  0 mon.ceph-mon-03@-1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.402384 7f7ef9f8c800  0 mon.ceph-mon-03@-1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.402386 7f7ef9f8c800  0 mon.ceph-mon-03@-1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.402388 7f7ef9f8c800  0 mon.ceph-mon-03@-1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.403725 7f7ef9f8c800  1 
> mon.ceph-mon-03@-1(probing).paxosservice(auth
> 26001..26154) refresh upgraded, format 0 -> 1
> 2014-09-17 16:49:02.404834 7f7ef9f8c800  0 mon.ceph-mon-03@-1(probing) e2
>  my rank is now 2 (was -1)
> 2014-09-17 16:49:02.407439 7f7ef331b700  1 mon.ceph-mon-03@2(synchronizing)
> e2 sync_obtain_latest_monmap
> 2014-09-17 16:49:02.407588 7f7ef331b700  1 mon.ceph-mon-03@2(synchronizing)
> e2 sync_obtain_latest_monmap obtained monmap e2
> 2014-09-17 16:49:09.514365 7f7ef331b700  0 log [INF] : mon.ceph-mon-03
> calling new monitor election
> 2014-09-17 16:49:09.514523 7f7ef331b700  1 
> mon.ceph-mon-03@2(electing).elector(931)
> init, last seen epoch 931
> 2014-09-17 16:49:09.514658 7f7ef331b700  1 
> mon.ceph-mon-03@2(electing).paxos(paxos
> recovering c 31223899..31224482) is_readable now=2014-09-17 16:49:09.514659
> lease_expire=0.00 has v0 lc 31224482
> 2014-09-17 16:49:09.514665 7f7ef331b700  1 
> mon.ceph-mon-03@2(electing).paxos(paxos
> recovering c 31223899..312

Re: [ceph-users] Still seing scrub errors in .80.5

2014-09-18 Thread Marc
Hi,

we did run a deep scrub on everything yesterday, and a repair
afterwards. Then a new deep scrub today, which brought new scrub errors.

I did check the osd config, they report "filestore_xfs_extsize": "false",
as it should be if I understood things correctly.

FTR the deep scrub has been initiated like this:

for pgnum in `ceph pg dump|grep active|awk '{print $1}'`; do ceph pg
deep-scrub $pgnum; done

How do we proceed from here?

Thanks


On 16/09/2014 20:36, Gregory Farnum wrote:
> Ah, you're right — it wasn't popping up in the same searches and I'd
> forgotten that was so recent.
>
> In that case, did you actually deep scrub *everything* in the cluster,
> Marc? You'll need to run and fix every PG in the cluster, and the
> background deep scrubbing doesn't move through the data very quickly.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Sep 16, 2014 at 11:32 AM, Dan Van Der Ster
>  wrote:
>> Hi Greg,
>> I believe Marc is referring to the corruption triggered by set_extsize on
>> xfs. That option was disabled by default in 0.80.4... See the thread
>> "firefly scrub error".
>> Cheers,
>> Dan
>>
>>
>>
>> From: Gregory Farnum 
>> Sent: Sep 16, 2014 8:15 PM
>> To: Marc
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Still seing scrub errors in .80.5
>>
>> On Tue, Sep 16, 2014 at 12:03 AM, Marc  wrote:
>>> Hello fellow cephalopods,
>>>
>>> every deep scrub seems to dig up inconsistencies (i.e. scrub errors)
>>> that we could use some help with diagnosing.
>>>
>>> I understand there used to be a data corruption issue before .80.3 so we
>>> made sure that all the nodes were upgraded to .80.5 and all the daemons
>>> were restarted (they all report .80.5 when contacted via socket).
>>> *After* that we ran a deep scrub, which obviously found errors, which we
>>> then repaired. But unfortunately, it's now a week later, and the next
>>> deep scrub has dug up new errors, which shouldn't have happened I
>>> think...?
>>>
>>> ceph.log shows these errors in between the deep scrub messages:
>>>
>>> 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR]
>>> 3.335 shard 2: soid
>>> 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest
>>> 3090820441 != known digest 3787996302
>>> 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR]
>>> 3.335 shard 6: soid
>>> 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest
>>> 3259686791 != known digest 3787996302
>>> 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR]
>>> 3.335 deep-scrub 0 missing, 1 inconsistent objects
>>> 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR]
>>> 3.335 deep-scrub 2 errors
>> Uh, I'm afraid those errors were never output as a result of bugs in
>> Firefly. These are indicating actual data differences between the
>> nodes, whereas the Firefly issue was a metadata flag that wasn't
>> handled properly in mixed-version OSD clusters.
>>
>> I don't think Ceph has ever had a bug that would change the data
>> payload between OSDs. Searching the tracker logs, the only entries
>> with this error message are:
>> 1) The local filesystem is not misbehaving under the workload we give
>> it (and there are no known filesystem issues that are exposed by
>> running firefly OSDs in default config that I can think of — certainly
>> none with this error)
>> 2) The disks themselves are bad.
>>
>> :/
>>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Newbie Ceph Design Questions

2014-09-18 Thread Christoph Adomeit

Hello Ceph-Community,

we are considering to use a Ceph Cluster for serving VMs.
We need goog performance and absolute stability.

Regarding Ceph I have a few questions.

Presently we use Solaris ZFS Boxes as NFS Storage for VMs.

The zfs boxes are totally fast, because they use all free ram
for read caches. With arc stats we can see that 90% of all read 
operations are served from memory. Also read cache in zfs is very 
intelligent about what blocks to put in the read cache.

>From Reading about Ceph it seems that ceph Clusters dont have
such an optimized read cache. Do you think we can still perform
as well as the solaris boxes ?

Next question: I read that in Ceph an OSD is marked invalid, as 
soon as its journaling disk is invalid. So what should I do ? I don't
want to use 1 Journal Disk for each osd. I also dont want to use 
a journal disk per 4 osds because then I will loose 4 osds if an ssd
fails. Using journals on osd Disks i am afraid will be slow.
Again I am afraid of slow Ceph performance compared to zfs because
zfs supports zil write cache disks .

Last Question: Someone told me Ceph Snapshots are slow. Is this true ?
I always thought making a snapshot is just moving around some pointers 
to data.

And very last question: What about btrfs, still not recommended ?

Thanks for helping

Christoph

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Frequent Crashes on rbd to nfs gateway Server

2014-09-18 Thread Micha Krause

Hi,

I have build an NFS Server based on Sebastiens Blog Post here:
http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/

Im using Kernel 3.14-0.bpo.1-amd64 on Debian wheezy, the host is a VM on Vmware.

Using rsync im writing data via nfs from one client to this Server.

The NFS Server crashes multiple times per day, I can't even login to the Server 
then.
After a reset, there is no kernel log about the crash, so I guess something is 
blocking
all I/Os.


Any ideas on how to debug this?

Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] three way replication on pool a failed

2014-09-18 Thread m.channappa.negalur
Hello Sebastien,

I am configuring ceph with 3 node storage cluster + one ceph admin.

I have few questions.

I have created pool name ' storage' with the replication size 3 on it and I 
have set the CRUSH rule  .

root@node1:/home/oss# ceph osd dump | grep -E 'storage'
pool 9 'storage' replicated size 3 min_size 1 crush_ruleset 3 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 160 flags hashpspool stripe_width 0

Note: command used to set replication size , #ceph osd pool set stotage size 3.

Even after setting replication size 3 , my data is not getting replicated on 
all the 3 nodes. 

Example:
root@Cephadmin:/home/oss# ceph osd map storage check1
osdmap e122 pool 'storage' (9) object 'check1' -> pg 9.7c9c5619 (9.1) -> up 
([0,2,1], p0) acting ([0,2,1], p0)

but here if I shutdown my 2 nodes I will be unable to access data. In actual 
scenario I should be able to access / write data as my other 3rd node is up (if 
my understanding is correct). Please let me know where I am wrong.

Crush Map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 pool

# buckets
host node2 {
id -2   # do not change unnecessarily
# weight 0.030
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.030
}
host node3 {
id -3   # do not change unnecessarily
# weight 0.030
alg straw
hash 0  # rjenkins1
item osd.1 weight 0.030
}
host node1 {
id -4   # do not change unnecessarily
# weight 0.030
alg straw
hash 0  # rjenkins1
item osd.2 weight 0.030
}
root default {
id -1   # do not change unnecessarily
# weight 0.090
alg straw
hash 0  # rjenkins1
item node2 weight 0.030
item node3 weight 0.030
item node1 weight 0.030
}
pool storage {
id -5   # do not change unnecessarily
# weight 0.090
alg straw
hash 0  # rjenkins1
item node2 weight 0.030
item node3 weight 0.030
item node1 weight 0.030
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

rule storage {
ruleset 3
type replicated
min_size 1
max_size 10
step take storage
step choose firstn 0 type osd
step emit
}
# end crush map


root@node1:/home/oss# ceph osd tree
# idweight  type name   up/down reweight
-5  0.09pool storage
-2  0.03host node2
0   0.03osd.0   up  1
-3  0.03host node3
1   0.03osd.1   up  1
-4  0.03host node1
2   0.03osd.2   up  1
-1  0.09root default
-2  0.03host node2
0   0.03osd.0   up  1
-3  0.03host node3
1   0.03osd.1   up  1
-4  0.03host node1
2   0.03osd.2   up  1

Refernce:  
http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/


-Original Message-
From: Sebastien Han [mailto:sebastien@enovance.com] 
Sent: Tuesday, September 16, 2014 7:43 PM
To: Channappa Negalur, M.
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] vdb busy error when attaching to instance

Did you follow this ceph.com/docs/master/rbd/rbd-openstack/ to configure your 
env?

On 12 Sep 2014, at 14:38, m.channappa.nega...@accenture.com wrote:

> Hello Team,
>  
> I have configured ceph as a multibackend for openstack.
>  
> I have created 2 pools .
> 1.   Volumes (replication size =3 )
> 2.   poolb (replication size =2 )
>  
> Below is the details from /etc/cinder/cinder.conf
>  
> enabled_backends=rbd-ceph,rbd-cephrep
> [rbd-ceph]
> volume_driver=cinder.volume.drivers.rbd.RBDDriver
> rbd_pool=volumes
> volume_backend_name=ceph
> rbd_user=volumes
> rbd_secret_uuid=34c88ed2-1cf6-446d-8564-f888934eec35
> volumes_dir=/var/lib/cinder/volumes
> [rbd-cephrep]
> volume_driver=cinder.volume.drivers.rbd.RBDDriver
> rbd_pool=poolb
> volume_backend_name=ceph1
> rbd_user=poolb
> rbd_secret_uuid=d62b0df6-ee26-46f0-8d90-4ef4d55caa5b
> volumes_dir=/var/lib/cinder/volumes1
>  
> when I am attaching a volume to a instance I am getting "DeviceIsBusy: The 
> supplied device (vdb) is busy" error.
>  
> Please let me know how to correct this..
>  
> Regards,
> Malleshi CN
> 
> 
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have receive

Re: [ceph-users] Newbie Ceph Design Questions

2014-09-18 Thread Christian Balzer

Hello,

On Thu, 18 Sep 2014 13:07:35 +0200 Christoph Adomeit wrote:

> 
> Hello Ceph-Community,
> 
> we are considering to use a Ceph Cluster for serving VMs.
> We need goog performance and absolute stability.
> 
I really don't want to sound snarky here, but you get what you pay for, as
in the old adage of "cheap, fast, reliable. pick one." still holds.

That said, Ceph can probably fulfill your needs if you're willing to invest
the time (learning curve, testing) and money (resources). 

> Regarding Ceph I have a few questions.
> 
> Presently we use Solaris ZFS Boxes as NFS Storage for VMs.
> 
That sounds slower than I would Ceph RBD expect to be in nearly all cases.

Also, how do you replicate the filesystems to cover for node failures? 

> The zfs boxes are totally fast, because they use all free ram
> for read caches. With arc stats we can see that 90% of all read 
> operations are served from memory. Also read cache in zfs is very 
> intelligent about what blocks to put in the read cache.
> 
> From Reading about Ceph it seems that ceph Clusters dont have
> such an optimized read cache. Do you think we can still perform
> as well as the solaris boxes ?
> 
It's called the linux page cache. If you're spending enough money to fill
your OSD nodes with similar amounts of RAM the ratio will also be similar.
I have a ceph storage cluster with just 2 storage nodes (don't ask, read
my older posts if you want to know how and why) with a 32GB RAM each and
they serve nearly all reads for about 100 VMs out of that cache space,
too. 
More memory in OSD nodes is definitely one of the best ways to improve
performance with Ceph. 

In the future (not now really, the feature is much too new) Ceph cache
pools (SSD based) are likely to be very helpful with working sets that go
beyond OSD RAM size.

> Next question: I read that in Ceph an OSD is marked invalid, as 
> soon as its journaling disk is invalid. So what should I do ? I don't
> want to use 1 Journal Disk for each osd. I also dont want to use 
> a journal disk per 4 osds because then I will loose 4 osds if an ssd
> fails. Using journals on osd Disks i am afraid will be slow.
> Again I am afraid of slow Ceph performance compared to zfs because
> zfs supports zil write cache disks .
> 
I don't do ZFS, but it is my understanding that loosing the ZIL cache
(presumably on a SSD for speed reasons) will also potentially loose you
the latest writes. So not really all that different from Ceph.

With Ceph you will (if you strive for that "absolute stability" you
mention, which I also interpret as reliability) have at least a replica
size of 3, so 3 copies of your data. On different nodes of course.
So loosing an OSD or 4 isn't great, but it's not the end of the world
either. There are many discussions here that cover these subjects, you can
go for maximum speed and low cost when all is working (the classic Ceph
setup with SSD journals for 2-5 HDDs) to things with a RAID1 journal in
front of a RAID6 OSD storage. This all depends on your goals/needs and
wallet size. 

If you have no control or idea over what your VMs are doing and thus want
a generic, high performance cluster, look at:

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

I find this document a bit dated and optimistic at points given recent
developments, but it is a very good basis to start from.

You can juggle numbers, but unless you're willing to do the tests in your
specific case yourself and optimize for it, I would recommend something
like 9 storage nodes (with n OSDs depending on your requirements) and at
least 3 monitors for the initial deployment. 
If your storage nodes have SSDs for the OS and plenty of CPU/RAM reserves,
I see no reason to not put monitors on them if you're tight for space or
money. 

In that scenario loosing even 4 OSDs due to a journal SSD failure would
not be the end of the world by long shot. Never mind that if you're using
the right SSDs (Intel DC 3700S for example) you're unlikely to ever
experience such a failure. 
And even if so, there are again plenty of discussions in this ML how to
mitigate the effects of such failure (in terms of replication traffic and
its impact on the cluster performance, data redundancy should really never
be the issue). 

> Last Question: Someone told me Ceph Snapshots are slow. Is this true ?
> I always thought making a snapshot is just moving around some pointers 
> to data.
>
No idea, I don't use them.
But from what I gather the DELETION of them (like RBD images) is a rather
resource intensive process, not the creation.
 
> And very last question: What about btrfs, still not recommended ?
> 
Definitely not from where I'm standing.
Between the inherent disadvantage of using BTRFS (CoW, thus fragmentation
galore) for VM storage and actual bugs people run into I don't think it
ever will be.

I venture that Key/Value store systems will be both faster and more
reliable than BTRFS within a year or so.

Christian

> Thanks 

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-18 Thread Alexandre DERUMIER
>>Have anyone ever testing multi volume performance on a *FULL* SSD setup?

I known that Stefan Priebe run full ssd clusters in production, and have done 
benchmark. (Ad far I remember, he have benched  around 20k peak with dumpling)

>>We are able to get ~18K IOPS for 4K random read on a single volume with fio 
>>(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) 
>>IOPS even with multiple volumes. 
>>Seems the maximum random write performance we can get on the entire cluster 
>>is quite close to single volume performance. 
Firefly or Giant ?

I'll do benchs with 6 osd dc3500 tomorrow to compare firefly and giant. 

- Mail original - 

De: "Jian Zhang"  
À: "Sebastien Han" , "Alexandre DERUMIER" 
 
Cc: ceph-users@lists.ceph.com 
Envoyé: Jeudi 18 Septembre 2014 08:12:32 
Objet: RE: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

Have anyone ever testing multi volume performance on a *FULL* SSD setup? 
We are able to get ~18K IOPS for 4K random read on a single volume with fio 
(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) IOPS 
even with multiple volumes. 
Seems the maximum random write performance we can get on the entire cluster is 
quite close to single volume performance. 

Thanks 
Jian 


-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Sebastien Han 
Sent: Tuesday, September 16, 2014 9:33 PM 
To: Alexandre DERUMIER 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

Hi, 

Thanks for keeping us updated on this subject. 
dsync is definitely killing the ssd. 

I don't have much to add, I'm just surprised that you're only getting 5299 with 
0.85 since I've been able to get 6,4K, well I was using the 200GB model, that 
might explain this. 


On 12 Sep 2014, at 16:32, Alexandre DERUMIER  wrote: 

> here the results for the intel s3500 
>  
> max performance is with ceph 0.85 + optracker disabled. 
> intel s3500 don't have d_sync problem like crucial 
> 
> %util show almost 100% for read and write, so maybe the ssd disk performance 
> is the limit. 
> 
> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to 
> bench them next week. 
> 
> 
> 
> 
> 
> 
> INTEL s3500 
> --- 
> raw disk 
>  
> 
> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
> --iodepth=32 --group_reporting --invalidate=0 --name=abc 
> --ioengine=aio bw=288207KB/s, iops=72051 
> 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 30,96 0,42 0,42 0,00 0,01 
> 99,90 
> 
> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
> --sync=1 bw=48131KB/s, iops=12032 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 2,08 0,09 0,00 0,09 0,04 
> 100,00 
> 
> 
> ceph 0.80 
> - 
> randread: no tuning: bw=24578KB/s, iops=6144 
> 
> 
> randwrite: bw=10358KB/s, iops=2589 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 1,63 0,18 0,00 0,18 0,06 
> 50,90 
> 
> 
> ceph 0.85 : 
> - 
> 
> randread : bw=41406KB/s, iops=10351 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 1,36 0,13 0,13 0,00 0,07 75,90 
> 
> randwrite : bw=17204KB/s, iops=4301 
> 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 1,46 0,15 0,00 0,15 0,07 
> 67,80 
> 
> 
> ceph 0.85 tuning op_tracker=false 
>  
> 
> randread : bw=86537KB/s, iops=21634 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 3,13 0,15 0,15 0,00 0,05 
> 98,00 
> 
> randwrite: bw=21199KB/s, iops=5299 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 2,09 0,21 0,00 0,21 0,07 
> 80,00 
> 
> 
> - Mail original - 
> 
> De: "Alexandre DERUMIER"  
> À: "Cedric Lemarchand"  
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Vendredi 12 Septembre 2014 08:15:08 
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 
> 3, 2K IOPS 
> 
> results of fio on rbd with kernel patch 
> 
> 
> 
> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
> result): 
> --- 
> bw=12327KB/s, iops=3081 
> 
> So no much better than before, but this time, iostat show only 15% 
> utils

Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Steven Timm

thanks Luke, I will try that.

Steve


On Wed, 17 Sep 2014, Luke Jing Yuan wrote:


Hi,

From the ones we managed to configure in our lab here. I noticed that using image format 
"raw" instead of "qcow2" worked for us.

Regards,
Luke

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steven 
Timm
Sent: Thursday, 18 September, 2014 5:01 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm


I am trying to use Ceph as a data store with OpenNebula 4.6 and have followed 
the instructions in OpenNebula's documentation at 
http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html

and compared them against the "using libvirt with ceph"

http://ceph.com/docs/master/rbd/libvirt/

We are using the ceph-recompiled qemu-kvm and qemu-img as found at

http://ceph.com/packages/qemu-kvm/

under Scientific Linux 6.5 which is a Redhat clone.  Also a kernel-lt-3.10 
kernel.

[root@fgtest15 qemu]# kvm -version
QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
2003-2008 Fabrice Bellard


From qemu-img

Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat qcow2
qed parallels nbd blkdebug host_cdrom host_floppy host_device file rbd


--
Libvirt is trying to execute the following KVM command:

2014-09-17 19:50:12.774+: starting up
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
/usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m 4096
-smp 2,sockets=2,cores=1,threads=1 -uuid
572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-drive
file=/var/lib/one//datastores/102/60/disk.1,if=none,id=drive-virtio-disk1,format=raw,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
-drive
file=/var/lib/one//datastores/102/60/disk.2,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:02:0b:04,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:60 -k en-us -vga
cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
char device redirected to /dev/pts/3
qemu-kvm: -drive
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none:
could not open disk image
rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789:
Invalid argument
2014-09-17 19:50:12.980+: shutting down

---

just to show that from the command line I can see the rbd pool fine

[root@fgtest15 qemu]# rbd list one
foo
one-19
one-19-58-0
one-19-60-0
[root@fgtest15 qemu]# rbd info one/one-19-60-0
rbd image 'one-19-60-0':
   size 40960 MB in 10240 objects
   order 22 (4096 kB objects)
   block_name_prefix: rb.0.3c39.238e1f29
   format: 1


and even mount stuff with rbd map, etc.

It's only inside libvirt that we had the problem.

At first we were getting "permission denied" but then I upped the
permissions allowed to the libvirt user (client.libvirt2) and then
we are just getting  "invalid argument"


client.libvirt2
   key: AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==
   caps: [mon] allow r
   caps: [osd] allow *, allow rwx pool=one

--

Any idea why kvm doesn't like the argument I am delivering in the file=
argument?  Better--does anyone have a working kvm command out
of either opennebula or openstack against which I can compare?

Thanks

Steve Timm





--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as r

Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Steven Timm

With the default user "libvirt" (corresponding to client.libvirt ceph
token) and with the permissions that were suggested both in the Ceph
manual and the OpenNebula manual, I get a different error, namely
permission denied.  I am not sure why that is.. I then tried
with the full ceph admin privilges and got the error I am getting now.

I will have a look at the patch.

Steve Timm



On Thu, 18 Sep 2014, Stijn De Weirdt wrote:


hi steven,

we ran into issues when trying to use a non-default user ceph user in 
opennebula (don't remeber what the default was; but it's probably not 
libvirt2 ), patches are in https://github.com/OpenNebula/one/pull/33, devs 
sort-of confirmed they will be in 4.8.1. this way you can set CEPH_USER in 
the datastore template. (but if this is the case, i think that onedatastore 
list fails to show size of the datastore)


stijn


On 09/18/2014 04:38 AM, Luke Jing Yuan wrote:

 Hi,

>From the ones we managed to configure in our lab here. I noticed that 
using image format "raw" instead of "qcow2" worked for us.


 Regards,
 Luke

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Steven Timm
 Sent: Thursday, 18 September, 2014 5:01 AM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm


 I am trying to use Ceph as a data store with OpenNebula 4.6 and have
 followed the instructions in OpenNebula's documentation at
 http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html

 and compared them against the "using libvirt with ceph"

 http://ceph.com/docs/master/rbd/libvirt/

 We are using the ceph-recompiled qemu-kvm and qemu-img as found at

 http://ceph.com/packages/qemu-kvm/

 under Scientific Linux 6.5 which is a Redhat clone.  Also a kernel-lt-3.10
 kernel.

 [root@fgtest15 qemu]# kvm -version
 QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
 2003-2008 Fabrice Bellard


>From qemu-img

 Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat qcow2
 qed parallels nbd blkdebug host_cdrom host_floppy host_device file rbd


 --
 Libvirt is trying to execute the following KVM command:

 2014-09-17 19:50:12.774+: starting up
 LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
 /usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m 4096
 -smp 2,sockets=2,cores=1,threads=1 -uuid
 572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
 
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
 -device
 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive
 
file=/var/lib/one//datastores/102/60/disk.1,if=none,id=drive-virtio-disk1,format=raw,cache=none
 -device
 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
 -drive
 
file=/var/lib/one//datastores/102/60/disk.2,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:02:0b:04,bus=pci.0,addr=0x3
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:60 -k en-us -vga
 cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 char device redirected to /dev/pts/3
 qemu-kvm: -drive
 
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none:
 could not open disk image
 
rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789:
 Invalid argument
 2014-09-17 19:50:12.980+: shutting down

 ---

 just to show that from the command line I can see the rbd pool fine

 [root@fgtest15 qemu]# rbd list one
 foo
 one-19
 one-19-58-0
 one-19-60-0
 [root@fgtest15 qemu]# rbd info one/one-19-60-0
 rbd image 'one-19-60-0':
  size 40960 MB in 10240 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.3c39.238e1f29
  format: 1


 and even mount stuff with rbd map, etc.

 It's only inside libvirt that we had the problem.

 At first we were getting "permission denied" but then I upped the
 permissions allowed to the libvirt user (client.libvirt2) and then
 we are just getting  "invalid argument"


 client.libvirt2
  key: AQAV5BlU2OV7NBAApurqxG0K8UkZlQV

Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Steven Timm

On Thu, 18 Sep 2014, Osier Yang wrote:



On 2014年09月18日 10:38, Luke Jing Yuan wrote:

 Hi,

  From the ones we managed to configure in our lab here. I noticed that
  using image format "raw" instead of "qcow2" worked for us.

 Regards,
 Luke

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Steven Timm
 Sent: Thursday, 18 September, 2014 5:01 AM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm


 I am trying to use Ceph as a data store with OpenNebula 4.6 and have
 followed the instructions in OpenNebula's documentation at
 http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html

 and compared them against the "using libvirt with ceph"

 http://ceph.com/docs/master/rbd/libvirt/

 We are using the ceph-recompiled qemu-kvm and qemu-img as found at

 http://ceph.com/packages/qemu-kvm/

 under Scientific Linux 6.5 which is a Redhat clone.  Also a kernel-lt-3.10
 kernel.

 [root@fgtest15 qemu]# kvm -version
 QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
 2003-2008 Fabrice Bellard


  From qemu-img

 Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat qcow2
 qed parallels nbd blkdebug host_cdrom host_floppy host_device file rbd


 --
 Libvirt is trying to execute the following KVM command:

 2014-09-17 19:50:12.774+: starting up
 LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
 /usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m 4096
 -smp 2,sockets=2,cores=1,threads=1 -uuid
 572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
 
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
 -device
 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive
 
file=/var/lib/one//datastores/102/60/disk.1,if=none,id=drive-virtio-disk1,format=raw,cache=none
 -device
 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
 -drive
 
file=/var/lib/one//datastores/102/60/disk.2,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:02:0b:04,bus=pci.0,addr=0x3
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:60 -k en-us -vga
 cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 char device redirected to /dev/pts/3
 qemu-kvm: -drive
 
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none:
 could not open disk image
 
rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789:
 Invalid argument



The error is from qemu-kvm.

You need to check whether your qemu-kvm supports all the arguments listed 
above for
option "-drive". As you mentioned, the qemu-kvm is built by youself. It's 
likely that you
missed something or the qemu-kvm version is old, and doesn't support some of 
the

arguments.

Regards,
Osier



thanks for the feedback, Osier.. I did not build the qemu-kvm myself, it 
is the recommended version from ceph and was built by them.

Short of looking into the source code how do I determine what
parameters my qemu-vm may support in the -drive option?  Thus far I have
found nothing online to describe the syntax of that option and furthermore
it looks like the allowed arguments vary by device type, i.e. rbd has
got different sub-arguments it will accept than any other sub type.

Steve Timm



--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-18 Thread Mark Nelson
Couple of questions:  Are those client IOPS or real IOPS after 
replication and journal writes?  Also, how's CPU usage?  Are interrupts 
being distributed to all cores?


Mark

On 09/18/2014 01:12 AM, Zhang, Jian wrote:

Have anyone ever testing multi volume performance on a *FULL* SSD setup?
We are able to get ~18K IOPS for 4K random read on a single volume with fio 
(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) IOPS 
even with multiple volumes.
Seems the maximum random write performance we can get on the entire cluster is 
quite close to single volume performance.

Thanks
Jian


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Sebastien Han
Sent: Tuesday, September 16, 2014 9:33 PM
To: Alexandre DERUMIER
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS

Hi,

Thanks for keeping us updated on this subject.
dsync is definitely killing the ssd.

I don't have much to add, I'm just surprised that you're only getting 5299 with 
0.85 since I've been able to get 6,4K, well I was using the 200GB model, that 
might explain this.


On 12 Sep 2014, at 16:32, Alexandre DERUMIER  wrote:


here the results for the intel s3500

max performance is with ceph 0.85 + optracker disabled.
intel s3500 don't have d_sync problem like crucial

%util show almost 100% for read and write, so maybe the ssd disk performance is 
the limit.

I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to 
bench them next week.






INTEL s3500
---
raw disk


randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k
--iodepth=32 --group_reporting --invalidate=0 --name=abc
--ioengine=aio bw=288207KB/s, iops=72051

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 0,00 73454,000,00 293816,00 0,00 8,00
30,960,420,420,00   0,01  99,90

randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k
--iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
--sync=1 bw=48131KB/s, iops=12032
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 0,000,00 24120,00 0,00 48240,00 4,00 
2,080,090,000,09   0,04 100,00


ceph 0.80
-
randread: no tuning:  bw=24578KB/s, iops=6144


randwrite: bw=10358KB/s, iops=2589
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00   373,000,00 8878,00 0,00 34012,50 7,66 
1,630,180,000,18   0,06  50,90


ceph 0.85 :
-

randread :  bw=41406KB/s, iops=10351
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   2,00 0,00 10425,000,00 41816,00 0,00 8,02 
1,360,130,130,00   0,07  75,90

randwrite : bw=17204KB/s, iops=4301

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00   333,000,00 9788,00 0,00 57909,0011,83 
1,460,150,000,15   0,07  67,80


ceph 0.85 tuning op_tracker=false


randread :  bw=86537KB/s, iops=21634
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb  25,00 0,00 21428,000,00 86444,00 0,00 8,07 
3,130,150,150,00   0,05  98,00

randwrite:  bw=21199KB/s, iops=5299
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00  1563,000,00 9880,00 0,00 75223,5015,23 
2,090,210,000,21   0,07  80,00


- Mail original -

De: "Alexandre DERUMIER" 
À: "Cedric Lemarchand" 
Cc: ceph-users@lists.ceph.com
Envoyé: Vendredi 12 Septembre 2014 08:15:08
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
3, 2K IOPS

results of fio on rbd with kernel patch



fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
result):
---
bw=12327KB/s, iops=3081

So no much better than before, but this time, iostat show only 15%
utils, and latencies are lower

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
23,90 0,29 0,10 0,00 0,10 0,05 15,20


So, the write bottleneck seem to be in ceph.



I will send s3500 result today

- Mail original -

De: "Alexandre DERUMIER" 
À: "Cedric Lemarchand" 
Cc: ceph-users@lists.ceph.com
Envoyé: Vendredi 12 Septembre 2014 07:58:05
Objet: Re: [c

Re: [ceph-users] getting ulimit set error while installing ceph in admin node

2014-09-18 Thread Subhadip Bagui
Thanks John,

I tried the same you suggested. Disabled the SELinux and requiretty  also.
But still getting the same issue.
Attached the log for debug. Please help to resolve.


Regards,
Subhadip
---

On Thu, Sep 18, 2014 at 3:30 AM, John Wilkins 
wrote:

> Subhadip,
>
> I updated the master branch of the preflight docs here:
> http://ceph.com/docs/master/start/  We did encounter some issues that
> were resolved with those preflight steps.
>
> I think it might be either requiretty or SELinux. I will keep you
> posted. Let me know if it helps.
>
> On Wed, Sep 17, 2014 at 12:13 PM, Subhadip Bagui 
> wrote:
> > Hi,
> >
> > any suggestions ?
> >
> > Regards,
> > Subhadip
> >
> >
> ---
> >
> > On Wed, Sep 17, 2014 at 9:05 AM, Subhadip Bagui 
> wrote:
> >>
> >> Hi
> >>
> >> I'm getting the below error while installing ceph in admin node. Please
> >> let me know how to resolve the same.
> >>
> >>
> >> [ceph@ceph-admin ceph-cluster]$ ceph-deploy mon create-initial
> ceph-admin
> >>
> >>
> >> [ceph_deploy.conf][DEBUG ] found configuration file at:
> >> /home/ceph/.cephdeploy.conf
> >>
> >> [ceph_deploy.cli][INFO  ] Invoked (1.5.14): /usr/bin/ceph-deploy mon
> >> create-initial ceph-admin
> >>
> >> [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts ceph-admin
> >>
> >> [ceph_deploy.mon][DEBUG ] detecting platform for host ceph-admin ...
> >>
> >> [ceph-admin][DEBUG ] connected to host: ceph-admin
> >>
> >> [ceph-admin][DEBUG ] detect platform information from remote host
> >>
> >> [ceph-admin][DEBUG ] detect machine type
> >>
> >> [ceph_deploy.mon][INFO  ] distro info: CentOS 6.5 Final
> >>
> >> [ceph-admin][DEBUG ] determining if provided host has same hostname in
> >> remote
> >>
> >> [ceph-admin][DEBUG ] get remote short hostname
> >>
> >> [ceph-admin][DEBUG ] deploying mon to ceph-admin
> >>
> >> [ceph-admin][DEBUG ] get remote short hostname
> >>
> >> [ceph-admin][DEBUG ] remote hostname: ceph-admin
> >>
> >> [ceph-admin][DEBUG ] write cluster configuration to
> >> /etc/ceph/{cluster}.conf
> >>
> >> [ceph-admin][DEBUG ] create the mon path if it does not exist
> >>
> >> [ceph-admin][DEBUG ] checking for done path:
> >> /var/lib/ceph/mon/ceph-ceph-admin/done
> >>
> >> [ceph-admin][DEBUG ] done path does not exist:
> >> /var/lib/ceph/mon/ceph-ceph-admin/done
> >>
> >> [ceph-admin][INFO  ] creating keyring file:
> >> /var/lib/ceph/tmp/ceph-ceph-admin.mon.keyring
> >>
> >> [ceph-admin][DEBUG ] create the monitor keyring file
> >>
> >> [ceph-admin][INFO  ] Running command: sudo ceph-mon --cluster ceph
> --mkfs
> >> -i ceph-admin --keyring /var/lib/ceph/tmp/ceph-ceph-admin.mon.keyring
> >>
> >> [ceph-admin][DEBUG ] ceph-mon: set fsid to
> >> a36227e3-a39f-41cb-bba1-fea098a4fc65
> >>
> >> [ceph-admin][DEBUG ] ceph-mon: created monfs at
> >> /var/lib/ceph/mon/ceph-ceph-admin for mon.ceph-admin
> >>
> >> [ceph-admin][INFO  ] unlinking keyring file
> >> /var/lib/ceph/tmp/ceph-ceph-admin.mon.keyring
> >>
> >> [ceph-admin][DEBUG ] create a done file to avoid re-doing the mon
> >> deployment
> >>
> >> [ceph-admin][DEBUG ] create the init path if it does not exist
> >>
> >> [ceph-admin][DEBUG ] locating the `service` executable...
> >>
> >> [ceph-admin][INFO  ] Running command: sudo /sbin/service ceph -c
> >> /etc/ceph/ceph.conf start mon.ceph-admin
> >>
> >> [ceph-admin][DEBUG ] === mon.ceph-admin ===
> >>
> >> [ceph-admin][DEBUG ] Starting Ceph mon.ceph-admin on ceph-admin...
> >>
> >> [ceph-admin][DEBUG ] failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i
> >> ceph-admin --pid-file /var/run/ceph/mon.ceph-admin.pid -c
> >> /etc/ceph/ceph.conf --cluster ceph '
> >>
> >> [ceph-admin][DEBUG ] Starting ceph-create-keys on ceph-admin...
> >>
> >> [ceph-admin][WARNIN] No data was received after 7 seconds,
> >> disconnecting...
> >>
> >> [ceph-admin][INFO  ] Running command: sudo ceph --cluster=ceph
> >> --admin-daemon /var/run/ceph/ceph-mon.ceph-admin.asok mon_status
> >>
> >> [ceph-admin][ERROR ] admin_socket: exception getting command
> descriptions:
> >> [Errno 2] No such file or directory
> >>
> >> [ceph-admin][WARNIN] monitor: mon.ceph-admin, might not be running yet
> >>
> >> [ceph-admin][INFO  ] Running command: sudo ceph --cluster=ceph
> >> --admin-daemon /var/run/ceph/ceph-mon.ceph-admin.asok mon_status
> >>
> >> [ceph-admin][ERROR ] admin_socket: exception getting command
> descriptions:
> >> [Errno 2] No such file or directory
> >>
> >> [ceph-admin][WARNIN] ceph-admin is not defined in `mon initial members`
> >>
> >> [ceph-admin][WARNIN] monitor ceph-admin does not exist in monmap
> >>
> >> [ceph-admin][WARNIN] neither `public_addr` nor `public_network` keys are
> >> defined for monitors
> >>
> >> [ceph-admin][WARNIN] monitors may not be able to form quorum
> >>
> >> [ceph_deploy.mon][INFO  ] 

[ceph-users] Timeout on ceph-disk activate

2014-09-18 Thread BG
I've hit a timeout issue on calls to ceph-disk activate.

Initially, I followed the 'Storage Cluster Quick Start' on the CEPH website to
get a cluster up and running. I wanted to tweak the configuration however and
decided to blow away the initial setup using the purge / purgedata / forgetkeys
commands with ceph-deploy.

Next time around I'm getting a timeout error when attempting to activate an OSD
on two out of the three boxes I'm using:
[ceph_deploy.cli][INFO  ] Invoked (1.5.15): /usr/bin/ceph-deploy osd activate
hp10:/data/osd
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks hp10:/data/osd:
[hp10][DEBUG ] connected to host: hp10 
[hp10][DEBUG ] detect platform information from remote host
[hp10][DEBUG ] detect machine type
[ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.0.1406 Core
[ceph_deploy.osd][DEBUG ] activating host hp10 disk /data/osd
[ceph_deploy.osd][DEBUG ] will use init type: sysvinit
[hp10][INFO  ] Running command: sudo ceph-disk -v activate --mark-init sysvinit
--mount /data/osd
[hp10][WARNIN] No data was received after 300 seconds, disconnecting...
[hp10][INFO  ] checking OSD status...
[hp10][INFO  ] Running command: sudo ceph --cluster=ceph osd stat --format=json

This is on CentOS 7, ceph-deploy version is 1.5.15. The firewalld service is
disabled, network connectivity should be good as the cluster previously worked
on these boxes.

Any suggestions where I should start looking to track down the root cause
of the timeout?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] three way replication on pool a failed

2014-09-18 Thread Michael

On 18/09/2014 13:50, m.channappa.nega...@accenture.com wrote:

Even after setting replication size 3 , my data is not getting replicated on 
all the 3 nodes.

Example:
root@Cephadmin:/home/oss# ceph osd map storage check1
osdmap e122 pool 'storage' (9) object 'check1' -> pg 9.7c9c5619 (9.1) -> up 
([0,2,1], p0) acting ([0,2,1], p0)


pg 9.7c9c5619 (9.1) -> up ([0,2,1], p0) acting ([0,2,1], p0)

Right here it says your data is being replicated for that PG across osd.0 osd.1 
osd.3 ([0,2,1]) so yes your data is being replicated across the three nodes.


but here if I shutdown my 2 nodes I will be unable to access data. In actual 
scenario I should be able to access / write data as my other 3rd node is up (if 
my understanding is correct). Please let me know where I am wrong.


Where are your mons situated? If you have 3 mons across 3 nodes once two are 
shut down you'll only have 1 mon left, 1/3 will fail quorum and so the cluster 
will stop taking data to prevent split-brain scenarios. For 2 nodes to be down 
and the cluster to continue to operate you'd need a minimum of 5 mons or you'd 
need to move your mons away from your osd's.

-Michael

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-18 Thread Chen, Xiaoxi
Hi Mark
 It's client IOPS and we use replica = 2, journal and OSD are hosted in the 
same SSDs so the real IOPS is 23K * 2 * 2 =90K, still far from HW limit (30K+ 
for a single DCS3700)
 CPU % is ~62% in peak (2VM ), interrupt distributed.
 
An additional information, seems the cluster is in a kind of busy, and we do 
notice Ceph cluster become unstable after we add the client #VM to 7.  Lot's of 
heartbeat timeout( I believe it's due to OP_thread timeout and lead to OSD drop 
the ping request, but not checked yet).

Xiaoxi

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Thursday, September 18, 2014 11:06 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS

Couple of questions:  Are those client IOPS or real IOPS after replication and 
journal writes?  Also, how's CPU usage?  Are interrupts being distributed to 
all cores?

Mark

On 09/18/2014 01:12 AM, Zhang, Jian wrote:
> Have anyone ever testing multi volume performance on a *FULL* SSD setup?
> We are able to get ~18K IOPS for 4K random read on a single volume with fio 
> (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) 
> IOPS even with multiple volumes.
> Seems the maximum random write performance we can get on the entire cluster 
> is quite close to single volume performance.
>
> Thanks
> Jian
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Sebastien Han
> Sent: Tuesday, September 16, 2014 9:33 PM
> To: Alexandre DERUMIER
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go 
> over 3, 2K IOPS
>
> Hi,
>
> Thanks for keeping us updated on this subject.
> dsync is definitely killing the ssd.
>
> I don't have much to add, I'm just surprised that you're only getting 5299 
> with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, 
> that might explain this.
>
>
> On 12 Sep 2014, at 16:32, Alexandre DERUMIER  wrote:
>
>> here the results for the intel s3500
>> 
>> max performance is with ceph 0.85 + optracker disabled.
>> intel s3500 don't have d_sync problem like crucial
>>
>> %util show almost 100% for read and write, so maybe the ssd disk performance 
>> is the limit.
>>
>> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to 
>> bench them next week.
>>
>>
>>
>>
>>
>>
>> INTEL s3500
>> ---
>> raw disk
>> 
>>
>> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>> --ioengine=aio bw=288207KB/s, iops=72051
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   0,00 0,00 73454,000,00 293816,00 0,00 8,00 
>>30,960,420,420,00   0,01  99,90
>>
>> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
>> --sync=1 bw=48131KB/s, iops=12032
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   0,00 0,000,00 24120,00 0,00 48240,00 4,00  
>>2,080,090,000,09   0,04 100,00
>>
>>
>> ceph 0.80
>> -
>> randread: no tuning:  bw=24578KB/s, iops=6144
>>
>>
>> randwrite: bw=10358KB/s, iops=2589
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   0,00   373,000,00 8878,00 0,00 34012,50 7,66   
>>   1,630,180,000,18   0,06  50,90
>>
>>
>> ceph 0.85 :
>> -
>>
>> randread :  bw=41406KB/s, iops=10351
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   2,00 0,00 10425,000,00 41816,00 0,00 8,02  
>>1,360,130,130,00   0,07  75,90
>>
>> randwrite : bw=17204KB/s, iops=4301
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   0,00   333,000,00 9788,00 0,00 57909,0011,83   
>>   1,460,150,000,15   0,07  67,80
>>
>>
>> ceph 0.85 tuning op_tracker=false
>> 
>>
>> randread :  bw=86537KB/s, iops=21634
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb  25,00 0,00 21428,000,00 86444,00 0,00 8,07  
>>3,130,150,150,00   0,05  98,00
>>
>> randwrite:  bw=21199KB/s, iops=5299
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB

Re: [ceph-users] ceph mds unable to start with 0.85

2014-09-18 Thread Gregory Farnum
On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋  wrote:
> dear,
>  my ceph cluster worked for about two weeks,  mds crashed every 2-3
> days,
> Now it stuck on replay , looks like replay crash and restart mds process
> again
>  what can i do for this?
>
>  1015 => # ceph -s
> cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18
> health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is
> degraded; noscrub,nodeep-scrub flag(s) set
> monmap e1: 2 mons at
> {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election
> epoch 26, quorum 0,1 storage-1-213,storage-1-214
> mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby
> osdmap e1932: 18 osds: 18 up, 18 in
> flags noscrub,nodeep-scrub
> pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects
> 4479 GB used, 32292 GB / 36772 GB avail
> 444 active+clean
> 56 active+clean+inconsistent
> client io 125 MB/s rd, 31 op/s
>
> MDS log here:
>
> 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620
> from mon.0
> 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby
> 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621
> from mon.0
> 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now
> mds.0.272
> 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state
> change up:standby --> up:replay
> 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start
> 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set
> 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is
> 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929,
> have 1927
> 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929
> (which blacklists prior instance)
> 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for
> myself; not doing anything
> 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening
> inotable
> 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening
> sessionmap
> 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening
> mds log
> 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log
> bounds
> 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening
> snap table
> 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.213:6806/6114
> 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.213:6811/6385
> 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.213:6801/6110
> 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200
> to recover...
> 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.214:6806/6238
> 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered.
> 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in
> format 1
> 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1:
> loading/discovering base inodes
> 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode
> with ino:100
> 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode
> with ino:1
> 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.214:6811/6276
> 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying
> mds log
> 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n()
> hs=0+0,ss=0+0 0x5da]
> 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1
> / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
> subtree=1 0x5da]
> 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824
> f() n() hs=0+0,ss=0+0 0x5da03b8]
> 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir
> 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
> subtree=1 0x5da03b8]
> 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache
> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth
> v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
> 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135
> 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226
> b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da]
> bound_dfs []
> 2014-09-18 12:38:55.675116 7f823ca08700 7 mds.0.cache
> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth
> v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
> 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135
> 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226
> b1824464654503 31746894=31708437+38457) hs=0+0,ss

Re: [ceph-users] Still seing scrub errors in .80.5

2014-09-18 Thread Gregory Farnum
On Thu, Sep 18, 2014 at 3:09 AM, Marc  wrote:
> Hi,
>
> we did run a deep scrub on everything yesterday, and a repair
> afterwards. Then a new deep scrub today, which brought new scrub errors.
>
> I did check the osd config, they report "filestore_xfs_extsize": "false",
> as it should be if I understood things correctly.
>
> FTR the deep scrub has been initiated like this:
>
> for pgnum in `ceph pg dump|grep active|awk '{print $1}'`; do ceph pg
> deep-scrub $pgnum; done
>
> How do we proceed from here?

Did the deep scrubs all actually complete yesterday, so these are new
errors and not just scrubs which weren't finished until now?

If so, I'd start looking at the scrub errors and which OSDs are
involved. Hopefully they'll have one or a few OSDs in common that you
can examine more closely.
But like I said before, my money's on faulty hardware or local
filesystems. Depending on how you're set up it's probably a good idea
to just start checking dmesg for any indications of trouble before you
start tackling it from the RADOS side.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS : rm file does not remove object in rados

2014-09-18 Thread Gregory Farnum
On Thu, Sep 18, 2014 at 10:39 AM, Florent B  wrote:
> On 09/12/2014 07:38 PM, Gregory Farnum wrote:
>> On Fri, Sep 12, 2014 at 6:49 AM, Florent Bautista  
>> wrote:
>>> Hi all,
>>>
>>> Today I have a problem using CephFS. I use firefly last release, with
>>> kernel 3.16 client (Debian experimental).
>>>
>>> I have a directory in CephFS, associated to a pool "pool2" (with
>>> set_layout).
>>>
>>> All is working fine, I can add and remove files, objects are stored in
>>> the right pool.
>>>
>>> But when Ceph cluster is overloaded (or for another reason, I don't
>>> know), sometimes when I remove a file, objects are not deleted in rados !
>> CephFS file removal is asynchronous with you removing it from the
>> filesystem. The files get moved into a "stray" directory and will get
>> deleted once nobody holds references to them any more.
>
> My client is the only mounted and does not use files.

"does not use files"...what?

>
> This problems occurs when I delete files with "rm", but not when I use
> given rsync command.
>
>>
>>> I explain : I want to remove a large directory, containing millions of
>>> files. For a moment, objects are really deleted in rados (I see it in
>>> "rados df"), but when I start to do some heavy operations (like moving
>>> volumes in rdb), objects are not deleted anymore, "rados df" returns a
>>> fixed number of objects. I can see that files are still deleting because
>>> I use rsync (rsync -avP --stats --delete /empty/dir/ /dir/to/delete/).
>> What do you mean you're rsyncing and can see files deleting? I don't 
>> understand.
>
> When you run command I gave, syncing an empty dir with the dir you want
> deleted, rsync is telling you "Deleting (file)" for each file to unlink.
>
>>
>> Anyway, It's *possible* that the client is holding capabilities on the
>> deleted files and isn't handing them back, in which case unmounting it
>> would drop them (and then you could remount). I don't think we have
>> any commands designed to hasten that, though.
>
> unmounting does not help.
>
> When I unlink() via rsync, objects are deleted in rados (it makes all
> cluster slow down, and have slow requests).
>
> When I use rm command, it is much faster but objects are not deleted in
> rados !

I think you're not doing what you think you're doing, then...those two
actions should look the same to CephFS.

>
> When I re-mount root CephFS, there are no files, all empty.
>
> But still have 125 MB of objects in "metadata" pool and 21.57 GB in my
> data pool (and it does not decrease...)...

Well, the metadata pool is never going to be emptied; that holds your
MDS journals. The data pool might not get entirely empty either; how
many objects does it say it has?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Can't Start-up MDS

2014-09-18 Thread Gregory Farnum
None of your PGs exist. Since you only have one OSD, they're probably
not capable of fulfilling their default size requirements. You should
go through the generic quick start guides and configuration before
moving on to using the MDS.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Sep 17, 2014 at 11:41 PM, Shun-Fa Yang  wrote:
> hi Gregory,
>
> Thanks for your response.
>
> I'm installed ceph v80.5 on a single node, and my mds status always be
> "creating".
>
> The output of "ceph -s" as following:
>
> root@ubuntu165:~# ceph -s
> cluster 3cd658c3-34ca-43f3-93c7-786e5162e412
>  health HEALTH_WARN 200 pgs incomplete; 200 pgs stuck inactive; 200 pgs
> stuck unclean; 50 requests are blocked > 32 sec
>  monmap e1: 1 mons at {ubuntu165=10.62.170.165:6789/0}, election epoch
> 1, quorum 0 ubuntu165
>  mdsmap e19: 1/1/1 up {0=ubuntu165=up:creating}
>  osdmap e32: 1 osds: 1 up, 1 in
>   pgmap v64: 200 pgs, 4 pools, 0 bytes data, 0 objects
> 1059 MB used, 7448 GB / 7449 GB avail
>  200 creating+incomplete
> root@ubuntu165:~# ceph -v
> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
>
>
> thanks.
>
> 2014-09-18 1:22 GMT+08:00 Gregory Farnum :
>>
>> That looks like the beginning of an mds creation to me. What's your
>> problem in more detail, and what's the output of "ceph -s"?
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Mon, Sep 15, 2014 at 5:34 PM, Shun-Fa Yang  wrote:
>> > Hi all,
>> >
>> > I'm installed ceph v 0.80.5 on Ubuntu 14.04 server version by using
>> > apt-get...
>> >
>> > The log of mds shows as following:
>> >
>> > 2014-09-15 17:24:58.291305 7fd6f6d47800  0 ceph version 0.80.5
>> > (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 10487
>> >
>> > 2014-09-15 17:24:58.302164 7fd6f6d47800 -1 mds.-1.0 *** no OSDs are up
>> > as of
>> > epoch 8, waiting
>> >
>> > 2014-09-15 17:25:08.302930 7fd6f6d47800 -1 mds.-1.-1 *** no OSDs are up
>> > as
>> > of epoch 8, waiting
>> >
>> > 2014-09-15 17:25:19.322092 7fd6f1938700  1 mds.-1.0 handle_mds_map
>> > standby
>> >
>> > 2014-09-15 17:25:19.325024 7fd6f1938700  1 mds.0.3 handle_mds_map i am
>> > now
>> > mds.0.3
>> >
>> > 2014-09-15 17:25:19.325026 7fd6f1938700  1 mds.0.3 handle_mds_map state
>> > change up:standby --> up:creating
>> >
>> > 2014-09-15 17:25:19.325196 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:1
>> >
>> > 2014-09-15 17:25:19.325377 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:100
>> >
>> > 2014-09-15 17:25:19.325381 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:600
>> >
>> > 2014-09-15 17:25:19.325449 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:601
>> >
>> > 2014-09-15 17:25:19.325489 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:602
>> >
>> > 2014-09-15 17:25:19.325538 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:603
>> >
>> > 2014-09-15 17:25:19.325564 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:604
>> >
>> > 2014-09-15 17:25:19.325603 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:605
>> >
>> > 2014-09-15 17:25:19.325627 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:606
>> >
>> > 2014-09-15 17:25:19.325655 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:607
>> >
>> > 2014-09-15 17:25:19.325682 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:608
>> >
>> > 2014-09-15 17:25:19.325714 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:609
>> >
>> > 2014-09-15 17:25:19.325738 7fd6f1938700  0 mds.0.cache creating system
>> > inode
>> > with ino:200
>> >
>> > Could someone tell me how to solve it?
>> >
>> > Thanks.
>> >
>> > --
>> > 楊順發(yang shun-fa)
>> >
>> > ___
>> > Ceph-community mailing list
>> > ceph-commun...@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>> >
>
>
>
>
> --
> 楊順發(yang shun-fa)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Steven Timm

Using image type raw actually got kvm to create the VM

but then the virt-viewer console shows

Booting from Hard Disk
Geom Error
---
We do not even get as far as GRUB.


Below is the network stanza from XML.









uuid='3bd5a6a5-6b2a-44b2-a07

5-ec9da47ae3f4'/>






Any other tweak I might be missing?
Thanks

Steve Timm



By the way--the following is what the raw file in question
looked like before I loaded it into CEPH

[root@one4dev timm]# file gcso_sl6_giwms.raw
gcso_sl6_giwms.raw: x86 boot sector; GRand Unified Bootloader, stage1 
version 0x3, boot drive 0x80, 1st sector stage2 0x1307f70, GRUB version 
0.94; partition 1: ID=0x83, active, starthead 32, startsector 2048, 
6291456 sectors, code offset 0x4




On Thu, 18 Sep 2014, Steven Timm wrote:


thanks Luke, I will try that.

Steve


On Wed, 17 Sep 2014, Luke Jing Yuan wrote:


 Hi,

 From the ones we managed to configure in our lab here. I noticed that
 using image format "raw" instead of "qcow2" worked for us.

 Regards,
 Luke

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Steven Timm
 Sent: Thursday, 18 September, 2014 5:01 AM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm


 I am trying to use Ceph as a data store with OpenNebula 4.6 and have
 followed the instructions in OpenNebula's documentation at
 http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html

 and compared them against the "using libvirt with ceph"

 http://ceph.com/docs/master/rbd/libvirt/

 We are using the ceph-recompiled qemu-kvm and qemu-img as found at

 http://ceph.com/packages/qemu-kvm/

 under Scientific Linux 6.5 which is a Redhat clone.  Also a kernel-lt-3.10
 kernel.

 [root@fgtest15 qemu]# kvm -version
 QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
 2003-2008 Fabrice Bellard


 From qemu-img

 Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat qcow2
 qed parallels nbd blkdebug host_cdrom host_floppy host_device file rbd


 --
 Libvirt is trying to execute the following KVM command:

 2014-09-17 19:50:12.774+: starting up
 LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
 /usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m 4096
 -smp 2,sockets=2,cores=1,threads=1 -uuid
 572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
 
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
 -device
 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive
 
file=/var/lib/one//datastores/102/60/disk.1,if=none,id=drive-virtio-disk1,format=raw,cache=none
 -device
 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
 -drive
 
file=/var/lib/one//datastores/102/60/disk.2,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:02:0b:04,bus=pci.0,addr=0x3
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:60 -k en-us -vga
 cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 char device redirected to /dev/pts/3
 qemu-kvm: -drive
 
file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,format=qcow2,cache=none:
 could not open disk image
 
rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZlQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;stkendca04a\:6789\;stkendca02a\:6789:
 Invalid argument
 2014-09-17 19:50:12.980+: shutting down

 ---

 just to show that from the command line I can see the rbd pool fine

 [root@fgtest15 qemu]# rbd list one
 foo
 one-19
 one-19-58-0
 one-19-60-0
 [root@fgtest15 qemu]# rbd info one/one-19-60-0
 rbd image 'one-19-60-0':
size 40960 MB in 10240 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.3c39.238e1f29
format: 1


 and even mount stuff with rbd map, etc.

 It's only inside libvirt that we had the problem.

 At first we were getting "p

Re: [ceph-users] ceph mds unable to start with 0.85

2014-09-18 Thread 廖建锋
if i turn on debug=20, the log will be more than 100G,

looks no way to put,  do you have any other good way to figure it out?

would you like to log into the server to check?


From: Gregory Farnum
Date: 2014-09-19 02:33
To: 廖建锋
CC: ceph-users
Subject: Re: [ceph-users] ceph mds unable to start with 0.85

On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋  wrote:
> dear,
>  my ceph cluster worked for about two weeks,  mds crashed every 2-3
> days,
> Now it stuck on replay , looks like replay crash and restart mds process
> again
>  what can i do for this?
>
>  1015 => # ceph -s
> cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18
> health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is
> degraded; noscrub,nodeep-scrub flag(s) set
> monmap e1: 2 mons at
> {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election
> epoch 26, quorum 0,1 storage-1-213,storage-1-214
> mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby
> osdmap e1932: 18 osds: 18 up, 18 in
> flags noscrub,nodeep-scrub
> pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects
> 4479 GB used, 32292 GB / 36772 GB avail
> 444 active+clean
> 56 active+clean+inconsistent
> client io 125 MB/s rd, 31 op/s
>
> MDS log here:
>
> 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620
> from mon.0
> 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby
> 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621
> from mon.0
> 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now
> mds.0.272
> 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state
> change up:standby --> up:replay
> 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start
> 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set
> 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is
> 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929,
> have 1927
> 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929
> (which blacklists prior instance)
> 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for
> myself; not doing anything
> 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening
> inotable
> 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening
> sessionmap
> 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening
> mds log
> 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log
> bounds
> 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening
> snap table
> 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.213:6806/6114
> 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.213:6811/6385
> 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.213:6801/6110
> 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200
> to recover...
> 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.214:6806/6238
> 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered.
> 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in
> format 1
> 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1:
> loading/discovering base inodes
> 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode
> with ino:100
> 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode
> with ino:1
> 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on
> 10.1.0.214:6811/6276
> 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying
> mds log
> 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n()
> hs=0+0,ss=0+0 0x5da]
> 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1
> / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
> subtree=1 0x5da]
> 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824
> f() n() hs=0+0,ss=0+0 0x5da03b8]
> 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir
> 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
> subtree=1 0x5da03b8]
> 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache
> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth
> v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
> 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135
> 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226
> b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da]
> bound_df

Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

2014-09-18 Thread Craig Lewis
No, removing the snapshots didn't solve my problem.  I eventually traced
this problem to XFS deadlocks caused by
[osd]
  "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
size=4096"

Changing to just "-s size=4096", and reformatting all OSDs solved this
problem.


Since then, I ran into http://tracker.ceph.com/issues/5699.  Snapshots are
off until I've deployed Firefly.


On Wed, Sep 17, 2014 at 8:09 AM, Florian Haas  wrote:

> Hi Craig,
>
> just dug this up in the list archives.
>
> On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis 
> wrote:
> > In the interest of removing variables, I removed all snapshots on all
> pools,
> > then restarted all ceph daemons at the same time.  This brought up osd.8
> as
> > well.
>
> So just to summarize this: your 100% CPU problem at the time went away
> after you removed all snapshots, and the actual cause of the issue was
> never found?
>
> I am seeing a similar issue now, and have filed
> http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
> again. Can you take a look at that issue and let me know if anything
> in the description sounds familiar?
>
> You mentioned in a later message in the same thread that you would
> keep your snapshot script running and "repeat the experiment". Did the
> situation change in any way after that? Did the issue come back? Or
> did you just stop using snapshots altogether?
>
> Cheers,
> Florian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd going down every 15m blocking recovery from degraded state

2014-09-18 Thread Craig Lewis
The magic in Sage's steps was really setting noup.  That gives the OSD time
to apply the osdmap changes, without starting the timeout.  Set noup,
nodown, noout, restart the OSD, and wait until the CPU usage goes to zero.
 Some of mine took 5 minutes.  Once it's done, unset noup, and restart
again.  The OSD should join the cluster, and not spin the CPU forever.
 Repeat for every OSD.


The XFS params caused my OSDs to crash often enough to cause the big osdmap
backlog.  I was seeing "XFS: possible memory allocation deadlock in
kmem_alloc" in dmesg.  ceph.conf had
[osd]
   "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
size=4096"

I fixed the problem by changing the config to
[osd]
   "osd mkfs options xfs": "-s size=4096"

Then reformated every OSD in my cluster (one at a time).  The -n size=64k
was the problem.  It looks like the 3.14 kernels have a fix:
http://tracker.ceph.com/issues/6301.  Upgrading the kernel might be less
painful that reformatting everything.


On Tue, Sep 16, 2014 at 3:19 PM, Christopher Thorjussen <
christopher.thorjus...@onlinebackupcompany.com> wrote:

> I've been throught your post many times (google likes it ;)
> I've been trying all the noout/nodown/noup.
> But I will look into the XFS issue you are talking about. And read all of
> the post one more time..
>
> /C
>
>
> On Wed, Sep 17, 2014 at 12:01 AM, Craig Lewis 
> wrote:
>
>> I ran into a similar issue before.  I was having a lot of OSD crashes
>> caused by XFS memory allocation deadlocks.  My OSDs crashed so many times
>> that they couldn't replay the OSD Map before they would be marked
>> unresponsive.
>>
>> See if this sounds familiar:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html
>>
>> If so, Sage's procedure to apply the osdmaps fixed my cluster:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html
>>
>>
>>
>>
>>
>> On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen <
>> christopher.thorjus...@onlinebackupcompany.com> wrote:
>>
>>> I've got several osds that are spinning at 100%.
>>>
>>> I've retained some professional services to have a look. Its out of my
>>> newbie reach..
>>>
>>> /Christopher
>>>
>>> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis >> > wrote:
>>>
 Is it using any CPU or Disk I/O during the 15 minutes?

 On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
 christopher.thorjus...@onlinebackupcompany.com> wrote:

> I'm waiting for my cluster to recover from a crashed disk and a second
> osd that has been taken out (crushmap, rm, stopped).
>
> Now I'm stuck at looking at this output ('ceph -w') while my osd.58
> goes down every 15 minute.
>
> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
>
> Here is a log from when I restarted osd.58 and through the next reboot
> 15 minutes later: http://pastebin.com/rt64vx9M
> Short, it just waits for 15 minutes not doing anything and then goes
> down putting lots of lines like this in the log for that osd:
>
> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234
> >> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159
> cs=1 l=0 c=0x35bcf1e0).fault with nothing to send, going to standby
> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234
> >> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370
> cs=1 l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>
> Then I have to restart it. And it repeats.
>
> What should/can I do? Take it out?
>
> I've got 4 servers with 24 disks each. Details about servers:
> http://pastebin.com/XQeSh8gJ
> Running dumpling - 0.67.10
>
> Cheers,
> Christopher Thorjussen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

>>>
>>
>
_

Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Luke Jing Yuan
Hi Steven,

Assuming the original image was in qcow2 format, did you convert it back to raw 
before registering it?

Another tweak I did was enabling and NFS shared the system datastore (id: 0) 
from the frontend to the other hosts:

nebula@z4-hn01:~$ onedatastore list
  ID NAMESIZE AVAIL CLUSTER  IMAGES TYPE DS   TM
   0 system457.7G 92%   z4-cluster-w  0 sys  -shared
   1 default   457.7G 92%   - 0 img  fs   shared
   2 files 457.7G 92%   - 0 fil  fs   ssh
 100 mi-cloud-ceph   7.8T 81%   z4-cluster-w  3 img  ceph ceph

Regards,
Luke

-Original Message-
From: Steven Timm [mailto:t...@fnal.gov]
Sent: Friday, 19 September, 2014 5:18 AM
To: Luke Jing Yuan
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

Using image type raw actually got kvm to create the VM

but then the virt-viewer console shows

Booting from Hard Disk
Geom Error
---
We do not even get as far as GRUB.


Below is the network stanza from XML.
















Any other tweak I might be missing?
Thanks

Steve Timm



By the way--the following is what the raw file in question looked like before I 
loaded it into CEPH

[root@one4dev timm]# file gcso_sl6_giwms.raw
gcso_sl6_giwms.raw: x86 boot sector; GRand Unified Bootloader, stage1 version 
0x3, boot drive 0x80, 1st sector stage2 0x1307f70, GRUB version 0.94; partition 
1: ID=0x83, active, starthead 32, startsector 2048,
6291456 sectors, code offset 0x4



On Thu, 18 Sep 2014, Steven Timm wrote:

> thanks Luke, I will try that.
>
> Steve
>
>
> On Wed, 17 Sep 2014, Luke Jing Yuan wrote:
>
>>  Hi,
>>
>>  From the ones we managed to configure in our lab here. I noticed
>> that  using image format "raw" instead of "qcow2" worked for us.
>>
>>  Regards,
>>  Luke
>>
>>  -Original Message-
>>  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> Behalf Of  Steven Timm
>>  Sent: Thursday, 18 September, 2014 5:01 AM
>>  To: ceph-users@lists.ceph.com
>>  Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm
>>
>>
>>  I am trying to use Ceph as a data store with OpenNebula 4.6 and have
>> followed the instructions in OpenNebula's documentation at
>> http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html
>>
>>  and compared them against the "using libvirt with ceph"
>>
>>  http://ceph.com/docs/master/rbd/libvirt/
>>
>>  We are using the ceph-recompiled qemu-kvm and qemu-img as found at
>>
>>  http://ceph.com/packages/qemu-kvm/
>>
>>  under Scientific Linux 6.5 which is a Redhat clone.  Also a
>> kernel-lt-3.10  kernel.
>>
>>  [root@fgtest15 qemu]# kvm -version
>>  QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
>>  2003-2008 Fabrice Bellard
>>
>>
>>  From qemu-img
>>
>>  Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat
>> qcow2  qed parallels nbd blkdebug host_cdrom host_floppy host_device
>> file rbd
>>
>>
>>  --
>>  Libvirt is trying to execute the following KVM command:
>>
>>  2014-09-17 19:50:12.774+: starting up  LC_ALL=C
>> PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
>> /usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m
>> 4096  -smp 2,sockets=2,cores=1,threads=1 -uuid
>>  572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults
>> -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,serve
>> r,nowait  -mon chardev=charmonitor,id=monitor,mode=control -rtc
>> base=utc  -no-shutdown -device
>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
>> file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZ
>> lQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;s
>> tkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,for
>> mat=qcow2,cache=none
>>  -device
>>
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,i
>> d=virtio-disk0,bootindex=1
>>  -drive
>>
>> file=/var/lib/one//datastores/102/60/disk.1,if=none,id=drive-virtio-d
>> isk1,format=raw,cache=none
>>  -device
>>
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,i
>> d=virtio-disk1
>>  -drive
>>
>> file=/var/lib/one//datastores/102/60/disk.2,if=none,media=cdrom,id=dr
>> ive-ide0-1-0,readonly=on,format=raw
>>  -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
>>  -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device
>>
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:02:0b:04,bus=pci.
>> 0,addr=0x3
>>  -chardev pty,id=charserial0 -device
>>  isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:60 -k en-us
>> -vga  cirrus -device
>> 

[ceph-users] do you have any test case that lost data mostlikely

2014-09-18 Thread yuelongguang
hi,all
 
i want to test some cases that lost data mostlikely.
now i just test killing osds.
 
do you have any such test cases?
 
 
thanks___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?

2014-09-18 Thread Aegeaner


I noticed ceph added key/value store OSD backend feature in firefly, but 
i can hardly get any documentation about how to use it. At last I found 
that i can add a line in ceph.conf:


osd objectstore = keyvaluestore-dev

but got failed with ceph-deploy creating OSDs. According to the log, 
ceph-disk still tried to part a journal partition but failed.


The commands i used  are:

ceph-deploy disk zap CVM-0-11:/dev/hioa

ceph-deploy osd prepare CVM-0-11:/dev/hioa

ceph-deploy osd activate CVM-0-11:/dev/hioa1

Can anyone help me to create a kvstore backend OSD?

Thanks!

=

Aegeaner
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Steven C Timm
Yes--the image was converted back to raw.
Since the image is mapped via rbd I can run fdisk on it and see both the 
partition
tables and a normal set of files inside of it.  

My system datastore is local to each node.  Have been in that mode for quite 
some time.

Steve Timm




From: Luke Jing Yuan [jyl...@mimos.my]
Sent: Thursday, September 18, 2014 9:44 PM
To: Steven C Timm
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph issue:  rbd vs. qemu-kvm

Hi Steven,

Assuming the original image was in qcow2 format, did you convert it back to raw 
before registering it?

Another tweak I did was enabling and NFS shared the system datastore (id: 0) 
from the frontend to the other hosts:

nebula@z4-hn01:~$ onedatastore list
  ID NAMESIZE AVAIL CLUSTER  IMAGES TYPE DS   TM
   0 system457.7G 92%   z4-cluster-w  0 sys  -shared
   1 default   457.7G 92%   - 0 img  fs   shared
   2 files 457.7G 92%   - 0 fil  fs   ssh
 100 mi-cloud-ceph   7.8T 81%   z4-cluster-w  3 img  ceph ceph

Regards,
Luke

-Original Message-
From: Steven Timm [mailto:t...@fnal.gov]
Sent: Friday, 19 September, 2014 5:18 AM
To: Luke Jing Yuan
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

Using image type raw actually got kvm to create the VM

but then the virt-viewer console shows

Booting from Hard Disk
Geom Error
---
We do not even get as far as GRUB.


Below is the network stanza from XML.
















Any other tweak I might be missing?
Thanks

Steve Timm



By the way--the following is what the raw file in question looked like before I 
loaded it into CEPH

[root@one4dev timm]# file gcso_sl6_giwms.raw
gcso_sl6_giwms.raw: x86 boot sector; GRand Unified Bootloader, stage1 version 
0x3, boot drive 0x80, 1st sector stage2 0x1307f70, GRUB version 0.94; partition 
1: ID=0x83, active, starthead 32, startsector 2048,
6291456 sectors, code offset 0x4



On Thu, 18 Sep 2014, Steven Timm wrote:

> thanks Luke, I will try that.
>
> Steve
>
>
> On Wed, 17 Sep 2014, Luke Jing Yuan wrote:
>
>>  Hi,
>>
>>  From the ones we managed to configure in our lab here. I noticed
>> that  using image format "raw" instead of "qcow2" worked for us.
>>
>>  Regards,
>>  Luke
>>
>>  -Original Message-
>>  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> Behalf Of  Steven Timm
>>  Sent: Thursday, 18 September, 2014 5:01 AM
>>  To: ceph-users@lists.ceph.com
>>  Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm
>>
>>
>>  I am trying to use Ceph as a data store with OpenNebula 4.6 and have
>> followed the instructions in OpenNebula's documentation at
>> http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html
>>
>>  and compared them against the "using libvirt with ceph"
>>
>>  http://ceph.com/docs/master/rbd/libvirt/
>>
>>  We are using the ceph-recompiled qemu-kvm and qemu-img as found at
>>
>>  http://ceph.com/packages/qemu-kvm/
>>
>>  under Scientific Linux 6.5 which is a Redhat clone.  Also a
>> kernel-lt-3.10  kernel.
>>
>>  [root@fgtest15 qemu]# kvm -version
>>  QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
>>  2003-2008 Fabrice Bellard
>>
>>
>>  From qemu-img
>>
>>  Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat
>> qcow2  qed parallels nbd blkdebug host_cdrom host_floppy host_device
>> file rbd
>>
>>
>>  --
>>  Libvirt is trying to execute the following KVM command:
>>
>>  2014-09-17 19:50:12.774+: starting up  LC_ALL=C
>> PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
>> /usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m
>> 4096  -smp 2,sockets=2,cores=1,threads=1 -uuid
>>  572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults
>> -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,serve
>> r,nowait  -mon chardev=charmonitor,id=monitor,mode=control -rtc
>> base=utc  -no-shutdown -device
>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
>> file=rbd:one/one-19-60-0:id=libvirt2:key=AQAV5BlU2OV7NBAApurqxG0K8UkZ
>> lQVy6hKmkA==:auth_supported=cephx\;none:mon_host=stkendca01a\:6789\;s
>> tkendca04a\:6789\;stkendca02a\:6789,if=none,id=drive-virtio-disk0,for
>> mat=qcow2,cache=none
>>  -device
>>
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,i
>> d=virtio-disk0,bootindex=1
>>  -drive
>>
>> file=/var/lib/one//datastores/102/60/disk.1,if=none,id=drive-virtio-d
>> isk1,format=raw,cache=none
>>  -device
>>
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,i

[ceph-users] confusion when kill 3 osds that store the same pg

2014-09-18 Thread yuelongguang
hi,all
in order to test ceph stability. i  try to kill osds.
in this case ,i kill 3 osds(osd3,2,0) that store the same pg  2.30.
 
---crush---
osdmap e1342 pool 'rbd' (2) object 'rbd_data.19d92ae8944a.' -> 
pg 2.c59a45b0 (2.30) -> up ([3,2,0], p3) acting ([3,2,0], p3)
[root@cephosd5-gw current]# ceph osd tree
# idweight  type name   up/down reweight
-1  0.09995 root default
-2  0.01999 host cephosd1-mona
0   0.01999 osd.0   down0
-3  0.01999 host cephosd2-monb
1   0.01999 osd.1   up  1
-4  0.01999 host cephosd3-monc
2   0.01999 osd.2   down0
-5  0.01999 host cephosd4-mdsa
3   0.01999 osd.3   down0
-6  0.01999 host cephosd5-gw
4   0.01999 osd.4   up  1
-
according to the test result, i have some confusion.
 
1.
[root@cephosd5-gw current]# ceph pg 2.30 query
Error ENOENT: i don't have pgid 2.30
 
why i can not query infomations of this pg?  how to dump this pg?
 
2.
#ceph osd map rbd rbd_data.19d92ae8944a.
osdmap e1451 pool 'rbd' (2) object 'rbd_data.19d92ae8944a.' -> 
pg 2.c59a45b0 (2.30) -> up ([4,1], p4) acting ([4,1], p4)
 
does 'ceph osd map' command just calculate map , but does not check real pg 
stat?  i do not find 2.30  on osd1 and osd.4.
new that client will get the new map, why client hang ?
 
 
thanks very much
 
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?

2014-09-18 Thread Haomai Wang
Sorry for the poor document, it's in progress.

KeyValueStore doesn't need journal, so we may need some change for
ceph-disk. But I'm not familiar to it.

KeyValueStore is a experiment backend, so it's still exists some bugs
in Firefly version. And there no existing bug in master branch.

On Fri, Sep 19, 2014 at 11:11 AM, Aegeaner  wrote:
>
> I noticed ceph added key/value store OSD backend feature in firefly, but i
> can hardly get any documentation about how to use it. At last I found that i
> can add a line in ceph.conf:
>
> osd objectstore = keyvaluestore-dev
>
> but got failed with ceph-deploy creating OSDs. According to the log,
> ceph-disk still tried to part a journal partition but failed.
>
> The commands i used  are:
>
> ceph-deploy disk zap CVM-0-11:/dev/hioa
>
> ceph-deploy osd prepare CVM-0-11:/dev/hioa
>
> ceph-deploy osd activate CVM-0-11:/dev/hioa1
>
> Can anyone help me to create a kvstore backend OSD?
>
> Thanks!
>
> =
>
> Aegeaner
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds unable to start with 0.85

2014-09-18 Thread Gregory Farnum
On Thu, Sep 18, 2014 at 5:35 PM, 廖建锋  wrote:
> if i turn on debug=20, the log will be more than 100G,
>
> looks no way to put,  do you have any other good way to figure it out?

It should compress well and you can use ceph-post-file if you don't
have a place to host it yourself.
-Greg

>
> would you like to log into the server to check?
>
>
> From: Gregory Farnum
> Date: 2014-09-19 02:33
> To: 廖建锋
> CC: ceph-users
> Subject: Re: [ceph-users] ceph mds unable to start with 0.85
>
> On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋  wrote:
>> dear,
>>  my ceph cluster worked for about two weeks,  mds crashed every 2-3
>> days,
>> Now it stuck on replay , looks like replay crash and restart mds process
>> again
>>  what can i do for this?
>>
>>  1015 => # ceph -s
>> cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18
>> health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is
>> degraded; noscrub,nodeep-scrub flag(s) set
>> monmap e1: 2 mons at
>> {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0},
>> election
>> epoch 26, quorum 0,1 storage-1-213,storage-1-214
>> mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby
>> osdmap e1932: 18 osds: 18 up, 18 in
>> flags noscrub,nodeep-scrub
>> pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects
>> 4479 GB used, 32292 GB / 36772 GB avail
>> 444 active+clean
>> 56 active+clean+inconsistent
>> client io 125 MB/s rd, 31 op/s
>>
>> MDS log here:
>>
>> 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch
>> 620
>> from mon.0
>> 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby
>> 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch
>> 621
>> from mon.0
>> 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am
>> now
>> mds.0.272
>> 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state
>> change up:standby --> up:replay
>> 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start
>> 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set
>> 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is
>> 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch
>> 1929,
>> have 1927
>> 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap
>> 1929
>> (which blacklists prior instance)
>> 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for
>> myself; not doing anything
>> 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening
>> inotable
>> 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening
>> sessionmap
>> 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening
>> mds log
>> 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log
>> bounds
>> 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening
>> snap table
>> 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on
>> 10.1.0.213:6806/6114
>> 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on
>> 10.1.0.213:6811/6385
>> 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on
>> 10.1.0.213:6801/6110
>> 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal
>> 200
>> to recover...
>> 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on
>> 10.1.0.214:6806/6238
>> 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered.
>> 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200
>> in
>> format 1
>> 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1:
>> loading/discovering base inodes
>> 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system
>> inode
>> with ino:100
>> 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system
>> inode
>> with ino:1
>> 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on
>> 10.1.0.214:6811/6276
>> 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2:
>> replaying
>> mds log
>> 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth
>> -1,-2 -> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f()
>> n()
>> hs=0+0,ss=0+0 0x5da]
>> 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir
>> 1
>> / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
>> subtree=1 0x5da]
>> 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth
>> -1,-2 -> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0
>> state=1073741824
>> f() n() hs=0+0,ss=0+0 0x5da03b8]
>> 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir
>> 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0
>> |
>> subtree=1 0x5da03b8]
>> 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache
>> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth
>> v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
>

[ceph-users] ceph health related message

2014-09-18 Thread shiva rkreddy
Hi,

I've setup a cluster with 3 monitors and 2 OSD nodes with 2 disks
each.Cluster is in active+clean state. But, "ceph -s" keeps throwing the
following message, every other time "ceph -s"  is run.

 #ceph -s
2014-09-19 04:13:07.116662 7fc88c3f9700  0 -- :/1011833 >> *192.168.240.200
:*6789/0 pipe(0x7fc890021200 sd=3 :0 s=1 pgs=0 cs=0
l=1 c=0x7fc890021470).fault

If ceph -s is run from the same IP that is listed above, ceph -s doesn't
throw the message, not even once.

Appreciate your suggestions.

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?

2014-09-18 Thread Mark Kirkwood

On 19/09/14 15:11, Aegeaner wrote:


I noticed ceph added key/value store OSD backend feature in firefly, but
i can hardly get any documentation about how to use it. At last I found
that i can add a line in ceph.conf:

osd objectstore = keyvaluestore-dev

but got failed with ceph-deploy creating OSDs. According to the log,
ceph-disk still tried to part a journal partition but failed.

The commands i used  are:

ceph-deploy disk zap CVM-0-11:/dev/hioa

ceph-deploy osd prepare CVM-0-11:/dev/hioa

ceph-deploy osd activate CVM-0-11:/dev/hioa1

Can anyone help me to create a kvstore backend OSD?



Attached script should work (configured to use rocksdb, but just change 
to leveldb in the obvious place. It does the whole job assuming you want 
a MON and an OSD on the same host, so you may needed to customize it - 
or only use the osd part after editing your existing ceph.conf.


It expects that the OSD_DATAPATH points to a mounted filesystem (i.e 
does not make use of ceph-disk).


Also some assumption of Ubuntu (i.e upstart) is made when it tries to 
start the daemons.


Best wishes

Mark



deploy-keyvalue.sh
Description: application/shellscript
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Troubleshooting down OSDs: Invalid command: ceph osd start osd.1

2014-09-18 Thread Piers Dawson-Damer
Has the command for manually starting and stopping OSDs changed? 

The documentation for troubleshooting OSDs  
(http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) 
mentions restarting OSDs with the command;

ceph osd start osd.{num}
Yet I find, using Firefly 0.80.5

piers@sol:/etc/ceph$ ceph osd start osd.1
no valid command found; 10 closest matches:
osd tier remove  
osd tier cache-mode  none|writeback|forward|readonly
osd thrash 
osd tier add   {--force-nonempty}
osd pool stats {}
osd reweight-by-utilization {}
osd pool set  
size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid
  {--yes-i-really-mean-it}
osd pool set-quota  max_objects|max_bytes 
osd pool rename  
osd pool get  
size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid
Error EINVAL: invalid command


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Troubleshooting down OSDs: Invalid command: ceph osd start osd.1

2014-09-18 Thread Piers Dawson-Damer
Also, using the init.d framework seams to fail.

piers@sol:/etc/ceph$ sudo service ceph start osd.1
/etc/init.d/ceph: osd.1 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph 
defines )

The disks are mounted

piers@sol:~$ cat /etc/mtab | sort
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
devpts /dev/pts devpts rw,noexec,nosuid,gid=5,mode=0620 0 0
/dev/sdc1 /var/lib/ceph/osd/ceph-16 xfs rw,noatime 0 0
/dev/sdd1 /var/lib/ceph/osd/ceph-17 xfs rw,noatime 0 0
/dev/sde1 /var/lib/ceph/osd/ceph-0 xfs rw,noatime 0 0
/dev/sdf1 /var/lib/ceph/osd/ceph-1 xfs rw,noatime 0 0
/dev/sdg1 /var/lib/ceph/osd/ceph-2 xfs rw,noatime 0 0
/dev/sdh1 /var/lib/ceph/osd/ceph-3 xfs rw,noatime 0 0
/dev/sdi1 /var/lib/ceph/osd/ceph-4 xfs rw,noatime 0 0
/dev/sdj1 /var/lib/ceph/osd/ceph-5 xfs rw,noatime 0 0
/dev/sdk1 /var/lib/ceph/osd/ceph-6 xfs rw,noatime 0 0
/dev/sdl1 /var/lib/ceph/osd/ceph-7 xfs rw,noatime 0 0
/dev/sdm1 /var/lib/ceph/osd/ceph-8 xfs rw,noatime 0 0
/dev/sdn1 /var/lib/ceph/osd/ceph-9 xfs rw,noatime 0 0
/dev/sdo1 /var/lib/ceph/osd/ceph-10 xfs rw,noatime 0 0
/dev/sdp1 /var/lib/ceph/osd/ceph-11 xfs rw,noatime 0 0
/dev/sdq1 /var/lib/ceph/osd/ceph-12 xfs rw,noatime 0 0
/dev/sdr1 /var/lib/ceph/osd/ceph-13 xfs rw,noatime 0 0
/dev/sds1 /var/lib/ceph/osd/ceph-14 xfs rw,noatime 0 0
/dev/sdt1 /var/lib/ceph/osd/ceph-15 xfs rw,noatime 0 0
...

Thanks in advance,

piers


Begin forwarded message:

> From: Piers Dawson-Damer 
> Subject: Troubleshooting down OSDs: Invalid command: ceph osd start osd.1 
> Date: 19 September 2014 4:06:31 pm AEST
> To: "ceph-us...@ceph.com" 
> 
> Has the command for manually starting and stopping OSDs changed? 
> 
> The documentation for troubleshooting OSDs  
> (http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) 
> mentions restarting OSDs with the command;
> 
> ceph osd start osd.{num}
> Yet I find, using Firefly 0.80.5
> 
> piers@sol:/etc/ceph$ ceph osd start osd.1
> no valid command found; 10 closest matches:
> osd tier remove  
> osd tier cache-mode  none|writeback|forward|readonly
> osd thrash 
> osd tier add   {--force-nonempty}
> osd pool stats {}
> osd reweight-by-utilization {}
> osd pool set  
> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid
>   {--yes-i-really-mean-it}
> osd pool set-quota  max_objects|max_bytes 
> osd pool rename  
> osd pool get  
> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid
> Error EINVAL: invalid command
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can ceph-deploy be used with 'osd objectstore = keyvaluestore-dev' in config file ?

2014-09-18 Thread Mark Kirkwood

On 19/09/14 18:02, Mark Kirkwood wrote:

On 19/09/14 15:11, Aegeaner wrote:


I noticed ceph added key/value store OSD backend feature in firefly, but
i can hardly get any documentation about how to use it. At last I found
that i can add a line in ceph.conf:

osd objectstore = keyvaluestore-dev

but got failed with ceph-deploy creating OSDs. According to the log,
ceph-disk still tried to part a journal partition but failed.

The commands i used  are:

ceph-deploy disk zap CVM-0-11:/dev/hioa

ceph-deploy osd prepare CVM-0-11:/dev/hioa

ceph-deploy osd activate CVM-0-11:/dev/hioa1

Can anyone help me to create a kvstore backend OSD?



Attached script should work (configured to use rocksdb, but just change
to leveldb in the obvious place. It does the whole job assuming you want
a MON and an OSD on the same host, so you may needed to customize it -
or only use the osd part after editing your existing ceph.conf.

It expects that the OSD_DATAPATH points to a mounted filesystem (i.e
does not make use of ceph-disk).

Also some assumption of Ubuntu (i.e upstart) is made when it tries to
start the daemons.



...oh and it removes *everything* under OSD_DATAPATH (so be careful what 
you set this to...I probably should have a check in there to abort if is 
any of /, /var, /usr) !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

2014-09-18 Thread Luke Jing Yuan
Hi Steven,

I am not sure what else would be different, the stanza you shown is similar to 
those I have. The only possibility maybe that I am using a different linux 
distro?

FYI, I am using Ubuntu 12.04 but I had Ubuntu CloudArchive's Havana repo 
enabled (https://wiki.ubuntu.com/ServerTeam/CloudArchive) not so much for 
OpenStack but for the libvirt and qemu packages there since they are more 
recent than those from the original distro, and as for Ceph, I am using Emperor.

Regards,
Luke

-Original Message-
From: Steven C Timm [mailto:t...@fnal.gov]
Sent: Friday, 19 September, 2014 11:16 AM
To: Luke Jing Yuan
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph issue: rbd vs. qemu-kvm

Yes--the image was converted back to raw.
Since the image is mapped via rbd I can run fdisk on it and see both the 
partition tables and a normal set of files inside of it.

My system datastore is local to each node.  Have been in that mode for quite 
some time.

Steve Timm




From: Luke Jing Yuan [jyl...@mimos.my]
Sent: Thursday, September 18, 2014 9:44 PM
To: Steven C Timm
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph issue:  rbd vs. qemu-kvm

Hi Steven,

Assuming the original image was in qcow2 format, did you convert it back to raw 
before registering it?

Another tweak I did was enabling and NFS shared the system datastore (id: 0) 
from the frontend to the other hosts:

nebula@z4-hn01:~$ onedatastore list
  ID NAMESIZE AVAIL CLUSTER  IMAGES TYPE DS   TM
   0 system457.7G 92%   z4-cluster-w  0 sys  -shared
   1 default   457.7G 92%   - 0 img  fs   shared
   2 files 457.7G 92%   - 0 fil  fs   ssh
 100 mi-cloud-ceph   7.8T 81%   z4-cluster-w  3 img  ceph ceph

Regards,
Luke

-Original Message-
From: Steven Timm [mailto:t...@fnal.gov]
Sent: Friday, 19 September, 2014 5:18 AM
To: Luke Jing Yuan
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph issue: rbd vs. qemu-kvm

Using image type raw actually got kvm to create the VM

but then the virt-viewer console shows

Booting from Hard Disk
Geom Error
---
We do not even get as far as GRUB.


Below is the network stanza from XML.
















Any other tweak I might be missing?
Thanks

Steve Timm



By the way--the following is what the raw file in question looked like before I 
loaded it into CEPH

[root@one4dev timm]# file gcso_sl6_giwms.raw
gcso_sl6_giwms.raw: x86 boot sector; GRand Unified Bootloader, stage1 version 
0x3, boot drive 0x80, 1st sector stage2 0x1307f70, GRUB version 0.94; partition 
1: ID=0x83, active, starthead 32, startsector 2048,
6291456 sectors, code offset 0x4



On Thu, 18 Sep 2014, Steven Timm wrote:

> thanks Luke, I will try that.
>
> Steve
>
>
> On Wed, 17 Sep 2014, Luke Jing Yuan wrote:
>
>>  Hi,
>>
>>  From the ones we managed to configure in our lab here. I noticed
>> that  using image format "raw" instead of "qcow2" worked for us.
>>
>>  Regards,
>>  Luke
>>
>>  -Original Message-
>>  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> Behalf Of  Steven Timm
>>  Sent: Thursday, 18 September, 2014 5:01 AM
>>  To: ceph-users@lists.ceph.com
>>  Subject: [ceph-users] ceph issue: rbd vs. qemu-kvm
>>
>>
>>  I am trying to use Ceph as a data store with OpenNebula 4.6 and have
>> followed the instructions in OpenNebula's documentation at
>> http://docs.opennebula.org/4.8/administration/storage/ceph_ds.html
>>
>>  and compared them against the "using libvirt with ceph"
>>
>>  http://ceph.com/docs/master/rbd/libvirt/
>>
>>  We are using the ceph-recompiled qemu-kvm and qemu-img as found at
>>
>>  http://ceph.com/packages/qemu-kvm/
>>
>>  under Scientific Linux 6.5 which is a Redhat clone.  Also a
>> kernel-lt-3.10  kernel.
>>
>>  [root@fgtest15 qemu]# kvm -version
>>  QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
>>  2003-2008 Fabrice Bellard
>>
>>
>>  From qemu-img
>>
>>  Supported formats: raw cow qcow vdi vmdk cloop dmg bochs vpc vvfat
>> qcow2  qed parallels nbd blkdebug host_cdrom host_floppy host_device
>> file rbd
>>
>>
>>  --
>>  Libvirt is trying to execute the following KVM command:
>>
>>  2014-09-17 19:50:12.774+: starting up  LC_ALL=C
>> PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none
>> /usr/libexec/qemu-kvm -name one-60 -S -M rhel6.3.0 -enable-kvm -m
>> 4096  -smp 2,sockets=2,cores=1,threads=1 -uuid
>>  572499bf-07f3-3014-8d6a-dfa1ebb99aa4 -nodefconfig -nodefaults
>> -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-60.monitor,serve
>> r,nowa