Re: [ceph-users] Radosgw Timeout

2014-05-22 Thread Craig Lewis

On 5/22/14 06:16 , Georg Höllrigl wrote:


I have created one bucket that holds many small files, separated into 
different "directories". But whenever I try to acess the bucket, I 
only run into some timeout. The timeout is at around 30 - 100 seconds. 
This is smaller then the Apache timeout of 300 seconds.


Just so we're all talking about the same things, what does "many small 
files" mean to you?  Also, how are you separating them into 
"directories"?  Are you just giving files in the same "directory" the 
same leading string, like "dir1_subdir1_filename"?


I'm putting about 1M objects, random sizes, in each bucket.  I'm not 
having problems getting individual files, or uploading new ones.  It 
does take a long time for s3cmd to list the contents of the bucket. The 
only time I get timeouts is when my cluster is very unhealthy.


If you're doing a lot more than that, say 10M or 100M objects, then that 
could cause a hot spot on disk.  You might be better off taking your 
"directories", and putting them in their own bucket.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests

2014-05-23 Thread Craig Lewis

On 5/22/14 11:51 , Győrvári Gábor wrote:

Hello,

Got this kind of logs in two node of 3 node cluster both node has 2 
OSD, only affected 2 OSD on two separate node thats why i dont 
understand the situation. There wasnt any extra io on the system at 
the given time.


Using radosgw with s3 api to store objects under ceph average ops 
around 20-150 and bw usage 100-2000kb read / sec and only 50-1000kb / 
sec written.


osd_op(client.7821.0:67251068 
default.4181.1_products/800x600/537e28022fdcc.jpg [cmpxattr 
user.rgw.idtag (22) op 1 mode 1,setxattr user.rgw.idtag (33),call 
refcount.put] 11.fe53a6fb e590) v4 *currently waiting for subops from 
[2] **

*


Are any of your PGs in recovery or backfill?

I've seen this happen two different ways.  The first time was because I 
had the recovery and backfill parameters set too high for my cluster.  
If your journals aren't SSDs, the default parameters are too high.  The 
recovery operation will use most of the IOps, and starve the clients.


The second time I saw this is when one disk was starting to fail. 
Sectors starting failing, and the drive spent a lot of time reading and 
remapping bad sectors.  Consumer class SATA disks will retry bad sectors 
for 30+ second.  It happens in the drive firmware, so it's not something 
you can stop.  Enterprise class drives will give up quicker, since they 
know you have another copy of the data.  (Nobody uses enterprise class 
drives stand-alone; they're always in some sort of storage array).


I've had reports of 6+ OSDs blocking subops, and I traced it back to one 
disk that was blocking others.  I replaced that disk, and the warnings 
went away.



If your cluster is healthy, check the SMART attributes for osd.2. If 
osd.2 looks good, it might another osd.  Check osd.2 logs, and check any 
osd that are blocking osd.2.  If your cluster is small, it might be 
faster to just check all disks instead of following the trail.




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd pool default pg num problem

2014-05-23 Thread Craig Lewis
If you're not using CephFS, you don't need metadata or data pools.  You 
can delete them.

If you're not using RBD, you don't need the rbd pool.

If you are using CephFS, and you do delete and recreate the 
metadata/data pools, you'll need to tell CephFS.  I think the command is 
ceph mds add_data_pool .  I'm not using CephFS, so I 
can't test that.  I'm don't see any commands to set the metadata pool 
for CephFS, but it seems strange that you have to tell it about the data 
pool, but not the metadata pool.




On 5/23/14 11:22 , McNamara, Bradley wrote:

The other thing to note, too, is that it appears you're trying to decrease the 
PG/PGP_num parameters, which is not supported.  In order to decrease those 
settings, you'll need to delete and recreate the pools.  All new pools created 
will use the settings defined in the ceph.conf file.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John 
Spray
Sent: Friday, May 23, 2014 6:38 AM
To: Cao, Buddy
Cc: ceph-users@lists.ceph.com; ceph-u...@ceph.com
Subject: Re: [ceph-users] osd pool default pg num problem

Those settings are applied when creating new pools with "osd pool create", but 
not to the pools that are created automatically during cluster setup.

We've had the same question before
(http://comments.gmane.org/gmane.comp.file-systems.ceph.user/8150), so maybe 
it's worth opening a ticket to do something about it.

Cheers,
John

On Fri, May 23, 2014 at 2:01 PM, Cao, Buddy  wrote:

In Firefly, I added below lines to [global] section in ceph.conf,
however, after creating the cluster, the default pool
“metadata/data/rbd”’s pg num is still over 900 but not 375.  Any suggestion?





osd pool default pg num = 375

osd pool default pgp num = 375






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to backup mon-data?

2014-05-23 Thread Craig Lewis

On 5/23/14 09:30 , Fabian Zimmermann wrote:

Hi,


Am 23.05.2014 um 17:31 schrieb "Wido den Hollander" :

I wrote a blog about this: 
http://blog.widodh.nl/2014/03/safely-backing-up-your-ceph-monitors/

so you assume restoring the old data is working, or did you proof this?


I did some of the same things, but never tested a restore 
(http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3087). 
There is a discussion, but I can't figure out how to get gmane to show 
me the threaded version from a google search.



I stopped doing the backups, because they seemed rather useless.

The monitors have a snapshot of the cluster state right now.  If you 
ever need to restore a monitor backup, you're effectively rolling the 
whole cluster back to that point in time.


What happens if you've added disks after the backup?
What happens if a disk has failed after the backup?
What happens if you write data to the cluster after the backup?
What happens if you delete data after the backup, and it gets garbage 
collected?


All questions that can be tested and answered... with a lot of time and 
experimentation.  I decided to add more monitors and stop taking backups.



I'm still thinking about doing manual backups before a major ceph 
version upgrade.  In that case, I'd only need to test the write/delete 
cases, because I can control the the add/remove disk cases.  The backups 
would only be useful between restarting the MON and the OSD processes 
though.  I can't really backup the OSD state[1], so once they're 
upgraded, there's no going back.



1: ZFS or Btrfs snapshots could do this, but neither one are recommended 
for production.  I do plan to make snapshots once either FS is 
production ready.  LVM snapshots could do it, but they're such a pain 
that I never bothered.  And I have the scripts I used to use to make LVM 
snapshots of MySQL data directories.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw Timeout

2014-05-23 Thread Craig Lewis

On 5/23/14 03:47 , Georg Höllrigl wrote:



On 22.05.2014 17:30, Craig Lewis wrote:

On 5/22/14 06:16 , Georg Höllrigl wrote:


I have created one bucket that holds many small files, separated into
different "directories". But whenever I try to acess the bucket, I
only run into some timeout. The timeout is at around 30 - 100 seconds.
This is smaller then the Apache timeout of 300 seconds.


Just so we're all talking about the same things, what does "many small
files" mean to you?  Also, how are you separating them into
"directories"?  Are you just giving files in the same "directory" the
same leading string, like "dir1_subdir1_filename"?


I can only estimate how many files. ATM I've 25M files on the origin 
but only 1/10th has been synced to radosgw. These are distributed 
throuhg 20 folders, each containing about 2k directories with ~ 100 - 
500 files each.


Do you think that's too much in that usecase?

The recommendations I've seen indicate that 25M objects per bucket is 
doable, but painful.  The bucket is itself an object stored in Ceph, 
which stores the list of objects in that bucket.   With a single bucket 
containing 25M objects, you're going to hotspot on the bucket.  Think of 
a bucket like a directory on a filesystem.  You wouldn't store 25M files 
in a single directory.


Buckets are a bit simpler than directories.  They don't have to track 
permissions, per file ACLs, and all the other things that POSIX 
filesystems do.  You can push them harder than a normal directory, but 
the same concepts still apply.  The more files you put in a 
bucket/directory, the slower it gets.  Most filesystems impose a hard 
limit on the number of files in a directory.  RadosGW doesn't have a 
limit, it just gets slower.


Even the list of buckets has this problem.  You wouldn't want to create 
25M buckets with one object each.  By default, there is a 1000 bucket 
limit per user, but you can increase that.



If you can handle using 20 buckets, it would be worthwhile to put each 
one of your top 20 folders into it's own bucket.  If you can break it 
apart even more, that would be even better.


I mentioned that I have a bunch of buckets with ~1M objects each. GET 
and PUT of objects is still fast, but listing the contents of the bucket 
takes a long time.  Each bucket takes 20-30 minutes to get a full 
listing.  If you're going to be doing a lot of bucket listing, you might 
want to keep each bucket below 1000 items.  Maybe each of your 2k 
directories gets it's own bucket.



If using more than one bucket is difficult, then 25M objects in one 
bucket will work.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about zone and disater recovery

2014-05-23 Thread Craig Lewis

On 5/21/14 19:49 , wsnote wrote:

Hi,everyone!
I have 2 ceph clusters, one master zone, another secondary zone.
Now I have some question.
1. Can ceph have two or more secondary zones?


It's supposed to work, but I haven't tested it.



2. Can the role of master zone and secondary zone transform mutual?
I mean I can change the secondary zone to be master and the master 
zone to secondary.
Yes and no.  You can promote the slave to a master at any time by 
disabling replication, and writing to it.  You'll want to update your 
region and zone maps, but that's only required to make replication 
between zones work.


Converting the master to a secondary zone... I don't know. Everything 
will work if you delete the contents of the old master, set it up as a 
new secondary of the new master, and re-replicate everything.   Nobody 
wants to do that.  It would be nice if you could just point the old 
master (with it's existing data) at the new master, and it would start 
replicating.  I can't answer that.




3. How to deal with the situation when the master zone is down?
Now the secondary zone forbids all the operations of files, such as 
create objects, delete objects.
When the master zone is down, users can't do anything to the files 
except read objects from the secondary zone.
It's a bad user experience. Additionly, it will have a bad influence 
on the confidence of the users.
I know the limit of secondary zone is out of consideration for the 
consistency of data. However, is there another way to improve some 
experience?

I think:
There can be a config that allow the files operations of the secondary 
zone.If the master zone is down, the admin can enable it, then the 
users can do files opeartions as usually. The secondary record all the 
files operations of the files. When the master zone gets right, the 
admin can sync files to the master zone manually.




The secondary zone tracks what metadata operations that it has replayed 
from the master zone.  It does this per bucket.


In theory, there's no reason you can have additional buckets in the 
slave zone that the master zone doesn't have.  Since these buckets 
aren't replicated, there shouldn't be a problem writing to them.  In 
theory, you should even be able to write objects to the existing buckets 
in the slave, as long as the master doesn't have those objects.  I don't 
know what would happen if you created one of those buckets or objects on 
the master.  Maybe replication breaks, or maybe it just overwrites the 
data in the slave.


That's a lot of "in theory" though.  I wouldn't attempt it without a lot 
of simulation in test clusters.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to backup mon-data?

2014-05-27 Thread Craig Lewis

On 5/23/14 16:20 , Cédric Lemarchand wrote:
Out of curiosity, what's the current beef with zfs? I know what 
problems are cited for btrfs, but I haven't heard much about zfs lately.

The Linux implementation (ZoL) is actually stable for production, but is quiet 
memory hungry because of a spl/slab fragmentation issue ...

But I would ask a question : even with a snapshot capable FS, is it sufficient 
to achieve a consistent backup of a running leveldb ? Or did you plan to 
stop/snap/start the mon ? (No knowledge at all about leveldb ...)

Cheers



A ZFS snapshot is atomic, but it doesn't tell the daemons to flush their 
logs to disk.  Reverting to a snapshot looks the same as if you turned 
off the machine by yanking the power cord at the instant the snapshot 
was taken.


It's not a nice thing to do to a daemon, but the monitors need to be 
able to handle a dirty shutdown.


It would be better to stop the monitor, snapshot, and start the 
monitor.  It shouldn't cause any problems if you don't, and I wouldn't 
bother.




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy or manual?

2014-05-27 Thread Craig Lewis

On 5/27/14 09:19 , Don Talton (dotalton) wrote:

I'd love to know how people are deploying their production clouds now. I've heard mixed 
answers about whether or not the "right" way is with ceph-deploy, or manual 
deployment. Are people using automation tools like puppet or ansible?


Donald Talton
Cloud Systems Development
Cisco Systems


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


I'm using Chef and the Ceph cookbook 
(https://github.com/ceph/ceph-cookbooks).  This does the heavy lifting 
of installing and configuring machines.  I partition the boot disks, 
install the OS, then Chef does everything after that.


It works, but there's no cluster wide automation.  For example, it won't 
handle major Ceph upgrades.  A major upgrade has the steps:


1. Update packages
2. Restart all monitors
3. Restart all osds
4. Restart all MDS/RadosGW


Chef doesn't handle that.  It's only aware of the individual nodes, so 
it'll restart the monitors, osds, mds, and radosgw without regard for 
the other nodes.  I'm sure I could add the logic, but it's a lot of work 
to make it work correctly in all cases.  I just do the upgrades by hand.


This isn't unique to Chef.  cfEngine, Puppet, Chef, and Salt all have 
the same limitations.  Ansible claims to be cluster aware, but I haven't 
really seen any actual support for that statement.




I am a recent convert to Config Management.  I might be a bit of a 
zealot, but I don't plan to manage nodes by hand ever again.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is there a way to repair placement groups? [Offtopic - ZFS]

2014-05-27 Thread Craig Lewis

On 5/27/14 13:40 , phowell wrote:

Hi

First apologies if this is the wrong place to ask this question.

We are running a small Ceph (0.79) cluster will about 12 osd's which 
are on top of a zfs raid 1+0 (for another discussion)... which were 
created on this version.




Just a reminder to benchmark everything, especially things you have 
known to be true since the dawn of time.  I benchmarked RAID10 vs. RAID5 
so long ago, I had to find a 3.5" floppy to open the spreadsheet.



Recently, I was testing ZFS on software encrypted volumes, and wanted to 
see how badly it would impact a PostgreSQL server.  My test setup was 
using RAIDZ2, so I just ran the benchmark on that zpool.


Imagine my surprise when an untuned and encrypted RAIDZ2 posted better 
benchmarks than a tuned ZFS RAID10.



I really think the "RAID5 is bad for performance" is a nasty hold-over 
from when parity calculations needed dedicated hardware. I won't be 
building any more ZFS RAID10 arrays.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-deploy to deploy osds simultaneously

2014-05-27 Thread Craig Lewis
In practice, it's not a big deal.  Just deploy your disks sequentially, 
and Ceph will sort it out.


Sure, you'll waste a bit of time watching data copy to a new disk, only 
to see it get remapped it to a newer disk.  It's a small period of time 
relative to how long it's going to take to remap to all of the new disks 
anyway.  Ceph handles it fine, and won't lose data.



If you still want to, check out the mon osd auto mark new in setting: 
http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/#configuration-settings



Alfredo's comments about a race condition shouldn't apply, because 
you'll create the OSDs sequentially, and only mark them up and in after 
they're created.




On 5/27/14 07:20 , Alfredo Deza wrote:

There is no simultaneous support in ceph-deploy as this is something
that is not needed for users setting up a cluster to try out
ceph (the main objective of ceph-deploy)

However, there have been users that used scripts that can call
ceph-deploy in parallel so it might be possible to do so.

For OSDs specifically, be warned about a possible race condition, you
might find a bit of info in this ticket:
http://tracker.ceph.com/issues/3309



On Mon, May 26, 2014 at 2:14 AM, Cao, Buddy  wrote:

Hi,  does ceph-deploy support to deploy osds simultaneously in a large scale 
cluster? Looks mkcephfs does not support simultaneous osd deployment.

If there are many hosts with very large number and size of osds/devices, how do 
I improve the performance to deploy the whole cluster at once?

Wei Cao (Buddy)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 70+ OSD are DOWN and not coming up

2014-05-27 Thread Craig Lewis

On 5/22/14 00:26 , Craig Lewis wrote:

On 5/21/14 21:15 , Sage Weil wrote:

On Wed, 21 May 2014, Craig Lewis wrote:

If you do this over IRC, can you please post a summary to the mailling
list?

I believe I'm having this issue as well.

In the other case, we found that some of the OSDs were behind processing
maps (by several thousand epochs).  The trick here to give them a chance
to catch up is

  ceph osd set noup
  ceph osd set nodown
  ceph osd set noout

and wait for them to stop spinning on the CPU.  You can check which map
each OSD is on with

  ceph daemon osd.NNN status

to see which epoch they are on and compare that to

  ceph osd stat

Once they are within 100 or less epochs,

  ceph osd unset noup

and let them all start up.

We haven't determined whether the original problem was caused by this or
the other way around; we'll see once they are all caught up.

sage


I was seeing the CPU spinning too, so I think it is the same issue.  
Thanks for the explanation!  I've been pulling my hair out for weeks.




This process solved my problem, with one caveat.  When I followed it, I 
filled up /var/log/ceph/ and the recovery failed.  I had to manually run 
each OSD in debugging mode until it completed the map update.  Aside 
from that, I followed your procedure.


After that, I was able to start everything normally, and the cluster 
recovered within a couple of hours.



This has been keeping me awake at night.  So far, it only happened to my 
slave cluster.  I've been living in dread of seeing this happen to my 
master cluster.  Now I know why the master cluster has been safe.  When 
my master cluster had problems, I intervened quickly (usually rebooting 
the node).  When the slave had problems, I fixed it in the morning.  
That extra delay was enough time to cause this issue.


Thank you!



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to implement a rados plugin to encode/decode data while r/w

2014-05-28 Thread Craig Lewis

On 5/27/14 19:44 , Plato wrote:
For certain security issue, I need to make sure the data finally saved 
to disk is encrypted.
So, I'm trying to write a rados class, which would be triggered to 
reading and writing process.
That is, before data is written, encrypting method of the class will 
be invoked; and then after data is readed, decrypting method of the 
class will be invoked.


I checked the interfaces in objclass.h, and found that cls_link 
perhaps is what I need.
However, the interface not implemented yet. So, how to write such a 
rados plugin? Is it possible.


Plato


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


If you're looking for encryption at rest, can you use ceph-disk prepare 
--dmcrypt or ceph-deploy disk --dmcrypt ?


The encryption is top notch, but the actual security is a bit weak. The 
keys are stored unencrypted in /etc/ceph/dmcrypt-keys/, which allows the 
OSDs to start at boot without a pass-phrase.  If you're looking to check 
a box on your security auditor's form, it meets the requirements: The 
disk without the key is useless.


If you want stronger security (encrypted keys w/ pass-phrase on boot), 
the --dmcrypt arg just calls cryptsetup.  Open up your deployment tool 
of choice, and look at the innards.  It wouldn't be very hard to setup 
better security manually.  It will complicate reboots, but actual 
security does.


cryptsetup looks like only AES256 is compiled in Ubuntu.  If you need 
stronger crypto, I'm sure it's available with a bit more effort.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is there a way to repair placement groups? [Offtopic - ZFS]

2014-05-28 Thread Craig Lewis

On 5/28/14 09:45 , Dimitri Maziuk wrote:

On 05/28/2014 09:32 AM, Christian Balzer wrote:

I was about to write something similar yesterday, but work interfered. ^o^

For bandwidth a RAID(Z*/6, don't even think about RAID5 or equivalent) is
indeed very nice, but for IOPS it will be worse than a RAID10.

Of course a controller with a large writeback cache can pretty alleviate
or at least hide those issues up to a point. ^.^

Also, all benchmarks suck(tm). Are you comparing the exact same workload
on the exact same disks on the exact same controller etc. Sure you can
have a software raid 6 that's faster than hardware raid 10 -- it may
take some work but it should be perfectly doable.




Agreed.  I rate all benchmark tools on the "least useless" scale.


In that case, I was using the same single server, with different zpool 
configurations and tunables.  The disks were single disk RAID0, with 
battery backed write cache.  All tests included a mirrored ZIL on SSD, 
and an L2ARC on SSD.  This was a while ago, so those SSDs would've been 
the Intel X25E 64GB.


The PostgreSQL benchmark tool is a modified TPC-B benchmark.  TPC is 
more IOps constrained than throughput constrained, but it had a 
component of both.  It doesn't really match my access patterns, but it's 
Close Enough (tm) that I'm not forced to do a better job.  I care about 
the overall latency and IOps in a many user scenario, not single thread 
performance.


I was tuning ZFS parameters for my database server.  It wasn't meant to 
be definitive, just something to quickly narrow values for real 
production tests.  In the end, the parameters that gave the best pgbench 
score also gave the best performing production database server, despite 
the different in access patterns.  I was surprised too.  It's probably 
because in the end, I didn't have to change much.


I ended up with 3 optimization, in order of effectiveness:

1. Adding SSD ZIL and L2ARC
2. Using 5 disk RAIDZ over 4 disk RAID10 (I had 8 drive bays. With the
   3 SSDs, that left 5 bays for spinners)
3. Adjusting ZFS recordsize to match PostgreSQL's 8k object size.
4. Enable compression.  This tied for effectiveness with #3, but was
   statically insignificant combined with #3.  I would re-test under Ceph.

Everything else I tested was either counter-productive or statistically 
insignificant.



Bringing it back to Ceph, I plan to re-test all of them on Ceph on ZFS.  
Prior to testing, I hypothesis that I'll end up with #1, #2, and #4.


My initial Ceph cluster uses the same chasis as my database server.  If 
I willfully ignore some things, the PostgreSQL benchmark sounds like a 
reasonable first order approximation for my Ceph nodes.


Using rados bench, my benchmarks on XFS told me to skip the SSD journals 
and put more spinners in.  That benchmarked really well in my mostly 
read workload.  It proved to be a disaster in production, when I started 
expanding the cluster.  My benchmarks only had the cluster 10% full, and 
there wasn't enough volume to actually stress things properly.  
Production load indicates that I need the SSD journals.  I'm in the 
process of adjusting the older nodes.



The reason my PostgreSQL benchmark was successful, and my Ceph benchmark 
failed so miserably?  I had enough production PostgreSQL experience to 
know that the benchmark was somewhat reasonable, and a way to test the 
results in production.  I had neither one of those things when I was 
running my Ceph benchmarks.  Which mostly boils down to hubris: "I'm 
good enough that I don't need those things anymore."



Hence my assertion to re-test things you think you know.  :-)



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] someone using btrfs with ceph

2014-05-28 Thread Craig Lewis

On 5/28/14 07:19 , Cedric Lemarchand wrote:

Le 28/05/2014 16:15, Stefan Priebe - Profihost AG a écrit :

Am 28.05.2014 16:13, schrieb Wido den Hollander:

On 05/28/2014 04:11 PM, VELARTIS Philipp Dürhammer wrote:

Is someone using btrfs in production?
I know people say it’s still not stable. But do we use so many features
with ceph? And facebook uses it also in production. Would be a big speed
gain.

As far as I know the main problem is still performance degradation over
time. On a SSD-only cluster this would be less of a problem since seek
times on SSDs aren't a really big problem, but on spinning disks they are.

I haven't seen btrfs in production on any Ceph cluster I encountered.

It heavily fragements over time.

I just would add that it is inherent to *all* COW based file system, and
not specifically to BTRFS ;-)

Cheers

Cédric


I agree, but it appears to affect BtrFS more than others.

I'm using ZFS for other things (not Ceph).  Those filesystems are slower 
after several years, but only by a few percent (estimating from RRD 
graphs).  ZFS made a design decision to ignore fragmentation, as long as 
the zpool is less than 80% full.  Once there, it switches to an optimal 
placement algorithm instead of the fast-but-inefficient placement 
algorithm.  This drives up the CPU usage and kills the IOps.  So it does 
suffer, but the pain doesn't hit until > 80% full.


I don't recall VxFS having any issue when using COW, but it's been a 
while.  The multi-million dollar storage array probably helped.


I never used ReiserFS in production, so I can't comment.

I haven't tried any other COW filesystems.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inter-region data replication through radosgw

2014-05-28 Thread Craig Lewis

On 5/21/14 22:55 , wsnote wrote:

Hi, Lewis!
With your way, there will be a contradition because of the limit of 
secondary zone.

In secondary zone, one can't do any files operations.
Let me give some example.I define the symbols first.

The instances of cluster 1:
M1: master zone of cluster 1
S2: Slave zone for M2 of cluster2, the files of cluster2 will be 
synced from M2 to S2
I13: the third instance of cluster 1(M1 and S2 are both the instances 
too.)


The instances of cluster 2:
M2: master zone of cluster 2
S1: Slave zone for M1 of cluster 1, the files of cluster1 will be 
synced from M1 to S1
I23: the third instance of cluster 1(M2 and S1 are both the instances 
too.)


cluster 1: M1 S2 I13
cluster 2: M2 S1 I23

Questions:
1. If I upload objects form I13 of cluster 1, is it synced to cluster 
2 from M1?
I'm assuming that I23's description should be "the third instance of 
cluster 2", not "the third instance of cluster 1".


If so, the answer is no. You you haven't configured I13 to replicate to 
I23. Replication happens between zones.


In this example, you'll have two replication agents running.
One in Cluster2, copying data from M1 to S1.
One in Cluster1, copying data from M2 to S2.

There's no reason you couldn't setup replication from I12 to I23 if you 
want. But I don't see why you wouldn't just use M1 in that case.


2. In cluster 1, can I do some operations for the files synced from 
cluster2 through M1 or I13?
In cluster1, all operations you do to M1 will be replicated to S1 in 
cluster2. Uploading, Overwriting, or Deleting objects in M1 will delete 
them in S1.



3. If I upload an object in cluster 1, the matadata will be synced to 
cluster 2 before file data.When matadata of it has been synced but 
filedata not, cluster 1 is down, that is to say the object hasnot been 
synced yet.Then I upload the same object in cluster 2. Can it succeed?

Metadata is synced pretty much the same time as data.

I tested replication by deliberately importing into the master zone 
faster than replication could handle. It will take the slave another 2 
weeks to finish catching up. It has ~50% of the objects right now. If 
the object hasn't been replicated yet, the slave zone doesn't know it 
exists. Here's an object that I just created in the master zone:
clewis@clewis ~ (-) $ s3prod.master ls 
s3://live-23/17c23967ca275cf606f3cd5151b03d393eed836d754e670a00b878a4fe9abc73
2014-05-29 02:26 1354k 91decb5e8bc658079f030517937ff6b8 
s3://live-23/17c23967ca275cf606f3cd5151b03d393eed836d754e670a00b878a4fe9abc73
clewis@clewis ~ (-) $ s3prod.slave ls 
s3://live-23/17c23967ca275cf606f3cd5151b03d393eed836d754e670a00b878a4fe9abc73


The slave has no record that this object exists (yet).


I think it will fail. cluster 2 has the matadata of object and will 
consider the object is in cluster 2, and this object is synced from 
cluster 1, so I have no permission to operate it.

Do I right?


Whether or not you can upload that object to the slave zone, I replied 
with a lot of guess in your other question titled "Questions about zone 
and disater recovery".


I intuition is that once you do this, you're going to break replication. 
At that point, the slave becomes the new master, and you need to delete 
the old master and replicate back.


This is pretty common in replication scenarios. I have to do this when 
my PostgreSQL servers fail from master to secondary.





Because of the limit of files operations in slave zone, I think there 
will be some contradition.


Looking forward to your reply.
Thanks!



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter 
<http://www.twitter.com/centraldesktop> | Facebook 
<http://www.facebook.com/CentralDesktop> | LinkedIn 
<http://www.linkedin.com/groups?gid=147417> | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD not up

2014-05-30 Thread Craig Lewis

On 5/30/14 03:08 , Ta Ba Tuan wrote:

Dear all,
I'm using Firefly. One disk was false, I replated failure disk and 
start that osd.

But that osd 's still down.

Help me,
Thank you





You need to re-initialize the disk after replacing it.  Ceph stores 
cluster information on the disk, and ceph-osd needs that information to 
start.  The process is pretty much removing the osd, then adding it again.


This blog walks you through the details: 
http://karan-mj.blogspot.com/2014/03/admin-guide-replacing-failed-disk-in.html


Or you can search through the mailing list for "replace osd" for more 
discussions.







--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD suffers problems after filesystem crashed and recovered.

2014-05-30 Thread Craig Lewis

On 5/29/14 01:09 , Felix Lee wrote:

Dear experts,
Recently, a disk for one of our OSDs was failure and caused osd down, 
after I recovered the disk and filesystem, I noticed two problems:


1. journal corruption, which causes osd failure from starting:



2. I guess I may use ceph-osd with "--mkjournal" option to fix journal 
corruption issue, but there is another thing that bothers me, which 
is, the previous osd daemon is staying in "D" state, so, it can't be 
terminated, but usually, when filesystem recovered, process should be 
able to leave D state, so, I am not sure what causes this and if I can 
ignore that without causing any bad consequence.


In any case, it would be very grateful if you experts could shed me 
some light.


Our current ceph version is ceph-0.72.2-0.el6.x86_64
And, the filesystem backend is xfs with fiber direct attached storages. 



I can't speak to the specific errors you're seeing, but it looks like 
you have a failing or corrupted disk.


Things I would investigate:

1. Is the disk itself failing?  If this were a SATA disk, I'd check the
   SMART stats on the disk.  I haven't dealt with Fiber Channel disks
   since before SMART was standardized, so I can't tell you do do that.
2. Get rid of the old ceph-osd process.  Reboot the node if you have
   to.  If things come up cleanly, then you're done.
3. Fsck the filesystem.  If the FS is clean, then you probably
   corrupted the OSD journal.
4. How quickly do you need this fixed?  At this point, I'm out of
   suggestions, so I'd remove the osd, zap it, and add it back in. If
   you can wait, somebody might have a better suggestion.


Fiber Channel hardware is much more complicated that SATA and SAS.  
There are a lot more parts involved, which leaves more room for bugs.


If you see this problem come back on the same disk, I'd replace the 
disk.  If you see this happen again to other disks, I would get your 
Fiber Channel vendor involved.  It wouldn't hurt to make sure you have 
the latest firmware on the disks, enclosure, and FC adapter.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Craig Lewis
I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time, spread
out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and tries
to do all of the deep scrubs over the weekend.  The secondary starts
loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}'
| sort | head -1 | read date time pg
 do
  ceph pg deep-scrub ${pg}
  while ceph status | grep scrubbing+deep
   do
sleep 5
  done
  sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary finishes
replicating from the primary.  Once it's caught up, the write load should
drop enough that opportunistic deep scrubs should have a chance to run.  It
should only take another week or two to catch up.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] I have PGs that I can't deep-scrub

2014-06-10 Thread Craig Lewis
Every time I deep-scrub one PG, all of the OSDs responsible get kicked
out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
fails the same way every time.  OSD logs are linked at the bottom.

What can I do to get this deep-scrub to complete cleanly?

This is the first time I've deep-scrubbed these PGs since Sage helped
me recover from some OSD problems
(http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html)

I can trigger the issue easily in this cluster, but have not been able
to re-create in other clusters.






The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
owned by OSDs [11,0]

This is a secondary cluster, so I stopped all external I/O on it.  I
set nodeep-scrub, and restarted both OSDs with:
  debug osd = 5/5
  debug filestore = 5/5
  debug journal = 1
  debug monc = 20/20

then I ran a deep-scrub on this PG.

2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
77870 GB / 130 TB avail
2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
27701 GB data, 56218 GB used, 77870 GB / 130 TB avail


At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU
(100.3% +/- 1.0%).  Prior to this, they were both using ~30% CPU.  It
might've started a few seconds sooner, I'm watching top.

I forgot to watch IO stat until 10:56.  At this point, both OSDs are
reading.  iostat reports that they're both doing ~100
transactions/sec, reading ~1 MiBps, 0 writes.


At 11:01:26, iostat reports that both osds are no longer consuming any
disk I/O.  They both go for > 30 seconds with 0 transactions, and 0
kiB read/write.  There are small bumps of 2 transactions/sec for one
second, then it's back to 0.


At 11:02:41, the primary OSD gets kicked out by the monitors:
2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555
active+clean, 4 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 1996 B/s rd, 2
op/s
2014-06-10 11:02:57.801047 mon.0 [INF] osd.11 marked down after no pg
stats for 903.825187seconds
2014-06-10 11:02:57.823115 mon.0 [INF] osdmap e58834: 36 osds: 35 up, 36 in

Both ceph-osd processes (11 and 0) continue to use 100% CPU (same range).


At ~11:10, I see that osd.11 has resumed reading from disk at the
original levels (~100 tps, ~1MiBps read, 0 MiBps write).  Since it's
down, but doing something, I let it run.

Both the osd.11 and osd.0 repeat this pattern.  Reading for a while at
~1 MiBps, then nothing.  The duty cycle seems about 50%, with a 20
minute period, but I haven't timed anything.  CPU usage remains at
100%, regardless of whether IO is happening or not.


At 12:24:15, osd.11 rejoins the cluster:
2014-06-10 12:24:15.294646 mon.0 [INF] osd.11 10.193.0.7:6804/7100 boot
2014-06-10 12:24:15.294725 mon.0 [INF] osdmap e58838: 36 osds: 35 up, 36 in
2014-06-10 12:24:15.343869 mon.0 [INF] pgmap v8832827: 2560 pgs: 1
stale+active+clean+scrubbing+deep, 2266 active+clean, 5
stale+active+clean, 287 active+degraded, 1 active+clean+scrubbing;
27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 15650 B/s rd,
18 op/s; 3617854/61758142 objects degraded (5.858%)


osd.0's CPU usage drops back to normal when osd.11 rejoins the
cluster.  The PG stats have not changed.   The last_deep_scrub and
deep_scrub_stamp are still 48009'1904117 2014-05-21 07:28:01.315996
respectively.


This time, osd.0 did not get kicked out by the monitors.  In previous
attempts, osd.0 was kicked out 5-10 minutes after osd.11.  When that
happens, osd.0 rejoins the cluster after osd.11.


I have several more PGs exhibiting the same behavior.  At least 3 that
I know of, and many more that I haven't attempted to deep-scrub.






ceph -v: ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
ceph.conf: https://cd.centraldesktop.com/p/eAAADvxuAHJRUk4
ceph-osd.11.log (5.7 MiB):
https://cd.centraldesktop.com/p/eAAADvxyABPwaeM
ceph-osd.0.log (6.3 MiB):
https://cd.centraldesktop.com/p/eAAADvx0ADWEGng
ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADvxvAAylTW0

(the pg query was collected at 13:24, after the above events)




Things that probably don't matter:
The OSD partitions were created using ceph-disk-prepare --dmcrypt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-10 Thread Craig Lewis
After doing this, I've found that I'm having problems with a few
specific PGs.  If I set nodeep-scrub, then manually deep-scrub one
specific PG, the responsible OSDs get kicked out.  I'm starting a new
discussion, subject: "I have PGs that I can't deep-scrub"

I'll re-test this correlation after I fix the broken PGs.

On Mon, Jun 9, 2014 at 10:20 PM, Gregory Farnum  wrote:
> On Mon, Jun 9, 2014 at 6:42 PM, Mike Dawson  wrote:
>> Craig,
>>
>> I've struggled with the same issue for quite a while. If your i/o is similar
>> to mine, I believe you are on the right track. For the past month or so, I
>> have been running this cronjob:
>>
>> * * * * *   for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' |
>> sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg;
>> done
>>
>> That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7
>> days. Your script may be a bit better, but this quick and dirty method has
>> helped my cluster maintain more consistency.
>>
>> The real key for me is to avoid the "clumpiness" I have observed without
>> that hack where concurrent deep-scrubs sit at zero for a long period of time
>> (despite having PGs that were months overdue for a deep-scrub), then
>> concurrent deep-scrubs suddenly spike up and stay in the teens for hours,
>> killing client writes/second.
>>
>> The scrubbing behavior table[0] indicates that a periodic tick initiates
>> scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently
>> randomized when you restart lots of OSDs concurrently (for instance via
>> pdsh).
>>
>> On my cluster I suffer a significant drag on client writes/second when I
>> exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent
>> deep-scrubs get into the teens, I get a massive drop in client
>> writes/second.
>>
>> Greg, is there locking involved when a PG enters deep-scrub? If so, is the
>> entire PG locked for the duration or is each individual object inside the PG
>> locked as it is processed? Some of my PGs will be in deep-scrub for minutes
>> at a time.
>
> It locks very small regions of the key space, but the expensive part
> is that deep scrub actually has to read all the data off disk, and
> that's often a lot more disk seeks than simply examining the metadata
> is.
> -Greg
>
>>
>> 0: http://ceph.com/docs/master/dev/osd_internals/scrub/
>>
>> Thanks,
>> Mike Dawson
>>
>>
>>
>> On 6/9/2014 6:22 PM, Craig Lewis wrote:
>>>
>>> I've correlated a large deep scrubbing operation to cluster stability
>>> problems.
>>>
>>> My primary cluster does a small amount of deep scrubs all the time,
>>> spread out over the whole week.  It has no stability problems.
>>>
>>> My secondary cluster doesn't spread them out.  It saves them up, and
>>> tries to do all of the deep scrubs over the weekend.  The secondary
>>> starts loosing OSDs about an hour after these deep scrubs start.
>>>
>>> To avoid this, I'm thinking of writing a script that continuously scrubs
>>> the oldest outstanding PG.  In psuedo-bash:
>>> # Sort by the deep-scrub timestamp, taking the single oldest PG
>>> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
>>> $1}' | sort | head -1 | read date time pg
>>>   do
>>>ceph pg deep-scrub ${pg}
>>>while ceph status | grep scrubbing+deep
>>> do
>>>  sleep 5
>>>done
>>>sleep 30
>>> done
>>>
>>>
>>> Does anybody think this will solve my problem?
>>>
>>> I'm also considering disabling deep-scrubbing until the secondary
>>> finishes replicating from the primary.  Once it's caught up, the write
>>> load should drop enough that opportunistic deep scrubs should have a
>>> chance to run.  It should only take another week or two to catch up.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] about rgw region and zone

2014-06-10 Thread Craig Lewis
The idea of regions and zones is to replicate Amazon's S3 storage.
Here's some links from Amazon descriping EC2 regions and zones
(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html)
and S3 Regions 
(http://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html).
The S3 links are more appropriate, but you need some the EC2
background.


Clusters are (usually) a single group of servers in a single location.
This doesn't have to be the case, but people generally don't mix
multi-location clusters and federation.

Regions and Zones are how ever you want to set them up.  You could
have a single region called "Earth", or many Regions like Amazon does.
Each cluster belongs to a single region, and a region has one or more
clusters.

Zones are inside regions.  Federated replication happens between
zones.  A cluster can host any number of zones, as long as they all
belong to the same region.  Only the master zone is writable.  You can
read from the master or slave.

Several common setups are cross replication or ring replication
between clusters.  In a cross replication setup, cluster1 would have
zone1master and zone2slave.  cluster2 would have zone1slave and
zone2master.  In a ring, cluster1 would have zone1master and
zone3slave.  cluster2 would have zone2master and zone1slave.  cluster3
would have zone3master and zone2slave.  cross replication is just the
2 cluster version of ring replication.  You're welcome to make this
replication strategy as complicated as you're willing to deal with.


As far as I know, there isn't any published disaster recovery
documentation.  My understanding is that in the event of a disaster in
the master zone, you disable replication and start writing to the
slave.  When the old master zone comes back online, you delete it, and
set it up as a slave of the current master.   Once re-replication
completes, you could fail back to the old master.  I'm not aware that
anybody has tested this, but the general idea should work.

I'm hoping that it might be possible to skip the delete, but it will
involved some extensive testing.  I'm not aware that anybody has
tested this, much less gotten it to work.  I plan to try it and write
it up... but I'm not sure when.



On Wed, Jun 4, 2014 at 11:15 PM, lijie8...@126.com  wrote:
> hi  , I have some question about region and zone
>
> 1.  why define the concepts of "REGION , ZONE",  what the purpose is it.
> 2.  what the relation between region , zone ,cluster   ,and how design
> the federated architecture.   and how do the disaster recover.
>
>
> Expect receiving your early reply,thank you .
> 
> lijie8...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I have PGs that I can't deep-scrub

2014-06-11 Thread Craig Lewis
New logs, with debug ms = 1, debug osd = 20.


In this timeline, I started the deep-scrub at 11:04:00  Ceph start
deep-scrubing at 11:04:03.

osd.11 started consuming 100% CPU around 11:07.  Same for osd.0.  CPU
usage is all user; iowait is < 0.10%.  There is more variance in the
CPU usage now, ranging between 98.5% and 101.2%

This time, I didn't see any major IO, read or write.

osd.11 was marked down at 11:22:00:
2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg
stats for 902.656777seconds

osd.0 was marked down at 11:36:00:
 2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg
stats for 902.498894seconds




ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE
ceph-osd.0.log.gz (140MiB, 18MiB compressed):
https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ
ceph-osd.11.log.gz (131MiB, 17MiB compressed):
https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI
ceph pg 40.11e query: https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc





On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil  wrote:
> Hi Craig,
>
> It's hard to say what is going wrong with that level of logs.  Can you
> reproduce with debug ms = 1 and debug osd = 20?
>
> There were a few things fixed in scrub between emperor and firefly.  Are
> you planning on upgrading soon?
>
> sage
>
>
> On Tue, 10 Jun 2014, Craig Lewis wrote:
>
>> Every time I deep-scrub one PG, all of the OSDs responsible get kicked
>> out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
>> fails the same way every time.  OSD logs are linked at the bottom.
>>
>> What can I do to get this deep-scrub to complete cleanly?
>>
>> This is the first time I've deep-scrubbed these PGs since Sage helped
>> me recover from some OSD problems
>> (http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html)
>>
>> I can trigger the issue easily in this cluster, but have not been able
>> to re-create in other clusters.
>>
>>
>>
>>
>>
>>
>> The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
>> are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
>> owned by OSDs [11,0]
>>
>> This is a secondary cluster, so I stopped all external I/O on it.  I
>> set nodeep-scrub, and restarted both OSDs with:
>>   debug osd = 5/5
>>   debug filestore = 5/5
>>   debug journal = 1
>>   debug monc = 20/20
>>
>> then I ran a deep-scrub on this PG.
>>
>> 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
>> active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
>> 77870 GB / 130 TB avail
>> 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
>> active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
>> 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail
>>
>>
>> At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU
>> (100.3% +/- 1.0%).  Prior to this, they were both using ~30% CPU.  It
>> might've started a few seconds sooner, I'm watching top.
>>
>> I forgot to watch IO stat until 10:56.  At this point, both OSDs are
>> reading.  iostat reports that they're both doing ~100
>> transactions/sec, reading ~1 MiBps, 0 writes.
>>
>>
>> At 11:01:26, iostat reports that both osds are no longer consuming any
>> disk I/O.  They both go for > 30 seconds with 0 transactions, and 0
>> kiB read/write.  There are small bumps of 2 transactions/sec for one
>> second, then it's back to 0.
>>
>>
>> At 11:02:41, the primary OSD gets kicked out by the monitors:
>> 2014-06-10 11:02:41.168443 mon.0 [INF] pgmap v8832125: 2560 pgs: 2555
>> active+clean, 4 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
>> 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail; 1996 B/s rd, 2
>> op/s
>> 2014-06-10 11:02:57.801047 mon.0 [INF] osd.11 marked down after no pg
>> stats for 903.825187seconds
>> 2014-06-10 11:02:57.823115 mon.0 [INF] osdmap e58834: 36 osds: 35 up, 36 in
>>
>> Both ceph-osd processes (11 and 0) continue to use 100% CPU (same range).
>>
>>
>> At ~11:10, I see that osd.11 has resumed reading from disk at the
>> original levels (~100 tps, ~1MiBps read, 0 MiBps write).  Since it's
>> down, but doing something, I let it run.
>>
>> Both the osd.11 and osd.0 repeat this pattern.  Reading for a while at
>> ~1 MiBps, then nothing.  The duty cycle seems about 50%, with a 20
>> minute period, but I haven't timed anything.  CPU usage remains at
>> 100%, regar

Re: [ceph-users] Some easy questions

2014-06-17 Thread Craig Lewis
> 3. You must use MDS from the start, because it's a metadata
> structure/directory that only gets populated when writing files through
> cephfs / FUSE. Otherwise, it doesn't even know about other objects and
> therefore isn't visible on cephfs.
> 4. MDS does not get updated when radosgw / S3 is used.

You can use MDS whenever you want to start using CephFS.  CephFS and
RadosGW are indepenent, they use different pools.  Data added to
RadosGW is not visible to CephFS, and data added to CephFS is not
visible to RadosGW.

It's all visible to RADOS, because both are implemented on top of
RADOS.  More on that later.



> So my questions are:
> * radosgw uses the ".bucket" pool for managing and controlling which buckets
> there are?

By default, RadosGW uses .rgw, .rgw.buckets.index, and .rgw.buckets.
Once you start creating RadosGW users, it will create some more pools,
depending on which of the features you're using.

You can create different pools (using placement targets), and assign
users and buckets to them.  The common example is one user's data
should be on SSDs, and another user should be on HDDs.



> * a new bucket is written in ".bucket"  and there will be an entry of some
> sort in ".bucket-index" to keep track of objects created within that bucket?

Actually, my .rgw.buckets.index pool has 0 bytes in use.  It looks
like everything goes into the .rgw.buckets pool.

The bucket is just an object that contains a list of the files (and
some metadata about those files).



> * I.e., buckets and metadata about objects that live inside buckets are not
> as such available from rados?  (you can't query rados for these objects,
> grouped by user/bucket?)

Effectively, that's true.  The objects are available using RADOS, but
they're not in a human readable format.  If you list the contents of
the .rgw.buckets pool, you'll see stuff like:
us-west-1.35026898.2__shadow__pcARf6VxB_ZPy0AwF-FKSADrV_H5l_m_2
us-west-1.43275004.2_5f33e39093fda01db84b6d32a1a1b3352b4b23f2778a756f751c0e9e51d62f6e
us-west-1.43273412.2_669465eb9b41d94c4ebfc1bff7a26c9eee4ff065297f972132fc54942b160994
us-west-1.50224305.2__shadow__nJadmBD7-3loD4kBw7ug0HlO5RSxSLP_1
us-west-1.35026898.2__shadow__5EKY5xkrxmqr7iSrscl6emryM9DxBYx_1
us-west-1.43275004.2_70fcd7b628b7a13e92002b15dbfb7354668ada2eefbe9dfd2ffa5f5dd432ac59
us-west-1.51921289.1_989ecb8f4b786d60c4dc656c39220ea5cca76716789a1b2bfe66810ffd7846f3
...

RadosGW breaks every file up into a 4M chunk.  With enough effort, you
could reconstruct the bucket and object manually.  I've done it once
(to prove I could), but I don't plan to do it again.



> * Are there alternatives for accessing the radosgw information besides going
> through the S3 interface (command line)?

I've been using s3cmd for command line access and general maintenance.
It's a python script that talks to the S3 interface though, so you'll
still need a website.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] about rgw region and zone

2014-06-17 Thread Craig Lewis
Metadata replication is about keeping a global namespace in all zones.  It
will replicate all of your users and bucket names, but not the data itself.
 That way you don't end up with a bucket named "mybucket" in your US and EU
zones that are owned by different people.  It's up to you to decide if this
is something you want or not.  Metadata replication won't protected against
the primary zone going offline.

Data replication will copy the metadata and data.  If the primary goes
offline, you'll be able to read everything that has replicated to the
secondary zone.  You should make sure you have enough bandwidth between the
zones (and that latency is low enough) to allow replication can keep up.
 If replication falls behind, anything not replicated will catch up when
the primary comes back up.

I haven't found any docs on the process to promote a secondary zone to
primary.  Right now, it doesn't look like a good idea.  If the master goes
offline, you can read from the secondary while you get the master back
online.  The failover/failback are expensive (time and bandwidth wise), so
it would take a pretty big problem before it's a good idea to promote the
secondary to primary.



Regarding your FastCGI error, when I see that, it's because my RadosGW
daemon isn't running.  Check if it's running (`ps auxww | grep radosgw`).
 If it's not, try `start radosgw-all`, then restart apache.  If that
doesn't work, you might need some extra configs in ceph.conf.



Wido den Hollander just posted some WSGI examples in a thread titled "REST
API and uWSGI?"  If you're still interested in getting WSGI to work, check
th
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what is the Recommandation configure for a ceph cluster with 10 servers without memory leak?

2014-06-18 Thread Craig Lewis
I haven't seen behavior like that.  I have seen my OSDs use a lot of RAM
while they're doing a recovery, but it goes back down when they're done.

Your OSD is doing something, it's using 126% CPU. What does `ceph osd tree`
and `ceph health detail` say?


When you say you're installing Ceph on 10 severs, are you running a monitor
on all 10 servers?




On Wed, Jun 18, 2014 at 4:18 AM, wsnote  wrote:

> If I install ceph in 10 servers with one disk each servers, the problem
> remains.
> This is the memory usage of ceph-osd.
> ceph-osd VIRT:10.2G, RES: 4.2G
> The usage of ceph-osd is too big!
>
>
> At 2014-06-18 16:51:02,wsnote  wrote:
>
> Hi, Lewis!
> I come up with a question and don't know how to solve, so I ask you for
> help.
> I can succeed to install ceph in a cluster with 3 or 4 servers but fail to
> do it with 10 servers.
> I install it and start it, then there would be a server whose memory rises
> up to 100% and this server crash.I have to restart it.
> All the config are the same.I don't know what's the problem.
> Can you give some suggestion?
> Thanks!
>
> ceph.conf:
> [global]
> auth supported = none
>
> ;auth_service_required = cephx
> ;auth_client_required = cephx
> ;auth_cluster_required = cephx
> filestore_xattr_use_omap = true
>
> max open files = 131072
> log file = /var/log/ceph/$name.log
> pid file = /var/run/ceph/$name.pid
> keyring = /etc/ceph/keyring.admin
>
> ;mon_clock_drift_allowed = 1 ;clock skew detected
>
> [mon]
> mon data = /data/mon$id
> keyring = /etc/ceph/keyring.$name
>  [mds]
> mds data = /data/mds$id
> keyring = /etc/ceph/keyring.$name
> [osd]
> osd data = /data/osd$id
> osd journal = /data/osd$id/journal
> osd journal size = 1024
> keyring = /etc/ceph/keyring.$name
> osd mkfs type = xfs
> osd mount options xfs = rw,noatime
> osd mkfs options xfs = -f
> filestore fiemap = false
>
> In every server, there is an mds, an mon, 11 osd with 4TB space each.
> mon address is public IP, and osd address has an public IP and an cluster
> IP.
>
> wsnote
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] java.net.UnknownHostException while creating a bucket

2014-06-18 Thread Craig Lewis
> java.net.UnknownHostException:
my-new-ceph-bucket.svl-cephstack-05.cisco.com

Amazon's S3 libraries generate the URL by prepending the bucket name to the
hostname.  See
https://ceph.com/docs/master/radosgw/config/#enabling-subdomain-s3-calls

Aside from the RadosGW configuration mentioned above, you also need a real
DNS.  I have these DNS entries:
us-west-1.ceph.local. IN A 192.168.0.2
*.us-west-1.ceph.local. IN CNAME us-west-1.ceph.local.


If you don't have access to a DNS server, the ceph.com documentation
mentions setting up Dnsmasq.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using S3 REST API

2014-06-18 Thread Craig Lewis
You went through the RadosGW configuration at
https://ceph.com/docs/master/radosgw/config/ ?

Once you complete that, you can test it by going to http://cluster.hostname/.
 You should get
http://s3.amazonaws.com/doc/2006-03-01/";
class=" cd-browser-extension">

anonymous






If you get a 500 error, you'll need to go through the apache error_log and
the radosgw logs to figure out what's wrong.  A common mistake is not
starting both apache and the radosgw daemon.



On Wed, Jun 18, 2014 at 9:49 AM, Prabhat Kumar -X (prabhaku - INFOSYS
LIMITED at Cisco)  wrote:

>  Hi All,
>
>
>
> I am new to ceph and trying to make use of S3 REST API but not able to
> create connection with my end point.
>
>
>
> Is there any document or guide to implement connectivity using S3 REST API?
>
>
>
>
>
> Thanks
>
> Prabhat
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using S3 REST API

2014-06-18 Thread Craig Lewis
I just replied to another user with a similar issue.  Take a look at a
recent post with the subject line "java.net.UnknownHostException while
creating a bucket".
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what is the Recommandation configure for a ceph cluster with 10 servers without memory leak?

2014-06-18 Thread Craig Lewis
etail
> HEALTH_WARN 374 pgs degraded; 11053 pgs down; 6805 pgs incomplete; 51729
> pgs peering; 62838 pgs stale; 187069 pgs stuck inactive; 62838 pgs stuck
> stale; 187153 pgs stuck unclean; 45 requests are blocked > 32 sec; 10 osds
> have slow requests; 1/15 in osds are down
> pg 0.fb7f is stuck inactive since forever, current state down+peering,
> last acting [404]
> pg 1.fb7e is stuck inactive since forever, current state
> stale+down+peering, last acting [202]
> pg 2.fb7d is stuck inactive since forever, current state creating, last
> acting [204,902,505]
> pg 0.fb7e is stuck inactive since forever, current state stale+peering,
> last acting [704,1004,504]
> pg 1.fb7f is stuck inactive since forever, current state creating, last
> acting [404,204,804]
> pg 2.fb7c is stuck inactive since forever, current state creating, last
> acting [805,901,404]
> pg 0.fb7d is stuck inactive since forever, current state creating, last
> acting [903,204,803]
> pg 1.fb7c is stuck inactive since forever, current state creating, last
> acting [802,905,505]
> pg 2.fb7f is stuck inactive since forever, current state creating, last
> acting [804,902]
> pg 1.fb7d is stuck inactive since forever, current state creating, last
> acting [803,404,204]
> pg 2.fb7e is stuck inactive since forever, current state creating, last
> acting [404,905,802]
> pg 0.fb7c is stuck inactive since forever, current state stale+peering,
> last acting [904,405]
> pg 2.fb79 is stuck inactive since forever, current state creating, last
> acting [804,901]
> pg 1.fb7a is stuck inactive since forever, current state creating, last
> acting [903,505,803]
> pg 0.fb7b is stuck inactive since forever, current state stale+peering,
> last acting [801,503]
> pg 2.fb78 is stuck inactive since forever, current state creating, last
> acting [903,404]
> pg 1.fb7b is stuck inactive since forever, current state creating, last
> acting [505,803,303]
> pg 0.fb7a is stuck inactive since forever, current state
> stale+remapped+peering, last acting [1003]
> pg 2.fb7b is stuck inactive since forever, current state creating, last
> acting [303,505,802]
> pg 1.fb78 is stuck inactive since forever, current state creating, last
> acting [803,303,905]
> pg 0.fb79 is stuck inactive since forever, current state creating, last
> acting [901,403,805]
> pg 0.fb78 is stuck inactive since forever, current state creating, last
> acting [404,901]
> pg 2.fb7a is stuck inactive since forever, current state creating, last
> acting [403,303,805]
> pg 1.fb79 is stuck inactive since forever, current state creating, last
> acting [803,901,404]
> pg 0.fb77 is stuck inactive for 24155.756030, current state stale+peering,
> last acting [101,1005]
> pg 1.fb76 is stuck inactive since forever, current state creating, last
> acting [901,505,403]
> pg 2.fb75 is stuck inactive since forever, current state creating, last
> acting [905,403,204]
> pg 1.fb77 is stuck inactive since forever, current state creating, last
> acting [905,805,204]
> pg 2.fb74 is stuck inactive since forever, current state creating, last
> acting [901,404]
> pg 0.fb75 is stuck inactive since forever, current state creating, last
> acting [903,403]
> pg 1.fb74 is stuck inactive since forever, current state creating, last
> acting [901,204,403]
> pg 2.fb77 is stuck inactive since forever, current state creating, last
> acting [505,802,905]
> pg 0.fb74 is stuck inactive for 24042.660267, current state
> stale+incomplete, last acting [101]
> pg 1.fb75 is stuck inactive since forever, current state creating, last
> acting [905,403,804]
>
>
>
>
> At 2014-06-19 04:31:09,"Craig Lewis"  wrote:
>
> I haven't seen behavior like that.  I have seen my OSDs use a lot of RAM
> while they're doing a recovery, but it goes back down when they're done.
>
> Your OSD is doing something, it's using 126% CPU. What does `ceph osd
> tree` and `ceph health detail` say?
>
>
> When you say you're installing Ceph on 10 severs, are you running a
> monitor on all 10 servers?
>
>
>
>
> On Wed, Jun 18, 2014 at 4:18 AM, wsnote  wrote:
>
>> If I install ceph in 10 servers with one disk each servers, the problem
>> remains.
>> This is the memory usage of ceph-osd.
>> ceph-osd VIRT:10.2G, RES: 4.2G
>> The usage of ceph-osd is too big!
>>
>>
>> At 2014-06-18 16:51:02,wsnote  wrote:
>>
>> Hi, Lewis!
>> I come up with a question and don't know how to solve, so I ask you for
>> help.
>> I can succeed to install ceph in a cluster with 3 or 4 servers but fail
>> to do it with 10 servers.
>> I install it and start it, then there would be a server whose m

Re: [ceph-users] Some easy questions

2014-06-19 Thread Craig Lewis
>
>
>> Just to clarify. Suppose you insert an object into rados directly, you
> won't be able to see that file
> in cephfs anywhere, since it won't be listed in MDS. Correct?
>
> Meaning, you can start using CephFS+MDS at any point in time, but it will
> only ever list objects/files
> that were inserted through cephfs/FUSE, nothing else.?
>

Correct.




>
> As S3 and Swift are just interfaces to tlak to radosgw, I believe objects
> added through S3 are also visible
> on Swift?
>

I believe so, but I haven't setup Swift to test it.  RGW has users, and the
user has S3 and Swift credentials.  So I'm pretty sure.

Swift has more features than S3.  IIRC, extra ACLs and metadata.  So you're
better off sticking with one rather than mixing protocols.  It should work,
but you might get some weirdness.



>
> When adding the S3/Swift users, I noticed that radosgw-admin is capable of
> a lot more. The use of how the commands
> are tied together though isn't very intuitive and I had to rely on
> examples to get things done.
>
> I'm now figuring out how to use that so I can list the size of a single
> bucket and user.
>
>

Check out `radosgw-admin bucket stats`.  I don't think you can get usage
for a single user, but you can sum over the user's buckets.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOSGW + OpenStack basic question

2014-06-19 Thread Craig Lewis
Unfortunately, I can't help much.  I'm just using the S3 interface for
object storage.

Looking back at the archives, this question does come up a lot, and there
aren't a lot of replies.  The best thread I see in the archive is
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/006283.html
 .

Sébastien Han wrote

Well after restarting the services run:

$ cinder create 1

Then you can check both status in Cinder and Ceph:

For Cinder run:
$ cinder list

For Ceph run:
$ rbd -p  ls

If the image is there, you’re good.

Cheers.


There is some follow up in the thread, so you might want to read the whole
discussion.


On Thu, Jun 19, 2014 at 12:38 AM, Vickey Singh 
wrote:

> Hello Craig
>
> I am following you up on ceph mailing list, looks like you are the expert
> for CEPH RGW. can you please guide me here.
>
> Thanks in advance for your help.
>
> - Vickram -
>
>
> -- Forwarded message --
> From: Vickey Singh 
> Date: Wed, Jun 18, 2014 at 7:57 PM
> Subject: RADOSGW + OpenStack basic question
> To: ceph-users@lists.ceph.com
>
>
> Hello Cephers
>
>
> I have been following ceph documentation to install and configure RGW and
> fortunately everything went fine and RGW is correctly setup.
>
> Next i would like to use RGW with OpenStack , and for this i have followed
>  http://ceph.com/docs/master/radosgw/keystone/   , as per the document i
> have done all the steps.
>
> But how should i test RGW and OpenStack integration , the document does
> not show steps to verify the integration or how to use it further with
> openstack.
>
> Can you please point me to the right direction , for testing Ceph RGW and
> OpenStack integration . Is there any work / blog on internet by
> someone which can demonstrate the things “How to use Ceph RGW with
> Openstack"
>
> Please help.
>
>
> Regards
> Vickram Singh
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOSGW + OpenStack basic question

2014-06-19 Thread Craig Lewis
There is a tool named s3cmd, I use that for minor things.  The first time
you run it, use `s3cmd --configure`.

Most of my access to the cluster is using the Amazon S3 library.


On Thu, Jun 19, 2014 at 1:07 PM, Vickey Singh 
wrote:

> Hello Craig
>
> I want to use object storage only NOT cinder and glance.  I have
> successfully deployed and install RGW , and i can use it from swift client
> .
>
> But i need to access i using any other client like andy s3 client or with
> openstack keystone.
>
> Do you know how can i test it from s3 client , what will be the tool name
> or command for that.  how you are accessing your object storage.
>
> Thanks for writing back.
>
>
>
> On Thu, Jun 19, 2014 at 8:43 PM, Craig Lewis 
> wrote:
>
>> Unfortunately, I can't help much.  I'm just using the S3 interface for
>> object storage.
>>
>>  Looking back at the archives, this question does come up a lot, and
>> there aren't a lot of replies.  The best thread I see in the archive is
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/006283.html
>> .
>>
>> Sébastien Han wrote
>>
>>
>> Well after restarting the services run:
>>
>> $ cinder create 1
>>
>> Then you can check both status in Cinder and Ceph:
>>
>> For Cinder run:
>> $ cinder list
>>
>> For Ceph run:
>> $ rbd -p  ls
>>
>> If the image is there, you’re good.
>>
>> Cheers.
>>
>>
>> There is some follow up in the thread, so you might want to read the
>> whole discussion.
>>
>>
>>
>>
>> On Thu, Jun 19, 2014 at 12:38 AM, Vickey Singh <
>> vickey.singh22...@gmail.com> wrote:
>>
>>> Hello Craig
>>>
>>> I am following you up on ceph mailing list, looks like you are the
>>> expert for CEPH RGW. can you please guide me here.
>>>
>>> Thanks in advance for your help.
>>>
>>> - Vickram -
>>>
>>>
>>> -- Forwarded message --
>>> From: Vickey Singh 
>>> Date: Wed, Jun 18, 2014 at 7:57 PM
>>> Subject: RADOSGW + OpenStack basic question
>>> To: ceph-users@lists.ceph.com
>>>
>>>
>>> Hello Cephers
>>>
>>>
>>> I have been following ceph documentation to install and configure RGW
>>> and fortunately everything went fine and RGW is correctly setup.
>>>
>>> Next i would like to use RGW with OpenStack , and for this i have
>>> followed  http://ceph.com/docs/master/radosgw/keystone/   , as per the
>>> document i have done all the steps.
>>>
>>> But how should i test RGW and OpenStack integration , the document does
>>> not show steps to verify the integration or how to use it further with
>>> openstack.
>>>
>>> Can you please point me to the right direction , for testing Ceph RGW
>>> and OpenStack integration . Is there any work / blog on internet by
>>> someone which can demonstrate the things “How to use Ceph RGW with
>>> Openstack"
>>>
>>> Please help.
>>>
>>>
>>> Regards
>>> Vickram Singh
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve performance of ceph objcect storage cluster

2014-06-26 Thread Craig Lewis
Cern noted that they need to reformat to put the Journal in a partition
rather than on the OSD's filesystem like you did.  See
http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern, slide 24.

When I saw that ceph disk prepare created a journal partition, I thought it
was stupid to force a seek like that.  (This was before I saw Cern's
slides).  I really should've known better, there's a reason it's the
default behavior.  I didn't even benchmark the two. *hangs head in shame*

I really can't tell you why it's bad idea, but I can say that my recoveries
are extremely painful.  I'm using RadosGW, and I only care about seconds of
latency.  During large recoveries (like adding new nodes), people complain
about how slow the cluster is.

I'm in the middle of rolling out SSD journals to all machines.




On Tue, Jun 24, 2014 at 11:52 PM, wsnote  wrote:

> OS: CentOS 6.5
> Version: Ceph 0.79
>
> Hi, everybody!
> I have installed a ceph cluster with 10 servers.
> I test the throughput of ceph cluster in the same  datacenter.
> Upload files of 1GB from one server or several servers to one server or
> several servers, the total is about 30MB/s.
> That is to say, there is no difference between one server or one cluster
> about throughput when uploading files.
> How to optimize the performance of ceph object storage?
> Thanks!
>
>
> 
> Info about ceph cluster:
> 4 MONs in the first 4 nodes in the cluster.
> 11 OSDs in each server, 109 OSDs in total (one disk was bad).
> 4TB each disk, 391TB in total (109*4-391=45TB.Where did 45TB space?)
> 1 RGW in each server, 10 RGWs in total.That is to say, I can use S3 API in
> each Server.
>
> ceph.conf:
> [global]
> auth supported = none
>
> ;auth_service_required = cephx
> ;auth_client_required = cephx
> ;auth_cluster_required = cephx
> filestore_xattr_use_omap = true
>
> max open files = 131072
> log file = /var/log/ceph/$name.log
> pid file = /var/run/ceph/$name.pid
> keyring = /etc/ceph/keyring.admin
>
> mon_clock_drift_allowed = 2 ;clock skew detected
>
> [mon]
> mon data = /data/mon$id
> keyring = /etc/ceph/keyring.$name
>  [osd]
> osd data = /data/osd$id
> osd journal = /data/osd$id/journal
> osd journal size = 1024;
> keyring = /etc/ceph/keyring.$name
> osd mkfs type = xfs
> osd mount options xfs = rw,noatime
> osd mkfs options xfs = -f
>
> [client.radosgw.cn-bj-1]
> rgw region = cn
> rgw region root pool = .cn.rgw.root
> rgw zone = cn-bj
> rgw zone root pool = .cn-wz.rgw.root
> host = yun168
> public_addr = 192.168.10.115
> rgw dns name = s3.domain.com
> keyring = /etc/ceph/ceph.client.radosgw.keyring
> rgw socket path = /var/run/ceph/$name.sock
> log file = /var/log/ceph/radosgw.log
> debug rgw = 20
> rgw print continue = true
> rgw should log = true
>
>
>
>
> [root@yun168 ~]# ceph -s
> cluster e48b0d5b-ff08-4a8e-88aa-4acd3f5a6204
>  health HEALTH_OK
>  monmap e7: 4 mons at {... ...  ...}, election epoch 78, quorum
> 0,1,2,3 0,1,2,3
>  mdsmap e49: 0/0/1 up
>  osdmap e3722: 109 osds: 109 up, 109 in
>   pgmap v106768: 29432 pgs, 19 pools, 12775 GB data, 12786 kobjects
> 640 GB used, 390 TB / 391 TB avail
>29432 active+clean
>   client io 1734 kB/s rd, 29755 kB/s wr, 443 op/s
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference between "ceph osd reweight" and "ceph osd crush reweight"

2014-06-26 Thread Craig Lewis
Note that 'ceph osd reweight' is not a persistent setting.  When an OSD
gets marked out, the osd weight will be set to 0.  When it gets marked in
again, the weight will be changed to 1.

Because of this 'ceph osd reweight' is a temporary solution.  You should
only use it to keep your cluster running while you're ordering more
hardware.




On Thu, Jun 26, 2014 at 10:05 AM, Gregory Farnum  wrote:

> On Thu, Jun 26, 2014 at 7:03 AM, Micha Krause  wrote:
> > Hi,
> >
> > could someone explain to me what the difference is between
> >
> > ceph osd reweight
> >
> > and
> >
> > ceph osd crush reweight
>
> "ceph osd crush reweight" sets the CRUSH weight of the OSD. This
> weight is an arbitrary value (generally the size of the disk in TB or
> something) and controls how much data the system tries to allocate to
> the OSD.
>
> "ceph osd reweight" sets an override weight on the OSD. This value is
> in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the
> data that would otherwise live on this drive. It does *not* change the
> weights assigned to the buckets above the OSD, and is a corrective
> measure in case the normal CRUSH distribution isn't working out quite
> right. (For instance, if one of your OSDs is at 90% and the others are
> at 50%, you could reduce this weight to try and compensate for it.)
>
> It looks like our docs aren't very clear on the difference, when it
> even mentions them...and admittedly it's a pretty subtle issue!
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with RadosGW and special characters

2014-06-26 Thread Craig Lewis
Note that wget did URL encode the space ("test file" became "test%20file"),
because it knows that a space is never valid.  It can't know if you meant
an actual plus, or a encoded space in "test+file", so it left it alone.

I will say that I would prefer that the + be left alone.  If I have a
static "test+file", Apache will serve that static file correctly.



How badly do you need this to work, right now?  If you need it now, I can
suggest a work around.  This is dirty hack, and I'm not saying it's a good
idea.  It's more of a thought exercise.

A quick google indicates that mod_rewrite might help:
http://stackoverflow.com/questions/459667/how-to-encode-special-characters-using-mod-rewrite-apache
.

But that might make the problem worse for other characters... If it does,
I'm sure I could get it working by installing an Apache hook.  Off the top
of my head, I'd try a hook in
http://perl.apache.org/docs/2.0/user/handlers/http.html#PerlFixupHandler to
replace all + characters with the correct escape sequence, %2B.  I know
mod_python can hook into Apache too.  I don't know if nginx has
a similar capability.


As with all dirty hacks, if you actually implement it, you'll want to watch
the release notes.  Once you work around a bug, someone will fix the bug
and break your hack.




On Thu, Jun 26, 2014 at 8:54 AM, Brian Rak  wrote:

>  Going back to my first post, I linked to this
> http://stackoverflow.com/questions/1005676/urls-and-plus-signs
>
> Per the defintion of application/x-www-form-urlencoded:
> http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
>
> "Control names and values are escaped. Space characters are replaced by
> `+', and then reserved characters are escaped as described in [RFC1738]
> ,"
>
> The whole +=space thing is only for the query portion of the URL, not the
> filename.
>
> I've done some testing with nginx, and this is how it behaves:
>
> On the server, somewhere in the webroot:
>
> echo space > "test file"
>
> Then, from a client:
> $ wget --spider "http://example.com/test/test file"
> 
>
> Spider mode enabled. Check if remote file exists.
> --2014-06-26 11:46:54--  http://example.com/test/test%20file
> Connecting to example.com:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 6 [application/octet-stream]
> Remote file exists.
>
> $ wget --spider "http://example.com/test/test+file";
> 
>
> Spider mode enabled. Check if remote file exists.
> --2014-06-26 11:46:57--  http://example.com/test/test+file
> Connecting to example.com:80... connected.
> HTTP request sent, awaiting response... 404 Not Found
>
> Remote file does not exist -- broken link!!!
>
> These tests were done just with the standard filesystem.  I wasn't using
> radosgw for this.  Feel free to repeat with the web server of your choice,
> you'll find the same thing happens.
>
> URL decoding the path is not the correct behavior.
>
>
>
> On 6/26/2014 11:36 AM, Sylvain Munaut wrote:
>
> Hi,
>
>
>  Based on the debug log, radosgw is definitely the software that's
> incorrectly parsing the URL.  For example:
>
>
> 2014-06-25 17:30:37.383134 7f7c6cfa9700 20
> REQUEST_URI=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383199 7f7c6cfa9700 10
> s->object=ubuntu/pool/main/a/adduser/adduser_3.113 nmu3ubuntu3_all.deb
> s->bucket=ubuntu
>
> I'll dig into this some more, but it definitely looks like radosgw is the
> one that's unencoding the + character here.  How else would it be receiving
> the request_uri with the + in it, but then a little bit later the request
> has a space in it instead?
>
>  Note that AFAIK, in fastcgi, REQUEST_URI is _supposed_ to be an URL
> encoded version and should be URL-decoded by the fastcgi handler. So
> converting the + to ' ' seems valid to me.
>
>
> Cheers,
>
>Sylvain
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with RadosGW and special characters

2014-06-27 Thread Craig Lewis
backslash characters (\) are known to cause problems for some clients (like
s3cmd).

Try removing them from your secret, and see if that works.  If it doesn't,
just remove the key and secret, and regenerate until the secret doesn't
have any backslashes.


On Fri, Jun 27, 2014 at 12:22 AM, Florent B  wrote:

>  I have a similar issue but I don't know if it is related to yours.
>
> When secret key of a swift subuser contains "+" or "\" (I don't which),
> authentification fails using python-swiftclient.
>
> Is it an issue ?
>
>
> On 06/25/2014 11:58 PM, Brian Rak wrote:
>
> I'm trying to find an issue with RadosGW and special characters in
> filenames.  Specifically, it seems that filenames with a + in them are not
> being handled correctly, and that I need to explicitly escape them.
>
> For example:
>
> ---request begin---
> HEAD /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb HTTP/1.0
> User-Agent: Wget/1.12 (linux-gnu)
>
> Will fail with a 404 error, but
>
> ---request begin---
> HEAD /ubuntu/pool/main/a/adduser/adduser_3.113%2Bnmu3ubuntu3_all.deb
> HTTP/1.0
> User-Agent: Wget/1.12 (linux-gnu)
>
> will work properly.
>
> I enabled debug mode on radosgw, and see this:
>
> 2014-06-25 17:30:37.383029 7f7ca7fff700 20 RGWWQ:
> 2014-06-25 17:30:37.383040 7f7ca7fff700 20 req: 0x7f7ca000b180
> 2014-06-25 17:30:37.383053 7f7ca7fff700 10 allocated request
> req=0x7f7ca0015ef0
> 2014-06-25 17:30:37.383064 7f7c6cfa9700 20 dequeued request
> req=0x7f7ca000b180
> 2014-06-25 17:30:37.383070 7f7c6cfa9700 20 RGWWQ: empty
> 2014-06-25 17:30:37.383121 7f7c6cfa9700 20 CONTENT_LENGTH=
> 2014-06-25 17:30:37.383123 7f7c6cfa9700 20 CONTENT_TYPE=
> 2014-06-25 17:30:37.383124 7f7c6cfa9700 20 DOCUMENT_ROOT=/etc/nginx/html
> 2014-06-25 17:30:37.383125 7f7c6cfa9700 20
> DOCUMENT_URI=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383126 7f7c6cfa9700 20 FCGI_ROLE=RESPONDER
> 2014-06-25 17:30:37.383127 7f7c6cfa9700 20 GATEWAY_INTERFACE=CGI/1.1
> 2014-06-25 17:30:37.383128 7f7c6cfa9700 20 HTTP_ACCEPT=*/*
> 2014-06-25 17:30:37.383129 7f7c6cfa9700 20 HTTP_CONNECTION=Keep-Alive
> 2014-06-25 17:30:37.383129 7f7c6cfa9700 20 HTTP_HOST=xxx
> 2014-06-25 17:30:37.383130 7f7c6cfa9700 20 HTTP_USER_AGENT=Wget/1.12
> (linux-gnu)
> 2014-06-25 17:30:37.383131 7f7c6cfa9700 20 QUERY_STRING=
> 2014-06-25 17:30:37.383131 7f7c6cfa9700 20 REDIRECT_STATUS=200
> 2014-06-25 17:30:37.383132 7f7c6cfa9700 20 REMOTE_ADDR=yyy
> 2014-06-25 17:30:37.383133 7f7c6cfa9700 20 REMOTE_PORT=43855
> 2014-06-25 17:30:37.383134 7f7c6cfa9700 20 REQUEST_METHOD=HEAD
>
> 2014-06-25 17:30:37.383134 7f7c6cfa9700 20
> REQUEST_URI=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383135 7f7c6cfa9700 20
> SCRIPT_NAME=/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb
> 2014-06-25 17:30:37.383136 7f7c6cfa9700 20 SERVER_ADDR=yyy
> 2014-06-25 17:30:37.383136 7f7c6cfa9700 20 SERVER_NAME=xxx
> 2014-06-25 17:30:37.383137 7f7c6cfa9700 20 SERVER_PORT=80
> 2014-06-25 17:30:37.383138 7f7c6cfa9700 20 SERVER_PROTOCOL=HTTP/1.0
> 2014-06-25 17:30:37.383138 7f7c6cfa9700 20 SERVER_SOFTWARE=nginx/1.4.6
> 2014-06-25 17:30:37.383140 7f7c6cfa9700  1 == starting new request
> req=0x7f7ca000b180 =
> 2014-06-25 17:30:37.383152 7f7c6cfa9700  2 req 1:0.13::HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb::initializing
> 2014-06-25 17:30:37.383158 7f7c6cfa9700 10 host= rgw_dns_name=
> 2014-06-25 17:30:37.383199 7f7c6cfa9700 10 
> *s->object=ubuntu/pool/main/a/adduser/adduser_3.113
> nmu3ubuntu3_all.deb s->bucket=ubuntu*
> 2014-06-25 17:30:37.383207 7f7c6cfa9700  2 req 1:0.68:s3:HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb::getting op
> 2014-06-25 17:30:37.383211 7f7c6cfa9700  2 req 1:0.72:s3:HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb:get_obj:authorizing
> 2014-06-25 17:30:37.383218 7f7c6cfa9700  2 req 1:0.79:s3:HEAD
> /ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all.deb:get_obj:reading
> permissions
> 2014-06-25 17:30:37.383268 7f7c6cfa9700 20 get_obj_state:
> rctx=0x7f7c6cfa8640 obj=.rgw:ubuntu state=0x7f7c6800c0a8 s->prefetch_data=0
> 2014-06-25 17:30:37.383279 7f7c6cfa9700 10 cache get: name=.rgw+ubuntu :
> miss
>
>
> It seems that Ceph is attempting to urldecode the filename, even when it
> shouldn't be.  (Going by
> http://stackoverflow.com/questions/1005676/urls-and-plus-signs ).  Is
> this a bug, or is this the desired behavior?
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph

Re: [ceph-users] about rgw region and zone

2014-06-30 Thread Craig Lewis
Well, I was hoping for a reply from Inktank, but I'll describe the process
I plan to test:

Best Case:
Primary zone is down
Disable radosgw-agent in secondary zone
Update the region in the secondary to enable data and metadata logging
Update DNS/Load balancer to send primary traffic to secondary
Secondary is now the primary

When the old primary comes back online, don't write to it
Update the old primary's zone configs to configure it as a secondary
Setup radosgw-agent, and start replication
Wait for replication to catch up.

I think this will work, as long as you enable the data and metadata logging
in the secondary before you start writing to it.  Once the new secondary
has caught up on replication, you can repeat the process to promote the new
secondary to the new new master.



Worst Case:
Primary zone is down
Disable radosgw-agent in secondary zone
Update DNS/Load balancer to send primary traffic to secondary
Secondary is now the primary

When the old primary comes back online, drop all of it's rgw pools
Rebuild the old primary as the new secondary
Setup radosgw-agent, and start replication
Wait a long time for replication to catch up.

That's pretty extreme, but it should work.



As far as replication delay, I'm not aware of any tunables that would do
that.  It's an alert you should setup in your monitoring tool.  Time based
delay is harder to do; you'd have to setup a heartbeat file that you could
watch.  Bytewise, you can monitor the replication backlog by summing up the
totals of radosgw-admin bucket stats in both zones.

I do know that replication can get pretty far behind and still catch up.  I
deliberately imported into the primary faster than I could replicate.  In
the end, I ended up with the primary about 20 TiB ahead of the of the
secondary.  It's still catching up weeks later (I only have a 200 Mbps
link).

There are some problems when the secondary gets that far behind.  If you're
using an older radosgw-agent, it might stop replicating buckets because
there wasn't any new write activity.  I wrote a bash script that uploads a
0 byte file to every bucket every 10 minutes.  If you're using a newer
radosgw-agent, it works around that, but it doesn't persist it's progress.
 Restarting the radosgw-agent makes it start over.  Depending on how big
your replication backlog, letting radosgw-agent run uninterrupted to
completion may or may not be a problem.




On Tue, Jun 17, 2014 at 3:37 PM, Fred Yang  wrote:

> I have been looking for documents regarding DR procedure for Federated
> Gateway as well and not much luck. Can somebody from Inktank comment on
> that?
> In the event of site failure, what's the current procedure to switch
> master/secondary zone role? or Ceph currently does not have that capability
> yet? If that's the case, any roadmap to add that in future release?
>
> Also, for data sync from master to secondary, are there any parameter to
> control the maximum amount of data or time window that secondary zone can
> be lagging behind?
>
> Thanks,
> Fred
> On Jun 17, 2014 4:46 PM, "Craig Lewis"  wrote:
>
>> Metadata replication is about keeping a global namespace in all zones.
>>  It will replicate all of your users and bucket names, but not the data
>> itself.  That way you don't end up with a bucket named "mybucket" in your
>> US and EU zones that are owned by different people.  It's up to you to
>> decide if this is something you want or not.  Metadata replication won't
>> protected against the primary zone going offline.
>>
>> Data replication will copy the metadata and data.  If the primary goes
>> offline, you'll be able to read everything that has replicated to the
>> secondary zone.  You should make sure you have enough bandwidth between the
>> zones (and that latency is low enough) to allow replication can keep up.
>>  If replication falls behind, anything not replicated will catch up when
>> the primary comes back up.
>>
>> I haven't found any docs on the process to promote a secondary zone to
>> primary.  Right now, it doesn't look like a good idea.  If the master goes
>> offline, you can read from the secondary while you get the master back
>> online.  The failover/failback are expensive (time and bandwidth wise), so
>> it would take a pretty big problem before it's a good idea to promote the
>> secondary to primary.
>>
>>
>>
>> Regarding your FastCGI error, when I see that, it's because my RadosGW
>> daemon isn't running.  Check if it's running (`ps auxww | grep radosgw`).
>>  If it's not, try `start radosgw-all`, then restart apache.  If that
>> doesn't work, you might need some

Re: [ceph-users] external monitoring tools for ceph

2014-06-30 Thread Craig Lewis
You should check out Calamari (https://github.com/ceph/calamari), Inktank's
monitoring and administration tool.


I started before Calamari was announced, so I rolled my own using using
Zabbix.  It handles all the monitoring, graphing, and alerting in one tool.
 It's kind of a pain to setup, but works ok now that it's going.
I don't know how to handle the cluster view though.  I'm monitoring
individual machines.  Whenever something happens, like an OSD stops
responding, I get an alert from every monitor.  Otherwise it's not a big
deal.

I'm in the middle of re-factoring the data gathering from poll to push.  If
you're interested, I can publish my templates and scripts when I'm done.





On Sun, Jun 29, 2014 at 1:17 AM, pragya jain  wrote:

> Hello all,
>
> I am working on ceph storage cluster with rados gateway for object storage.
> I am looking for external monitoring tools that can be used to monitor
> ceph storage cluster and rados gateway interface.
> I find various monitoring tools, such as nagios, collectd, ganglia,
> diamond, sensu, logstash.
> but i don't get details of anyone about what features do these monitoring
> tools monitor in ceph.
>
> Has somebody implemented anyone of these tools?
>
> Can somebody help me in identifying the features provided by these tools?
>
> Is there any other tool which can also be used to monitor ceph specially
> for object storage?
>
> Regards
> Pragya Jain
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW & data striping

2014-06-30 Thread Craig Lewis
RadosGW stripes data by default.  Objects larger than 4MiB are broken up
into 4MiB chunks.


On Wed, Jun 25, 2014 at 3:49 AM, Florent B  wrote:

> Hi,
>
> Is it possible to get data striped with radosgw, as in RBD or CephFS ?
>
> Thank you
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw scalability questions

2014-07-08 Thread Craig Lewis
You can and should run multiple RadosGW and Apache instances per zone.  The
whole point of Ceph is eliminating as many points of failure as possible.
You'll want to setup a load balancer just like you would for any website.
 You'll want your load balancer to recognize and forward both
http://us-west-1.domain/ and http://*.us-west-1.domain/.  Most load
balancers shouldn't have a problem with that.

I am having a bit of trouble with the load balancer in my secondary zones.
 The primary is working fine, but radosgw-agent's locking doesn't work
correctly in the secondary zones.  I haven't taken the time to figure out
the problem; I probably need to add some kind of stickyness to the pool.
 In the mean time, I've configured the secondary zone's load balancer to
have one machine in the pool, and one backup machine if the first one stops
responding.

You can setup multiple zones in a single cluster, but you don't have to.
 Some instances that might make sense are if you have 3 physical locations,
and you want each location to be a primary zone and a secondary zone for
one of the other locations.

radosgw-agent should have multiple instances too.  I'm currently running
radosgw-agent on all of the radosgw nodes in the secondary zone.  I just
adjust the --num-threads on each node so the total is the number of
connections I want.

I'm not using CephX.  It was causing me some problems early with the Chef
recipes, and I just disabled it.  In general though, I favor independent
encryption keys rather than one shared key.  It's a bit more work to
manage, but it makes key revocation much easier.


For atomic operations, RadosGW handles it fine.  In my tests, I never had
any corruptions.  The most recently completed write wins.
It looks something like:
Start uploading largefile to bucket/testfile; bucket/testfile doesn't exist
Start uploading smallfile to bucket/testfile; bucket/testfile doesn't exist
smallfile upload completes; bucket/testfile is smallfile
largefile upload completes; bucket/testfile is largefile


On Mon, Jul 7, 2014 at 2:34 PM, Fred Yang  wrote:

> I'm setting up federated gateway following
> https://ceph.com/docs/master/radosgw/federated-config/, it seems one
> cluster can have multiple instances serving multiple zone each(be it master
> or slave), but it's not clear whether I can have multiple radosgw/httpd
> instances in the same cluster to serve request for same master zone, can
> anybody help answering this:
>
> 1. Should I setup multiple instances like us-east-1, us-east2.. for each
> physical host ? Should I be creating a separate keyring for each phyiscal
> host? Or just create single keyring and on every physical host refer it to
> us-east-1?
>
> 2. Can I use load-balance in front of these radosgw instances? If yes,
> should the "host="  entry set the the load balancer name or local host name?
>
> 3. While librados is atomic but REST API does not have write lock, so in
> the case when multiple user try to perform the write to same object through
> different radosgw instances for same master zone, what will happen? Will
> the write request to object being written be denied or potentially data
> corruption can occur?
>
> Thanks,
> Fred
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Throttle pool pg_num/pgp_num increase impact

2014-07-09 Thread Craig Lewis
FWIW, I'm beginning to think that SSD journals are a requirement.

Even with minimal recovery/backfilling settings, it's very easy to kick off
an operation that will bring a cluster to it's knees.  Increasing PG/PGP,
increasing replication, adding too many new OSDs, etc.  These operations
can cause latency to increase 50x.  The SSDs won't completely hide it, but
they've brought latency down to "painful but tolerable".


As Kostis suggests, the only option I've found so far is to do smaller
operations, and I hope I made them small enough.  Try to do something that
will affect less than 10% of your OSDs.  ie, instead of adding 10% new OSDs
in one operation, add one per node, and wait until the recovery finishes.
 It takes a lot longer, and moves data many time, but my latency generally
only doubles instead of 50x.

I've figured out how to that for OSD additions and PG/PGP increases.  I
haven't figured out a way to do it for replication levels.  If I want to
change a replication level, I think it will be better to create new pools,
and migrate the data manually.




On Wed, Jul 9, 2014 at 6:59 AM, Gregory Farnum  wrote:

> You're physically moving (lots of) data around between most of your
> disks. There's going to be an IO impact from that, although we are
> always working on ways to make it more controllable and try to
> minimize its impact. Your average latency increase sounds a little
> high to me, but I don't have much data to draw from; maybe others who
> have done this on large clusters can discuss.
> Basically, think of what happens to IO performance on a resilvering
> RAID array. We should be a lot better than that, but it's the same
> concept.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> On Tue, Jul 8, 2014 at 11:15 PM, Kostis Fardelas 
> wrote:
> > Hi Greg,
> > thanks for your immediate feedback. My comments follow.
> >
> > Initially we thought that the 248 PG (15%) increment we used was
> > really small, but it seems that we should increase PGs in even small
> > increments. I think that the term "multiples" is not the appropriate
> > term here, I fear someone would assume that it is the same (or even
> > the right way to do) to go from 10 PGs to 20 PGs and from 1000 PGs to
> > 2000 PGs just because he/she uses a small 2X multiple.
> >
> > Regarding, the data movement due to pgp_num increase, we had already
> > set osd_max_backfills, osd_recovery_max_active,
> > osd_recovery_op_priority, osd_recovery_threads to their minimum values
> > but we still got impacted. The first two are also set in ceph.conf but
> > we use to change all four of them at runtime (through injecting). Is
> > there anything else we should check? Is it some known issue?
> >
> > Another question that came up from our exercise is related to pool
> > isolation during PG remapping. As I reported we only changed the
> > pg/pgp num in one of our pools but ceph client io and ceph ops seem to
> > have dropped at cluster level (verified by looking at ceph status).
> > Did our second pool got impacted too or we should take from granted
> > that the pools are indeed isolated during remapping and there is a
> > ceph status view granularity issue here?
> >
> > Regards,
> > Kostis
> >
> > On 8 July 2014 20:01, Gregory Farnum  wrote:
> >> The impact won't be 300 times bigger, but it will be bigger. There are
> two
> >> things impacting your cluster here
> >> 1) the initial "split" of the affected PGs into multiple child PGs. You
> can
> >> mitigate this by stepping through pg_num at small multiples.
> >> 2) the movement of data to its new location (when you adjust pgp_num).
> This
> >> can be adjusted by setting the "OSD max backfills" and related
> parameters;
> >> check the docs.
> >> -Greg
> >>
> >>
> >> On Tuesday, July 8, 2014, Kostis Fardelas  wrote:
> >>>
> >>> Hi,
> >>> we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw
> >>> used space. We store data objects basically on two pools, the one
> >>> being appr. 300x larger in data stored and # of objects terms than the
> >>> other. Based on the formula provided here
> >>> http://ceph.com/docs/master/rados/operations/placement-groups/ we
> >>> computed that we need to increase our per pool pg_num & pgp_num to
> >>> appr 6300 PGs / pool (100 * 126 / 2).
> >>> We started by increasing the pg & pgp number on the smaller pool from
> >>> 1800 to 2048 PGs (first the pg_num, then the pgp_num) and we
> >>> experienced a 10X increase in Ceph total operations and an appr 3X
> >>> disk latency increase in some underlying OSD disks. At the same time,
> >>> for appr 10 seconds we experienced very low values of client io and
> >>> op/s
> >>>
> >>> Should we be worried that the pg/pgp num increase on the bigger pool
> >>> will have a 300X larger impact?
> >>> Can we throttle this impact by injecting any thresholds or applying an
> >>> appropriate configuration on our ceph conf?
> >>>
> >>> Regards,
> >>> Kostis
> >>> __

Re: [ceph-users] radosgw-agent failed to parse

2014-07-09 Thread Craig Lewis
Just to ask a couple obvious questions...

You didn't accidentally put 'http://us-secondary.example.comhttp://
us-secondary.example.com/' in any of your region or zone configuration
files?  The fact that it's missing the :80 makes me think it's getting that
URL from someplace that isn't the command line.

You do have both system users on both clusters, with the same access and
secret keys?

You can resolve us-secondary.example.com. from this host?


I tested URLs of the form http://us-secondary.example.com/ and
http://us-secondary.example.com:80 in my setup, and both work fine.



On Wed, Jul 9, 2014 at 3:56 AM, Peter  wrote:

> thank you for your reply. I am running ceph 0.80.1, radosgw-agent 1.2 on
> Ubuntu 14.04 LTS (GNU/Linux 3.13.0-24-generic x86_64) . I also ran into
> this same issue with ubuntu 12.04 previously.
> There are no special characters in the access or secret key (ive had
> issues with this before so i make sure of this).
>
> here is the output python interpreter:
>
>  Python 2.7.6 (default, Mar 22 2014, 22:59:56)
>> [GCC 4.8.2] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>> >>> import urlparse
>> >>> result = urlparse.urlparse('http://us-secondary.example.com:80')
>> >>> print result.hostname, result.port
>> us-secondary.example.com 80
>>
>
> that looks ok to me.
>
>
>
> On 07/07/14 22:57, Josh Durgin wrote:
>
>> On 07/04/2014 08:36 AM, Peter wrote:
>>
>>> i am having issues running radosgw-agent to sync data between two
>>> radosgw zones. As far as i can tell both zones are running correctly.
>>>
>>> My issue is when i run the radosgw-agent command:
>>>
>>>
>>>  radosgw-agent -v --src-access-key  --src-secret-key
  --dest-access-key  --dest-secret-key
  --src-zone us-master http://us-secondary.example.com:80

>>>
>>> i get the following error:
>>>
>>> |DEBUG:boto:Using access key provided by client.||
>>> ||DEBUG:boto:Using secret key provided by client.||
>>> ||DEBUG:boto:StringToSign:||
>>> ||GET||
>>> ||
>>> ||Fri, 04 Jul 2014 15:25:53 GMT||
>>> ||/admin/config||
>>> ||DEBUG:boto:Signature:||
>>> ||AWS EA20YO07DA8JJJX7ZIPJ:WbykwyXu5m5IlbEsBzo8bKEGIzg=||
>>> ||DEBUG:boto:url =
>>> 'http://us-secondary.example.comhttp://us-secondary.
>>> example.com/admin/config'||
>>> 
>>> ||params={}||
>>> ||headers={'Date': 'Fri, 04 Jul 2014 15:25:53 GMT', 'Content-Length':
>>> '0', 'Authorization': 'AWS
>>> EA20YO07DA8JJJX7ZIPJ:WbykwyXu5m5IlbEsBzo8bKEGIzg=', 'User-Agent':
>>> 'Boto/2.20.1 Python/2.7.6 Linux/3.13.0-24-generic'}||
>>> ||data=None||
>>> ||ERROR:root:Could not retrieve region map from destination||
>>> ||Traceback (most recent call last):||
>>> ||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line
>>> 269, in main||
>>> ||region_map = client.get_region_map(dest_conn)||
>>> ||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py",
>>> line 391, in get_region_map||
>>> ||region_map = request(connection, 'get', 'admin/config')||
>>> ||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py",
>>> line 153, in request||
>>> ||result = handler(url, params=params, headers=request.headers,
>>> data=data)||
>>> ||  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in
>>> get||
>>> ||return request('get', url, **kwargs)||
>>> ||  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in
>>> request||
>>> ||return session.request(method=method, url=url, **kwargs)||
>>> ||  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line
>>> 349, in request||
>>> ||prep = self.prepare_request(req)||
>>> ||  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line
>>> 287, in prepare_request||
>>> ||hooks=merge_hooks(request.hooks, self.hooks),||
>>> ||  File "/usr/lib/python2.7/dist-packages/requests/models.py", line
>>> 287, in prepare||
>>> ||self.prepare_url(url, params)||
>>> ||  File "/usr/lib/python2.7/dist-packages/requests/models.py", line
>>> 334, in prepare_url||
>>> ||scheme, auth, host, port, path, query, fragment = parse_url(url)||
>>> ||  File "/usr/lib/python2.7/dist-packages/urllib3/util.py", line 390,
>>> in parse_url||
>>> ||raise LocationParseError("Failed to parse: %s" % url)||
>>> ||LocationParseError: Failed to parse: Failed to parse:
>>> us-secondary.example.comhttp:
>>>
>>>
>>> |||Is this a bug? or is my setup wrong? i can navigate to
>>> http://us-secondary.example.com/admin/config and it correctly outputs
>>> zone details. at the output above
>>>
>>
>> It seems like an issue with your environment. What version of
>> radosgw-agent and which distro is this running on?
>>
>> Are there any special characters in the access or secret keys that
>> might need to be escaped on the command line?
>>
>>  |DEBUG:boto:url =
>>> 'http://us-secondary.example.comhttp://us-secondary.
>>> example.com/admin/config'||
>>> 

Re: [ceph-users] I have PGs that I can't deep-scrub

2014-07-10 Thread Craig Lewis
I fixed this issue by reformatting all of the OSDs.  I changed the mkfs
options from

[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096

to
[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -s size=4096

(I have a mix of 512 and 4k sector drives, and I want to treat them all
like 4k sector).


Now deep scrub runs to completion, and CPU usage of the daemon never goes
over 30%.  I did have to restart a few OSDs when I scrubbed known problem
PGs, but they scrubbed the 2nd time successfully.  The cluster is still
scrubbing, but it's completed half with no more issues.



It took me a long time to correlate the "XFS: possible memory allocation
deadlock in kmem_alloc" message in dmesg to OSD problems.  It was only when
I started having these deep-scrub issues that the XFS deadlock messages
were well correlated with OSD issues.

Looking back at previous issues I had with OSDs flapping, the XFS deadlocks
were present, but usually preceded the issues by several hours.


I strongly recommend to anybody that sees "XFS: possible memory allocation
deadlock in kmem_alloc" in dmesg to reformat your XFS filesystems.  It's
painful, but my cluster has been rock solid since I finished.





On Wed, Jun 11, 2014 at 2:23 PM, Craig Lewis 
wrote:

> New logs, with debug ms = 1, debug osd = 20.
>
>
> In this timeline, I started the deep-scrub at 11:04:00  Ceph start
> deep-scrubing at 11:04:03.
>
> osd.11 started consuming 100% CPU around 11:07.  Same for osd.0.  CPU
> usage is all user; iowait is < 0.10%.  There is more variance in the
> CPU usage now, ranging between 98.5% and 101.2%
>
> This time, I didn't see any major IO, read or write.
>
> osd.11 was marked down at 11:22:00:
> 2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg
> stats for 902.656777seconds
>
> osd.0 was marked down at 11:36:00:
>  2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg
> stats for 902.498894seconds
>
>
>
>
> ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE
> ceph-osd.0.log.gz
> <https://cd.centraldesktop.com/p/eAAADwbcABIDZuEceph-osd.0.log.gz>
> (140MiB, 18MiB compressed):
> https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ
> ceph-osd.11.log.gz (131MiB, 17MiB compressed):
> https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI
> ceph pg 40.11e query:
> https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc
>
>
>
>
>
> On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil  wrote:
> > Hi Craig,
> >
> > It's hard to say what is going wrong with that level of logs.  Can you
> > reproduce with debug ms = 1 and debug osd = 20?
> >
> > There were a few things fixed in scrub between emperor and firefly.  Are
> > you planning on upgrading soon?
> >
> > sage
> >
> >
> > On Tue, 10 Jun 2014, Craig Lewis wrote:
> >
> >> Every time I deep-scrub one PG, all of the OSDs responsible get kicked
> >> out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
> >> fails the same way every time.  OSD logs are linked at the bottom.
> >>
> >> What can I do to get this deep-scrub to complete cleanly?
> >>
> >> This is the first time I've deep-scrubbed these PGs since Sage helped
> >> me recover from some OSD problems
> >> (
> http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html
> )
> >>
> >> I can trigger the issue easily in this cluster, but have not been able
> >> to re-create in other clusters.
> >>
> >>
> >>
> >>
> >>
> >>
> >> The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
> >> are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
> >> owned by OSDs [11,0]
> >>
> >> This is a secondary cluster, so I stopped all external I/O on it.  I
> >> set nodeep-scrub, and restarted both OSDs with:
> >>   debug osd = 5/5
> >>   debug filestore = 5/5
> >>   debug journal = 1
> >>   debug monc = 20/20
> >>
> >> then I ran a deep-scrub on this PG.
> >>
> >> 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
> >> active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
> >> 77870 GB / 130 TB avail
> >> 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
> >> active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
> >> 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail
> >>
> >>
> &

Re: [ceph-users] Creating a bucket on a non-master region in a multi-region configuration with unified namespace/replication

2014-07-14 Thread Craig Lewis
There's no reason you can't create another set of zones that have a master
in us-west, call it us-west-2.  Then users that need low latency to us-west
write to http://us-west-2.cluster/, and users that need low latency to
us-east write to http://us-east-1.cluster/.

In general, you want your replication zones geographically separated.  In
my setup, I have a single region, us.  I have the primary zone in
California and the secondary zone in Texas.  When I create objects in the
primary, they appear in the secondary zone in a couple minutes.  Then if I
need low latency read access in Texas, I read from the secondary.  I don't
currently need to write from Texas.  When I do, I'll create a new set of
zones that are primary in Texas and secondary in California.  Then Texas
can write to that new zone.




On Mon, Jul 14, 2014 at 7:50 AM, Bachelder, Kurt <
kurt.bachel...@sierra-cedar.com> wrote:

>  Hi Craig – Thanks for the info… My understanding is the same as what
> you’ve laid out below – I’ve scraped together enough information to
> understand that Ceph should be working the same as Amazon S3, which is a
> relief.
>
>
>
> Here’s my issue – I have 2 regions, geographically separated.  Each region
> contains 2 zones, which replicate metadata + data – and that works
> perfectly – no issues whatsoever.  However, when I add regional replication
> for a unified namespace, the location constraint SEEMS to be completely
> ignored.  Let’s say “us-east” is my master region and “us-west” is my slave
> – for latency reasons, I want the actual data to be stored in us-west.  So
> I submit a command to us-east to create a bucket with
> location_constraint=us-west.  However, the radosgw seems to completely
> ignore the constraint and it creates the bucket in us-east.  When I start
> writing data to the bucket, it physically resides in us-east, not us-west,
> which doesn’t work for us from a latency standpoint.
>
>
>
> Creating users and buckets in us-east works fine, and the metadata filters
> down to all other regions/zones – so I know the replication working to some
> extent.
>
>
>
> Oh, and you’re right about the redirect – when I try to create a bucket
> against the “slave” region us-west, it DOES redirect the request to the
> master region… but still fails to create the bucket in us-west.  It is
> created in pools in us-east instead.
>
>
>
> Frustrating – the documentation is so sparse around this whole
> multi-region setup – especially in regards to location constraints.  I’m
> still not sure whether my configuration is off or whether the RGW just
> isn’t working as expected…
>
>
>
> Thank you for your reply – keep in touch if you end up doing some
> multi-region replication, would love to hear your experience.
>
>
>
> Kurt
>
>
>
> *From:* Craig Lewis [mailto:cle...@centraldesktop.com]
> *Sent:* Saturday, July 12, 2014 7:51 PM
> *To:* Bachelder, Kurt
> *Subject:* Re: [ceph-users] Creating a bucket on a non-master region in a
> multi-region configuration with unified namespace/replication
>
>
>
> Disclaimer: I haven't used Amazon S3 yet, so I'm only familiar with
> regions, zones, and placements as they apply to Ceph.  I'm pretty sure Ceph
> is trying hard to emulate Amazon, but it's obviously missing some features.
>  I was hoping somebody else would chime in, but I'll give it a shot.
>
>
>
>
>
>
>
> Bucket creation (really, all writes) should only occur in a master zone.
>
>
>
> Zones (within a region) are intended to have data+metadata replication, so
> the primary zone will get the writes, then replicate everything to the
> secondary zone(s).  You can read from the primary or secondary, as long as
> you remember that replication is asynchronous, and generally takes a few
> minutes even when it's keeping up.  You can have multiple zones in a region
> that partition instead of replicate as well.  For example, us-west-1
> replicates to us-east-1, and us-east-2 replicates to us-west-2.
>
>
>
> Regions can have metadata replication (without data replication), if you
> want a globally unique list of users and pools.  It's up to you if you want
> that.   If you plan to move data between regions, then you probably do.
>  It's not required if you want us-west-1/bucket/ and us-east-2/bucket/ to
> be different objects (but it's confusing).
>
>
>
> If you want geo-replication, each zone in a region should have a secondary
> zone with data replication.  zones & data replication are the method of
> accomplishing geo-replication.  Regions are a way to isolate data for
> latency/legal reasons.
>
>
>
>
>
> Placement targets are setup inside a region.  They let t

Re: [ceph-users] HW recommendations for OSD journals?

2014-07-16 Thread Craig Lewis
The good SSDs will report how much of their estimated life has been used.
 It's not in the SMART spec though, so different manufacturers do it
differently (or not at all).


I've got Intel DC S3700s, and the SMART attributes include:
233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always
-   0

I'm planning to monitor those value, and replace the SSD when "gets old".
 I don't know exactly what that means yet, but I'll figure it out.  It's
easy to replace SSDs before they fail, without losing the whole OSD.

With my write volume, I might just make it a quarterly manual task instead
of adding it to my monitoring tool.  TBD.



Of course, this won't prevent other types of SSD failure.  I've lost both
SSDs in a RAID1 when I triggered an Intel firmware bug.  I've lost both
SSDs in a RAID1 when the colo lost power (older SSDs without super caps).

The only way I can think of that would make RAID1 SSDs safer than a single
SSD is if you use two SSDs from different manufacturers.

Ceph's mantra is "failure is constant".  I'm not going to RAID my journal
devices.  I will use SSDs with power loss protection though.  I can handle
one or two SSDs dropping out at a time.  I can't handle a large percentage
of them dropping out at the same time.




On Wed, Jul 16, 2014 at 8:28 AM, Mark Nelson 
wrote:

> On 07/16/2014 09:58 AM, Riccardo Murri wrote:
>
>> Hello,
>>
>> I am new to Ceph; the group I'm working in is currently evaluating it
>> for our new large-scale storage.
>>
>> Is there any recommendation for the OSD journals?  E.g., does it make
>> sense to keep them on SSDs?  Would it make sense to host the journal
>> on a RAID-1 array for added safety? (IOW: what happens if the journal
>> device fails and the journal is lost?)
>>
>> Thanks for any explanation and suggestion!
>>
>
> Hi,
>
> There are a couple of common configurations that make sense imho:
>
> 1) Leave journals on the same disks as the data (best to have them in
> their own partition).  This is a fairly safe option since the OSDs only
> have a single disk they rely on (IE minimize potential failures).  It can
> be slow, but it depends on the controller you use and possibly the IO
> scheduler.  Often times a controller with writeback cache seems to help
> avoid seek contention during writes, but you will currently lose about half
> your disk throughput to journal writes during sequential write IO.
>
> 2) Put journals on SSDs.  In this scenario you want to match your per
> journal SSD speed and disk speed.  IE if you have an SSD that can do
> 400MB/s and disks that can do ~125MB/s of sequential writes, you probably
> want to put somewhere around 3-5 journals on the SSD depending on how much
> sequential write throughput matters to you.  OSDs are now dependant on both
> the spinning disk and the SSD not to fail, and one SSD failure will take
> down multiple OSDs.  You gain speed though and may not need more expensive
> controllers with WB cache (though they may still be useful to protect
> against power failure).
>
> Some folks have used raid-1 LUNs for the journals and it works fine, but
> I'm not really a fan of it, especially with SSDs.  You are causing double
> the writes to the SSDs, and SSDs tend to fail in clumps based on the number
> of writes.  If the choice is between 6 journals per SSD RAID-1 or 3
> journals per SSD JBOD, I'd choose the later.  I'd want to keep my overall
> OSD count high though to minimize the fallout from 3 OSDs going down at
> once.
>
> Arguably if you do the RAID1, can swap failed SSDs quickly, and anticipate
> that the remaining SSD is likely going to die soon after the first, maybe
> the RAID1 is worth it.  The disadvantages seem pretty steep to me though.
>
> Mark
>
>
>
>> Riccardo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-16 Thread Craig Lewis
One of the things I've learned is that many small changes to the cluster
are better than one large change.  Adding 20% more OSDs?  Don't add them
all at once, trickle them in over time.  Increasing pg_num & pgp_num from
128 to 1024?  Go in steps, not one leap.

I try to avoid operations that will touch more than 20% of the disks
simultaneously.  When I had journals on HDD, I tried to avoid going over
10% of the disks.


Is there a way to execute `ceph osd crush tunables optimal` in a way that
takes smaller steps?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-16 Thread Craig Lewis
Thanks, that's worth a try.  Half as bad might make all the difference.

I have the luxury of a federated setup, and I can test on the secondary
cluster fairly safely.  If the change doesn't cause replication timeouts,
it's probably ok to deploy on the primary.

I'll go to CRUSH_TUNABLES2 manually by making the changes in
http://ceph.com/docs/master/rados/operations/crush-map/#tunables, one at a
time.  Then do chooseleaf_vary_r => 4, and see what happens.


I won't get a chance to try for at least a couple weeks, probably longer.




On Wed, Jul 16, 2014 at 5:06 PM, Sage Weil  wrote:

> On Wed, 16 Jul 2014, Gregory Farnum wrote:
> > On Wed, Jul 16, 2014 at 4:45 PM, Craig Lewis 
> wrote:
> > > One of the things I've learned is that many small changes to the
> cluster are
> > > better than one large change.  Adding 20% more OSDs?  Don't add them
> all at
> > > once, trickle them in over time.  Increasing pg_num & pgp_num from 128
> to
> > > 1024?  Go in steps, not one leap.
> > >
> > > I try to avoid operations that will touch more than 20% of the disks
> > > simultaneously.  When I had journals on HDD, I tried to avoid going
> over 10%
> > > of the disks.
> > >
> > >
> > > Is there a way to execute `ceph osd crush tunables optimal` in a way
> that
> > > takes smaller steps?
> >
> > Unfortunately not; the crush tunables are changes to the core
> > placement algorithms at work.
>
> Well, there is one way, but it is only somewhat effective.  If you
> decompile the crush maps for bobtail vs firefly the actual difference is
>
>  tunable chooseleaf_vary_r 1
>
> and this is written such that a value of 1 is the optimal 'new' way, 0 is
> the legacy old way, but values > 1 are less-painful steps between the two
> (though mostly closer to the firefly value of 1).  So, you could set
>
>  tunable chooseleaf_vary_r 4
>
> wait for it to settle, and then do
>
>  tunable chooseleaf_vary_r 3
>
> ...and so forth down to 1.  I did some limited testing of the data
> movement involved and noted it here:
>
>
> https://github.com/ceph/ceph/commit/37f840b499da1d39f74bfb057cf2b92ef4e84dc6
>
> In my test case, going from 0 to 4 was about 1/10th as bad as going
> straight from 0 to 1, but the final step from 2 to 1 is still about 1/2 as
> bad.  I'm not sure if that means it's not worth the trouble of not just
> jumping straight to the firefly tunables, or whether it means legacy users
> should just set (and leave) this at 2 or 3 or 4 and get almost all the
> benefit without the rebalance pain.
>
> sage
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-17 Thread Craig Lewis
I'd like to see some way to cap recovery IOPS per OSD.  Don't allow
backfill to do no more than 50 operations per second.  It will slow
backfill down, but reserve plenty of IOPS for normal operation.  I know
that implementing this well is not a simple task.


I know I did some stupid things that caused a lot of my problems.  Most of
my problems can be traced back to
  osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096
 and the kernel malloc problems it caused.

Reformatting all of the disks fixed a lot of my issues, but it didn't fix
them all.




While I was reformatting my secondary cluster, I tested the stability by
reformatting all of the disks on the last node at once.  I didn't mark them
out and wait for the rebuild; I removed the OSDs, reformatted, and added
them back to the cluster.  It was 10 disks out of 36 total, in a 4 node
cluster (I'm waiting for hardware to free up to build the 5th node).
 Everything was fine for the first hour or so.  After several hours, there
was enough latency that the HTTP load balancer was marking RadosGW nodes
down.  My load balancer has a 30s timeout.  Since the latency was cluster
wide, all RadosGW nodes were marked down together.  When the latency spike
subsided, they'd all get marked up again.  This continued until the
backfill completed.  They were mostly up.  I don't have numbers, but I
think they were marked down about 5 times an hour, for less than a minute
each time.  That really messes with radosgw-agent.


I had recovery tuned down:
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

I have journals on SSD, and single GigE public and cluster networks.  This
cluster has 2x replication (I'm waiting for the 5th node to go to 3x).  The
cluster network was pushing 950 Mbps.  The SSDs and OSDs had plenty of
write bandwidth, but the HDDs were saturating their IOPs.  These are
consumer class 7200 RPM SATA disks, so they don't have very many IOPS.

The average write latency on these OSDs is normally ~10ms.  While this
backfill was going on, the average write latency was 100ms, with plenty of
times when the latency was 200ms.  The average read latency increased, but
not as bad.  It averaged 50ms, with occasional spikes up to 400ms.  Since I
formatted a 27% of my cluster, I was seeing higher latency on 55% of my
OSDs (readers and writers).

Instead, if I trickle in the disks, everything works fine.  I was able to
reformat 2 OSDs at a time without a problem.  The cluster latency increase
was barely noticeable, even though the IOPS on those two disks were
saturated.  A bit of latency here and there (5% of the time) doesn't hurt
much.  When it's 55% of the time, it hurts a lot more.


When I finally get the 5th node, and increase replication from 2x to 3x, I
expect this cluster to be unusable for about a week.







On Thu, Jul 17, 2014 at 9:02 AM, Andrei Mikhailovsky 
wrote:

> Comments inline
>
>
> --
> *From: *"Sage Weil" 
> *To: *"Quenten Grasso" 
> *Cc: *ceph-users@lists.ceph.com
> *Sent: *Thursday, 17 July, 2014 4:44:45 PM
>
> *Subject: *Re: [ceph-users] ceph osd crush tunables optimal AND add new
> OSD at the same time
>
> On Thu, 17 Jul 2014, Quenten Grasso wrote:
>
> > Hi Sage & List
> >
> > I understand this is probably a hard question to answer.
> >
> > I mentioned previously our cluster is co-located MON?s on OSD servers,
> which
> > are R515?s w/ 1 x AMD 6 Core processor & 11 3TB OSD?s w/ dual 10GBE.
> >
> > When our cluster is doing these busy operations and IO has stopped as in
> my
> > case, I mentioned earlier running/setting tuneable to optimal or heavy
> > recovery
> >
> > operations is there a way to ensure our IO doesn?t get completely
> > blocked/stopped/frozen in our vms?
> >
> > Could it be as simple as putting all 3 of our mon servers on baremetal
> >  w/ssd?s? (I recall reading somewhere that a mon disk was doing several
> > thousand IOPS during a recovery operation)
> >
> > I assume putting just one on baremetal won?t help because our mon?s will
> only
> > ever be as fast as our slowest mon server?
>
> I don't think this is related to where the mons are (most likely).  The
> big question for me is whether IO is getting completely blocked, or just
> slowed enough that the VMs are all timing out.
>
>
> AM: I was looking at the cluster status while the rebalancing was taking
> place and I was seeing very little client IO reported by ceph -s output.
> The numbers were around 20-100 whereas our typical IO for the cluster is
> around 1000. Having said that, this was not enough as _all_ of our vms
> become unresponsive and didn't recover after rebalancing finished.
>
>
> What slow request messages
> did you see during the rebalance?
>
> AM: As I was experimenting with different options while trying to gain
> some client IO back i've noticed that when I am limiting the options to 1
> per osd ( osd max backfills = 1, osd recovery max active = 1, osd
> recovery threads

Re: [ceph-users] radosgw-agent failed to parse

2014-07-21 Thread Craig Lewis
I was hoping for some easy fixes :-P

I created two system users, in both zones.  Each user has different access
and secret, but I copied the access and secret from the primary to the
secondary.  I can't imaging that this would cause the problem you're
seeing, but it is something different from the examples.

Sorry, I'm out of ideas.



On Mon, Jul 21, 2014 at 7:13 AM, Peter  wrote:

>  hello again,
>
> i couldn't find  'http://us-secondary.example.comhttp://
> us-secondary.example.com/' in any zone or regions config files. How could
> it be getting the URL from someplace else if i am specifying as command
> line option after radosgw-agent ?
>
>
> Here is region config:
>
> { "name": "us",
>   "api_name": "us",
>   "is_master": "True",
>   "endpoints": [
> "http:\/\/us-master.example.com:80\/"
> ],
>   "master_zone": "us-master",
>   "zones": [
> { "name": "us-master",
>   "endpoints": [
> "http:\/\/us-master.example.com:80\/"
> ],
>   "log_meta": "true",
>   "log_data": "true"},
> { "name": "us-secondary",
>   "endpoints": [
> "http:\/\/us-master.example.com:80\/"
> ],
>   "log_meta": "true",
>   "log_data": "true"}
> ],
>   "placement_targets": [
>{
>  "name": "default-placement",
>  "tags": []
>}
>   ],
>   "default_placement": "default-placement"}
>
>
> I also get the above when i navigate to
> http://us-master.example.com/admin/config  and
> http://us-secondary.example.com/admin/config .
>
> us-master zone looks like this:
>
> { "domain_root": ".us-master.domain.rgw",
>   "control_pool": ".us-master.rgw.control",
>   "gc_pool": ".us-master.rgw.gc",
>   "log_pool": ".us-master.log",
>   "intent_log_pool": ".us-master.intent-log",
>   "usage_log_pool": ".us-master.usage",
>   "user_keys_pool": ".us-master.users",
>   "user_email_pool": ".us-master.users.email",
>   "user_swift_pool": ".us-master.users.swift",
>   "user_uid_pool": ".us-master.users.uid",
>   "system_key": { "access_key": "EA02UO07DA8JJJX7ZIPJ", "secret_key":
> "InmPlbQhsj7dqjdNabqkZaqR8ShWC6fS0XVo"},
>   "placement_pools": [
> { "key": "default-placement",
>   "val": { "index_pool": ".us-master.rgw.buckets.index",
>"data_pool": ".us-master.rgw.buckets"}
> }
>   ]
> }
>
>
> us-secondary zone:
>
> { "domain_root": ".us-secondary.domain.rgw",
>   "control_pool": ".us-secondary.rgw.control",
>   "gc_pool": ".us-secondary.rgw.gc",
>   "log_pool": ".us-secondary.log",
>   "intent_log_pool": ".us-secondary.intent-log",
>   "usage_log_pool": ".us-secondary.usage",
>   "user_keys_pool": ".us-secondary.users",
>   "user_email_pool": ".us-secondary.users.email",
>   "user_swift_pool": ".us-secondary.users.swift",
>   "user_uid_pool": ".us-secondary.users.uid",
>   "system_key": { "access_key": "EA02UO07DA8JJJX7ZIPJ", "secret_key":
> "InmPlbQhsj7dqjdNabqkZaqR8ShWC6fS0XVo"},
>   "placement_pools": [
> { "key": "default-placement",
>   "val": { "index_pool": ".us-secondary.rgw.buckets.index",
>"data_pool": ".us-secondary.rgw.buckets"}
> }
>   ]
> }
>
>
> us-master user exists on us-master cluster gateway, us-secondary user
> exists on us-secondary cluster gateway. both us-master and us-secondary
> gateway users have same access and secret key. should us-master and
> us-secondary users exist on both clusters?
>
> i can resolve us-master.example.com and us-secondary.example.com from
> both gateways.
>
>
> Thanks
>
>
> On 09/07/14 22:20, Craig Lewis wrote:
>
>  Just to ask a couple obvious questions...
>
>  You didn't accidentally put 'h

Re: [ceph-users] radosgw monitoring

2014-07-28 Thread Craig Lewis
(Sorry for the duplicate email, I forgot to CC the list)

Assuming you're using the default setup (RadosGW, FastCGI, and Apache),
it's the same as monitoring a web site.  On every node, verify that request
for / returns a 200.  If the RadosGW agent is down, or FastCGI is
mis-configured, the request will return a 500 error.  If Apache is down,
you won't be able to connect.

I'm also monitoring my load balancer (HAProxy).  I added alerts if HAProxy
marks a node offline.


That's the basics, but you can get more complicated if you want.  You could
add a heartbeat file, and verify it's being updated.  You can monitor the
performance stats returned by /usr/bin/ceph --admin-daemon
/var/run/ceph/radosgw.asok --format=json perf dump.

I'm not doing a heartbeat, but I am monitoring performance.  If the latency
per operation get too high, I alert on that to.  It's really noise during
recovery, but useful when the cluster is healthy.


On Sat, Jul 26, 2014 at 2:58 AM, pragya jain  wrote:

> Thanks zhu qiang for your response
>
> that means there are only the logs with the help of which we can monitor
> radosgw instances for coming user request traffic for uploading and
> downloading the stored data and also for monitoring other features of
> radosgw
> no external monitoring tool, such as calamari, nagios collectd, zabbix
> etc., provide the functionality to monitor radosgw instances.
>
> Am I right?
>
> Thanks again
> Pragya Jain
>
>
>   On Friday, 25 July 2014 8:12 PM, zhu qiang 
> wrote:
>
>
>
> Hi,
>May be you can try the ways below:
>   1. Set “debug rgw = 2” ,then view the radosgw daemon’s log, also can use
> ‘sed,grep,awk’,get  the infos you want.
>   2. timely rum “ceph daemon client.radosgw.X perf dump” command to get
> the statics message of radosgw daemon.
>
> This is all I know, may this will be usefull for you.
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *pragya jain
> *Sent:* Friday, July 25, 2014 6:39 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] radosgw monitoring
>
> Hi all,
>
> Please suggest me some open source monitoring tools which can monitor
> radosgw instances for coming user request traffic for uploading and
> downloading the stored data and also for monitoring other features of
> radosgw
>
> Regards
> Pragya Jain
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deployment scenario with 2 hosts

2014-07-28 Thread Craig Lewis
That's expected.  You need > 50% of the monitors up.  If you only have 2
machines, rebooting one means that 50% are up, so the cluster halts
operations.  That's done on purpose to avoid problems when the cluster is
divided in exactly half, and both halves continue to run thinking the other
half is down.  Monitors don't need a lot of resources.  I'd recommend that
you add a small box as a third monitor.  A VM is fine, as long as it has
enough IOPS to it's disks.

It's best to have 3 storage nodes.  A new, out of the box install tries to
store data on at least 3 separate hosts.  You can lower the replication
level to 2, or change the rules so that it will store data on 3 separate
disks.  It might store all 3 copies on the same host though, so lowering
the replication level to 2 is probably better.

I think it's possible to require data stored on 3 disks, with 2 of the
disks coming from different nodes.  Editing the CRUSH rules is a bit
advanced: http://ceph.com/docs/master/rados/operations/crush-map/




On Mon, Jul 28, 2014 at 9:59 AM, Don Pinkster  wrote:

> Hi,
>
> Currently I am evalutating multiple distributed storage solutions with an
> S3-like interface.
> We have two huge machines with big amounts of storage. Is it possible to
> let these two behave exactly the same with Ceph? My idea is runninng both
> MON and OSD on these two machines.
>
> With quick tests the cluster is degrated after a reboot of 1 host and is
> not able to recover from the reboot.
>
> Thanks in advance!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD daemon code in /var/lib/ceph/osd/ceph-2/ "dissapears" after creating pool/rbd -

2014-08-05 Thread Craig Lewis
You can manually mount it and start the daemon, run ceph-disk-activate, or
just reboot the node.  A reboot is the easiest.


Most setups use udev rules to mount the disks on boot, instead of writing
to /etc/fstab.  If you want the details of how that works, take a look at
/lib/udev/rules.d/95-ceph-osd.rules





On Tue, Aug 5, 2014 at 12:08 PM, Bruce McFarland <
bruce.mcfarl...@taec.toshiba.com> wrote:

>  No it’s not. How would I recover that mount point? As part of the
> ceph-deploy I don’t see anything in fstab so it’s not clear to me yet who
> or what handles the mount.
>
>
>
> *From:* Craig Lewis [mailto:cle...@centraldesktop.com]
> *Sent:* Tuesday, August 05, 2014 11:35 AM
> *To:* Bruce McFarland
> *Subject:* Re: [ceph-users] OSD daemon code in /var/lib/ceph/osd/ceph-2/
> "dissapears" after creating pool/rbd -
>
>
>
> Is /var/lib/ceph/osd/ceph-2/ mounted on ess59?
>
>
>
>
>
>
>
> On Mon, Aug 4, 2014 at 6:53 PM, Bruce McFarland <
> bruce.mcfarl...@taec.toshiba.com> wrote:
>
> /var/lib/ceph/osd/ceph-2/
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Remote replication

2014-08-05 Thread Craig Lewis
That depends on which features of Ceph you're using.

RadosGW supports replication.  It's not real time, but it's near real time.
 Everything in my primary cluster is copied to my secondary within a few
minutes.  Take a look at
http://ceph.com/docs/master/radosgw/federated-config/  .  The details of
disaster recovery is still being figured out.

For RDB, people have been rolling their own using incremental snapshots:
http://ceph.com/dev-notes/incremental-snapshots-with-rbd/  .  It's not real
time, and it's up to you how often it runs.

There currently isn't a backup tool for CephFS.  CephFS is a POSIX
filesystem, so your normal tools should work.  It's a really large POSIX
filesystem though, so normal tools may not scale well.

There's no generic replication tool for RADOS itself.  If you're using
librados directly, you'll have to build your own replication system.



On Mon, Aug 4, 2014 at 10:16 AM, Patrick McGarry 
wrote:

> This is probably a question best asked on the ceph-user list.  I have
> added it here.
>
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
>
>
> On Mon, Aug 4, 2014 at 2:17 AM, Santhosh Fernandes
>  wrote:
> > Hi all,
> >
> > Do we have continuous access or remote replication  feature in ceph ?
> When
> > we can get this functionality  implemented?
> >
> > Thank you.
> >
> > Regards,
> > Santhosh
> >
> >
> > ___
> > Ceph-community mailing list
> > ceph-commun...@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Crucial MX100 for journals or cache pool

2014-08-05 Thread Craig Lewis
You really do want power-loss protection on your journal SSDs.  Data
centers do have power outages, even with all the redundant grid
connections, UPSes, and diesel generators.

Losing an SSD will lose of all of the OSDs that are using it as a journal.
 If the data center loses power, you're probably going to lose more than
one SSD.  It's a probability, so the likelihood of multiple failures goes
up as you add more SSDs.

For me, the possibility of losing data after a sudden power outage isn't
worth the cost savings.




On Fri, Aug 1, 2014 at 1:38 AM, Andrei Mikhailovsky 
wrote:

> Hello guys,
>
> Was wondering if anyone has tried using the Crucial MX100 ssds either for
> osd journals or cache pool? It seems like a good cost effective alternative
> to the more expensive drives and read/write performance is very good as
> well.
>
> Thanks
>
> --
> Andrei Mikhailovsky
> Director
> Arhont Information Security
>
> Web: http://www.arhont.com
> http://www.wi-foo.com
> Tel: +44 (0)870 4431337
> Fax: +44 (0)208 429 3111
> PGP: Key ID - 0x2B3438DE
> PGP: Server - keyserver.pgp.com
>
> DISCLAIMER
>
> The information contained in this email is intended only for the use of
> the person(s) to whom it is addressed and may be confidential or contain
> legally privileged information. If you are not the intended recipient you
> are hereby notified that any perusal, use, distribution, copying or
> disclosure is strictly prohibited. If you have received this email in error
> please immediately advise us by return email at and...@arhont.com and
> delete and purge the email and any attachments without making a copy.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph can't seem to forget

2014-08-07 Thread Craig Lewis
Have you re-formatted and re-added all of the lost OSDs?  I've found that
if you lose an OSD, you can tell Ceph the data is gone (ceph osd lost
), but it won't believe you until it can talk to that OSDID again.

If you have OSDs that are offline, you can verify that Ceph is waiting on
them with ceph pg  query, looking at the recovery_state section.
You're looking for probing_osds and down_osds_we_would_probe.  If
down_osds_we_would_probe isn't empty, then Ceph won't do anything until
it's can talk to those down OSDs again.

Once Ceph is able to probe all of the OSDs, you'll need to tell it to
create PGs that have lost all copies.  ceph pg  mark_unfound_lost
revert is used when the latest version of an object was lost, but previous
versions are still available.  ceph pg force_create_pg  is used when
all copies of a PG have been deleted. You may need to scrub the PGs too.
 I've been through this once, and I ran these commands many times before
Ceph finally agreed to re-create the missing PGs.


What your data looks like once you get the cluster healthy again, I can't
say.  I don't know how RDB or RadosGW will handle it.  Based on the number
of pools you have, I'm guessing you're using RadosGW.  If so, once this is
healthy, I would try to retrieve a copy of every object that RadosGW claims
to have, and verify it's MD5 hash.  In my case, after finally getting the
PGs created, I ended up deleting the whole pool, and re-running replication.






On Wed, Aug 6, 2014 at 11:33 AM, Sean Sullivan  wrote:

> I think I have a split issue or I can't seem to get rid of these objects.
> How can I tell ceph to forget the objects and revert?
>
> How this happened is that due to the python 2.7.8/ceph bug ( a whole rack
> of ceph went town (it had ubuntu 14.10 and that seemed to have 2.7.8 before
> 14.04). I didn't know what was going on and tried re-installing which
> killed the vast majority of the data. 2/3. The drives are gone and the data
> on them is lost now.
>
> I tried deleting them via rados but that didn't seem to work either and
> just froze there.  Any help would be much appreciated.
>
>
> Pastebin data below
> http://pastebin.com/HU8yZ1ae
>
>
> cephuser@host:~/CephPDC$ ceph --version
> ceph version 0.82-524-gbf04897 (bf048976f50bd0142f291414ea893ef0f205b51a)
>
> cephuser@host:~/CephPDC$ ceph -s
> cluster 9e0a4a8e-91fa-4643-887a-c7464aa3fd14
>  health HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests
> are blocked > 32 sec; recovery 478/15386946 objects degraded (0.003%);
> 23/5128982 unfound (0.000%)
>  monmap e9: 5 mons at {kg37-12=
> 10.16.0.124:6789/0,kg37-17=10.16.0.129:6789/0,kg37-23=10.16.0.135:6789/0,kg37-28=10.16.0.140:6789/0,kg37-5=10.16.0.117:6789/0},
> election epoch 1450, quorum 0,1,2,3,4 kg37-5,kg37-12,kg37-17,kg37-23,kg37-28
>  mdsmap e100: 1/1/1 up {0=kg37-5=up:active}
>  osdmap e46061: 245 osds: 245 up, 245 in
>   pgmap v3268915: 22560 pgs, 19 pools, 20020 GB data, 5008 kobjects
> 61956 GB used, 830 TB / 890 TB avail
> 478/15386946 objects degraded (0.003%); 23/5128982 unfound
> (0.000%)
>22558 active+clean
>2 active+recovering
>   client io 95939 kB/s rd, 80854 B/s wr, 795 op/s
>
>
> cephuser@host:~/CephPDC$ ceph health detail
> HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked
> > 32 sec; 1 osds have slow requests; recovery 478/15386946 objects degraded
> (0.003%); 23/5128982 unfound (0.000%)
> pg 5.f4f is stuck unclean since forever, current state active+recovering,
> last acting [279,115,78]
> pg 5.27f is stuck unclean since forever, current state active+recovering,
> last acting [213,0,258]
> pg 5.f4f is active+recovering, acting [279,115,78], 10 unfound
> pg 5.27f is active+recovering, acting [213,0,258], 13 unfound
> 5 ops are blocked > 67108.9 sec
> 5 ops are blocked > 67108.9 sec on osd.279
> 1 osds have slow requests
> recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound
> (0.000%)
>
> cephuser@host:~/CephPDC$ ceph pg 5.f4f mark_unfound_lost revert
> 2014-08-06 12:59:42.282672 7f7d4a6fb700  0 -- 10.16.0.117:0/1005129 >>
> 10.16.64.29:6844/718 pipe(0x7f7d4005c120 sd=4 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f7d4005c3b0).fault
> 2014-08-06 12:59:51.890574 7f7d4a4f9700  0 -- 10.16.0.117:0/1005129 >>
> 10.16.64.29:6806/7875 pipe(0x7f7d4005f180 sd=4 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f7d4005fae0).fault
> pg has no unfound objects
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph can't seem to forget

2014-08-07 Thread Craig Lewis
For your RDB volumes, you've lost random 4MiB chunks from your virtual
disks.  Think of it as unrecoverable bad sectors on the HDD.  It was only a
few unfound objects though (ceph status said 23 out of 5128982).  You can
probably recovery from that.

I'd fsck all of the volumes, and perform any application level checks for
anything high level (database tests for MySQL, stuff like that).

If you still have the list of unfound objects, you might be able to trace
it back to the specific RDB volume.  That would give you a short list of
volumes to check, instead of doing them all.




On Thu, Aug 7, 2014 at 3:54 PM, Sean Sullivan  wrote:

> Thanks craig! I think I got it back up. The odd thing is that only 2 of
> the pgs using the osds on the downed nodes were corrupted.
>
> I ended up forcing all of the osds in the pool groups down, rebooting the
> hosts. Then restarting the osds and bringing them back up to get it
> working.
>
> I had previously rebooted the osds in the pgs but something must have been
> stuck.
>
> Now I am seeing corrupt data like you mentioned and am beginning to
> question the integrity of the pool.
>
> So far the cinder volume for our main login node had some corruption but
> no complaints so far. Repaired without issue.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Apache on Trusty

2014-08-08 Thread Craig Lewis
Is anybody running Ubuntu Trusty, but using Ceph's apache 2.2 and fastcgi
packages?

I'm a bit of a Ubuntu noob.  I can't figure out the correct
/etc/apt/preferences.d/ configs to prioritize  Ceph's version of the
packages.  I keep getting Ubuntu's apache 2.4 packages.

Can somebody that has this working tell me what configs I need to change?



Alternately, should I just ditch apache entirely, and migrate to Firefly
and Civetweb?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk: Error: ceph osd start failed: Command '['/sbin/service', 'ceph', 'start', 'osd.5']' returned non-zero exit status 1

2014-08-11 Thread Craig Lewis
Are the disks mounted?  You should have a single mount for each OSD
in /var/lib/ceph/osd/ceph-/.

If they're not mounted, is there anything complicated about your disks?


On Mon, Aug 11, 2014 at 6:32 AM, Yitao Jiang  wrote:

> Hi,
>
> I launched a ceph (ceph version 0.80.5) lab on my laptop with 7 disk for
> osd.
> Yesterday all works fine, and i can create new pool and mount them.
> But after reboot, the ceph now working, more specificly the osd not start,
> belows are logs
>
> [root@cephnode1 ~]# ceph-disk activate-all
> === osd.5 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-5
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5
> --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.5']' returned non-zero exit status 1
> === osd.7 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-7
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7
> --keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.7']' returned non-zero exit status 1
> === osd.3 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-3
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.3
> --keyring=/var/lib/ceph/osd/ceph-3/keyring osd crush create-or-move -- 3
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.3']' returned non-zero exit status 1
> === osd.4 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-4
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.4
> --keyring=/var/lib/ceph/osd/ceph-4/keyring osd crush create-or-move -- 4
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.4']' returned non-zero exit status 1
> === osd.1 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-1
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1
> --keyring=/var/lib/ceph/osd/ceph-1/keyring osd crush create-or-move -- 1
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.1']' returned non-zero exit status 1
> === osd.2 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-2
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.2
> --keyring=/var/lib/ceph/osd/ceph-2/keyring osd crush create-or-move -- 2
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.2']' returned non-zero exit status 1
> === osd.6 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-6
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.6
> --keyring=/var/lib/ceph/osd/ceph-6/keyring osd crush create-or-move -- 6
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.6']' returned non-zero exit status 1
> === osd.0 ===
> Mounting xfs on cephnode1:/var/lib/ceph/osd/ceph-0
> failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0
> --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0
> 0.02 host=cephnode1 root=default'
> ceph-disk: Error: ceph osd start failed: Command '['/sbin/service',
> 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
> ceph-disk: Error: One or more partitions failed to activate
>
> [root@cephnode1 ~]# ps -aef | grep ceph
> root  2021 1  0 21:02 ?00:00:03 /usr/bin/ceph-mon -i
> cephnode1 --pid-file /var/run/ceph/mon.cephnode1.pid -c /etc/ceph/ceph.conf
> --cluster ceph
> root  2110 1  0 21:02 ?00:00:03 /usr/bin/ceph-mds -i
> cephnode1 --pid-file /var/run/ceph/mds.cephnode1.pid -c /etc/ceph/ceph.conf
> --cluster ceph
> root  6965  2278  0 21:31 pts/100:00:00 grep ceph
>
>
> Do you have any ideas ?
> ​
> ---
> Thanks,
> Yitao(依涛 姜)
> jiangyt.github.io
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map advice

2014-08-11 Thread Craig Lewis
Your MON nodes are separate hardware from the OSD nodes, right?  If so,
with replication=2, you should be able to shut down one of the two OSD
nodes, and everything will continue working.  Since it's for
experimentation, I wouldn't deal with the extra hassle of replication=4 and
custom CRUSH rules to make it work.  If you have your heart set on that, it
should be possible.  I'm no CRUSH expert though, so I can't say for certain
until I've actually done it.

I'm a bit confused why your performance is horrible though.  I'm assuming
your HDDs are 7200 RPM.  With the SSD journals and replication=3, you won't
have a ton of IO, but you shouldn't have any problem doing > 100 MB/s with
4 MB blocks.  Unless your SSDs are very low quality, the HDDs should be
your bottleneck.




On Fri, Aug 8, 2014 at 10:24 PM, John Morris  wrote:

> Our experimental Ceph cluster is performing terribly (with the operator to
> blame!), and while it's down to address some issues, I'm curious to hear
> advice about the following ideas.
>
> The cluster:
> - two disk nodes (6 * CPU, 16GB RAM each)
> - 8 OSDs (4 each)
> - 3 monitors
> - 10Gb front + back networks
> - 2TB Enterprise SATA drives
> - HP RAID controller w/battery-backed cache
> - one SSD journal drive for each two OSDs
>
> First, I'd like to play with taking one machine down, but with the other
> node continuing to serve the cluster.  To maintain redundancy in this
> scenario, I'm thinking of setting the pool size to 4 and the min_size to 2,
> with the idea that a proper CRUSH map should always keep two copies on each
> disk node.  Again, *this is for experimentation* and probably raises red
> flags for production, but I'm just asking if it's *possible*:  Could one
> node go down and the other node continue to serve r/w data?  Any anecdotes
> of performance differences between size=4 and size=3 in other clusters?
>
> Second, does it make any sense to divide the CRUSH map into an extra level
> for the SSD disks, which each hold journals for two OSDs?  This might
> increase redundancy in case of a journal disk failure, but ISTR something
> about too few OSDs in a bucket causing problems with the CRUSH algorithm.
>
> Thanks-
>
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph network

2014-08-11 Thread Craig Lewis
Only the OSDs use the cluster network.  OSD heartbeat use both networks, to
verify connectivity.

Check out the Network Configuration Reference:
http://ceph.com/docs/master/rados/configuration/network-config-ref/



On Mon, Aug 11, 2014 at 6:30 PM, yuelongguang  wrote:

> hi,all
> i know ceph differentiates network, mostly it uses public and cluster
> ,heartbeat network.
> do mon and mds have those network? i only know osd has.
> is there a place to introduce ceph's network?
>
>
> thanks.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practice of installing ceph(large-scale deployment)

2014-08-11 Thread Craig Lewis
Take a look at Cern's "Scaling Ceph at Cern" slides
, as well
as Inktank's
Hardware Configuration Guide
.


You need at least 3 MONs for production.  You might want more if depending
on the size and your failure domains.

Since you're doing RDB, you're going to be more concerned with latency than
raw storage space.  You'll want to look at the number of IOPS each disk can
do, divided by the number of replicas you want, reduced by the percentage
of the cluster you're willing to lose and still have acceptable
performance.  Add more OSDs until you get the IOPS you want.

SSD journals will really help to get the full IOPS out of each disk.  Make
sure the SSD has enough write speed to match the OSDs using it.  ie, if
your SSDs can write 400MB/s, and the OSDs can write 100MB/s, then you only
want 4 OSDs sharing an SSD for journals.

Make sure you have enough network bandwidth to handle all of the OSDs.  10x
disks at 100 MB/s is 1 GB/s.  You'll need 10GigE to handle that.


If you're concerned about latency, you probably want a dedicated cluster
network.


To really get the best performance, you need more money and a lot of
testing.  :-)  It's up to you to determine if you need those SSDs and
battery backed write caching RAID cards to meet your performance numbers.
 A larger cluster is a faster cluster (until you bottleneck on network IO).
 More spindles are faster.  If you favor speed over space, you're better
with twice as many 1TB disks than 2TB disks.  That'll cost more though,
because you need twice as many nodes to hold those twice as many disks.

Consider an caching tier using SSDs.




On Mon, Aug 11, 2014 at 6:23 PM, yuelongguang  wrote:

> hi,all
> i am using ceph-rbd with openstack as its backends storage.
> is there a best practice?
> 1.
> it needs at least   how many osds,mons, and their proportion ?
>
> 2. how you deploy the network?public , cluster network...
>
> 3.as for performance, what do you do? journal..
>
> 4. anything  it promotes ceph performance.
> thanks.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map advice

2014-08-12 Thread Craig Lewis
On Mon, Aug 11, 2014 at 11:26 PM, John Morris  wrote:

> On 08/11/2014 08:26 PM, Craig Lewis wrote:
>
>> Your MON nodes are separate hardware from the OSD nodes, right?
>>
>
> Two nodes are OSD + MON, plus a separate MON node.
>
>
>  If so,
>> with replication=2, you should be able to shut down one of the two OSD
>> nodes, and everything will continue working.
>>
>
> IIUC, the third MON node is sufficient for a quorum if one of the OSD +
> MON nodes shuts down, is that right?
>

So yeah, if you lose any one node, you'll be fine.


>
> Replication=2 is a little worrisome, since we've already seen two disks
> simultaneously fail just in the year the cluster has been running.  That
> statistically unlikely situation is the first and probably last time I'll
> see that, but they say lightning can strike twice


That's a low probability, given the number of disks you have.  I would've
taken that bet (with backups).  As the number of OSDs goes up, the
probability of multiple simultaneous failures goes up, and slowly becomes a
bad bet.



>
>
>  Since it's for
>> experimentation, I wouldn't deal with the extra hassle of replication=4
>> and custom CRUSH rules to make it work.  If you have your heart set on
>> that, it should be possible.  I'm no CRUSH expert though, so I can't say
>> for certain until I've actually done it.
>>
>> I'm a bit confused why your performance is horrible though.  I'm
>> assuming your HDDs are 7200 RPM.  With the SSD journals and
>> replication=3, you won't have a ton of IO, but you shouldn't have any
>> problem doing > 100 MB/s with 4 MB blocks.  Unless your SSDs are very
>> low quality, the HDDs should be your bottleneck.
>>
>
> The below setup is tomorrow's plan; today's reality is 3 OSDs on one node
> and 2 OSDs on another, crappy SSDs, 1Gb networks, pgs stuck unclean and no
> monitoring to pinpoint bottlenecks.  My work is cut out for me.  :)
>
> Thanks for the helpful reply.  I wish we could just add a third OSD node
> and have these issues just go away, but it's not in the budget ATM.
>
>
Ah, yeah, that explains the performance problems.  Although, crappy SSD
journals are still better than no SSD journals.  When I added SSD journals
to my existing cluster, I saw my write bandwidth go from 10 MBps/disk to
50MBps/disk.  Average latency dropped a bit, and the variance in latency
dropped a lot.

Just adding more disks to your existing nodes would help performance,
assuming you have room to add them.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-12 Thread Craig Lewis
For the incomplete PGs, can you give me the output of
ceph pg  dump

I'm interested in the recovery_state key of that JSON data.



On Tue, Aug 12, 2014 at 5:29 AM, Riederer, Michael 
wrote:

>  Sorry, but I think that does not help me. I forgot to mention something about
> the operating system:
>
> root@ceph-1-storage:~# dpkg -l | grep libleveldb1
> ii  libleveldb1   1.12.0-1precise.ceph
> fast key-value storage library
> root@ceph-1-storage:~# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 12.04.5 LTS
> Release:12.04
> Codename:   precise
> root@ceph-1-storage:~# uname -a
> Linux ceph-1-storage 3.5.0-52-generic #79~precise1-Ubuntu SMP Fri Jul 4
> 21:03:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> libleveldb1 is greater than the mentioned version 1.9.0-1 ~ bpo70 + 1.
>
> All ceph nodes are IBM x3650 with Intel Xeon CPUs 2.00 GHz and 8 GB RAM,
> ok all very old, about eight years,
> but are still running.
>
> Mike
>
>
>
>  --
> *Von:* Karan Singh [karan.si...@csc.fi]
> *Gesendet:* Dienstag, 12. August 2014 13:00
>
> *An:* Riederer, Michael
> *Cc:* ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>  I am not sure if this helps , but have a look
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10078.html
>
> - Karan -
>
>  On 12 Aug 2014, at 12:04, Riederer, Michael 
> wrote:
>
>  Hi Karan,
>
> root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# ceph
> osd getcrushmap -o crushmap.bin
> got crush map from osdmap epoch 30748
> root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list#
> crushtool -d crushmap.bin -o crushmap.txt
> root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# cat
> crushmap.txt
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 device21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 device27
> device 28 osd.28
> device 29 osd.29
> device 30 osd.30
> device 31 osd.31
> device 32 osd.32
> device 33 osd.33
> device 34 osd.34
> device 35 osd.35
> device 36 osd.36
> device 37 osd.37
> device 38 osd.38
> device 39 osd.39
> device 40 device40
> device 41 device41
> device 42 osd.42
> device 43 osd.43
> device 44 osd.44
> device 45 osd.45
> device 46 osd.46
> device 47 osd.47
> device 48 osd.48
> device 49 osd.49
> device 50 osd.50
> device 51 osd.51
> device 52 osd.52
> device 53 osd.53
> device 54 osd.54
> device 55 osd.55
> device 56 osd.56
> device 57 osd.57
> device 58 osd.58
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
>
> # buckets
> host ceph-1-storage {
> id -2# do not change unnecessarily
> # weight 19.330
> alg straw
> hash 0# rjenkins1
> item osd.0 weight 0.910
> item osd.2 weight 0.910
> item osd.3 weight 0.910
> item osd.4 weight 1.820
> item osd.9 weight 1.360
> item osd.11 weight 0.680
> item osd.6 weight 3.640
> item osd.5 weight 1.820
> item osd.7 weight 3.640
> item osd.8 weight 3.640
> }
> host ceph-2-storage {
> id -3# do not change unnecessarily
> # weight 20.000
> alg straw
> hash 0# rjenkins1
> item osd.14 weight 3.640
> item osd.18 weight 1.360
> item osd.19 weight 1.360
> item osd.15 weight 3.640
> item osd.1 weight 3.640
> item osd.12 weight 3.640
> item osd.22 weight 0.680
> item osd.23 weight 0.680
> item osd.26 weight 0.680
> item osd.36 weight 0.680
> }
> host ceph-5-storage {
> id -4# do not change unnecessarily
> # weight 11.730
> alg straw
> hash 0# rjenkins1
> item osd.32 weight 0.270
> item osd.37 weight 0.270
> item osd.42 weight 0.270
> item osd.43 weight 1.820
> item osd.44 weight 1.820
> item osd.45 weight 1.820
> item osd.46 weight 1.820
> item osd.47 weight 1.820
> item osd.48 weight 1.820
> }
> room room0 {
> id -8# do not change unnecessarily
> # weight 51.060
> alg straw
> hash 0# rjenkins1
> item ceph-1-storage weight 19.330
> item ceph-2-storage weight 20.000
> item ceph-5-storage weight 11.730
> }
> host ceph-3-storage {
> id -5# do not change unnecessarily
> # weight 15.920
> alg straw
> has

Re: [ceph-users] Power Outage

2014-08-12 Thread Craig Lewis
I can't really help with MDS.  Hopefully somebody else will chime in here.

(Resending, because my last reply was too large.)


On Tue, Aug 12, 2014 at 12:44 PM, hjcho616  wrote:

> Craig,
>
> Thanks.  It turns out one of my memory stick went bad after that power
> outage.  While trying to fix the OSDs I ran in to many kernel crashes.
>  After removing that bad memory, I was able to fix them.  I did remove all
> OSD on that machine and rebuilt it as I didn't trust that data anymore. =P
>
> I was hoping MDS would come up after that.  But it didn't.  It shows this
> and kills itself.  Is this related to 0.82 MDS issue?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-13 Thread Craig Lewis
Yes, ceph pg  query, not dump.  Sorry about that.

Are you having problems with OSD stability?  There's a lot of history in
the [recovery_state][past_intervals]. That's normal when OSDs go down, and
out, and come back up and in. You have a lot of history there. You might
even be getting into the point that you have so much failover history, the
OSDs can't process it all before they hit the suicide timeout.

[recovery_state][probing_osds] lists a lot of OSDs that have recently owned
these PGs. If the OSDs are crashing frequently, you need to get that under
control before proceeding.

Once the OSDs are stable, I think Ceph just needs to scrub and deep-scrub
those PGs.


Until Ceph clears out the [recovery_state][probing_osds] section in the pg
query, it's not going to do anything.  ceph osd lost hears you, but doesn't
trust you.  Ceph won't do anything until it's actually checked those OSDs
itself.  Scrubbing and Deep scrubbing should convince it.

Once that [recovery_state][probing_osds] section is gone, you should see
the [recovery_state][past_intervals] section shrink or disappear. I don't
have either section in my pg query. Once that happens, your ceph pg repair
or ceph pg force_create_pg should finally have some effect.  You may or may
not need to re-issue those commands.




On Tue, Aug 12, 2014 at 9:32 PM, Riederer, Michael 
wrote:

>  Hi Craig,
>
> # ceph pg 2.587 query
> # ceph pg 2.c1 query
> # ceph pg 2.92 query
> # ceph pg 2.e3 query
>
> Please download the output form here:
> http://server.riederer.org/ceph-user/
>
> #
>
>
> It is not possible to map a rbd:
>
> # rbd map testshareone --pool rbd --name client.admin
> rbd: add failed: (5) Input/output error
>
> I found that:
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
> # ceph osd getcrushmap -o crushmap.bin
> got crush map from osdmap epoch 3741
> # crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
> # ceph osd setcrushmap -i crushmap-new.bin
> set crush map
>
> The Cluster had to do some. Now it looks a bit different.
>
> It is still not possible to map a rbd.
>
> root@ceph-admin-storage:~# ceph -s
> cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>  health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs
> stuck unclean
>  monmap e2: 3 mons at {ceph-1-storage=
> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
> election epoch 5010, quorum 0,1,2
> ceph-1-storage,ceph-2-storage,ceph-3-storage
>   osdmap e34206: 55 osds: 55 up, 55 in
>   pgmap v10838368: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
> 22078 GB used, 79932 GB / 102010 GB avail
> 6140 active+clean
>4 incomplete
>
> root@ceph-admin-storage:~# ceph health detail
> HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
> pg 2.92 is stuck inactive since forever, current state incomplete, last
> acting [8,13]
> pg 2.c1 is stuck inactive since forever, current state incomplete, last
> acting [13,8]
> pg 2.e3 is stuck inactive since forever, current state incomplete, last
> acting [20,8]
> pg 2.587 is stuck inactive since forever, current state incomplete, last
> acting [13,8]
>
> pg 2.92 is stuck unclean since forever, current state incomplete, last
> acting [8,13]
> pg 2.c1 is stuck unclean since forever, current state incomplete, last
> acting [13,8]
> pg 2.e3 is stuck unclean since forever, current state incomplete, last
> acting [20,8]
> pg 2.587 is stuck unclean since forever, current state incomplete, last
> acting [13,8]
> pg 2.587 is incomplete, acting [13,8]
> pg 2.e3 is incomplete, acting [20,8]
> pg 2.c1 is incomplete, acting [13,8]
>
> pg 2.92 is incomplete, acting [8,13]
>
> ###
>
> After updating to firefly, I did the following:
>
> # ceph health detail
> HEALTH_WARN crush map has legacy tunables crush map has legacy tunables;
> see http://ceph.com/docs/master/rados/operations/crush-map/#tunables
>
> # ceph osd crush tunables optimal
> adjusted tunables profile to optimal
>
> Mike
>  --
> *Von:* Craig Lewis [cle...@centraldesktop.com]
> *Gesendet:* Dienstag, 12. August 2014 20:02
> *An:* Riederer, Michael
> *Cc:* Karan Singh; ceph-users@lists.ceph.com
>
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>   For the incomplete PGs, can you give me the output of
> ceph pg  dump
>
>  I'm interested in the recovery_state key of that JSON data.
>
>
>
> On Tue, Aug 12, 2014 at 5:2

Re: [ceph-users] can osd start up if journal is lost and it has not been replayed?

2014-08-13 Thread Craig Lewis
If the journal is lost, the OSD is lost.  This can be a problem if you use
1 SSD for journals for many OSDs.

There has been some discussion about making the OSDs able to recover from a
lost journal, but I haven't heard anything else about it.  I haven't been
paying much attention to the developer mailing list though.


For your second question, I'd start by looking at the source code
in src/osd/ReplicatedPG.cc (for standard replication), or
src/osd/ECBackend.cc (for Erasure Coding).  I'm not a Ceph developer
though, so that might not be the right place to start.



On Tue, Aug 12, 2014 at 7:08 PM, yuelongguang  wrote:

> hi,all
>
> 1.
> can osd start up  if journal is lost and it has not been replayed?
>
> 2.
> how it catchs up latest epoch?  take osd as example,  where is the code?
> it better you consider journal is lost or not.
> in my mind journal only includes meta/R/W operations, does not include
> data(file data).
>
>
> thanks
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-14 Thread Craig Lewis
It sound likes you need to throttle recovery.  I have this in my ceph.conf:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1


Those configs, plus SSD journals, really helped the stability of my cluster
during recovery.  Before I made those changes, I would see OSDs get voted
down by other OSDs for not responding to heartbeats quickly.  Messages in
ceph.log like:
osd.# IP:PORT 420 : [WRN] map e41738 wrongly marked me down

are an indication that OSDs are so overloaded that they're getting kicked
out.



I also ran into problems when OSDs were getting kicked repeatedly.  It
caused those really large sections in pg query's
[recovery_state][past_intervals]
that you also have. I would restart an OSD, it would peer, and then suicide
timeout 300 seconds after starting the peering process.  When I first saw
it, it was only affecting a few OSDs.  If you're seeing repeated suicide
timeouts in the OSD's logs, there's a manual process to catch them up.



On Thu, Aug 14, 2014 at 12:25 AM, Riederer, Michael 
wrote:

>  Hi Craig,
>
> Yes we have stability problems. The cluster is definitely not suitable for
> a production environment. I will not describe the details here. I want to get
> to know ceph and this is possible with the Test-cluster. Some osds are
> very slow, less than 15 MB / sec writable. Also increases the load on the
> ceph nodes to over 30 when a osd is removed and a reorganistation of the
> data is necessary. If the load is very high (over 30) I have seen exactly
> what you describe. osds go down and out and come back up and in.
>
> OK. I'll try the slow osd to remove and then to scrub, deep-scrub the pgs.
>
> Many thanks for your help.
>
> Regards,
> Mike
>
>  --
> *Von:* Craig Lewis [cle...@centraldesktop.com]
> *Gesendet:* Mittwoch, 13. August 2014 19:48
>
> *An:* Riederer, Michael
> *Cc:* Karan Singh; ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>   Yes, ceph pg  query, not dump.  Sorry about that.
>
>  Are you having problems with OSD stability?  There's a lot of history in
> the [recovery_state][past_intervals]. That's normal when OSDs go down,
> and out, and come back up and in. You have a lot of history there. You
> might even be getting into the point that you have so much failover
> history, the OSDs can't process it all before they hit the suicide timeout.
>
>  [recovery_state][probing_osds] lists a lot of OSDs that have recently
> owned these PGs. If the OSDs are crashing frequently, you need to get that
> under control before proceeding.
>
>  Once the OSDs are stable, I think Ceph just needs to scrub and
> deep-scrub those PGs.
>
>
>  Until Ceph clears out the [recovery_state][probing_osds] section in the
> pg query, it's not going to do anything.  ceph osd lost hears you, but
> doesn't trust you.  Ceph won't do anything until it's actually checked
> those OSDs itself.  Scrubbing and Deep scrubbing should convince it.
>
>  Once that [recovery_state][probing_osds] section is gone, you should see
> the [recovery_state][past_intervals] section shrink or disappear. I don't
> have either section in my pg query. Once that happens, your ceph pg repair
> or ceph pg force_create_pg should finally have some effect.  You may or
> may not need to re-issue those commands.
>
>
>
>
> On Tue, Aug 12, 2014 at 9:32 PM, Riederer, Michael  > wrote:
>
>>  Hi Craig,
>>
>> # ceph pg 2.587 query
>> # ceph pg 2.c1 query
>> # ceph pg 2.92 query
>> # ceph pg 2.e3 query
>>
>> Please download the output form here:
>> http://server.riederer.org/ceph-user/
>>
>> #
>>
>>
>> It is not possible to map a rbd:
>>
>> # rbd map testshareone --pool rbd --name client.admin
>> rbd: add failed: (5) Input/output error
>>
>> I found that:
>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
>>  # ceph osd getcrushmap -o crushmap.bin
>>  got crush map from osdmap epoch 3741
>> # crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
>> # ceph osd setcrushmap -i crushmap-new.bin
>> set crush map
>>
>> The Cluster had to do some. Now it looks a bit different.
>>
>> It is still not possible to map a rbd.
>>
>>  root@ceph-admin-storage:~# ceph -s
>> cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>>  health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs
>> stuck unclean
>>  monmap e2: 3 mons at {ceph-1-storage=
>> 10.65.150.101:678

[ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-14 Thread Craig Lewis
In my effort to learn more of the details of Ceph, I'm trying to
figure out how to get from an object name in RadosGW, through the
layers, down to the files on disk.

clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
s3://cpltest/vmware-freebsd-tools.tar.gz

Looking at the .rgw pool's contents tells me that the cpltest bucket
is default.73886.55:
root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep cpltest
cpltest
.bucket.meta.cpltest:default.73886.55

The rados objects that belong to that bucket are:
root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
default.73886.55_vmware-freebsd-tools.tar.gz
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4


I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
bucket only has a single file (and the sum of the sizes matches).
With many files, I can't infer the link anymore.

How do I look up that link?

I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.



My real goal is the reverse.  I recently repaired an inconsistent PG.
The primary replica had the bad data, so I want to verify that the
repaired object is correct.  I have a database that stores the SHA256
of every object.  If I can get from the filename on disk back to an S3
object, I can verify the file.  If it's bad, I can restore from the
replicated zone.


Aside from today's task, I think it's really handy to understand these
low level details.  I know it's been handy in the past, when I had
disk corruption under my PostgreSQL database.  Knowing (and
practicing) ahead of time really saved me a lot of downtime then.


Thanks for any pointers.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance really drops from 700MB/s to 10MB/s

2014-08-14 Thread Craig Lewis
I find graphs really help here.  One screen that has all the disk I/O
and latency for all OSDs makes it easy to pin point the bottleneck.

If you don't have that, I'd go low tech: Watch the blinky lights. It's
really easy to see which disk is the hotspot.



On Thu, Aug 14, 2014 at 6:56 AM, Mariusz Gronczewski
 wrote:
> Actual OSD (/var/log/ceph/ceph-osd.$id) logs would be more useful.
>
> Few ideas:
>
> * do 'ceph health detail' to get detail of which OSD is stalling
> * 'ceph osd perf' to see latency of each osd
> * 'ceph --admin-daemon /var/run/ceph/ceph-osd.$id.asok dump_historic_ops' 
> shows "recent slow" ops
>
> I actually have very similiar problem, cluster goes full speed (sometimes 
> even for hours) and suddenly everything stops for a minute or 5, no disk IO, 
> no IO wait (so disks are fine), no IO errors in kernel log, and OSDs only 
> complain that other OSD subop is slow (but on that OSD everything looks fine 
> too)
>
> On Wed, 13 Aug 2014 16:04:30 -0400, German Anders
>  wrote:
>
>> Also, even a "ls -ltr" could be done inside the /mnt of the RBD that
>> it freeze the prompt. Any ideas? I've attach some syslogs from one of
>> the OSD servers and also from the client. Both are running Ubuntu
>> 14.04LTS with Kernel  3.15.8.
>> The cluster is not usable at this point, since I can't run a "ls" on
>> the rbd.
>>
>> Thanks in advance,
>>
>> Best regards,
>>
>>
>> German Anders
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> > --- Original message ---
>> > Asunto: Re: [ceph-users] Performance really drops from 700MB/s to
>> > 10MB/s
>> > De: German Anders 
>> > Para: Mark Nelson 
>> > Cc: 
>> > Fecha: Wednesday, 13/08/2014 11:09
>> >
>> >
>> > Actually is very strange, since if i run the fio test on the client,
>> > and also un parallel run a iostat on all the OSD servers, i don't see
>> > any workload going on over the disks, I mean... nothing! 0.00and
>> > also the fio script on the client is reacting very rare too:
>> >
>> >
>> > $ sudo fio --filename=/dev/rbd1 --direct=1 --rw=write --bs=4m
>> > --size=10G --iodepth=16 --ioengine=libaio --runtime=60
>> > --group_reporting --name=file99
>> > file99: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio,
>> > iodepth=16
>> > fio-2.1.3
>> > Starting 1 process
>> > Jobs: 1 (f=1): [W] [2.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> > 01h:26m:43s]
>> >
>> > It's seems like is doing nothing..
>> >
>> >
>> >
>> > German Anders
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >> --- Original message ---
>> >> Asunto: Re: [ceph-users] Performance really drops from 700MB/s to
>> >> 10MB/s
>> >> De: Mark Nelson 
>> >> Para: 
>> >> Fecha: Wednesday, 13/08/2014 11:00
>> >>
>> >> On 08/13/2014 08:19 AM, German Anders wrote:
>> >>>
>> >>> Hi to all,
>> >>>
>> >>>I'm having a particular behavior on a new Ceph cluster.
>> >>> I've map
>> >>> a RBD to a client and issue some performance tests with fio, at this
>> >>> point everything goes just fine (also the results :) ), but then I try
>> >>> to run another new test on a new RBD on the same client, and suddenly
>> >>> the performance goes below 10MB/s and it took almost 10 minutes to
>> >>> complete a 10G file test, if I issue a *ceph -w* I don't see anything
>> >>> suspicious, any idea what can be happening here?
>> >>
>> >> When things are going fast, are your disks actually writing data out
>> >> as
>> >> fast as your client IO would indicate? (don't forgot to count
>> >> replication!)  It may be that the great speed is just writing data
>> >> into
>> >> the tmpfs journals (if the test is only 10GB and spread across 36
>> >> OSDs,
>> >> it could finish pretty quickly writing to tmpfs!).  FWIW, tmpfs
>> >> journals
>> >> aren't very safe.  It's not something you want to use outside of
>> >> testing
>> >> except in unusual circumstances.
>> >>
>> >> In your tests, when things are bad: it's generally worth checking to
>> >> see
>> >> if any one disk/osd is backed up relative to the others.  There are a
>> >> couple of ways to accomplish this.  the Ceph admin socket can tell you
>> >> information about each OSD ie how many outstanding IOs and a history
>> >> of
>> >> slow ops.  You can also look at per-disk statistics with something
>> >> like
>> >> iostat or collectl.
>> >>
>> >> Hope this helps!
>> >>
>> >>>
>> >>>
>> >>>The cluster is made of:
>> >>>
>> >>> 3 x MON Servers
>> >>> 4 x OSD Servers (3TB SAS 6G disks for OSD daemons & tmpfs for Journal
>> >>> ->
>> >>> there's one tmpfs of 36GB that is share by 9 OSD daemons, on each
>> >>> server)
>> >>> 2 x Network SW (Cluster and Public)
>> >>> 10GbE speed on both networks
>> >>>
>> >>>The ceph.conf file is the following:
>> >>>
>> >>> [global]
>> >>> fsid = 56e56e4c-ea59-4157-8b98-acae109bebe1
>> >>> mon_initial_members = cephmon01, cephmon02, cephmon03
>> >>> mon_host = 10.97.10.1,10.97.10.2,10.97.10.3
>> >>> auth_client_required = cephx
>> >>> auth_cluster_

Re: [ceph-users] CRUSH map advice

2014-08-14 Thread Craig Lewis
On Thu, Aug 14, 2014 at 12:47 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Tue, 12 Aug 2014 10:53:21 -0700 Craig Lewis wrote:
>
>> That's a low probability, given the number of disks you have.  I would've
>> taken that bet (with backups).  As the number of OSDs goes up, the
>> probability of multiple simultaneous failures goes up, and slowly
>> becomes a bad bet.
>>
>
> I must be very unlucky then. ^o^
> As in, I've had dual disk failures in a set of 8 disks 3 times now
> (within the last 6 years).
> And twice that lead to data loss, once with RAID5 (no surprise there) and
> once with RAID10 (unlucky failure of neighboring disks).
> Granted, that was with consumer HDDs and the last one with rather well
> aged ones, too. But there you go.

Yeah, I'd say you're unlucky, unless you're running a pretty large cluster.
 I usually run my 8 disk arrays in RAID-Z2 / RAID6 though; 5 disks is my
limit for RAID-Z1 / RAID5.

I've been lucky so far.  No double failures in my RAID-Z1 / RAID5 arrays,
and no triple failures in my RAID-Z2 / RAID6 arrays.  After 15 years and
hundreds of arrays, I should've had at least one.  I have had several
double failures in RAID1, but none of those were important.


If this isn't a big cluster, I would suspect that you have a vibration or
power issue.  Both are known to cause premature death in HDDs.  Of course,
rebuilding a degraded RAID is also a well known cause of premature HDD
death.



> As for backups, those are for when somebody does something stupid and
> deletes stuff they shouldn't have.
> A storage system should be a) up all the time and b) not loose data.


I completely agree, but never trust it.

Over the years, I've used backups to recover when:

   - I do something stupid
   - My developers do something stupid
   - Hardware does something stupid
   - Manufacturer firmware does something stupid
   - Manufacturer Tech support tells me to do something stupid
   - My datacenter does something stupid
   - My power companies do something stupid

I've lost data from a software RAID0, all the way up to a
quadruply-redundant multi-million dollar hardware storage array.
 Regardless of the promises printed on the box, it's the contingency plans
that keep the paychecks coming.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can osd start up if journal is lost and it has not been replayed?

2014-08-15 Thread Craig Lewis
It just hasn't been implemented yet.  The developers are mostly working on
big features, and waiting to do these small optimizations later.  I'm sure
there are plans to address this, but I doubt it will be soon.

If you're interested, you're welcome to contribute:
http://ceph.com/community/contribute/


On Thu, Aug 14, 2014 at 6:21 PM, yuelongguang  wrote:

> hi
> could you tell the reason, why 'the journal is lost, the OSD is lost'? if
> journal is lost, actually it only lost part  which ware not replayed.
> let take a similar case as example, a osd is down for some time , its
> journal is out of date(lose part of journal), but it can catch up with
> other osds. why?
> that example can tell that  either outdated osd can get all journal from
> others  or 'catch up' has different theory with journal.
> could you explain?
>
>
>
> thanks
>
>
>
>
>
>
> At 2014-08-14 05:21:20, "Craig Lewis"  wrote:
>
> If the journal is lost, the OSD is lost.  This can be a problem if you use
> 1 SSD for journals for many OSDs.
>
> There has been some discussion about making the OSDs able to recover from
> a lost journal, but I haven't heard anything else about it.  I haven't been
> paying much attention to the developer mailing list though.
>
>
> For your second question, I'd start by looking at the source code
> in src/osd/ReplicatedPG.cc (for standard replication), or
> src/osd/ECBackend.cc (for Erasure Coding).  I'm not a Ceph developer
> though, so that might not be the right place to start.
>
>
>
> On Tue, Aug 12, 2014 at 7:08 PM, yuelongguang  wrote:
>
>> hi,all
>>
>> 1.
>>  can osd start up  if journal is lost and it has not been replayed?
>>
>> 2.
>> how it catchs up latest epoch?  take osd as example,  where is the code?
>> it better you consider journal is lost or not.
>> in my mind journal only includes meta/R/W operations, does not include
>> data(file data).
>>
>>
>> thanks
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-18 Thread Craig Lewis
I take it that OSD 8, 13, and 20 are some of the stopped OSDs.

I wasn't able to get ceph to execute ceph pg force_create until the OSDs in
[recovery_state][probing_osds] from ceph pg query were online.  I ended up
reformatting most of them, and re-adding them to the cluster.

What's wrong with those OSDs?  How slow are they?  If the problem is just
that they're really slow, try starting them up, and manually marking them
UP and OUT.  That way Ceph will read from them, but not write to them.  If
they won't stay up, I'd replace them, and get the replacements back in the
cluster.  I'd leave the replacements UP and OUT.  You can rebalance later,
after the cluster is healthy again.



I've never seen the replay state, I'm not sure what to do with that.



On Mon, Aug 18, 2014 at 5:05 AM, Riederer, Michael 
wrote:

>  What has changed in the cluster compared to my first mail, the cluster
> was in a position to repair one pg, but now has a different pg in status
> "active+clean+replay"
>
> root@ceph-admin-storage:~# ceph pg dump | grep "^2.92"
> dumped all in format plain
> 2.920000000active+clean2014-08-18
> 10:37:20.9628580'036830:577[8,13]8[8,13]80'0
> 2014-08-18 10:37:20.96272813503'13904192014-08-14 10:37:12.497492
> root@ceph-admin-storage:~# ceph pg dump | grep replay
> dumped all in format plain
> 0.49a0000000active+clean+replay
> 2014-08-18 13:09:15.3172210'036830:1704[12,10]12
> [12,10]120'02014-08-18 13:09:15.3171310'02014-08-18
> 13:09:15.317131
>
> Mike
>
>  --
> *Von:* ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von
> "Riederer, Michael [michael.riede...@br.de]
> *Gesendet:* Montag, 18. August 2014 13:40
> *An:* Craig Lewis
> *Cc:* ceph-users@lists.ceph.com; Karan Singh
>
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>   Hi Craig,
>
> I brought the cluster in a stable condition. All slow osds are no longer in
> the cluster. All remaining 36 osds are more than 100 MB / sec writeable
> (dd if=/dev/zero of=testfile-2.txt bs=1024 count=4096000). No ceph client
> is connected to the cluster. The ceph nodes are in idle. Now sees the
> state as follows:
>
> root@ceph-admin-storage:~# ceph -s
> cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>  health HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck
> inactive; 3 pgs stuck unclean
>  monmap e2: 3 mons at {ceph-1-storage=
> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
> election epoch 5018, quorum 0,1,2
> ceph-1-storage,ceph-2-storage,ceph-3-storage
>  osdmap e36830: 36 osds: 36 up, 36 in
>   pgmap v10907190: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
> 22051 GB used, 68206 GB / 90258 GB avail
> 6140 active+clean
>3 down+incomplete
>1 active+clean+replay
>
> root@ceph-admin-storage:~# ceph health detail
> HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs
> stuck unclean
> pg 2.c1 is stuck inactive since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.e3 is stuck inactive since forever, current state down+incomplete,
> last acting [20,8]
> pg 2.587 is stuck inactive since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.c1 is stuck unclean since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.e3 is stuck unclean since forever, current state down+incomplete,
> last acting [20,8]
> pg 2.587 is stuck unclean since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.587 is down+incomplete, acting [13,8]
> pg 2.e3 is down+incomplete, acting [20,8]
> pg 2.c1 is down+incomplete, acting [13,8]
>
> I have tried the following:
>
> root@ceph-admin-storage:~# ceph pg scrub 2.587
> instructing pg 2.587 on osd.13 to scrub
> root@ceph-admin-storage:~# ceph pg scrub 2.e3
> ^[[Ainstructing pg 2.e3 on osd.20 to scrub
> root@ceph-admin-storage:~# ceph pg scrub 2.c1
> instructing pg 2.c1 on osd.13 to scrub
>
> root@ceph-admin-storage:~# ceph pg deep-scrub 2.587
> instructing pg 2.587 on osd.13 to deep-scrub
> root@ceph-admin-storage:~# ceph pg deep-scrub 2.e3
> instructing pg 2.e3 on osd.20 to deep-scrub
> root@ceph-admin-storage:~# ceph pg deep-scrub 2.c1
> instructing pg 2.c1 on osd.13 to deep-scrub
>
> root@ceph-admin-storage:~# ceph pg repair 2.587
> instructing pg 2.587 on osd.13 to repair
> root@ceph-admin-storage:~# ceph pg repair 2.e

Re: [ceph-users] [radosgw-admin] bilog list confusion

2014-08-18 Thread Craig Lewis
I have the same results.  The primary zone (with log_meta and log_data
true) have bilog data, the secondary zone (with log_meta and log_data
false) do not have bilog data.

I'm just guessing here (I can't test it right now)...  I would think that
disabling log_meta and log_data will stop adding new information to the
bilog, but keep existing bilogs.  If that's true, bilog trim should clean
up the old logs (along with mdlog trim and datalog trim).





On Mon, Aug 18, 2014 at 5:43 AM, Patrycja Szabłowska <
szablowska.patry...@gmail.com> wrote:

> Hi,
>
>
> Is there any configuration option in ceph.conf for enabling/disabling
> the bilog list?
> I mean the result of this command:
> radosgw-admin bilog list
>
> One ceph cluster gives me results - list of operations which were made
> to the bucket, and the other one gives me just an empty list. I can't
> see what's the reason.
>
>
> I can't find it anywhere here in the ceph.conf file.
> http://ceph.com/docs/master/rados/configuration/ceph-conf/
>
> My guess is it's in region info, but when I've changed these values to
> false for the cluster with working bilog, the bilog would still show.
>
> 1. cluster with empty bilog list:
>   "zones": [
> { "name": "default",
>   "endpoints": [],
>   "log_meta": "false",
>   "log_data": "false"}],
> 2. cluster with *proper* bilog list:
>   "zones": [
> { "name": "master-1",
>   "endpoints": [
> "http:\/\/[...]"],
>   "log_meta": "true",
>   "log_data": "true"}],
>
>
> Here are pools on both of the clusters:
>
> 1. cluster with *proper* bilog list:
> rbd
> .rgw.root
> .rgw.control
> .rgw
> .rgw.gc
> .users.uid
> .users.email
> .users
> .rgw.buckets
> .rgw.buckets.index
> .log
> ''
>
> 2. cluster with empty bilog list:
> data
> metadata
> rbd
> .rgw.root
> .rgw.control
> .rgw
> .rgw.gc
> .users.uid
> .users.email
> .users
> ''
> .rgw.buckets.index
> .rgw.buckets
> .log
>
>
> And here is the zone info (just the placement_pools, rest of the
> config is the same):
> 1. cluster with *proper* bilog list:
> "placement_pools": []
>
> 2. cluster with *empty* bilog list:
>   "placement_pools": [
> { "key": "default-placement",
>   "val": { "index_pool": ".rgw.buckets.index",
>   "data_pool": ".rgw.buckets",
>   "data_extra_pool": ""}}]}
>
>
> Any thoughts? I've tried to figure it out by myself, but no luck.
>
>
>
> Thanks,
> Patrycja Szabłowska
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

2014-08-19 Thread Craig Lewis
On Tue, Aug 19, 2014 at 1:22 AM, Riederer, Michael 
wrote:

>
>
> root@ceph-admin-storage:~# ceph pg force_create_pg 2.587
> pg 2.587 now creating, ok
> root@ceph-admin-storage:~# ceph pg 2.587 query
> ...
>   "probing_osds": [
> "5",
> "8",
> "10",
> "13",
> "20",
> "35",
> "46",
> "56"],
> ...
>
> All mentioned osds "probing_osds" are up and in, but the cluster can not
> create the pg. Not even scrub, deep-scrub or repair it.
>


My experience is that as long as you have down_osds_we_would_probe in the
pg query, ceph pg force_create_pg won't do anything. ceph osd lost didn't
help. The PGs would go into the creating state, then revert to incomplete. The
only way I was able to get them to stay in the creating state was to
re-create all of the OSD IDs listed in down_osds_we_would_probe.

Even then, it wasn't deterministic. I issued the ceph pg force_create_pg,
and it actually took effect sometime in the middle of the night, after an
unrelated OSD went down and up.

It was a very frustrating experience.



>  Just to be sure, that I did it the right way:
> # stop ceph-osd id=x
> # ceph osd out x
> # ceph osd crush remove osd.x
> # ceph auth del osd.x
> # ceph osd rm x
>



My procedure was the same as yours, with the addition of a ceph osd lost x
before ceph osd rm.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] some pgs active+remapped, Ceph can not recover itself.

2014-08-19 Thread Craig Lewis
I believe you need to remove the authorization for osd.4 and osd.6 before
re-creating them.

When I re-format disks, I migrate data off of the disk using:
  ceph osd out $OSDID

Then wait for the remapping to finish.  Once it does:
  stop ceph-osd id=$OSDID
  ceph osd out $OSDID
  ceph auth del osd.$OSDID
  ceph osd crush remove osd.$OSDID
  ceph osd rm $OSDID

Ceph will migrate the data off of it.  When it's empty, you can delete it
using the above commands. Since osd.4 and osd.6 are already lost, you can
just do the part after remapping finishes for them.


You could be having trouble because the size of the OSDs are so different.
 I wouldn't mix OSDs that are 100GB and 1.8TB.  Most of the stuck PGs are
on osd.5, osd.7, and one of the small OSDs.  You can migrate data off of
those small disks the same way I said to do osd.10.



On Tue, Aug 19, 2014 at 6:34 AM, debian Only  wrote:

> this is happen after some OSD fail and i recreate osd.
>
> i have did  "ceph osd rm osd.4"  to remove the osd.4 and osd.6. but when i
> use ceph-deploy to install OSD by
>  "ceph-deploy osd --zap-disk --fs-type btrfs create ceph0x-vm:sdb",
> ceph-deploy result said new osd is ready,
>  but the OSD can not start. said that ceph-disk failure.
>  /var/lib/ceph/bootstrap-osd/ceph.keyring and auth:error
>  and i have check the ceph.keyring is same as other on live OSD.
>
>  when i run ceph-deploy twice. first it will create osd.4, failed , will
> display in osd tree.  then osd.6 same.
>  next ceph-deploy osd again, create osd.10, this OSD can start successful.
>  but osd.4 osd.6 display down in osd tree.
>
>  when i use ceph osd reweight-by-utilization,  run one time, more pgs
> active+remapped. Ceph can not recover itself
>
>  and Crush map tunables already optimize.  do not how to solve it.
>
> root@ceph-admin:~# ceph osd crush dump
> { "devices": [
> { "id": 0,
>   "name": "osd.0"},
> { "id": 1,
>   "name": "osd.1"},
> { "id": 2,
>   "name": "osd.2"},
> { "id": 3,
>   "name": "osd.3"},
> { "id": 4,
>   "name": "device4"},
> { "id": 5,
>   "name": "osd.5"},
> { "id": 6,
>   "name": "device6"},
> { "id": 7,
>   "name": "osd.7"},
> { "id": 8,
>   "name": "osd.8"},
> { "id": 9,
>   "name": "osd.9"},
> { "id": 10,
>   "name": "osd.10"}],
>   "types": [
> { "type_id": 0,
>   "name": "osd"},
> { "type_id": 1,
>   "name": "host"},
> { "type_id": 2,
>   "name": "chassis"},
> { "type_id": 3,
>   "name": "rack"},
> { "type_id": 4,
>   "name": "row"},
> { "type_id": 5,
>   "name": "pdu"},
> { "type_id": 6,
>   "name": "pod"},
> { "type_id": 7,
>   "name": "room"},
> { "type_id": 8,
>   "name": "datacenter"},
> { "type_id": 9,
>   "name": "region"},
> { "type_id": 10,
>   "name": "root"}],
>   "buckets": [
> { "id": -1,
>   "name": "default",
>   "type_id": 10,
>   "type_name": "root",
>   "weight": 302773,
>   "alg": "straw",
>   "hash": "rjenkins1",
>   "items": [
> { "id": -2,
>   "weight": 5898,
>   "pos": 0},
> { "id": -3,
>   "weight": 5898,
>   "pos": 1},
> { "id": -4,
>   "weight": 5898,
>   "pos": 2},
> { "id": -5,
>   "weight": 12451,
>   "pos": 3},
> { "id": -6,
>   "weight": 13107,
>   "pos": 4},
> { "id": -7,
>   "weight": 87162,
>   "pos": 5},
> { "id": -8,
>   "weight": 49807,
>   "pos": 6},
> { "id": -9,
>   "weight": 116654,
>   "pos": 7},
> { "id": -10,
>   "weight": 5898,
>   "pos": 8}]},
> { "id": -2,
>   "name": "ceph02-vm",
>   "type_id": 1,
>   "type_name": "host",
>   "weight": 5898,
>   "alg": "straw",
>   "hash": "rjenkins1",
>   "items": [
> { "id": 0,
>   "weight": 5898,
>   "pos": 0}]},
> { "id": -3,
>   "name": "ceph03-vm",
>   "type_id": 1,
>   "type_name": "host",
>   "weight": 5898,
>   "alg": "straw",
>   "hash": "rjenkins1",
>   "items": [
> { "id": 1,
>   "weight": 5898,
>   "pos": 0}]},
> { "id": -4,
>   "name": "ceph01-vm",
>   "type_id": 1,
>   "type_name": "host",
>   "weight": 5898,
>   "alg": "straw"

Re: [ceph-users] how radosgw recycle bucket index object and bucket meta object

2014-08-19 Thread Craig Lewis
My default, Ceph will wait two hours to garbage collect those RGW objects.

You can adjust that time by changing
rgw gc obj min wait

See http://ceph.com/docs/master/radosgw/config-ref/ for the full list of
configs.




On Tue, Aug 19, 2014 at 7:18 PM, baijia...@126.com 
wrote:

>   I create a bucket and put some objects in the bucket。but I delete the
> all the objects and the bucket, why the bucket.meta object and bucket index
> object
> are exist? when ceph recycle them?
>
> --
>  baijia...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-19 Thread Craig Lewis
Looks like I need to upgrade to Firefly to get ceph-kvstore-tool before I
can proceed.
I am getting some hits just from grepping the LevelDB store, but so far
nothing has panned out.

Thanks for the help!



On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum  wrote:

> It's been a while since I worked on this, but let's see what I remember...
>
> On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis 
> wrote:
> > In my effort to learn more of the details of Ceph, I'm trying to
> > figure out how to get from an object name in RadosGW, through the
> > layers, down to the files on disk.
> >
> > clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
> > 2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
> > s3://cpltest/vmware-freebsd-tools.tar.gz
> >
> > Looking at the .rgw pool's contents tells me that the cpltest bucket
> > is default.73886.55:
> > root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls |
> grep cpltest
> > cpltest
> > .bucket.meta.cpltest:default.73886.55
>
> Okay, what you're seeing here are two different types, whose names I'm
> not going to get right:
> 1) The bucket link "cpltest", which maps from the name "cpltest" to a
> "bucket instance". The contents of cpltest, or one of its xattrs, are
> pointing at ".bucket.meta.cpltest:default.73886.55"
> 2) The "bucket instance" .bucket.meta.cpltest:default.73886.55. I
> think this contains the bucket index (list of all objects), etc.
>
> > The rados objects that belong to that bucket are:
> > root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
> > default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
> > default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
> > default.73886.55_vmware-freebsd-tools.tar.gz
> > default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
> > default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4
>
> Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz
> from the cpltest bucket, it will look up (or, if we're lucky, have
> cached) the cpltest link, and find out that the "bucket prefix" is
> default.73886.55. It will then try and access the object
> "default.73886.55_vmware-freebsd-tools.tar.gz" (whose construction I
> hope is obvious — bucket instance ID as a prefix, _ as a separate,
> then the object name). This RADOS object is called the "head" for the
> RGW object. In addition to (usually) the beginning bit of data, it
> will also contain some xattrs with things like a "tag" for any extra
> RADOS objects which include data for this RGW object. In this case,
> that tag is "RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ". (This construction is
> how we do atomic overwrites of RGW objects which are larger than a
> single RADOS object, in addition to a few other things.)
>
> I don't think there's any way of mapping from a shadow (tail) object
> name back to its RGW name. but if you look at the rados object xattrs,
> there might (? or might not) be an attr which contains the parent
> object in one form or another. Check that out.
>
> (Or, if you want to check out the source, I think all the relevant
> bits for this are somewhere in the
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> > I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
> > rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
> > bucket only has a single file (and the sum of the sizes matches).
> > With many files, I can't infer the link anymore.
> >
> > How do I look up that link?
> >
> > I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.
> >
> >
> >
> > My real goal is the reverse.  I recently repaired an inconsistent PG.
> > The primary replica had the bad data, so I want to verify that the
> > repaired object is correct.  I have a database that stores the SHA256
> > of every object.  If I can get from the filename on disk back to an S3
> > object, I can verify the file.  If it's bad, I can restore from the
> > replicated zone.
> >
> >
> > Aside from today's task, I think it's really handy to understand these
> > low level details.  I know it's been handy in the past, when I had
> > disk corruption under my PostgreSQL database.  Knowing (and
> > practicing) ahead of time really saved me a lot of downtime then.
> >
> >
> > Thanks for any pointers.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-20 Thread Craig Lewis
Looks like I need to upgrade to Firefly to get ceph-kvstore-tool
before I can proceed.
I am getting some hits just from grepping the LevelDB store, but so
far nothing has panned out.

Thanks for the help!

On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum  wrote:
> It's been a while since I worked on this, but let's see what I remember...
>
> On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis  
> wrote:
>> In my effort to learn more of the details of Ceph, I'm trying to
>> figure out how to get from an object name in RadosGW, through the
>> layers, down to the files on disk.
>>
>> clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
>> 2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
>> s3://cpltest/vmware-freebsd-tools.tar.gz
>>
>> Looking at the .rgw pool's contents tells me that the cpltest bucket
>> is default.73886.55:
>> root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep 
>> cpltest
>> cpltest
>> .bucket.meta.cpltest:default.73886.55
>
> Okay, what you're seeing here are two different types, whose names I'm
> not going to get right:
> 1) The bucket link "cpltest", which maps from the name "cpltest" to a
> "bucket instance". The contents of cpltest, or one of its xattrs, are
> pointing at ".bucket.meta.cpltest:default.73886.55"
> 2) The "bucket instance" .bucket.meta.cpltest:default.73886.55. I
> think this contains the bucket index (list of all objects), etc.
>
>> The rados objects that belong to that bucket are:
>> root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
>> default.73886.55_vmware-freebsd-tools.tar.gz
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4
>
> Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz
> from the cpltest bucket, it will look up (or, if we're lucky, have
> cached) the cpltest link, and find out that the "bucket prefix" is
> default.73886.55. It will then try and access the object
> "default.73886.55_vmware-freebsd-tools.tar.gz" (whose construction I
> hope is obvious — bucket instance ID as a prefix, _ as a separate,
> then the object name). This RADOS object is called the "head" for the
> RGW object. In addition to (usually) the beginning bit of data, it
> will also contain some xattrs with things like a "tag" for any extra
> RADOS objects which include data for this RGW object. In this case,
> that tag is "RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ". (This construction is
> how we do atomic overwrites of RGW objects which are larger than a
> single RADOS object, in addition to a few other things.)
>
> I don't think there's any way of mapping from a shadow (tail) object
> name back to its RGW name. but if you look at the rados object xattrs,
> there might (? or might not) be an attr which contains the parent
> object in one form or another. Check that out.
>
> (Or, if you want to check out the source, I think all the relevant
> bits for this are somewhere in the
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>> I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
>> rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
>> bucket only has a single file (and the sum of the sizes matches).
>> With many files, I can't infer the link anymore.
>>
>> How do I look up that link?
>>
>> I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.
>>
>>
>>
>> My real goal is the reverse.  I recently repaired an inconsistent PG.
>> The primary replica had the bad data, so I want to verify that the
>> repaired object is correct.  I have a database that stores the SHA256
>> of every object.  If I can get from the filename on disk back to an S3
>> object, I can verify the file.  If it's bad, I can restore from the
>> replicated zone.
>>
>>
>> Aside from today's task, I think it's really handy to understand these
>> low level details.  I know it's been handy in the past, when I had
>> disk corruption under my PostgreSQL database.  Knowing (and
>> practicing) ahead of time really saved me a lot of downtime then.
>>
>>
>> Thanks for any pointers.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem setting tunables for ceph firefly

2014-08-21 Thread Craig Lewis
There was a good discussion of this a month ago:
https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg11483.html

That'll give you some things you can try, and information on how to undo it
if it does cause problems.


You can disable the warning by adding this to the [mon] section of
ceph.conf:
  mon warn on legacy crush tunables = false





On Thu, Aug 21, 2014 at 7:17 AM, Gerd Jakobovitsch 
wrote:

> Dear all,
>
> I have a ceph cluster running in 3 nodes, 240 TB space with 60% usage,
> used by rbd and radosgw clients. Recently I upgraded from emperor to
> firefly, and I got the message about legacy tunables described in
> http://ceph.com/docs/master/rados/operations/crush-map/#tunables. After
> some data rearrangement to minimize risks, I tried to apply the optimal
> settings. This resulted in 28% of object degradation, much more than I
> expected, and worse, I lost communication for the rbd clients, running in
> kernels 3.10 or 3.11.
>
> Searching for a solution, I got to this proposed solution:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11199.html.
> Applying it (before the data was all moved), I got additional 2% of object
> degradation, but the rbd clients came back into working. But then I got a
> large number of degraded or staled PGs, that are not backfilling. Looking
> for the definition of chooseleaf_vary_r, I reached the definition in
> http://ceph.com/docs/master/rados/operations/crush-map/:
> "chooseleaf_vary_r: Whether a recursive chooseleaf attempt will start with
> a non-zero value of r, based on how many attempts the parent has already
> made. Legacy default is 0, but with this value CRUSH is sometimes unable to
> find a mapping. The optimal value (in terms of computational cost and
> correctness) is 1. However, for legacy clusters that have lots of existing
> data, changing from 0 to 1 will cause a lot of data to move; a value of 4
> or 5 will allow CRUSH to find a valid mapping but will make less data move."
>
> Is there any suggestion to handle it? Have I to set chooseleaf_vary_r to
> some other value? Will I lose communication with my rbd clients? Or should
> I return to legacy tunables?
>
> Regards,
>
> Gerd Jakobovitsch
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on OSD node failure recovery

2014-08-21 Thread Craig Lewis
The default rules are sane for small clusters with few failure domains.
 Anything larger than a single rack should customize their rules.

It's a good idea to figure this out early.  Changes to your CRUSH rules can
result in a large percentage of data moving around, which will make your
cluster unusable until the migration completes.

It is possible to make changes after the cluster has a lot of data.  From
what I've been able to figure out, it involves a lot of work to manually
migrate data to new pools using the new rules.




On Thu, Aug 21, 2014 at 6:23 AM, Sean Noonan 
wrote:

> Ceph uses CRUSH (http://ceph.com/docs/master/rados/operations/crush-map/)
> to determine object placement.  The default generated crush maps are sane,
> in that they will put replicas in placement groups into separate failure
> domains.  You do not need to worry about this simple failure case, but you
> should consider the network and disk i/o consequences of re-replicating
> large amounts of data.
>
> Sean
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> LaBarre, James  (CTR)  A6IT [james.laba...@cigna.com]
> Sent: Thursday, August 21, 2014 9:17 AM
> To: ceph-us...@ceph.com
> Subject: [ceph-users] Question on OSD node failure recovery
>
> I understand the concept with Ceph being able to recover from the failure
> of an OSD (presumably with a single OSD being on a single disk), but I’m
> wondering what the scenario is if an OSD server node containing  multiple
> disks should fail.  Presuming you have a server containing 8-10 disks, your
> duplicated placement groups could end up on the same system.  From diagrams
> I’ve seen they show duplicates going to separate nodes, but is this in fact
> how it handles it?
>
>
> --
> CONFIDENTIALITY NOTICE: If you have received this email in error,
> please immediately notify the sender by e-mail at the address shown.
> This email transmission may contain confidential information.  This
> information is intended only for the use of the individual(s) or entity to
> whom it is intended even if addressed incorrectly.  Please delete it from
> your files if you are not the intended recipient.  Thank you for your
> compliance.  Copyright (c) 2014 Cigna
>
> ==
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
backfills = 1).   I believe that increases my risk of failure by 48^2 .
 Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk.  So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
higher risk than 1 / 10^8.


A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure.  Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number.  Managing human error is
much harder.






On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  wrote:

> Using percentages instead of numbers lead me to calculations errors. Here
> it is again using 1/100 instead of % for clarity ;-)
>
> Assuming that:
>
> * The pool is configured for three replicas (size = 3 which is the default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 1/100,000 chance to fail within the hour following
> the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> * A given disk does not participate in more than 100 PG
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Craig Lewis
I had a similar problem once.  I traced my problem it to a failed battery
on my RAID card, which disabled write caching.  One of the many things I
need to add to monitoring.



On Tue, Aug 26, 2014 at 3:58 AM,  wrote:

>  Hello Gentelmen:-)
>
> Let me point one important aspect of this "low performance" problem: from
> all 4 nodes of our ceph cluster only one node shows bad metrics, that is
> very high latency on its osd's (from 200-600ms), while other three nodes
> behave normaly, thats is latency of their osds is between 1-10ms.
>
> So, the idea of putting journals on SSD is something that we are looking
> at, but we think that we have in general some problem with that particular
> node, what affects whole cluster.
>
> So can the number (4) of hosts a reason for that? Any other hints?
>
> Thanks
>
> Pawel
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-27 Thread Craig Lewis
I am using GigE.  I'm building a cluster using existing hardware, and the
network hasn't been my bottleneck (yet).

I've benchmarked the single disk recovery speed as about 50 MB/s, using max
backfills = 4, with SSD journals.  If I go higher, the disk bandwidth
increases slightly, and the latency starts increasing.
 At max backfills = 10, I regularly see OSD latency hit the 1 second mark.
 With max backfills = 4, OSD latency is pretty much the same as max
backfills = 1.  I haven't tested 5-9 yet.

I'm tracking latency by polling the OSD perf numbers every minute,
recording the delta from the previous poll, and calculating the average
latency over the last minute.  Given that it's an average over the last
minute, a 1 second average latency is way too high.  That usually means one
operation took > 30 seconds, and the other operations were mostly ok.  It's
common see blocked operations in ceph -w when latency is this high.


Using 50 MB/s for a single disk, that takes at least 14 hours to rebuild my
disks (4TB disk, 60% full).  If I'm not sitting in front of the computer, I
usually only run 2 backfills.  I'm very paranoid, due to some problems I
had early in the production release.  Most of these problems were caused by
64k XFS inodes, not Ceph.  But I have things working now, so I'm hesitant
to change anything.  :-)




On Tue, Aug 26, 2014 at 11:21 AM, Loic Dachary  wrote:

> Hi Craig,
>
> I assume the reason for the 48 hours recovery time is to keep the cost of
> the cluster low ? I wrote "1h recovery time" because it is roughly the time
> it would take to move 4TB over a 10Gb/s link. Could you upgrade your
> hardware to reduce the recovery time to less than two hours ? Or are there
> factors other than cost that prevent this ?
>
> Cheers
>
> On 26/08/2014 19:37, Craig Lewis wrote:
> > My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
> backfills = 1).   I believe that increases my risk of failure by 48^2 .
> Since your numbers are failure rate per hour per disk, I need to consider
> the risk for the whole time for each disk.  So more formally, rebuild time
> to the power of (replicas -1).
> >
> > So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
> higher risk than 1 / 10^8.
> >
> >
> > A risk of 1/43,000 means that I'm more likely to lose data due to human
> error than disk failure.  Still, I can put a small bit of effort in to
> optimize recovery speed, and lower this number.  Managing human error is
> much harder.
> >
> >
> >
> >
> >
> >
> > On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  l...@dachary.org>> wrote:
> >
> > Using percentages instead of numbers lead me to calculations errors.
> Here it is again using 1/100 instead of % for clarity ;-)
> >
> > Assuming that:
> >
> > * The pool is configured for three replicas (size = 3 which is the
> default)
> > * It takes one hour for Ceph to recover from the loss of a single OSD
> > * Any other disk has a 1/100,000 chance to fail within the hour
> following the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> > * A given disk does not participate in more than 100 PG
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] do RGW have billing feature? If have, how do we use it ?

2014-08-27 Thread Craig Lewis
Not directly, no.

There is data recorded per bucket that could be used for billing.  Take a
look at radosgw-admin bucket --bucket= stats .

That only covers storage.  If you're looking to bill the same was Amazon
does, I believe that you'll need to query your web server logs to get
number of uploads/downloads, and bandwidth used.




On Tue, Aug 26, 2014 at 7:10 PM, baijia...@126.com 
wrote:

>
>
> --
>  baijia...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Craig Lewis
My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate higher
latencies.

I was running my cluster with noout and nodown set for weeks at a time.
 Recovery of a single OSD might cause other OSDs to crash.  In the primary
cluster, I was always able to get it under control before it cascaded too
wide.  In my secondary cluster, it did spiral out to 40% of the OSDs, with
2-5 OSDs down at any time.

I traced my problems to a combination of osd max backfills was too high for
my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals, and reformatted
every OSD with better mkfs.xfs arguments.  Now both clusters are stable,
and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up will
also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson 
wrote:

>
> We use 3x replication and have drives that have relatively high
> steady-state IOPS. Therefore, we tend to prioritize client-side IO more
> than a reduction from 3 copies to 2 during the loss of one disk. The
> disruption to client io is so great on our cluster, we don't want our
> cluster to be in a recovery state without operator-supervision.
>
> Letting OSDs get marked out without operator intervention was a disaster
> in the early going of our cluster. For example, an OSD daemon crash would
> trigger automatic recovery where it was unneeded. Ironically, often times
> the unneeded recovery would often trigger additional daemons to crash,
> making a bad situation worse. During the recovery, rbd client io would
> often times go to 0.
>
> To deal with this issue, we set "mon osd down out interval = 14400", so as
> operators we have 4 hours to intervene before Ceph attempts to self-heal.
> When hardware is at fault, we remove the osd, replace the drive, re-add the
> osd, then allow backfill to begin, thereby completely skipping step B in
> your timeline above.
>
> - Mike
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven OSD usage

2014-09-03 Thread Craig Lewis
ceph osd reweight-by-utilization is ok to use, as long as it's tempory.
 I've used it while waiting for new hardware to arrive.  It adjusts the
weight displayed in ceph osd tree, but not the weight used in the crushmap.
 Yeah, there are two different weights for an OSD.  Leave the crushmap
weight as the size of the disk in TB, and just adjust the tree weight.


It will cause data migration (obviously, that's what you want).  I prefer
to use ceph osd reweight rather than reweight-by-utilization.  I can slowly
dial down the weight, one OSD at a time.
If I recall, I started with something like ceph osd reweight 9 0.975, and
lowered the weight by 0.025 each step.  Most OSDs were fine after one or
two steps, but some made it down to 0.80 before I was happy with them.  It
was an iterative process; sometimes reweighting the next OSD pushed data
back to the OSD I'd just finished reweighting.


I do remember running into problems with backfill_toofull though.  Doing a
reweight changes the CRUSH rules.  If I recall, I got in a state where two
OSDs wanted to exchange PGs, but they were both too full to accept them.
 They had other PGs they could vacate, but because I had osd_max_backfills
= 1, it was stuck.  In the end, I increased the osd backfill full ratio,
like you did.  You can do it without restarting the daemons, using
ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.90'

I also recall telling the monitors too (I don't recall why though), using:
ceph tell mon.\* injectargs '--mon_osd_nearfull_ratio 0.90'



Be aware that marking an OSD as OUT will set it's tree weight to 0, and
marking it IN will set the weight to 1.  Once you start using ceph osd
reweight, it's a good idea to keep track of the weights outside of ceph.
 If any OSDs go OUT, you'll want to manually set the
weight, preferably before it backfills itself toofull.


Once you get your new hardware, you should returns all the osd weights to
1, and just live with the uneven distribution until you can take
Christian's suggestion for chooseleaf_vary_r.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering

2014-09-03 Thread Craig Lewis
If you're running ntpd, then I believe your clocks were too skewed for the
authentication to work.  Once ntpd got the clocks syncing, authentication
would start working again.


You can query ntpd for how skewed the clock is relative to the NTP servers:
clewis@ceph2:~$ sudo ntpq -p
 remote   refid  st t when poll reach   delay   offset
 jitter
==
+192.168.0.1   132.239.1.6  2 u  519 1024  3770.2900.106   0.315
*192.168.0.2   132.239.1.6  2 u  847 1024  3770.410   -0.732   0.096
 LOCAL(0).LOCL.  10 l  60d   6400.0000.000
0.000

Offset is the number of milli-seconds that your local clock differs from
the ntp server.  Repeat on all our monitors.

A difference of more than 50ms between the monitors will cause the clock
skew warnings.  I'm not sure how far out it needs to be to cause
authentication problems.





On Thu, Aug 28, 2014 at 3:00 AM, yuelongguang  wrote:

>
> the next day  it returnes to normal.
> i have no idea.
>
>
>
>
> At 2014-08-27 00:38:29, "Michael"  wrote:
>
> How far out are your clocks? It's showing a clock skew, if they're too far
> out it can cause issues with cephx.
> Otherwise you're probably going to need to check your cephx auth keys.
>
> -Michael
>
> On 26/08/2014 12:26, yuelongguang wrote:
>
>  hi,all
>
> i have 5 osds and 3 mons. its status is ok then.
>
> to be mentioned , this cluster has no any data.  i just deploy it and to
> be familar with some command lines.
> what is the probpem and how to fix?
>
> thanks
>
>
> ---environment-
> ceph-release-1-0.el6.noarch
> ceph-deploy-1.5.11-0.noarch
> ceph-0.81.0-5.el6.x86_64
> ceph-libs-0.81.0-5.el6.x86_64
> -ceph -s --
> [root@cephosd1-mona ~]# ceph -s
> cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
>  health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 pgs
> stuck unclean; clock skew detected on mon.cephosd2-monb, mon.cephosd3-monc
>  monmap e13: 3 mons at {cephosd1-mona=
> 10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0},
> election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc
>  osdmap e151: 5 osds: 5 up, 5 in
>   pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects
> 201 MB used, 102143 MB / 102344 MB avail
>  167 peering
>  201 active+clean
>   16 remapped+peering
>
>
> --log--osd.0
> 2014-08-26 19:16:13.926345 7f114a8d2700  0 cephx: verify_authorizer could
> not decrypt ticket info: error: decryptor.MessageEnd::Exception:
> StreamTransformationFilter: invalid PKCS #7 block padding found
> 2014-08-26 19:16:13.926355 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >>
> 11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 l=0
> c=0x45d5960).accept: got bad authorizer
> 2014-08-26 19:16:28.928023 7f114a8d2700  0 cephx: verify_authorizer could
> not decrypt ticket info: error: decryptor.MessageEnd::Exception:
> StreamTransformationFilter: invalid PKCS #7 block padding found
> 2014-08-26 19:16:28.928050 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >>
> 11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 l=0
> c=0x45d56a0).accept: got bad authorizer
> 2014-08-26 19:16:28.929139 7f114c009700  0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 2014-08-26 19:16:28.929237 7f114c009700  0 -- 11.154.249.2:6800/1667 >>
> 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 l=0
> c=0x45d23c0).failed verifying authorize reply
> 2014-08-26 19:16:43.930846 7f114a8d2700  0 cephx: verify_authorizer could
> not decrypt ticket info: error: decryptor.MessageEnd::Exception:
> StreamTransformationFilter: invalid PKCS #7 block padding found
> 2014-08-26 19:16:43.930899 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >>
> 11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 l=0
> c=0x45d0b00).accept: got bad authorizer
> 2014-08-26 19:16:43.932204 7f114c009700  0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 2014-08-26 19:16:43.932230 7f114c009700  0 -- 11.154.249.2:6800/1667 >>
> 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 l=0
> c=0x45d23c0).failed verifying authorize reply
> 2014-08-26 19:16:58.933526 7f114a8d2700  0 cephx: verify_authorizer could
> not decrypt ticket info: error: decryptor.MessageEnd::Exception:
> StreamTransformationFilter: invalid PKCS #7 block padding found
> 2014-08-26 19:16:58.935094 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >>
> 11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 l=0
> c=0x45d0840).accept: got bad authorizer
> 2014-08-26 19:16:58.936239 7f114c009700  0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 2014-08-26 19:16:58.936261 7f114c009700  0 -- 11.154.249.2:

Re: [ceph-users] I fail to add a monitor in a ceph cluster

2014-09-03 Thread Craig Lewis
"monclient: hunting for new mon" happens whenever the monmap changes.  It
will hang if there's no quorum.

I haven't done this manually in a long time, so I'll refer to the Chef
recipes.  The recipe doesn't do the 'ceph-mon add', it just starts the
daemon up.


Try:
sudo ceph-mon -i gail --mkfs --monmap /var/tmp/monmap --keyring
/var/tmp/ceph.mon.keyring
sudo ceph-mon -i gail --public-addr 172.16.1.12
sudo ceph --admin-daemon /var/run/ceph/ceph-mon.gail.asok
add_bootstrap_peer_hint 172.16.1.11





On Mon, Sep 1, 2014 at 8:21 AM, Pascal GREGIS  wrote:

> Hello,
>
> I am currently testing ceph to make a replicated block device for a
> project that would involve 2 data servers accessing this block device, so
> that if one fails or crashes, the data can still be used and the cluster
> can be rebuilt.
>
> This project requires that both machines run an OSD and a monitor, and
> that a 3rd monitor is run somewhere else, so that there is not a single
> point of failure.
> I know it is not the best thing to run an OSD and a monitor on the same
> machine, but I cannot really find a better solution.
>
> My problem is that, after having read several times and followed the
> documentation, I cannot succeed to add a second monitor.
>
> I have bootstrapped a first monitor, added 2 OSDs (one on the machine with
> the monitor, one on the other), and I try to add a second monitor but it
> doesn't work.
> I think I misunderstood something.
>
> Here's what I did :
>
> On the first machine named grenier:
> # setup the configuration file /etc/ceph/ceph.conf (see content further)
> # bootstrap monitor:
> $ ceph-authtool --create-keyring /var/tmp/ceph.mon.keyring --gen-key -n
> mon. --cap mon 'allow *'
> $ sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
> --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow
> *' --cap mds 'allow'
> $ sudo chown myuser /etc/ceph/ceph.client.admin.keyring
> $ ceph-authtool /var/tmp/ceph.mon.keyring --import-keyring
> /etc/ceph/ceph.client.admin.keyring
> $ monmaptool --create --add grenier 172.16.1.11 --fsid $monuuid $tmp/monmap
> $ sudo mkdir -p /var/lib/ceph/mon/ceph-grenier
> $ sudo chown $ID -R /var/lib/ceph/mon/ceph-grenier
> $ ceph-mon --mkfs -i grenier --monmap /var/tmp/monmap --keyring
> /var/tmp/ceph.mon.keyring
> # start monitor:
> $ sudo start ceph-mon id=grenier
> # add OSD:
> $ sudo ceph osd create $osduuid
> $ sudo mkdir -p /var/lib/ceph/osd/ceph-0
> $ sudo ceph-osd -i 0 --mkfs --mkkey --osd-uuid $osduuid
> $ sudo ceph auth add osd.0 osd 'allow *' mon 'allow profile osd' -i
> /var/lib/ceph/osd/ceph-0/keyring
> $ ceph osd crush add-bucket grenier host
> $ ceph osd crush move grenier root=default
> $ ceph osd crush add osd.0 1.0 host=grenier
> # start this OSD
> $ sudo ceph-osd -i 0
>
> # copy /etc/ceph/ceph.conf, /etc/ceph/ceph.client.admin.keyring,
> /var/tmp/ceph/ceph.mon.keyring and /var/tmp/ceph/monmap from grenier to
> second node named gail:
> # add and start OSD on the second node
> $ sudo ceph osd create $newosduuid
> $ sudo mkdir -p /var/lib/ceph/osd/ceph-1
> $ sudo ceph-osd -i 1 --mkfs --mkkey --osd-uuid $newosduuid
> $ sudo ceph auth add osd.1 osd 'allow *' mon 'allow profile osd' -i
> /var/lib/ceph/osd/ceph-1/keyring
> $ ceph osd crush add-bucket gail host
> $ ceph osd crush move gail root=default
> $ ceph osd crush add osd.1 1.0 host=gail
> # start this OSD
> $ sudo ceph-osd -i 1
>
> There, everything works correctly, I can create and map a block device,
> and then write on it and the data is replicated on both nodes.
> When I perform a ceph -s I get :
> cluster a98faf65-b105-4ec7-913c-f8a33a4db4d1
>  health HEALTH_OK
>  monmap e1: 1 mons at {grenier=172.16.1.11:6789/0}, election epoch 2,
> quorum 0 grenier
>  osdmap e13: 2 osds: 2 up, 2 in
>   pgmap v47: 192 pgs, 3 pools, 0 bytes data, 0 objects
> 18400 MB used,
>  105 GB / 129 GB avail 192 active+clean
>
> And here is what I do when trying to add a second monitor on gail:
> $ sudo mkdir -p /var/lib/ceph/mon/ceph-gail
> $ ceph mon getmap -o /var/tmp/monmap
> $ sudo ceph-mon -i gail --mkfs --monmap /var/tmp/monmap --keyring
> /var/tmp/ceph.mon.keyring
>   which prints:
> ceph-mon: set fsid to a98faf65-b105-4ec7-913c-f8a33a4db4d1
> ceph-mon: created monfs at /var/lib/ceph/mon/ceph-gail for mon.gail
>   which seems correct (same uuid as in ceph.conf)
> $ sudo ceph-mon add gail 172.16.1.12
>   This command prints:
> 2014-09-01 17:07:26.033688 7f5538ada700  0 monclient: hunting for new mon
>   and hangs
>
> Then I would like to do this:
> $ sudo ceph-mon -i gail --public-addr 172.16.1.12
>   but it is useless as the previous command failed.
>
>
> Would anybody guess what I am doing wrong ?
>
> I use ceph 0.80 on an Ubuntu trusty.
> My ceph.conf is as follows :
> [global]
>   fsid = a98faf65-b105-4ec7-913c-f8a33a4db4d1
>   mon initial members = grenier
>   mon host = 172.16.1.11
>   public network = 172.16.0.0/16
>   auth cluster requ

Re: [ceph-users] SSD journal deployment experiences

2014-09-04 Thread Craig Lewis
On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster 
wrote:

>
>
> 1) How often are DC S3700's failing in your deployments?
>

None of mine have failed yet.  I am planning to monitor the wear level
indicator, and preemptively replace any SSDs that go below 10%.  Manually
flushing the journal, replacing the SSD, and building a new journal is much
faster than backfilling all the dependent OSDs.



> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
> backfilling which results from an SSD failure? Have you considered tricks
> like increasing the down out interval so backfilling doesn’t happen in this
> case (leaving time for the SSD to be replaced)?
>

Replacing a failed SSD won't help your backfill.  I haven't actually tested
it, but I'm pretty sure that losing the journal effectively corrupts your
OSDs.  I don't know what steps are required to complete this operation, but
it wouldn't surprise me if you need to re-format the OSD.



> Next, I wonder how people with puppet/chef/… are handling the
> creation/re-creation of the SSD devices. Are you just wiping and rebuilding
> all the dependent OSDs completely when the journal dev fails? I’m not keen
> on puppetizing the re-creation of journals for OSDs...
>

So far, I'm doing my disk zapping manually.  Automatically zapping disks
makes me nervous.  :-)

I'm of the opinion that you shouldn't automate something until you'll save
time versus doing by hand.  My cluster is small enough that it's faster to
do it manually.



>
> We also have this crazy idea of failing over to a local journal file in
> case an SSD fails. In this model, when an SSD fails we’d quickly create a
> new journal either on another SSD or on the local OSD filesystem, then
> restart the OSDs before backfilling started. Thoughts?
>

See #2.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-09 Thread Craig Lewis
On Sat, Sep 6, 2014 at 7:50 AM, Dan van der Ster 
wrote:

>
> BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> failed, are any object inconsistencies going to be found by a
> scrub/deep-scrub?
>

I haven't tested this, but I did something I *think* is similar.  I deleted
an OSD, removed it from the crushmap, marked it lost, then added it back
without reformatting.  It got the same OSD ID.  I think I spent about 10
minutes doing it.  I don't remember exactly why... I think I was trying to
force_pg_create or something.

If I recall correctly, the backfill was much faster than I expected.  It
should have taken >24 hours.  IIRC, it completed in about 2 hours.  It
wasn't as fast as marking the OSD out and in, but much faster than a
freshly formatted OSD.

It's possible that this only worked because the PGs hadn't completed
backfilling.  Despite my marking the OSD lost, the OSD was still listed in
the pg query, in the osds to probe section.


I want to experiment with losing an SSD.  I'm trying to think of a way to
run the test using VMs, but I haven't come up with anything yet.  All of my
test clusters are virtual, and I'm not ready to test this on a production
cluster yet.

I *think* losing an SSD will be similar to the above, possibly followed by
some inconsistencies found during scrub and deep-scrub.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-09 Thread Craig Lewis
On Sat, Sep 6, 2014 at 9:27 AM, Christian Balzer  wrote:

> On Sat, 06 Sep 2014 16:06:56 + Scott Laird wrote:
>
> > Backing up slightly, have you considered RAID 5 over your SSDs?
> >  Practically speaking, there's no performance downside to RAID 5 when
> > your devices aren't IOPS-bound.
> >
>
> Well...
> For starters with RAID5 you would loose 25% throughput in both Dan's and
> my case (4 SSDs) compared to JBOD SSD journals.
> In Dan's case that might not matter due to other bottlenecks, in my case
> it certainly would.
>

It's a trade off between lower performance all the time, or much lower
performance while you're backfilling those OSDs.  To me, this seems like a
somewhat reasonable idea for a small cluster, where losing one SSD could
lose >5% of the OSDs.  It doesn't seem worth the effort for a large
cluster, where losing one SSD would lose < 1% of the OSDs.


>
> And while you're quite correct when it comes to IOPS, doing RAID5 will
> either consume significant CPU resource in a software RAID case or require
> a decent HW RAID controller.
>
> Christian


 I haven't worried about CPU with software RAID5 in a very long time...
maybe Pentium 4 days?  It's so rare to actually have 0% Idle CPU, even
under high loads.

Most of my RAID5 is ZFS, but the CPU hasn't been the limiting factor on my
database or NFS servers.  I'm even doing software crypto, without CPU
support, with only a 10% performance penalty.  If the CPU has AES support,
crypto is free.  Obviously, RAID0 (or fully parallel JBOD) will be faster
than RAID5, but RAID5 is faster than RAID10 for all but the most heavily
read biased workloads.  Surprised the hell out of me.  I'll be converting
all of my database servers from RAID10 to RAIDZ.  Of course, benchmarks
that match your workload trump some random yahoo on the internet.  :-)


Ceph OSD nodes are a bit different though.  They're one of the few beasts
I've dealt with that are CPU, Disk, and network bound all at the same time.
 If you have some idle CPU during a big backfill, then I'd consider
Software RAID5 a possibility.  If you ever sustain 0% idle, then I wouldn't
try it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd going down every 15m blocking recovery from degraded state

2014-09-16 Thread Craig Lewis
Is it using any CPU or Disk I/O during the 15 minutes?

On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
christopher.thorjus...@onlinebackupcompany.com> wrote:

> I'm waiting for my cluster to recover from a crashed disk and a second osd
> that has been taken out (crushmap, rm, stopped).
>
> Now I'm stuck at looking at this output ('ceph -w') while my osd.58 goes
> down every 15 minute.
>
> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
>
> Here is a log from when I restarted osd.58 and through the next reboot 15
> minutes later: http://pastebin.com/rt64vx9M
> Short, it just waits for 15 minutes not doing anything and then goes down
> putting lots of lines like this in the log for that osd:
>
> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234 >>
> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 cs=1 l=0
> c=0x35bcf1e0).fault with nothing to send, going to standby
> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234 >>
> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 cs=1
> l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>
> Then I have to restart it. And it repeats.
>
> What should/can I do? Take it out?
>
> I've got 4 servers with 24 disks each. Details about servers:
> http://pastebin.com/XQeSh8gJ
> Running dumpling - 0.67.10
>
> Cheers,
> Christopher Thorjussen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] full/near full ratio

2014-09-16 Thread Craig Lewis
On Fri, Sep 12, 2014 at 4:35 PM, JIten Shah  wrote:

>
> 1. If we need to modify those numbers, do we need to update the values in
> ceph.conf and restart every OSD or we can run a command on MON, that will
> overwrite it?
>

That will work.  You can also update the values without a restart using:
ceph tell mon.\* injectargs '--mon_osd_nearfull_ratio 0.85'


You might also need to look at mon_osd_full_ratio, osd_backfill_full_ratio,
osd_failsafe_full_ratio, and  osd_failsafe_nearfull_ratio.

Variables that start with mon should be sent to all the monitors (ceph tell
mon.\* ...), variables that start with osd should be send to the osds (ceph
tell osd.\* ...).



>
> 2. What is the best way to get the OSD’s to work again, if we reach the
> full ration amount?  You can’t delete the data because read/write is
> blocked.
>

Add more OSDs.  Preferably before they become full, but it'll work if
they're toofull.  It may take a while though, Ceph doesn't seem to weight
which backfills should be done first, so it might take a while to get to
the OSDs that are toofull.

Since not everybody has nodes and disks laying around, you can stop all of
your writes, and bump the nearfull and full ratios.  I've bumped them while
I was using ceph osd reweight, and had some toofull disks that wanted to
exchange PGs.  Keep in mind that Ceph stops when the percentage is > than
toofull, so don't set full_ratio to 0.99.  You really don't want to fill up
your disks.

If all else fails (or you get a disk down to 0 kB free) you can manually
delete some PGs on disk.  This is fairly risky, and prone to human error
causing data loss.  You'll have to figure out the best ones to delete, and
you'll want to make sure you don't delete every replica of the PG.  You'll
want to disable backfilling (ceph osd set nobackfill), otherwise Ceph will
repair things back to toofull.



>
> 3. If we add new OSD’s, will it start rebalancing the OSD’s or do I need
> to trigger it manually and how?
>

Adding and starting the OSDs will start rebalancing.  The expected location
will change as soon as you add the OSD to the crushmap.  Shortly after the
OSD starts, it will begin updating to make reality match expectations.  For
most people, that happens in a single step, with ceph-deploy or a Config
Management tool.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-16 Thread Craig Lewis
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz 
wrote:

>
>
> XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
>
> All logs from before the disaster are still there, do you have any
> advise on what would be relevant?
>
>

This is a problem.  It's not necessarily a deadlock.  The warning is
printed if the XFS memory allocator has to retry more than 100 times when
it's trying to allocate memory.  It either indicates extremely low memory,
or extremely fragmented memory.  Either way, your OSDs are sitting there
trying to allocate memory instead of doing something useful.



By any chance, does your ceph.conf have:
osd mkfs options xfs = -n size=64k

If so, you should start planning to remove that arg, and reformat every
OSD.  Here's a thread where I discussion my (mis) adventures with XFS
allocation deadlocks:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041336.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd going down every 15m blocking recovery from degraded state

2014-09-16 Thread Craig Lewis
I ran into a similar issue before.  I was having a lot of OSD crashes
caused by XFS memory allocation deadlocks.  My OSDs crashed so many times
that they couldn't replay the OSD Map before they would be marked
unresponsive.

See if this sounds familiar:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html

If so, Sage's procedure to apply the osdmaps fixed my cluster:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html





On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen <
christopher.thorjus...@onlinebackupcompany.com> wrote:

> I've got several osds that are spinning at 100%.
>
> I've retained some professional services to have a look. Its out of my
> newbie reach..
>
> /Christopher
>
> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis 
> wrote:
>
>> Is it using any CPU or Disk I/O during the 15 minutes?
>>
>> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
>> christopher.thorjus...@onlinebackupcompany.com> wrote:
>>
>>> I'm waiting for my cluster to recover from a crashed disk and a second
>>> osd that has been taken out (crushmap, rm, stopped).
>>>
>>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58 goes
>>> down every 15 minute.
>>>
>>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>>
>>> Here is a log from when I restarted osd.58 and through the next reboot
>>> 15 minutes later: http://pastebin.com/rt64vx9M
>>> Short, it just waits for 15 minutes not doing anything and then goes
>>> down putting lots of lines like this in the log for that osd:
>>>
>>> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234 >>
>>> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 cs=1
>>> l=0 c=0x35bcf1e0).fault with nothing to send, going to standby
>>> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234 >>
>>> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 cs=1
>>> l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>>>
>>> Then I have to restart it. And it repeats.
>>>
>>> What should/can I do? Take it out?
>>>
>>> I've got 4 servers with 24 disks each. Details about servers:
>>> http://pastebin.com/XQeSh8gJ
>>> Running dumpling - 0.67.10
>>>
>>> Cheers,
>>> Christopher Thorjussen
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

2014-09-18 Thread Craig Lewis
No, removing the snapshots didn't solve my problem.  I eventually traced
this problem to XFS deadlocks caused by
[osd]
  "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
size=4096"

Changing to just "-s size=4096", and reformatting all OSDs solved this
problem.


Since then, I ran into http://tracker.ceph.com/issues/5699.  Snapshots are
off until I've deployed Firefly.


On Wed, Sep 17, 2014 at 8:09 AM, Florian Haas  wrote:

> Hi Craig,
>
> just dug this up in the list archives.
>
> On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis 
> wrote:
> > In the interest of removing variables, I removed all snapshots on all
> pools,
> > then restarted all ceph daemons at the same time.  This brought up osd.8
> as
> > well.
>
> So just to summarize this: your 100% CPU problem at the time went away
> after you removed all snapshots, and the actual cause of the issue was
> never found?
>
> I am seeing a similar issue now, and have filed
> http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
> again. Can you take a look at that issue and let me know if anything
> in the description sounds familiar?
>
> You mentioned in a later message in the same thread that you would
> keep your snapshot script running and "repeat the experiment". Did the
> situation change in any way after that? Did the issue come back? Or
> did you just stop using snapshots altogether?
>
> Cheers,
> Florian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd going down every 15m blocking recovery from degraded state

2014-09-18 Thread Craig Lewis
The magic in Sage's steps was really setting noup.  That gives the OSD time
to apply the osdmap changes, without starting the timeout.  Set noup,
nodown, noout, restart the OSD, and wait until the CPU usage goes to zero.
 Some of mine took 5 minutes.  Once it's done, unset noup, and restart
again.  The OSD should join the cluster, and not spin the CPU forever.
 Repeat for every OSD.


The XFS params caused my OSDs to crash often enough to cause the big osdmap
backlog.  I was seeing "XFS: possible memory allocation deadlock in
kmem_alloc" in dmesg.  ceph.conf had
[osd]
   "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
size=4096"

I fixed the problem by changing the config to
[osd]
   "osd mkfs options xfs": "-s size=4096"

Then reformated every OSD in my cluster (one at a time).  The -n size=64k
was the problem.  It looks like the 3.14 kernels have a fix:
http://tracker.ceph.com/issues/6301.  Upgrading the kernel might be less
painful that reformatting everything.


On Tue, Sep 16, 2014 at 3:19 PM, Christopher Thorjussen <
christopher.thorjus...@onlinebackupcompany.com> wrote:

> I've been throught your post many times (google likes it ;)
> I've been trying all the noout/nodown/noup.
> But I will look into the XFS issue you are talking about. And read all of
> the post one more time..
>
> /C
>
>
> On Wed, Sep 17, 2014 at 12:01 AM, Craig Lewis 
> wrote:
>
>> I ran into a similar issue before.  I was having a lot of OSD crashes
>> caused by XFS memory allocation deadlocks.  My OSDs crashed so many times
>> that they couldn't replay the OSD Map before they would be marked
>> unresponsive.
>>
>> See if this sounds familiar:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html
>>
>> If so, Sage's procedure to apply the osdmaps fixed my cluster:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html
>>
>>
>>
>>
>>
>> On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen <
>> christopher.thorjus...@onlinebackupcompany.com> wrote:
>>
>>> I've got several osds that are spinning at 100%.
>>>
>>> I've retained some professional services to have a look. Its out of my
>>> newbie reach..
>>>
>>> /Christopher
>>>
>>> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis >> > wrote:
>>>
>>>> Is it using any CPU or Disk I/O during the 15 minutes?
>>>>
>>>> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
>>>> christopher.thorjus...@onlinebackupcompany.com> wrote:
>>>>
>>>>> I'm waiting for my cluster to recover from a crashed disk and a second
>>>>> osd that has been taken out (crushmap, rm, stopped).
>>>>>
>>>>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58
>>>>> goes down every 15 minute.
>>>>>
>>>>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>>
>>>>> Here is a log from when I restarted osd.58 and through the next reboot
>>>>> 15 minutes later: http://pastebin.com/rt64vx9M
>>>>> Short, it just waits for 15 minutes not doing anything and then goes
>>>>> down putting lots of lines like this in the log for that osd:
>>>>>
>>>>> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-19 Thread Craig Lewis
On Fri, Sep 19, 2014 at 2:35 AM, Francois Deppierraz  wrote:

> Hi Craig,
>
> I'm planning to completely re-install this cluster with firefly because
> I started to see other OSDs crashes with the same trim_object error...
>

I did lose data because of this, but it was unrelated to the XFS issues.
 Luckily, it was only RGW replication state, and not something more
important.

I was having issues with OSDs crashing.  I'd mark them out, and the problem
would move to a new OSD.  I tried using the patch in
http://tracker.ceph.com/issues/6101.  It worked, but only as long as I ran
the patch.  When I went back to a stock binary, it started crashing again.
 It also spammed the logs with warnings instead of crashing.

The problem PG was in my RGW .$zone.log pool.  It's small, so I pulled all
of the objects out of the pool, recreated the pool, and uploaded the
objects again.  It messed up my replication state, so I'm still sorting
that out.

It appears to me that the code fix in Firefly (
http://tracker.ceph.com/issues/7595) will prevent the problem from
happening, but not correct an already corrupted store.  I dropped all my
snapshots, and disabled new ones, until I can complete the upgrade.

Rebuilding on FireFly should solve your problem.



>
> So now, I'm more interested in figuring out exactly why data corruption
> happened in the first place than repairing the cluster.
>

I'm not entirely sure from reading http://tracker.ceph.com/issues/7595, but
it looks like occasionally creating a snapshot doesn't save the correct
information.  Then when removing the snapshot, it gets confused and asserts.



>
> Comments in-line.
>
>
> >
> > This is a problem.  It's not necessarily a deadlock.  The warning is
> > printed if the XFS memory allocator has to retry more than 100 times
> > when it's trying to allocate memory.  It either indicates extremely low
> > memory, or extremely fragmented memory.  Either way, your OSDs are
> > sitting there trying to allocate memory instead of doing something
> useful.
>
> Do you mean that this particular error doesn't imply data corruption but
> only bad OSD performances?
>

That was my experience.  That cluster was pretty much unusable, but I was
able to access all of my data once I got the cluster healthy.


> > By any chance, does your ceph.conf have:
> > osd mkfs options xfs = -n size=64k
> >
> > If so, you should start planning to remove that arg, and reformat every
> > OSD.  Here's a thread where I discussion my (mis) adventures with XFS
> > allocation deadlocks:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041336.html
>
> Yes! Thanks for the details, I'm actually using the puppet-ceph module
> from enovance which indeed uses [1] the '-n size=64k' option when
> formating a new disk.
>

I would avoid that option when you rebuild your cluster.  There is a fix in
the 3.14 kernels, but it's not really necessary.  That option makes the
inodes larger, which should make directories with millions of files in them
a bit faster.  None of my PGs have more than 10 files in a directory.
 Every time a directory gets more than a few files in it, Ceph creates some
subdirectories, and splits things up.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

2014-09-19 Thread Craig Lewis
Excellent find.

On Fri, Sep 19, 2014 at 7:11 AM, Florian Haas  wrote:

> Hi Craig,
>
> On Fri, Sep 19, 2014 at 2:49 AM, Craig Lewis 
> wrote:
> > No, removing the snapshots didn't solve my problem.  I eventually traced
> > this problem to XFS deadlocks caused by
> > [osd]
> >   "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
> > size=4096"
> >
> > Changing to just "-s size=4096", and reformatting all OSDs solved this
> > problem.
> >
> >
> > Since then, I ran into http://tracker.ceph.com/issues/5699.  Snapshots
> are
> > off until I've deployed Firefly.
>
> Thanks for responding. We've tracked our issue down to
> http://tracker.ceph.com/issues/9487 in the interim. :)
>
> Cheers,
> Florian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] confusion when kill 3 osds that store the same pg

2014-09-19 Thread Craig Lewis
Comments inline.

On Thu, Sep 18, 2014 at 8:33 PM, yuelongguang  wrote:

>
> 1.
> [root@cephosd5-gw current]# ceph pg 2.30 query
> Error ENOENT: i don't have pgid 2.30
>
> why i can not query infomations of this pg?  how to dump this pg?
>


I haven't actually tried this, but I expect something like that.  The
primary OSD has all the data about the PG.  In your next question, you show
the acting OSDs as [4,1].  But you shutdown all OSDS that did have pg 2.30
before osd.4 or osd.1 could backfill, so osd.4 doesn't know anything about
pg 2.30.

If you bring up one of the other OSDs, osd.4 and osd.1 can backfill, and
then osd.4 will be able to answer your query.

If this was a real 3 disk failure, you would have lost this PG, and all the
data on it.



>
> 2.
> #ceph osd map rbd rbd_data.19d92ae8944a.
> osdmap e1451 pool 'rbd' (2) object
> 'rbd_data.19d92ae8944a.' -> pg 2.c59a45b0 (2.30) -> up
> ([4,1], p4) acting ([4,1], p4)
>
> does 'ceph osd map' command just calculate map , but does not check real
> pg stat?  i do not find 2.30  on osd1 and osd.4.
> new that client will get the new map, why client hang ?
>

I know less about RBD.  I have seen Ceph block on reads, because the
current primary osd doesn't have the latest data about the PG.  Once the
current primary gets the history it's missing from the previous primary,
then it can start to return data.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   >