Re: [ceph-users] mds isn't working anymore after osd's running full

2014-08-20 Thread Jasper Siero
Unfortunately that doesn't help. I restarted both the active and standby mds 
but that doesn't change the state of the mds. Is there a way to force the mds 
to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, 
have 1832)? 

Thanks,

Jasper

Van: Gregory Farnum [g...@inktank.com]
Verzonden: dinsdag 19 augustus 2014 19:49
Aan: Jasper Siero
CC: ceph-users@lists.ceph.com
Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
 wrote:
> Hi all,
>
> We have a small ceph cluster running version 0.80.1 with cephfs on five
> nodes.
> Last week some osd's were full and shut itself down. To help de osd's start
> again I added some extra osd's and moved some placement group directories on
> the full osd's (which has a copy on another osd) to another place on the
> node (as mentioned in
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
> After clearing some space on the full osd's I started them again. After a
> lot of deep scrubbing and two pg inconsistencies which needed to be repaired
> everything looked fine except the mds which still is in the replay state and
> it stays that way.
> The log below says that mds need osdmap epoch 1833 and have 1832.
>
> 2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
> 2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now
> mds.0.25
> 2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state
> change up:standby --> up:replay
> 2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
> 2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
> 2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833,
> have 1832
> 2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833
> (which blacklists prior instance)
>
>  # ceph status
> cluster c78209f5-55ea-4c70-8968-2231d2b05560
>  health HEALTH_WARN mds cluster is degraded
>  monmap e3: 3 mons at
> {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
> election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
>  mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
>  osdmap e1951: 12 osds: 12 up, 12 in
>   pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
> 124 GB used, 175 GB / 299 GB avail
>  492 active+clean
>
> # ceph osd tree
> # idweighttype nameup/downreweight
> -10.2399root default
> -20.05997host th1-osd001
> 00.01999osd.0up1
> 10.01999osd.1up1
> 20.01999osd.2up1
> -30.05997host th1-osd002
> 30.01999osd.3up1
> 40.01999osd.4up1
> 50.01999osd.5up1
> -40.05997host th1-mon003
> 60.01999osd.6up1
> 70.01999osd.7up1
> 80.01999osd.8up1
> -50.05997host th1-mon002
> 90.01999osd.9up1
> 100.01999osd.10up1
> 110.01999osd.11up1
>
> What is the way to get the mds up and running again?
>
> I still have all the placement group directories which I moved from the full
> osds which where down to create disk space.

Try just restarting the MDS daemon. This sounds a little familiar so I
think it's a known bug which may be fixed in a later dev or point
release on the MDS, but it's a soft-state rather than a disk state
issue.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] some pgs active+remapped, Ceph can not recover itself.

2014-08-20 Thread debian Only
thanks , Lewis.and i got one suggestion it is better to put similar OSD
size .


2014-08-20 9:24 GMT+07:00 Craig Lewis :

> I believe you need to remove the authorization for osd.4 and osd.6 before
> re-creating them.
>
> When I re-format disks, I migrate data off of the disk using:
>   ceph osd out $OSDID
>
> Then wait for the remapping to finish.  Once it does:
>   stop ceph-osd id=$OSDID
>   ceph osd out $OSDID
>   ceph auth del osd.$OSDID
>   ceph osd crush remove osd.$OSDID
>   ceph osd rm $OSDID
>
> Ceph will migrate the data off of it.  When it's empty, you can delete it
> using the above commands. Since osd.4 and osd.6 are already lost, you can
> just do the part after remapping finishes for them.
>
>
> You could be having trouble because the size of the OSDs are so different.
>  I wouldn't mix OSDs that are 100GB and 1.8TB.  Most of the stuck PGs are
> on osd.5, osd.7, and one of the small OSDs.  You can migrate data off of
> those small disks the same way I said to do osd.10.
>
>
>
> On Tue, Aug 19, 2014 at 6:34 AM, debian Only  wrote:
>
>> this is happen after some OSD fail and i recreate osd.
>>
>> i have did  "ceph osd rm osd.4"  to remove the osd.4 and osd.6. but when
>> i use ceph-deploy to install OSD by
>>  "ceph-deploy osd --zap-disk --fs-type btrfs create ceph0x-vm:sdb",
>> ceph-deploy result said new osd is ready,
>>  but the OSD can not start. said that ceph-disk failure.
>>  /var/lib/ceph/bootstrap-osd/ceph.keyring and auth:error
>>  and i have check the ceph.keyring is same as other on live OSD.
>>
>>  when i run ceph-deploy twice. first it will create osd.4, failed , will
>> display in osd tree.  then osd.6 same.
>>  next ceph-deploy osd again, create osd.10, this OSD can start
>> successful.  but osd.4 osd.6 display down in osd tree.
>>
>>  when i use ceph osd reweight-by-utilization,  run one time, more pgs
>> active+remapped. Ceph can not recover itself
>>
>>  and Crush map tunables already optimize.  do not how to solve it.
>>
>> root@ceph-admin:~# ceph osd crush dump
>> { "devices": [
>> { "id": 0,
>>   "name": "osd.0"},
>> { "id": 1,
>>   "name": "osd.1"},
>> { "id": 2,
>>   "name": "osd.2"},
>> { "id": 3,
>>   "name": "osd.3"},
>> { "id": 4,
>>   "name": "device4"},
>> { "id": 5,
>>   "name": "osd.5"},
>> { "id": 6,
>>   "name": "device6"},
>> { "id": 7,
>>   "name": "osd.7"},
>> { "id": 8,
>>   "name": "osd.8"},
>> { "id": 9,
>>   "name": "osd.9"},
>> { "id": 10,
>>   "name": "osd.10"}],
>>   "types": [
>> { "type_id": 0,
>>   "name": "osd"},
>> { "type_id": 1,
>>   "name": "host"},
>> { "type_id": 2,
>>   "name": "chassis"},
>> { "type_id": 3,
>>   "name": "rack"},
>> { "type_id": 4,
>>   "name": "row"},
>> { "type_id": 5,
>>   "name": "pdu"},
>> { "type_id": 6,
>>   "name": "pod"},
>> { "type_id": 7,
>>   "name": "room"},
>> { "type_id": 8,
>>   "name": "datacenter"},
>> { "type_id": 9,
>>   "name": "region"},
>> { "type_id": 10,
>>   "name": "root"}],
>>   "buckets": [
>> { "id": -1,
>>   "name": "default",
>>   "type_id": 10,
>>   "type_name": "root",
>>   "weight": 302773,
>>   "alg": "straw",
>>   "hash": "rjenkins1",
>>   "items": [
>> { "id": -2,
>>   "weight": 5898,
>>   "pos": 0},
>> { "id": -3,
>>   "weight": 5898,
>>   "pos": 1},
>> { "id": -4,
>>   "weight": 5898,
>>   "pos": 2},
>> { "id": -5,
>>   "weight": 12451,
>>   "pos": 3},
>> { "id": -6,
>>   "weight": 13107,
>>   "pos": 4},
>> { "id": -7,
>>   "weight": 87162,
>>   "pos": 5},
>> { "id": -8,
>>   "weight": 49807,
>>   "pos": 6},
>> { "id": -9,
>>   "weight": 116654,
>>   "pos": 7},
>> { "id": -10,
>>   "weight": 5898,
>>   "pos": 8}]},
>> { "id": -2,
>>   "name": "ceph02-vm",
>>   "type_id": 1,
>>   "type_name": "host",
>>   "weight": 5898,
>>   "alg": "straw",
>>   "hash": "rjenkins1",
>>   "items": [
>> { "id": 0,
>>   "weight": 5898,
>>   "pos": 0}]},
>> { "id": -3,
>>   "name": "ceph03-vm",
>>   "type_id": 1,
>>   "type_name": "host",
>>   "weight": 5898,
>>   "alg": "straw",
>>   "hash": "rjenkins1

Re: [ceph-users] Problem when building&running cuttlefish from source on Ubuntu 14.04 Server

2014-08-20 Thread NotExist
Hello Gregory:
I'm doing some comparison about performance between different
combination of environment. Therefore I have to try such old version.
Thanks for your kindly help! The solution you provided does work! I
think I was relying on ceph-disk too much therefore I didn't noticed
this.

2014-08-20 1:44 GMT+08:00 Gregory Farnum :
> On Thu, Aug 14, 2014 at 2:28 AM, NotExist  wrote:
>> Hello everyone:
>>
>> Since there's no cuttlefish package for 14.04 server on ceph
>> repository (only ceph-deploy there), I tried to build cuttlefish from
>> source on 14.04.
>
> ...why? Cuttlefish is old and no longer provided updates. You really
> want to be using either Dumpling or Firefly.
>
>>
>> Here's what I did:
>> Get source by following http://ceph.com/docs/master/install/clone-source/
>> Enter the sourcecode directory
>> git checkout cluttlefish
>> git submodule update
>> rm -rf src/civetweb/ src/erasure-code/ src/rocksdb/
>> to get the latest cluttlefish repo.
>>
>> Build source by following http://ceph.com/docs/master/install/build-ceph/
>> beside the package this url mentioned for Ubuntu:
>>
>> sudo apt-get install autotools-dev autoconf automake cdbs gcc g++ git
>> libboost-dev libedit-dev libssl-dev libtool libfcgi libfcgi-dev
>> libfuse-dev linux-kernel-headers libcrypto++-dev libcrypto++
>> libexpat1-dev pkg-config
>> sudo apt-get install uuid-dev libkeyutils-dev libgoogle-perftools-dev
>> libatomic-ops-dev libaio-dev libgdata-common libgdata13 libsnappy-dev
>> libleveldb-dev
>>
>> I also found it will need
>>
>> sudo apt-get install libboost-filesystem-dev libboost-thread-dev
>> libboost-program-options-dev
>>
>> (And xfsprogs if you need xfs)
>> after all packages are installed, I start to complie according to the doc:
>>
>> ./autogen.sh
>> ./configure
>> make -j8
>>
>> And install following
>> http://ceph.com/docs/master/install/install-storage-cluster/#installing-a-build
>>
>> sudo make install
>>
>> everything seems fine, but I found ceph_common.sh had been putted to
>> '/usr/local/lib/ceph', and some tools are putted into
>> /usr/local/usr/local/sbin/ (ceph-disk* and ceph-create-keys). I was
>> used to use ceph-disk to prepare the disk on other deployment (on
>> other machines with Emperor), but I can't do it now (and maybe the
>> path is the reason) so I choose to do do all stuffs manually.
>>
>> I follow the doc
>> http://ceph.com/docs/master/install/manual-deployment/ to deploy the
>> cluster many times, but it turns out different this time.
>> /etc/ceph isn't there, therefore I sudo mkdir /etc/ceph
>> Put a ceph.conf into /etc/ceph
>> Generate all required keys in /etc/ceph instead of /tmp/ to keep them
>>
>> ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n
>> mon. --cap mon 'allow *'
>> ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
>> --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd
>> 'allow *' --cap mds 'allow'
>> ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring
>> /etc/ceph/ceph.client.admin.keyring
>>
>> Generate monmap with monmaptool
>>
>> monmaptool --create --add storage01 192.168.11.1 --fsid
>> 9f8fffe3-040d-4641-b35a-ffa90241f723 /etc/ceph/monmap
>>
>> /var/lib/ceph is not there either
>>
>> sudo mkdir -p /var/lib/ceph/mon/ceph-storage01
>> sudo ceph-mon --mkfs -i storage01 --monmap /etc/ceph/monmap --keyring
>> /etc/ceph/ceph.mon.keyring
>>
>> log directory are not there, so I create it manually:
>>
>> sudo mkdir /var/log/ceph
>>
>> since service doesn't work, I start mon daemon manually:
>>
>> sudo /usr/local/bin/ceph-mon -i storage01
>>
>> and ceph -s looks like these:
>> storage@storage01:~/ceph$ ceph -s
>>health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
>>monmap e1: 1 mons at {storage01=192.168.11.1:6789/0}, election
>> epoch 2, quorum 0 storage01
>>osdmap e1: 0 osds: 0 up, 0 in
>> pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB 
>> avail
>>mdsmap e1: 0/0/1 up
>>
>> And I add disks as osd by following manual commands:
>> sudo mkfs -t xfs -f /dev/sdb
>> sudo mkdir /var/lib/ceph/osd/ceph-1
>> sudo mount /dev/sdb /var/lib/ceph/osd/ceph-1/
>> sudo ceph-osd -i 1 --mkfs --mkkey
>> ceph osd create
>> ceph osd crush add osd.1 1.0 host=storage01
>> sudo ceph-osd -i 1
>>
>> for 10 times, and I got:
>> storage@storage01:~/ceph$ ceph osd tree
>>
>> # idweight  type name   up/down reweight
>> -2  10  host storage01
>> 0   1   osd.0   up  1
>> 1   1   osd.1   up  1
>> 2   1   osd.2   up  1
>> 3   1   osd.3   up  1
>> 4   1   osd.4   up  1
>> 5   1   osd.5   up  1
>> 6   1   osd.6   up  1
>> 7   1   osd.7   up  1
>> 8   1   osd.8   up  1
>> 9   1   osd.9   up  1
>> -1  0   root default
>>
>> and
>>
>> storage@storage01:~/ceph$ ceph -s
>>

[ceph-users] Starting Ceph OSD

2014-08-20 Thread Pons
 

Hi All,
We monitored two of our osd as down using the ceph osd tree
command. We tried starting it using the following commands but ceph osd
tree command still reports it as down. Please see below for the commands
used. 

command:sudo start ceph-osd id=osd.0
output: ceph-osd
(ceph/osd.0) stop/pre-start, process 3831

ceph osd tree output:
# id
weight type name up/down reweight
-1 5.13 root default
-2 1.71 host
ceph-node1
0 0.8 osd.0 down 0 
2 0.91 osd.2 down 0 
-3 1.71 host
ceph-node2

command: sudo start ceph-osd id=0
output: ceph-osd (ceph/0)
start/running, process 3887

ceph osd tree output:
# id weight type name
up/down reweight
-1 5.13 root default
-2 1.71 host ceph-node1
0 0.8
osd.0 down 0 
2 0.91 osd.2 down 0 
-3 1.71 host ceph-node2

command:
sudo start ceph-osd id=0
output: ceph-osd (ceph/0) start/running,
process 4348

ceph osd tree output:
# id weight type name up/down
reweight
-1 5.22 root default
-2 1.8 host ceph-node1
0 0.8 osd.0 down 0

2 0.91 osd.2 down 0 

Is there any other ways to start an OSD? I'm out
of ideas. What we do is we execute the ceph-deploy activate command to
make an OSD as UP. Is that the right way to do it? We are using ceph
version 0.80.4 

Thanks!

Regards,
Pons

 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW problems

2014-08-20 Thread Marco Garcês
Hello,

Yehuda, I know I was using the correct fastcgi module, it was the one on
Ceph repositories; I had also disabled in apache, all other modules;

I tried to create a second swift user, using the provided instructions,
only to get the following:

# radosgw-admin user create --uid=marcogarces --display-name="Marco Garces"
# radosgw-admin subuser create --uid=marcogarces
--subuser=marcogarces:swift --access=full
# radosgw-admin key create --subuser=marcogarces:swift --key-type=swift
--gen-secret
could not create key: unable to add access key, unable to store user info
2014-08-20 13:19:33.664945 7f925b130880  0 WARNING: can't store user info,
swift id () already mapped to another user (marcogarces)


So I have created another user, some other way:

# radosgw-admin user create --subuser=testuser:swift --display-name="Test
User One" --key-type=swift --access=full
{ "user_id": "testuser",
  "display_name": "Test User One",
  "email": "",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [],
  "keys": [],
  "swift_keys": [
{ "user": "testuser:swift",
  "secret_key": "MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY"}],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": [],
  "bucket_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "user_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "temp_url_keys": []}


Now, when I do, from the client:

swift -V 1 -A http://gateway.bcitestes.local/auth -U testuser:swift -K
MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat
   Account: v1
Containers: 0
   Objects: 0
 Bytes: 0
Server: Tengine/2.0.3
Connection: keep-alive
X-Account-Bytes-Used-Actual: 0
  Content-Type: text/plain; charset=utf-8


If I try using https, I still have errors:

swift --insecure -V 1 -A https://gateway.bcitestes.local/auth -U
testuser:swift -K MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat
Account HEAD failed: http://gateway.bcitestes.local:443/swift/v1 400 Bad
Request


And I could not validate this account using a Swift client (Cyberduck);
Also, there are no S3 credentials!
How can I have a user with both S3 and Swift credentials created, and valid
to use with http/https, and on all clients (command line and gui). The
first user works great with the S3 credentials, on all scenarios.

Thank you,
Marco Garcês

On Tue, Aug 19, 2014 at 7:59 PM, Yehuda Sadeh  wrote:

> On Tue, Aug 19, 2014 at 5:32 AM, Marco Garcês  wrote:
> >
> > UPDATE:
> >
> > I have installed Tengine (nginx fork) and configured both HTTP and HTTPS
> to use radosgw socket.
>
> Looking back at this thread, and considering this solution it seems to
> me that you were running the wrong apache fastcgi module.
>
> >
> > I can login with S3, create buckets and upload objects.
> >
> > It's still not possible to use Swift credentials, can you help me on
> this part? What do I use when I login (url, username, password) ?
> > Here is the info for the user:
> >
> > radosgw-admin user info --uid=mgarces
> > { "user_id": "mgarces",
> >   "display_name": "Marco Garces",
> >   "email": "marco.gar...@bci.co.mz",
> >   "suspended": 0,
> >   "max_buckets": 1000,
> >   "auid": 0,
> >   "subusers": [
> > { "id": "mgarces:swift",
> >   "permissions": "full-control"}],
> >   "keys": [
> > { "user": "mgarces:swift",
> >   "access_key": "AJW2BCBXHFJ1DPXT112O",
> >   "secret_key": ""},
> > { "user": "mgarces",
> >   "access_key": "S88Y6ZJRACZG49JFPY83",
> >   "secret_key": "PlubMMjfQecJ5Py46e2kZz5VuUgHgsjLmYZDRdFg"}],
> >   "swift_keys": [
> > { "user": "mgarces:swift",
> >   "secret_key": "TtKWhY67ujhjn36\/nhv44A2BVPw5wDi3Sp13YrMM"}],
> >   "caps": [],
> >   "op_mask": "read, write, delete",
> >   "default_placement": "",
> >   "placement_tags": [],
> >   "bucket_quota": { "enabled": false,
> >   "max_size_kb": -1,
> >   "max_objects": -1},
> >   "user_quota": { "enabled": false,
> >   "max_size_kb": -1,
> >   "max_objects": -1},
> >   "temp_url_keys": []}
> >
>
> You might be hitting issue #8587 (aka #9155). Try creating a second
> swift user, see if it still happens.
>
> Yehuda
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

   Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)

   The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

   We have a relatively small number of concurrent users (typically
4-6 at most), who use GUI tools to examine their data, and then
complex sets of MATLAB scripts to process it, with processing often
being distributed across all the machines using Condor.

   It's not unusual to see the analysis scripts write out large
numbers (thousands, possibly tens or hundreds of thousands) of small
files, often from many client machines at once in parallel. When this
happens, the ceph cluster becomes almost completely unresponsive for
tens of seconds (or even for minutes) at a time, until the writes are
flushed through the system. Given the nature of modern GUI desktop
environments (often reading and writing small state files in the
user's home directory), this means that desktop interactiveness and
responsiveness for all the other users of the cluster suffer.

   1-minute load on the servers typically peaks at about 8 during
these events (on 4-core machines). Load on the clients also peaks
high, because of the number of processes waiting for a response from
the FS. The MDS shows little sign of stress -- it seems to be entirely
down to the OSDs. ceph -w shows requests blocked for more than 10
seconds, and in bad cases, ceph -s shows up to many hundreds of
requests blocked for more than 32s.

   We've had to turn off scrubbing and deep scrubbing completely --
except between 01.00 and 04.00 every night -- because it triggers the
exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
up to 7 PGs being scrubbed, as it did on Monday, it's completely
unusable.

   Is this problem something that's often seen? If so, what are the
best options for mitigation or elimination of the problem? I've found
a few references to issue #6278 [1], but that seems to be referencing
scrub specifically, not ordinary (if possibly pathological) writes.

   What are the sorts of things I should be looking at to work out
where the bottleneck(s) are? I'm a bit lost about how to drill down
into the ceph system for identifying performance issues. Is there a
useful guide to tools somewhere?

   Is an upgrade to 0.84 likely to be helpful? How "development" are
the development releases, from a stability / dangerous bugs point of
view?

   Thanks,
   Hugo.

[1] http://tracker.ceph.com/issues/6278

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Dan Van Der Ster
Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills  wrote:

>   We have a ceph system here, and we're seeing performance regularly
> descend into unusability for periods of minutes at a time (or longer).
> This appears to be triggered by writing large numbers of small files.
> 
>   Specifications:
> 
> ceph 0.80.5
> 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
> 2 machines running primary and standby MDS
> 3 monitors on the same machines as the OSDs
> Infiniband to about 8 CephFS clients (headless, in the machine room)
> Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
>   machines, in the analysis lab)
> 
>   The cluster stores home directories of the users and a larger area
> of scientific data (approx 15 TB) which is being processed and
> analysed by the users of the cluster.
> 
>   We have a relatively small number of concurrent users (typically
> 4-6 at most), who use GUI tools to examine their data, and then
> complex sets of MATLAB scripts to process it, with processing often
> being distributed across all the machines using Condor.
> 
>   It's not unusual to see the analysis scripts write out large
> numbers (thousands, possibly tens or hundreds of thousands) of small
> files, often from many client machines at once in parallel. When this
> happens, the ceph cluster becomes almost completely unresponsive for
> tens of seconds (or even for minutes) at a time, until the writes are
> flushed through the system. Given the nature of modern GUI desktop
> environments (often reading and writing small state files in the
> user's home directory), this means that desktop interactiveness and
> responsiveness for all the other users of the cluster suffer.
> 
>   1-minute load on the servers typically peaks at about 8 during
> these events (on 4-core machines). Load on the clients also peaks
> high, because of the number of processes waiting for a response from
> the FS. The MDS shows little sign of stress -- it seems to be entirely
> down to the OSDs. ceph -w shows requests blocked for more than 10
> seconds, and in bad cases, ceph -s shows up to many hundreds of
> requests blocked for more than 32s.
> 
>   We've had to turn off scrubbing and deep scrubbing completely --
> except between 01.00 and 04.00 every night -- because it triggers the
> exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
> up to 7 PGs being scrubbed, as it did on Monday, it's completely
> unusable.
> 
>   Is this problem something that's often seen? If so, what are the
> best options for mitigation or elimination of the problem? I've found
> a few references to issue #6278 [1], but that seems to be referencing
> scrub specifically, not ordinary (if possibly pathological) writes.
> 
>   What are the sorts of things I should be looking at to work out
> where the bottleneck(s) are? I'm a bit lost about how to drill down
> into the ceph system for identifying performance issues. Is there a
> useful guide to tools somewhere?
> 
>   Is an upgrade to 0.84 likely to be helpful? How "development" are
> the development releases, from a stability / dangerous bugs point of
> view?
> 
>   Thanks,
>   Hugo.
> 
> [1] http://tracker.ceph.com/issues/6278
> 
> -- 
> Hugo Mills :: IT Services, University of Reading
> Specialist Engineer, Research Servers
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Dan Van Der Ster
Hi,

On 20 Aug 2014, at 16:55, German Anders 
mailto:gand...@despegar.com>> wrote:

Hi Dan,

  How are you? I want to know how you disable the indexing on the 
/var/lib/ceph OSDs?


# grep ceph /etc/updatedb.conf
PRUNEPATHS = "/afs /media /net /sfs /tmp /udev /var/cache/ccache 
/var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph"



Did you disable deep scrub on you OSDs?


No but this can be an issue. If you get many PGs scrubbing at once, performance 
will suffer.

There is a new feature in 0.67.10 to sleep between scrubbing “chunks”. I set 
the that sleep to 0.1 (and the chunk_max to 5, and the scrub size 1MB). In 
0.67.10+1 there are some new options to set the iopriority of the scrubbing 
threads. Set that to class = 3, priority = 0 to give the scrubbing thread the 
idle priority. You need to use the cfq disk scheduler for io priorities to 
work. (cfq will also help if updatedb is causing any problems, since it runs 
with ionice -c 3).

I’m pretty sure those features will come in 0.80.6 as well.

Do you have the journals on SSD's or RAMDISK?


Never use RAMDISK.

We currently have the journals on the same spinning disk as the OSD, but the 
iops performance is low for the rbd and fs use-cases. (For object store it 
should be OK). But for rbd or fs, you really need journals on SSDs or your 
cluster will suffer.

We now have SSDs on order to augment our cluster. (The way I justified this is 
that our cluster has X TB of storage capacity and Y iops capacity. With disk 
journals we will run out of iops capacity well before we run out of storage 
capacity. So you can either increase the iops capacity substantially by 
decreasing the volume of the cluster by 20% and replacing those disks with SSD 
journals, or you can just leave 50% of the disk capacity empty since you can’t 
use it anyway).


What's the perf of your cluster? randos bench? fio? I've setup a new cluster 
and I want to know what would be the best option scheme to go.

It’s not really meaningful to compare performance of different clusters with 
different hardware. Some “constants” I can advise
  - with few clients, large write throughput is limited by the clients 
bandwidth, as long as you have enough OSDs and the client is striping over many 
objects.
  - with disk journals, small write latency will be ~30-50ms even when the 
cluster is idle. if you have SSD journals, maybe ~10ms.
  - count your iops. Each disk OSD can do ~100, and you need to divide by the 
number of replicas. With SSDs you can do a bit better than this since the 
synchronous writes go to the SSDs not the disks. In my tests with our hardware 
I estimate that going from disk to SSD journal will multiply the iops capacity 
by around 5x.

I also found that I needed to increase some the journal max write and journal 
queue max limits, also the filestore limits, to squeeze the best performance 
out of the SSD journals. Try increasing filestore queue max ops/bytes, 
filestore queue committing max ops/bytes, and the filestore wbthrottle xfs * 
options. (I’m not going to publish exact configs here because I haven’t 
finished tuning yet).

Cheers, Dan


Thanks a lot!!

Best regards,

German Anders














On Wednesday 20/08/2014 at 11:51, Dan Van Der Ster wrote:
Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills 
mailto:h.r.mi...@reading.ac.uk>> wrote:

We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
machines, in the analysis lab)

The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

We have a relatively small number of concurrent users

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
> Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for > 10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for > 10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for > 10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 12231 
B/s rd, 5534 kB/s wr, 370 op/s
2014-08-20 15:51:26.925996 mon.1 [INF] pgmap v2287929: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 26498 
B/s rd, 8121 kB/s wr, 367 op/s
2014-08-20 15:51:27.933424 mon.1 [INF] pgmap v2287930: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 706 kB/s 
rd, 7552 kB/s wr, 444 op/s

> What about monitor elections?

   No, that's been reporting "monmap e3" and "election epoch 130" for
a week or two. I assume that to mean we've had no elections. We're
actually running without one monitor at the moment, because one
machine is down, but we've had the same problems with the machine
present.

> Are your MDSs using a lot of CPU?

   No, they're showing load averages well under 1 the whole time. Peak
load average is about 0.6.

> did you try tuning anything in the MDS (I think the default config
> is still conservative, and there are options to cache more entries,
> etc…)

   Not much. We

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-08-20 Thread Gregory Farnum
After restarting your MDS, it still says it has epoch 1832 and needs
epoch 1833? I think you didn't really restart it.
If the epoch numbers have changed, can you restart it with "debug mds
= 20", "debug objecter = 20", "debug ms = 1" in the ceph.conf and post
the resulting log file somewhere?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero
 wrote:
> Unfortunately that doesn't help. I restarted both the active and standby mds 
> but that doesn't change the state of the mds. Is there a way to force the mds 
> to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 
> 1833, have 1832)?
>
> Thanks,
>
> Jasper
> 
> Van: Gregory Farnum [g...@inktank.com]
> Verzonden: dinsdag 19 augustus 2014 19:49
> Aan: Jasper Siero
> CC: ceph-users@lists.ceph.com
> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full
>
> On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
>  wrote:
>> Hi all,
>>
>> We have a small ceph cluster running version 0.80.1 with cephfs on five
>> nodes.
>> Last week some osd's were full and shut itself down. To help de osd's start
>> again I added some extra osd's and moved some placement group directories on
>> the full osd's (which has a copy on another osd) to another place on the
>> node (as mentioned in
>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
>> After clearing some space on the full osd's I started them again. After a
>> lot of deep scrubbing and two pg inconsistencies which needed to be repaired
>> everything looked fine except the mds which still is in the replay state and
>> it stays that way.
>> The log below says that mds need osdmap epoch 1833 and have 1832.
>>
>> 2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
>> 2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now
>> mds.0.25
>> 2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state
>> change up:standby --> up:replay
>> 2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
>> 2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
>> 2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833,
>> have 1832
>> 2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833
>> (which blacklists prior instance)
>>
>>  # ceph status
>> cluster c78209f5-55ea-4c70-8968-2231d2b05560
>>  health HEALTH_WARN mds cluster is degraded
>>  monmap e3: 3 mons at
>> {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
>> election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
>>  mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
>>  osdmap e1951: 12 osds: 12 up, 12 in
>>   pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
>> 124 GB used, 175 GB / 299 GB avail
>>  492 active+clean
>>
>> # ceph osd tree
>> # idweighttype nameup/downreweight
>> -10.2399root default
>> -20.05997host th1-osd001
>> 00.01999osd.0up1
>> 10.01999osd.1up1
>> 20.01999osd.2up1
>> -30.05997host th1-osd002
>> 30.01999osd.3up1
>> 40.01999osd.4up1
>> 50.01999osd.5up1
>> -40.05997host th1-mon003
>> 60.01999osd.6up1
>> 70.01999osd.7up1
>> 80.01999osd.8up1
>> -50.05997host th1-mon002
>> 90.01999osd.9up1
>> 100.01999osd.10up1
>> 110.01999osd.11up1
>>
>> What is the way to get the mds up and running again?
>>
>> I still have all the placement group directories which I moved from the full
>> osds which where down to create disk space.
>
> Try just restarting the MDS daemon. This sounds a little familiar so I
> think it's a known bug which may be fixed in a later dev or point
> release on the MDS, but it's a soft-state rather than a disk state
> issue.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best Practice to Copy/Move Data Across Clusters

2014-08-20 Thread Larry Liu
Hi guys,

Anyone has done copy/move data between clusters? If yes,  what are the best 
practices for you?

Thanks


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Practice to Copy/Move Data Across Clusters

2014-08-20 Thread Brian Rak
We do it with rbd volumes.  We're using rbd export/import and netcat to 
transfer it across clusters.  This was the most efficient solution, that 
did not require one cluster to have access to the other clusters (though 
it does require some way of starting the process on the different machines).




On 8/20/2014 12:49 PM, Larry Liu wrote:

Hi guys,

Anyone has done copy/move data between clusters? If yes,  what are the best 
practices for you?

Thanks


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-20 Thread Craig Lewis
Looks like I need to upgrade to Firefly to get ceph-kvstore-tool
before I can proceed.
I am getting some hits just from grepping the LevelDB store, but so
far nothing has panned out.

Thanks for the help!

On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum  wrote:
> It's been a while since I worked on this, but let's see what I remember...
>
> On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis  
> wrote:
>> In my effort to learn more of the details of Ceph, I'm trying to
>> figure out how to get from an object name in RadosGW, through the
>> layers, down to the files on disk.
>>
>> clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
>> 2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
>> s3://cpltest/vmware-freebsd-tools.tar.gz
>>
>> Looking at the .rgw pool's contents tells me that the cpltest bucket
>> is default.73886.55:
>> root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep 
>> cpltest
>> cpltest
>> .bucket.meta.cpltest:default.73886.55
>
> Okay, what you're seeing here are two different types, whose names I'm
> not going to get right:
> 1) The bucket link "cpltest", which maps from the name "cpltest" to a
> "bucket instance". The contents of cpltest, or one of its xattrs, are
> pointing at ".bucket.meta.cpltest:default.73886.55"
> 2) The "bucket instance" .bucket.meta.cpltest:default.73886.55. I
> think this contains the bucket index (list of all objects), etc.
>
>> The rados objects that belong to that bucket are:
>> root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
>> default.73886.55_vmware-freebsd-tools.tar.gz
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
>> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4
>
> Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz
> from the cpltest bucket, it will look up (or, if we're lucky, have
> cached) the cpltest link, and find out that the "bucket prefix" is
> default.73886.55. It will then try and access the object
> "default.73886.55_vmware-freebsd-tools.tar.gz" (whose construction I
> hope is obvious — bucket instance ID as a prefix, _ as a separate,
> then the object name). This RADOS object is called the "head" for the
> RGW object. In addition to (usually) the beginning bit of data, it
> will also contain some xattrs with things like a "tag" for any extra
> RADOS objects which include data for this RGW object. In this case,
> that tag is "RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ". (This construction is
> how we do atomic overwrites of RGW objects which are larger than a
> single RADOS object, in addition to a few other things.)
>
> I don't think there's any way of mapping from a shadow (tail) object
> name back to its RGW name. but if you look at the rados object xattrs,
> there might (? or might not) be an attr which contains the parent
> object in one form or another. Check that out.
>
> (Or, if you want to check out the source, I think all the relevant
> bits for this are somewhere in the
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>> I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
>> rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
>> bucket only has a single file (and the sum of the sizes matches).
>> With many files, I can't infer the link anymore.
>>
>> How do I look up that link?
>>
>> I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.
>>
>>
>>
>> My real goal is the reverse.  I recently repaired an inconsistent PG.
>> The primary replica had the bad data, so I want to verify that the
>> repaired object is correct.  I have a database that stores the SHA256
>> of every object.  If I can get from the filename on disk back to an S3
>> object, I can verify the file.  If it's bad, I can restore from the
>> replicated zone.
>>
>>
>> Aside from today's task, I think it's really handy to understand these
>> low level details.  I know it's been handy in the past, when I had
>> disk corruption under my PostgreSQL database.  Knowing (and
>> practicing) ahead of time really saved me a lot of downtime then.
>>
>>
>> Thanks for any pointers.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-20 Thread Sage Weil
On Wed, 20 Aug 2014, Craig Lewis wrote:
> Looks like I need to upgrade to Firefly to get ceph-kvstore-tool
> before I can proceed.
> I am getting some hits just from grepping the LevelDB store, but so
> far nothing has panned out.

FWIW if you just need the tool, you can wget the .deb and 'dpkg -x foo.deb 
/tmp/whatever' and grab the binary from there.

sage


> 
> Thanks for the help!
> 
> On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum  wrote:
> > It's been a while since I worked on this, but let's see what I remember...
> >
> > On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis  
> > wrote:
> >> In my effort to learn more of the details of Ceph, I'm trying to
> >> figure out how to get from an object name in RadosGW, through the
> >> layers, down to the files on disk.
> >>
> >> clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
> >> 2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
> >> s3://cpltest/vmware-freebsd-tools.tar.gz
> >>
> >> Looking at the .rgw pool's contents tells me that the cpltest bucket
> >> is default.73886.55:
> >> root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep 
> >> cpltest
> >> cpltest
> >> .bucket.meta.cpltest:default.73886.55
> >
> > Okay, what you're seeing here are two different types, whose names I'm
> > not going to get right:
> > 1) The bucket link "cpltest", which maps from the name "cpltest" to a
> > "bucket instance". The contents of cpltest, or one of its xattrs, are
> > pointing at ".bucket.meta.cpltest:default.73886.55"
> > 2) The "bucket instance" .bucket.meta.cpltest:default.73886.55. I
> > think this contains the bucket index (list of all objects), etc.
> >
> >> The rados objects that belong to that bucket are:
> >> root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
> >> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
> >> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
> >> default.73886.55_vmware-freebsd-tools.tar.gz
> >> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
> >> default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4
> >
> > Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz
> > from the cpltest bucket, it will look up (or, if we're lucky, have
> > cached) the cpltest link, and find out that the "bucket prefix" is
> > default.73886.55. It will then try and access the object
> > "default.73886.55_vmware-freebsd-tools.tar.gz" (whose construction I
> > hope is obvious ? bucket instance ID as a prefix, _ as a separate,
> > then the object name). This RADOS object is called the "head" for the
> > RGW object. In addition to (usually) the beginning bit of data, it
> > will also contain some xattrs with things like a "tag" for any extra
> > RADOS objects which include data for this RGW object. In this case,
> > that tag is "RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ". (This construction is
> > how we do atomic overwrites of RGW objects which are larger than a
> > single RADOS object, in addition to a few other things.)
> >
> > I don't think there's any way of mapping from a shadow (tail) object
> > name back to its RGW name. but if you look at the rados object xattrs,
> > there might (? or might not) be an attr which contains the parent
> > object in one form or another. Check that out.
> >
> > (Or, if you want to check out the source, I think all the relevant
> > bits for this are somewhere in the
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >> I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
> >> rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
> >> bucket only has a single file (and the sum of the sizes matches).
> >> With many files, I can't infer the link anymore.
> >>
> >> How do I look up that link?
> >>
> >> I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.
> >>
> >>
> >>
> >> My real goal is the reverse.  I recently repaired an inconsistent PG.
> >> The primary replica had the bad data, so I want to verify that the
> >> repaired object is correct.  I have a database that stores the SHA256
> >> of every object.  If I can get from the filename on disk back to an S3
> >> object, I can verify the file.  If it's bad, I can restore from the
> >> replicated zone.
> >>
> >>
> >> Aside from today's task, I think it's really handy to understand these
> >> low level details.  I know it's been handy in the past, when I had
> >> disk corruption under my PostgreSQL database.  Knowing (and
> >> practicing) ahead of time really saved me a lot of downtime then.
> >>
> >>
> >> Thanks for any pointers.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/maj

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Andrei Mikhailovsky
Hugo,

I would look at setting up a cache pool made of 4-6 ssds to start with. So, if 
you have 6 osd servers, stick at least 1 ssd disk in each server for the cache 
pool. It should greatly reduce the osd's stress of writing a large number of 
small files. Your cluster should become more responsive and the end user's 
experience should also improve.

I am planning on doing so in a near future, but according to my friend's 
experience, introducing a cache pool has greatly increased the overall 
performance of the cluster and has removed the performance issues that he was 
having during scrubbing/deep-scrubbing/recovery activities.

The size of your working data set should determine the size of the cache pool, 
but in general it will create a nice speedy buffer between your clients and 
those terribly slow spindles.

Andrei





- Original Message -
From: "Hugo Mills" 
To: "Dan Van Der Ster" 
Cc: "Ceph Users List" 
Sent: Wednesday, 20 August, 2014 4:54:28 PM
Subject: Re: [ceph-users] Serious performance problems with small file writes

   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
> Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for > 10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for > 10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for > 10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 
active+clean; 1

[ceph-users] MON running 'ceph -w' doesn't see OSD's booting

2014-08-20 Thread Bruce McFarland
I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple 
OSD's running on it. When I start the OSD using /etc/init.d/ceph start osd.0
I see the expected interaction between the OSD and the monitor authenticating 
keys etc and finally the OSD starts.

Running watching the cluster with 'ceph -w' running on the monitor I never see 
the INFO messages I expect. There isn't a msg from osd.0 for the boot event and 
the expected INFO messages from osdmap and pgmap  for the osd and it's pages 
being added to those maps.  I only see the last time the monitor was booted and 
it wins the monitor election and reports monmap, pgmap, and mdsmap info.

The firewalls are disabled with selinux==disabled and iptables turned off. All 
hosts can ssh w/o passwords into each other and I've verified traffic between 
hosts using tcpdump captures. Any ideas on what I'd need to add to ceph.conf or 
have overlooked would be greatly appreciated.
Thanks,
Bruce

[root@ceph0 ceph]# /etc/init.d/ceph restart osd.0
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on ceph0...kill 15676...done
=== osd.0 ===
2014-08-20 17:43:46.456592 7fa51a034700  1 -- :/0 messenger.start
2014-08-20 17:43:46.457363 7fa51a034700  1 -- :/1025971 --> 
209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 
0x7fa51402f9e0 con 0x7fa51402f570
2014-08-20 17:43:46.458229 7fa5189f0700  1 -- 209.243.160.83:0/1025971 learned 
my addr 209.243.160.83:0/1025971
2014-08-20 17:43:46.459664 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 1  mon_map v1  200+0+0 (3445960796 0 0) 
0x7fa508000ab0 con 0x7fa51402f570
2014-08-20 17:43:46.459849 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570
2014-08-20 17:43:46.460180 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 
0x7fa4fc0012d0 con 0x7fa51402f570
2014-08-20 17:43:46.461341 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570
2014-08-20 17:43:46.461514 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 
0x7fa4fc001cf0 con 0x7fa51402f570
2014-08-20 17:43:46.462824 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570
2014-08-20 17:43:46.463011 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0 
con 0x7fa51402f570
2014-08-20 17:43:46.463073 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fa4fc0025d0 
con 0x7fa51402f570
2014-08-20 17:43:46.463329 7fa51a034700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7fa514030490 con 0x7fa51402f570
2014-08-20 17:43:46.463363 7fa51a034700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7fa5140309b0 con 0x7fa51402f570
2014-08-20 17:43:46.463564 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 5  mon_map v1  200+0+0 (3445960796 0 0) 
0x7fa508001100 con 0x7fa51402f570
2014-08-20 17:43:46.463639 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0 
(540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570
2014-08-20 17:43:46.463707 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 7  auth_reply(proto 2 0 (0) Success) v1  
194+0+0 (1040860857 0 0) 0x7fa5080015d0 con 0x7fa51402f570
2014-08-20 17:43:46.468877 7fa51a034700  1 -- 209.243.160.83:0/1025971 --> 
209.243.160.84:6789/0 -- mon_command({"prefix": "get_command_descriptions"} v 
0) v1 -- ?+0 0x7fa514030e20 con 0x7fa51402f570
2014-08-20 17:43:46.469862 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 8  osd_map(554..554 src has 1..554) v3  
59499+0+0 (2180258623 0 0) 0x7fa50800f980 con 0x7fa51402f570
2014-08-20 17:43:46.470428 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 9  mon_subscribe_ack(300s) v1  20+0+0 
(540052875 0 0) 0x7fa50800fc40 con 0x7fa51402f570
2014-08-20 17:43:46.475021 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 10  osd_map(554..554 src has 1..554) v3  
59499+0+0 (2180258623 0 0) 0x7fa508001100 con 0x7fa51402f570
2014-08-20 17:43:46.475081 7fa5135fe700  1 -- 209.243.160.83:0/1025971 <== 
mon.0 209.243.160.84:6789/0 11  mon_subscribe_ack(300s) v1  20+0+0 
(540052875 0 0) 0x7fa508001310 con 0x7fa51402f570
2014-08-20 17:43:46.47755

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Christian Balzer

Hello,

On Wed, 20 Aug 2014 15:39:11 +0100 Hugo Mills wrote:

>We have a ceph system here, and we're seeing performance regularly
> descend into unusability for periods of minutes at a time (or longer).
> This appears to be triggered by writing large numbers of small files.
> 
>Specifications:
> 
> ceph 0.80.5
> 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2
> threads) 
> 2 machines running primary and standby MDS
> 3 monitors on the same machines as the OSDs
> Infiniband to about 8 CephFS clients (headless, in the machine room)
> Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
>machines, in the analysis lab)
> 
Please let us know the CPU and memory specs of the OSD nodes as well.
And the replication factor, I presume 3 if you value that data.
Also the PG and PGP values for the pool(s) you're using.

>The cluster stores home directories of the users and a larger area
> of scientific data (approx 15 TB) which is being processed and
> analysed by the users of the cluster.
> 
>We have a relatively small number of concurrent users (typically
> 4-6 at most), who use GUI tools to examine their data, and then
> complex sets of MATLAB scripts to process it, with processing often
> being distributed across all the machines using Condor.
> 
>It's not unusual to see the analysis scripts write out large
> numbers (thousands, possibly tens or hundreds of thousands) of small
> files, often from many client machines at once in parallel. When this
> happens, the ceph cluster becomes almost completely unresponsive for
> tens of seconds (or even for minutes) at a time, until the writes are
> flushed through the system. Given the nature of modern GUI desktop
> environments (often reading and writing small state files in the
> user's home directory), this means that desktop interactiveness and
> responsiveness for all the other users of the cluster suffer.
> 
>1-minute load on the servers typically peaks at about 8 during
> these events (on 4-core machines). Load on the clients also peaks
> high, because of the number of processes waiting for a response from
> the FS. The MDS shows little sign of stress -- it seems to be entirely
> down to the OSDs. ceph -w shows requests blocked for more than 10
> seconds, and in bad cases, ceph -s shows up to many hundreds of
> requests blocked for more than 32s.
> 
>We've had to turn off scrubbing and deep scrubbing completely --
> except between 01.00 and 04.00 every night -- because it triggers the
> exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
> up to 7 PGs being scrubbed, as it did on Monday, it's completely
> unusable.
> 
Note that I know nothing about CephFS and while there are probably
tunables the slow requests you're seeing and the hardware up there
definitely suggests slow OSDs.

Now with a replication factor of 3, your total cluster performance
(sustained) is that of just 6 disks and 4TB ones are never any speed
wonders. Minus the latency overheads from the network, which should be
minimal in your case though.

Your old NFS (cluster?) had twice the spindles you wrote, so if that means
36 disks it was quite a bit faster.

A cluster I'm just building with 3 nodes, 4 journal SSDs and 8 OSD HDDs
per node can do about 7000 write IOPS (4KB), so I would expect yours to be
worse off.

Having the journals on dedicated partitions instead of files on the rootfs
would not only be faster (though probably not significantly so), but also
prevent any potential failures based on FS corruption.

The SSD journals will compensate for some spikes of high IOPS, but 25
files is clearly beyond that.

Putting lots of RAM (relatively cheap these days) into the OSD nodes has
the big benefit that reads of hot objects will not have to go to disk and
thus compete with write IOPS.

>Is this problem something that's often seen? If so, what are the
> best options for mitigation or elimination of the problem? I've found
> a few references to issue #6278 [1], but that seems to be referencing
> scrub specifically, not ordinary (if possibly pathological) writes.
> 
You need to match your cluster to your workload.
Aside from tuning things (which tends to have limited effects), you can
either scale out by adding more servers or scale up by using faster
storage and/or a cache pool.

>What are the sorts of things I should be looking at to work out
> where the bottleneck(s) are? I'm a bit lost about how to drill down
> into the ceph system for identifying performance issues. Is there a
> useful guide to tools somewhere?
> 
Reading/scouring this ML can be quite helpful. 

Watch your OSD nodes (all of them!) with iostat or preferably atop (which
will also show you how your CPUs and network is doing) while running the
below stuff. 

To get a baseline do:
"rados -p  bench 60 write -t 64"
This will test your throughput most of all and due to the 4MB block size
spread the load very equally amongst the OSDs.
During th