[ceph-users] Kernel Module

2013-05-07 Thread Gandalf Corvotempesta
Do I need the kernel module? I'm planning an Infrastructure for
CephFS, QEmu and RGW. Will these need the kernel module or all is done
in userspace?

If I understood docs properly, the only case when the kernel module is
needed is for use of a RBD block device directly from Linux, like a
mount point. We never use this features, all of our system is based on
KVM/qemu or RGW/CephFS.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EPEL packages for QEMU-KVM with rbd support?

2013-05-07 Thread Dan van der Ster
Hi Barry,

On Mon, May 6, 2013 at 7:06 PM, Barry O'Rourke  wrote:
> Hi,
>
> I built a modified version of the fc17 package that I picked up from
> koji [1]. That might not be ideal for you as fc17 uses systemd rather
> than init, we use an in-house configuration management system which
> handles service start-up so it's not an issue for us.
>
> I'd be interested to hear how others install qemu on el6 derivatives,
> especially those of you running newer versions.
>

Are you by chance trying to use qemu with OpenStack, RDO OpenStack in
particular?

We've done a naive backport of rbd.c from qemu 1.5 to latest qemu-kvm
0.12.1.2+ in el6 (patch available upon request, but I wouldn't trust
it in production since we may have made a mistake). We then recompiled
libvirt from el6 to force the enabling of rbd support:

[root@xxx rpmbuild]# diff SPECS/libvirt.spec.orig SPECS/libvirt.spec
1671c1671
<%{?_without_storage_rbd} \
---
>--with-storage-rbd \

(without this patch, libvirt only to enables rbd for Fedora releases, not RHEL).

We're at the point where qemu-kvm alone works with rbd, but RDO
OpenStack Cinder and Glance are still failing to attach rbd volumes or
boot from volumes for some unknown reason. We'd be very interested if
someone else is trying/succeeding to achieve the same setup, RDO
OpenStack + RBD.
Cheers,
Dan van der Ster
CERN IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH WARN: clock skew detected

2013-05-07 Thread Joao Eduardo Luis

On 05/06/2013 01:07 PM, Michael Lowe wrote:

Um, start it? You must have synchronized clocks in a fault tolerant system 
(google Byzantine generals clock) and the way to do that is ntp, therefore ntp 
is required.


On May 6, 2013, at 1:34 AM, Varun Chandramouli  wrote:


Hi Michael,

Thanks for your response. No, the ntp daemon is not running. Any other 
suggestions?

Regards
Varun



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




The monitors have a low tolerance to clock skew.  It was common to hit 
strange behaviours due to unsynchronized clocks, which can manifest 
themselves in so many weird ways, that we decided to introduce those 
warning messages in case the monitors' clocks drifted too much apart.


You should run ntpd (or something of the sorts) as Michael and others 
have suggested.  Failing to have synchronized clocks on a monitor 
cluster will cause all sorts of weirdness.  Keep your clocks 
synchronized people!


  -Joao

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Format of option string for rados_conf_set()

2013-05-07 Thread Guido Winkelmann
Hi,

The API documentation for librados says that, instead of providing command 
line options or a configuration file, the rados object can also be configured 
by manually setting options with rados_conf_set() (or Rados::conf_set() for 
the C++ interface). This takes both the option and value as C-strings, but the 
documentation fails to mention how the string for option should look like.

If I want to set the address for a monitor, would the option be "[mon.alpha] 
mon addr" or just "mon addr"? If the latter, how do I set multiple monitor 
addresses? Also, how do I set Cephx authentication keys?

I assume that apart from at least one mon address and an authentication key 
(if needed), the rados object will get pretty much all other relevant options 
from the cluster once it has connected. Is this correct?

Guido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Format of option string for rados_conf_set()

2013-05-07 Thread Wido den Hollander

On 05/07/2013 12:08 PM, Guido Winkelmann wrote:

Hi,

The API documentation for librados says that, instead of providing command
line options or a configuration file, the rados object can also be configured
by manually setting options with rados_conf_set() (or Rados::conf_set() for
the C++ interface). This takes both the option and value as C-strings, but the
documentation fails to mention how the string for option should look like.

If I want to set the address for a monitor, would the option be "[mon.alpha]
mon addr" or just "mon addr"? If the latter, how do I set multiple monitor
addresses? Also, how do I set Cephx authentication keys?



The setting is actually "mon_host" and you set it this way:

rados_conf_set(cluster, "mon_host", "127.0.1.1,127.0.1.2,127.0.1.3")

The value for the cephx secret is "key", so you set:

rados_conf_set(cluster, "key", )

Depending on the version you might just for safety enabled cephx:

rados_conf_set(ptr->cluster, "auth_supported", "cephx")

The ID of which you connect with should be set when creating the cluster:

ados_create(&cluster, "admin")


I assume that apart from at least one mon address and an authentication key
(if needed), the rados object will get pretty much all other relevant options
from the cluster once it has connected. Is this correct?



It doesn't get any configuration options, but for a client usually just 
the monitors and the cephx data is enough.


Wido


Guido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW High Availability

2013-05-07 Thread Igor Laskovy
I tried do that and put behind RR DNS, but unfortunately only one host can
server requests from clients - second host does not responds totally.  I am
not to good familiar with apache, in standard log files nothing helpful.
Maybe this whole HA design is wrong? Does anybody resolve HA for Rados
Gateway endpoint? How?


On Wed, May 1, 2013 at 12:28 PM, Igor Laskovy wrote:

> Hello,
>
> Whether any best practices how to make Hing Availability of RadosGW?
> For example, is this right way to create two or tree RadosGW (keys for
> ceph-auth, directory and so on) and having for example this is ceph.conf:
>
> [client.radosgw.a]
> host = ceph01
> ...options...
>
> [client.radosgw.b]
> host = ceph02
> ...options...
>
> Does this rgws will run simultaneous?
> Have radosgw.b ability to continues serve load if ceph01 host went down?
>
> --
> Igor Laskovy
> facebook.com/igor.laskovy
> studiogrizzly.com
>



-- 
Igor Laskovy
facebook.com/igor.laskovy
studiogrizzly.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 0.61 Cuttlefish released

2013-05-07 Thread Igor Laskovy
Hi,

where can I read more about ceph-disk?


On Tue, May 7, 2013 at 5:51 AM, Sage Weil  wrote:

> Spring has arrived (at least for some of us), and a new stable release of
> Ceph is ready!  Thank you to everyone who has contributed to this release!
>
> Bigger ticket items since v0.56.x "Bobtail":
>
>  * ceph-deploy: our new deployment tool to replace 'mkcephfs'
>  * robust RHEL/CentOS support
>  * ceph-disk: many improvements to support hot-plugging devices via chef
>and ceph-deploy
>  * ceph-disk: dm-crypt support for OSD disks
>  * ceph-disk: 'list' command to see available (and used) disks
>  * rbd: incremental backups
>  * rbd-fuse: access RBD images via fuse
>  * librbd: autodetection of VM flush support to allow safe enablement of
>the writeback cache
>  * osd: improved small write, snap trimming, and overall performance
>  * osd: PG splitting
>  * osd: per-pool quotas (object and byte)
>  * osd: tool for importing, exporting, removing PGs from OSD data store
>  * osd: improved clean-shutdown behavior
>  * osd: noscrub, nodeepscrub options
>  * osd: more robust scrubbing, repair, ENOSPC handling
>  * osd: improved memory usage, log trimming
>  * osd: improved journal corruption detection
>  * ceph: new 'df' command
>  * mon: new storage backend (leveldb)
>  * mon: config-keys service
>  * mon, crush: new commands to manage CRUSH entirely via CLI
>  * mon: avoid marking entire subtrees (e.g., racks) out automatically
>  * rgw: CORS support
>  * rgw: misc API fixes
>  * rgw: ability to listen to fastcgi on a port
>  * sysvinit, upstart: improved support for standardized data locations
>  * mds: backpointers on all data and metadata objects
>  * mds: faster fail-over
>  * mds: many many bug fixes
>  * ceph-fuse: many stability improvements
>
> Notable changes since v0.60:
>
>  * rbd: incremental backups
>  * rbd: only set STRIPINGV2 feature if striping parameters are
>incompatible with old versions
>  * rbd: require allow-shrink for resizing images down
>  * librbd: many bug fixes
>  * rgw: fix object corruption on COPY to self
>  * rgw: new sysvinit script for rpm-based systems
>  * rgw: allow buckets with _
>  * rgw: CORS support
>  * mon: many fixes
>  * mon: improved trimming behavior
>  * mon: fix data conversion/upgrade problem (from bobtail)
>  * mon: ability to tune leveldb
>  * mon: config-keys service to store arbitrary data on monitor
>  * mon: osd crush add|link|unlink|add-bucket ... commands
>  * mon: trigger leveldb compaction on trim
>  * osd: per-rados pool quotas (objects, bytes)
>  * osd: tool to export, import, and delete PGs from an individual OSD data
>store
>  * osd: notify mon on clean shutdown to avoid IO stall
>  * osd: improved detection of corrupted journals
>  * osd: ability to tune leveldb
>  * osd: improve client request throttling
>  * osd, librados: fixes to the LIST_SNAPS operation
>  * osd: improvements to scrub error repair
>  * osd: better prevention of wedging OSDs with ENOSPC
>  * osd: many small fixes
>  * mds: fix xattr handling on root inode
>  * mds: fixed bugs in journal replay
>  * mds: many fixes
>  * librados: clean up snapshot constant definitions
>  * libcephfs: calls to query CRUSH topology (used by Hadoop)
>  * ceph-fuse, libcephfs: misc fixes to mds session management
>  * ceph-fuse: disabled cache invalidation (again) due to potential
>deadlock with kernel
>  * sysvinit: try to start all daemons despite early failures
>  * ceph-disk: new list command
>  * ceph-disk: hotplug fixes for RHEL/CentOS
>  * ceph-disk: fix creation of OSD data partitions on >2TB disks
>  * osd: fix udev rules for RHEL/CentOS systems
>  * fix daemon logging during initial startup
>
> There are a few things to keep in mind when upgrading from Bobtail,
> specifically with the monitor daemons.  Please see the upgrade guide
> and/or the complete release notes.  In short: upgrade all of your monitors
> (more or less) at once.
>
> Cuttlefish is the first Ceph release on our new three-month stable release
> cycle.  We are very pleased to have pulled everything together on schedule
> (well, only a week later than planned).  The next stable release, which
> will be code-named Dumpling, is slated for three months from now
> (beginning of August).
>
> You can download v0.61 Cuttlefish from the usual locations:
>
>  * Git at git://github.com/ceph/ceph.git
>  * Tarball at http://ceph.com/download/ceph-0.61.tar.gz
>  * For Debian/Ubuntu packages, see
> http://ceph.com/docs/master/install/debian
>  * For RPMs, see http://ceph.com/docs/master/install/rpm
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Igor Laskovy
facebook.com/igor.laskovy
studiogrizzly.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH WARN: clock skew detected

2013-05-07 Thread Varun Chandramouli
Hi All,

Thanks for the replies. I started the ntp daemon and the warnings as well
as the crashes seem to have gone. This is the first time I set up a cluster
(of physical machines), and was unaware of the need to synchronize the
clocks. Probably should have googled it more :). Pardon my ignorance.

Thanks Again,
Varun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EPEL packages for QEMU-KVM with rbd support?

2013-05-07 Thread Barry O'Rourke

Hi,

I'm not using OpenStack, I've only really been playing around with Ceph 
on test machines. I'm currently speccing up my production cluster and 
will probably end up running it along with OpenNebula.


Barry

On 07/05/13 10:01, Dan van der Ster wrote:

Hi Barry,

On Mon, May 6, 2013 at 7:06 PM, Barry O'Rourke  wrote:

Hi,

I built a modified version of the fc17 package that I picked up from
koji [1]. That might not be ideal for you as fc17 uses systemd rather
than init, we use an in-house configuration management system which
handles service start-up so it's not an issue for us.

I'd be interested to hear how others install qemu on el6 derivatives,
especially those of you running newer versions.



Are you by chance trying to use qemu with OpenStack, RDO OpenStack in
particular?

We've done a naive backport of rbd.c from qemu 1.5 to latest qemu-kvm
0.12.1.2+ in el6 (patch available upon request, but I wouldn't trust
it in production since we may have made a mistake). We then recompiled
libvirt from el6 to force the enabling of rbd support:

[root@xxx rpmbuild]# diff SPECS/libvirt.spec.orig SPECS/libvirt.spec
1671c1671
<%{?_without_storage_rbd} \
---

--with-storage-rbd \


(without this patch, libvirt only to enables rbd for Fedora releases, not RHEL).

We're at the point where qemu-kvm alone works with rbd, but RDO
OpenStack Cinder and Glance are still failing to attach rbd volumes or
boot from volumes for some unknown reason. We'd be very interested if
someone else is trying/succeeding to achieve the same setup, RDO
OpenStack + RBD.
Cheers,
Dan van der Ster
CERN IT



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Barry O'Rourke

Hi,

I'm looking to purchase a production cluster of 3 Dell Poweredge R515's 
which I intend to run in 3 x replication. I've opted for the following 
configuration;


2 x 6 core processors
32Gb RAM
H700 controller (1Gb cache)
2 x SAS OS disks (in RAID1)
2 x 1Gb ethernet (bonded for cluster network)
2 x 1Gb ethernet (bonded for client network)

and either 4 x 2Tb nearline SAS OSDs or 8 x 1Tb nearline SAS OSDs.

At the moment I'm undecided on the OSDs, although I'm swaying towards 
the second option at the moment as it would give me more flexibility and 
the option of using some of the disks as journals.


I'm intending to use this cluster to host the images for ~100 virtual 
machines, which will run on different hardware most likely be managed by 
OpenNebula.


I'd be interested to hear from anyone running a similar configuration 
with a similar use case, especially people who have spent some time 
benchmarking a similar configuration and still have a copy of the results.


I'd also welcome any comments or critique on the above specification. 
Purchases have to be made via Dell and 10Gb ethernet is out of the 
question at the moment.


Cheers,

Barry


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub error: found clone without head

2013-05-07 Thread Dzianis Kahanovich
I have 4 scrub errors (3 PGs - "found clone without head"), on one OSD. Not
repairing. How to repair it exclude re-creating of OSD?

Now it "easy" to clean+create OSD, but in theory - in case there are multiple
OSDs - it may cause data lost.

-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Jens Kristian Søgaard

Hi,

I'd be interested to hear from anyone running a similar configuration 


I'm running a somewhat similar configuration here. I'm wondering why you 
have left out SSDs for the journals?


I gather they would be quite important to achieve a level of performance 
for hosting 100 virtual machines - unless that is not important for you?


Have you considered having more than 3 servers?

If you want to run with a replication count of 3, I imagine that a 
failed server would be problematic. But perhaps it is not important for 
you if you have to live with paused VMs for a while if a server dies?


Do you know what kind of disk access patterns those 100 virtual machines 
will have? (i.e. is it a cluster computing setup with minimal disk 
access or are they running all sorts of general purpose systems?)


Purchases have to be made via Dell and 10Gb ethernet is out of the 
question at the moment.


Why do you consider 10 Gb ethernet to be out of the question?

I looked briefly at Dells site and the added cost of getting a dual-port 
10 Gb NIC instead of a quad-port gigabit is just 368$.


You could opt to buy their 2 TB SATA drives and 10 Gb NIC instead of the 
2 TB SAS drives - and the cost would be the same.


A 10 Gb switch with enough ports for this setup can be had for 920$. As 
you would need double the amount of ports in a 1 Gb setup, the 
difference in cost for the switch would be quite small. Especially 
considering that you would get 5 times the bandwidth adding only a few 
percent to the total costs.


--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practice for osd_min_down_reporters

2013-05-07 Thread Wido den Hollander

Hi,

I was just upgrading a 9 nodes, 36 OSD cluster running the next branch 
from some days ago to the Cuttlefish release.


While rebooting the nodes one by one and waiting for a active+clean for 
all PGs I noticed that some weird things happened.


I reboot a node and see:

"osdmap e580: 36 osds: 4 up, 36 in"

After a few seconds I see all the OSDs reporting:

osd.33 [WRN] map e582 wrongly marked me down
osd.5 [WRN] map e582 wrongly marked me down
osd.6 [WRN] map e582 wrongly marked me down

I didn't check what was happening here, but it seems like the 4 OSDs who 
were shutting down reported everybody but themselves out (Should have 
printed ceph osd tree).


Thinking about that, there is the following configuration option:

OPTION(osd_min_down_reporters, OPT_INT, 1)
OPTION(osd_min_down_reports, OPT_INT, 3)

So if just one OSD sends 3 reports it can mark anybody in the cluster 
down, right?


Shouldn't the best practice be to set osd_min_down_reporters to at least 
numosdperhost+1


In this case I have 4 OSDs per host, so shouldn't I use 5 here?

This might as well be a bug, but it still doesn't seem right that all 
the OSDs on one machine can mark the whole cluster down.


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Mike Lowe
FWIW, here is what I have for my ceph cluster:

4 x HP DL 180 G6
12Gb RAM
P411 with 512MB Battery Backed Cache
10GigE
4 HP MSA 60's with 12 x 1TB 7.2k SAS and SATA drives (bought at different times 
so there is a mix)
2 HP D2600 with 12 x 3TB 7.2k SAS Drives

I'm currently running 79 qemu/kvm vm's for Indiana University and xsede.org.

On May 7, 2013, at 7:50 AM, "Barry O'Rourke"  wrote:

> Hi,
> 
> I'm looking to purchase a production cluster of 3 Dell Poweredge R515's which 
> I intend to run in 3 x replication. I've opted for the following 
> configuration;
> 
> 2 x 6 core processors
> 32Gb RAM
> H700 controller (1Gb cache)
> 2 x SAS OS disks (in RAID1)
> 2 x 1Gb ethernet (bonded for cluster network)
> 2 x 1Gb ethernet (bonded for client network)
> 
> and either 4 x 2Tb nearline SAS OSDs or 8 x 1Tb nearline SAS OSDs.
> 
> At the moment I'm undecided on the OSDs, although I'm swaying towards the 
> second option at the moment as it would give me more flexibility and the 
> option of using some of the disks as journals.
> 
> I'm intending to use this cluster to host the images for ~100 virtual 
> machines, which will run on different hardware most likely be managed by 
> OpenNebula.
> 
> I'd be interested to hear from anyone running a similar configuration with a 
> similar use case, especially people who have spent some time benchmarking a 
> similar configuration and still have a copy of the results.
> 
> I'd also welcome any comments or critique on the above specification. 
> Purchases have to be made via Dell and 10Gb ethernet is out of the question 
> at the moment.
> 
> Cheers,
> 
> Barry
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice for osd_min_down_reporters

2013-05-07 Thread Andrey Korolyov
Hi Wido,

I have experienced same problem almost half a year ago, and finally
set this value to 3 - no more wrong marks was given, except extreme
high disk load when OSD really went down for a couple of seconds.

On Tue, May 7, 2013 at 4:59 PM, Wido den Hollander  wrote:
> Hi,
>
> I was just upgrading a 9 nodes, 36 OSD cluster running the next branch from
> some days ago to the Cuttlefish release.
>
> While rebooting the nodes one by one and waiting for a active+clean for all
> PGs I noticed that some weird things happened.
>
> I reboot a node and see:
>
> "osdmap e580: 36 osds: 4 up, 36 in"
>
> After a few seconds I see all the OSDs reporting:
>
> osd.33 [WRN] map e582 wrongly marked me down
> osd.5 [WRN] map e582 wrongly marked me down
> osd.6 [WRN] map e582 wrongly marked me down
>
> I didn't check what was happening here, but it seems like the 4 OSDs who
> were shutting down reported everybody but themselves out (Should have
> printed ceph osd tree).
>
> Thinking about that, there is the following configuration option:
>
> OPTION(osd_min_down_reporters, OPT_INT, 1)
> OPTION(osd_min_down_reports, OPT_INT, 3)
>
> So if just one OSD sends 3 reports it can mark anybody in the cluster down,
> right?
>
> Shouldn't the best practice be to set osd_min_down_reporters to at least
> numosdperhost+1
>
> In this case I have 4 OSDs per host, so shouldn't I use 5 here?
>
> This might as well be a bug, but it still doesn't seem right that all the
> OSDs on one machine can mark the whole cluster down.
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Barry O'Rourke

Hi,


I'm running a somewhat similar configuration here. I'm wondering why you
have left out SSDs for the journals?


I can't go into exact prices due to our NDA, but I can say that getting 
a couple of decent SSD disks from Dell will increase the cost per server 
by a four figure sum, and we're on a limited budget. Dell do offer a 
"budget" range of SSDs on a limit warranty, I'm not too sure how much 
"budget" can be trusted.



I gather they would be quite important to achieve a level of performance
for hosting 100 virtual machines - unless that is not important for you?


> Do you know what kind of disk access patterns those 100 virtual
> machines will have? (i.e. is it a cluster computing setup with
> minimal disk access or are they running all sorts of general purpose 
> systems?)


The majority of our virtual machines are web servers and subversion 
repositories with quite a low amount of traffic, I don't imaging the 
disk I/O being that high.



Have you considered having more than 3 servers?

If you want to run with a replication count of 3, I imagine that a
failed server would be problematic. But perhaps it is not important for
you if you have to live with paused VMs for a while if a server dies?


We have three server rooms, which is why I decided to go for three with 
3 x replication. I don't think I could squeeze any more than that into 
my budget either.



Why do you consider 10 Gb ethernet to be out of the question?


I was told that it is out of the question at the moment, I'll need to 
mention it again.


Thanks,

Barry

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Mark Nelson

On 05/07/2013 06:50 AM, Barry O'Rourke wrote:

Hi,

I'm looking to purchase a production cluster of 3 Dell Poweredge R515's
which I intend to run in 3 x replication. I've opted for the following
configuration;

2 x 6 core processors
32Gb RAM
H700 controller (1Gb cache)
2 x SAS OS disks (in RAID1)
2 x 1Gb ethernet (bonded for cluster network)
2 x 1Gb ethernet (bonded for client network)

and either 4 x 2Tb nearline SAS OSDs or 8 x 1Tb nearline SAS OSDs.


Hi Barry,

With so few disks and the inability to do 10GbE, you may want to 
consider doing something like 5-6 R410s or R415s and just using the 
on-board controller with a couple of SATA disks and 1 SSD for the 
journal.  That should give you better aggregate performance since in 
your case you can't use 10GbE.  It will also spread your OSDs across 
more hosts for better redundancy and may not cost that much more per GB 
since you won't need to use the H700 card if you are using an SSD for 
journals.  It's not as dense as R515s or R720XDs can be when fully 
loaded, but for small clusters with few disks I think it's a good 
trade-off to get the added redundancy and avoid expander/controller 
complications.




At the moment I'm undecided on the OSDs, although I'm swaying towards
the second option at the moment as it would give me more flexibility and
the option of using some of the disks as journals.

I'm intending to use this cluster to host the images for ~100 virtual
machines, which will run on different hardware most likely be managed by
OpenNebula.

I'd be interested to hear from anyone running a similar configuration
with a similar use case, especially people who have spent some time
benchmarking a similar configuration and still have a copy of the results.

I'd also welcome any comments or critique on the above specification.
Purchases have to be made via Dell and 10Gb ethernet is out of the
question at the moment.

Cheers,

Barry




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH WARN: clock skew detected

2013-05-07 Thread Mike Lowe
You've learned on of the three computer science facts you need to know about 
distributed systems, and I'm glad I could pass something on:

1. Consistent, Available, Distributed - pick any two
2. To completely guard against k failures where you don't know which one failed 
just by looking you need 2k+1 redundant copies
3. Fault tolerant systems must all agree on what time it is

On May 7, 2013, at 6:29 AM, Varun Chandramouli  wrote:

> Hi All,
> 
> Thanks for the replies. I started the ntp daemon and the warnings as well as 
> the crashes seem to have gone. This is the first time I set up a cluster (of 
> physical machines), and was unaware of the need to synchronize the clocks. 
> Probably should have googled it more :). Pardon my ignorance.
> 
> Thanks Again,
> Varun
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH WARN: clock skew detected

2013-05-07 Thread Joao Eduardo Luis

On 05/07/2013 03:20 PM, Mike Lowe wrote:

You've learned on of the three computer science facts you need to know about 
distributed systems, and I'm glad I could pass something on:

1. Consistent, Available, Distributed - pick any two


To some degree of Consistent, Available and Distributed. :-P



2. To completely guard against k failures where you don't know which one failed 
just by looking you need 2k+1 redundant copies
3. Fault tolerant systems must all agree on what time it is

On May 7, 2013, at 6:29 AM, Varun Chandramouli  wrote:


Hi All,

Thanks for the replies. I started the ntp daemon and the warnings as well as 
the crashes seem to have gone. This is the first time I set up a cluster (of 
physical machines), and was unaware of the need to synchronize the clocks. 
Probably should have googled it more :). Pardon my ignorance.

Thanks Again,
Varun



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel Module

2013-05-07 Thread Gregory Farnum
On Tuesday, May 7, 2013, Gandalf Corvotempesta wrote:

> Do I need the kernel module? I'm planning an Infrastructure for
> CephFS, QEmu and RGW. Will these need the kernel module or all is done
> in userspace?
>
> If I understood docs properly, the only case when the kernel module is
> needed is for use of a RBD block device directly from Linux, like a
> mount point. We never use this features, all of our system is based on
> KVM/qemu or RGW/CephFS.


As long as you're planning to use ceph-fuse for your filesystem access, you
don't need anything in the kernel.
-Greg
Software Engineer #42 @ inktank.com | ceph.com


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice for osd_min_down_reporters

2013-05-07 Thread Gregory Farnum
On Tuesday, May 7, 2013, Wido den Hollander wrote:

> Hi,
>
> I was just upgrading a 9 nodes, 36 OSD cluster running the next branch
> from some days ago to the Cuttlefish release.
>
> While rebooting the nodes one by one and waiting for a active+clean for
> all PGs I noticed that some weird things happened.
>
> I reboot a node and see:
>
> "osdmap e580: 36 osds: 4 up, 36 in"
>
> After a few seconds I see all the OSDs reporting:
>
> osd.33 [WRN] map e582 wrongly marked me down
> osd.5 [WRN] map e582 wrongly marked me down
> osd.6 [WRN] map e582 wrongly marked me down
>
> I didn't check what was happening here, but it seems like the 4 OSDs who
> were shutting down reported everybody but themselves out (Should have
> printed ceph osd tree).
>
> Thinking about that, there is the following configuration option:
>
> OPTION(osd_min_down_reporters, OPT_INT, 1)
> OPTION(osd_min_down_reports, OPT_INT, 3)
>
> So if just one OSD sends 3 reports it can mark anybody in the cluster
> down, right?
>
> Shouldn't the best practice be to set osd_min_down_reporters to at least
> numosdperhost+1
>
> In this case I have 4 OSDs per host, so shouldn't I use 5 here?
>
> This might as well be a bug, but it still doesn't seem right that all the
> OSDs on one machine can mark the whole cluster down.


I'm a little surprised tha OSDs turning off could have marked anybody down
at all. :/ Do you have any more info?

In any case, yeah, you probably want to increase your "reporters" required.
That value is set at 1 so it works on a 2-node cluster. :)
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Dave Spano
Barry, I have a similar setup and found that the 600GB 15K SAS drives work 
well. The 2TB 7200 disks did not work as well due to my not using SSD. Running 
the journal and the data on big slow drives will result in slow writes. All the 
big boys I've encountered are running SSDs. 

Currently, I'm using two 515s with 5 600GB drives in a RAID0x1 configuration on 
each host. I do not currently have a cluster network setup, but my OSDs have 8 
bonded 1gbe nics. 

Even with the fastest drives that you can get for the R515, I am considering 
trying to get an SSD sometime in the near future. This is due to performance 
issues I've run into trying to run an Oracle VM on Openstack Folsom. To not 
make this sound all doom and gloom. I am also running a website consisting of 
six vms that does not have as heavy random read write I/O and it runs fine. 

Here's a quick performance display with various block sizes on a host with 1 
public 1Gbe link and 1 1Gbe link on the same vlan as the ceph cluster. I'm 
using RBD writeback caching on this VM. To accomplish this, I had to hack the 
libvirt volume.py file in Openstack, and enable it in the /etc/ceph/ceph.conf 
file on the host I was running it on. I know nothing of OpenNebula, so I can't 
speak to what it can or cannot do, and how to enable writeback caching in it. 

The rbd caching settings for ceph.conf can be found here. 
http://ceph.com/docs/master/rbd/rbd-config-ref/ 
Ex. 
;Global Client Setting 
[client] 
rbd cache = true 


4K: 
[root@optog3 temp]# dd if=/dev/zero of=here bs=4k count=50k oflag=direct 
51200+0 records in 
51200+0 records out 
209715200 bytes (210 MB) copied, 7.09731 seconds, 29.5 MB/s 
[root@optog3 temp]# 

8K 
[root@optog3 temp]# dd if=/dev/zero of=here bs=8192 count=50k oflag=direct 
51200+0 records in 
51200+0 records out 
419430400 bytes (419 MB) copied, 7.36243 seconds, 57.0 MB/s 
[root@optog3 temp]# 


4MB blocks. 
[root@optog3 temp]# dd if=/dev/zero of=here bs=4M count=500 oflag=direct 
500+0 records in 
500+0 records out 
2097152000 bytes (2.1 GB) copied, 23.5803 seconds, 88.9 MB/s 
[root@optog3 temp]# 

1GB blocks: 
[root@optog3 temp]# dd if=/dev/zero of=here bs=1G count=1 oflag=direct 
1+0 records in 
1+0 records out 
1073741824 bytes (1.1 GB) copied, 12.0053 seconds, 89.4 MB/s 
[root@optog3 temp]# 



This article by Hastexo, which I wish I would've seen before going to 
production may help you greatly with this decision. 

http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals
 


Dave Spano 
Optogenics 
Systems Administrator 


- Original Message -

From: "Mark Nelson"  
To: ceph-users@lists.ceph.com 
Sent: Tuesday, May 7, 2013 9:17:24 AM 
Subject: Re: [ceph-users] Dell R515 performance and specification question 

On 05/07/2013 06:50 AM, Barry O'Rourke wrote: 
> Hi, 
> 
> I'm looking to purchase a production cluster of 3 Dell Poweredge R515's 
> which I intend to run in 3 x replication. I've opted for the following 
> configuration; 
> 
> 2 x 6 core processors 
> 32Gb RAM 
> H700 controller (1Gb cache) 
> 2 x SAS OS disks (in RAID1) 
> 2 x 1Gb ethernet (bonded for cluster network) 
> 2 x 1Gb ethernet (bonded for client network) 
> 
> and either 4 x 2Tb nearline SAS OSDs or 8 x 1Tb nearline SAS OSDs. 

Hi Barry, 

With so few disks and the inability to do 10GbE, you may want to 
consider doing something like 5-6 R410s or R415s and just using the 
on-board controller with a couple of SATA disks and 1 SSD for the 
journal. That should give you better aggregate performance since in 
your case you can't use 10GbE. It will also spread your OSDs across 
more hosts for better redundancy and may not cost that much more per GB 
since you won't need to use the H700 card if you are using an SSD for 
journals. It's not as dense as R515s or R720XDs can be when fully 
loaded, but for small clusters with few disks I think it's a good 
trade-off to get the added redundancy and avoid expander/controller 
complications. 

> 
> At the moment I'm undecided on the OSDs, although I'm swaying towards 
> the second option at the moment as it would give me more flexibility and 
> the option of using some of the disks as journals. 
> 
> I'm intending to use this cluster to host the images for ~100 virtual 
> machines, which will run on different hardware most likely be managed by 
> OpenNebula. 
> 
> I'd be interested to hear from anyone running a similar configuration 
> with a similar use case, especially people who have spent some time 
> benchmarking a similar configuration and still have a copy of the results. 
> 
> I'd also welcome any comments or critique on the above specification. 
> Purchases have to be made via Dell and 10Gb ethernet is out of the 
> question at the moment. 
> 
> Cheers, 
> 
> Barry 
> 
> 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


Re: [ceph-users] 0.61 Cuttlefish released

2013-05-07 Thread John Wilkins
Igor,

I haven't closed out 3674, because I haven't covered that part yet. Chef
docs are now in the wiki, but I'll be adding ceph-disk docs shortly.


On Tue, May 7, 2013 at 3:25 AM, Igor Laskovy  wrote:

> Hi,
>
> where can I read more about ceph-disk?
>
>
> On Tue, May 7, 2013 at 5:51 AM, Sage Weil  wrote:
>
>> Spring has arrived (at least for some of us), and a new stable release of
>> Ceph is ready!  Thank you to everyone who has contributed to this release!
>>
>> Bigger ticket items since v0.56.x "Bobtail":
>>
>>  * ceph-deploy: our new deployment tool to replace 'mkcephfs'
>>  * robust RHEL/CentOS support
>>  * ceph-disk: many improvements to support hot-plugging devices via chef
>>and ceph-deploy
>>  * ceph-disk: dm-crypt support for OSD disks
>>  * ceph-disk: 'list' command to see available (and used) disks
>>  * rbd: incremental backups
>>  * rbd-fuse: access RBD images via fuse
>>  * librbd: autodetection of VM flush support to allow safe enablement of
>>the writeback cache
>>  * osd: improved small write, snap trimming, and overall performance
>>  * osd: PG splitting
>>  * osd: per-pool quotas (object and byte)
>>  * osd: tool for importing, exporting, removing PGs from OSD data store
>>  * osd: improved clean-shutdown behavior
>>  * osd: noscrub, nodeepscrub options
>>  * osd: more robust scrubbing, repair, ENOSPC handling
>>  * osd: improved memory usage, log trimming
>>  * osd: improved journal corruption detection
>>  * ceph: new 'df' command
>>  * mon: new storage backend (leveldb)
>>  * mon: config-keys service
>>  * mon, crush: new commands to manage CRUSH entirely via CLI
>>  * mon: avoid marking entire subtrees (e.g., racks) out automatically
>>  * rgw: CORS support
>>  * rgw: misc API fixes
>>  * rgw: ability to listen to fastcgi on a port
>>  * sysvinit, upstart: improved support for standardized data locations
>>  * mds: backpointers on all data and metadata objects
>>  * mds: faster fail-over
>>  * mds: many many bug fixes
>>  * ceph-fuse: many stability improvements
>>
>> Notable changes since v0.60:
>>
>>  * rbd: incremental backups
>>  * rbd: only set STRIPINGV2 feature if striping parameters are
>>incompatible with old versions
>>  * rbd: require allow-shrink for resizing images down
>>  * librbd: many bug fixes
>>  * rgw: fix object corruption on COPY to self
>>  * rgw: new sysvinit script for rpm-based systems
>>  * rgw: allow buckets with _
>>  * rgw: CORS support
>>  * mon: many fixes
>>  * mon: improved trimming behavior
>>  * mon: fix data conversion/upgrade problem (from bobtail)
>>  * mon: ability to tune leveldb
>>  * mon: config-keys service to store arbitrary data on monitor
>>  * mon: osd crush add|link|unlink|add-bucket ... commands
>>  * mon: trigger leveldb compaction on trim
>>  * osd: per-rados pool quotas (objects, bytes)
>>  * osd: tool to export, import, and delete PGs from an individual OSD data
>>store
>>  * osd: notify mon on clean shutdown to avoid IO stall
>>  * osd: improved detection of corrupted journals
>>  * osd: ability to tune leveldb
>>  * osd: improve client request throttling
>>  * osd, librados: fixes to the LIST_SNAPS operation
>>  * osd: improvements to scrub error repair
>>  * osd: better prevention of wedging OSDs with ENOSPC
>>  * osd: many small fixes
>>  * mds: fix xattr handling on root inode
>>  * mds: fixed bugs in journal replay
>>  * mds: many fixes
>>  * librados: clean up snapshot constant definitions
>>  * libcephfs: calls to query CRUSH topology (used by Hadoop)
>>  * ceph-fuse, libcephfs: misc fixes to mds session management
>>  * ceph-fuse: disabled cache invalidation (again) due to potential
>>deadlock with kernel
>>  * sysvinit: try to start all daemons despite early failures
>>  * ceph-disk: new list command
>>  * ceph-disk: hotplug fixes for RHEL/CentOS
>>  * ceph-disk: fix creation of OSD data partitions on >2TB disks
>>  * osd: fix udev rules for RHEL/CentOS systems
>>  * fix daemon logging during initial startup
>>
>> There are a few things to keep in mind when upgrading from Bobtail,
>> specifically with the monitor daemons.  Please see the upgrade guide
>> and/or the complete release notes.  In short: upgrade all of your monitors
>> (more or less) at once.
>>
>> Cuttlefish is the first Ceph release on our new three-month stable release
>> cycle.  We are very pleased to have pulled everything together on schedule
>> (well, only a week later than planned).  The next stable release, which
>> will be code-named Dumpling, is slated for three months from now
>> (beginning of August).
>>
>> You can download v0.61 Cuttlefish from the usual locations:
>>
>>  * Git at git://github.com/ceph/ceph.git
>>  * Tarball at http://ceph.com/download/ceph-0.61.tar.gz
>>  * For Debian/Ubuntu packages, see
>> http://ceph.com/docs/master/install/debian
>>  * For RPMs, see http://ceph.com/docs/master/install/rpm
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> h

[ceph-users] OSD crash during script, 0.56.4

2013-05-07 Thread Travis Rhoden
Hey folks,

Saw this crash the other day:

 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 1: /usr/bin/ceph-osd() [0x788fba]
 2: (()+0xfcb0) [0x7f19d1889cb0]
 3: (gsignal()+0x35) [0x7f19d0248425]
 4: (abort()+0x17b) [0x7f19d024bb8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f19d0b9a69d]
 6: (()+0xb5846) [0x7f19d0b98846]
 7: (()+0xb5873) [0x7f19d0b98873]
 8: (()+0xb596e) [0x7f19d0b9896e]
 9: (operator new[](unsigned long)+0x47e) [0x7f19d102db1e]
 10: (ceph::buffer::create(unsigned int)+0x67) [0x834727]
 11: (ceph::buffer::ptr::ptr(unsigned int)+0x15) [0x834a95]
 12: (FileStore::read(coll_t, hobject_t const&, unsigned long,
unsigned long, ceph::buffer::list&)+0x1ae) [0x6fbdde]
 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
bool)+0x347) [0x69ac57]
 14: (PG::chunky_scrub()+0x375) [0x69faf5]
 15: (PG::scrub()+0x145) [0x6a0e95]
 16: (OSD::ScrubWQ::_process(PG*)+0xc) [0x6384ec]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8297e6]
 18: (ThreadPool::WorkThread::entry()+0x10) [0x82b610]
 19: (()+0x7e9a) [0x7f19d1881e9a]
 20: (clone()+0x6d) [0x7f19d0305cbd]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Appears to have gone down during a scrub?

I don't see anything interesting in /var/log/syslog or anywhere else
at the same time.  It's actually the second time I've seen this exact
stack trace.  First time was reported here...  (was going to insert
GMane link, but search.gmane.org appears to be down for me).  Well,
for those inclined, the thread was titled "question about mon memory
usage", and was also started by me.

Any thoughts?  I do plan to upgrade to 0.56.6 when I can.  I'm a
little leery of doing it on a production system without a maintenance
window, though.  When I went from 0.56.3 --> 0.56.4 on a live system,
a system using the RBD kernel module kpanic'd.  =)

 - Travis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mounting CephFS - mount error 5 = Input/output error

2013-05-07 Thread Wyatt Gorman
Here's the result of running ceph-mds -i a -d

ceph-mds -i a -d
2013-05-07 13:33:11.816963 b732a710  0 starting mds.a at :/0
ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c), process
ceph-mds, pid 9900
2013-05-07 13:33:11.824077 b4a1bb70  0 mds.-1.0 ms_handle_connect on
10.81.2.100:6789/0
2013-05-07 13:33:11.825629 b732a710 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2013-05-07 13:33:11.825653 b732a710  1 mds.-1.0 suicide.  wanted down:dne,
now up:boot
2013-05-07 13:33:11.825973 b732a710  0 stopped.

This "ERROR: failed to authenticate: (1) Operation not permitted" indicates
some problem with the authentication, correct? Something about my keyring?
I created a new one with ceph-authtool -C and it still returns that error.


On Mon, May 6, 2013 at 1:53 PM, Jens Kristian Søgaard <
j...@mermaidconsulting.dk> wrote:

> Hi,
>
>
>  how? running ceph-mds just returns the help page, and I'm not sure what
>> arguments to use.
>>
>
> Try running
>
> ceph-mds -i a -d
>
> (if the id of your mds is a)
>
> The -d means to to into the foreground and output debug information.
>
> Normally you would start the mds from the service management system on
> your platform. On my Fedora system it look like this:
>
> service ceph start mds.a
>
>
> --
> Jens Kristian Søgaard, Mermaid Consulting ApS,
> j...@mermaidconsulting.dk,
> http://www.mermaidconsulting.**com/ 
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel Module

2013-05-07 Thread Gandalf Corvotempesta
2013/5/7 Gregory Farnum :
> As long as you're planning to use ceph-fuse for your filesystem access, you
> don't need anything in the kernel.

I will not use ceph-fuse but plain ceph-fs when production ready.
Ceph-fs should not need kernel module, like ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel Module

2013-05-07 Thread Gregory Farnum
To access CephFS you need to either use the kernel client or a
userspace client. The userspace CephFS client is called ceph-fuse; if
you want to use the kernel's built-in access then obviously you need
it on your machine...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 7, 2013 at 10:38 AM, Gandalf Corvotempesta
 wrote:
> 2013/5/7 Gregory Farnum :
>> As long as you're planning to use ceph-fuse for your filesystem access, you
>> don't need anything in the kernel.
>
> I will not use ceph-fuse but plain ceph-fs when production ready.
> Ceph-fs should not need kernel module, like ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error: found clone without head

2013-05-07 Thread Dzianis Kahanovich
Dzianis Kahanovich пишет:
> I have 4 scrub errors (3 PGs - "found clone without head"), on one OSD. Not
> repairing. How to repair it exclude re-creating of OSD?
> 
> Now it "easy" to clean+create OSD, but in theory - in case there are multiple
> OSDs - it may cause data lost.

OOPS! After re-creating OSD what I think failed:

2013-05-07 20:41:55.203022 7f91841ff700  0 log [ERR] : 2.81 osd.5: soid
e2080881/rb.0.1ee4.238e1f29.1300/54e//2 digest 0 != known digest
2631043861, size 0 != known size 4194304
2013-05-07 20:41:55.203054 7f91841ff700  0 log [ERR] : deep-scrub 2.81
e2080881/rb.0.1ee4.238e1f29.1300/54e//2 found clone without head
2013-05-07 20:41:56.683561 7f91841ff700  0 log [ERR] : 2.81 deep-scrub 0
missing, 1 inconsistent objects
2013-05-07 20:41:56.683583 7f91841ff700  0 log [ERR] : 2.81 deep-scrub 2 errors


-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel Module

2013-05-07 Thread Gandalf Corvotempesta
Any performance penalty forma both solutions?
Il giorno 07/mag/2013 19:40, "Gregory Farnum"  ha scritto:

> To access CephFS you need to either use the kernel client or a
> userspace client. The userspace CephFS client is called ceph-fuse; if
> you want to use the kernel's built-in access then obviously you need
> it on your machine...
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, May 7, 2013 at 10:38 AM, Gandalf Corvotempesta
>  wrote:
> > 2013/5/7 Gregory Farnum :
> >> As long as you're planning to use ceph-fuse for your filesystem access,
> you
> >> don't need anything in the kernel.
> >
> > I will not use ceph-fuse but plain ceph-fs when production ready.
> > Ceph-fs should not need kernel module, like ?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel Module

2013-05-07 Thread Gregory Farnum
It actually depends on what your accesses look like; they have
different strengths and weaknesses. In general they perform about the
same, though.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 7, 2013 at 10:53 AM, Gandalf Corvotempesta
 wrote:
> Any performance penalty forma both solutions?
>
> Il giorno 07/mag/2013 19:40, "Gregory Farnum"  ha scritto:
>
>> To access CephFS you need to either use the kernel client or a
>> userspace client. The userspace CephFS client is called ceph-fuse; if
>> you want to use the kernel's built-in access then obviously you need
>> it on your machine...
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Tue, May 7, 2013 at 10:38 AM, Gandalf Corvotempesta
>>  wrote:
>> > 2013/5/7 Gregory Farnum :
>> >> As long as you're planning to use ceph-fuse for your filesystem access,
>> >> you
>> >> don't need anything in the kernel.
>> >
>> > I will not use ceph-fuse but plain ceph-fs when production ready.
>> > Ceph-fs should not need kernel module, like ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Igor Laskovy
If I currently understand idea, when this 1 SSD will fail whole node with
that SSD will fail. Correct?
What scenario for node recovery in this case?
Playing with "ceph-osd --flush-journal" and "ceph-osd --mkjournal" for each
osd?


On Tue, May 7, 2013 at 4:17 PM, Mark Nelson  wrote:

> On 05/07/2013 06:50 AM, Barry O'Rourke wrote:
>
>> Hi,
>>
>> I'm looking to purchase a production cluster of 3 Dell Poweredge R515's
>> which I intend to run in 3 x replication. I've opted for the following
>> configuration;
>>
>> 2 x 6 core processors
>> 32Gb RAM
>> H700 controller (1Gb cache)
>> 2 x SAS OS disks (in RAID1)
>> 2 x 1Gb ethernet (bonded for cluster network)
>> 2 x 1Gb ethernet (bonded for client network)
>>
>> and either 4 x 2Tb nearline SAS OSDs or 8 x 1Tb nearline SAS OSDs.
>>
>
> Hi Barry,
>
> With so few disks and the inability to do 10GbE, you may want to consider
> doing something like 5-6 R410s or R415s and just using the on-board
> controller with a couple of SATA disks and 1 SSD for the journal.  That
> should give you better aggregate performance since in your case you can't
> use 10GbE.  It will also spread your OSDs across more hosts for better
> redundancy and may not cost that much more per GB since you won't need to
> use the H700 card if you are using an SSD for journals.  It's not as dense
> as R515s or R720XDs can be when fully loaded, but for small clusters with
> few disks I think it's a good trade-off to get the added redundancy and
> avoid expander/controller complications.
>
>
>
>> At the moment I'm undecided on the OSDs, although I'm swaying towards
>> the second option at the moment as it would give me more flexibility and
>> the option of using some of the disks as journals.
>>
>> I'm intending to use this cluster to host the images for ~100 virtual
>> machines, which will run on different hardware most likely be managed by
>> OpenNebula.
>>
>> I'd be interested to hear from anyone running a similar configuration
>> with a similar use case, especially people who have spent some time
>> benchmarking a similar configuration and still have a copy of the results.
>>
>> I'd also welcome any comments or critique on the above specification.
>> Purchases have to be made via Dell and 10Gb ethernet is out of the
>> question at the moment.
>>
>> Cheers,
>>
>> Barry
>>
>>
>>
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>



-- 
Igor Laskovy
facebook.com/igor.laskovy
studiogrizzly.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Barry O'Rourke
Hi,

> With so few disks and the inability to do 10GbE, you may want to 
> consider doing something like 5-6 R410s or R415s and just using the 
> on-board controller with a couple of SATA disks and 1 SSD for the 
> journal.  That should give you better aggregate performance since in 
> your case you can't use 10GbE.  It will also spread your OSDs across 
> more hosts for better redundancy and may not cost that much more per GB 
> since you won't need to use the H700 card if you are using an SSD for 
> journals.  It's not as dense as R515s or R720XDs can be when fully 
> loaded, but for small clusters with few disks I think it's a good 
> trade-off to get the added redundancy and avoid expander/controller 
> complications.

I hadn't considered lowering the specification and increasing the number
of hosts, that seems like a really viable option and not too much more
expensive. When you say the on-board controller do you mean the onboard
SATA or the H310 controller? 

Thanks,

Barry



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Barry O'Rourke
Hi,

On Tue, 2013-05-07 at 21:07 +0300, Igor Laskovy wrote:
> If I currently understand idea, when this 1 SSD will fail whole node
> with that SSD will fail. Correct? 

Only OSDs that use that SSD for the journal will fail as they will lose
any writes still in the journal. If I only have 2 OSDs sharing one SSD I
would lose the whole node.

> What scenario for node recovery in this case? 

There are a couple of options, replace the SSD, remove the dead OSD's
from ceph and create them from scratch. Or if you need the host back up
quickly, delete the OSDs and create them with journals on them, this
will probably impact performance elsewhere.

Barry







-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Barry O'Rourke
Hi,

> Here's a quick performance display with various block sizes on a host
> with 1 public 1Gbe link and 1 1Gbe link on the same vlan as the ceph
> cluster.

Thanks for taking the time to look into this for me, I'll compare it
with my existing set-up in the morning.

Thanks,

Barry


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dell R515 performance and specification question

2013-05-07 Thread Mark Nelson

On 05/07/2013 03:36 PM, Barry O'Rourke wrote:

Hi,


With so few disks and the inability to do 10GbE, you may want to
consider doing something like 5-6 R410s or R415s and just using the
on-board controller with a couple of SATA disks and 1 SSD for the
journal.  That should give you better aggregate performance since in
your case you can't use 10GbE.  It will also spread your OSDs across
more hosts for better redundancy and may not cost that much more per GB
since you won't need to use the H700 card if you are using an SSD for
journals.  It's not as dense as R515s or R720XDs can be when fully
loaded, but for small clusters with few disks I think it's a good
trade-off to get the added redundancy and avoid expander/controller
complications.


I hadn't considered lowering the specification and increasing the number
of hosts, that seems like a really viable option and not too much more
expensive. When you say the on-board controller do you mean the onboard
SATA or the H310 controller?


Good question on the controller.  I suspect the on-board will be good 
enough for 1GbE or even bonded 1GbE throughput levels.  I've also heard 
some mixed things about the H310 but haven't gotten to test one myself. 
 What I've seen in the past is that if you are only using spinning 
disks, a controller with on-board cache will help performance quite a 
bit.  If you have an SSD drive for journals, you can get away with much 
cheaper sata/SAS controllers.  You mentioned earlier that the Dell SSDs 
were quite expensive.  Have you considered something like an Intel DC 
S3700?  If you can't get one through Dell, you might consider just doing 
3 disks from Dell and adding one yourself (you could put OS and journals 
on it, and use the 3 spinning disks for OSDs).  This does have the 
effect though of making the SSD a single point of failure (which is why 
it's good to use the enterprise grade drive here I think).


Mark



Thanks,

Barry





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados Gateway Pools

2013-05-07 Thread Jeppesen, Nelson
Now the .61 is out I have tried getting a second radosgw farm working but into 
an issue using a custom root/zone pool.

The  'radosgw-admin zone set' and ' radosgw-admin zone info' commands are 
working fine except it keeps defaulting to using .rgw.root. I've tried the two 
settings, the one you gave and the one documented on ceph.com in my conf file 
but still no luck.

My Ceph.conf

rgw root zone pool = .rgw.zone2
rgw cluster root pool = .rgw.zone2


Thanks for your help.

Nelson Jeppesen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados Gateway Pools

2013-05-07 Thread Yehuda Sadeh
On Tue, May 7, 2013 at 2:26 PM, Jeppesen, Nelson
 wrote:
> Now the .61 is out I have tried getting a second radosgw farm working but
> into an issue using a custom root/zone pool.
>
>
>
> The  ‘radosgw-admin zone set’ and ‘ radosgw-admin zone info’ commands are
> working fine except it keeps defaulting to using .rgw.root. I’ve tried the
> two settings, the one you gave and the one documented on ceph.com in my conf
> file but still no luck.
>
>
>
> My Ceph.conf
>
> ….
>
> rgw root zone pool = .rgw.zone2
>
> rgw cluster root pool = .rgw.zone2

Under what section is that? Note that usually you'd run radosgw-admin
under the client.admin user, so it might not get this configuration.
Try running the radosgw-admin --rgw-root-zone-pool=.rgw.zone2, see if
it fixes it for you.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados Gateway Pools

2013-05-07 Thread Jeppesen, Nelson
The settings are under the the rgw client settings

[client.radosgw.internal.01]
rgw root zone pool = .rgw.zone2
rgw cluster root pool = .rgw.zone2

I tried  'radosgw-admin zone set   --rgw-root-zone-pool=.rgw.zone2 < zone2'  
and 'radosgw-admin zone info  --rgw-root-zone-pool=.rgw.zone2'

Neither is reading from .rgw.zone2. How do you get radosgw-admin to run as a 
different user?


Nelson Jeppesen
   Disney Technology Solutions and Services
   Phone 206-588-5001

-Original Message-
From: yehud...@gmail.com [mailto:yehud...@gmail.com] On Behalf Of Yehuda Sadeh
Sent: Tuesday, May 07, 2013 2:46 PM
To: Jeppesen, Nelson
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Rados Gateway Pools

On Tue, May 7, 2013 at 2:26 PM, Jeppesen, Nelson  
wrote:
> Now the .61 is out I have tried getting a second radosgw farm working 
> but into an issue using a custom root/zone pool.
>
>
>
> The  'radosgw-admin zone set' and ' radosgw-admin zone info' commands 
> are working fine except it keeps defaulting to using .rgw.root. I've 
> tried the two settings, the one you gave and the one documented on 
> ceph.com in my conf file but still no luck.
>
>
>
> My Ceph.conf
>
> 
>
> rgw root zone pool = .rgw.zone2
>
> rgw cluster root pool = .rgw.zone2

Under what section is that? Note that usually you'd run radosgw-admin under the 
client.admin user, so it might not get this configuration.
Try running the radosgw-admin --rgw-root-zone-pool=.rgw.zone2, see if it fixes 
it for you.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados Gateway Pools

2013-05-07 Thread Yehuda Sadeh
On Tue, May 7, 2013 at 2:54 PM, Jeppesen, Nelson
 wrote:
> The settings are under the the rgw client settings
>
> [client.radosgw.internal.01]
> rgw root zone pool = .rgw.zone2
> rgw cluster root pool = .rgw.zone2
>
> I tried  'radosgw-admin zone set   --rgw-root-zone-pool=.rgw.zone2 < zone2'  
> and 'radosgw-admin zone info  --rgw-root-zone-pool=.rgw.zone2'

That should have worked:

$ ./radosgw-admin --rgw-zone-root-pool=.rgw.root2 zone info
{ "domain_root": ".rgw",
  "control_pool": ".rgw.control",
  "gc_pool": ".rgw.gc",
  "log_pool": ".log",
  "intent_log_pool": ".intent-log",
  "usage_log_pool": ".usage",
  "user_keys_pool": ".users",
  "user_email_pool": ".users.email",
  "user_swift_pool": ".users.swift",
  "user_uid_pool ": ".users.uid"}
$ ./radosgw-admin --rgw-zone-root-pool=.rgw.root2 zone set < zone.1
{ "domain_root": ".rgw2",
  "control_pool": ".rgw.control2",
  "gc_pool": ".rgw.gc2",
  "log_pool": ".log2",
  "intent_log_pool": ".intent-log2",
  "usage_log_pool": ".usage2",
  "user_keys_pool": ".users2",
  "user_email_pool": ".users.email2",
  "user_swift_pool": ".users.swift2",
  "user_uid_pool ": ".users.uid2"}

Can you make sure you're running the correct version?


>
> Neither is reading from .rgw.zone2. How do you get radosgw-admin to run as a 
> different user?
>

$ radosgw-admin -n client.radosgw.internal.01 ...


>
> Nelson Jeppesen
>Disney Technology Solutions and Services
>Phone 206-588-5001
>
> -Original Message-
> From: yehud...@gmail.com [mailto:yehud...@gmail.com] On Behalf Of Yehuda Sadeh
> Sent: Tuesday, May 07, 2013 2:46 PM
> To: Jeppesen, Nelson
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Rados Gateway Pools
>
> On Tue, May 7, 2013 at 2:26 PM, Jeppesen, Nelson  
> wrote:
>> Now the .61 is out I have tried getting a second radosgw farm working
>> but into an issue using a custom root/zone pool.
>>
>>
>>
>> The  'radosgw-admin zone set' and ' radosgw-admin zone info' commands
>> are working fine except it keeps defaulting to using .rgw.root. I've
>> tried the two settings, the one you gave and the one documented on
>> ceph.com in my conf file but still no luck.
>>
>>
>>
>> My Ceph.conf
>>
>> 
>>
>> rgw root zone pool = .rgw.zone2
>>
>> rgw cluster root pool = .rgw.zone2
>
> Under what section is that? Note that usually you'd run radosgw-admin under 
> the client.admin user, so it might not get this configuration.
> Try running the radosgw-admin --rgw-root-zone-pool=.rgw.zone2, see if it 
> fixes it for you.
>
> Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CentOs kernel for ceph server

2013-05-07 Thread MinhTien MinhTien
Dear all,

I deploy ceph with Centos 6.3. When I upgrade kernel 3.9.0, I having few
problems with card raid.

I want to deploy ceph with default kernel 2.6.32. *This is good isn't it*?

Ceph client will use the lartest kernel (3.9.0).


Thanks and Regard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy documentation fixes

2013-05-07 Thread Bryan Stillwell
With the release of cuttlefish, I decided to try out ceph-deploy and
ran into some documentation errors along the way:


http://ceph.com/docs/master/rados/deployment/preflight-checklist/

Under 'CREATE A USER' it has the following line:

To provide full privileges to the user, add the following to
/etc/sudoers.d/chef.

Based on the command that followed, chef should be replaced with ceph.


http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/

Under 'ZAP DISKS' it has an 'Important' message that states:

Important: This will delete all data in the partition.

If I understand it correctly, this should be changed to:

Important: This will delete all data on the disk.


Under 'PREPARE OSDS' it first gives an example to prepare a disk:

ceph-deploy osd prepare {host-name}:{path/to/disk}[:{path/to/journal}]

And then it gives an example that attempts to prepare a partition:

ceph-deploy osd prepare osdserver1:/dev/sdb1:/dev/ssd1


The same issue exists for 'ACTIVATE OSDS' and 'CREATE OSDS'.


Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados Gateway Pools

2013-05-07 Thread Jeppesen, Nelson
Figured it out; On your post last month you were using 'rgw-zone-root-pool' but 
today you're using ' rgw-root-zone-pool.' I didn't notice that root and zone 
had switched and was using your old syntax. It's working now though.

Thank you for your help again! 

Nelson Jeppesen

-Original Message-
From: yehud...@gmail.com [mailto:yehud...@gmail.com] On Behalf Of Yehuda Sadeh
Sent: Tuesday, May 07, 2013 3:14 PM
To: Jeppesen, Nelson
Cc: Yehuda Sadeh; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Rados Gateway Pools

On Tue, May 7, 2013 at 2:54 PM, Jeppesen, Nelson  
wrote:
> The settings are under the the rgw client settings
>
> [client.radosgw.internal.01]
> rgw root zone pool = .rgw.zone2
> rgw cluster root pool = .rgw.zone2
>
> I tried  'radosgw-admin zone set   --rgw-root-zone-pool=.rgw.zone2 < zone2'  
> and 'radosgw-admin zone info  --rgw-root-zone-pool=.rgw.zone2'

That should have worked:

$ ./radosgw-admin --rgw-zone-root-pool=.rgw.root2 zone info { "domain_root": 
".rgw",
  "control_pool": ".rgw.control",
  "gc_pool": ".rgw.gc",
  "log_pool": ".log",
  "intent_log_pool": ".intent-log",
  "usage_log_pool": ".usage",
  "user_keys_pool": ".users",
  "user_email_pool": ".users.email",
  "user_swift_pool": ".users.swift",
  "user_uid_pool ": ".users.uid"}
$ ./radosgw-admin --rgw-zone-root-pool=.rgw.root2 zone set < zone.1 { 
"domain_root": ".rgw2",
  "control_pool": ".rgw.control2",
  "gc_pool": ".rgw.gc2",
  "log_pool": ".log2",
  "intent_log_pool": ".intent-log2",
  "usage_log_pool": ".usage2",
  "user_keys_pool": ".users2",
  "user_email_pool": ".users.email2",
  "user_swift_pool": ".users.swift2",
  "user_uid_pool ": ".users.uid2"}

Can you make sure you're running the correct version?


>
> Neither is reading from .rgw.zone2. How do you get radosgw-admin to run as a 
> different user?
>

$ radosgw-admin -n client.radosgw.internal.01 ...


>
> Nelson Jeppesen
>Disney Technology Solutions and Services
>Phone 206-588-5001
>
> -Original Message-
> From: yehud...@gmail.com [mailto:yehud...@gmail.com] On Behalf Of 
> Yehuda Sadeh
> Sent: Tuesday, May 07, 2013 2:46 PM
> To: Jeppesen, Nelson
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Rados Gateway Pools
>
> On Tue, May 7, 2013 at 2:26 PM, Jeppesen, Nelson  
> wrote:
>> Now the .61 is out I have tried getting a second radosgw farm working 
>> but into an issue using a custom root/zone pool.
>>
>>
>>
>> The  'radosgw-admin zone set' and ' radosgw-admin zone info' commands 
>> are working fine except it keeps defaulting to using .rgw.root. I've 
>> tried the two settings, the one you gave and the one documented on 
>> ceph.com in my conf file but still no luck.
>>
>>
>>
>> My Ceph.conf
>>
>> 
>>
>> rgw root zone pool = .rgw.zone2
>>
>> rgw cluster root pool = .rgw.zone2
>
> Under what section is that? Note that usually you'd run radosgw-admin under 
> the client.admin user, so it might not get this configuration.
> Try running the radosgw-admin --rgw-root-zone-pool=.rgw.zone2, see if it 
> fixes it for you.
>
> Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOs kernel for ceph server

2013-05-07 Thread Gregory Farnum
On Tue, May 7, 2013 at 4:45 PM, MinhTien MinhTien
 wrote:
> Dear all,
>
> I deploy ceph with Centos 6.3. When I upgrade kernel 3.9.0, I having few
> problems with card raid.
>
> I want to deploy ceph with default kernel 2.6.32. This is good isn't it?
>
> Ceph client will use the lartest kernel (3.9.0).

Yeah, the servers will run on whatever. If you're using multiple OSDs
on a server you want to make sure the kernel implements the syncfs
syscall (I believe it does for 6.3).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster unable to finish balancing

2013-05-07 Thread Berant Lemmenes
So just a little update... after replacing the original failed drive things
seem to be progressing a little better however I noticed something else
odd. Looking at a 'rados df' it looks like the system thinks that the data
pool has 32 TB of data, this is only a 18TB raw system.

pool name   category KB  objects   clones
degraded  unfound   rdrd KB   wrwr KB
data-32811540110   8949270
  240445   010  2720415   4223435021
media_video -  110
   0   021  2611361   1177389479
metadata- 210246184820
4592   1 6970   561296  1253955 19500149
rbd -  330731965820180
   19584   026295  1612689 54606042   2127030019
  total used 10915771968   995428
  total avail 6657285104
  total space17573057072


Any recommendations on how I can sort out why it thinks it has way more
data in that pool than it actually does?

Thanks in advance.
Berant


On Mon, May 6, 2013 at 4:43 PM, Berant Lemmenes  wrote:

> TL;DR
>
> bobtail Ceph cluster unable to finish rebalance after drive failure, usage
> increasing even with no clients connected.
>
>
> I've been running a test bobtail cluster for a couple of months and it's
> been working great. Last week I had a drive die and rebalance; durring that
> time another OSD crashed. All was still well, however as the second osd had
> just crashed I restarted made sure that it re-entered properly and
> rebalancing continued and then I went to bed.
>
> Waking up in the morning I found 2 OSDs were 100% full and two more were
> almost full. To get out of the situation I decreased the replication size
> from 3 to 2, and then also carefully (I believe carefully enough) remove
> some PGs in order to start things up again.
>
> I got things going again and things appeared to be rebalancing correctly;
> however it got to the point were it stopped at 1420 PGs active+clean and
> the rest were stuck backfilling.
>
> Looking at the PG dump, all of the PGs that were having issues were on
> osd.1. So I stopped it, verified things were continuing to rebalance after
> it was down/out and then formated osd.1's disk and put it back in.
>
> Since then I've not been able to get the cluster back to HEALTHY, due to a
> combination of OSDs dying while recovering (not due to disk failure, just
> crashes) as well as the used space in the cluster increasing abnormally.
>
> Right now I have all the clients disconnected and just the cluster
> rebalancing and the usage is increasing to the point where I have 12TB used
> when I have only < 3TB in cephfs and 2TB in a single RBD image
> (replication size 2). I've since shutdown the cluster so I don't fill it up.
>
> My crushmap is the default, here is the usual suspects. I'm happy to
> provide additional information.
>
> pg dump: http://pastebin.com/LUyu6Z09
>
> ceph osd tree:
> osd.8 is the failed drive (I will be replacing tonight), weight on osd.1
> and osd.6 was done via reweight-by-utilization
>
> # id weight type name up/down reweight
> -1 19.5 root default
> -3 19.5 rack unknownrack
> -2 19.5 host ceph-test
> 0 1.5 osd.0 up 1
> 1 1.5 osd.1 up 0.6027
> 2 1.5 osd.2 up 1
> 3 1.5 osd.3 up 1
> 4 1.5 osd.4 up 1
> 5 2 osd.5 up 1
> 6 2 osd.6 up 0.6676
> 7 2 osd.7 up 1
> 8 2 osd.8 down 0
> 9 2 osd.9 up 1
> 10 2 osd.10 up 1
>
>
> ceph -s:
>
>health HEALTH_WARN 24 pgs backfill; 85 pgs backfill_toofull; 29 pgs
> backfilling; 40 pgs degraded; 1 pgs recovery_wait; 121 pgs stuck unclean;
> recovery 109306/2091318 degraded (5.227%);  recovering 3 o/s, 43344KB/s; 2
> near full osd(s); noout flag(s) set
>monmap e2: 1 mons at {a=10.200.200.21:6789/0}, election epoch 1,
> quorum 0 a
>osdmap e16251: 11 osds: 10 up, 10 in
> pgmap v3145187: 1536 pgs: 1414 active+clean, 6
> active+remapped+wait_backfill, 10
> active+remapped+wait_backfill+backfill_toofull, 4
> active+degraded+wait_backfill+backfill_toofull, 22
> active+remapped+backfilling, 42 active+remapped+backfill_toofull, 7
> active+degraded+backfilling, 17 active+degraded+backfill_toofull, 1
> active+recovery_wait+remapped, 4
> active+degraded+remapped+wait_backfill+backfill_toofull, 8
> active+degraded+remapped+backfill_toofull, 1 active+clean+scrubbing+deep;
> 31607 GB data, 12251 GB used, 4042 GB / 16293 GB avail; 109306/2091318
> degraded (5.227%);  recovering 3 o/s, 43344KB/s
>mdsmap e3363: 1/1/1 up {0=a=up:active}
>
> rep size:
> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 384
> pgp_num 384 last_change 897 owner 0 crash_replay_interval 45
> pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num
> 384 pgp_num 384 last_change 13364 owner 0
> pool 2 'rbd' rep size 2 crush_r

[ceph-users] HELP: raid 6 - osd low reqest

2013-05-07 Thread Lenon Join
Dear all,

I use raid 6 deployment ceph. I have 1 SSD partition (raid 0). I use SSD
make journal for OSD.

Raid 6 containt 60TB, divided into 4 OSD..

When I deploy, OSD usually reflects the slow request.

001080 [write 0~4194304 [5@0]] 0.72bf90bf snapc 1=[]) v4 currently commit
sent
2013-05-08 10:56:25.617376 osd.5 [WRN] slow request 40.220838 seconds old,
received at 2013-05-08 10:55:45.396435: osd_op(client.32118.1:23397
1253cb9.1083 [write 0~4194304 [5@0]] 0.10e44e90 snapc 1=[]) v4
currently commit sent
2013-05-08 10:56:25.617379 osd.5 [WRN] slow request 40.073943 seconds old,
received at 2013-05-08 10:55:45.543330: osd_op(client.32118.1:23409
1253cb9.108e [write 0~4194304 [5@0]] 0.d42b5452 snapc 1=[]) v4
currently commit sent
2013-05-08 10:56:26.617605 osd.5 [WRN] 5 slow requests, 1 included below;
oldest blocked for > 41.528346 secs
2013-05-08 10:56:26.617614 osd.5 [WRN] slow request 40.935174 seconds old,
received at 2013-05-08 10:55:45.682338: osd_op(client.32118.1:23414
1253cb9.1093 [write 0~4194304 [5@0]] 0.9543dc39 snapc 1=[]) v4
currently commit sent
2013-05-08 10:56:27.617784 osd.5 [WRN] 6 slow requests, 2 included below;
oldest blocked for > 42.371566 secs
2013-05-08 10:56:27.617789 osd.5 [WRN] slow request 40.425833 seconds old,
received at 2013-05-08 10:55:47.191899: osd_sub_op(osd.7.0:6648 0.d6
6bf7cbd6/1253cd2.0060/head//0 [push] v 5203'39393 snapset=0=[]:[]
snapc=0=[]) v7 currently no flag points reached


I user kernel 3.9.0 for Centos, ceph 0.56.4

I do not know the cause and I can not fix it.

Please help me!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com