Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-05 Thread Olivier Bonvalet
It's Xen yes, but no I didn't tried the RBD tab client, for two
reasons :
- too young to enable it in production
- Debian packages don't have the TAP driver


Le lundi 05 août 2013 à 01:43 +, James Harper a écrit :
> What VM? If Xen, have you tried the rbd tap client?
> 
> James
> 
> > -Original Message-
> > From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
> > boun...@lists.ceph.com] On Behalf Of Olivier Bonvalet
> > Sent: Monday, 5 August 2013 11:07 AM
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103
> > 
> > 
> > Hi,
> > 
> > I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
> > 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
> > VM which use RBD kernel client.
> > 
> > 
> > In kernel logs, I have :
> > 
> > Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at
> > net/ceph/osd_client.c:2103!
> > Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  [#1] 
> > SMP
> > Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc rbd
> > libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter
> > ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT
> > xt_tcpudp iptable_filter ip_tables x_tables bridge loop coretemp
> > ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper
> > ablk_helper cryptd iTCO_wdt iTCO_vendor_support gpio_ich microcode
> > serio_raw sb_edac edac_core evdev lpc_ich i2c_i801 mfd_core wmi ac
> > ioatdma shpchp button dm_mod hid_generic usbhid hid sg sd_mod
> > crc_t10dif crc32c_intel isci megaraid_sas libsas ahci libahci ehci_pci 
> > ehci_hcd
> > libata scsi_transport_sas igb scsi_mod i2c_algo_bit ixgbe usbcore i2c_core
> > dca usb_common ptp pps_core mdio
> > Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm:
> > blkback.3.xvda Not tainted 3.10-dae-dom0 #1
> > Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: Supermicro
> > X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> > Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 ti:
> > 88003803a000 task.ti: 88003803a000
> > Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: 
> > e030:[]
> > [] ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
> > Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: e02b:88003803b9f8
> > EFLAGS: 00010212
> > Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 RBX:
> > 880033a182ec RCX: 
> > Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af RSI:
> > 8050 RDI: 880030d34888
> > Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 R08:
> > 88003803ba58 R09: 
> > Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  R11:
> >  R12: 880033ba3500
> > Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 R14:
> > 88003847aa78 R15: 88003847ab58
> > Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  7f775da8c700()
> > GS:88003f84() knlGS:
> > Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES: 
> > CR0: 80050033
> > Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 CR3:
> > 2be14000 CR4: 00042660
> > Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  DR1:
> >  DR2: 
> > Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  DR6:
> > 0ff0 DR7: 0400
> > Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
> > Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000
> > 00243847aa78  880039949b40
> > Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201
> > 880033811d98 88003803ba80 88003847aa78
> > Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380
> > 880002a38400 2000 a029584c
> > Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
> > Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ?
> > rbd_osd_req_format_write+0x71/0x7c [rbd]
> > Aug  5 02:51:22 murmillia kernel: [  289.213459]  [] ?
> > rbd_img_request_fill+0x695/0x736 [rbd]
> > Aug  5 02:51:22 murmillia kernel: [  289.213562]  [] ?
> > arch_local_irq_restore+0x7/0x8
> > Aug  5 02:51:22 murmillia kernel: [  289.213667]  [] ?
> > down_read+0x9/0x19
> > Aug  5 02:51:22 murmillia kernel: [  289.213763]  [] ?
> > rbd_request_fn+0x191/0x22e [rbd]
> > Aug  5 02:51:22 murmillia kernel: [  289.213864]  [] ?
> > __blk_run_queue_uncond+0x1e/0x26
> > Aug  5 02:51:22 murmillia kernel: [  289.213962]  [] ?
> > blk_flush_plug_list+0x1c1/0x1e4
> > Aug  5 02:51:22 murmillia kernel: [  289.214059]  [] ?
> > blk_finish_plug+0xb/0x2a
> > Aug  5 02:51:22 murmillia kernel: [  289.214157]  [] ?
> > dispatch_rw_block_io+0x33e/0x3f0
> > Aug  5

Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-05 Thread James Harper
> 
> It's Xen yes, but no I didn't tried the RBD tab client, for two
> reasons :
> - too young to enable it in production
> - Debian packages don't have the TAP driver
> 

It works under Wheezy. blktap is available via dkms package, then just replace 
the tapdisk with the rbd version and follow the other instructions (fix xend 
restriction). xen-utils-4.1 installs its own tap-ctl too... I seem to remember 
removing that.

As for maturity, that's certainly a valid concern. I've had less trouble with 
rbd tapdisk than rbd kernel module though.

James


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-05 Thread Stefan Hajnoczi
On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:
> Am 02.08.2013 um 23:47 schrieb Mike Dawson :
> > We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh 
> > screenshot' command. After that, the guest resumes and runs as expected. At 
> > that point we can examine the guest. Each time we'll see:

If virsh screenshot works then this confirms that QEMU itself is still
responding.  Its main loop cannot be blocked since it was able to
process the screendump command.

This supports Josh's theory that a callback is not being invoked.  The
virtio-blk I/O request would be left in a pending state.

Now here is where the behavior varies between configurations:

On a Windows guest with 1 vCPU, you may see the symptom that the guest no
longer responds to ping.

On a Linux guest with multiple vCPUs, you may see the hung task message
from the guest kernel because other vCPUs are still making progress.
Just the vCPU that issued the I/O request and whose task is in
UNINTERRUPTIBLE state would really be stuck.

Basically, the symptoms depend not just on how QEMU is behaving but also
on the guest kernel and how many vCPUs you have configured.

I think this can explain how both problems you are observing, Oliver and
Mike, are a result of the same bug.  At least I hope they are :).

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [AD]:Factory Price,Promotional Metal items

2013-08-05 Thread s...@sunpin.net
Dear sir / madam,
   
Thanks for your reading! 
We engaged in metal industry ten years ago. High quality and price has always 
been our pride. of course, the good service too. We can show you a price only 
take 15 minutes.Any demand, Please Contact us without hesitate. 

Best regards 
Christine Zhou
Sales representative
Sales Dept Skype ID: sonier-pins-christine

Sonier Pins Co.,Ltd
No.67, Ju Cheng Road, Da Beng Wei Industry Zone,
XiaoLan Town, ZhongShan 528415,GaungDong China.
Tel:  (86 -760)22123680  Fax: (86 -760)22122916
Email: sa...@sonier-pins.com  Website: www.sonier-pins.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Block device storage

2013-08-05 Thread Benito Chamberlain

Hi there

I have a few questions regarding the block device storage and the 
ceph-filesystem.


We want to cluster a database (Progress) on a clustered filesystem , but 
the database requires the
operating system to see the clustered storage area as a block device , 
and not a network storage area / filesystem.
It refuses to create database volumes on something like GFS2 / Glusterfs 
, because it sees it as unfit - guessing due
to the filesystem attributes, the block size is invalid. It complains 
about the size when trying to create the db and says that the max size 
for any volume is 0.


Will ceph be of any help in this regard ?
The database needs to be replicated ( not necessarily realtime , but as 
close to realtime as possible )  Does the block storage device in ceph 
mean that it acts as a locally mounted filesystem to the os , and not 
say a NFS/CIFS/GFS2 share ?


Thank you in advance.
If someone can point me in the right direction , i'd appreciated it.

Benito

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Block device storage

2013-08-05 Thread James Harper
> 
> Hi there
> 
> I have a few questions regarding the block device storage and the
> ceph-filesystem.
> 
> We want to cluster a database (Progress) on a clustered filesystem , but
> the database requires the
> operating system to see the clustered storage area as a block device ,
> and not a network storage area / filesystem.
> It refuses to create database volumes on something like GFS2 / Glusterfs
> , because it sees it as unfit - guessing due
> to the filesystem attributes, the block size is invalid. It complains
> about the size when trying to create the db and says that the max size
> for any volume is 0.

I think that's the wrong way to do it. I'm pretty sure postgres assumes you are 
not running on a shared filesystem. You would use the rbd (which is a block 
device and not a filesystem) with a regular filesystem on top 
(ext3/xfs/whatever) and then make pacemaker or heartbeat control which node 
mounted the filesystem, add the IP address, and run the database service. I 
don't think postgres itself supports any sort of clustering (although it's been 
a while since I looked at it).

I'm not an expert on pacemaker although I do use it regularly. I think you 
would:

. Create an IP address resource for postgres
. Create a filesystem resource for /var/lib/postgres (or whatever)
. Create a service resource for postgres, listening on your clustered IP
. Group the above services together

You may also want to have a rbd resource (you'd probably need to write the RA 
yourself - I've not seen one) to map the rbd in the above group, or just always 
map on each node.

Lots of testing required, and it's a bit of a steep learning curve if you've 
never used pacemaker or heartbeat before. Definitely don't ignore the advice on 
STONITH.

This link explains how to do it http://clusterlabs.org/wiki/PostgresHowto (with 
DRBD, but the rest of it should be applicable)

Another approach is run postgres in a VM and cluster that instead, so the VM 
itself would fail over to another node. This is basically what I'm doing for 
all my VMs. Of course neither approach protects you if the database itself got 
corrupt (or the VM broke in the VM case), only if the active node itself failed.

James

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] About single monitor recovery

2013-08-05 Thread Yu Changyuan
The good news is, with new patch, ceph start OK, cephfs mount OK, and kvm
virtual machine use rbd boot OK(and seems running ok), and I check the
timestamp of last file write to cephfs, it's fair near to the time of
reboot(which cause ceph not work any more). Since I don't have any other
way to check the integrity of  the files store in cephfs, I just randomly
pick some video files, and play it, all seems OK.

So, thank you very much.

But, I do not use the last version of files in /var/lib/ceph/mon/ceph-a,
with these files, ceph-mon startup ok, and ceph -s returns, but osd still
think the monitor is wrong node and refuse to work.
Then I think I may try the files of 2 day ago(Aug 1st) and see what happen,
and something actually happen, that is ceph-osd start to work.
So, I am a bit curious about why patched version work with the ceph-mon
data 2 days ago but original version not,
and what more important, do I need extra step to make current running ceph
cluster to work with a normal version(not patched) ceph,
and are there any chance that current cluster will run into problem in the
future(keep current state and do not take any extra step).



On Mon, Aug 5, 2013 at 12:39 AM, Sage Weil  wrote:

> On Sun, 4 Aug 2013, Yu Changyuan wrote:
> > And here is the log of ceph-mon, with debug_mon set to 10, I run "ceph
> -s"
> > command(which is blocked) on 192.168.1.2 during recording this log.
> >
> > https://gist.github.com/yuchangyuan/ba3e72452215221d1e82
>
> I pushed one more patch to that branch that should get you up.  This one
> should go to master as well.
>
> sage
>
> >
> >
> > On Sun, Aug 4, 2013 at 3:25 PM, Yu Changyuan  wrote:
> >   I just try the branch, and mon start ok, here is the log:
> >   https://gist.github.com/yuchangyuan/3138952ac60508d18aed
> >   But ceph -s or ceph -w just block, without any message return(I
> >   just start monitor, no mds or osd).
> >
> >
> >
> > On Sun, Aug 4, 2013 at 12:23 PM, Yu Changyuan 
> > wrote:
> >
> >   On Sun, Aug 4, 2013 at 12:16 PM, Sage Weil
> >wrote:
> > It looks like the auth state wasn't trimmed
> > properly.  It also sort of
> > looks like you aren't using authentication on
> > this cluster... is that
> > true?  (The keyring file was empty.)
> >
> > Yes, your're right, I disable auth. It's just a personal
> > cluster, so the simpler the better.
> >
> >   This looks like a trim issue, but I don't remember
> >   what all we fixed since
> >   .1.. that was a while ago!  We certainly haven't
> >   seen anything like this
> >   recently.
> >
> >   I pushed a branch wip-mon-skip-auth-cuttlefish that
> >   skips the missing
> >   incrementals and will get your mon up, but you may
> >   lose some auth keys.
> >   If auth is on, you'll need ot add them back again.
> >If not, it may just
> >   work with this.
> >
> >   You can grab the packages from
> >
> >
> http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-mon-skip-
> >   auth-cuttlefish
> >
> >   or whatever the right dir is for your distro when
> >   they appear in about 15
> >   minutes.  Let me know if that resolves it.
> >
> >
> > Thank you for your work, I will try as soon as possible.
> > PS: My distro is Gentoo, so maybe I should build from source
> > directly.
> >
> >
> >   sage
> >
> >
> >   On Sun, 4 Aug 2013, Yu Changyuan wrote:
> >
> >   >
> >   >
> >   >
> >   > On Sun, Aug 4, 2013 at 12:13 AM, Sage Weil
> >wrote:
> >   >   On Sat, 3 Aug 2013, Yu Changyuan wrote:
> >   >   > I run a tiny ceph cluster with only one
> >   monitor. After a
> >   >   reboot the system,
> >   >   > the monitor refuse to start.
> >   >   > I try to start ceph-mon manually with
> >   command 'ceph -f -i a',
> >   >below is
> >   >   > first few lines of the output:
> >   >   >
> >   >   > starting mon.a rank 0 at
> >   192.168.1.10:6789/0 mon_data
> >   >   > /var/lib/ceph/mon/ceph-a fsid
> >   >   554bee60-9602-4017-a6e1-ceb6907a218c
> >   >   > mon/AuthMonitor.cc: In function 'virtual
> >   void
> >   >   > AuthMonitor::update_from_paxos()' thread
> >   7f9e3b0db780 time
> >   >   2013-08-03
> >   >   > 20:27:29.208156
> >   >   > mon/AuthMonitor.cc: 147: FAILED assert(ret
> >   == 0)
> >   >   >
> >   >   > The full log is at:
> >   >
> >   https://gist.github.com/yuchangyuan/0a0a56a14fa4649ec2c8
> >   >
> >   > This is 0.61.1.  Can you try again with 0.61.7 to
> >   rule out anything
> >   > there?
> >   >
> >   >
> >   > I just tried 0.61.7, still out of luck. Here is
> >   the log:
> >   >
> >   https://gist.github.com/yuchangyuan/34743c0abf1bfd8ef243
> >   >
> >   >
> >   >   > So, a

[ceph-users] re-initializing a ceph cluster

2013-08-05 Thread Jeff Moskow
After more than a week of trying to restore our cluster I've given up.

I'd like to reset the data, metadata and rbd pools to their initial clean
states (wiping out all data).  Is there an easy way to do this?  I tried
deleting and adding pools, but still have:

   health HEALTH_WARN 32 pgs degraded; 86 pgs stuck unclean

Thanks!
Jeff

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Large storage nodes - best practices

2013-08-05 Thread Brian Candler
I am looking at evaluating ceph for use with large storage nodes (24-36 
SATA disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).


What would be the best practice for deploying this? I can see two main 
options.


(1) Run 24-36 osds per node. Configure ceph to replicate data to one or 
more other nodes. This means that if a disk fails, there will have to be 
an operational process to stop the osd, unmount and replace the disk, 
mkfs a new filesystem, mount it, and restart the osd - which could be 
more complicated and error-prone than a RAID swap would be.


(2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and 
run one osd per node. In this case:
* if I use RAID0 or LVM, then a single disk failure will cause all the 
data on the node to be lost and rebuilt

* if I use RAID5/6, then write performance is likely to be poor
* if I use RAID10, then capacity is reduced by half; with ceph 
replication each piece of data will be replicated 4 times (twice on one 
node, twice on the replica node)


It seems to me that (1) is what ceph was designed to achieve, maybe with 
2 or 3 replicas. Is this what's recommended?


I have seen some postings which imply one osd per node: e.g.
http://www.sebastien-han.fr/blog/2012/08/17/ceph-storage-node-maintenance/
shows three nodes each with one OSD - but maybe this was just a trivial 
example for simplicity.


Looking at
http://ceph.com/docs/next/install/hardware-recommendations/
it says " You *may* run multiple OSDs per host" (my emphasis), and goes 
on to caution against having more disk bandwidth than network bandwidth. 
Ah, but at another point it says " We recommend using a dedicated drive 
for the operating system and software, and one drive for each OSD daemon 
you run on the host." So I guess that's fairly clear.


Anything other options I should be considering?

Regards,

Brian.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Block device storage

2013-08-05 Thread Bill Campbell
Here's a quick Google result for Ceph Resource Agents packages in the
Debian unstable branch.  These look to apply to 0.48, but could be used
as a base for a Resource Agent for RBD.


http://packages.debian.org/unstable/ceph-resource-agents



-Original Message-
From: James Harper 
To: Benito Chamberlain , ceph-us...@ceph.com

Subject: Re: [ceph-users] Block device storage
Date: Mon, 5 Aug 2013 10:12:53 +


> 
> Hi there
> 
> I have a few questions regarding the block device storage and the
> ceph-filesystem.
> 
> We want to cluster a database (Progress) on a clustered filesystem , but
> the database requires the
> operating system to see the clustered storage area as a block device ,
> and not a network storage area / filesystem.
> It refuses to create database volumes on something like GFS2 / Glusterfs
> , because it sees it as unfit - guessing due
> to the filesystem attributes, the block size is invalid. It complains
> about the size when trying to create the db and says that the max size
> for any volume is 0.

I think that's the wrong way to do it. I'm pretty sure postgres assumes you are 
not running on a shared filesystem. You would use the rbd (which is a block 
device and not a filesystem) with a regular filesystem on top 
(ext3/xfs/whatever) and then make pacemaker or heartbeat control which node 
mounted the filesystem, add the IP address, and run the database service. I 
don't think postgres itself supports any sort of clustering (although it's been 
a while since I looked at it).

I'm not an expert on pacemaker although I do use it regularly. I think you 
would:

. Create an IP address resource for postgres
. Create a filesystem resource for /var/lib/postgres (or whatever)
. Create a service resource for postgres, listening on your clustered IP
. Group the above services together

You may also want to have a rbd resource (you'd probably need to write the RA 
yourself - I've not seen one) to map the rbd in the above group, or just always 
map on each node.

Lots of testing required, and it's a bit of a steep learning curve if you've 
never used pacemaker or heartbeat before. Definitely don't ignore the advice on 
STONITH.

This link explains how to do it http://clusterlabs.org/wiki/PostgresHowto (with 
DRBD, but the rest of it should be applicable)

Another approach is run postgres in a VM and cluster that instead, so the VM 
itself would fail over to another node. This is basically what I'm doing for 
all my VMs. Of course neither approach protects you if the database itself got 
corrupt (or the VM broke in the VM case), only if the active node itself failed.

James

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

NOTICE: Protect the information in this message in accordance with the 
company's security policies. If you received this message in error, immediately 
notify the sender and destroy all copies.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] compile error on centos 5.9

2013-08-05 Thread Sage Weil
[Moving to ceph-devel]

On Mon, 5 Aug 2013, huangjun wrote:
> hi,all
> i compiled ceph 0.61.3 on centos 5.9,the "sh autogen.sh" and
> "./configure " is ok, but when i "make", an error occurs, the err log:
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:
> In function ?int rados::cls::lock::lock(librados::IoCtx*, const
> std::string&, const std::string&, ClsLockType, const std::string&, const
> std::string&, const std::string&, const utime_t&, uint8_t)?:
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:83:
> error: ?class __gnu_cxx::lock? is not a function,
> cls/lock/cls_lock_client.cc:59: error: conflict with ?int
> rados::cls::lock::lock(librados::IoCtx*, const std::string&, const
> std::string&, ClsLockType, const std::string&, const std::string&, const
> std::string&, const utime_t&, uint8_t)?
> cls/lock/cls_lock_client.cc:62: error: in call to ?lock?
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:
> In member function ?void
> rados::cls::lock::Lock::lock_shared(librados::ObjectWriteOperation*)?:
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:83:
> error: ?class __gnu_cxx::lock? is not a function,
> cls/lock/cls_lock_client.cc:59: error: conflict with ?int
> rados::cls::lock::lock(librados::IoCtx*, const std::string&, const
> std::string&, ClsLockType, const std::string&, const std::string&, const
> std::string&, const utime_t&, uint8_t)?
> cls/lock/cls_lock_client.cc:181: error: in call to ?lock?
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:
> In member function ?int
> rados::cls::lock::Lock::lock_shared(librados::IoCtx*, const std::string&)?:
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:83:
> error: ?class __gnu_cxx::lock? is not a function,
> cls/lock/cls_lock_client.cc:59: error: conflict with ?int
> rados::cls::lock::lock(librados::IoCtx*, const std::string&, const
> std::string&, ClsLockType, const std::string&, const std::string&, const
> std::string&, const utime_t&, uint8_t)?
> cls/lock/cls_lock_client.cc:187: error: in call to ?lock?
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:
> In member function ?void
> rados::cls::lock::Lock::lock_exclusive(librados::ObjectWriteOperation*)?:
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:83:
> error: ?class __gnu_cxx::lock? is not a function,
> cls/lock/cls_lock_client.cc:59: error: conflict with ?int
> rados::cls::lock::lock(librados::IoCtx*, const std::string&, const
> std::string&, ClsLockType, const std::string&, const std::string&, const
> std::string&, const utime_t&, uint8_t)?
> cls/lock/cls_lock_client.cc:193: error: in call to ?lock?
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:
> In member function ?int
> rados::cls::lock::Lock::lock_exclusive(librados::IoCtx*, const
> std::string&)?:
> /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/concurrence.h:83:
> error: ?class __gnu_cxx::lock? is not a function,
> cls/lock/cls_lock_client.cc:59: error: conflict with ?int
> rados::cls::lock::lock(librados::IoCtx*, const std::string&, const
> std::string&, ClsLockType, const std::string&, const std::string&, const
> std::string&, const utime_t&, uint8_t)?
> cls/lock/cls_lock_client.cc:199: error: in call to ?lock?
> make[3]: *** [cls_lock_client.o] Error 1
> 
> the gcc version is 4.1.2, does this make a difference?

I suspect so.  Mark Nelson successfully built on RHEL5 a while back but 
needed to use a newer gcc.

> what should i do if i want to use ceph-fuse client on centos 5.9? must
> compile the ceph? or just compile the ceph-fuse code?

Right.. you only need the ceph-fuse code.  'make ceph-fuse' may do the 
trick.  Otherwise, you'll need to just strip out the osd stuff from 
Makefile.am.

Either way, let us know how it goes, as others would benefit from this as 
well!

Thanks-
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Mike Dawson

Brian,

Short answer: Ceph generally is used with multiple OSDs per node. One 
OSD per storage drive with no RAID is the most common setup. At 24- or 
36-drives per chassis, there are several potential bottlenecks to consider.


Mark Nelson, the Ceph performance guy at Inktank, has published several 
articles you should consider reading. A few of interest are [0], [1], 
and [2]. The last link is a 5-part series.


There are lots of considerations:

- HBA performance
- Total OSD throughput vs network throughput
- SSD throughput vs. OSD throughput
- CPU / RAM overhead for the OSD processes

Also, note that there is on-going work to add erasure coding as a 
optional backend (as opposed to the current replication scheme). If you 
prioritize bulk storage over performance, you may be interested in 
following the progress [3], [4], and [5].


[0]: 
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[1]: 
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[2]: 
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/
[3]: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
[4]: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[5]: http://www.inktank.com/about-inktank/roadmap/


Cheers,
Mike Dawson


On 8/5/2013 9:50 AM, Brian Candler wrote:

I am looking at evaluating ceph for use with large storage nodes (24-36
SATA disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).

What would be the best practice for deploying this? I can see two main
options.

(1) Run 24-36 osds per node. Configure ceph to replicate data to one or
more other nodes. This means that if a disk fails, there will have to be
an operational process to stop the osd, unmount and replace the disk,
mkfs a new filesystem, mount it, and restart the osd - which could be
more complicated and error-prone than a RAID swap would be.

(2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and
run one osd per node. In this case:
* if I use RAID0 or LVM, then a single disk failure will cause all the
data on the node to be lost and rebuilt
* if I use RAID5/6, then write performance is likely to be poor
* if I use RAID10, then capacity is reduced by half; with ceph
replication each piece of data will be replicated 4 times (twice on one
node, twice on the replica node)

It seems to me that (1) is what ceph was designed to achieve, maybe with
2 or 3 replicas. Is this what's recommended?

I have seen some postings which imply one osd per node: e.g.
http://www.sebastien-han.fr/blog/2012/08/17/ceph-storage-node-maintenance/
shows three nodes each with one OSD - but maybe this was just a trivial
example for simplicity.

Looking at
http://ceph.com/docs/next/install/hardware-recommendations/
it says " You *may* run multiple OSDs per host" (my emphasis), and goes
on to caution against having more disk bandwidth than network bandwidth.
Ah, but at another point it says " We recommend using a dedicated drive
for the operating system and software, and one drive for each OSD daemon
you run on the host." So I guess that's fairly clear.

Anything other options I should be considering?

Regards,

Brian.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] About single monitor recovery

2013-08-05 Thread Sage Weil
On Mon, 5 Aug 2013, Yu Changyuan wrote:
> The good news is, with new patch, ceph start OK, cephfs mount OK, and kvm
> virtual machine use rbd boot OK(and seems running ok), and I check the
> timestamp of last file write to cephfs, it's fair near to the time of
> reboot(which cause ceph not work any more). Since I don't have any other way
> to check the integrity of  the files store in cephfs, I just randomly pick
> some video files, and play it, all seems OK.
> 
> So, thank you very much.
> 
> But, I do not use the last version of files in /var/lib/ceph/mon/ceph-a,
> with these files, ceph-mon startup ok, and ceph -s returns, but osd still
> think the monitor is wrong node and refuse to work.
> Then I think I may try the files of 2 day ago(Aug 1st) and see what happen,
> and something actually happen, that is ceph-osd start to work.
> So, I am a bit curious about why patched version work with the ceph-mon data
> 2 days ago but original version not,
> and what more important, do I need extra step to make current running ceph
> cluster to work with a normal version(not patched) ceph,
> and are there any chance that current cluster will run into problem in the
> future(keep current state and do not take any extra step).

I think you will be fine with the current state and switching back to 
normal release code.

I'm confused why ceph-osds wouldn't start with the latest mon data, but 
can't speculate too much without spending time analyzing your logs from 
the failed startup. 

Glad to hear you're back online!
sage


> 
> 
> 
> On Mon, Aug 5, 2013 at 12:39 AM, Sage Weil  wrote:
>   On Sun, 4 Aug 2013, Yu Changyuan wrote:
> > And here is the log of ceph-mon, with debug_mon set to 10, I run
> "ceph -s"
> > command(which is blocked) on 192.168.1.2 during recording this log.
> >
> > https://gist.github.com/yuchangyuan/ba3e72452215221d1e82
> 
> I pushed one more patch to that branch that should get you up.  This
> one
> should go to master as well.
> 
> sage
> 
> >
> >
> > On Sun, Aug 4, 2013 at 3:25 PM, Yu Changyuan 
> wrote:
> >       I just try the branch, and mon start ok, here is the log:
> >       https://gist.github.com/yuchangyuan/3138952ac60508d18aed
> >       But ceph -s or ceph -w just block, without any message
> return(I
> >       just start monitor, no mds or osd).
> >
> >
> >
> > On Sun, Aug 4, 2013 at 12:23 PM, Yu Changyuan 
> > wrote:
> >
> >       On Sun, Aug 4, 2013 at 12:16 PM, Sage Weil
> >        wrote:
> >             It looks like the auth state wasn't trimmed
> >             properly.  It also sort of
> >             looks like you aren't using authentication on
> >             this cluster... is that
> >             true?  (The keyring file was empty.)
> >
> > Yes, your're right, I disable auth. It's just a personal
> > cluster, so the simpler the better.
> >
> >       This looks like a trim issue, but I don't remember
> >       what all we fixed since
> >       .1.. that was a while ago!  We certainly haven't
> >       seen anything like this
> >       recently.
> >
> >       I pushed a branch wip-mon-skip-auth-cuttlefish that
> >       skips the missing
> >       incrementals and will get your mon up, but you may
> >       lose some auth keys.
> >       If auth is on, you'll need ot add them back again.
> >        If not, it may just
> >       work with this.
> >
> >       You can grab the packages from
> >
> > http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-mon-skip-
> 
> >       auth-cuttlefish
> >
> >       or whatever the right dir is for your distro when
> >       they appear in about 15
> >       minutes.  Let me know if that resolves it.
> >
> >  
> > Thank you for your work, I will try as soon as possible.
> > PS: My distro is Gentoo, so maybe I should build from source
> > directly.
> >  
> >
> >       sage
> >
> >
> >       On Sun, 4 Aug 2013, Yu Changyuan wrote:
> >
> >       >
> >       >
> >       >
> >       > On Sun, Aug 4, 2013 at 12:13 AM, Sage Weil
> >        wrote:
> >       >       On Sat, 3 Aug 2013, Yu Changyuan wrote:
> >       >       > I run a tiny ceph cluster with only one
> >       monitor. After a
> >       >       reboot the system,
> >       >       > the monitor refuse to start.
> >       >       > I try to start ceph-mon manually with
> >       command 'ceph -f -i a',
> >       >        below is
> >       >       > first few lines of the output:
> >       >       >
> >       >       > starting mon.a rank 0 at
> >       192.168.1.10:6789/0 mon_data
> >       >       > /var/lib/ceph/mon/ceph-a fsid
> >       >       554bee60-9602-4017-a6e1-ceb6907a218c
> >       >       > mon/AuthMonitor.cc: In function 'virtual
> >       void
> >       >       > AuthMonitor::update_from_paxos()' thread
> >       7f9e3b0db780 time
> >       >       2013-08-03
> >       >       > 20:27:29.208156
> >       >       > mon/AuthMonitor.cc: 147: FAILED assert(ret
> >       == 0)
> >       >       >
> >       >       > The full log is at:
> >       >      
> > 

Re: [ceph-users] Ceph Hadoop Configuration

2013-08-05 Thread Scottix
Hey Noah,
Yes it does look like an older version 56.6, I got it from the Ubuntu Repo.
Is there another method or pull request I can run to get the latest? I am
having a hard time finding it.

Thanks


On Sun, Aug 4, 2013 at 10:33 PM, Noah Watkins wrote:

> Hey Scott,
>
> Things look OK, but I'm a little foggy on what exactly was shipping in
> the libcephfs-java jar file back at 0.61. There was definitely a time
> where Hadoop and libcephfs.jar in the Debian repos were out of sync,
> and that might be what you are seeing.
>
> Could you list the contents of the libcephfs.jar file, to see if
> CephPoolException.class is in there? It might just be that the
> libcephfs.jar is out-of-date.
>
> -Noah
>
> On Sun, Aug 4, 2013 at 8:44 PM, Scottix  wrote:
> > I am running into an issues connecting hadoop to my ceph cluster and I'm
> > sure I am missing something but can't figure it out.
> > I have a Ceph cluster with MDS running fine and I can do a basic mount
> > perfectly normal.
> > I have hadoop fs -ls with basic file:/// working well.
> >
> > Info:
> > ceph cluster version 0.61.7
> > Ubuntu Server 13.04 x86_64
> > hadoop 1.2.1-1 deb install (stable now I did try 1.1.2 same issue)
> > libcephfs-java both hadoop-cephfs.jar and libcephfs.jar show up in
> "hadoop
> > classpath"
> > libcephfs-jni with symlink trick
> /usr/share/hadoop/lib/native/Linux-amd64-64
> > listed here
> >
> http://thread.gmane.org/gmane.comp.file-systems.ceph.user/1788/focus=1806
> > and the LD_LIBRARY_PATH in hadoop-env.sh
> >
> > When I try to setup the ceph mount within Hadoop I get an exception
> >
> > $ hadoop fs -ls
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > com/ceph/fs/CephPoolException
> > at
> >
> org.apache.hadoop.fs.ceph.CephFileSystem.initialize(CephFileSystem.java:96)
> > at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
> > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:124)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:247)
> > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
> > at org.apache.hadoop.fs.FsShell.ls(FsShell.java:583)
> > at org.apache.hadoop.fs.FsShell.run(FsShell.java:1812)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> > at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)
> > Caused by: java.lang.ClassNotFoundException:
> com.ceph.fs.CephPoolException
> >
> > Followed the tutorial here
> > http://ceph.com/docs/next/cephfs/hadoop/
> >
> > core-site.xml settings
> > ...
> > 
> > fs.ceph.impl
> > org.apache.hadoop.fs.ceph.CephFileSystem
> > 
> > 
> > fs.default.name
> > ceph://192.168.1.11:6789
> > 
> > 
> > ceph.data.pools
> > hadoop1
> > 
> > 
> > ceph.auth.id
> > admin
> > 
> > 
> > ceph.auth.keyfile
> > /etc/ceph/admin.secret
> > 
> >
> > Any Help Appreciated
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
Follow Me: @Scottix 
http://about.me/scottix
scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Hadoop Configuration

2013-08-05 Thread Noah Watkins
The ceph.com repositories can be added in Ubuntu. Checkout
http://ceph.com/docs/master/install/debian/ for details. If you
upgrade to the latest stable (cuttlefish) then all the dependencies
should be correct.

On Mon, Aug 5, 2013 at 9:38 AM, Scottix  wrote:
> Hey Noah,
> Yes it does look like an older version 56.6, I got it from the Ubuntu Repo.
> Is there another method or pull request I can run to get the latest? I am
> having a hard time finding it.
>
> Thanks
>
>
> On Sun, Aug 4, 2013 at 10:33 PM, Noah Watkins 
> wrote:
>>
>> Hey Scott,
>>
>> Things look OK, but I'm a little foggy on what exactly was shipping in
>> the libcephfs-java jar file back at 0.61. There was definitely a time
>> where Hadoop and libcephfs.jar in the Debian repos were out of sync,
>> and that might be what you are seeing.
>>
>> Could you list the contents of the libcephfs.jar file, to see if
>> CephPoolException.class is in there? It might just be that the
>> libcephfs.jar is out-of-date.
>>
>> -Noah
>>
>> On Sun, Aug 4, 2013 at 8:44 PM, Scottix  wrote:
>> > I am running into an issues connecting hadoop to my ceph cluster and I'm
>> > sure I am missing something but can't figure it out.
>> > I have a Ceph cluster with MDS running fine and I can do a basic mount
>> > perfectly normal.
>> > I have hadoop fs -ls with basic file:/// working well.
>> >
>> > Info:
>> > ceph cluster version 0.61.7
>> > Ubuntu Server 13.04 x86_64
>> > hadoop 1.2.1-1 deb install (stable now I did try 1.1.2 same issue)
>> > libcephfs-java both hadoop-cephfs.jar and libcephfs.jar show up in
>> > "hadoop
>> > classpath"
>> > libcephfs-jni with symlink trick
>> > /usr/share/hadoop/lib/native/Linux-amd64-64
>> > listed here
>> >
>> > http://thread.gmane.org/gmane.comp.file-systems.ceph.user/1788/focus=1806
>> > and the LD_LIBRARY_PATH in hadoop-env.sh
>> >
>> > When I try to setup the ceph mount within Hadoop I get an exception
>> >
>> > $ hadoop fs -ls
>> > Exception in thread "main" java.lang.NoClassDefFoundError:
>> > com/ceph/fs/CephPoolException
>> > at
>> >
>> > org.apache.hadoop.fs.ceph.CephFileSystem.initialize(CephFileSystem.java:96)
>> > at
>> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
>> > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
>> > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
>> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
>> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:124)
>> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:247)
>> > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>> > at org.apache.hadoop.fs.FsShell.ls(FsShell.java:583)
>> > at org.apache.hadoop.fs.FsShell.run(FsShell.java:1812)
>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> > at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)
>> > Caused by: java.lang.ClassNotFoundException:
>> > com.ceph.fs.CephPoolException
>> >
>> > Followed the tutorial here
>> > http://ceph.com/docs/next/cephfs/hadoop/
>> >
>> > core-site.xml settings
>> > ...
>> > 
>> > fs.ceph.impl
>> > org.apache.hadoop.fs.ceph.CephFileSystem
>> > 
>> > 
>> > fs.default.name
>> > ceph://192.168.1.11:6789
>> > 
>> > 
>> > ceph.data.pools
>> > hadoop1
>> > 
>> > 
>> > ceph.auth.id
>> > admin
>> > 
>> > 
>> > ceph.auth.keyfile
>> > /etc/ceph/admin.secret
>> > 
>> >
>> > Any Help Appreciated
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
>
>
> --
> Follow Me: @Scottix
> http://about.me/scottix
> scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Brian Candler

On 05/08/2013 17:15, Mike Dawson wrote:


Short answer: Ceph generally is used with multiple OSDs per node. One 
OSD per storage drive with no RAID is the most common setup. At 24- or 
36-drives per chassis, there are several potential bottlenecks to 
consider.


Mark Nelson, the Ceph performance guy at Inktank, has published 
several articles you should consider reading. A few of interest are 
[0], [1], and [2]. The last link is a 5-part series.


Yep, I saw [0] and [1]. He tries a 6-disk RAID0 array and generally gets 
lower throughput than 6 x JBOD disks (although I think he's using the 
controller RAID0 functionality, rather than mdraid).


AFAICS he has a 36-disk chassis but only runs tests with 6 disks, which 
is a shame as it would be nice to know which other bottleneck you could 
hit first with this type of setup.


Also, note that there is on-going work to add erasure coding as a 
optional backend (as opposed to the current replication scheme). If 
you prioritize bulk storage over performance, you may be interested in 
following the progress [3], [4], and [5].


[0]: 
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[1]: 
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[2]: 
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/
[3]: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
[4]: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[5]: http://www.inktank.com/about-inktank/roadmap/

Thank you - erasure coding is very much of interest for the 
archival-type storage I'm looking at. However your links [3] and [4] are 
identical, did you mean to link to another one?


Cheers,

Brian.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Mike Dawson


On 8/5/2013 12:51 PM, Brian Candler wrote:

On 05/08/2013 17:15, Mike Dawson wrote:


Short answer: Ceph generally is used with multiple OSDs per node. One
OSD per storage drive with no RAID is the most common setup. At 24- or
36-drives per chassis, there are several potential bottlenecks to
consider.

Mark Nelson, the Ceph performance guy at Inktank, has published
several articles you should consider reading. A few of interest are
[0], [1], and [2]. The last link is a 5-part series.


Yep, I saw [0] and [1]. He tries a 6-disk RAID0 array and generally gets
lower throughput than 6 x JBOD disks (although I think he's using the
controller RAID0 functionality, rather than mdraid).

AFAICS he has a 36-disk chassis but only runs tests with 6 disks, which
is a shame as it would be nice to know which other bottleneck you could
hit first with this type of setup.


The third link I sent shows Mark's results with 24 spinners and 8 SSDs 
for journals. Specifically read:


http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/#setup

Florian Haas has also published some thoughts on bottenecks:

http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals




Also, note that there is on-going work to add erasure coding as a
optional backend (as opposed to the current replication scheme). If
you prioritize bulk storage over performance, you may be interested in
following the progress [3], [4], and [5].

[0]:
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

[1]:
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/

[2]:
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/

[3]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[4]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[5]: http://www.inktank.com/about-inktank/roadmap/


Thank you - erasure coding is very much of interest for the
archival-type storage I'm looking at. However your links [3] and [4] are
identical, did you mean to link to another one?


Oops.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29




Cheers,

Brian.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] trouble authenticating after bootstrapping monitors

2013-08-05 Thread Kevin Weiler
Thanks for looking Sage,

I came to this conclusion myself as well and this seemed to work. I'm
trying to replicate a ceph cluster that was made with ceph-deploy
manually. I noted that these capabilities entries were not in the
ceph-deploy cluster. Does ceph-deploy do something special when creating
the client.admin key so it doesn't need capabilities? Thanks again!

--

Kevin Weiler

IT


IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
60606 | http://imc-chicago.com/

Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail:
kevin.wei...@imc-chicago.com







On 8/3/13 11:16 AM, "Sage Weil"  wrote:

>On Fri, 2 Aug 2013, Kevin Weiler wrote:
>> I'm having some trouble bootstrapping my monitors using this page as a
>> guide:
>>
>> http://ceph.com/docs/next/dev/mon-bootstrap/
>>
>> I can't seem to authenticate to my monitors with client.admin after I've
>> created them and started them:
>>
>
>You also need
>
>> [root@camelot ~]# cat /etc/ceph/ceph.keyring
>> [mon.]
>>  key = AQD6yftRkKY3NxAA5VNbtUM23C3uPqUUXYSHeQ==
>caps mon = allow *
>
>> [client.admin]
>>  key = AQANyvtRYDHCCxAAwgcgdMJ9ue64m6+enYONOw==
>caps mon = allow *
>
>so that the mon knows these users are allowed to do everything.
>
>sage
>
>>
>> [root@camelot ~]# monmaptool --create --add camelot 10.198.1.3:6789
>>monmap
>> monmaptool: monmap file monmap
>> monmaptool: generated fsid 87a5f355-f7be-43aa-b26c-b6ad23f371bb
>> monmaptool: writing epoch 0 to monmap (1 monitors)
>>
>> [root@camelot ~]# ceph-mon --mkfs -i camelot --monmap monmap --keyring
>>/etc/
>> ceph/ceph.keyring
>> ceph-mon: created monfs at /srv/mon.camelot for mon.camelot
>>
>> [root@camelot ~]# service ceph start
>> === mon.camelot ===
>> Starting Ceph mon.camelot on camelot...
>> === mds.camelot ===
>> Starting Ceph mds.camelot on camelot...
>> starting mds.camelot at :/0
>>
>> [root@camelot ~]# ceph auth get mon.
>> access denied
>>
>> If someone could tell me what I'm doing wrong it would be greatly
>> appreciated. Thanks!
>>
>>
>> --
>>
>> Kevin Weiler
>>
>> IT
>>
>>
>>
>> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
>>60606
>> | http://imc-chicago.com/
>>
>> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 |
>> E-Mail: kevin.wei...@imc-chicago.com
>>
>>
>>
>>_
>>___
>>
>> The information in this e-mail is intended only for the person or
>>entity to
>> which it is addressed.
>>
>> It may contain confidential and /or privileged material. If someone
>>other
>> than the intended recipient should receive this e-mail, he / she shall
>>not
>> be entitled to read, disseminate, disclose or duplicate it.
>>
>> If you receive this e-mail unintentionally, please inform us
>>immediately by
>> "reply" and then delete it from your system. Although this information
>>has
>> been compiled with great care, neither IMC Financial Markets & Asset
>> Management nor any of its related entities shall accept any
>>responsibility
>> for any errors, omissions or other inaccuracies in this information or
>>for
>> the consequences thereof, nor shall it be bound in any way by the
>>contents
>> of this e-mail or its attachments. In the event of incomplete or
>>incorrect
>> transmission, please return the e-mail to the sender and permanently
>>delete
>> this message and any attachments.
>>
>> Messages and attachments are scanned for all known viruses. Always scan
>> attachments before opening them.
>>
>>




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Trying to identify performance bottlenecks

2013-08-05 Thread Lincoln Bryant
Hi all,

I'm trying to identify the performance bottlenecks in my experimental Ceph 
cluster. A little background on my setup:
10 storage servers, each configured with:
-(2) dual-core opterons
-8 GB of RAM
-(6) 750GB disks (1 OSD per disk, 7200 RPM SATA, probably 4-5 
years old.). JBOD w/ BTRFS
-1GbE
-CentOS 6.4, custom kernel 3.7.8
1 dedicated mds/mon server
-same specs at OSD nodes
(2 more dedicated mons waiting in the wings, recently reinstalled ceph)
1 front-facing node mounting CephFS, with a 10GbE connection into the 
switch stack housing the storage machines
-CentOS 6.4, custom kernel 3.7.8

Some Ceph settings:
[osd]
osd journal size = 1000
filestore xattr use omap = true

When I try to transfer files in/out via CephFS (10GbE host), I'm seeing only 
about 230MB/s at peak. First, is this what I should expect? Given 60 OSDs 
spread across 10 servers, I would have thought I'd get something closer to 
400-500 MB/s or more. I tried upping the number of placement groups to 3000 for 
my 'data' pool (following the formula here: 
http://ceph.com/docs/master/rados/operations/placement-groups/) with no 
increase in performance. I also saw no performance difference between XFS and 
BTRFS.

I also see a lot of messages like this in the log: 
10.1.6.4:6815/30138 3518 : [WRN] slow request 30.874441 seconds old, received 
at 2013-07-31 10:52:49.721518: osd_op(client.7763.1:67060 10003ba.13d4 
[write 0~4194304] 0.102b9365 RETRY=-1 snapc 1=[] e1454) currently waiting for 
subops from [1]

Does anyone have any thoughts as to what the bottleneck may be, if there is 
one? Or, any idea what I should try to measure to determine the bottleneck?

Perhaps my disks are just that bad? :)

Cheers,
Lincoln___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-05 Thread Alex Elder
On 08/04/2013 08:07 PM, Olivier Bonvalet wrote:
> 
> Hi,
> 
> I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
> 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
> VM which use RBD kernel client. 

A crash like this was reported last week.  I started looking at it
but I don't believe I ever sent out my findings.

The problem is that while formatting the write request it's
exhausting the space available in the front buffer for the
request message.  The size of that buffer is established at
request creation time, when rbd_osd_req_create() gets called
inside rbd_img_request_fill().

I think this is another unfortunate result of not setting the
image request pointer early enough.  Sort of related to this:

commit d2d1f17a0dad823a4cb71583433d26cd7f734e08
Author: Josh Durgin 
Date:   Wed Jun 26 12:56:17 2013 -0700

rbd: send snapshot context with writes

That is, when the osd request gets created, the object request
has not been associated with the image request yet.  And as a
result, the size set aside for the front of the osd write request
message does not take into account the bytes required to hold the
snapshot context.

It's possible a simple fix will be to move the call to
rbd_img_obj_request_add() in rbd_img_request_fill() even
further up, just after verifying the obj_request allocated
via rbd_obj_request_create() is non-null.

I haven't really verified this will work though, but it's a
hint at what might work.

-Alex


> 
> 
> In kernel logs, I have :
> 
> Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at 
> net/ceph/osd_client.c:2103!
> Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  [#1] 
> SMP 
> Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc rbd 
> libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter 
> ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT xt_tcpudp 
> iptable_filter ip_tables x_tables bridge loop coretemp ghash_clmulni_intel 
> aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt 
> iTCO_vendor_support gpio_ich microcode serio_raw sb_edac edac_core evdev 
> lpc_ich i2c_i801 mfd_core wmi ac ioatdma shpchp button dm_mod hid_generic 
> usbhid hid sg sd_mod crc_t10dif crc32c_intel isci megaraid_sas libsas ahci 
> libahci ehci_pci ehci_hcd libata scsi_transport_sas igb scsi_mod i2c_algo_bit 
> ixgbe usbcore i2c_core dca usb_common ptp pps_core mdio
> Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm: 
> blkback.3.xvda Not tainted 3.10-dae-dom0 #1
> Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: Supermicro 
> X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 ti: 
> 88003803a000 task.ti: 88003803a000
> Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: 
> e030:[]  [] 
> ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
> Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: e02b:88003803b9f8  
> EFLAGS: 00010212
> Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 RBX: 
> 880033a182ec RCX: 
> Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af RSI: 
> 8050 RDI: 880030d34888
> Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 R08: 
> 88003803ba58 R09: 
> Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  R11: 
>  R12: 880033ba3500
> Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 R14: 
> 88003847aa78 R15: 88003847ab58
> Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  7f775da8c700() 
> GS:88003f84() knlGS:
> Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES:  
> CR0: 80050033
> Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 CR3: 
> 2be14000 CR4: 00042660
> Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  DR1: 
>  DR2: 
> Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  DR6: 
> 0ff0 DR7: 0400
> Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
> Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000 
> 00243847aa78  880039949b40
> Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201 
> 880033811d98 88003803ba80 88003847aa78
> Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380 
> 880002a38400 2000 a029584c
> Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
> Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ? 
> rbd_osd_req_format_write+0x71/0x7c [rbd]
> Aug  5 02:51:22 murmillia kernel: [  289.213459]  [] ? 
> rbd_img_request_fill+0x695/0x736 [rbd]
>

Re: [ceph-users] inconsistent pg: no 'snapset' attr

2013-08-05 Thread John Nielsen
Can no one shed any light on this?

On Jul 30, 2013, at 1:51 PM, John Nielsen  wrote:

> I am running a ceph cluster with 24 OSD's across 3 nodes, Cuttlefish 0.61.3. 
> Recently an inconsistent PG cropped up:
> 
> # ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 11.2d5 is active+clean+inconsistent, acting [5,22,9]
> 1 scrub errors
> 
> Pool 11 is .rgw.buckets, used by a RADOS gateway on a separate machine.
> 
> I tried to repair the pg but it was ineffective. In the OSD log, I see this:
> 
> 2013-07-30 13:26:00.948312 7f83a9d32700  0 log [ERR] : scrub 11.2d5 
> 33c382d5/170509.178_fc845fcfbb504f0eb87c9061ebbaf477/head//11 no 'snapset' 
> attr
> 2013-07-30 13:26:03.358112 7f83a9d32700  0 log [ERR] : 11.2d5 scrub 1 errors
> 
> What does it mean, how did it happen and how can I fix it?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-05 Thread Mike Dawson

Josh,

Logs are uploaded to cephdrop with the file name 
mikedawson-rbd-qemu-deadlock.


- At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0
- At about 2013-08-05 19:53:51, ran a 'virsh screenshot'


Environment is:

- Ceph 0.61.7 (client is co-mingled with three OSDs)
- rbd cache = true and cache=writeback
- qemu 1.4.0 1.4.0+dfsg-1expubuntu4
- Ubuntu Raring with 3.8.0-25-generic

This issue is reproducible in my environment, and I'm willing to run any 
wip branch you need. What else can I provide to help?


Thanks,
Mike Dawson


On 8/5/2013 3:48 AM, Stefan Hajnoczi wrote:

On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:

Am 02.08.2013 um 23:47 schrieb Mike Dawson :

We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh 
screenshot' command. After that, the guest resumes and runs as expected. At that point we 
can examine the guest. Each time we'll see:


If virsh screenshot works then this confirms that QEMU itself is still
responding.  Its main loop cannot be blocked since it was able to
process the screendump command.

This supports Josh's theory that a callback is not being invoked.  The
virtio-blk I/O request would be left in a pending state.

Now here is where the behavior varies between configurations:

On a Windows guest with 1 vCPU, you may see the symptom that the guest no
longer responds to ping.

On a Linux guest with multiple vCPUs, you may see the hung task message
from the guest kernel because other vCPUs are still making progress.
Just the vCPU that issued the I/O request and whose task is in
UNINTERRUPTIBLE state would really be stuck.

Basically, the symptoms depend not just on how QEMU is behaving but also
on the guest kernel and how many vCPUs you have configured.

I think this can explain how both problems you are observing, Oliver and
Mike, are a result of the same bug.  At least I hope they are :).

Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread James Harper
> I am looking at evaluating ceph for use with large storage nodes (24-36 SATA
> disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).
> 
> What would be the best practice for deploying this? I can see two main
> options.
> 
> (1) Run 24-36 osds per node. Configure ceph to replicate data to one or more
> other nodes. This means that if a disk fails, there will have to be an
> operational process to stop the osd, unmount and replace the disk, mkfs a
> new filesystem, mount it, and restart the osd - which could be more
> complicated and error-prone than a RAID swap would be.
> 
> (2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and run
> one osd per node. In this case:
> * if I use RAID0 or LVM, then a single disk failure will cause all the data 
> on the
> node to be lost and rebuilt
> * if I use RAID5/6, then write performance is likely to be poor
> * if I use RAID10, then capacity is reduced by half; with ceph replication 
> each
> piece of data will be replicated 4 times (twice on one node, twice on the
> replica node)
> 
> It seems to me that (1) is what ceph was designed to achieve, maybe with 2
> or 3 replicas. Is this what's recommended?
> 

There is a middle ground to consider - 12-18 OSD's each running on a pair of 
disks in a RAID1 configuration. This would reduce most disk failures to a 
simple disk swap (assuming an intelligent hardware RAID controller). Obviously 
you still have a 50% reduction in disk space, but you have the advantage that 
your filesystem never sees the bad disk and all the problems that can cause.

James

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Scottix
In the previous email, you are forgetting Raid1 has a write penalty of 2
since it is mirroring and now we are talking about different types of raid
and nothing really to do about Ceph. One of the main advantages of Ceph is
to have data replicated so you don't have to do Raid to that degree. I am
sure there is math to do this but larger quantity of smaller nodes have
better fail-over than a few large nodes. If you are competing over CPU
resources then you can use Raid0 with minimal write penalty (never thought
I suggest Raid0 haha). You may not max out the drive speed because of CPU
but that is the cost of switching to a data system the machine was not
intended for. It would be good information to know the limits of what a
machine could do with Ceph, so please do share if you do some tests.

Overall from my understanding it is generally better to move to the ideal
node size for Ceph then slowly deprecate the larger nodes also
fundamentally since replication is done at a higher level than individual
spinners. The idea of doing raid falls farther behind.


On Mon, Aug 5, 2013 at 5:05 PM, James Harper
wrote:

> > I am looking at evaluating ceph for use with large storage nodes (24-36
> SATA
> > disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).
> >
> > What would be the best practice for deploying this? I can see two main
> > options.
> >
> > (1) Run 24-36 osds per node. Configure ceph to replicate data to one or
> more
> > other nodes. This means that if a disk fails, there will have to be an
> > operational process to stop the osd, unmount and replace the disk, mkfs a
> > new filesystem, mount it, and restart the osd - which could be more
> > complicated and error-prone than a RAID swap would be.
> >
> > (2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and
> run
> > one osd per node. In this case:
> > * if I use RAID0 or LVM, then a single disk failure will cause all the
> data on the
> > node to be lost and rebuilt
> > * if I use RAID5/6, then write performance is likely to be poor
> > * if I use RAID10, then capacity is reduced by half; with ceph
> replication each
> > piece of data will be replicated 4 times (twice on one node, twice on the
> > replica node)
> >
> > It seems to me that (1) is what ceph was designed to achieve, maybe with
> 2
> > or 3 replicas. Is this what's recommended?
> >
>
> There is a middle ground to consider - 12-18 OSD's each running on a pair
> of disks in a RAID1 configuration. This would reduce most disk failures to
> a simple disk swap (assuming an intelligent hardware RAID controller).
> Obviously you still have a 50% reduction in disk space, but you have the
> advantage that your filesystem never sees the bad disk and all the problems
> that can cause.
>
> James
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Follow Me: @Scottix 
http://about.me/scottix
scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread James Harper
> 
> In the previous email, you are forgetting Raid1 has a write penalty of 2 
> since it
> is mirroring and now we are talking about different types of raid and nothing
> really to do about Ceph. One of the main advantages of Ceph is to have data
> replicated so you don't have to do Raid to that degree. I am sure there is
> math to do this but larger quantity of smaller nodes have better fail-over
> than a few large nodes. If you are competing over CPU resources then you
> can use Raid0 with minimal write penalty (never thought I suggest Raid0
> haha). You may not max out the drive speed because of CPU but that is the
> cost of switching to a data system the machine was not intended for. It
> would be good information to know the limits of what a machine could do
> with Ceph, so please do share if you do some tests.
> 
> Overall from my understanding it is generally better to move to the ideal
> node size for Ceph then slowly deprecate the larger nodes also
> fundamentally since replication is done at a higher level than individual
> spinners. The idea of doing raid falls farther behind.
> 

The reason for suggesting RAID1 was to ease the job of disk replacement, and 
also to minimise the chance of crashing. With 1 or 2 OSD's per node and many 
nodes it doesn't really matter if a screwy disk brings down your system. With 
24 OSD's per node, bringing down a node is more of a big deal. Or maybe there 
is no chance of a failing disk causing an XFS oops these days? (I know it has 
in the past)

Also I think there won't be sufficient network bandwidth to saturate 24 disks 
so bringing it down to 12 RAID1 sets isn't such a problem. The RAID1 write 
penalty isn't significant for throughput as the writes are done in parallel, 
and you can get increased performance on read I think.

I don't use RAID1 on my setup, but then I don't have 24-36 disks per node!

James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CDS Day One Videos

2013-08-05 Thread Ross David Turk

Hi, all!  Thanks to those who attended the first day (unless you are in
Europe, where they all happen on the same glorious day) of the Ceph
Developer Summit.

For those of you who couldn’t make it, I have added links to videos of
the sessions to the wiki page - it’s actually just three large videos,
but the links should take you to the approximate start of each discussion.

http://ceph.com/cds (click on the video links near each session)

Cheers,
Ross

--
Ross Turk
Community, Inktank

@rossturk @inktank @ceph

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com