Re: [ceph-users] Weird problem with mkcephfs
Although it doesn't attempt to login to my other machines as I thought it was designed to do, as I know it did the last time I built a cluster. Not sure what I'm doing wrong. -Steve On 03/23/2013 10:35 PM, Steve Carter wrote: I changed: for k in $dir/key.* to: for k in $dir/key* and it appeared to run correctly: root@smon:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -d /tmp -k /etc/ceph/keyring preparing monmap in /tmp/monmap /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 --print /tmp/monmap /usr/bin/monmaptool: monmap file /tmp/monmap /usr/bin/monmaptool: generated fsid 46e4ae99-3df6-41ae-8d45-474c95b98852 epoch 0 fsid 46e4ae99-3df6-41ae-8d45-474c95b98852 last_changed 2013-03-23 22:33:26.254974 created 2013-03-23 22:33:26.254974 0: 192.168.0.253:6789/0 mon.a /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors) Building generic osdmap from /tmp/conf /usr/bin/osdmaptool: osdmap file '/tmp/osdmap' /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap Generating admin key at /tmp/keyring.admin creating /tmp/keyring.admin Building initial monitor keyring placing client.admin keyring in /etc/ceph/keyring On 03/23/2013 10:29 PM, Steve Carter wrote: The below part of the mkcephfs code seems responsible for this. specifically the 'for' loop below. I wonder if I installed from the wrong place? I installed from the ubuntu source rather than the ceph source. # admin keyring echo Generating admin key at $dir/keyring.admin $BINDIR/ceph-authtool --create-keyring --gen-key -n client.admin $dir/keyring.admin # mon keyring echo Building initial monitor keyring cp $dir/keyring.admin $dir/keyring.mon $BINDIR/ceph-authtool -n client.admin --set-uid=0 \ --cap mon 'allow *' \ --cap osd 'allow *' \ --cap mds 'allow' \ $dir/keyring.mon $BINDIR/ceph-authtool --gen-key -n mon. $dir/keyring.mon for k in $dir/key.* do kname=`echo $k | sed 's/.*key\.//'` ktype=`echo $kname | cut -c 1-3` kid=`echo $kname | cut -c 4- | sed 's/^\\.//'` kname="$ktype.$kid" secret=`cat $k` if [ "$ktype" = "osd" ]; then $BINDIR/ceph-authtool -n $kname --add-key $secret $dir/keyring.mon \ --cap mon 'allow rwx' \ --cap osd 'allow *' fi if [ "$ktype" = "mds" ]; then $BINDIR/ceph-authtool -n $kname --add-key $secret $dir/keyring.mon \ --cap mon "allow rwx" \ --cap osd 'allow *' \ --cap mds 'allow' fi done exit 0 fi On 03/23/2013 01:50 PM, Steve Carter wrote: This is consistently repeatable on my system. This is the latest of two cluster builds I have done. This is a brand new deployment on hardware I haven't deployed on previously. You see the error below is referencing /tmp/key.* and the keyring files are actually keyring.*. Any help is much appreciated. root@mon:~# uname -a Linux mon.X.com 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux root@mon:~# ceph -v ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5) root@mon:~# ls -al /tmp/ total 8 drwxrwxrwx 2 root root 4096 Mar 23 12:15 . drwxr-xr-x 25 root root 4096 Mar 22 23:18 .. root@mon:~# ls -al / | grep tmp drwxrwxrwx 2 root root 4096 Mar 23 12:15 tmp root@mon:~# mkcephfs -d /tmp -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring preparing monmap in /tmp/monmap /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 --print /tmp/monmap /usr/bin/monmaptool: monmap file /tmp/monmap /usr/bin/monmaptool: generated fsid 68b9c724-21c0-4d54-8237-674ced7adbfe epoch 0 fsid 68b9c724-21c0-4d54-8237-674ced7adbfe last_changed 2013-03-23 12:17:03.087018 created 2013-03-23 12:17:03.087018 0: 192.168.0.253:6789/0 mon.a /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors) Building generic osdmap from /tmp/conf /usr/bin/osdmaptool: osdmap file '/tmp/osdmap' /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap Generating admin key at /tmp/keyring.admin creating /tmp/keyring.admin Building initial monitor keyring cat: /tmp/key.*: No such file or directory root@mon:~# ls -al /tmp/ total 32 drwxrwxrwx 2 root root 4096 Mar 23 12:17 . drwxr-xr-x 25 root root 4096 Mar 22 23:18 .. -rw-r--r-- 1 root root 695 Mar 23 12:17 conf -rw--- 1 root root 63 Mar 23 12:17 keyring.admin -rw--- 1 root root 192 Mar 23 12:17 keyring.mon -rw-r--r-- 1 root root 187 Mar 23 12:17 monmap -rw-r--r-- 1 root root 6886 Mar 23 12:17 osdmap ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi list, We have hit and reproduce this issue for several times, ceph will suicide because FileStore: sync_entry timed out after a very heavy random IO on top of the RBD. My test environment is: 4 Nodes ceph cluster with 20 HDDs for OSDs and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total 48 VMs spread across 12 Physical nodes, 48 RBD attached to the VMs 1:1 via Qemu. Ceph @ 0.58 XFS were used. I am using Aiostress (something like FIO) to produce random write requests on top of each RBDs. From Ceph-w , ceph reports a very high Ops (1+ /s) , but technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random write. When digging into the code, I found that the OSD write data to Pagecache than returned, although it called ::sync_file_range, but this syscall doesn't actually sync data to disk when it return,it's an aync call. So the situation is , the random write will be extremely fast since it only write to journal and pagecache, but once syncing , it will take very long time. The speed gap between journal and OSDs exist, the amount of data that need to be sync keep increasing, and it will certainly exceed 600s. For more information, I have tried to reproduce this by rados bench,but failed. Could you please let me know if you need any more informations & have some solutions? Thanks Xiaoxi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi, this could be related to this issue here and has been reported multiple times: http://tracker.ceph.com/issues/3737 In short: They're working on it, they know about it. Wolfgang On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote: > Hi list, > > We have hit and reproduce this issue for several times, ceph > will suicide because FileStore: sync_entry timed out after a very heavy > random IO on top of the RBD. > > My test environment is: > > 4 Nodes ceph cluster with 20 HDDs for OSDs > and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total > > 48 VMs spread across 12 Physical nodes, 48 > RBD attached to the VMs 1:1 via Qemu. > > Ceph @ 0.58 > > XFS were used. > > I am using Aiostress (something like FIO) to produce random > write requests on top of each RBDs. > > > > From Ceph-w , ceph reports a very high Ops (1+ /s) , but > technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K > random write. > > When digging into the code, I found that the OSD write data to > Pagecache than returned, although it called ::sync_file_range, but this > syscall doesn’t actually sync data to disk when it return,it’s an aync > call. So the situation is , the random write will be extremely fast > since it only write to journal and pagecache, but once syncing , it will > take very long time. The speed gap between journal and OSDs exist, the > amount of data that need to be sync keep increasing, and it will > certainly exceed 600s. > > > > For more information, I have tried to reproduce this by rados > bench,but failed. > > > > Could you please let me know if you need any more informations > & have some solutions? Thanks > > > > > > Xiaoxi > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi Wolfgang, Thanks for the reply,but why my problem is related with issue#3737? I cannot find any direct link between them. I didnt turn on qemu cache and my qumu/VM work fine Xiaoxi 在 2013-3-25,17:07,"Wolfgang Hennerbichler" 写道: > Hi, > > this could be related to this issue here and has been reported multiple > times: > > http://tracker.ceph.com/issues/3737 > > In short: They're working on it, they know about it. > > Wolfgang > > On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote: >> Hi list, >> >> We have hit and reproduce this issue for several times, ceph >> will suicide because FileStore: sync_entry timed out after a very heavy >> random IO on top of the RBD. >> >> My test environment is: >> >>4 Nodes ceph cluster with 20 HDDs for OSDs >> and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total >> >>48 VMs spread across 12 Physical nodes, 48 >> RBD attached to the VMs 1:1 via Qemu. >> >>Ceph @ 0.58 >> >>XFS were used. >> >> I am using Aiostress (something like FIO) to produce random >> write requests on top of each RBDs. >> >> >> >> From Ceph-w , ceph reports a very high Ops (1+ /s) , but >> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K >> random write. >> >> When digging into the code, I found that the OSD write data to >> Pagecache than returned, although it called ::sync_file_range, but this >> syscall doesn’t actually sync data to disk when it return,it’s an aync >> call. So the situation is , the random write will be extremely fast >> since it only write to journal and pagecache, but once syncing , it will >> take very long time. The speed gap between journal and OSDs exist, the >> amount of data that need to be sync keep increasing, and it will >> certainly exceed 600s. >> >> >> >> For more information, I have tried to reproduce this by rados >> bench,but failed. >> >> >> >> Could you please let me know if you need any more informations >> & have some solutions? Thanks >> >> >> Xiaoxi >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > DI (FH) Wolfgang Hennerbichler > Software Development > Unit Advanced Computing Technologies > RISC Software GmbH > A company of the Johannes Kepler University Linz > > IT-Center > Softwarepark 35 > 4232 Hagenberg > Austria > > Phone: +43 7236 3343 245 > Fax: +43 7236 3343 250 > wolfgang.hennerbich...@risc-software.at > http://www.risc-software.at > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi Xiaoxi, sorry, I thought you were testing within VMs and caching turned on (I assumed, you didn't tell us if you really did use your benchmark within vms and if not, how you tested rbd outside of VMs). It just triggered an alarm in me because we had also experienced issues with benchmarking within a VM (it didn't crash but responded extremely slow). Wolfgang On 03/25/2013 10:15 AM, Chen, Xiaoxi wrote: > > > Hi Wolfgang, > > Thanks for the reply,but why my problem is related with issue#3737? I > cannot find any direct link between them. I didnt turn on qemu cache and my > qumu/VM work fine > > > Xiaoxi > > 在 2013-3-25,17:07,"Wolfgang Hennerbichler" > 写道: > >> Hi, >> >> this could be related to this issue here and has been reported multiple >> times: >> >> http://tracker.ceph.com/issues/3737 >> >> In short: They're working on it, they know about it. >> >> Wolfgang >> >> On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote: >>> Hi list, >>> >>> We have hit and reproduce this issue for several times, ceph >>> will suicide because FileStore: sync_entry timed out after a very heavy >>> random IO on top of the RBD. >>> >>> My test environment is: >>> >>>4 Nodes ceph cluster with 20 HDDs for OSDs >>> and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total >>> >>>48 VMs spread across 12 Physical nodes, 48 >>> RBD attached to the VMs 1:1 via Qemu. >>> >>>Ceph @ 0.58 >>> >>>XFS were used. >>> >>> I am using Aiostress (something like FIO) to produce random >>> write requests on top of each RBDs. >>> >>> >>> >>> From Ceph-w , ceph reports a very high Ops (1+ /s) , but >>> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K >>> random write. >>> >>> When digging into the code, I found that the OSD write data to >>> Pagecache than returned, although it called ::sync_file_range, but this >>> syscall doesn’t actually sync data to disk when it return,it’s an aync >>> call. So the situation is , the random write will be extremely fast >>> since it only write to journal and pagecache, but once syncing , it will >>> take very long time. The speed gap between journal and OSDs exist, the >>> amount of data that need to be sync keep increasing, and it will >>> certainly exceed 600s. >>> >>> >>> >>> For more information, I have tried to reproduce this by rados >>> bench,but failed. >>> >>> >>> >>> Could you please let me know if you need any more informations >>> & have some solutions? Thanks >>> >>> >>> Xiaoxi >>> >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> DI (FH) Wolfgang Hennerbichler >> Software Development >> Unit Advanced Computing Technologies >> RISC Software GmbH >> A company of the Johannes Kepler University Linz >> >> IT-Center >> Softwarepark 35 >> 4232 Hagenberg >> Austria >> >> Phone: +43 7236 3343 245 >> Fax: +43 7236 3343 250 >> wolfgang.hennerbich...@risc-software.at >> http://www.risc-software.at >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi, 在 2013-3-25,17:30,"Wolfgang Hennerbichler" 写道: > Hi Xiaoxi, > > sorry, I thought you were testing within VMs and caching turned on (I > assumed, you didn't tell us if you really did use your benchmark within > vms and if not, how you tested rbd outside of VMs). Yes,I really testing within VMs > It just triggered an alarm in me because we had also experienced issues > with benchmarking within a VM (it didn't crash but responded extremely > slow). > OK,but my VM didnt crash, it's ceph-osd daemon crashed. So is it safe for me to say the issue I hit is a different issue?(not #3737) > Wolfgang xiaoxi > > On 03/25/2013 10:15 AM, Chen, Xiaoxi wrote: >> >> >> Hi Wolfgang, >> >>Thanks for the reply,but why my problem is related with issue#3737? I >> cannot find any direct link between them. I didnt turn on qemu cache and my >> qumu/VM work fine >> >> >>Xiaoxi >> >> 在 2013-3-25,17:07,"Wolfgang Hennerbichler" >> 写道: >> >>> Hi, >>> >>> this could be related to this issue here and has been reported multiple >>> times: >>> >>> http://tracker.ceph.com/issues/3737 >>> >>> In short: They're working on it, they know about it. >>> >>> Wolfgang >>> >>> On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote: Hi list, We have hit and reproduce this issue for several times, ceph will suicide because FileStore: sync_entry timed out after a very heavy random IO on top of the RBD. My test environment is: 4 Nodes ceph cluster with 20 HDDs for OSDs and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total 48 VMs spread across 12 Physical nodes, 48 RBD attached to the VMs 1:1 via Qemu. Ceph @ 0.58 XFS were used. I am using Aiostress (something like FIO) to produce random write requests on top of each RBDs. From Ceph-w , ceph reports a very high Ops (1+ /s) , but technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random write. When digging into the code, I found that the OSD write data to Pagecache than returned, although it called ::sync_file_range, but this syscall doesn’t actually sync data to disk when it return,it’s an aync call. So the situation is , the random write will be extremely fast since it only write to journal and pagecache, but once syncing , it will take very long time. The speed gap between journal and OSDs exist, the amount of data that need to be sync keep increasing, and it will certainly exceed 600s. For more information, I have tried to reproduce this by rados bench,but failed. Could you please let me know if you need any more informations & have some solutions? Thanks Xiaoxi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> -- >>> DI (FH) Wolfgang Hennerbichler >>> Software Development >>> Unit Advanced Computing Technologies >>> RISC Software GmbH >>> A company of the Johannes Kepler University Linz >>> >>> IT-Center >>> Softwarepark 35 >>> 4232 Hagenberg >>> Austria >>> >>> Phone: +43 7236 3343 245 >>> Fax: +43 7236 3343 250 >>> wolfgang.hennerbich...@risc-software.at >>> http://www.risc-software.at >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > DI (FH) Wolfgang Hennerbichler > Software Development > Unit Advanced Computing Technologies > RISC Software GmbH > A company of the Johannes Kepler University Linz > > IT-Center > Softwarepark 35 > 4232 Hagenberg > Austria > > Phone: +43 7236 3343 245 > Fax: +43 7236 3343 250 > wolfgang.hennerbich...@risc-software.at > http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
On 03/25/2013 10:35 AM, Chen, Xiaoxi wrote: > OK,but my VM didnt crash, it's ceph-osd daemon crashed. So is it safe for me > to say the issue I hit is a different issue?(not #3737) Yes, then it surely is a different issue. Actually you just said ceph crashed, no mention of an OSD, so it was hard to find out :) >> Wolfgang > >xiaoxi >> >> On 03/25/2013 10:15 AM, Chen, Xiaoxi wrote: >>> >>> >>> Hi Wolfgang, >>> >>>Thanks for the reply,but why my problem is related with issue#3737? >>> I cannot find any direct link between them. I didnt turn on qemu cache and >>> my qumu/VM work fine >>> >>> >>>Xiaoxi >>> >>> 在 2013-3-25,17:07,"Wolfgang Hennerbichler" >>> 写道: >>> Hi, this could be related to this issue here and has been reported multiple times: http://tracker.ceph.com/issues/3737 In short: They're working on it, they know about it. Wolfgang On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote: > Hi list, > >We have hit and reproduce this issue for several times, ceph > will suicide because FileStore: sync_entry timed out after a very heavy > random IO on top of the RBD. > >My test environment is: > > 4 Nodes ceph cluster with 20 HDDs for OSDs > and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in > total > > 48 VMs spread across 12 Physical nodes, 48 > RBD attached to the VMs 1:1 via Qemu. > > Ceph @ 0.58 > > XFS were used. > >I am using Aiostress (something like FIO) to produce random > write requests on top of each RBDs. > > > >From Ceph-w , ceph reports a very high Ops (1+ /s) , but > technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K > random write. > >When digging into the code, I found that the OSD write data to > Pagecache than returned, although it called ::sync_file_range, but this > syscall doesn’t actually sync data to disk when it return,it’s an aync > call. So the situation is , the random write will be extremely fast > since it only write to journal and pagecache, but once syncing , it will > take very long time. The speed gap between journal and OSDs exist, the > amount of data that need to be sync keep increasing, and it will > certainly exceed 600s. > > > >For more information, I have tried to reproduce this by rados > bench,but failed. > > > >Could you please let me know if you need any more informations > & have some solutions? Thanks > > >Xiaoxi > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> -- >> DI (FH) Wolfgang Hennerbichler >> Software Development >> Unit Advanced Computing Technologies >> RISC Software GmbH >> A company of the Johannes Kepler University Linz >> >> IT-Center >> Softwarepark 35 >> 4232 Hagenberg >> Austria >> >> Phone: +43 7236 3343 245 >> Fax: +43 7236 3343 250 >> wolfgang.hennerbich...@risc-software.at >> http://www.risc-software.at -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] kernel BUG when mapping unexisting rbd device
Hi, Apologies if this is already a known bug (though I didn't find it). If we try to map a device that doesn't exist, we get an immediate and reproduceable kernel BUG (see the P.S.). We hit this by accident because we forgot to add the --pool . This works: [root@afs245 /]# rbd map afs254-vicepa --pool afs --id afs --keyring /etc/ceph/ceph.client.afs.keyring [root@afs245 /]# rbd showmapped id pool image snap device 1 afs afs254-vicepa -/dev/rbd1 But this BUGS: [root@afs245 /]# rbd map afs254-vicepa BUG... Any clue? Cheers, Dan, CERN IT Mar 25 11:48:25 afs245 kernel: kernel BUG at mm/slab.c:3130! Mar 25 11:48:25 afs245 kernel: invalid opcode: [#1] SMP Mar 25 11:48:25 afs245 kernel: Modules linked in: rbd libceph libcrc32c cpufreq_ondemand ipv6 ext2 iTCO_wdt iTCO_vendor_support coretemp acpi_cpufreq freq_tabl e mperf kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr serio_raw i2c_i801 lpc_ich joydev e1000e ses enclosure sg ixgbe hwmon dca ptp pps_core mdio ext3 jbd mbcache sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci 3w_9xxx mpt2sas scsi_transport_sas raid_class v ideo mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod Mar 25 11:48:25 afs245 kernel: CPU 3 Mar 25 11:48:25 afs245 kernel: Pid: 7444, comm: rbd Not tainted 3.8.4-1.el6.elrepo.x86_64 #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM Mar 25 11:48:25 afs245 kernel: RIP: 0010:[] [] cache_alloc_refill+0x270/0x3c0 Mar 25 11:48:25 afs245 kernel: RSP: 0018:8808028e5c48 EFLAGS: 00010082 Mar 25 11:48:25 afs245 kernel: RAX: RBX: 88082f000e00 RCX: 88082f000e00 Mar 25 11:48:25 afs245 kernel: RDX: 8808055fba80 RSI: 88082f0028d0 RDI: 88082f002900 Mar 25 11:48:25 afs245 kernel: RBP: 8808028e5ca8 R08: 88082f0028e0 R09: 8808010068c0 Mar 25 11:48:25 afs245 kernel: R10: dead00200200 R11: 0003 R12: Mar 25 11:48:25 afs245 kernel: R13: 880807a71ec0 R14: 88082f0028c0 R15: 0004 Mar 25 11:48:25 afs245 kernel: FS: 7ff85056e760() GS:88082fd8() knlGS: Mar 25 11:48:25 afs245 kernel: CS: 0010 DS: ES: CR0: 80050033 Mar 25 11:48:25 afs245 kernel: CR2: 00428220 CR3: 0007eee7e000 CR4: 001407e0 Mar 25 11:48:25 afs245 kernel: DR0: DR1: DR2: Mar 25 11:48:25 afs245 kernel: DR3: DR6: 0ff0 DR7: 0400 Mar 25 11:48:25 afs245 kernel: Process rbd (pid: 7444, threadinfo 8808028e4000, task 8807ef6fb520) Mar 25 11:48:25 afs245 kernel: Stack: Mar 25 11:48:25 afs245 kernel: 8808028e5d68 8112fd5d 8808028e5de8 880800ac7000 Mar 25 11:48:25 afs245 kernel: 028e5c78 80d0 8808028e5fd8 88082f000e00 Mar 25 11:48:25 afs245 kernel: 1078 0010 80d0 80d0 Mar 25 11:48:25 afs245 kernel: Call Trace: Mar 25 11:48:25 afs245 kernel: [] ? get_page_from_freelist+0x22d/0x710 Mar 25 11:48:25 afs245 kernel: [] __kmalloc+0x168/0x340 Mar 25 11:48:25 afs245 kernel: [] ? ceph_parse_options+0x65/0x410 [libceph] Mar 25 11:48:25 afs245 kernel: [] ? kzalloc+0x20/0x20 [rbd] Mar 25 11:48:25 afs245 kernel: [] ceph_parse_options+0x65/0x410 [libceph] Mar 25 11:48:25 afs245 kernel: [] ? kmem_cache_alloc_trace+0x214/0x2e0 Mar 25 11:48:25 afs245 kernel: [] ? __kmalloc+0x277/0x340 Mar 25 11:48:25 afs245 kernel: [] ? kzalloc+0xf/0x20 [rbd] Mar 25 11:48:25 afs245 kernel: [] rbd_add_parse_args+0x1fa/0x250 [rbd] Mar 25 11:48:25 afs245 kernel: [] rbd_add+0x84/0x2b4 [rbd] Mar 25 11:48:25 afs245 kernel: [] bus_attr_store+0x27/0x30 Mar 25 11:48:25 afs245 kernel: [] sysfs_write_file+0xef/0x170 Mar 25 11:48:25 afs245 kernel: [] vfs_write+0xb4/0x130 Mar 25 11:48:25 afs245 kernel: [] sys_write+0x5f/0xa0 Mar 25 11:48:25 afs245 kernel: [] ? __audit_syscall_exit+0x246/0x2f0 Mar 25 11:48:25 afs245 kernel: [] system_call_fastpath+0x16/0x1b Mar 25 11:48:25 afs245 kernel: Code: 48 8b 00 48 8b 55 b0 8b 4d b8 48 8b 75 a8 4c 8b 45 a0 4c 8b 4d c0 a8 40 0f 84 b8 fe ff ff 49 83 cf 01 e9 af fe ff ff 0f 0b eb fe <0f> 0b eb fe 8b 75 c8 8b 55 cc 31 c9 48 89 df 81 ce 00 12 04 00 Mar 25 11:48:25 afs245 kernel: RIP [] cache_alloc_refill+0x270/0x3c0 Mar 25 11:48:25 afs245 kernel: RSP Mar 25 11:48:25 afs245 kernel: ---[ end trace 46b67e5b8b69abcb ]--- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Rephrase it to make it more clear From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chen, Xiaoxi Sent: 2013年3月25日 17:02 To: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com) Cc: ceph-de...@vger.kernel.org Subject: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes. Hi list, We have hit and reproduce this issue for several times, ceph will suicide because FileStore: sync_entry timed out after a very heavy random IO on top of the RBD. My test environment is: 4 Nodes ceph cluster with 20 HDDs for OSDs and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total 48 VMs spread across 12 Physical nodes, 48 RBD attached to the VMs 1:1 via QEMU, The Qemu Cache disabled. Ceph @ 0.58 XFS were used. I am running Aiostress (something like FIO) inside VMS to produce random write requests on top of each RBDs. From Ceph-w , ceph reports a very high Ops (1+ /s) , but technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random write. When digging into the code, from Filestore.cc::_write(), it's clear that the OSD open object files without O_DIRECT, that means data writes will be buffered by pagecache, and then returned.Although ::sync_file_range called , but with flag "SYNC_FILE_RANGE_WRITE", this system call doesn’t actually sync data to disk before it returns ,instead, it just initiate the write out IOs. So the situation is , since all writes just go to pagecache , the backend OSD data disk **seems** extremely fast for random write, so we can see such a high Ops from Ceph-w. However, when OSD Sync_thread trying to sync the FS, it use ::syncfs(), before ::syncfs returned, the OS has to ensure that all dirty page in PageCache(relate with that particular FS) had written into disk. This will obviously take long time and you can only expect 100 IOPS for non-btrfs filesystem. The performance gap exists there, a SSD journal can do 4K random wirte @ 1K IOPS +, but for 4 HDDs(journaled by the same SSD), they can only provide 400IOPS. With the random write pressure continuing , the amount of dirty page in PageCache will keep increasing , sooner or later, the ::syncfs() cannot return within 600s(the default value of filestore_commit_timeout ) and triggered the ASSERT to suicide ceph-osd process. I have tried to reproduce this by rados bench,but failed.Because rados bench **create** objects rather than modify them, a bucket of creates can be merged into a single big writes. So I assume if anyone like to reproduce this issue, you have to use QEMU/Kernel Client, using a fast journal(say tempfs) , slow data disk, choosing a small filestore_commit_timeout may be helpful to reproduce this issue in a small scale environment. Could you please let me know if you need any more informations & have some solutions? Thanks Xiaoxi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird problem with mkcephfs
They keyring.* vs key.* distinction in mkcephfs appears correct. Can you attach your ceph.conf? It looks a bit like no daemons are defined. sage On Mon, 25 Mar 2013, Steve Carter wrote: > Although it doesn't attempt to login to my other machines as I thought it was > designed to do, as I know it did the last time I built a cluster. Not sure > what I'm doing wrong. > > -Steve > > On 03/23/2013 10:35 PM, Steve Carter wrote: > > I changed: > > > > for k in $dir/key.* > > > > to: > > > > for k in $dir/key* > > > > and it appeared to run correctly: > > > > root@smon:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -d /tmp -k > > /etc/ceph/keyring > > preparing monmap in /tmp/monmap > > /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 --print > > /tmp/monmap > > /usr/bin/monmaptool: monmap file /tmp/monmap > > /usr/bin/monmaptool: generated fsid 46e4ae99-3df6-41ae-8d45-474c95b98852 > > epoch 0 > > fsid 46e4ae99-3df6-41ae-8d45-474c95b98852 > > last_changed 2013-03-23 22:33:26.254974 > > created 2013-03-23 22:33:26.254974 > > 0: 192.168.0.253:6789/0 mon.a > > /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors) > > Building generic osdmap from /tmp/conf > > /usr/bin/osdmaptool: osdmap file '/tmp/osdmap' > > /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap > > Generating admin key at /tmp/keyring.admin > > creating /tmp/keyring.admin > > Building initial monitor keyring > > placing client.admin keyring in /etc/ceph/keyring > > > > On 03/23/2013 10:29 PM, Steve Carter wrote: > > > The below part of the mkcephfs code seems responsible for this. > > > specifically the 'for' loop below. I wonder if I installed from the wrong > > > place? I installed from the ubuntu source rather than the ceph source. > > > > > > # admin keyring > > > echo Generating admin key at $dir/keyring.admin > > > $BINDIR/ceph-authtool --create-keyring --gen-key -n client.admin > > > $dir/keyring.admin > > > > > > # mon keyring > > > echo Building initial monitor keyring > > > cp $dir/keyring.admin $dir/keyring.mon > > > $BINDIR/ceph-authtool -n client.admin --set-uid=0 \ > > > --cap mon 'allow *' \ > > > --cap osd 'allow *' \ > > > --cap mds 'allow' \ > > > $dir/keyring.mon > > > > > > $BINDIR/ceph-authtool --gen-key -n mon. $dir/keyring.mon > > > > > > for k in $dir/key.* > > > do > > > kname=`echo $k | sed 's/.*key\.//'` > > > ktype=`echo $kname | cut -c 1-3` > > > kid=`echo $kname | cut -c 4- | sed 's/^\\.//'` > > > kname="$ktype.$kid" > > > secret=`cat $k` > > > if [ "$ktype" = "osd" ]; then > > > $BINDIR/ceph-authtool -n $kname --add-key $secret > > > $dir/keyring.mon \ > > > --cap mon 'allow rwx' \ > > > --cap osd 'allow *' > > > fi > > > if [ "$ktype" = "mds" ]; then > > > $BINDIR/ceph-authtool -n $kname --add-key $secret > > > $dir/keyring.mon \ > > > --cap mon "allow rwx" \ > > > --cap osd 'allow *' \ > > > --cap mds 'allow' > > > fi > > > done > > > > > > exit 0 > > > fi > > > > > > > > > On 03/23/2013 01:50 PM, Steve Carter wrote: > > > > This is consistently repeatable on my system. This is the latest of two > > > > cluster builds I have done. This is a brand new deployment on hardware I > > > > haven't deployed on previously. > > > > > > > > You see the error below is referencing /tmp/key.* and the keyring files > > > > are actually keyring.*. > > > > > > > > Any help is much appreciated. > > > > > > > > root@mon:~# uname -a > > > > Linux mon.X.com 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 > > > > UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > > > > root@mon:~# ceph -v > > > > ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5) > > > > root@mon:~# ls -al /tmp/ > > > > total 8 > > > > drwxrwxrwx 2 root root 4096 Mar 23 12:15 . > > > > drwxr-xr-x 25 root root 4096 Mar 22 23:18 .. > > > > root@mon:~# ls -al / | grep tmp > > > > drwxrwxrwx 2 root root 4096 Mar 23 12:15 tmp > > > > root@mon:~# mkcephfs -d /tmp -a -c /etc/ceph/ceph.conf -k > > > > /etc/ceph/keyring > > > > preparing monmap in /tmp/monmap > > > > /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 > > > > --print /tmp/monmap > > > > /usr/bin/monmaptool: monmap file /tmp/monmap > > > > /usr/bin/monmaptool: generated fsid 68b9c724-21c0-4d54-8237-674ced7adbfe > > > > epoch 0 > > > > fsid 68b9c724-21c0-4d54-8237-674ced7adbfe > > > > last_changed 2013-03-23 12:17:03.087018 > > > > created 2013-03-23 12:17:03.087018 > > > > 0: 192.168.0.253:6789/0 mon.a > > > > /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors) > > > > Building generic osdmap from /tmp/conf > > > > /usr/bin/osdmaptool: osdmap file '/tmp/osdmap' > > > > /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap > > > > Generating admin key at /tmp/keyring.admin
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi Xiaoxi, On Mon, 25 Mar 2013, Chen, Xiaoxi wrote: > From Ceph-w , ceph reports a very high Ops (1+ /s) , but > technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random > write. > > When digging into the code, I found that the OSD write data to > Pagecache than returned, although it called ::sync_file_range, but this > syscall doesn?t actually sync data to disk when it return,it?s an aync call. > So the situation is , the random write will be extremely fast since it only > write to journal and pagecache, but once syncing , it will take very long > time. The speed gap between journal and OSDs exist, the amount of data that > need to be sync keep increasing, and it will certainly exceed 600s. The sync_file_range is only there to push things to disk sooner, so that the eventual syncfs(2) takes less time. When the async flushing is enabled, there is a limit to the number of flushes that are in the queue, but if it hits the max it just does dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off << "~" << len << " qlen " << flusher_queue_len << " hit flusher_max_fds " << m_filestore_flusher_max_fds << ", skipping async flush" << dendl; Can you confirm that the filestore is taking this path? (debug filestore = 10 and then reproduce.) You may want to try filestore flusher = false filestore sync flush = true and see if that changes things--it will make the sync_file_range() happen inline after the write. Anyway, it sounds like you may be queueing up so many random writes that the sync takes forever. I've never actually seen that happen, so if we can confirm that's what is going on that will be very interesting. Thanks- sage > > > > For more information, I have tried to reproduce this by rados > bench,but failed. > > > > Could you please let me know if you need any more informations & > have some solutions? Thanks > > ?? ? ?? ? ?? ? Xiaoxi > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD Capacity and Partitions for OSD Journals
Hi, I have a couple of HW provisioning questions in regards to SSD for OSD Journals. I'd like to provision 12 OSDs per a node and there are enough CPU clocks and Memory. Each OSD is allocated one 3TB HDD for OSD data - these 12 * 3TB HDDs are in non-RAID. For increasing access and (sequential) write performance, I'd like to put 2 SSDs for OSD journals - these two SSDs are not mirrored. By the rule of thumb, I'd like to mount the OSD journals (the path below) to the "SSD partitions" accordingly. /var/lib/ceph/osd/$cluster-$id/journal Question 1. Which way is recommended between: (1) Partitions for OS/Boot and 6 OSD journals on #1 SSD, and partitions for the rest 6 OSD journals on #2 SSD; (2) OS/Boot partition on #1 SSD, and separately 12 OSD journals on #2 SSD? BTW, for better utilization of expensive SSDs, I prefer the first way. Should it be okay? Question 2. I have several capacity options for SSDs. What's the capacity requirement if there are 6 partitions for 6 OSD journals on a SSD? If it's hard to generalize, please provide me with some guidelines. Thanks, Peter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW fault tolerance
Hi Yehuda, Thanks for reply, my comments below inline. On 25/03/2013 04:32, Yehuda Sadeh wrote: On Sun, Mar 24, 2013 at 7:14 PM, Rustam Aliyev wrote: Hi, I was testing RadosGW setup and observed strange behavior - RGW becomes unresponsive or won't start whenever cluster health is degraded (e.g. restarting one of the OSDs). Probably I'm doing something wrong but I couldn't find any information about this. I'm running 0.56.3 on 3 node cluster (3xMON, 3xOSD). I increased replication factor for rgw related pools so that cluster can survive single node failure (quorum). pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0 pool 3 'pbench' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 150 pgp_num 150 last_change 11 owner 0 pool 4 '.rgw' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 90 pgp_num 8 last_change 111 owner 0 pool 5 '.rgw.gc' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 112 owner 0 pool 6 '.rgw.control' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 114 owner 0 pool 7 '.users.uid' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 117 owner 0 pool 8 '.users.email' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 118 owner 0 pool 9 '.users' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 115 owner 0 pool 11 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 108 owner 0 Any idea how to fix this? We'll need some more specific info with regard to the actual scenario in order to determine what exactly is that you're seeing What is the exact scenario you're testing (osd goes down?). I'm just doing "service ceph stop osd" on one of the nodes However, there are a few things to note: - you have only 3 osds, which means that a single osd going down affects large portion of your data. How and what exactly happens really depends on your configuration. Configuration is quite simple, 3 osd and 3 monitos with default params: http://pastebin.com/LP3X7cf9 Note that it is not highly impossible that it takes some time to determine that an osd went down. I tested that scenario - it seems that you are right. It basically takes some time, but I'm not sure if that's expected. So when I shut down osd, rgw becomes unresponsive for 2 minutes. Then it works even though health is degraded. After some time I brought back osd (started) and rgw became unresponsive again - this time however for 5 minutes. Then it started functioning again while pgs were recovering in the background. If it is expected that this osd gets 1/3 of the traffic, which means that until there's a map change, the gateway will still try to contact it. Does it mean that rgw/rados waits for all replicas to acknowledge success? Is it possible to configure it in a way where quorum is enough - i.e. 2 out of 3 replicas written successfully and rgw returns OK? - some of your pools contain a very small number of pgs (8). Probably not related to your issue, but you'd want to change that. Yes, I'm aware of that - just kept their default values for now. Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v0.56.4 released
There have been several important fixes that we've backported to bobtail that users are hitting in the wild. Most notably, there was a problem with pool names with - and _ that OpenStack users were hitting, and memory usage by ceph-osd and other daemons due to the trimming of in-memory logs. This and more is fixed in v0.56.4. We recommend that all bobtail users upgrade. Notable changes include: * mon: fix bug in bringup with IPv6 * reduce default memory utilization by internal logging (all daemons) * rgw: fix for bucket removal * rgw: reopen logs after log rotation * rgw: fix multipat upload listing * rgw: don't copy object when copied onto self * osd: fix caps parsing for pools with - or _ * osd: allow pg log trimming when degraded, scrubbing, recoverying (reducing memory consumption) * osd: fix potential deadlock when 'journal aio = true' * osd: various fixes for collection creation/removal, rename, temp collections * osd: various fixes for PG split * osd: deep-scrub omap key/value data * osd: fix rare bug in journal replay * osd: misc fixes for snapshot tracking * osd: fix leak in recovery reservations on pool deletion * osd: fix bug in connection management * osd: fix for op ordering when rebalancing * ceph-fuse: report file system size with correct units * mds: get and set directory layout policies via virtual xattrs * mkcephfs, init-ceph: close potential security issues with predictable filenames There is one minor change (fix) in the output to the 'ceph osd tree --format=json' command. Please see the full release notes. You can get v0.56.4 from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.56.4.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.56.4 released
On Mon, 25 Mar 2013, Sage Weil wrote: > There is one minor change (fix) in the output to the 'ceph osd tree > --format=json' command. Please see the full release notes. Greg just reminded me about one additional note about upgrades (that should hopefully affect noone): * The MDS disk format has changed from prior releases *and* from v0.57. In particular, upgrades to v0.56.4 are safe, but you cannot move from v0.56.4 to v0.57 if you are using the MDS for CephFS; you must upgrade directly to v0.58 (or later) instead. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.
Hi Sage, Thanks for your mail.When turn on filestore sync flush, it seems works and OSD process doesn't suicide any more . I have already disabled flusher long age since both Mark's and my report show disable flusher seems to improve performance(so my original configuration is filestore_flusher=false, filestore_sync_flush=false(default)), but now we have to reconsider on this. I would like to see the internal code of ::sync_file_range() to learn more about how it works. First guess is ::sync_file_range will push request to disk queue and if the disk queue is full, this call will block and wait, but not sure. But from the code path,(BTW, these lines of codes are a bit hard to follow) if (!should_flush ||!m_filestore_flusher || !queue_flusher(fd, offset, len)) { if (should_flush && m_filestore_sync_flush) ::sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WRITE); lfn_close(fd); } With the default setting (m_filestore_flusher = true) , the flusher queue will soon burn out, in this situation, if user doesn't turn on " m_filestore_sync_flush = ture ", he/she will likely to hit the same situation that writes remain in page cache and OSD daemon died when trying to sync. I suppose the right logical should be(persuade code), : if (should_flush) { If(m_filestore_flusher) If(queue_flusher(fd, offset, len) Do nothing Else ::sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WRITE); Else if (m_filestore_sync_flush ) ::sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WRITE); lfn_close(fd); } Xiaoxi -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年3月25日 23:35 To: Chen, Xiaoxi Cc: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com); ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes. Hi Xiaoxi, On Mon, 25 Mar 2013, Chen, Xiaoxi wrote: > From Ceph-w , ceph reports a very high Ops (1+ /s) , but > technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K > random write. > > When digging into the code, I found that the OSD write data > to Pagecache than returned, although it called ::sync_file_range, but > this syscall doesn?t actually sync data to disk when it return,it?s an aync > call. > So the situation is , the random write will be extremely fast since it > only write to journal and pagecache, but once syncing , it will take > very long time. The speed gap between journal and OSDs exist, the > amount of data that need to be sync keep increasing, and it will certainly > exceed 600s. The sync_file_range is only there to push things to disk sooner, so that the eventual syncfs(2) takes less time. When the async flushing is enabled, there is a limit to the number of flushes that are in the queue, but if it hits the max it just does dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off << "~" << len << " qlen " << flusher_queue_len << " hit flusher_max_fds " << m_filestore_flusher_max_fds << ", skipping async flush" << dendl; Can you confirm that the filestore is taking this path? (debug filestore = 10 and then reproduce.) You may want to try filestore flusher = false filestore sync flush = true and see if that changes things--it will make the sync_file_range() happen inline after the write. Anyway, it sounds like you may be queueing up so many random writes that the sync takes forever. I've never actually seen that happen, so if we can confirm that's what is going on that will be very interesting. Thanks- sage > > > > For more information, I have tried to reproduce this by rados > bench,but failed. > > > > Could you please let me know if you need any more > informations & have some solutions? Thanks > > ?? ? ?? ? ?? ? Xiaoxi > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird problem with mkcephfs
Sage, Sure, here you go: [global] auth cluster required = cephx auth service required = cephx auth client required = cephx max open files = 4096 [mon] mon data = /data/${name} keyring = /data/${name}/keyring [osd] osd data = /data/${name} keyring = /data/${name}/keyring btrfs devs = /dev/disk/by-label/${name}-data osd journal = /dev/sda_vg/${name}-journal [mon.a] hostname = smon mon addr = 192.168.0.253:6789 [osd.0] hostname = s1 [osd.1] hostname = s1 [osd.2] hostname = s1 [osd.3] hostname = s1 [osd.4] hostname = s1 [osd.5] hostname = s1 [osd.6] hostname = s2 [osd.7] hostname = s2 [osd.8] hostname = s2 [osd.9] hostname = s2 [osd.10] hostname = s2 [osd.11] hostname = s2 - Original Message - > From: "Sage Weil" > To: "Steve Carter" > Cc: ceph-users@lists.ceph.com > Sent: Monday, March 25, 2013 8:26:54 AM > Subject: Re: [ceph-users] Weird problem with mkcephfs > > They keyring.* vs key.* distinction in mkcephfs appears correct. Can you > attach your ceph.conf? It looks a bit like no daemons are defined. > > sage > > > On Mon, 25 Mar 2013, Steve Carter wrote: > > > Although it doesn't attempt to login to my other machines as I thought it > > was > > designed to do, as I know it did the last time I built a cluster. Not sure > > what I'm doing wrong. > > > > -Steve > > > > On 03/23/2013 10:35 PM, Steve Carter wrote: > > > I changed: > > > > > > for k in $dir/key.* > > > > > > to: > > > > > > for k in $dir/key* > > > > > > and it appeared to run correctly: > > > > > > root@smon:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -d /tmp -k > > > /etc/ceph/keyring > > > preparing monmap in /tmp/monmap > > > /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 --print > > > /tmp/monmap > > > /usr/bin/monmaptool: monmap file /tmp/monmap > > > /usr/bin/monmaptool: generated fsid 46e4ae99-3df6-41ae-8d45-474c95b98852 > > > epoch 0 > > > fsid 46e4ae99-3df6-41ae-8d45-474c95b98852 > > > last_changed 2013-03-23 22:33:26.254974 > > > created 2013-03-23 22:33:26.254974 > > > 0: 192.168.0.253:6789/0 mon.a > > > /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors) > > > Building generic osdmap from /tmp/conf > > > /usr/bin/osdmaptool: osdmap file '/tmp/osdmap' > > > /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap > > > Generating admin key at /tmp/keyring.admin > > > creating /tmp/keyring.admin > > > Building initial monitor keyring > > > placing client.admin keyring in /etc/ceph/keyring > > > > > > On 03/23/2013 10:29 PM, Steve Carter wrote: > > > > The below part of the mkcephfs code seems responsible for this. > > > > specifically the 'for' loop below. I wonder if I installed from the > > > > wrong > > > > place? I installed from the ubuntu source rather than the ceph source. > > > > > > > > # admin keyring > > > > echo Generating admin key at $dir/keyring.admin > > > > $BINDIR/ceph-authtool --create-keyring --gen-key -n client.admin > > > > $dir/keyring.admin > > > > > > > > # mon keyring > > > > echo Building initial monitor keyring > > > > cp $dir/keyring.admin $dir/keyring.mon > > > > $BINDIR/ceph-authtool -n client.admin --set-uid=0 \ > > > > --cap mon 'allow *' \ > > > > --cap osd 'allow *' \ > > > > --cap mds 'allow' \ > > > > $dir/keyring.mon > > > > > > > > $BINDIR/ceph-authtool --gen-key -n mon. $dir/keyring.mon > > > > > > > > for k in $dir/key.* > > > > do > > > > kname=`echo $k | sed 's/.*key\.//'` > > > > ktype=`echo $kname | cut -c 1-3` > > > > kid=`echo $kname | cut -c 4- | sed 's/^\\.//'` > > > > kname="$ktype.$kid" > > > > secret=`cat $k` > > > > if [ "$ktype" = "osd" ]; then > > > > $BINDIR/ceph-authtool -n $kname --add-key $secret > > > > $dir/keyring.mon \ > > > > --cap mon 'allow rwx' \ > > > > --cap osd 'allow *' > > > > fi > > > > if [ "$ktype" = "mds" ]; then > > > > $BINDIR/ceph-authtool -n $kname --add-key $secret > > > > $dir/keyring.mon \ > > > > --cap mon "allow rwx" \ > > > > --cap osd 'allow *' \ > > > > --cap mds 'allow' > > > > fi > > > > done > > > > > > > > exit 0 > > > > fi > > > > > > > > > > > > On 03/23/2013 01:50 PM, Steve Carter wrote: > > > > > This is consistently repeatable on my system. This is the latest of > > > > > two > > > > > cluster builds I have done. This is a brand new deployment on > > > > > hardware I > > > > > haven't deployed on previously. > > > > > > > > > > You see the error below is referencing /tmp/key.* and the keyring > > > > > files > > > > > are actually keyring.*. > > > > > > > > > > Any help is much appreciated. > > > > > > > > > > root@mon:~# uname -a > > > > > Linux mon.X.com 3.2.0-39-gener
Re: [ceph-users] SSD Capacity and Partitions for OSD Journals
On 03/25/2013 04:07 PM, peter_j...@dell.com wrote: Hi, I have a couple of HW provisioning questions in regards to SSD for OSD Journals. I’d like to provision 12 OSDs per a node and there are enough CPU clocks and Memory. Each OSD is allocated one 3TB HDD for OSD data – these 12 * 3TB HDDs are in non-RAID. For increasing access and (sequential) write performance, I’d like to put 2 SSDs for OSD journals – these two SSDs are not mirrored. By the rule of thumb, I’d like to mount the OSD journals (the path below) to the “SSD partitions” accordingly. _/var/lib/ceph/osd/$cluster-$id/journal_ Question 1. Which way is recommended between: (1) Partitions for OS/Boot and 6 OSD journals on #1 SSD, and partitions for the rest 6 OSD journals on #2 SSD; (2) OS/Boot partition on #1 SSD, and separately 12 OSD journals on #2 SSD? BTW, for better utilization of expensive SSDs, I prefer the first way. Should it be okay? Question 2. I have several capacity options for SSDs. What’s the capacity requirement if there are 6 partitions for 6 OSD journals on a SSD? If it’s hard to generalize, please provide me with some guidelines. Journal size is configurable in the /etc/ceph/ceph.conf so if you have a journal size of 10 000, you'll need 10G for 1 journal so for 6 it should be 60G add a safety factor (ie. 20%) and you should be ok. The size itself is 2 * desired throughput * interval between sync, so if you have a hard drive that is able to do 100MB/s and want to have an interval of 50s (not sure it's highly recommended) then you'll need 2 * 100 * 50 = 10 000 as the journal size. Matthieu. Thanks, Peter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com