Re: [ceph-users] Weird problem with mkcephfs

2013-03-25 Thread Steve Carter
Although it doesn't attempt to login to my other machines as I thought 
it was designed to do, as I know it did the last time I built a 
cluster.  Not sure what I'm doing wrong.


-Steve

On 03/23/2013 10:35 PM, Steve Carter wrote:

I changed:

for k in $dir/key.*

to:

for k in $dir/key*

and it appeared to run correctly:

root@smon:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -d /tmp -k 
/etc/ceph/keyring

preparing monmap in /tmp/monmap
/usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 
--print /tmp/monmap

/usr/bin/monmaptool: monmap file /tmp/monmap
/usr/bin/monmaptool: generated fsid 46e4ae99-3df6-41ae-8d45-474c95b98852
epoch 0
fsid 46e4ae99-3df6-41ae-8d45-474c95b98852
last_changed 2013-03-23 22:33:26.254974
created 2013-03-23 22:33:26.254974
0: 192.168.0.253:6789/0 mon.a
/usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
Building generic osdmap from /tmp/conf
/usr/bin/osdmaptool: osdmap file '/tmp/osdmap'
/usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap
Generating admin key at /tmp/keyring.admin
creating /tmp/keyring.admin
Building initial monitor keyring
placing client.admin keyring in /etc/ceph/keyring

On 03/23/2013 10:29 PM, Steve Carter wrote:
The below part of the mkcephfs code seems responsible for this. 
specifically the 'for' loop below.  I wonder if I installed from the 
wrong place?  I installed from the ubuntu source rather than the ceph 
source.


# admin keyring
echo Generating admin key at $dir/keyring.admin
$BINDIR/ceph-authtool --create-keyring --gen-key -n client.admin 
$dir/keyring.admin


# mon keyring
echo Building initial monitor keyring
cp $dir/keyring.admin $dir/keyring.mon
$BINDIR/ceph-authtool -n client.admin --set-uid=0 \
--cap mon 'allow *' \
--cap osd 'allow *' \
--cap mds 'allow' \
$dir/keyring.mon

$BINDIR/ceph-authtool --gen-key -n mon. $dir/keyring.mon

for k in $dir/key.*
do
kname=`echo $k | sed 's/.*key\.//'`
ktype=`echo $kname | cut -c 1-3`
kid=`echo $kname | cut -c 4- | sed 's/^\\.//'`
kname="$ktype.$kid"
secret=`cat $k`
if [ "$ktype" = "osd" ]; then
$BINDIR/ceph-authtool -n $kname --add-key $secret 
$dir/keyring.mon \

--cap mon 'allow rwx' \
--cap osd 'allow *'
fi
if [ "$ktype" = "mds" ]; then
$BINDIR/ceph-authtool -n $kname --add-key $secret 
$dir/keyring.mon \

--cap mon "allow rwx" \
--cap osd 'allow *' \
--cap mds 'allow'
fi
done

exit 0
fi


On 03/23/2013 01:50 PM, Steve Carter wrote:
This is consistently repeatable on my system.  This is the latest of 
two cluster builds I have done. This is a brand new deployment on 
hardware I haven't deployed on previously.


You see the error below is referencing /tmp/key.* and the keyring 
files are actually keyring.*.


Any help is much appreciated.

root@mon:~# uname -a
Linux mon.X.com 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 
00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

root@mon:~# ceph -v
ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
root@mon:~# ls -al /tmp/
total 8
drwxrwxrwx  2 root root 4096 Mar 23 12:15 .
drwxr-xr-x 25 root root 4096 Mar 22 23:18 ..
root@mon:~# ls -al / | grep tmp
drwxrwxrwx  2 root root  4096 Mar 23 12:15 tmp
root@mon:~# mkcephfs -d /tmp -a -c /etc/ceph/ceph.conf -k 
/etc/ceph/keyring

preparing monmap in /tmp/monmap
/usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 
--print /tmp/monmap

/usr/bin/monmaptool: monmap file /tmp/monmap
/usr/bin/monmaptool: generated fsid 
68b9c724-21c0-4d54-8237-674ced7adbfe

epoch 0
fsid 68b9c724-21c0-4d54-8237-674ced7adbfe
last_changed 2013-03-23 12:17:03.087018
created 2013-03-23 12:17:03.087018
0: 192.168.0.253:6789/0 mon.a
/usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
Building generic osdmap from /tmp/conf
/usr/bin/osdmaptool: osdmap file '/tmp/osdmap'
/usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap
Generating admin key at /tmp/keyring.admin
creating /tmp/keyring.admin
Building initial monitor keyring
cat: /tmp/key.*: No such file or directory
root@mon:~# ls -al /tmp/
total 32
drwxrwxrwx  2 root root 4096 Mar 23 12:17 .
drwxr-xr-x 25 root root 4096 Mar 22 23:18 ..
-rw-r--r--  1 root root  695 Mar 23 12:17 conf
-rw---  1 root root   63 Mar 23 12:17 keyring.admin
-rw---  1 root root  192 Mar 23 12:17 keyring.mon
-rw-r--r--  1 root root  187 Mar 23 12:17 monmap
-rw-r--r--  1 root root 6886 Mar 23 12:17 osdmap







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi
Hi list,
 We have hit and reproduce this issue for several times, ceph will 
suicide because FileStore: sync_entry timed out after a very heavy random IO on 
top of the RBD.
 My test environment is:
4 Nodes ceph cluster with 20 HDDs for OSDs and 4 
Intel DCS3700 ssds for journal per node, that is 80 spindles in total
48 VMs spread across 12 Physical nodes, 48 RBD 
attached to the VMs 1:1 via Qemu.
Ceph @ 0.58
XFS were used.
 I am using Aiostress (something like FIO) to produce random write 
requests on top of each RBDs.

 From Ceph-w , ceph reports a very high Ops (1+ /s) , but 
technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random 
write.
 When digging into the code, I found that the OSD write data to 
Pagecache than returned, although it called ::sync_file_range, but this syscall 
doesn't actually sync data to disk when it return,it's an aync call. So the 
situation is , the random write will be extremely fast since it only write to 
journal and pagecache, but once syncing , it will take very long time. The 
speed gap between journal and OSDs exist, the amount of data that need to be 
sync keep increasing, and it will certainly exceed 600s.

 For more information, I have tried to reproduce this by rados 
bench,but failed.

 Could you please let me know if you need any more informations & have 
some solutions? Thanks



Xiaoxi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Wolfgang Hennerbichler
Hi,

this could be related to this issue here and has been reported multiple
times:

http://tracker.ceph.com/issues/3737

In short: They're working on it, they know about it.

Wolfgang

On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote:
> Hi list,
> 
>  We have hit and reproduce this issue for several times, ceph
> will suicide because FileStore: sync_entry timed out after a very heavy
> random IO on top of the RBD.
> 
>  My test environment is:
> 
> 4 Nodes ceph cluster with 20 HDDs for OSDs
> and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
> 
> 48 VMs spread across 12 Physical nodes, 48
> RBD attached to the VMs 1:1 via Qemu.
> 
> Ceph @ 0.58
> 
> XFS were used.
> 
>  I am using Aiostress (something like FIO) to produce random
> write requests on top of each RBDs.
> 
>  
> 
>  From Ceph-w , ceph reports a very high Ops (1+ /s) , but
> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K
> random write.
> 
>  When digging into the code, I found that the OSD write data to
> Pagecache than returned, although it called ::sync_file_range, but this
> syscall doesn’t actually sync data to disk when it return,it’s an aync
> call. So the situation is , the random write will be extremely fast
> since it only write to journal and pagecache, but once syncing , it will
> take very long time. The speed gap between journal and OSDs exist, the
> amount of data that need to be sync keep increasing, and it will
> certainly exceed 600s.
> 
>  
> 
>  For more information, I have tried to reproduce this by rados
> bench,but failed.
> 
>  
> 
>  Could you please let me know if you need any more informations
> & have some solutions? Thanks
> 
>   
>   
>   
> 
>  Xiaoxi
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi


Hi Wolfgang,

Thanks for the reply,but why my problem is related with issue#3737? I 
cannot find any direct link between them. I didnt turn on qemu cache and my 
qumu/VM work fine


Xiaoxi

在 2013-3-25,17:07,"Wolfgang Hennerbichler" 
 写道:

> Hi,
> 
> this could be related to this issue here and has been reported multiple
> times:
> 
> http://tracker.ceph.com/issues/3737
> 
> In short: They're working on it, they know about it.
> 
> Wolfgang
> 
> On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote:
>> Hi list,
>> 
>> We have hit and reproduce this issue for several times, ceph
>> will suicide because FileStore: sync_entry timed out after a very heavy
>> random IO on top of the RBD.
>> 
>> My test environment is:
>> 
>>4 Nodes ceph cluster with 20 HDDs for OSDs
>> and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
>> 
>>48 VMs spread across 12 Physical nodes, 48
>> RBD attached to the VMs 1:1 via Qemu.
>> 
>>Ceph @ 0.58
>> 
>>XFS were used.
>> 
>> I am using Aiostress (something like FIO) to produce random
>> write requests on top of each RBDs.
>> 
>> 
>> 
>> From Ceph-w , ceph reports a very high Ops (1+ /s) , but
>> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K
>> random write.
>> 
>> When digging into the code, I found that the OSD write data to
>> Pagecache than returned, although it called ::sync_file_range, but this
>> syscall doesn’t actually sync data to disk when it return,it’s an aync
>> call. So the situation is , the random write will be extremely fast
>> since it only write to journal and pagecache, but once syncing , it will
>> take very long time. The speed gap between journal and OSDs exist, the
>> amount of data that need to be sync keep increasing, and it will
>> certainly exceed 600s.
>> 
>> 
>> 
>> For more information, I have tried to reproduce this by rados
>> bench,but failed.
>> 
>> 
>> 
>> Could you please let me know if you need any more informations
>> & have some solutions? Thanks
>> 
>> 
>> Xiaoxi
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> DI (FH) Wolfgang Hennerbichler
> Software Development
> Unit Advanced Computing Technologies
> RISC Software GmbH
> A company of the Johannes Kepler University Linz
> 
> IT-Center
> Softwarepark 35
> 4232 Hagenberg
> Austria
> 
> Phone: +43 7236 3343 245
> Fax: +43 7236 3343 250
> wolfgang.hennerbich...@risc-software.at
> http://www.risc-software.at
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Wolfgang Hennerbichler
Hi Xiaoxi,

sorry, I thought you were testing within VMs and caching turned on (I
assumed, you didn't tell us if you really did use your benchmark within
vms and if not, how you tested rbd outside of VMs).
It just triggered an alarm in me because we had also experienced issues
with benchmarking within a VM (it didn't crash but responded extremely
slow).

Wolfgang

On 03/25/2013 10:15 AM, Chen, Xiaoxi wrote:
> 
> 
> Hi Wolfgang,
> 
> Thanks for the reply,but why my problem is related with issue#3737? I 
> cannot find any direct link between them. I didnt turn on qemu cache and my 
> qumu/VM work fine
> 
> 
> Xiaoxi
> 
> 在 2013-3-25,17:07,"Wolfgang Hennerbichler" 
>  写道:
> 
>> Hi,
>>
>> this could be related to this issue here and has been reported multiple
>> times:
>>
>> http://tracker.ceph.com/issues/3737
>>
>> In short: They're working on it, they know about it.
>>
>> Wolfgang
>>
>> On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote:
>>> Hi list,
>>>
>>> We have hit and reproduce this issue for several times, ceph
>>> will suicide because FileStore: sync_entry timed out after a very heavy
>>> random IO on top of the RBD.
>>>
>>> My test environment is:
>>>
>>>4 Nodes ceph cluster with 20 HDDs for OSDs
>>> and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
>>>
>>>48 VMs spread across 12 Physical nodes, 48
>>> RBD attached to the VMs 1:1 via Qemu.
>>>
>>>Ceph @ 0.58
>>>
>>>XFS were used.
>>>
>>> I am using Aiostress (something like FIO) to produce random
>>> write requests on top of each RBDs.
>>>
>>>
>>>
>>> From Ceph-w , ceph reports a very high Ops (1+ /s) , but
>>> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K
>>> random write.
>>>
>>> When digging into the code, I found that the OSD write data to
>>> Pagecache than returned, although it called ::sync_file_range, but this
>>> syscall doesn’t actually sync data to disk when it return,it’s an aync
>>> call. So the situation is , the random write will be extremely fast
>>> since it only write to journal and pagecache, but once syncing , it will
>>> take very long time. The speed gap between journal and OSDs exist, the
>>> amount of data that need to be sync keep increasing, and it will
>>> certainly exceed 600s.
>>>
>>>
>>>
>>> For more information, I have tried to reproduce this by rados
>>> bench,but failed.
>>>
>>>
>>>
>>> Could you please let me know if you need any more informations
>>> & have some solutions? Thanks
>>>
>>>
>>> Xiaoxi
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> -- 
>> DI (FH) Wolfgang Hennerbichler
>> Software Development
>> Unit Advanced Computing Technologies
>> RISC Software GmbH
>> A company of the Johannes Kepler University Linz
>>
>> IT-Center
>> Softwarepark 35
>> 4232 Hagenberg
>> Austria
>>
>> Phone: +43 7236 3343 245
>> Fax: +43 7236 3343 250
>> wolfgang.hennerbich...@risc-software.at
>> http://www.risc-software.at
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi

Hi,

在 2013-3-25,17:30,"Wolfgang Hennerbichler" 
 写道:

> Hi Xiaoxi,
> 
> sorry, I thought you were testing within VMs and caching turned on (I
> assumed, you didn't tell us if you really did use your benchmark within
> vms and if not, how you tested rbd outside of VMs).
Yes,I really testing within VMs
> It just triggered an alarm in me because we had also experienced issues
> with benchmarking within a VM (it didn't crash but responded extremely
> slow).
> 
OK,but my VM didnt crash, it's ceph-osd daemon crashed. So is it safe for me to 
say the issue I hit is a different issue?(not #3737)
 
> Wolfgang

   xiaoxi
> 
> On 03/25/2013 10:15 AM, Chen, Xiaoxi wrote:
>> 
>> 
>> Hi Wolfgang,
>> 
>>Thanks for the reply,but why my problem is related with issue#3737? I 
>> cannot find any direct link between them. I didnt turn on qemu cache and my 
>> qumu/VM work fine
>> 
>> 
>>Xiaoxi
>> 
>> 在 2013-3-25,17:07,"Wolfgang Hennerbichler" 
>>  写道:
>> 
>>> Hi,
>>> 
>>> this could be related to this issue here and has been reported multiple
>>> times:
>>> 
>>> http://tracker.ceph.com/issues/3737
>>> 
>>> In short: They're working on it, they know about it.
>>> 
>>> Wolfgang
>>> 
>>> On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote:
 Hi list,
 
We have hit and reproduce this issue for several times, ceph
 will suicide because FileStore: sync_entry timed out after a very heavy
 random IO on top of the RBD.
 
My test environment is:
 
   4 Nodes ceph cluster with 20 HDDs for OSDs
 and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
 
   48 VMs spread across 12 Physical nodes, 48
 RBD attached to the VMs 1:1 via Qemu.
 
   Ceph @ 0.58
 
   XFS were used.
 
I am using Aiostress (something like FIO) to produce random
 write requests on top of each RBDs.
 
 
 
From Ceph-w , ceph reports a very high Ops (1+ /s) , but
 technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K
 random write.
 
When digging into the code, I found that the OSD write data to
 Pagecache than returned, although it called ::sync_file_range, but this
 syscall doesn’t actually sync data to disk when it return,it’s an aync
 call. So the situation is , the random write will be extremely fast
 since it only write to journal and pagecache, but once syncing , it will
 take very long time. The speed gap between journal and OSDs exist, the
 amount of data that need to be sync keep increasing, and it will
 certainly exceed 600s.
 
 
 
For more information, I have tried to reproduce this by rados
 bench,but failed.
 
 
 
Could you please let me know if you need any more informations
 & have some solutions? Thanks
 
 
Xiaoxi
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
>>> 
>>> 
>>> -- 
>>> DI (FH) Wolfgang Hennerbichler
>>> Software Development
>>> Unit Advanced Computing Technologies
>>> RISC Software GmbH
>>> A company of the Johannes Kepler University Linz
>>> 
>>> IT-Center
>>> Softwarepark 35
>>> 4232 Hagenberg
>>> Austria
>>> 
>>> Phone: +43 7236 3343 245
>>> Fax: +43 7236 3343 250
>>> wolfgang.hennerbich...@risc-software.at
>>> http://www.risc-software.at
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> DI (FH) Wolfgang Hennerbichler
> Software Development
> Unit Advanced Computing Technologies
> RISC Software GmbH
> A company of the Johannes Kepler University Linz
> 
> IT-Center
> Softwarepark 35
> 4232 Hagenberg
> Austria
> 
> Phone: +43 7236 3343 245
> Fax: +43 7236 3343 250
> wolfgang.hennerbich...@risc-software.at
> http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Wolfgang Hennerbichler


On 03/25/2013 10:35 AM, Chen, Xiaoxi wrote:

> OK,but my VM didnt crash, it's ceph-osd daemon crashed. So is it safe for me 
> to say the issue I hit is a different issue?(not #3737)

Yes, then it surely is a different issue. Actually you just said ceph
crashed, no mention of an OSD, so it was hard to find out :)

>> Wolfgang
> 
>xiaoxi
>>
>> On 03/25/2013 10:15 AM, Chen, Xiaoxi wrote:
>>>
>>>
>>> Hi Wolfgang,
>>>
>>>Thanks for the reply,but why my problem is related with issue#3737? 
>>> I cannot find any direct link between them. I didnt turn on qemu cache and 
>>> my qumu/VM work fine
>>>
>>>
>>>Xiaoxi
>>>
>>> 在 2013-3-25,17:07,"Wolfgang Hennerbichler" 
>>>  写道:
>>>
 Hi,

 this could be related to this issue here and has been reported multiple
 times:

 http://tracker.ceph.com/issues/3737

 In short: They're working on it, they know about it.

 Wolfgang

 On 03/25/2013 10:01 AM, Chen, Xiaoxi wrote:
> Hi list,
>
>We have hit and reproduce this issue for several times, ceph
> will suicide because FileStore: sync_entry timed out after a very heavy
> random IO on top of the RBD.
>
>My test environment is:
>
>   4 Nodes ceph cluster with 20 HDDs for OSDs
> and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in 
> total
>
>   48 VMs spread across 12 Physical nodes, 48
> RBD attached to the VMs 1:1 via Qemu.
>
>   Ceph @ 0.58
>
>   XFS were used.
>
>I am using Aiostress (something like FIO) to produce random
> write requests on top of each RBDs.
>
>
>
>From Ceph-w , ceph reports a very high Ops (1+ /s) , but
> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K
> random write.
>
>When digging into the code, I found that the OSD write data to
> Pagecache than returned, although it called ::sync_file_range, but this
> syscall doesn’t actually sync data to disk when it return,it’s an aync
> call. So the situation is , the random write will be extremely fast
> since it only write to journal and pagecache, but once syncing , it will
> take very long time. The speed gap between journal and OSDs exist, the
> amount of data that need to be sync keep increasing, and it will
> certainly exceed 600s.
>
>
>
>For more information, I have tried to reproduce this by rados
> bench,but failed.
>
>
>
>Could you please let me know if you need any more informations
> & have some solutions? Thanks
>
>
>Xiaoxi
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


 -- 
 DI (FH) Wolfgang Hennerbichler
 Software Development
 Unit Advanced Computing Technologies
 RISC Software GmbH
 A company of the Johannes Kepler University Linz

 IT-Center
 Softwarepark 35
 4232 Hagenberg
 Austria

 Phone: +43 7236 3343 245
 Fax: +43 7236 3343 250
 wolfgang.hennerbich...@risc-software.at
 http://www.risc-software.at
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> -- 
>> DI (FH) Wolfgang Hennerbichler
>> Software Development
>> Unit Advanced Computing Technologies
>> RISC Software GmbH
>> A company of the Johannes Kepler University Linz
>>
>> IT-Center
>> Softwarepark 35
>> 4232 Hagenberg
>> Austria
>>
>> Phone: +43 7236 3343 245
>> Fax: +43 7236 3343 250
>> wolfgang.hennerbich...@risc-software.at
>> http://www.risc-software.at


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] kernel BUG when mapping unexisting rbd device

2013-03-25 Thread Dan van der Ster
Hi,
Apologies if this is already a known bug (though I didn't find it).

If we try to map a device that doesn't exist, we get an immediate and
reproduceable kernel BUG (see the P.S.). We hit this by accident
because we forgot to add the --pool .

This works:

[root@afs245 /]# rbd map afs254-vicepa --pool afs --id afs --keyring
/etc/ceph/ceph.client.afs.keyring
[root@afs245 /]# rbd showmapped
id pool image snap device
1  afs  afs254-vicepa -/dev/rbd1

But this BUGS:

[root@afs245 /]# rbd map afs254-vicepa
BUG...

Any clue?

Cheers,
Dan, CERN IT


Mar 25 11:48:25 afs245 kernel: kernel BUG at mm/slab.c:3130!
Mar 25 11:48:25 afs245 kernel: invalid opcode:  [#1] SMP
Mar 25 11:48:25 afs245 kernel: Modules linked in: rbd libceph
libcrc32c cpufreq_ondemand ipv6 ext2 iTCO_wdt iTCO_vendor_support
coretemp acpi_cpufreq freq_tabl
e mperf kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
pcspkr serio_raw i2c_i801 lpc_ich joydev e1000e ses enclosure sg ixgbe
hwmon dca ptp pps_core
mdio ext3 jbd mbcache sd_mod crc_t10dif aesni_intel ablk_helper cryptd
lrw aes_x86_64 xts gf128mul ahci libahci 3w_9xxx mpt2sas
scsi_transport_sas raid_class v
ideo mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod
Mar 25 11:48:25 afs245 kernel: CPU 3
Mar 25 11:48:25 afs245 kernel: Pid: 7444, comm: rbd Not tainted
3.8.4-1.el6.elrepo.x86_64 #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM
Mar 25 11:48:25 afs245 kernel: RIP: 0010:[]
[] cache_alloc_refill+0x270/0x3c0
Mar 25 11:48:25 afs245 kernel: RSP: 0018:8808028e5c48  EFLAGS: 00010082
Mar 25 11:48:25 afs245 kernel: RAX:  RBX:
88082f000e00 RCX: 88082f000e00
Mar 25 11:48:25 afs245 kernel: RDX: 8808055fba80 RSI:
88082f0028d0 RDI: 88082f002900
Mar 25 11:48:25 afs245 kernel: RBP: 8808028e5ca8 R08:
88082f0028e0 R09: 8808010068c0
Mar 25 11:48:25 afs245 kernel: R10: dead00200200 R11:
0003 R12: 
Mar 25 11:48:25 afs245 kernel: R13: 880807a71ec0 R14:
88082f0028c0 R15: 0004
Mar 25 11:48:25 afs245 kernel: FS:  7ff85056e760()
GS:88082fd8() knlGS:
Mar 25 11:48:25 afs245 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Mar 25 11:48:25 afs245 kernel: CR2: 00428220 CR3:
0007eee7e000 CR4: 001407e0
Mar 25 11:48:25 afs245 kernel: DR0:  DR1:
 DR2: 
Mar 25 11:48:25 afs245 kernel: DR3:  DR6:
0ff0 DR7: 0400
Mar 25 11:48:25 afs245 kernel: Process rbd (pid: 7444, threadinfo
8808028e4000, task 8807ef6fb520)
Mar 25 11:48:25 afs245 kernel: Stack:
Mar 25 11:48:25 afs245 kernel: 8808028e5d68 8112fd5d
8808028e5de8 880800ac7000
Mar 25 11:48:25 afs245 kernel: 028e5c78 80d0
8808028e5fd8 88082f000e00
Mar 25 11:48:25 afs245 kernel: 1078 0010
80d0 80d0
Mar 25 11:48:25 afs245 kernel: Call Trace:
Mar 25 11:48:25 afs245 kernel: [] ?
get_page_from_freelist+0x22d/0x710
Mar 25 11:48:25 afs245 kernel: [] __kmalloc+0x168/0x340
Mar 25 11:48:25 afs245 kernel: [] ?
ceph_parse_options+0x65/0x410 [libceph]
Mar 25 11:48:25 afs245 kernel: [] ? kzalloc+0x20/0x20 [rbd]
Mar 25 11:48:25 afs245 kernel: []
ceph_parse_options+0x65/0x410 [libceph]
Mar 25 11:48:25 afs245 kernel: [] ?
kmem_cache_alloc_trace+0x214/0x2e0
Mar 25 11:48:25 afs245 kernel: [] ? __kmalloc+0x277/0x340
Mar 25 11:48:25 afs245 kernel: [] ? kzalloc+0xf/0x20 [rbd]
Mar 25 11:48:25 afs245 kernel: []
rbd_add_parse_args+0x1fa/0x250 [rbd]
Mar 25 11:48:25 afs245 kernel: [] rbd_add+0x84/0x2b4 [rbd]
Mar 25 11:48:25 afs245 kernel: [] bus_attr_store+0x27/0x30
Mar 25 11:48:25 afs245 kernel: [] sysfs_write_file+0xef/0x170
Mar 25 11:48:25 afs245 kernel: [] vfs_write+0xb4/0x130
Mar 25 11:48:25 afs245 kernel: [] sys_write+0x5f/0xa0
Mar 25 11:48:25 afs245 kernel: [] ?
__audit_syscall_exit+0x246/0x2f0
Mar 25 11:48:25 afs245 kernel: []
system_call_fastpath+0x16/0x1b
Mar 25 11:48:25 afs245 kernel: Code: 48 8b 00 48 8b 55 b0 8b 4d b8 48
8b 75 a8 4c 8b 45 a0 4c 8b 4d c0 a8 40 0f 84 b8 fe ff ff 49 83 cf 01
e9 af fe ff ff 0f 0b eb fe <0f> 0b eb fe 8b 75 c8 8b 55 cc 31 c9 48 89
df 81 ce 00 12 04 00
Mar 25 11:48:25 afs245 kernel: RIP  []
cache_alloc_refill+0x270/0x3c0
Mar 25 11:48:25 afs245 kernel: RSP 
Mar 25 11:48:25 afs245 kernel: ---[ end trace 46b67e5b8b69abcb ]---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi
Rephrase it to make it more clear

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chen, Xiaoxi
Sent: 2013年3月25日 17:02
To: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)
Cc: ceph-de...@vger.kernel.org
Subject: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random 
writes.

Hi list,
 We have hit and reproduce this issue for several times, ceph will 
suicide because FileStore: sync_entry timed out after a very heavy random IO on 
top of the RBD.
 My test environment is:
    4 Nodes ceph cluster with 20 HDDs for OSDs and 4 
Intel DCS3700 ssds for journal per node, that is 80 spindles in total
    48 VMs spread across 12 Physical nodes, 48 RBD 
attached to the VMs 1:1 via QEMU, The Qemu Cache disabled.
    Ceph @ 0.58
    XFS were used.
 I am running  Aiostress (something like FIO) inside VMS to produce 
random write requests on top of each RBDs.

 From Ceph-w , ceph reports a very high Ops (1+ /s) , but 
technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random 
write.
 When digging into the code, from Filestore.cc::_write(), it's clear 
that the OSD open object files without O_DIRECT, that means data writes will be 
buffered by pagecache, and then returned.Although ::sync_file_range called , 
but with flag "SYNC_FILE_RANGE_WRITE", this system call doesn’t actually sync 
data to disk before it returns ,instead, it just initiate the write out IOs. 
So the situation is , since all writes just go to pagecache 
, the backend OSD data disk **seems** extremely fast for random write, so we 
can see  such a high Ops from Ceph-w. However, when OSD Sync_thread trying to 
sync the FS, it use ::syncfs(), before ::syncfs returned, the OS has to ensure 
that all dirty page in PageCache(relate with that particular FS)  had written 
into disk. This will obviously take long time and you can only expect 100 IOPS 
for non-btrfs filesystem.   The performance gap exists there, a SSD journal can 
do 4K random wirte @  1K IOPS +, but for 4 HDDs(journaled  by the same SSD), 
they can only provide 400IOPS.
With the random write pressure continuing , the amount of dirty page in 
PageCache will keep increasing , sooner or later, the ::syncfs() cannot return 
within 600s(the default value of filestore_commit_timeout ) and triggered the 
ASSERT to suicide ceph-osd process.

   I have tried to reproduce this by rados bench,but failed.Because rados bench 
**create** objects rather than modify them, a bucket of creates can be merged 
into a single big writes. So I assume if anyone like to reproduce this issue, 
you have to use QEMU/Kernel Client, using a fast journal(say tempfs) , slow 
data disk, choosing a small filestore_commit_timeout may be helpful to 
reproduce this issue in a small scale environment.

 Could you please let me know if you need any more informations & have 
some solutions? Thanks



    Xiaoxi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird problem with mkcephfs

2013-03-25 Thread Sage Weil
They keyring.* vs key.* distinction in mkcephfs appears correct.  Can you 
attach your ceph.conf?  It looks a bit like no daemons are defined.

sage


On Mon, 25 Mar 2013, Steve Carter wrote:

> Although it doesn't attempt to login to my other machines as I thought it was
> designed to do, as I know it did the last time I built a cluster.  Not sure
> what I'm doing wrong.
> 
> -Steve
> 
> On 03/23/2013 10:35 PM, Steve Carter wrote:
> > I changed:
> > 
> > for k in $dir/key.*
> > 
> > to:
> > 
> > for k in $dir/key*
> > 
> > and it appeared to run correctly:
> > 
> > root@smon:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -d /tmp -k
> > /etc/ceph/keyring
> > preparing monmap in /tmp/monmap
> > /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 --print
> > /tmp/monmap
> > /usr/bin/monmaptool: monmap file /tmp/monmap
> > /usr/bin/monmaptool: generated fsid 46e4ae99-3df6-41ae-8d45-474c95b98852
> > epoch 0
> > fsid 46e4ae99-3df6-41ae-8d45-474c95b98852
> > last_changed 2013-03-23 22:33:26.254974
> > created 2013-03-23 22:33:26.254974
> > 0: 192.168.0.253:6789/0 mon.a
> > /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
> > Building generic osdmap from /tmp/conf
> > /usr/bin/osdmaptool: osdmap file '/tmp/osdmap'
> > /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap
> > Generating admin key at /tmp/keyring.admin
> > creating /tmp/keyring.admin
> > Building initial monitor keyring
> > placing client.admin keyring in /etc/ceph/keyring
> > 
> > On 03/23/2013 10:29 PM, Steve Carter wrote:
> > > The below part of the mkcephfs code seems responsible for this.
> > > specifically the 'for' loop below.  I wonder if I installed from the wrong
> > > place?  I installed from the ubuntu source rather than the ceph source.
> > > 
> > > # admin keyring
> > > echo Generating admin key at $dir/keyring.admin
> > > $BINDIR/ceph-authtool --create-keyring --gen-key -n client.admin
> > > $dir/keyring.admin
> > > 
> > > # mon keyring
> > > echo Building initial monitor keyring
> > > cp $dir/keyring.admin $dir/keyring.mon
> > > $BINDIR/ceph-authtool -n client.admin --set-uid=0 \
> > > --cap mon 'allow *' \
> > > --cap osd 'allow *' \
> > > --cap mds 'allow' \
> > > $dir/keyring.mon
> > > 
> > > $BINDIR/ceph-authtool --gen-key -n mon. $dir/keyring.mon
> > > 
> > > for k in $dir/key.*
> > > do
> > > kname=`echo $k | sed 's/.*key\.//'`
> > > ktype=`echo $kname | cut -c 1-3`
> > > kid=`echo $kname | cut -c 4- | sed 's/^\\.//'`
> > > kname="$ktype.$kid"
> > > secret=`cat $k`
> > > if [ "$ktype" = "osd" ]; then
> > > $BINDIR/ceph-authtool -n $kname --add-key $secret
> > > $dir/keyring.mon \
> > > --cap mon 'allow rwx' \
> > > --cap osd 'allow *'
> > > fi
> > > if [ "$ktype" = "mds" ]; then
> > > $BINDIR/ceph-authtool -n $kname --add-key $secret
> > > $dir/keyring.mon \
> > > --cap mon "allow rwx" \
> > > --cap osd 'allow *' \
> > > --cap mds 'allow'
> > > fi
> > > done
> > > 
> > > exit 0
> > > fi
> > > 
> > > 
> > > On 03/23/2013 01:50 PM, Steve Carter wrote:
> > > > This is consistently repeatable on my system.  This is the latest of two
> > > > cluster builds I have done. This is a brand new deployment on hardware I
> > > > haven't deployed on previously.
> > > > 
> > > > You see the error below is referencing /tmp/key.* and the keyring files
> > > > are actually keyring.*.
> > > > 
> > > > Any help is much appreciated.
> > > > 
> > > > root@mon:~# uname -a
> > > > Linux mon.X.com 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53
> > > > UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> > > > root@mon:~# ceph -v
> > > > ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
> > > > root@mon:~# ls -al /tmp/
> > > > total 8
> > > > drwxrwxrwx  2 root root 4096 Mar 23 12:15 .
> > > > drwxr-xr-x 25 root root 4096 Mar 22 23:18 ..
> > > > root@mon:~# ls -al / | grep tmp
> > > > drwxrwxrwx  2 root root  4096 Mar 23 12:15 tmp
> > > > root@mon:~# mkcephfs -d /tmp -a -c /etc/ceph/ceph.conf -k
> > > > /etc/ceph/keyring
> > > > preparing monmap in /tmp/monmap
> > > > /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789
> > > > --print /tmp/monmap
> > > > /usr/bin/monmaptool: monmap file /tmp/monmap
> > > > /usr/bin/monmaptool: generated fsid 68b9c724-21c0-4d54-8237-674ced7adbfe
> > > > epoch 0
> > > > fsid 68b9c724-21c0-4d54-8237-674ced7adbfe
> > > > last_changed 2013-03-23 12:17:03.087018
> > > > created 2013-03-23 12:17:03.087018
> > > > 0: 192.168.0.253:6789/0 mon.a
> > > > /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
> > > > Building generic osdmap from /tmp/conf
> > > > /usr/bin/osdmaptool: osdmap file '/tmp/osdmap'
> > > > /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap
> > > > Generating admin key at /tmp/keyring.admin

Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Sage Weil
Hi Xiaoxi,

On Mon, 25 Mar 2013, Chen, Xiaoxi wrote:
>  From Ceph-w , ceph reports a very high Ops (1+ /s) , but
> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random
> write.
> 
>  When digging into the code, I found that the OSD write data to
> Pagecache than returned, although it called ::sync_file_range, but this
> syscall doesn?t actually sync data to disk when it return,it?s an aync call.
> So the situation is , the random write will be extremely fast since it only
> write to journal and pagecache, but once syncing , it will take very long
> time. The speed gap between journal and OSDs exist, the amount of data that
> need to be sync keep increasing, and it will certainly exceed 600s.

The sync_file_range is only there to push things to disk sooner, so that 
the eventual syncfs(2) takes less time.  When the async flushing is 
enabled, there is a limit to the number of flushes that are in the queue, 
but if it hits the max it just does

dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off 
<< "~" << len
 << " qlen " << flusher_queue_len 
 << " hit flusher_max_fds " << m_filestore_flusher_max_fds
 << ", skipping async flush" << dendl;

Can you confirm that the filestore is taking this path?  (debug filestore 
= 10 and then reproduce.)

You may want to try

 filestore flusher = false
 filestore sync flush = true

and see if that changes things--it will make the sync_file_range() happen 
inline after the write.

Anyway, it sounds like you may be queueing up so many random writes that 
the sync takes forever.  I've never actually seen that happen, so if we 
can confirm that's what is going on that will be very interesting.

Thanks-
sage


> 
>  
> 
>  For more information, I have tried to reproduce this by rados
> bench,but failed.
> 
>  
> 
>  Could you please let me know if you need any more informations &
> have some solutions? Thanks
> 
>   
?? ?  
?? ?  
?? ?   Xiaoxi
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD Capacity and Partitions for OSD Journals

2013-03-25 Thread Peter_Jung
Hi,

I have a couple of HW provisioning questions in regards to SSD for OSD Journals.

I'd like to provision 12 OSDs per a node and there are enough CPU clocks and 
Memory.
Each OSD is allocated one 3TB HDD for OSD data - these 12 * 3TB HDDs are in 
non-RAID.

For increasing access and (sequential) write performance, I'd like to put 2 
SSDs for OSD journals - these two SSDs are not mirrored.
By the rule of thumb, I'd like to mount the OSD journals (the path below) to 
the "SSD partitions" accordingly.
/var/lib/ceph/osd/$cluster-$id/journal


Question 1.
Which way is recommended between:
(1) Partitions for OS/Boot and 6 OSD journals on #1 SSD, and partitions for the 
rest 6 OSD journals on #2 SSD;
(2) OS/Boot partition on #1 SSD, and separately 12 OSD journals on #2 SSD?
BTW, for better utilization of expensive SSDs, I prefer the first way. Should 
it be okay?

Question 2.
I have several capacity options for SSDs.
What's the capacity requirement if there are 6 partitions for 6 OSD journals on 
a SSD?
If it's hard to generalize, please provide me with some guidelines.

Thanks,
Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW fault tolerance

2013-03-25 Thread Rustam Aliyev

Hi Yehuda,

Thanks for reply, my comments below inline.

On 25/03/2013 04:32, Yehuda Sadeh wrote:

On Sun, Mar 24, 2013 at 7:14 PM, Rustam Aliyev  wrote:

Hi,

I was testing RadosGW setup and observed strange behavior - RGW becomes
unresponsive or won't start whenever cluster health is degraded (e.g.
restarting one of the OSDs). Probably I'm doing something wrong but I
couldn't find any information about this.

I'm running 0.56.3 on 3 node cluster (3xMON, 3xOSD). I increased replication
factor for rgw related pools so that cluster can survive single node failure
(quorum).

pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 256
pgp_num 256 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 256
pgp_num 256 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 256
pgp_num 256 last_change 1 owner 0
pool 3 'pbench' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 150
pgp_num 150 last_change 11 owner 0
pool 4 '.rgw' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 90
pgp_num 8 last_change 111 owner 0
pool 5 '.rgw.gc' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
pgp_num 8 last_change 112 owner 0
pool 6 '.rgw.control' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num
8 pgp_num 8 last_change 114 owner 0
pool 7 '.users.uid' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
pgp_num 8 last_change 117 owner 0
pool 8 '.users.email' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num
8 pgp_num 8 last_change 118 owner 0
pool 9 '.users' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
pgp_num 8 last_change 115 owner 0
pool 11 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
pg_num 1024 pgp_num 1024 last_change 108 owner 0

Any idea how to fix this?


We'll need some more specific info with regard to the actual scenario
in order to determine what exactly is that you're seeing What is the
exact scenario you're testing (osd goes down?).

I'm just doing "service ceph stop osd" on one of the nodes

However, there are a
few things to note:
  - you have only 3 osds, which means that a single osd going down
affects large portion of your data. How and what exactly happens
really depends on your configuration.
Configuration is quite simple, 3 osd and 3 monitos with default params: 
http://pastebin.com/LP3X7cf9

Note that it is not highly
impossible that it takes some time to determine that an osd went down.
I tested that scenario - it seems that you are right. It basically takes 
some time, but I'm not sure if that's expected. So when I shut down osd, 
rgw becomes unresponsive for 2 minutes. Then it works even though health 
is degraded. After some time I brought back osd (started) and rgw became 
unresponsive again - this time however for 5 minutes. Then it started 
functioning again while pgs were recovering in the background.

If it is expected that this osd gets 1/3 of the traffic, which means
that until there's a map change, the gateway will still try to contact
it.
Does it mean that rgw/rados waits for all replicas to acknowledge 
success? Is it possible to configure it in a way where quorum is enough 
- i.e. 2 out of 3 replicas written successfully and rgw returns OK?

  - some of your pools contain a very small number of pgs (8). Probably
not related to your issue, but you'd want to change that.

Yes, I'm aware of that - just kept their default values for now.


Yehuda


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v0.56.4 released

2013-03-25 Thread Sage Weil
There have been several important fixes that we've backported to bobtail 
that users are hitting in the wild. Most notably, there was a problem with 
pool names with - and _ that OpenStack users were hitting, and memory 
usage by ceph-osd and other daemons due to the trimming of in-memory logs. 
This and more is fixed in v0.56.4. We recommend that all bobtail users 
upgrade.

Notable changes include:

 * mon: fix bug in bringup with IPv6
 * reduce default memory utilization by internal logging (all daemons)
 * rgw: fix for bucket removal
 * rgw: reopen logs after log rotation
 * rgw: fix multipat upload listing
 * rgw: don't copy object when copied onto self
 * osd: fix caps parsing for pools with - or _
 * osd: allow pg log trimming when degraded, scrubbing, recoverying 
   (reducing memory consumption)
 * osd: fix potential deadlock when 'journal aio = true'
 * osd: various fixes for collection creation/removal, rename, temp 
   collections
 * osd: various fixes for PG split
 * osd: deep-scrub omap key/value data
 * osd: fix rare bug in journal replay
 * osd: misc fixes for snapshot tracking
 * osd: fix leak in recovery reservations on pool deletion
 * osd: fix bug in connection management
 * osd: fix for op ordering when rebalancing
 * ceph-fuse: report file system size with correct units
 * mds: get and set directory layout policies via virtual xattrs
 * mkcephfs, init-ceph: close potential security issues with predictable 
   filenames

There is one minor change (fix) in the output to the 'ceph osd tree 
--format=json' command. Please see the full release notes.

You can get v0.56.4 from the usual locations:

 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.56.4.tar.gz
 * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
 * For RPMs, see http://ceph.com/docs/master/install/rpm
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.56.4 released

2013-03-25 Thread Sage Weil
On Mon, 25 Mar 2013, Sage Weil wrote:
> There is one minor change (fix) in the output to the 'ceph osd tree 
> --format=json' command. Please see the full release notes.

Greg just reminded me about one additional note about upgrades (that 
should hopefully affect noone):

* The MDS disk format has changed from prior releases *and* from v0.57. In 
  particular, upgrades to v0.56.4 are safe, but you cannot move from 
  v0.56.4 to v0.57 if you are using the MDS for CephFS; you must upgrade
  directly to v0.58 (or later) instead.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi
Hi Sage,
Thanks for your mail.When turn on filestore sync flush, it seems works 
and OSD process doesn't suicide any more . I have already disabled flusher long 
age since both Mark's and my report show disable flusher seems to improve 
performance(so my original configuration is filestore_flusher=false, 
filestore_sync_flush=false(default)), but now we have to reconsider on this. I 
would like to see the internal code of ::sync_file_range() to learn more about 
how it works. First guess is ::sync_file_range will push request to disk queue 
and if the disk queue is full, this call will block and wait, but not sure.

But from the code path,(BTW, these lines of codes are a bit hard to 
follow)
if (!should_flush ||!m_filestore_flusher || 
!queue_flusher(fd, offset, len)) 
{
if (should_flush && m_filestore_sync_flush)
::sync_file_range(fd, offset, len, 
SYNC_FILE_RANGE_WRITE);
lfn_close(fd);
}
With the default setting (m_filestore_flusher = true) , the flusher 
queue will soon burn out, in this situation, if user doesn't turn on " 
m_filestore_sync_flush = ture ", he/she will likely to hit the same situation 
that writes remain in page cache and OSD daemon died when trying to sync. I 
suppose the right logical should be(persuade code), :
if (should_flush) 
{
If(m_filestore_flusher)
If(queue_flusher(fd, offset, len)
Do nothing
Else
::sync_file_range(fd, offset, 
len, SYNC_FILE_RANGE_WRITE);
Else
if (m_filestore_sync_flush )
::sync_file_range(fd, offset, 
len, SYNC_FILE_RANGE_WRITE);
lfn_close(fd);
}



Xiaoxi
-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: 2013年3月25日 23:35
To: Chen, Xiaoxi
Cc: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com); 
ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random 
writes.

Hi Xiaoxi,

On Mon, 25 Mar 2013, Chen, Xiaoxi wrote:
>  From Ceph-w , ceph reports a very high Ops (1+ /s) , but 
> technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K 
> random write.
> 
>  When digging into the code, I found that the OSD write data 
> to Pagecache than returned, although it called ::sync_file_range, but 
> this syscall doesn?t actually sync data to disk when it return,it?s an aync 
> call.
> So the situation is , the random write will be extremely fast since it 
> only write to journal and pagecache, but once syncing , it will take 
> very long time. The speed gap between journal and OSDs exist, the 
> amount of data that need to be sync keep increasing, and it will certainly 
> exceed 600s.

The sync_file_range is only there to push things to disk sooner, so that the 
eventual syncfs(2) takes less time.  When the async flushing is enabled, there 
is a limit to the number of flushes that are in the queue, but if it hits the 
max it just does

dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off 
<< "~" << len
 << " qlen " << flusher_queue_len 
 << " hit flusher_max_fds " << m_filestore_flusher_max_fds
 << ", skipping async flush" << dendl;

Can you confirm that the filestore is taking this path?  (debug filestore = 10 
and then reproduce.)

You may want to try

 filestore flusher = false
 filestore sync flush = true

and see if that changes things--it will make the sync_file_range() happen 
inline after the write.

Anyway, it sounds like you may be queueing up so many random writes that the 
sync takes forever.  I've never actually seen that happen, so if we can confirm 
that's what is going on that will be very interesting.

Thanks-
sage


> 
>  
> 
>  For more information, I have tried to reproduce this by rados 
> bench,but failed.
> 
>  
> 
>  Could you please let me know if you need any more 
> informations & have some solutions? Thanks
> 
>   
?? ? ?? ? ?? ?   Xiaoxi
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird problem with mkcephfs

2013-03-25 Thread Steve Carter

Sage,

Sure, here you go:

[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
max open files = 4096

[mon]
mon data = /data/${name}
keyring = /data/${name}/keyring

[osd]
osd data = /data/${name}
keyring = /data/${name}/keyring
btrfs devs = /dev/disk/by-label/${name}-data
osd journal = /dev/sda_vg/${name}-journal

[mon.a]
hostname = smon
mon addr = 192.168.0.253:6789

[osd.0]
hostname = s1

[osd.1]
hostname = s1

[osd.2]
hostname = s1

[osd.3]
hostname = s1

[osd.4]
hostname = s1

[osd.5]
hostname = s1

[osd.6]
hostname = s2

[osd.7]
hostname = s2

[osd.8]
hostname = s2

[osd.9]
hostname = s2

[osd.10]
hostname = s2

[osd.11]
hostname = s2


- Original Message -
> From: "Sage Weil" 
> To: "Steve Carter" 
> Cc: ceph-users@lists.ceph.com
> Sent: Monday, March 25, 2013 8:26:54 AM
> Subject: Re: [ceph-users] Weird problem with mkcephfs
>
> They keyring.* vs key.* distinction in mkcephfs appears correct.  Can you
> attach your ceph.conf?  It looks a bit like no daemons are defined.
>
> sage
>
>
> On Mon, 25 Mar 2013, Steve Carter wrote:
>
> > Although it doesn't attempt to login to my other machines as I 
thought it

> > was
> > designed to do, as I know it did the last time I built a cluster.  
Not sure

> > what I'm doing wrong.
> >
> > -Steve
> >
> > On 03/23/2013 10:35 PM, Steve Carter wrote:
> > > I changed:
> > >
> > > for k in $dir/key.*
> > >
> > > to:
> > >
> > > for k in $dir/key*
> > >
> > > and it appeared to run correctly:
> > >
> > > root@smon:/etc/ceph# mkcephfs -a -c /etc/ceph/ceph.conf -d /tmp -k
> > > /etc/ceph/keyring
> > > preparing monmap in /tmp/monmap
> > > /usr/bin/monmaptool --create --clobber --add a 192.168.0.253:6789 
--print

> > > /tmp/monmap
> > > /usr/bin/monmaptool: monmap file /tmp/monmap
> > > /usr/bin/monmaptool: generated fsid 
46e4ae99-3df6-41ae-8d45-474c95b98852

> > > epoch 0
> > > fsid 46e4ae99-3df6-41ae-8d45-474c95b98852
> > > last_changed 2013-03-23 22:33:26.254974
> > > created 2013-03-23 22:33:26.254974
> > > 0: 192.168.0.253:6789/0 mon.a
> > > /usr/bin/monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
> > > Building generic osdmap from /tmp/conf
> > > /usr/bin/osdmaptool: osdmap file '/tmp/osdmap'
> > > /usr/bin/osdmaptool: writing epoch 1 to /tmp/osdmap
> > > Generating admin key at /tmp/keyring.admin
> > > creating /tmp/keyring.admin
> > > Building initial monitor keyring
> > > placing client.admin keyring in /etc/ceph/keyring
> > >
> > > On 03/23/2013 10:29 PM, Steve Carter wrote:
> > > > The below part of the mkcephfs code seems responsible for this.
> > > > specifically the 'for' loop below.  I wonder if I installed 
from the

> > > > wrong
> > > > place?  I installed from the ubuntu source rather than the ceph 
source.

> > > >
> > > > # admin keyring
> > > > echo Generating admin key at $dir/keyring.admin
> > > > $BINDIR/ceph-authtool --create-keyring --gen-key -n 
client.admin

> > > > $dir/keyring.admin
> > > >
> > > > # mon keyring
> > > > echo Building initial monitor keyring
> > > > cp $dir/keyring.admin $dir/keyring.mon
> > > > $BINDIR/ceph-authtool -n client.admin --set-uid=0 \
> > > > --cap mon 'allow *' \
> > > > --cap osd 'allow *' \
> > > > --cap mds 'allow' \
> > > > $dir/keyring.mon
> > > >
> > > > $BINDIR/ceph-authtool --gen-key -n mon. $dir/keyring.mon
> > > >
> > > > for k in $dir/key.*
> > > > do
> > > > kname=`echo $k | sed 's/.*key\.//'`
> > > > ktype=`echo $kname | cut -c 1-3`
> > > > kid=`echo $kname | cut -c 4- | sed 's/^\\.//'`
> > > > kname="$ktype.$kid"
> > > > secret=`cat $k`
> > > > if [ "$ktype" = "osd" ]; then
> > > > $BINDIR/ceph-authtool -n $kname --add-key $secret
> > > > $dir/keyring.mon \
> > > > --cap mon 'allow rwx' \
> > > > --cap osd 'allow *'
> > > > fi
> > > > if [ "$ktype" = "mds" ]; then
> > > > $BINDIR/ceph-authtool -n $kname --add-key $secret
> > > > $dir/keyring.mon \
> > > > --cap mon "allow rwx" \
> > > > --cap osd 'allow *' \
> > > > --cap mds 'allow'
> > > > fi
> > > > done
> > > >
> > > > exit 0
> > > > fi
> > > >
> > > >
> > > > On 03/23/2013 01:50 PM, Steve Carter wrote:
> > > > > This is consistently repeatable on my system.  This is the 
latest of

> > > > > two
> > > > > cluster builds I have done. This is a brand new deployment on
> > > > > hardware I
> > > > > haven't deployed on previously.
> > > > >
> > > > > You see the error below is referencing /tmp/key.* and the keyring
> > > > > files
> > > > > are actually keyring.*.
> > > > >
> > > > > Any help is much appreciated.
> > > > >
> > > > > root@mon:~# uname -a
> > > > > Linux mon.X.com 3.2.0-39-gener

Re: [ceph-users] SSD Capacity and Partitions for OSD Journals

2013-03-25 Thread Matthieu Patou

On 03/25/2013 04:07 PM, peter_j...@dell.com wrote:


Hi,

I have a couple of HW provisioning questions in regards to SSD for OSD 
Journals.


I’d like to provision 12 OSDs per a node and there are enough CPU 
clocks and Memory.


Each OSD is allocated one 3TB HDD for OSD data – these 12 * 3TB HDDs 
are in non-RAID.


For increasing access and (sequential) write performance, I’d like to 
put 2 SSDs for OSD journals – these two SSDs are not mirrored.


By the rule of thumb, I’d like to mount the OSD journals (the path 
below) to the “SSD partitions” accordingly.


_/var/lib/ceph/osd/$cluster-$id/journal_

Question 1.

Which way is recommended between:

(1) Partitions for OS/Boot and 6 OSD journals on #1 SSD, and 
partitions for the rest 6 OSD journals on #2 SSD;


(2) OS/Boot partition on #1 SSD, and separately 12 OSD journals on #2 SSD?

BTW, for better utilization of expensive SSDs, I prefer the first way. 
Should it be okay?


Question 2.

I have several capacity options for SSDs.

What’s the capacity requirement if there are 6 partitions for 6 OSD 
journals on a SSD?


If it’s hard to generalize, please provide me with some guidelines.

Journal size is configurable in the /etc/ceph/ceph.conf so if you have a 
journal size of 10 000, you'll need 10G for 1 journal so for 6 it should 
be 60G add a safety factor (ie. 20%) and you should be ok.
The size itself is 2 * desired throughput * interval between sync, so if 
you have a hard drive that is able to do 100MB/s and want to have an 
interval of 50s (not sure it's highly recommended) then you'll need 2 * 
100 * 50 = 10 000 as the journal size.


Matthieu.


Thanks,

Peter



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com