Re: [ceph-users] CephFS metadata pool size

2016-09-26 Thread David
Ryan, a team at Ebay recently did some metadata testing, have a search on
this list. Pretty sure they found there wasn't a huge benefit it putting
the metadata pool on solid. As Christian says, it's all about ram and Cpu.
You want to get as many inodes into cache as possible.

On 26 Sep 2016 2:09 a.m., "Christian Balzer"  wrote:

>
> Hello,
>
> On Sun, 25 Sep 2016 19:51:25 -0400 (EDT) Tyler Bishop wrote:
>
> > 800TB of NVMe?  That sounds wonderful!
> >
> That's not what he wrote at all.
> 800TB capacity, of which the meta-data will likely be a small fraction.
>
> As for the OP, try your google foo on the ML archives, this of course has
> been discussed before.
> See the "CephFS in the wild" thread 3 months ago for example.
>
> In short, you need to have an idea of the number of files and calculate
> 2KB per object (file).
> Plus some overhead for the underlying OSD FS, for the time being at least.
>
> And while having the meta-data pool on fast storage certainly won't hurt,
> the consensus here seems to be that the CPU (few, fast cores) and RAM of
> the MDS have a much higher priority/benefit.
>
> Christian
> >
> > - Original Message -
> > From: "Ryan Leimenstoll" 
> > To: "ceph new" 
> > Sent: Saturday, September 24, 2016 5:37:08 PM
> > Subject: [ceph-users] CephFS metadata pool size
> >
> > Hi all,
> >
> > We are in the process of expanding our current Ceph deployment (Jewel,
> 10.2.2) to incorporate CephFS for fast, network attached scratch storage.
> We are looking to have the metadata pool exist entirely on SSDs (or NVMe),
> however I am not sure how big to expect this pool to grow to. Is there any
> good rule of thumb or guidance to getting an estimate on this before
> purchasing hardware? We are expecting upwards of 800T usable capacity at
> the start.
> >
> > Thanks for any insight!
> >
> > Ryan Leimenstoll
> > rleim...@umiacs.umd.edu
> > University of Maryland Institute for Advanced Computer Studies
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS metadata pool size

2016-09-26 Thread Christian Balzer

Hello,

On Mon, 26 Sep 2016 08:28:02 +0100 David wrote:

> Ryan, a team at Ebay recently did some metadata testing, have a search on
> this list. Pretty sure they found there wasn't a huge benefit it putting
> the metadata pool on solid. As Christian says, it's all about ram and Cpu.
> You want to get as many inodes into cache as possible.
> 
This is the slide show / test in question, btw:
http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark

> On 26 Sep 2016 2:09 a.m., "Christian Balzer"  wrote:
> 
> >
> > Hello,
> >
> > On Sun, 25 Sep 2016 19:51:25 -0400 (EDT) Tyler Bishop wrote:
> >
> > > 800TB of NVMe?  That sounds wonderful!
> > >
> > That's not what he wrote at all.
> > 800TB capacity, of which the meta-data will likely be a small fraction.
> >
> > As for the OP, try your google foo on the ML archives, this of course has
> > been discussed before.
> > See the "CephFS in the wild" thread 3 months ago for example.
> >
> > In short, you need to have an idea of the number of files and calculate
> > 2KB per object (file).
> > Plus some overhead for the underlying OSD FS, for the time being at least.
> >
> > And while having the meta-data pool on fast storage certainly won't hurt,
> > the consensus here seems to be that the CPU (few, fast cores) and RAM of
> > the MDS have a much higher priority/benefit.
> >
> > Christian
> > >
> > > - Original Message -
> > > From: "Ryan Leimenstoll" 
> > > To: "ceph new" 
> > > Sent: Saturday, September 24, 2016 5:37:08 PM
> > > Subject: [ceph-users] CephFS metadata pool size
> > >
> > > Hi all,
> > >
> > > We are in the process of expanding our current Ceph deployment (Jewel,
> > 10.2.2) to incorporate CephFS for fast, network attached scratch storage.
> > We are looking to have the metadata pool exist entirely on SSDs (or NVMe),
> > however I am not sure how big to expect this pool to grow to. Is there any
> > good rule of thumb or guidance to getting an estimate on this before
> > purchasing hardware? We are expecting upwards of 800T usable capacity at
> > the start.
> > >
> > > Thanks for any insight!
> > >
> > > Ryan Leimenstoll
> > > rleim...@umiacs.umd.edu
> > > University of Maryland Institute for Advanced Computer Studies
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph repo is broken, no repodata at all

2016-09-26 Thread Chengwei Yang
On Fri, Sep 23, 2016 at 09:31:46AM +0200, Wido den Hollander wrote:
> 
> > Op 23 september 2016 om 5:59 schreef Chengwei Yang 
> > :
> > 
> > 
> > Hi list,
> > 
> > I found that ceph repo is broken these days, no any repodata in the repo at 
> > all.
> > 
> > http://us-east.ceph.com/rpm-jewel/el7/x86_64/repodata/
> > 
> > it's just empty, so how can I install ceph rpms from yum?
> > 
> 
> Thanks for the report! I contacted the mirror admin for you asking him to 
> check out why these files are not present.

Thanks Wido, expect it will be fixed soon.

> 
> Wido
> 
> > A workaround is I synced all the rpms to local and create repodata with
> > createrepo command, but I think the upstream has to be fixed.
> > 
> > -- 
> > Thanks,
> > Chengwei
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Thanks,
Chengwei


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-26 Thread Ilya Dryomov
On Mon, Sep 26, 2016 at 8:39 AM, Nikolay Borisov  wrote:
>
>
> On 09/22/2016 06:36 PM, Ilya Dryomov wrote:
>> On Thu, Sep 15, 2016 at 3:18 PM, Ilya Dryomov  wrote:
>>> On Thu, Sep 15, 2016 at 2:43 PM, Nikolay Borisov  wrote:

 [snipped]

 cat /sys/bus/rbd/devices/47/client_id
 client157729
 cat /sys/bus/rbd/devices/1/client_id
 client157729

 Client client157729 is alxc13, based on correlation by the ip address
 shown by the rados -p ... command. So it's the only client where the rbd
 images are mapped.
>>>
>>> Well, the watches are there, but cookie numbers indicate that they may
>>> have been re-established, so that's inconclusive.
>>>
>>> My suggestion would be to repeat the test and do repeated freezes to
>>> see if snapshot continues to follow HEAD.
>>>
>>> Further, to rule out a missed snap context update, repeat the test, but
>>> stick
>>>
>>> # echo 1 >/sys/bus/rbd/devices//refresh
>>>
>>> after "rbd snap create" (for the today's test, ID_OF_THE_ORIG_DEVICE
>>> would be 47).
>>
>> Hi Nikolay,
>>
>> Any news on this?
>
> Hello,
>
> I was on holiday hence the radio silence. Here is the latest set of
> tests that were run:
>
> Results:
>
> c11579 (100GB - used: 83GB):
> root@alxc13:~# rbd showmapped |grep c11579
> 47  rbd  c11579 -/dev/rbd47
> root@alxc13:~# fsfreeze -f /var/lxc/c11579
> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 686.382 s, 156 MB/s
> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
> root@alxc13:~# rbd snap create rbd/c11579@snap_test
> root@alxc13:~# rbd map c11579@snap_test
> /dev/rbd1
> root@alxc13:~# echo 1 >/sys/bus/rbd/devices/47/refresh
> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 915.225 s, 117 MB/s
> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 863.464 s, 143 MB/s
> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
> root@alxc13:~# file -s /dev/rbd1
> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files)
> (huge files)
> root@alxc13:~# fsfreeze -u /var/lxc/c11579
> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 730.243 s, 147 MB/s
> 65294ce9eae5694a56054ec4af011264  /dev/fd/63
> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 649.373 s, 165 MB/s
> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>
> 30min later:
> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 648.328 s, 166 MB/s
> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>
>
>
> c12607 (30GB - used: 4GB):
> root@alxc13:~# rbd showmapped |grep c12607
> 39  rbd  c12607 -/dev/rbd39
> root@alxc13:~# fsfreeze -f /var/lxc/c12607
> root@alxc13:~# md5sum <(dd if=/dev/rbd39 iflag=direct bs=8M)
> 3840+0 records in
> 3840+0 records out
> 32212254720 bytes (32 GB) copied, 228.2 s, 141 MB/s
> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
> root@alxc13:~# rbd snap create rbd/c12607@snap_test
> root@alxc13:~# rbd map c12607@snap_test
> /dev/rbd21
> root@alxc13:~# rbd snap protect rbd/c12607@snap_test
> root@alxc13:~# echo 1 >/sys/bus/rbd/devices/39/refresh
> root@alxc13:~# md5sum <(dd if=/dev/rbd39 iflag=direct bs=8M)
> 3840+0 records in
> 3840+0 records out
> 32212254720 bytes (32 GB) copied, 217.138 s, 148 MB/s
> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
> root@alxc13:~# md5sum <(dd if=/dev/rbd21 iflag=direct bs=8M)
> 3840+0 records in
> 3840+0 records out
> 32212254720 bytes (32 GB) copied, 212.254 s, 152 MB/s
> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
> root@alxc13:~# file -s /dev/rbd21
> /dev/rbd21: Linux rev 1.0 ext4 filesystem data (extents) (large files)
> (huge files)
> root@alxc13:~# fsfreeze -u /var/lxc/c12607
> root@alxc13:~# md5sum <(dd if=/dev/rbd39 iflag=direct bs=8M)
> 3840+0 records in
> 3840+0 records out
> 32212254720 bytes (32 GB) copied, 322.964 s, 99.7 MB/s
> 71c5efc24162452473cda50155cd4399  /dev/fd/63
> root@alxc13:~# md5sum <(dd if=/dev/rbd21 iflag=direct bs=8M)
> 3840+0 records in
> 3840+0 records out
> 32212254720 bytes (32 GB) copied, 326.273 s, 98.7 MB/s
> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
> root@alxc13:~# file -s /dev/rbd21
> /dev/rbd21: Linux rev 1.0 ext4 filesystem data (extents) (large files)
> (huge files)
> root@alxc13:~#
>
> 30min later:
> root@alxc13:~# md5sum <(dd if=/dev/rbd21 iflag=direct bs=8M)
> 3840+0 records in
> 3840+0 records out
> 32212254720 bytes (32 GB) copied, 359.917 s, 89.5 MB/s
> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
>
> Everything seems consistent, but 

Re: [ceph-users] CephFS metadata pool size

2016-09-26 Thread John Spray
On Mon, Sep 26, 2016 at 8:28 AM, David  wrote:
> Ryan, a team at Ebay recently did some metadata testing, have a search on
> this list. Pretty sure they found there wasn't a huge benefit it putting the
> metadata pool on solid. As Christian says, it's all about ram and Cpu. You
> want to get as many inodes into cache as possible.

This is generally good advice, but we're not quite at the point of
saying using SSDs is not useful.  In the case where the working set
exceeds the MDS cache size, and the metadata access is random, the MDS
will generate a large quantity of small latency sensitive IOs to read
in metadata.  In that kind of case, having those reads going to
dedicated SSDs (as opposed to the same spindles as bulk data) may well
give better (and certainly more predictable) performance.

As always this is a relatively young area and empirical testing is needed.

John

>
>
> On 26 Sep 2016 2:09 a.m., "Christian Balzer"  wrote:
>>
>>
>> Hello,
>>
>> On Sun, 25 Sep 2016 19:51:25 -0400 (EDT) Tyler Bishop wrote:
>>
>> > 800TB of NVMe?  That sounds wonderful!
>> >
>> That's not what he wrote at all.
>> 800TB capacity, of which the meta-data will likely be a small fraction.
>>
>> As for the OP, try your google foo on the ML archives, this of course has
>> been discussed before.
>> See the "CephFS in the wild" thread 3 months ago for example.
>>
>> In short, you need to have an idea of the number of files and calculate
>> 2KB per object (file).
>> Plus some overhead for the underlying OSD FS, for the time being at least.
>>
>> And while having the meta-data pool on fast storage certainly won't hurt,
>> the consensus here seems to be that the CPU (few, fast cores) and RAM of
>> the MDS have a much higher priority/benefit.
>>
>> Christian
>> >
>> > - Original Message -
>> > From: "Ryan Leimenstoll" 
>> > To: "ceph new" 
>> > Sent: Saturday, September 24, 2016 5:37:08 PM
>> > Subject: [ceph-users] CephFS metadata pool size
>> >
>> > Hi all,
>> >
>> > We are in the process of expanding our current Ceph deployment (Jewel,
>> > 10.2.2) to incorporate CephFS for fast, network attached scratch storage. 
>> > We
>> > are looking to have the metadata pool exist entirely on SSDs (or NVMe),
>> > however I am not sure how big to expect this pool to grow to. Is there any
>> > good rule of thumb or guidance to getting an estimate on this before
>> > purchasing hardware? We are expecting upwards of 800T usable capacity at 
>> > the
>> > start.
>> >
>> > Thanks for any insight!
>> >
>> > Ryan Leimenstoll
>> > rleim...@umiacs.umd.edu
>> > University of Maryland Institute for Advanced Computer Studies
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-26 Thread Ilya Dryomov
On Mon, Sep 26, 2016 at 11:13 AM, Ilya Dryomov  wrote:
> On Mon, Sep 26, 2016 at 8:39 AM, Nikolay Borisov  wrote:
>>
>>
>> On 09/22/2016 06:36 PM, Ilya Dryomov wrote:
>>> On Thu, Sep 15, 2016 at 3:18 PM, Ilya Dryomov  wrote:
 On Thu, Sep 15, 2016 at 2:43 PM, Nikolay Borisov  wrote:
>
> [snipped]
>
> cat /sys/bus/rbd/devices/47/client_id
> client157729
> cat /sys/bus/rbd/devices/1/client_id
> client157729
>
> Client client157729 is alxc13, based on correlation by the ip address
> shown by the rados -p ... command. So it's the only client where the rbd
> images are mapped.

 Well, the watches are there, but cookie numbers indicate that they may
 have been re-established, so that's inconclusive.

 My suggestion would be to repeat the test and do repeated freezes to
 see if snapshot continues to follow HEAD.

 Further, to rule out a missed snap context update, repeat the test, but
 stick

 # echo 1 >/sys/bus/rbd/devices//refresh

 after "rbd snap create" (for the today's test, ID_OF_THE_ORIG_DEVICE
 would be 47).
>>>
>>> Hi Nikolay,
>>>
>>> Any news on this?
>>
>> Hello,
>>
>> I was on holiday hence the radio silence. Here is the latest set of
>> tests that were run:
>>
>> Results:
>>
>> c11579 (100GB - used: 83GB):
>> root@alxc13:~# rbd showmapped |grep c11579
>> 47  rbd  c11579 -/dev/rbd47
>> root@alxc13:~# fsfreeze -f /var/lxc/c11579
>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 686.382 s, 156 MB/s
>> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>> root@alxc13:~# rbd snap create rbd/c11579@snap_test
>> root@alxc13:~# rbd map c11579@snap_test
>> /dev/rbd1
>> root@alxc13:~# echo 1 >/sys/bus/rbd/devices/47/refresh
>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 915.225 s, 117 MB/s
>> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 863.464 s, 143 MB/s
>> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>> root@alxc13:~# file -s /dev/rbd1
>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files)
>> (huge files)
>> root@alxc13:~# fsfreeze -u /var/lxc/c11579
>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 730.243 s, 147 MB/s
>> 65294ce9eae5694a56054ec4af011264  /dev/fd/63
>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 649.373 s, 165 MB/s
>> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>>
>> 30min later:
>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 648.328 s, 166 MB/s
>> f2edb5abb100de30c1301b0856e595aa  /dev/fd/63
>>
>>
>>
>> c12607 (30GB - used: 4GB):
>> root@alxc13:~# rbd showmapped |grep c12607
>> 39  rbd  c12607 -/dev/rbd39
>> root@alxc13:~# fsfreeze -f /var/lxc/c12607
>> root@alxc13:~# md5sum <(dd if=/dev/rbd39 iflag=direct bs=8M)
>> 3840+0 records in
>> 3840+0 records out
>> 32212254720 bytes (32 GB) copied, 228.2 s, 141 MB/s
>> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
>> root@alxc13:~# rbd snap create rbd/c12607@snap_test
>> root@alxc13:~# rbd map c12607@snap_test
>> /dev/rbd21
>> root@alxc13:~# rbd snap protect rbd/c12607@snap_test
>> root@alxc13:~# echo 1 >/sys/bus/rbd/devices/39/refresh
>> root@alxc13:~# md5sum <(dd if=/dev/rbd39 iflag=direct bs=8M)
>> 3840+0 records in
>> 3840+0 records out
>> 32212254720 bytes (32 GB) copied, 217.138 s, 148 MB/s
>> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
>> root@alxc13:~# md5sum <(dd if=/dev/rbd21 iflag=direct bs=8M)
>> 3840+0 records in
>> 3840+0 records out
>> 32212254720 bytes (32 GB) copied, 212.254 s, 152 MB/s
>> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
>> root@alxc13:~# file -s /dev/rbd21
>> /dev/rbd21: Linux rev 1.0 ext4 filesystem data (extents) (large files)
>> (huge files)
>> root@alxc13:~# fsfreeze -u /var/lxc/c12607
>> root@alxc13:~# md5sum <(dd if=/dev/rbd39 iflag=direct bs=8M)
>> 3840+0 records in
>> 3840+0 records out
>> 32212254720 bytes (32 GB) copied, 322.964 s, 99.7 MB/s
>> 71c5efc24162452473cda50155cd4399  /dev/fd/63
>> root@alxc13:~# md5sum <(dd if=/dev/rbd21 iflag=direct bs=8M)
>> 3840+0 records in
>> 3840+0 records out
>> 32212254720 bytes (32 GB) copied, 326.273 s, 98.7 MB/s
>> e6ce3ea688a778b9c732041164b4638c  /dev/fd/63
>> root@alxc13:~# file -s /dev/rbd21
>> /dev/rbd21: Linux rev 1.0 ext4 filesystem data (extents) (large files)
>> (huge files)
>> root@alxc13:~#
>>
>> 30min later:
>> root@alxc13:~# md5sum <(dd if=/dev/rbd21 iflag=direc

Re: [ceph-users] rgw multi-site replication issues

2016-09-26 Thread Orit Wasserman
Hi John,
Can you provide:
radosgw-admin zonegroupmap get on both us-dfw and us-phx?
radosgw-admin realm get and radosgw-admin period get on all the gateways?

Orit

On Thu, Sep 22, 2016 at 4:37 PM, John Rowe  wrote:
> Hello Orit, thanks.
>
> I will do all 6 just in case. Also as an FYI I originally had all 6 as
> endpoint (3 in each zone) but have it down to just the two "1" servers
> talking to each other until I can get it working. Eventually I would like to
> have all 6 cross connecting again.
>
> rgw-primary-1:
> radosgw-admin zonegroup get
> {
> "id": "235b010c-22e2-4b43-8fcc-8ae01939273e",
> "name": "us",
> "api_name": "us",
> "is_master": "true",
> "endpoints": [
> "http:\/\/LB_FQDN:80"
> ],
> "hostnames": [],
> "hostnames_s3website": [],
> "master_zone": "6c830b44-4e39-4e19-9bd8-03c37c2021f2",
> "zones": [
> {
> "id": "58aa3eef-fc1f-492c-a08e-9c6019e7c266",
> "name": "us-phx",
> "endpoints": [
> "http:\/\/PHX-RGW-1:80"
> ],
> "log_meta": "false",
> "log_data": "true",
> "bucket_index_max_shards": 0,
> "read_only": "false"
> },
> {
> "id": "6c830b44-4e39-4e19-9bd8-03c37c2021f2",
> "name": "us-dfw",
> "endpoints": [
> "http:\/\/DFW-RGW-1:80"
> ],
> "log_meta": "true",
> "log_data": "true",
> "bucket_index_max_shards": 0,
> "read_only": "false"
> }
> ],
> "placement_targets": [
> {
> "name": "default-placement",
> "tags": []
> }
> ],
> "default_placement": "default-placement",
> "realm_id": "3af93a86-916a-490f-b38f-17922b472b19"
> }
> radosgw-admin zone get
> {
> "id": "6c830b44-4e39-4e19-9bd8-03c37c2021f2",
> "name": "us-dfw",
> "domain_root": "us-dfw.rgw.data.root",
> "control_pool": "us-dfw.rgw.control",
> "gc_pool": "us-dfw.rgw.gc",
> "log_pool": "us-dfw.rgw.log",
> "intent_log_pool": "us-dfw.rgw.intent-log",
> "usage_log_pool": "us-dfw.rgw.usage",
> "user_keys_pool": "us-dfw.rgw.users.keys",
> "user_email_pool": "us-dfw.rgw.users.email",
> "user_swift_pool": "us-dfw.rgw.users.swift",
> "user_uid_pool": "us-dfw.rgw.users.uid",
> "system_key": {
> "access_key": "SYSTEM_ACCESS_KEY",
> "secret_key": "SYSTEM_SECRET_KEY"
> },
> "placement_pools": [
> {
> "key": "default-placement",
> "val": {
> "index_pool": "us-dfw.rgw.buckets.index",
> "data_pool": "us-dfw.rgw.buckets.data",
> "data_extra_pool": "us-dfw.rgw.buckets.non-ec",
> "index_type": 0
> }
> }
> ],
> "metadata_heap": "us-dfw.rgw.meta",
> "realm_id": "3af93a86-916a-490f-b38f-17922b472b19"
> }
>
> 
> rgw-primary-2:
> radosgw-admin zonegroup get
> {
> "id": "235b010c-22e2-4b43-8fcc-8ae01939273e",
> "name": "us",
> "api_name": "us",
> "is_master": "true",
> "endpoints": [
> "http:\/\/LB_FQDN:80"
> ],
> "hostnames": [],
> "hostnames_s3website": [],
> "master_zone": "6c830b44-4e39-4e19-9bd8-03c37c2021f2",
> "zones": [
> {
> "id": "58aa3eef-fc1f-492c-a08e-9c6019e7c266",
> "name": "us-phx",
> "endpoints": [
> "http:\/\/PHX-RGW-1:80"
> ],
> "log_meta": "false",
> "log_data": "true",
> "bucket_index_max_shards": 0,
> "read_only": "false"
> },
> {
> "id": "6c830b44-4e39-4e19-9bd8-03c37c2021f2",
> "name": "us-dfw",
> "endpoints": [
> "http:\/\/DFW-RGW-1:80"
> ],
> "log_meta": "true",
> "log_data": "true",
> "bucket_index_max_shards": 0,
> "read_only": "false"
> }
> ],
> "placement_targets": [
> {
> "name": "default-placement",
> "tags": []
> }
> ],
> "default_placement": "default-placement",
> "realm_id": "3af93a86-916a-490f-b38f-17922b472b19"
> }
>
> radosgw-admin zone get
> {
> "id": "6c830b44-4e39-4e19-9bd8-03c37c2021f2",
> "name": "us-dfw",
> "domain_root": "us-dfw.rgw.data.root",
> "control_pool": "us-dfw.rgw.control",
> "gc_pool": "us-dfw.rgw.gc",
> "log_pool": "us-dfw.rgw.log",
> "intent_log_pool": "us-dfw.rgw.intent-log",
> "usage_log_pool": "us-dfw.rgw.usage",
> "user_keys_pool": "us-dfw.rgw.users.keys",
> "user_email_pool": "us-dfw.rgw.users.email",
> "user_swift_pool": "us-dfw.rgw.users.swift",
> "user_uid_pool": "us-dfw.rgw.users.uid",
> "system_key": {
> "access_key": "SYSTEM_ACCESS_KEY",
> "secret_key": "SYSTEM_SECRET_KEY"
> },
> "placement_pools": [

[ceph-users] Bcache, partitions and BlueStore

2016-09-26 Thread Wido den Hollander
Hi,

This has been discussed on the ML before [0], but I would like to bring this up 
again with the outlook towards BlueStore.

Bcache [1] allows for block device level caching in Linux. This can be 
read/write(back) and vastly improves read and write performance to a block 
device.

With the current layout of Ceph with FileStore you can already use bcache, but 
not with ceph-disk.

The reason is that bcache currently does not support creating partitions on 
those devices. There are patches [2] out there, but they are not upstream.

I haven't tested it yet, but it looks like BlueStore can still benefit quite 
good from Bcache and it would be a lot easier if the patches [2] were merged 
upstream.

This way you would have:

- bcache0p1: XFS/EXT4 OSD metadata
- bcache0p2: RocksDB
- bcache0p3: RocksDB WAL
- bcache0p4: BlueStore DATA

With bcache you could create multiple bcache devices by creating partitions on 
the backing disk and creating bcache devices for all of them, but that's a lot 
of work and not easy to automate with ceph-disk.

So what I'm trying to find is the best route to get this upstream in the Linux 
kernel. That way next year when BlueStore becomes the default in L (luminous) 
users can use bcache underneath BlueStore easily.

Does anybody know the proper route we need to take to get this fixed upstream? 
Has any contacts with the bcache developers?

Thanks!

Wido

[0]: http://www.spinics.net/lists/ceph-devel/msg29550.html
[1]: https://bcache.evilpiepirate.org/
[2]: https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph full cluster

2016-09-26 Thread Dmitriy Lock
Hello all!
I need some help with my Ceph cluster.
I've installed ceph cluster with two physical servers with osd /data 40G on
each.
Here is ceph.conf:
[global]
fsid = 377174ff-f11f-48ec-ad8b-ff450d43391c
mon_initial_members = vm35, vm36
mon_host = 192.168.1.35,192.168.1.36
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

osd pool default size = 2  # Write an object 2 times.
osd pool default min size = 1 # Allow writing one copy in a degraded state.

osd pool default pg num = 200
osd pool default pgp num = 200

Right after creation it was HEALTH_OK, and i've started with filling it.
I've wrote 40G data to cluster using Rados gateway, but cluster uses all
avaiable space and keep growing after i've added two another osd - 10G
/data1 on each server.
Here is tree output:
# ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.09756 root default
-2 0.04878 host vm35
0 0.03899 osd.0  up  1.0  1.0
2 0.00980 osd.2  up  1.0  1.0
-3 0.04878 host vm36
1 0.03899 osd.1  up  1.0  1.0
3 0.00980 osd.3  up  1.0  1.0

and health:
root@vm35:/etc# ceph health
HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck unclean;
15 pgs undersized; recovery 87176/300483 objects degraded (29.012%);
recovery 62272/300483 obj
ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool
default.rgw.buckets.data has many more objects per pg than average (too few
pgs?)
root@vm35:/etc# ceph health detail
HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck unclean;
15 pgs undersized; recovery 87176/300483 objects degraded (29.012%);
recovery 62272/300483 obj
ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool
default.rgw.buckets.data has many more objects per pg than average (too few
pgs?)
pg 10.5 is stuck unclean since forever, current state
active+undersized+degraded, last acting [1,0]
pg 9.6 is stuck unclean since forever, current state
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.4 is stuck unclean since forever, current state active+remapped, last
acting [3,0,1]
pg 9.7 is stuck unclean since forever, current state
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.7 is stuck unclean since forever, current state
active+undersized+degraded+remapped+backfill_toofull, last acting [0,1]
pg 9.4 is stuck unclean since forever, current state
active+undersized+degraded, last acting [1,0]
pg 9.1 is stuck unclean since forever, current state
active+undersized+degraded, last acting [0,3]
pg 10.2 is stuck unclean since forever, current state
active+undersized+degraded, last acting [1,0]
pg 9.0 is stuck unclean since forever, current state
active+undersized+degraded, last acting [1,2]
pg 10.3 is stuck unclean since forever, current state
active+undersized+degraded, last acting [2,1]
pg 9.3 is stuck unclean since forever, current state
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.0 is stuck unclean since forever, current state
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 9.2 is stuck unclean since forever, current state
active+undersized+degraded, last acting [0,1]
pg 10.1 is stuck unclean since forever, current state
active+undersized+degraded, last acting [0,1]
pg 9.5 is stuck unclean since forever, current state
active+undersized+degraded, last acting [1,0]
pg 10.6 is stuck unclean since forever, current state
active+undersized+degraded, last acting [0,1]
pg 9.1 is active+undersized+degraded, acting [0,3]
pg 10.2 is active+undersized+degraded, acting [1,0]
pg 9.0 is active+undersized+degraded, acting [1,2]
pg 10.3 is active+undersized+degraded, acting [2,1]
pg 9.3 is active+undersized+degraded+remapped+backfill_toofull, acting
[1,0]
pg 10.0 is active+undersized+degraded+remapped+backfill_toofull, acting
[1,0]
pg 9.2 is active+undersized+degraded, acting [0,1]
pg 10.1 is active+undersized+degraded, acting [0,1]
pg 9.5 is active+undersized+degraded, acting [1,0]
pg 10.6 is active+undersized+degraded, acting [0,1]
pg 9.4 is active+undersized+degraded, acting [1,0]
pg 10.7 is active+undersized+degraded+remapped+backfill_toofull, acting
[0,1]
pg 9.7 is active+undersized+degraded+remapped+backfill_toofull, acting
[1,0]
pg 9.6 is active+undersized+degraded+remapped+backfill_toofull, acting
[1,0]
pg 10.5 is active+undersized+degraded, acting [1,0]
recovery 87176/300483 objects degraded (29.012%)
recovery 62272/300483 objects misplaced (20.724%)
osd.1 is full at 95%
osd.2 is near full at 91%
osd.3 is near full at 91%
pool default.rgw.buckets.data objects per pg (12438) is more than 17.8451
times cluster average (697)

In log i see this:
2016-09-26 10:37:21.688849 mon.0 192.168.1.35:6789/0 4836 : cluster [INF]
pgmap v8364: 144 pgs: 5
active+undersized+degraded+remapped+backfill_toofull, 1 active+remapped,
128 active

Re: [ceph-users] Ceph full cluster

2016-09-26 Thread Burkhard Linke

Hi,


On 09/26/2016 12:58 PM, Dmitriy Lock wrote:

Hello all!
I need some help with my Ceph cluster.
I've installed ceph cluster with two physical servers with osd /data 
40G on each.

Here is ceph.conf:
[global]
fsid = 377174ff-f11f-48ec-ad8b-ff450d43391c
mon_initial_members = vm35, vm36
mon_host = 192.168.1.35,192.168.1.36
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

osd pool default size = 2  # Write an object 2 times.
osd pool default min size = 1 # Allow writing one copy in a degraded 
state.


osd pool default pg num = 200
osd pool default pgp num = 200

Right after creation it was HEALTH_OK, and i've started with filling 
it. I've wrote 40G data to cluster using Rados gateway, but cluster 
uses all avaiable space and keep growing after i've added two another 
osd - 10G /data1 on each server.

Here is tree output:
# ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.09756 root default
-2 0.04878 host vm35
0 0.03899 osd.0  up  1.0  1.0
2 0.00980 osd.2  up  1.0  1.0
-3 0.04878 host vm36
1 0.03899 osd.1  up  1.0  1.0
3 0.00980 osd.3  up  1.0  1.0

and health:
root@vm35:/etc# ceph health
HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck 
unclean; 15 pgs undersized; recovery 87176/300483 objects degraded 
(29.012%); recovery 62272/300483 obj
ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool 
default.rgw.buckets.data has many more objects per pg than average 
(too few pgs?)

root@vm35:/etc# ceph health detail
HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck 
unclean; 15 pgs undersized; recovery 87176/300483 objects degraded 
(29.012%); recovery 62272/300483 obj
ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool 
default.rgw.buckets.data has many more objects per pg than average 
(too few pgs?)
pg 10.5 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 9.6 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.4 is stuck unclean since forever, current state active+remapped, 
last acting [3,0,1]
pg 9.7 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.7 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [0,1]
pg 9.4 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 9.1 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,3]
pg 10.2 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 9.0 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,2]
pg 10.3 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [2,1]
pg 9.3 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 10.0 is stuck unclean since forever, current state 
active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
pg 9.2 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,1]
pg 10.1 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,1]
pg 9.5 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [1,0]
pg 10.6 is stuck unclean since forever, current state 
active+undersized+degraded, last acting [0,1]

pg 9.1 is active+undersized+degraded, acting [0,3]
pg 10.2 is active+undersized+degraded, acting [1,0]
pg 9.0 is active+undersized+degraded, acting [1,2]
pg 10.3 is active+undersized+degraded, acting [2,1]
pg 9.3 is active+undersized+degraded+remapped+backfill_toofull, acting 
[1,0]
pg 10.0 is active+undersized+degraded+remapped+backfill_toofull, 
acting [1,0]

pg 9.2 is active+undersized+degraded, acting [0,1]
pg 10.1 is active+undersized+degraded, acting [0,1]
pg 9.5 is active+undersized+degraded, acting [1,0]
pg 10.6 is active+undersized+degraded, acting [0,1]
pg 9.4 is active+undersized+degraded, acting [1,0]
pg 10.7 is active+undersized+degraded+remapped+backfill_toofull, 
acting [0,1]
pg 9.7 is active+undersized+degraded+remapped+backfill_toofull, acting 
[1,0]
pg 9.6 is active+undersized+degraded+remapped+backfill_toofull, acting 
[1,0]

pg 10.5 is active+undersized+degraded, acting [1,0]
recovery 87176/300483 objects degraded (29.012%)
recovery 62272/300483 objects misplaced (20.724%)
osd.1 is full at 95%
osd.2 is near full at 91%
osd.3 is near full at 91%
pool default.rgw.buckets.data objects per pg (12438) is more than 
17.8451 times cluster average (697)


In log i see this:
2016-09-26 10:37:21.688849 mon.0 192.168.1.35:6789/0 
 483

Re: [ceph-users] Ceph full cluster

2016-09-26 Thread Dmitriy Lock
Yes, you are right!
I've changed this for all pools, but not for last two!

pool 1 '.rgw.root' replicated size 2 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 27 owner 18446744073709551615 flags
hashpspool strip
e_width 0
pool 2 'default.rgw.control' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 29 owner
18446744073709551615 flags hashps
pool stripe_width 0
pool 3 'default.rgw.data.root' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 31 owner
18446744073709551615 flags hash
pspool stripe_width 0
pool 4 'default.rgw.gc' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 33 owner
18446744073709551615 flags hashpspool
stripe_width 0
pool 5 'default.rgw.log' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 35 owner
18446744073709551615 flags hashpspool
stripe_width 0
pool 6 'default.rgw.users.uid' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 37 owner
18446744073709551615 flags hash
pspool stripe_width 0
pool 7 'default.rgw.users.keys' replicated size 2 min_size 2 crush_ruleset
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 owner
18446744073709551615 flags has
hpspool stripe_width 0
pool 8 'default.rgw.meta' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 41 owner
18446744073709551615 flags hashpspoo
l stripe_width 0
pool 9 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 43
flags hashpspool stripe_width 0
pool 10 'default.rgw.buckets.data' replicated size 3 min_size 2
crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 45
flags hashpspool stripe_width 0

Changing right now.
Thank you very much!

On Mon, Sep 26, 2016 at 2:05 PM, Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 09/26/2016 12:58 PM, Dmitriy Lock wrote:
>
> Hello all!
> I need some help with my Ceph cluster.
> I've installed ceph cluster with two physical servers with osd /data 40G
> on each.
> Here is ceph.conf:
> [global]
> fsid = 377174ff-f11f-48ec-ad8b-ff450d43391c
> mon_initial_members = vm35, vm36
> mon_host = 192.168.1.35,192.168.1.36
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> osd pool default size = 2  # Write an object 2 times.
> osd pool default min size = 1 # Allow writing one copy in a degraded
> state.
>
> osd pool default pg num = 200
> osd pool default pgp num = 200
>
> Right after creation it was HEALTH_OK, and i've started with filling it.
> I've wrote 40G data to cluster using Rados gateway, but cluster uses all
> avaiable space and keep growing after i've added two another osd - 10G
> /data1 on each server.
> Here is tree output:
> # ceph osd tree
> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 0.09756 root default
> -2 0.04878 host vm35
> 0 0.03899 osd.0  up  1.0  1.0
> 2 0.00980 osd.2  up  1.0  1.0
> -3 0.04878 host vm36
> 1 0.03899 osd.1  up  1.0  1.0
> 3 0.00980 osd.3  up  1.0  1.0
>
> and health:
> root@vm35:/etc# ceph health
> HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck unclean;
> 15 pgs undersized; recovery 87176/300483 objects degraded (29.012%);
> recovery 62272/300483 obj
> ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool
> default.rgw.buckets.data has many more objects per pg than average (too few
> pgs?)
> root@vm35:/etc# ceph health detail
> HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck unclean;
> 15 pgs undersized; recovery 87176/300483 objects degraded (29.012%);
> recovery 62272/300483 obj
> ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool
> default.rgw.buckets.data has many more objects per pg than average (too few
> pgs?)
> pg 10.5 is stuck unclean since forever, current state
> active+undersized+degraded, last acting [1,0]
> pg 9.6 is stuck unclean since forever, current state
> active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
> pg 10.4 is stuck unclean since forever, current state active+remapped,
> last acting [3,0,1]
> pg 9.7 is stuck unclean since forever, current state
> active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]
> pg 10.7 is stuck unclean since forever, current state
> active+undersized+degraded+remapped+backfill_toofull, last acting [0,1]
> pg 9.4 is stuck unclean since forever, current state
> active+undersized+degraded, last acting [1,0]
> pg 9.1 is stuck unclean since forever, current state
> active+undersized+degraded, last acting [0,3]
> pg 10.2 is stuck unclean since forever, current state
> active+undersized+degraded, last

Re: [ceph-users] Ceph full cluster

2016-09-26 Thread Yoann Moulin
Hello,

> Yes, you are right!
> I've changed this for all pools, but not for last two!
> 
> pool 1 '.rgw.root' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 27 owner
> 18446744073709551615 flags hashpspool strip
> e_width 0
> pool 2 'default.rgw.control' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 29 owner
> 18446744073709551615 flags hashps
> pool stripe_width 0
> pool 3 'default.rgw.data.root' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 31 owner
> 18446744073709551615 flags hash
> pspool stripe_width 0
> pool 4 'default.rgw.gc' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 33 owner
> 18446744073709551615 flags hashpspool
> stripe_width 0
> pool 5 'default.rgw.log' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 35 owner
> 18446744073709551615 flags hashpspool
> stripe_width 0
> pool 6 'default.rgw.users.uid' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 37 owner
> 18446744073709551615 flags hash
> pspool stripe_width 0
> pool 7 'default.rgw.users.keys' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 owner
> 18446744073709551615 flags has
> hpspool stripe_width 0
> pool 8 'default.rgw.meta' replicated size 2 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 41 owner
> 18446744073709551615 flags hashpspoo
> l stripe_width 0
> pool 9 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 
> 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 43 flags
> hashpspool stripe_width 0
> pool 10 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_ruleset 
> 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 45 flags
> hashpspool stripe_width 0

Be-careful, if you set size 2 and min_size 2, your cluster will be in 
HEALTH_ERR state if you loose only OSD, if you want to set "size 2" (which
is not recommended) you should set min_size to 1.

Best Regards.

Yoann Moulin

> On Mon, Sep 26, 2016 at 2:05 PM, Burkhard Linke 
>  > wrote:
> 
> Hi,
> 
> 
> On 09/26/2016 12:58 PM, Dmitriy Lock wrote:
>> Hello all!
>> I need some help with my Ceph cluster.
>> I've installed ceph cluster with two physical servers with osd /data 40G 
>> on each.
>> Here is ceph.conf:
>> [global]
>> fsid = 377174ff-f11f-48ec-ad8b-ff450d43391c
>> mon_initial_members = vm35, vm36
>> mon_host = 192.168.1.35,192.168.1.36
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> auth_client_required = cephx
>>
>> osd pool default size = 2  # Write an object 2 times.
>> osd pool default min size = 1 # Allow writing one copy in a degraded 
>> state.
>>
>> osd pool default pg num = 200
>> osd pool default pgp num = 200
>>
>> Right after creation it was HEALTH_OK, and i've started with filling it. 
>> I've wrote 40G data to cluster using Rados gateway, but cluster
>> uses all avaiable space and keep growing after i've added two another 
>> osd - 10G /data1 on each server.
>> Here is tree output:
>> # ceph osd tree
>> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY  
>> -1 0.09756 root default 
>> -2 0.04878 host vm35
>> 0 0.03899 osd.0  up  1.0  1.0  
>> 2 0.00980 osd.2  up  1.0  1.0  
>> -3 0.04878 host vm36
>> 1 0.03899 osd.1  up  1.0  1.0  
>> 3 0.00980 osd.3  up  1.0  1.0 
>>
>> and health:
>> root@vm35:/etc# ceph health
>> HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck 
>> unclean; 15 pgs undersized; recovery 87176/300483 objects degraded
>> (29.012%); recovery 62272/300483 obj
>> ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool 
>> default.rgw.buckets.data has many more objects per pg than average (too
>> few pgs?)
>> root@vm35:/etc# ceph health detail
>> HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs stuck 
>> unclean; 15 pgs undersized; recovery 87176/300483 objects degraded
>> (29.012%); recovery 62272/300483 obj
>> ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s); pool 
>> default.rgw.buckets.data has many more objects per pg than average (too
>> few pgs?)
>> pg 10.5 is stuck unclean since forever, current state 
>> active+undersized+degraded, last acting [1,0]
>> pg 9.6 is stuck unclean since forever, current state 
>> active+undersized+degraded+remapped+backfill_toofull, last acting [1,0]

[ceph-users] 10.2.3 release announcement?

2016-09-26 Thread Henrik Korkuc

Hey,

10.2.3 is tagged in jewel branch for more than 5 days already, but there 
were no announcement for that yet. Is there any reasons for that? 
Packages seems to be present too


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to maintain cluster properly

2016-09-26 Thread Eugen Block

Hi experts,

I need your help. I have a running cluster with 19 OSDs and 3 MONs. I  
created a separate LVM for /var/lib/ceph on one of the nodes. I  
stopped the mon service on that node, rsynced the content to the newly  
created LVM and restarted the monitor, but obviously, I didn't do that  
correctly as I'm stuck in ERROR state and can't repair the respective  
PGs.
How would I do that correctly? I want to do the same on the remaining  
nodes, but without bringing the cluster to error state.


One thing I alreade learned is to set the noout flag before stopping  
services, but what else is there to accomplish that?


But now that it is in error state, how can I repair my cluster? the  
current status is:


---cut here---
ceph@node01:~/ceph-deploy> ceph -s
cluster 655cb05a-435a-41ba-83d9-8549f7c36167
 health HEALTH_ERR
16 pgs inconsistent
261 scrub errors
 monmap e7: 3 mons at  
{mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}

election epoch 356, quorum 0,1,2 mon1,mon2,mon3
 osdmap e3394: 19 osds: 19 up, 19 in
  pgmap v7105355: 8432 pgs, 15 pools, 1003 GB data, 205 kobjects
2114 GB used, 6038 GB / 8153 GB avail
8413 active+clean
  16 active+clean+inconsistent
   3 active+clean+scrubbing+deep
  client io 0 B/s rd, 136 kB/s wr, 34 op/s

ceph@ndesan01:~/ceph-deploy> ceph health detail
HEALTH_ERR 16 pgs inconsistent; 261 scrub errors
pg 1.ffa is active+clean+inconsistent, acting [16,5]
pg 1.cc9 is active+clean+inconsistent, acting [5,18]
pg 1.bb1 is active+clean+inconsistent, acting [15,5]
pg 1.ac4 is active+clean+inconsistent, acting [0,5]
pg 1.a46 is active+clean+inconsistent, acting [13,4]
pg 1.a16 is active+clean+inconsistent, acting [5,18]
pg 1.9e4 is active+clean+inconsistent, acting [13,9]
pg 1.9b7 is active+clean+inconsistent, acting [5,6]
pg 1.950 is active+clean+inconsistent, acting [0,9]
pg 1.6db is active+clean+inconsistent, acting [15,5]
pg 1.5f6 is active+clean+inconsistent, acting [17,5]
pg 1.5c2 is active+clean+inconsistent, acting [8,4]
pg 1.5bc is active+clean+inconsistent, acting [9,6]
pg 1.505 is active+clean+inconsistent, acting [16,9]
pg 1.3e6 is active+clean+inconsistent, acting [2,4]
pg 1.32 is active+clean+inconsistent, acting [18,5]
261 scrub errors
---cut here---

And the number of scrub errors is increasing, although I started with  
more thatn 400 scrub errors.

What I have tried is to manually repair single PGs as described in [1]

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to maintain cluster properly (Part2)

2016-09-26 Thread Eugen Block

(Sorry, sometimes I use the wrong shortcuts too quick)

Hi experts,

I need your help. I have a running cluster with 19 OSDs and 3 MONs. I  
created a separate LVM for /var/lib/ceph on one of the nodes. I  
stopped the mon service on that node, rsynced the content to the newly  
created LVM and restarted the monitor, but obviously, I didn't do that  
correctly as I'm stuck in ERROR state and can't repair the respective  
PGs.
How would I do that correctly? I want to do the same on the remaining  
nodes, but without bringing the cluster to error state.


One thing I already learned is to set the noout flag before stopping  
services, but what else is there to do to accomplish that?


But now that it is in error state, how can I repair my cluster? the  
current status is:


---cut here---
ceph@node01:~/ceph-deploy> ceph -s
cluster 655cb05a-435a-41ba-83d9-8549f7c36167
 health HEALTH_ERR
16 pgs inconsistent
261 scrub errors
 monmap e7: 3 mons at  
{mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}

election epoch 356, quorum 0,1,2 mon1,mon2,mon3
 osdmap e3394: 19 osds: 19 up, 19 in
  pgmap v7105355: 8432 pgs, 15 pools, 1003 GB data, 205 kobjects
2114 GB used, 6038 GB / 8153 GB avail
8413 active+clean
  16 active+clean+inconsistent
   3 active+clean+scrubbing+deep
  client io 0 B/s rd, 136 kB/s wr, 34 op/s

ceph@ndesan01:~/ceph-deploy> ceph health detail
HEALTH_ERR 16 pgs inconsistent; 261 scrub errors
pg 1.ffa is active+clean+inconsistent, acting [16,5]
pg 1.cc9 is active+clean+inconsistent, acting [5,18]
pg 1.bb1 is active+clean+inconsistent, acting [15,5]
pg 1.ac4 is active+clean+inconsistent, acting [0,5]
pg 1.a46 is active+clean+inconsistent, acting [13,4]
pg 1.a16 is active+clean+inconsistent, acting [5,18]
pg 1.9e4 is active+clean+inconsistent, acting [13,9]
pg 1.9b7 is active+clean+inconsistent, acting [5,6]
pg 1.950 is active+clean+inconsistent, acting [0,9]
pg 1.6db is active+clean+inconsistent, acting [15,5]
pg 1.5f6 is active+clean+inconsistent, acting [17,5]
pg 1.5c2 is active+clean+inconsistent, acting [8,4]
pg 1.5bc is active+clean+inconsistent, acting [9,6]
pg 1.505 is active+clean+inconsistent, acting [16,9]
pg 1.3e6 is active+clean+inconsistent, acting [2,4]
pg 1.32 is active+clean+inconsistent, acting [18,5]
261 scrub errors
---cut here---

And the number of scrub errors is increasing, although I started with  
more thatn 400 scrub errors.
What I have tried is to manually repair single PGs as described in  
[1]. But some of the broken PGs have no entries in the log file so I  
don't have anything to look at.
In case there is one object in one OSD but is missing in the other.  
how do I get that copied back there? Everything I've tried so far  
didn't accomplish anything except the decreasing number of scrub  
errors, but they are increasing again, so no success at all.


I'd be really greatful for your advice!

Regards,
Eugen

[1] http://ceph.com/planet/ceph-manually-repair-object/

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache, partitions and BlueStore

2016-09-26 Thread Jens Rosenboom
2016-09-26 11:31 GMT+02:00 Wido den Hollander :
...
> Does anybody know the proper route we need to take to get this fixed 
> upstream? Has any contacts with the bcache developers?

I do not have direct contacts either, but having partitions on bcache
would be really great. Currently we do some nasty hacks to get OSDs
started properly on systemd/udev with it. You may want to visit
#bcache on OFTC, there is also a dedicated ML[3].

[3] linux-bca...@vger.kernel.org
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache, partitions and BlueStore

2016-09-26 Thread Sam Yaple
On Mon, Sep 26, 2016 at 9:31 AM, Wido den Hollander  wrote:

> Hi,
>
> This has been discussed on the ML before [0], but I would like to bring
> this up again with the outlook towards BlueStore.
>
> Bcache [1] allows for block device level caching in Linux. This can be
> read/write(back) and vastly improves read and write performance to a block
> device.
>
> With the current layout of Ceph with FileStore you can already use bcache,
> but not with ceph-disk.
>
> The reason is that bcache currently does not support creating partitions
> on those devices. There are patches [2] out there, but they are not
> upstream.
>
> I haven't tested it yet, but it looks like BlueStore can still benefit
> quite good from Bcache and it would be a lot easier if the patches [2] were
> merged upstream.
>
> This way you would have:
>
> - bcache0p1: XFS/EXT4 OSD metadata
> - bcache0p2: RocksDB
> - bcache0p3: RocksDB WAL
> - bcache0p4: BlueStore DATA
>
> With bcache you could create multiple bcache devices by creating
> partitions on the backing disk and creating bcache devices for all of them,
> but that's a lot of work and not easy to automate with ceph-disk.
>
> So what I'm trying to find is the best route to get this upstream in the
> Linux kernel. That way next year when BlueStore becomes the default in L
> (luminous) users can use bcache underneath BlueStore easily.
>
> Does anybody know the proper route we need to take to get this fixed
> upstream? Has any contacts with the bcache developers?
>

Kent is pretty heavy into developing bcachefs at the moment. But you can
hit him up on IRC at OFTC #bcache . I've talked ot him about this before
and he is 100% willing to accept any patch to solves this issue in the
standard way the kernel typically allocs major/minors for disks. The blog
post you listed from me does _not_ solve this in an upstream way, though
the final result is pretty accurate from my understanding.

I will look into a more better way to patch this upstream since there is
renew interested in this.

Also, checkout bcachefs if you like bcache. It's up and coming, but it is
pretty sweet. My goal is to use bcachefs with bluestore in the future.


>
> Thanks!
>
> Wido
>
> [0]: http://www.spinics.net/lists/ceph-devel/msg29550.html
> [1]: https://bcache.evilpiepirate.org/
> [2]: https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
>


SamYaple
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] AWS ebs volume snapshot for ceph osd

2016-09-26 Thread sudhakar
Hello all

I need your help.


I have running ceph cluster on aws with 3 mons and 3 osd.

my question is that can I use EBS snapshot of OSD as backup solution will
it work if  I crate volume from snapshot of OSD and add to ceph cluster as
new OSD

any help weather this approach is correct or not. what other methods can i
implement for backup of my ceph data

thankyou
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache, partitions and BlueStore

2016-09-26 Thread Wido den Hollander

> Op 26 september 2016 om 17:48 schreef Sam Yaple :
> 
> 
> On Mon, Sep 26, 2016 at 9:31 AM, Wido den Hollander  wrote:
> 
> > Hi,
> >
> > This has been discussed on the ML before [0], but I would like to bring
> > this up again with the outlook towards BlueStore.
> >
> > Bcache [1] allows for block device level caching in Linux. This can be
> > read/write(back) and vastly improves read and write performance to a block
> > device.
> >
> > With the current layout of Ceph with FileStore you can already use bcache,
> > but not with ceph-disk.
> >
> > The reason is that bcache currently does not support creating partitions
> > on those devices. There are patches [2] out there, but they are not
> > upstream.
> >
> > I haven't tested it yet, but it looks like BlueStore can still benefit
> > quite good from Bcache and it would be a lot easier if the patches [2] were
> > merged upstream.
> >
> > This way you would have:
> >
> > - bcache0p1: XFS/EXT4 OSD metadata
> > - bcache0p2: RocksDB
> > - bcache0p3: RocksDB WAL
> > - bcache0p4: BlueStore DATA
> >
> > With bcache you could create multiple bcache devices by creating
> > partitions on the backing disk and creating bcache devices for all of them,
> > but that's a lot of work and not easy to automate with ceph-disk.
> >
> > So what I'm trying to find is the best route to get this upstream in the
> > Linux kernel. That way next year when BlueStore becomes the default in L
> > (luminous) users can use bcache underneath BlueStore easily.
> >
> > Does anybody know the proper route we need to take to get this fixed
> > upstream? Has any contacts with the bcache developers?
> >
> 
> Kent is pretty heavy into developing bcachefs at the moment. But you can
> hit him up on IRC at OFTC #bcache . I've talked ot him about this before
> and he is 100% willing to accept any patch to solves this issue in the
> standard way the kernel typically allocs major/minors for disks. The blog
> post you listed from me does _not_ solve this in an upstream way, though
> the final result is pretty accurate from my understanding.
> 

No, I understood that the blog indeed doesn't solve that.

> I will look into a more better way to patch this upstream since there is
> renew interested in this.
> 

That would be great! My kernel knowledge is to limited to look into this, but 
if you could help with this it would be nice.

If this hits the kernel somewhere in Nov/Dec we should be good for a kernel 
release somewhere together with L for Ceph.

> Also, checkout bcachefs if you like bcache. It's up and coming, but it is
> pretty sweet. My goal is to use bcachefs with bluestore in the future.
> 

bcachefs with bluestore? The OSD doesn't require a filesystem with BlueStore, 
just a raw block device :)

Wido

> 
> >
> > Thanks!
> >
> > Wido
> >
> > [0]: http://www.spinics.net/lists/ceph-devel/msg29550.html
> > [1]: https://bcache.evilpiepirate.org/
> > [2]: https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
> >
> 
> 
> SamYaple
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bcache, partitions and BlueStore

2016-09-26 Thread Sam Yaple
On Mon, Sep 26, 2016 at 5:44 PM, Wido den Hollander  wrote:

>
> > Op 26 september 2016 om 17:48 schreef Sam Yaple :
> >
> >
> > On Mon, Sep 26, 2016 at 9:31 AM, Wido den Hollander 
> wrote:
> >
> > > Hi,
> > >
> > > This has been discussed on the ML before [0], but I would like to bring
> > > this up again with the outlook towards BlueStore.
> > >
> > > Bcache [1] allows for block device level caching in Linux. This can be
> > > read/write(back) and vastly improves read and write performance to a
> block
> > > device.
> > >
> > > With the current layout of Ceph with FileStore you can already use
> bcache,
> > > but not with ceph-disk.
> > >
> > > The reason is that bcache currently does not support creating
> partitions
> > > on those devices. There are patches [2] out there, but they are not
> > > upstream.
> > >
> > > I haven't tested it yet, but it looks like BlueStore can still benefit
> > > quite good from Bcache and it would be a lot easier if the patches [2]
> were
> > > merged upstream.
> > >
> > > This way you would have:
> > >
> > > - bcache0p1: XFS/EXT4 OSD metadata
> > > - bcache0p2: RocksDB
> > > - bcache0p3: RocksDB WAL
> > > - bcache0p4: BlueStore DATA
> > >
> > > With bcache you could create multiple bcache devices by creating
> > > partitions on the backing disk and creating bcache devices for all of
> them,
> > > but that's a lot of work and not easy to automate with ceph-disk.
> > >
> > > So what I'm trying to find is the best route to get this upstream in
> the
> > > Linux kernel. That way next year when BlueStore becomes the default in
> L
> > > (luminous) users can use bcache underneath BlueStore easily.
> > >
> > > Does anybody know the proper route we need to take to get this fixed
> > > upstream? Has any contacts with the bcache developers?
> > >
> >
> > Kent is pretty heavy into developing bcachefs at the moment. But you can
> > hit him up on IRC at OFTC #bcache . I've talked ot him about this before
> > and he is 100% willing to accept any patch to solves this issue in the
> > standard way the kernel typically allocs major/minors for disks. The blog
> > post you listed from me does _not_ solve this in an upstream way, though
> > the final result is pretty accurate from my understanding.
> >
>
> No, I understood that the blog indeed doesn't solve that.
>
> > I will look into a more better way to patch this upstream since there is
> > renew interested in this.
> >
>
> That would be great! My kernel knowledge is to limited to look into this,
> but if you could help with this it would be nice.
>
> If this hits the kernel somewhere in Nov/Dec we should be good for a
> kernel release somewhere together with L for Ceph.
>
> > Also, checkout bcachefs if you like bcache. It's up and coming, but it is
> > pretty sweet. My goal is to use bcachefs with bluestore in the future.
> >
>
> bcachefs with bluestore? The OSD doesn't require a filesystem with
> BlueStore, just a raw block device :)
>
> Well there are parts of the OSD that still use a file system that can
benefit from the caching (rockdb and wal). This is what I meant. There is a
tiering system with bcachefs which currently only supports 2 tiers, but
will eventually allow for 15 tiers, so you could have small and fast pci
caching tier, followed by ssd, followed by spinning disk. Controlling what
data can exist on what tier (and with writeback/writethrough potentially).
Lots of room for configurations to improve performance.

SamYaple


> Wido
>
> >
> > >
> > > Thanks!
> > >
> > > Wido
> > >
> > > [0]: http://www.spinics.net/lists/ceph-devel/msg29550.html
> > > [1]: https://bcache.evilpiepirate.org/
> > > [2]: https://yaple.net/2016/03/31/bcache-partitions-and-dkms/
> > >
> >
> >
> > SamYaple
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.3 release announcement?

2016-09-26 Thread Scottix
Agreed no announcement like there usually is, what is going on?
Hopefully there is an explanation. :|

On Mon, Sep 26, 2016 at 6:01 AM Henrik Korkuc  wrote:

> Hey,
>
> 10.2.3 is tagged in jewel branch for more than 5 days already, but there
> were no announcement for that yet. Is there any reasons for that?
> Packages seems to be present too
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] filestore_split_multiple hardcoded maximum?

2016-09-26 Thread David Turner
We are running on Hammer 0.94.7 and have had very bad experiences with PG 
folders splitting a sub-directory further.  OSDs being marked out, hundreds of 
blocked requests, etc.  We have modified our settings and watched the behavior 
match the ceph documentation for splitting, but right now the subfolders are 
splitting outside of what the documentation says they should.

filestore_split_multiple * abs(filestore_merge_threshold) * 16

Our filestore_merge_threshold is set to 40.  When we had our 
filestore_split_multiple set to 8, we were splitting subfolders when a 
subfolder had (8 * 40 * 16 = ) 5120 objects in the directory.  In a different 
cluster we had to push that back again with elevated settings and the 
subfolders split when they had (16 * 40 * 16 = ) 10240 objects.

We have another cluster that we're working with that is splitting at a value 
that seems to be a hardcoded maximum.  The settings are (32 * 40 * 16 = ) 20480 
objects before it should split, but it seems to be splitting subfolders at 
12800 objects.

Normally I would expect this number to be a power of 2, but we recently found 
another hardcoded maximum of the object map only allowing RBD's with a maximum 
256,000,000 objects in them.  The 12800 matches that as being a base 2 followed 
by a set of zero's to be the hardcoded maximum.

Has anyone else encountered what seems to be a hardcoded maximum here?  Are we 
missing a setting elsewhere that is capping us, or diminishing our value?  Much 
more to the point, though, is there any way to mitigate how painful it is to 
split subfolders in PGs?  So far it seems like the only way we can do it is to 
push up the setting to later drop it back down during a week that we plan to 
have our cluster plagued with blocked requests all while cranking our 
osd_heartbeat_grace so that we don't have flapping osds.

A little more about our setup is that we have 32x 4TB HGST drives with 4x 200GB 
Intel DC3710 journals (8 drives per journal), dual hyper-threaded octa-core 
Xeon (32 virtual cores), 192GB memory, 10Gb redundant network... per storage 
node.



[cid:imageee0ec0.JPG@9083cd5b.46a1210c]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] some Ceph questions for new install - newbie warning

2016-09-26 Thread Erick Perez - Quadrian Enterprises
We are looking to implement a small setup in Ceph+Openstack+kvm for a
college that teaches IT careers.We want to empower teachers and students to
self-provision resources and to develop skills to extend and/or build
Multi-Tenant portals.

Currently:

45VMs (90% Linux and 10% Wndows) using 70vCPUs from 4 physical servers
(real cores no more than 2ghz) and 96GB RAM total.
Three Xen Hypervisor with two bonded 1Gb ethernet nics for public traffic
and another bond for connection to the NFS storage.
One NAS/Debian 24x2TB SATA HDD w/128GB RAM serving NFS v3 to Xen. No usable
data on IOPS.
NAS serves its content from bonded 1Gbps. No 10GB Ethernet AT ALL.

The goal:
1- Use Openstack with KVM in 3 physical nodes to run Linux VMs and Windows
VMs.
2- Use 3 physical hosts (or 5, we still doing $$$ math) for Ceph. Still
struggling between damn expensive HP SAS drives or cheaper HP SATA.
3- Use Cinder/Ceph to provide block storage for the windows VMs,
4- Shared File System(SFS) to mount serveral read-only FS to several
hosts/labs (is it a good idea?)
5- Swift to connect to some storage we have on Amazon.

Questions:
1- I know I need RBD/block to present storage to Nova/Cinder for my Windows
VMs. But if the other VMs are Linux and Im using containers? Block too? NFS?
2- On the 3 physical compute nodes that we are getting, can I run ceph-mon
as VMs there or do I really need another 3 physical machines to do ceph-mon
only?
3- Since we got a pair of 10Gb switches, I wonder if ceph-mon instances do
heavy traffic? do they sync stuff? or the heavy lifting on the 10Gb network
will be performed by the storage nodes when crush tables are updated and or
OSD changes/fail/etc.

Thanks for your comments.
-
Erick.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to maintain cluster properly (Part2)

2016-09-26 Thread lyt_yudi
please try with: 
ceph pg repair 

most of the time will be good! 

good luck!


> 在 2016年9月26日,下午10:44,Eugen Block  写道:
> 
> (Sorry, sometimes I use the wrong shortcuts too quick)
> 
> Hi experts,
> 
> I need your help. I have a running cluster with 19 OSDs and 3 MONs. I created 
> a separate LVM for /var/lib/ceph on one of the nodes. I stopped the mon 
> service on that node, rsynced the content to the newly created LVM and 
> restarted the monitor, but obviously, I didn't do that correctly as I'm stuck 
> in ERROR state and can't repair the respective PGs.
> How would I do that correctly? I want to do the same on the remaining nodes, 
> but without bringing the cluster to error state.
> 
> One thing I already learned is to set the noout flag before stopping 
> services, but what else is there to do to accomplish that?
> 
> But now that it is in error state, how can I repair my cluster? the current 
> status is:
> 
> ---cut here---
> ceph@node01:~/ceph-deploy> ceph -s
>cluster 655cb05a-435a-41ba-83d9-8549f7c36167
> health HEALTH_ERR
>16 pgs inconsistent
>261 scrub errors
> monmap e7: 3 mons at 
> {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}
>election epoch 356, quorum 0,1,2 mon1,mon2,mon3
> osdmap e3394: 19 osds: 19 up, 19 in
>  pgmap v7105355: 8432 pgs, 15 pools, 1003 GB data, 205 kobjects
>2114 GB used, 6038 GB / 8153 GB avail
>8413 active+clean
>  16 active+clean+inconsistent
>   3 active+clean+scrubbing+deep
>  client io 0 B/s rd, 136 kB/s wr, 34 op/s
> 
> ceph@ndesan01:~/ceph-deploy> ceph health detail
> HEALTH_ERR 16 pgs inconsistent; 261 scrub errors
> pg 1.ffa is active+clean+inconsistent, acting [16,5]
> pg 1.cc9 is active+clean+inconsistent, acting [5,18]
> pg 1.bb1 is active+clean+inconsistent, acting [15,5]
> pg 1.ac4 is active+clean+inconsistent, acting [0,5]
> pg 1.a46 is active+clean+inconsistent, acting [13,4]
> pg 1.a16 is active+clean+inconsistent, acting [5,18]
> pg 1.9e4 is active+clean+inconsistent, acting [13,9]
> pg 1.9b7 is active+clean+inconsistent, acting [5,6]
> pg 1.950 is active+clean+inconsistent, acting [0,9]
> pg 1.6db is active+clean+inconsistent, acting [15,5]
> pg 1.5f6 is active+clean+inconsistent, acting [17,5]
> pg 1.5c2 is active+clean+inconsistent, acting [8,4]
> pg 1.5bc is active+clean+inconsistent, acting [9,6]
> pg 1.505 is active+clean+inconsistent, acting [16,9]
> pg 1.3e6 is active+clean+inconsistent, acting [2,4]
> pg 1.32 is active+clean+inconsistent, acting [18,5]
> 261 scrub errors
> ---cut here---
> 
> And the number of scrub errors is increasing, although I started with more 
> thatn 400 scrub errors.
> What I have tried is to manually repair single PGs as described in [1]. But 
> some of the broken PGs have no entries in the log file so I don't have 
> anything to look at.
> In case there is one object in one OSD but is missing in the other. how do I 
> get that copied back there? Everything I've tried so far didn't accomplish 
> anything except the decreasing number of scrub errors, but they are 
> increasing again, so no success at all.
> 
> I'd be really greatful for your advice!
> 
> Regards,
> Eugen
> 
> [1] http://ceph.com/planet/ceph-manually-repair-object/
> 
> -- 
> Eugen Block voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg e-mail  : ebl...@nde.ag
> 
>Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>  Sitz und Registergericht: Hamburg, HRB 90934
>  Vorstand: Jens-U. Mozdzen
>   USt-IdNr. DE 814 013 983
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to maintain cluster properly (Part2)

2016-09-26 Thread lyt_yudi

> 在 2016年9月26日,下午10:44,Eugen Block  写道:
> 
> And the number of scrub errors is increasing, although I started with more 
> thatn 400 scrub errors.
> What I have tried is to manually repair single PGs as described in [1]. But 
> some of the broken PGs have no entries in the log file so I don't have 
> anything to look at.
> In case there is one object in one OSD but is missing in the other. how do I 
> get that copied back there? Everything I've tried so far didn't accomplish 
> anything except the decreasing number of scrub errors, but they are 
> increasing again, so no success at all.
> 
> I'd be really greatful for your advice!
> 
> Regards,
> Eugen
> 
> [1] http://ceph.com/planet/ceph-manually-repair-object/ 
> 

Sorry.  You have tried.  :(___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to maintain cluster properly (Part2)

2016-09-26 Thread lyt_yudi

> 在 2016年9月26日,下午10:44,Eugen Block  写道:
> 
> What I have tried is to manually repair single PGs as described in [1]. But 
> some of the broken PGs have no entries in the log file so I don't have 
> anything to look at.
> In case there is one object in one OSD but is missing in the other. how do I 
> get that copied back there? Everything I've tried so far didn't accomplish 
> anything except the decreasing number of scrub errors, but they are 
> increasing again, so no success at all.

met before in my case, no error in the log too,

like this back to normal:

1. offline the OSD(with inconsistent pgid in the osd) and marked as out, 
waiting for the completion of data synchronization, and then delete the OSD

2. manual repair again, waiting for a period of time, the cluster can be 
returned to normal

3. in the end, then re join the deletion of the OSD.

Good Luck!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph user manangerment question

2016-09-26 Thread 卢 迪
Hello all,


I'm a newbie of Ceph. I read the document and created a ceph cluster against 
VM. I have a question about how to apply user managerment to the cluster. I'm 
not asking how to create or modify users or user privileges. I have found this 
in the Ceph document.


I want to know:


1. Is there a way to know the usage of all privileges? For example, I created 
an user client.appuser with mon "allow r", this user can accsess the Ceph; If I 
removed the mon "allow r", it will be time out. (in this case, I mount the 
cluster with cephfs). If someone has these information, could you please share 
with me?


2. What kind of situation would you create differnet users for cluster? In 
currently, I user admin user to access the all cluster, such as start cluster, 
mount file system and etc. It looks like the appuser( I created above) can 
mount file system too. Is it possible to create an user liking the OS user or 
database user? So, one user upload some data, the others can't see them or can 
only read them.


Thanks,

 Dillon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object lost

2016-09-26 Thread Fran Barrera
Hi Jason,

I've been able to rebuild some of images but all are corrupted at this
time, but your procedure appears ok.

Thanks!

2016-09-22 15:07 GMT+02:00 Jason Dillaman :

> You can do something like the following:
>
> # create a sparse file the size of your image
> $ dd if=/dev/zero of=rbd_export bs=1 count=0 seek=
>
> # import the data blocks
> $ POOL=images
> $ PREFIX=rbd_data.1014109cf92e
> $ BLOCK_SIZE=512
> $ for x in $(rados --pool ${POOL} ls | grep ${PREFIX} | sort) ; do
>   rm -rf tmp_object
>   rados --pool ${POOL} get $x tmp_object
>   SUFFIX=0x$(echo ${x} | cut -d. -f3)
>   OFFSET=$(($SUFFIX * 0x40 / ${BLOCK_SIZE}))
>   echo ${x} @ ${OFFSET}
>   dd conv=notrunc if=tmp_object of=rbd_export seek=${OFFSET}
> bs=${BLOCK_SIZE}
> done
>
> On Thu, Sep 22, 2016 at 5:27 AM, Fran Barrera 
> wrote:
> > Hi Jason,
> >
> > I've followed your steps and now I can list all available data blocks of
> my
> > image, but I don't know how rebuild a sparse image, I found this script
> > (https://raw.githubusercontent.com/smmoore/ceph/master/rbd_restore.sh)
> and
> > https://www.sebastien-han.fr/blog/2015/01/29/ceph-recover-
> a-rbd-image-from-a-dead-cluster/
> > but I don't know if this can help me.
> >
> > Any suggestions?
> >
> > Thanks.
> >
> > 2016-09-21 22:35 GMT+02:00 Jason Dillaman :
> >>
> >> Unfortunately, it sounds like the image's header object was lost
> >> during your corruption event. While you can manually retrieve the
> >> image data blocks from the cluster, undoubtedly many might be lost
> >> and/or corrupted as well.
> >>
> >> You'll first need to determine the internal id of your image:
> >> $ rados --pool images getomapval rbd_directory
> >> name_07e54256-d123-4e61-a23a-7f8008340751
> >> value (16 bytes) :
> >>   0c 00 00 00 31 30 31 34  31 30 39 63 66 39 32 65
> >> |1014109cf92e|
> >> 0010
> >>
> >> In my example above, the image id (1014109cf92e in this case) is the
> >> string starting after the first four bytes (the id length). I can then
> >> use the rados tool to list all available data blocks:
> >>
> >> $ rados --pool images ls | grep rbd_data.1014109cf92e | sort
> >> rbd_data.1014109cf92e.
> >> rbd_data.1014109cf92e.000b
> >> rbd_data.1014109cf92e.0010
> >>
> >> The sequence of hex numbers at the end of each data object is the
> >> object number and it represents the byte offset within the image (4MB
> >> * object number = byte offset assuming default 4MB object size and no
> >> fancy striping enabled).
> >>
> >> You should be able to script something up to rebuild a sparse image
> >> with whatever data is still available in your cluster.
> >>
> >> On Wed, Sep 21, 2016 at 11:12 AM, Fran Barrera 
> >> wrote:
> >> > Hello,
> >> >
> >> > I have a Ceph Jewel cluster with 4 osds and only one monitor
> integrated
> >> > with
> >> > Openstack Mitaka.
> >> >
> >> > Two OSD were down, with a service restart one of them was recovered.
> The
> >> > cluster began to recover and was OK. Finally the disk of the other OSD
> >> > was
> >> > corrupted and the solution was a format and recreate the OSD.
> >> >
> >> > Now I have the cluster OK, but the problem now is with some of the
> >> > images
> >> > stored in Ceph.
> >> >
> >> > $ rbd list -p images|grep 07e54256-d123-4e61-a23a-7f8008340751
> >> > 07e54256-d123-4e61-a23a-7f8008340751
> >> >
> >> > $ rbd export -p images 07e54256-d123-4e61-a23a-7f8008340751
> >> > /tmp/image.img
> >> > 2016-09-21 17:07:00.889379 7f51f9520700 -1 librbd::image::OpenRequest:
> >> > failed to retreive immutable metadata: (2) No such file or directory
> >> > rbd: error opening image 07e54256-d123-4e61-a23a-7f8008340751: (2) No
> >> > such
> >> > file or directory
> >> >
> >> > Ceph can list the image but nothing more, for example an export. So
> >> > Openstack can not retrieve this image. I try repair the pg but
> appear's
> >> > ok.
> >> > Is there any solution for this?
> >> >
> >> > Kind Regards,
> >> > Fran.
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >>
> >> --
> >> Jason
> >
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph user manangerment question

2016-09-26 Thread Daleep Singh Bais
Hi Dillon,

Ceph uses CephX authentication, which gives permission to users on
selected Pools  to read / write.  We give mon 'allow r'
 to get cluster/Crush map for client.

You can refer to below URL for more information on CephX and creating
user keyrings for access to selected / specific pools.

http://docs.ceph.com/docs/jewel/rados/configuration/auth-config-ref/

The below URL will give you information on various permissions which can
be applied while creating a CephX authentication key.

http://docs.ceph.com/docs/firefly/rados/operations/auth-intro/

Hope this will give some insight and way forward to proceed.

Thanks,

Daleep Singh Bais

On 09/27/2016 12:02 PM, 卢 迪 wrote:
>
> Hello all,
>
>
> I'm a newbie of Ceph. I read the document and created a ceph cluster
> against VM. I have a question about how to apply user managerment to
> the cluster. I'm not asking how to create or modify users or user
> privileges. I have found this in the Ceph document.
>
>
> I want to know:
>
>
> 1. Is there a way to know the usage of all privileges? For example, I
> created an user client.appuser with mon "allow r", this user can
> accsess the Ceph; If I removed the mon "allow r", it will be time out.
> (in this case, I mount the cluster with cephfs). If someone has these
> information, could you please share with me?
>
>
> 2. What kind of situation would you create differnet users for
> cluster? In currently, I user admin user to access the all cluster,
> such as start cluster, mount file system and etc. It looks like the
> appuser( I created above) can mount file system too. Is it possible to
> create an user liking the OS user or database user? So, one user
> upload some data, the others can't see them or can only read them.
>
>
> Thanks,
>
>  Dillon
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com