[ceph-users] centos and 'print continue' support

2014-05-23 Thread Bryan Stillwell
Yesterday I went through manually configuring a ceph cluster with a
rados gateway on centos 6.5, and I have a question about the
documentation.  On this page:

https://ceph.com/docs/master/radosgw/config/

It mentions "On CentOS/RHEL distributions, turn off print continue. If
you have it set to true, you may encounter problems with PUT
operations."  However, when I had 'rgw print continue = false' in my
ceph.conf, adding objects with the python boto module would hang at:

key.set_contents_from_string('Hello World!')

After switching it to 'rgw print continue = true' things started working.

I'm wondering if this is because I installed the custom
apache/mod_fastcgi packages from the instructions on this page?:

http://ceph.com/docs/master/install/install-ceph-gateway/#id2

If that's the case, could the docs be updated to mention that setting
'rgw print continue = false' is only needed if you're using the distro
packages?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-04 Thread Bryan Stillwell
Bill,

I've run into a similar issue with objects averaging ~100KiB.  The
explanation I received on IRC is that there are scaling issues if you're
uploading them all to the same bucket because the index isn't sharded.  The
recommended solution is to spread the objects out to a lot of buckets.
 However, that ran me into another issue once I hit 1000 buckets which is a
per user limit.  I switched the limit to be unlimited with this command:

radosgw-admin user modify --uid=your_username --max-buckets=0

Bryan


On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer  wrote:

> I'm testing ceph for storing a very large number of small files.  I'm
> seeing some performance issues and would like to see if anyone could offer
> any insight as to what I could do to correct this.
>
> Some numbers:
>
> Uploaded 184111 files, with an average file size of 5KB, using
> 10 separate servers to upload the request using Python and the cloudfiles
> module.  I stopped uploading after 53 minutes, which seems to average 5.7
> files per second per node.
>
>
> My storage cluster consists of 21 OSD's across 7 servers, with their
> journals written to SSD drives.  I've done a default installation, using
> ceph-deploy with the dumpling release.
>
> I'm using statsd to monitor the performance, and what's interesting is
> when I start with an empty bucket, performance is amazing, with average
> response times of 20-50ms.  However as time goes on, the response times go
> in to the hundreds, and the average number of uploads per second drops.
>
> I've installed radosgw on all 7 ceph servers.  I've tested using a load
> balancer to distribute the api calls, as well as pointing the 10 worker
> servers to a single instance.  I've not seen a real different
> in performance with this either.
>
>
> Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of ram, OCZ
> Vertex4 SSD drives for the journals and Seagate Barracuda ES2 drives for
> storage.
>
>
> Any help would be greatly appreciated.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
[image: Photobucket] <http://photobucket.com>

*Bryan Stillwell*
SENIOR SYSTEM ADMINISTRATOR

E: bstillw...@photobucket.com
O: 303.228.5109
M: 970.310.6085

[image: Facebook] <http://www.facebook.com/photobucket>[image:
Twitter]<http://twitter.com/photobucket>[image:
Photobucket] <http://photobucket.com/images/photobucket>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-04 Thread Bryan Stillwell
So far I haven't seen much of a change.  It's still working through
removing the bucket that reached 1.5 million objects though (my guess is
that'll take a few more days), so I believe that might have something to do
with it.

Bryan


On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson wrote:

> Bryan,
>
> Good explanation.  How's performance now that you've spread the load over
> multiple buckets?
>
> Mark
>
> On 09/04/2013 12:39 PM, Bryan Stillwell wrote:
>
>> Bill,
>>
>> I've run into a similar issue with objects averaging ~100KiB.  The
>> explanation I received on IRC is that there are scaling issues if you're
>> uploading them all to the same bucket because the index isn't sharded.
>>   The recommended solution is to spread the objects out to a lot of
>> buckets.  However, that ran me into another issue once I hit 1000
>> buckets which is a per user limit.  I switched the limit to be unlimited
>> with this command:
>>
>> radosgw-admin user modify --uid=your_username --max-buckets=0
>>
>> Bryan
>>
>>
>> On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer > <mailto:bill.o...@gmail.com>> wrote:
>>
>> I'm testing ceph for storing a very large number of small files.
>>   I'm seeing some performance issues and would like to see if anyone
>> could offer any insight as to what I could do to correct this.
>>
>> Some numbers:
>>
>> Uploaded 184111 files, with an average file size of 5KB, using
>> 10 separate servers to upload the request using Python and the
>> cloudfiles module.  I stopped uploading after 53 minutes, which
>> seems to average 5.7 files per second per node.
>>
>>
>> My storage cluster consists of 21 OSD's across 7 servers, with their
>> journals written to SSD drives.  I've done a default installation,
>> using ceph-deploy with the dumpling release.
>>
>> I'm using statsd to monitor the performance, and what's interesting
>> is when I start with an empty bucket, performance is amazing, with
>> average response times of 20-50ms.  However as time goes on, the
>> response times go in to the hundreds, and the average number of
>> uploads per second drops.
>>
>> I've installed radosgw on all 7 ceph servers.  I've tested using a
>> load balancer to distribute the api calls, as well as pointing the
>> 10 worker servers to a single instance.  I've not seen a real
>> different in performance with this either.
>>
>>
>> Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of ram,
>> OCZ Vertex4 SSD drives for the journals and Seagate Barracuda ES2
>> drives for storage.
>>
>>
>> Any help would be greatly appreciated.
>>
>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> <mailto:ceph-us...@lists.ceph.**com
>> >
>> 
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>
>>
>> --
>> Photobucket <http://photobucket.com>
>>
>> *Bryan Stillwell*
>> SENIOR SYSTEM ADMINISTRATOR
>>
>> E: bstillw...@photobucket.com 
>> <mailto:bstillwell@**photobucket.com
>> >
>> O: 303.228.5109
>> M: 970.310.6085
>>
>> Facebook 
>> <http://www.facebook.com/**photobucket<http://www.facebook.com/photobucket>>
>>  Twitter
>> <http://twitter.com/**photobucket <http://twitter.com/photobucket>>
>>Photobucket
>> <http://photobucket.com/**images/photobucket<http://photobucket.com/images/photobucket>
>> >
>>
>>
>>
>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>



-- 
[image: Photobucket] <http://photobucket.com>

*Bryan Stillwell*
SENIOR SYSTEM ADMINISTRATOR

E: bstillw...@photobucket.com
O: 303.228.5109
M: 970.310.6085

[image: Facebook] <http://www.facebook.com/photobucket>[image:
Twitter]<http://twitter.com/photobucket>[image:
Photobucket] <http://photobucket.com/images/photobucket>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell
Wouldn't using only the first two characters in the file name result
in less then 65k buckets being used?

For example if the file names contained 0-9 and a-f, that would only
be 256 buckets (16*16).  Or if they contained 0-9, a-z, and A-Z, that
would only be 3,844 buckets (62 * 62).

Bryan


On Thu, Sep 5, 2013 at 8:19 AM, Bill Omer  wrote:
>
> Thats correct.  We created 65k buckets, using two hex characters as the 
> naming convention, then stored the files in each container based on their 
> first two characters in the file name.  The end result was 20-50 files per 
> bucket.  Once all of the buckets were created and files were being loaded, we 
> still observed an increase in latency overtime.
>
> Is there a way to disable indexing?  Or are there other settings you can 
> suggest to attempt to speed this process up?
>
>
> On Wed, Sep 4, 2013 at 5:21 PM, Mark Nelson  wrote:
>>
>> Just for clarification, distributing objects over lots of buckets isn't 
>> helping improve small object performance?
>>
>> The degradation over time is similar to something I've seen in the past, 
>> with higher numbers of seeks on the underlying OSD device over time.  Is it 
>> always (temporarily) resolved writing to a new empty bucket?
>>
>> Mark
>>
>>
>> On 09/04/2013 02:45 PM, Bill Omer wrote:
>>>
>>> We've actually done the same thing, creating 65k buckets and storing
>>> 20-50 objects in each.  No change really, not noticeable anyway
>>>
>>>
>>> On Wed, Sep 4, 2013 at 2:43 PM, Bryan Stillwell
>>> mailto:bstillw...@photobucket.com>> wrote:
>>>
>>> So far I haven't seen much of a change.  It's still working through
>>> removing the bucket that reached 1.5 million objects though (my
>>> guess is that'll take a few more days), so I believe that might have
>>> something to do with it.
>>>
>>> Bryan
>>>
>>>
>>> On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson
>>> mailto:mark.nel...@inktank.com>> wrote:
>>>
>>> Bryan,
>>>
>>> Good explanation.  How's performance now that you've spread the
>>> load over multiple buckets?
>>>
>>> Mark
>>>
>>> On 09/04/2013 12:39 PM, Bryan Stillwell wrote:
>>>
>>> Bill,
>>>
>>> I've run into a similar issue with objects averaging
>>> ~100KiB.  The
>>> explanation I received on IRC is that there are scaling
>>> issues if you're
>>> uploading them all to the same bucket because the index
>>> isn't sharded.
>>>The recommended solution is to spread the objects out to
>>> a lot of
>>> buckets.  However, that ran me into another issue once I hit
>>> 1000
>>> buckets which is a per user limit.  I switched the limit to
>>> be unlimited
>>> with this command:
>>>
>>> radosgw-admin user modify --uid=your_username --max-buckets=0
>>>
>>> Bryan
>>>
>>>
>>> On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer
>>> mailto:bill.o...@gmail.com>
>>> <mailto:bill.o...@gmail.com <mailto:bill.o...@gmail.com>>>
>>>
>>> wrote:
>>>
>>>  I'm testing ceph for storing a very large number of
>>> small files.
>>>I'm seeing some performance issues and would like to
>>> see if anyone
>>>  could offer any insight as to what I could do to
>>> correct this.
>>>
>>>  Some numbers:
>>>
>>>  Uploaded 184111 files, with an average file size of
>>> 5KB, using
>>>  10 separate servers to upload the request using Python
>>> and the
>>>  cloudfiles module.  I stopped uploading after 53
>>> minutes, which
>>>  seems to average 5.7 files per second per node.
>>>
>>>
>>>  My storage cluster consists of 21 OSD's across 7
>>> servers, with their
>>>  journals written to SSD drives.  I've done a default
>>>  

Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell
Mark,

Yesterday I blew away all the objects and restarted my test using
multiple buckets, and things are definitely better!

After ~20 hours I've already uploaded ~3.5 million objects, which much
is better then the ~1.5 million I did over ~96 hours this past
weekend.  Unfortunately it seems that things have slowed down a bit.
The average upload rate over those first 20 hours was ~48
objects/second, but now I'm only seeing ~20 objects/second.  This is
with 18,836 buckets.

Bryan

On Wed, Sep 4, 2013 at 12:43 PM, Bryan Stillwell
 wrote:
> So far I haven't seen much of a change.  It's still working through removing
> the bucket that reached 1.5 million objects though (my guess is that'll take
> a few more days), so I believe that might have something to do with it.
>
> Bryan
>
>
> On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson 
> wrote:
>>
>> Bryan,
>>
>> Good explanation.  How's performance now that you've spread the load over
>> multiple buckets?
>>
>> Mark
>>
>> On 09/04/2013 12:39 PM, Bryan Stillwell wrote:
>>>
>>> Bill,
>>>
>>> I've run into a similar issue with objects averaging ~100KiB.  The
>>> explanation I received on IRC is that there are scaling issues if you're
>>> uploading them all to the same bucket because the index isn't sharded.
>>>   The recommended solution is to spread the objects out to a lot of
>>> buckets.  However, that ran me into another issue once I hit 1000
>>> buckets which is a per user limit.  I switched the limit to be unlimited
>>> with this command:
>>>
>>> radosgw-admin user modify --uid=your_username --max-buckets=0
>>>
>>> Bryan
>>>
>>>
>>> On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer >> <mailto:bill.o...@gmail.com>> wrote:
>>>
>>> I'm testing ceph for storing a very large number of small files.
>>>   I'm seeing some performance issues and would like to see if anyone
>>> could offer any insight as to what I could do to correct this.
>>>
>>> Some numbers:
>>>
>>> Uploaded 184111 files, with an average file size of 5KB, using
>>> 10 separate servers to upload the request using Python and the
>>> cloudfiles module.  I stopped uploading after 53 minutes, which
>>> seems to average 5.7 files per second per node.
>>>
>>>
>>> My storage cluster consists of 21 OSD's across 7 servers, with their
>>> journals written to SSD drives.  I've done a default installation,
>>> using ceph-deploy with the dumpling release.
>>>
>>> I'm using statsd to monitor the performance, and what's interesting
>>> is when I start with an empty bucket, performance is amazing, with
>>> average response times of 20-50ms.  However as time goes on, the
>>> response times go in to the hundreds, and the average number of
>>> uploads per second drops.
>>>
>>> I've installed radosgw on all 7 ceph servers.  I've tested using a
>>> load balancer to distribute the api calls, as well as pointing the
>>> 10 worker servers to a single instance.  I've not seen a real
>>> different in performance with this either.
>>>
>>>
>>> Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of ram,
>>> OCZ Vertex4 SSD drives for the journals and Seagate Barracuda ES2
>>> drives for storage.
>>>
>>>
>>> Any help would be greatly appreciated.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with small files

2013-09-05 Thread Bryan Stillwell
I need to restart the upload process again because all the objects
have a content-type of 'binary/octet-stream' instead of 'image/jpeg',
'image/png', etc.  I plan on enabling monitoring this time so we can
see if there are any signs of what might be going on.  Did you want me
to increase the number of buckets to see if that changes anything?
This is pretty easy for me to do.

Bryan

On Thu, Sep 5, 2013 at 11:07 AM, Mark Nelson  wrote:
> based on your numbers, you were at something like an average of 186 objects
> per bucket at the 20 hour mark?  I wonder how this trend compares to what
> you'd see with a single bucket.
>
> With that many buckets you should have indexes well spread across all of the
> OSDs.  It'd be interesting to know what the iops/throughput is on all of
> your OSDs now (blktrace/seekwatcher can help here, but they are not the
> easiest tools to setup/use).
>
> Mark
>
> On 09/05/2013 11:59 AM, Bryan Stillwell wrote:
>>
>> Mark,
>>
>> Yesterday I blew away all the objects and restarted my test using
>> multiple buckets, and things are definitely better!
>>
>> After ~20 hours I've already uploaded ~3.5 million objects, which much
>> is better then the ~1.5 million I did over ~96 hours this past
>> weekend.  Unfortunately it seems that things have slowed down a bit.
>> The average upload rate over those first 20 hours was ~48
>> objects/second, but now I'm only seeing ~20 objects/second.  This is
>> with 18,836 buckets.
>>
>> Bryan
>>
>> On Wed, Sep 4, 2013 at 12:43 PM, Bryan Stillwell
>>  wrote:
>>>
>>> So far I haven't seen much of a change.  It's still working through
>>> removing
>>> the bucket that reached 1.5 million objects though (my guess is that'll
>>> take
>>> a few more days), so I believe that might have something to do with it.
>>>
>>> Bryan
>>>
>>>
>>> On Wed, Sep 4, 2013 at 12:14 PM, Mark Nelson 
>>> wrote:
>>>>
>>>>
>>>> Bryan,
>>>>
>>>> Good explanation.  How's performance now that you've spread the load
>>>> over
>>>> multiple buckets?
>>>>
>>>> Mark
>>>>
>>>> On 09/04/2013 12:39 PM, Bryan Stillwell wrote:
>>>>>
>>>>>
>>>>> Bill,
>>>>>
>>>>> I've run into a similar issue with objects averaging ~100KiB.  The
>>>>> explanation I received on IRC is that there are scaling issues if
>>>>> you're
>>>>> uploading them all to the same bucket because the index isn't sharded.
>>>>>The recommended solution is to spread the objects out to a lot of
>>>>> buckets.  However, that ran me into another issue once I hit 1000
>>>>> buckets which is a per user limit.  I switched the limit to be
>>>>> unlimited
>>>>> with this command:
>>>>>
>>>>> radosgw-admin user modify --uid=your_username --max-buckets=0
>>>>>
>>>>> Bryan
>>>>>
>>>>>
>>>>> On Wed, Sep 4, 2013 at 11:27 AM, Bill Omer >>>> <mailto:bill.o...@gmail.com>> wrote:
>>>>>
>>>>>  I'm testing ceph for storing a very large number of small files.
>>>>>I'm seeing some performance issues and would like to see if
>>>>> anyone
>>>>>  could offer any insight as to what I could do to correct this.
>>>>>
>>>>>  Some numbers:
>>>>>
>>>>>  Uploaded 184111 files, with an average file size of 5KB, using
>>>>>  10 separate servers to upload the request using Python and the
>>>>>  cloudfiles module.  I stopped uploading after 53 minutes, which
>>>>>  seems to average 5.7 files per second per node.
>>>>>
>>>>>
>>>>>  My storage cluster consists of 21 OSD's across 7 servers, with
>>>>> their
>>>>>  journals written to SSD drives.  I've done a default installation,
>>>>>  using ceph-deploy with the dumpling release.
>>>>>
>>>>>  I'm using statsd to monitor the performance, and what's
>>>>> interesting
>>>>>  is when I start with an empty bucket, performance is amazing, with
>>>>>  average response times of 20-50ms.  However as time goes on, the
>>>>>  response times go in to the hundreds, and the average number of
>>>>>  uploads per second drops.
>>>>>
>>>>>  I've installed radosgw on all 7 ceph servers.  I've tested using a
>>>>>  load balancer to distribute the api calls, as well as pointing the
>>>>>  10 worker servers to a single instance.  I've not seen a real
>>>>>  different in performance with this either.
>>>>>
>>>>>
>>>>>  Each of the ceph servers are 16 core Xeon 2.53GHz with 72GB of
>>>>> ram,
>>>>>  OCZ Vertex4 SSD drives for the journals and Seagate Barracuda ES2
>>>>>  drives for storage.
>>>>>
>>>>>
>>>>>  Any help would be greatly appreciated.
>>>>>
>>>>>
>>>>>  ___
>>>>>  ceph-users mailing list
>>>>>  ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Full OSD with 29% free

2013-10-14 Thread Bryan Stillwell
This appears to be more of an XFS issue than a ceph issue, but I've
run into a problem where some of my OSDs failed because the filesystem
was reported as full even though there was 29% free:

[root@den2ceph001 ceph-1]# touch blah
touch: cannot touch `blah': No space left on device
[root@den2ceph001 ceph-1]# df .
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/sdc1486562672 342139340 144423332  71% /var/lib/ceph/osd/ceph-1
[root@den2ceph001 ceph-1]# df -i .
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/sdc160849984 4097408 567525767% /var/lib/ceph/osd/ceph-1
[root@den2ceph001 ceph-1]#

I've tried remounting the filesystem with the inode64 option like a
few people recommended, but that didn't help (probably because it
doesn't appear to be running out of inodes).

This happened while I was on vacation and I'm pretty sure it was
caused by another OSD failing on the same node.  I've been able to
recover from the situation by bringing the failed OSD back online, but
it's only a matter of time until I'll be running into this issue again
since my cluster is still being populated.

Any ideas on things I can try the next time this happens?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full OSD with 29% free

2013-10-14 Thread Bryan Stillwell
The filesystem isn't as full now, but the fragmentation is pretty low:

[root@den2ceph001 ~]# df /dev/sdc1
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/sdc1486562672 270845628 215717044  56% /var/lib/ceph/osd/ceph-1
[root@den2ceph001 ~]# xfs_db -c frag -r /dev/sdc1
actual 3481543, ideal 3447443, fragmentation factor 0.98%

Bryan

On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe  wrote:
>
> How fragmented is that file system?
>
> Sent from my iPad
>
> > On Oct 14, 2013, at 5:44 PM, Bryan Stillwell  
> > wrote:
> >
> > This appears to be more of an XFS issue than a ceph issue, but I've
> > run into a problem where some of my OSDs failed because the filesystem
> > was reported as full even though there was 29% free:
> >
> > [root@den2ceph001 ceph-1]# touch blah
> > touch: cannot touch `blah': No space left on device
> > [root@den2ceph001 ceph-1]# df .
> > Filesystem   1K-blocks  Used Available Use% Mounted on
> > /dev/sdc1486562672 342139340 144423332  71% 
> > /var/lib/ceph/osd/ceph-1
> > [root@den2ceph001 ceph-1]# df -i .
> > FilesystemInodes   IUsed   IFree IUse% Mounted on
> > /dev/sdc160849984 4097408 567525767% 
> > /var/lib/ceph/osd/ceph-1
> > [root@den2ceph001 ceph-1]#
> >
> > I've tried remounting the filesystem with the inode64 option like a
> > few people recommended, but that didn't help (probably because it
> > doesn't appear to be running out of inodes).
> >
> > This happened while I was on vacation and I'm pretty sure it was
> > caused by another OSD failing on the same node.  I've been able to
> > recover from the situation by bringing the failed OSD back online, but
> > it's only a matter of time until I'll be running into this issue again
> > since my cluster is still being populated.
> >
> > Any ideas on things I can try the next time this happens?
> >
> > Thanks,
> > Bryan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full OSD with 29% free

2013-10-21 Thread Bryan Stillwell
So I'm running into this issue again and after spending a bit of time
reading the XFS mailing lists, I believe the free space is too
fragmented:

[root@den2ceph001 ceph-0]# xfs_db -r "-c freesp -s" /dev/sdb1
   from  to extents  blockspct
  1   1   85773   85773   0.24
  2   3  176891  444356   1.27
  4   7  430854 2410929   6.87
  8  15 2327527 30337352  86.46
 16  31   75871 1809577   5.16
total free extents 3096916
total free blocks 35087987
average free extent size 11.33


Compared to a drive which isn't reporting 'No space left on device':

[root@den2ceph008 ~]# xfs_db -r "-c freesp -s" /dev/sdc1
   from  to extents  blockspct
  1   1  133148  133148   0.15
  2   3  320737  808506   0.94
  4   7  809748 4532573   5.27
  8  15 4536681 59305608  68.96
 16  31   31531  751285   0.87
 32  63 364   16367   0.02
 64 127  909174   0.01
128 255   92072   0.00
256 511  48   18018   0.02
5121023 128  102422   0.12
   10242047 290  451017   0.52
   20484095 538 1649408   1.92
   40968191 851 5066070   5.89
   8192   16383 746 8436029   9.81
  16384   32767 194 4042573   4.70
  32768   65535  15  614301   0.71
  65536  131071   1   66630   0.08
total free extents 5835119
total free blocks 86005201
average free extent size 14.7392


What I'm wondering is if reducing the block size from 4K to 2K (or 1K)
would help?  I'm pretty sure this would take require re-running
mkfs.xfs on every OSD to fix if that's the case...

Thanks,
Bryan


On Mon, Oct 14, 2013 at 5:28 PM, Bryan Stillwell
 wrote:
>
> The filesystem isn't as full now, but the fragmentation is pretty low:
>
> [root@den2ceph001 ~]# df /dev/sdc1
> Filesystem   1K-blocks  Used Available Use% Mounted on
> /dev/sdc1486562672 270845628 215717044  56% 
> /var/lib/ceph/osd/ceph-1
> [root@den2ceph001 ~]# xfs_db -c frag -r /dev/sdc1
> actual 3481543, ideal 3447443, fragmentation factor 0.98%
>
> Bryan
>
> On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe  
> wrote:
> >
> > How fragmented is that file system?
> >
> > Sent from my iPad
> >
> > > On Oct 14, 2013, at 5:44 PM, Bryan Stillwell  
> > > wrote:
> > >
> > > This appears to be more of an XFS issue than a ceph issue, but I've
> > > run into a problem where some of my OSDs failed because the filesystem
> > > was reported as full even though there was 29% free:
> > >
> > > [root@den2ceph001 ceph-1]# touch blah
> > > touch: cannot touch `blah': No space left on device
> > > [root@den2ceph001 ceph-1]# df .
> > > Filesystem   1K-blocks  Used Available Use% Mounted on
> > > /dev/sdc1486562672 342139340 144423332  71% 
> > > /var/lib/ceph/osd/ceph-1
> > > [root@den2ceph001 ceph-1]# df -i .
> > > FilesystemInodes   IUsed   IFree IUse% Mounted on
> > > /dev/sdc160849984 4097408 567525767% 
> > > /var/lib/ceph/osd/ceph-1
> > > [root@den2ceph001 ceph-1]#
> > >
> > > I've tried remounting the filesystem with the inode64 option like a
> > > few people recommended, but that didn't help (probably because it
> > > doesn't appear to be running out of inodes).
> > >
> > > This happened while I was on vacation and I'm pretty sure it was
> > > caused by another OSD failing on the same node.  I've been able to
> > > recover from the situation by bringing the failed OSD back online, but
> > > it's only a matter of time until I'll be running into this issue again
> > > since my cluster is still being populated.
> > >
> > > Any ideas on things I can try the next time this happens?
> > >
> > > Thanks,
> > > Bryan
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full OSD with 29% free

2013-10-30 Thread Bryan Stillwell
I wanted to report back on this since I've made some progress on
fixing this issue.

After converting every OSD on a single server to use a 2K block size,
I've been able to cross 90% utilization without running into the 'No
space left on device' problem.  They're currently between 51% and 75%,
but I hit 90% over the weekend after a couple OSDs died during
recovery.

This conversion was pretty rough though with OSDs randomly dying
multiple times during the process (logs point at suicide time outs).
When looking at top I would frequently see xfsalloc pegging multiple
cores, so I wonder if that has something to do with it.  I also had
the 'xfs_db -r "-c freesp -s"' command segfault on me a few times
which was fixed by running xfs_repair on those partitions.  This has
me wondering how well XFS is tested with non-default block sizes on
CentOS 6.4...

Anyways, after about a week I was finally able to get the cluster to
fully recover today.  Now I need to repeat the process on 7 more
servers before I can finish populating my cluster...

In case anyone is wondering how I switched to a 2K block size, this is
what I added to my ceph.conf:

[osd]
osd_mount_options_xfs = "rw,noatime,inode64"
osd_mkfs_options_xfs = "-f -b size=2048"


The cluster is currently running the 0.71 release.

Bryan

On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell
 wrote:
> So I'm running into this issue again and after spending a bit of time
> reading the XFS mailing lists, I believe the free space is too
> fragmented:
>
> [root@den2ceph001 ceph-0]# xfs_db -r "-c freesp -s" /dev/sdb1
>from  to extents  blockspct
>   1   1 85773 85773   0.24
>   2   3  176891  444356   1.27
>   4   7  430854 2410929   6.87
>   8  15 2327527 30337352  86.46
>  16  31   75871 1809577   5.16
> total free extents 3096916
> total free blocks 35087987
> average free extent size 11.33
>
>
> Compared to a drive which isn't reporting 'No space left on device':
>
> [root@den2ceph008 ~]# xfs_db -r "-c freesp -s" /dev/sdc1
>from  to extents  blockspct
>   1   1  133148  133148   0.15
>   2   3  320737  808506   0.94
>   4   7  809748 4532573   5.27
>   8  15 4536681 59305608  68.96
>  16  31   31531  751285   0.87
>  32  63 364   16367   0.02
>  64 127  909174   0.01
> 128 255   92072   0.00
> 256 511  48   18018   0.02
> 5121023 128  102422   0.12
>10242047 290  451017   0.52
>20484095 538 1649408   1.92
>40968191 851 5066070   5.89
>8192   16383 746 8436029   9.81
>   16384   32767 194 4042573   4.70
>   32768   65535  15  614301   0.71
>   65536  131071   1   66630   0.08
> total free extents 5835119
> total free blocks 86005201
> average free extent size 14.7392
>
>
> What I'm wondering is if reducing the block size from 4K to 2K (or 1K)
> would help?  I'm pretty sure this would take require re-running
> mkfs.xfs on every OSD to fix if that's the case...
>
> Thanks,
> Bryan
>
>
> On Mon, Oct 14, 2013 at 5:28 PM, Bryan Stillwell
>  wrote:
>>
>> The filesystem isn't as full now, but the fragmentation is pretty low:
>>
>> [root@den2ceph001 ~]# df /dev/sdc1
>> Filesystem   1K-blocks  Used Available Use% Mounted on
>> /dev/sdc1486562672 270845628 215717044  56% 
>> /var/lib/ceph/osd/ceph-1
>> [root@den2ceph001 ~]# xfs_db -c frag -r /dev/sdc1
>> actual 3481543, ideal 3447443, fragmentation factor 0.98%
>>
>> Bryan
>>
>> On Mon, Oct 14, 2013 at 4:35 PM, Michael Lowe  
>> wrote:
>> >
>> > How fragmented is that file system?
>> >
>> > Sent from my iPad
>> >
>> > > On Oct 14, 2013, at 5:44 PM, Bryan Stillwell 
>> > >  wrote:
>> > >
>> > > This appears to be more of an XFS issue than a ceph issue, but I've
>> > > run into a problem where some of my OSDs failed because the filesystem
>> > > was reported as full even though there was 29% free:
>> > >
>> > > [root@den2ceph001 ceph-1]# touch blah
>> > > touch: cannot touch `blah': No space left on device
>> > > [root@den2ceph001 ceph-1]# df .
>> > > Filesystem   1K-blocks  Used Available Use% Mounted on
>> > > /dev/sdc1486562672 342139340 144423332  71% 
>> > > /var/lib/ceph/osd/ceph-1
>> > > [root@den2ceph001 ceph-1]# df -i .
>> > > FilesystemInodes   IUsed  

Re: [ceph-users] Full OSD with 29% free

2013-10-31 Thread Bryan Stillwell
Shain,

After getting the segfaults when running 'xfs_db -r "-c freesp -s"' on
a couple partitions, I'm concerned that 2K block sizes aren't nearly
as well tested as 4K block sizes.  This could just be a problem with
RHEL/CentOS 6.4 though, so if you're using a newer kernel the problem
might already be fixed.  There also appears to be more overhead with
2K block sizes which I believe manifests as high CPU usage by the
xfsalloc processes.  However, my cluster has been running in a clean
state for over 24 hours and none of the scrubs have found a problem
yet.

According to 'ceph -s' my cluster has the following stats:

 osdmap e16882: 40 osds: 40 up, 40 in
  pgmap v3520420: 2808 pgs, 13 pools, 5694 GB data, 72705 kobjects
18095 GB used, 13499 GB / 31595 GB avail

That's about 78k per object on average, so if your files aren't that
small I would stay with 4K block sizes to avoid headaches.

Bryan


On Thu, Oct 31, 2013 at 6:43 AM, Shain Miley  wrote:
>
> Bryan,
>
> We are setting up a cluster using xfs and have been a bit concerned about 
> running into similar issues to the ones you described below.
>
> I am just wondering if you came across any potential downsides to using a 2K 
> block size with xfs on your osd's.
>
> Thanks,
>
> Shain
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
> smi...@npr.org | 202.513.3649
>
> ____
> From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] 
> on behalf of Bryan Stillwell [bstillw...@photobucket.com]
> Sent: Wednesday, October 30, 2013 2:18 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Full OSD with 29% free
>
> I wanted to report back on this since I've made some progress on
> fixing this issue.
>
> After converting every OSD on a single server to use a 2K block size,
> I've been able to cross 90% utilization without running into the 'No
> space left on device' problem.  They're currently between 51% and 75%,
> but I hit 90% over the weekend after a couple OSDs died during
> recovery.
>
> This conversion was pretty rough though with OSDs randomly dying
> multiple times during the process (logs point at suicide time outs).
> When looking at top I would frequently see xfsalloc pegging multiple
> cores, so I wonder if that has something to do with it.  I also had
> the 'xfs_db -r "-c freesp -s"' command segfault on me a few times
> which was fixed by running xfs_repair on those partitions.  This has
> me wondering how well XFS is tested with non-default block sizes on
> CentOS 6.4...
>
> Anyways, after about a week I was finally able to get the cluster to
> fully recover today.  Now I need to repeat the process on 7 more
> servers before I can finish populating my cluster...
>
> In case anyone is wondering how I switched to a 2K block size, this is
> what I added to my ceph.conf:
>
> [osd]
> osd_mount_options_xfs = "rw,noatime,inode64"
> osd_mkfs_options_xfs = "-f -b size=2048"
>
>
> The cluster is currently running the 0.71 release.
>
> Bryan
>
> On Mon, Oct 21, 2013 at 2:39 PM, Bryan Stillwell
>  wrote:
> > So I'm running into this issue again and after spending a bit of time
> > reading the XFS mailing lists, I believe the free space is too
> > fragmented:
> >
> > [root@den2ceph001 ceph-0]# xfs_db -r "-c freesp -s" /dev/sdb1
> >from  to extents  blockspct
> >   1   1 85773 85773   0.24
> >   2   3  176891  444356   1.27
> >   4   7  430854 2410929   6.87
> >   8  15 2327527 30337352  86.46
> >  16  31   75871 1809577   5.16
> > total free extents 3096916
> > total free blocks 35087987
> > average free extent size 11.33
> >
> >
> > Compared to a drive which isn't reporting 'No space left on device':
> >
> > [root@den2ceph008 ~]# xfs_db -r "-c freesp -s" /dev/sdc1
> >from  to extents  blockspct
> >   1   1  133148  133148   0.15
> >   2   3  320737  808506   0.94
> >   4   7  809748 4532573   5.27
> >   8  15 4536681 59305608  68.96
> >  16  31   31531  751285   0.87
> >  32  63 364   16367   0.02
> >  64 127  909174   0.01
> > 128 255   92072   0.00
> > 256 511  48   18018   0.02
> > 5121023 128  102422   0.12
> >10242047 290  451017   0.52
> >20484095 538 1649408   1.92
> >40968191 851 5066070   5.89
> >

Re: [ceph-users] Full OSD with 29% free

2013-10-31 Thread Bryan Stillwell
Shain,

I investigated the segfault a little more since I sent this message
and found this email thread:

http://oss.sgi.com/archives/xfs/2012-06/msg00066.html

After reading that I did the following:

[root@den2ceph001 ~]# xfs_db -r "-c freesp -s" /dev/sdb1
Segmentation fault (core dumped)
[root@den2ceph001 ~]# service ceph stop osd.0
=== osd.0 ===
Stopping Ceph osd.0 on den2ceph001...kill 2407...kill 2407...done
[root@den2ceph001 ~]# umount /dev/sdb1
[root@den2ceph001 ~]# xfs_db -r "-c freesp -s" /dev/sdb1
   from  to extents  blockspct
  1   1   44510   44510   0.05
  2   3   60341  142274   0.16
  4   7   68836  355735   0.39
  8  15  274122 3212122   3.50
 16  31 1429274 37611619  41.02
 32  63   43225 1945740   2.12
 64 127   39480 3585579   3.91
128 255   36046 6544005   7.14
256 511   30946 10899979  11.89
5121023   14119 9907129  10.80
   102420475727 7998938   8.72
   204840952647 6811258   7.43
   40968191 362 1940622   2.12
   8192   16383  59  603690   0.66
  16384   32767   5   90464   0.10
total free extents 2049699
total free blocks 91693664
average free extent size 44.7352


That gives me a little more confidence in using 2K block sizes now.  :)

Bryan

On Thu, Oct 31, 2013 at 11:02 AM, Bryan Stillwell
 wrote:
> Shain,
>
> After getting the segfaults when running 'xfs_db -r "-c freesp -s"' on
> a couple partitions, I'm concerned that 2K block sizes aren't nearly
> as well tested as 4K block sizes.  This could just be a problem with
> RHEL/CentOS 6.4 though, so if you're using a newer kernel the problem
> might already be fixed.  There also appears to be more overhead with
> 2K block sizes which I believe manifests as high CPU usage by the
> xfsalloc processes.  However, my cluster has been running in a clean
> state for over 24 hours and none of the scrubs have found a problem
> yet.
>
> According to 'ceph -s' my cluster has the following stats:
>
>  osdmap e16882: 40 osds: 40 up, 40 in
>   pgmap v3520420: 2808 pgs, 13 pools, 5694 GB data, 72705 kobjects
> 18095 GB used, 13499 GB / 31595 GB avail
>
> That's about 78k per object on average, so if your files aren't that
> small I would stay with 4K block sizes to avoid headaches.
>
> Bryan
>
>
> On Thu, Oct 31, 2013 at 6:43 AM, Shain Miley  wrote:
>>
>> Bryan,
>>
>> We are setting up a cluster using xfs and have been a bit concerned about 
>> running into similar issues to the ones you described below.
>>
>> I am just wondering if you came across any potential downsides to using a 2K 
>> block size with xfs on your osd's.
>>
>> Thanks,
>>
>> Shain
>>
>> Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
>> smi...@npr.org | 202.513.3649
>>
>> 
>> From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] 
>> on behalf of Bryan Stillwell [bstillw...@photobucket.com]
>> Sent: Wednesday, October 30, 2013 2:18 PM
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Full OSD with 29% free
>>
>> I wanted to report back on this since I've made some progress on
>> fixing this issue.
>>
>> After converting every OSD on a single server to use a 2K block size,
>> I've been able to cross 90% utilization without running into the 'No
>> space left on device' problem.  They're currently between 51% and 75%,
>> but I hit 90% over the weekend after a couple OSDs died during
>> recovery.
>>
>> This conversion was pretty rough though with OSDs randomly dying
>> multiple times during the process (logs point at suicide time outs).
>> When looking at top I would frequently see xfsalloc pegging multiple
>> cores, so I wonder if that has something to do with it.  I also had
>> the 'xfs_db -r "-c freesp -s"' command segfault on me a few times
>> which was fixed by running xfs_repair on those partitions.  This has
>> me wondering how well XFS is tested with non-default block sizes on
>> CentOS 6.4...
>>
>> Anyways, after about a week I was finally able to get the cluster to
>> fully recover today.  Now I need to repeat the process on 7 more
>> servers before I can finish populating my cluster...
>>
>> In case anyone is wondering how I switched to a 2K block size, this is
>> what I added to my ceph.conf:
>>
>> [osd]
>> osd_mount_options_xfs = "rw,noatime,inode64"
>> osd_mkfs_options_xfs = "-f -b size=2048"
>>
>&g

[ceph-users] Recover from corrupted journals

2013-11-12 Thread Bryan Stillwell
While updating my cluster to use a 2K block size for XFS, I've run
into a couple OSDs failing to start because of corrupted journals:

=== osd.1 ===
   -10> 2013-11-12 13:40:35.388177 7f030458a7a0  1
filestore(/var/lib/ceph/osd/ceph-1) mount detected xfs
-9> 2013-11-12 13:40:35.388194 7f030458a7a0  1
filestore(/var/lib/ceph/osd/ceph-1)  disabling 'filestore replica
fadvise' due to known issues with fadvise(DONTNEED) on xfs
-8> 2013-11-12 13:40:49.735893 7f030458a7a0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is supported and appears to work
-7> 2013-11-12 13:40:49.735955 7f030458a7a0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
-6> 2013-11-12 13:40:49.778879 7f030458a7a0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
syscall(SYS_syncfs, fd) fully supported
-5> 2013-11-12 13:41:02.512202 7f030458a7a0  0
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
-4> 2013-11-12 13:41:05.932177 7f030458a7a0  2 journal open
/var/lib/ceph/osd/ceph-1/journal fsid
f7bde53e-458a-4398-a949-770648ddc414 fs_op_seq 2973368
-3> 2013-11-12 13:41:05.964093 7f030458a7a0  1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 1072693248 bytes, block size
4096 bytes, directio = 1, aio = 1
-2> 2013-11-12 13:41:05.987641 7f030458a7a0  2 journal read_entry
361586688 : seq 2973370 55428 bytes
-1> 2013-11-12 13:41:05.988024 7f030458a7a0 -1 journal Unable to
read past sequence 2973369 but header indicates the journal has
committed up through 2980190, journal is corrupt
 0> 2013-11-12 13:41:06.070833 7f030458a7a0 -1 os/FileJournal.cc:
In function 'bool FileJournal::read_entry(ceph::bufferlist&,
uint64_t&, bool*)' thread 7f030458a7a0 time 2013-11-12 13:41:05.988054
os/FileJournal.cc: 1697: FAILED assert(0)

 ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217)
 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
bool*)+0xa46) [0x6d9ab6]
 2: (JournalingObjectStore::journal_replay(unsigned long)+0x325) [0x865835]
 3: (FileStore::mount()+0x2db0) [0x70e330]
 4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x608dba]
 5: (OSD::convertfs(std::string const&, std::string const&)+0x49) [0x6097c9]
 6: (main()+0x3190) [0x5c65d0]
 7: (__libc_start_main()+0xfd) [0x3ee0e1ecdd]
 8: /usr/bin/ceph-osd() [0x5c3089]


=== osd.4 ===
   -10> 2013-11-11 16:31:52.697736 7fefe710e7a0  1
filestore(/var/lib/ceph/osd/ceph-4) mount detected xfs
-9> 2013-11-11 16:31:52.697764 7fefe710e7a0  1
filestore(/var/lib/ceph/osd/ceph-4)  disabling 'filestore replica
fadvise' due to known issues with fadvise(DONTNEED) on xfs
-8> 2013-11-11 16:32:06.301437 7fefe710e7a0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
FIEMAP ioctl is supported and appears to work
-7> 2013-11-11 16:32:06.301478 7fefe710e7a0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
-6> 2013-11-11 16:32:06.321094 7fefe710e7a0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
syscall(SYS_syncfs, fd) fully supported
-5> 2013-11-11 16:32:06.642899 7fefe710e7a0  0
filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
-4> 2013-11-11 16:32:10.047982 7fefe710e7a0  2 journal open
/var/lib/ceph/osd/ceph-4/journal fsid
1c68cdc3-4ba1-4711-86a2-517d32b352fa fs_op_seq 2964169
-3> 2013-11-11 16:32:10.062596 7fefe710e7a0  1 journal _open
/var/lib/ceph/osd/ceph-4/journal fd 21: 1072693248 bytes, block size
4096 bytes, directio = 1, aio = 1
-2> 2013-11-11 16:32:10.132954 7fefe710e7a0  2 journal read_entry
993447936 : seq 2964171 8007 bytes
-1> 2013-11-11 16:32:10.133125 7fefe710e7a0 -1 journal Unable to
read past sequence 2964170 but header indicates the journal has
committed up through 2967854, journal is corrupt
 0> 2013-11-11 16:32:10.135432 7fefe710e7a0 -1 os/FileJournal.cc:
In function 'bool FileJournal::read_entry(ceph::bufferlist&,
uint64_t&, bool*)' thread 7fefe710e7a0 time 2013-11-11 16:32:10.133149
os/FileJournal.cc: 1697: FAILED assert(0)

 ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217)
 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
bool*)+0xa46) [0x6d9ab6]
 2: (JournalingObjectStore::journal_replay(unsigned long)+0x325) [0x865835]
 3: (FileStore::mount()+0x2db0) [0x70e330]
 4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x608dba]
 5: (OSD::convertfs(std::string const&, std::string const&)+0x49) [0x6097c9]
 6: (main()+0x3190) [0x5c65d0]
 7: (__libc_start_main()+0xfd) [0x3ee0e1ecdd]
 8: /usr/bin/ceph-osd() [0x5c3089]


What's the best way to recover from this situation?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-

Re: [ceph-users] CephFS First product release discussion

2013-03-05 Thread Bryan Stillwell
On Tue, Mar 5, 2013 at 12:44 PM, Kevin Decherf  wrote:
>
> On Tue, Mar 05, 2013 at 12:27:04PM -0600, Dino Yancey wrote:
> > The only two features I'd deem necessary for our workload would be
> > stable distributed metadata / MDS and a working fsck equivalent.
> > Snapshots would be great once the feature is deemed stable, as would
>
> We have the same needs here.

Stable distributed metadata and snapshots are the most important to me.

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bobtail & Precise

2013-04-03 Thread Bryan Stillwell
I have two test clusters running Bobtail (0.56.4) and Ubuntu Precise
(12.04.2).  The problem I'm having is that I'm not able to get either
of them into a state where I can both mount the filesystem and have
all the PGs in the active+clean state.

It seems that on both clusters I can get them into a 100% active+clean
state by setting "ceph osd crush tunables bobtail", but when I try to
mount the filesystem I get:

mount error 5 = Input/output error


However, if I set "ceph osd crush tunables legacy" I can mount both
filesystems, but then some of the PGs are stuck in the
"active+remapped" state:

# ceph -s
   health HEALTH_WARN 29 pgs stuck unclean; recovery 5/1604152 degraded (0.000%)
   monmap e1: 1 mons at {a=172.16.0.50:6789/0}, election epoch 1, quorum 0 a
   osdmap e10272: 20 osds: 20 up, 20 in
pgmap v1114740: 1920 pgs: 1890 active+clean, 29 active+remapped, 1
active+clean+scrubbing; 3086 GB data, 6201 GB used, 3098 GB / 9300 GB
avail; 232B/s wr, 0op/s; 5/1604152 degraded (0.000%)
   mdsmap e420: 1/1/1 up {0=a=up:active}


Is any one else seeing this?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bobtail & Precise

2013-04-18 Thread Bryan Stillwell
John,

Thanks for your response.  I haven't spent a lot of time on this issue
since then, so I'm still in the same situation.  I do remember seeing an
error message about an unsupported feature at one point after setting the
tunables to bobtail.

Bryan


On Thu, Apr 18, 2013 at 1:51 PM, John Wilkins wrote:

> Bryan,
>
> It seems you got crickets with this question. Did you get any further? I'd
> like to add it to my upcoming CRUSH troubleshooting section.
>
>
> On Wed, Apr 3, 2013 at 9:27 AM, Bryan Stillwell <
> bstillw...@photobucket.com> wrote:
>
>> I have two test clusters running Bobtail (0.56.4) and Ubuntu Precise
>> (12.04.2).  The problem I'm having is that I'm not able to get either
>> of them into a state where I can both mount the filesystem and have
>> all the PGs in the active+clean state.
>>
>> It seems that on both clusters I can get them into a 100% active+clean
>> state by setting "ceph osd crush tunables bobtail", but when I try to
>> mount the filesystem I get:
>>
>> mount error 5 = Input/output error
>>
>>
>> However, if I set "ceph osd crush tunables legacy" I can mount both
>> filesystems, but then some of the PGs are stuck in the
>> "active+remapped" state:
>>
>> # ceph -s
>>health HEALTH_WARN 29 pgs stuck unclean; recovery 5/1604152 degraded
>> (0.000%)
>>monmap e1: 1 mons at {a=172.16.0.50:6789/0}, election epoch 1, quorum
>> 0 a
>>osdmap e10272: 20 osds: 20 up, 20 in
>> pgmap v1114740: 1920 pgs: 1890 active+clean, 29 active+remapped, 1
>> active+clean+scrubbing; 3086 GB data, 6201 GB used, 3098 GB / 9300 GB
>> avail; 232B/s wr, 0op/s; 5/1604152 degraded (0.000%)
>>mdsmap e420: 1/1/1 up {0=a=up:active}
>>
>>
>> Is any one else seeing this?
>>
>> Thanks,
>> Bryan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> John Wilkins
> Senior Technical Writer
> Intank
> john.wilk...@inktank.com
> (415) 425-9599
> http://inktank.com
>



-- 
[image: Photobucket] <http://photobucket.com>

*Bryan Stillwell*
SENIOR SYSTEM ADMINISTRATOR

E: bstillw...@photobucket.com
O: 303.228.5109
M: 970.310.6085

[image: Facebook] <http://www.facebook.com/photobucket>[image:
Twitter]<http://twitter.com/photobucket>[image:
Photobucket] <http://photobucket.com/images/photobucket>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bobtail & Precise

2013-04-18 Thread Bryan Stillwell
What's the fix for people running precise (12.04)?  I believe I see the
same issue with quantal (12.10) as well.


On Thu, Apr 18, 2013 at 1:56 PM, Gregory Farnum  wrote:

> Seeing this go by again it's simple enough to provide a quick
> answer/hint — by setting the tunables it's of course getting a better
> distribution of data, but the reason they're optional to begin with is
> that older clients won't support them. In this case, the kernel client
> being run; so it returns an error.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Thu, Apr 18, 2013 at 12:51 PM, John Wilkins 
> wrote:
> > Bryan,
> >
> > It seems you got crickets with this question. Did you get any further?
> I'd
> > like to add it to my upcoming CRUSH troubleshooting section.
> >
> >
> > On Wed, Apr 3, 2013 at 9:27 AM, Bryan Stillwell <
> bstillw...@photobucket.com>
> > wrote:
> >>
> >> I have two test clusters running Bobtail (0.56.4) and Ubuntu Precise
> >> (12.04.2).  The problem I'm having is that I'm not able to get either
> >> of them into a state where I can both mount the filesystem and have
> >> all the PGs in the active+clean state.
> >>
> >> It seems that on both clusters I can get them into a 100% active+clean
> >> state by setting "ceph osd crush tunables bobtail", but when I try to
> >> mount the filesystem I get:
> >>
> >> mount error 5 = Input/output error
> >>
> >>
> >> However, if I set "ceph osd crush tunables legacy" I can mount both
> >> filesystems, but then some of the PGs are stuck in the
> >> "active+remapped" state:
> >>
> >> # ceph -s
> >>health HEALTH_WARN 29 pgs stuck unclean; recovery 5/1604152 degraded
> >> (0.000%)
> >>monmap e1: 1 mons at {a=172.16.0.50:6789/0}, election epoch 1,
> quorum 0
> >> a
> >>osdmap e10272: 20 osds: 20 up, 20 in
> >> pgmap v1114740: 1920 pgs: 1890 active+clean, 29 active+remapped, 1
> >> active+clean+scrubbing; 3086 GB data, 6201 GB used, 3098 GB / 9300 GB
> >> avail; 232B/s wr, 0op/s; 5/1604152 degraded (0.000%)
> >>mdsmap e420: 1/1/1 up {0=a=up:active}
> >>
> >>
> >> Is any one else seeing this?
> >>
> >> Thanks,
> >> Bryan
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> > John Wilkins
> > Senior Technical Writer
> > Intank
> > john.wilk...@inktank.com
> > (415) 425-9599
> > http://inktank.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
[image: Photobucket] <http://photobucket.com>

*Bryan Stillwell*
SENIOR SYSTEM ADMINISTRATOR

E: bstillw...@photobucket.com
O: 303.228.5109
M: 970.310.6085

[image: Facebook] <http://www.facebook.com/photobucket>[image:
Twitter]<http://twitter.com/photobucket>[image:
Photobucket] <http://photobucket.com/images/photobucket>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bobtail & Precise

2013-04-18 Thread Bryan Stillwell
Ahh, I think I have a better understanding now.  I had my crush map set up
like this:

default
basement
rack1
server1
osd.0
osd.1
osd.2
osd.3
osd.4
server2
osd.5
osd.6
osd.7
osd.8
osd.9
rack2
server3
osd.10
osd.11
osd.12
osd.13
osd.14
server4
osd.15
osd.16
osd.17
osd.18
osd.19

Since those failure domains are pretty small for the 2X replicas I
currently have set, I went ahead and changed it to be like this:

default
server1
osd.0
osd.1
osd.2
osd.3
osd.4
server2
osd.5
osd.6
osd.7
osd.8
osd.9
server3
osd.10
osd.11
osd.12
osd.13
osd.14
server4
osd.15
osd.16
osd.17
osd.18
osd.19

It's currently rebalancing with the new crushmap, so we shall see if that
clears things up in a few hours.

Bryan


On Thu, Apr 18, 2013 at 2:11 PM, Gregory Farnum  wrote:

> There's not really a fix — either update all your clients so they support
> the tunables (I'm not sure how new a kernel you need), or else run without
> the tunables. In setups where your branching factors aren't very close to
> your replication counts they aren't normally needed, if you want to reshape
> your cluster a little bit.
> -Greg
>
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Thu, Apr 18, 2013 at 1:04 PM, Bryan Stillwell <
> bstillw...@photobucket.com> wrote:
>
>> What's the fix for people running precise (12.04)?  I believe I see the
>> same issue with quantal (12.10) as well.
>>
>>
>> On Thu, Apr 18, 2013 at 1:56 PM, Gregory Farnum  wrote:
>>
>>> Seeing this go by again it's simple enough to provide a quick
>>> answer/hint — by setting the tunables it's of course getting a better
>>> distribution of data, but the reason they're optional to begin with is
>>> that older clients won't support them. In this case, the kernel client
>>> being run; so it returns an error.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Thu, Apr 18, 2013 at 12:51 PM, John Wilkins 
>>> wrote:
>>> > Bryan,
>>> >
>>> > It seems you got crickets with this question. Did you get any further?
>>> I'd
>>> > like to add it to my upcoming CRUSH troubleshooting section.
>>> >
>>> >
>>> > On Wed, Apr 3, 2013 at 9:27 AM, Bryan Stillwell <
>>> bstillw...@photobucket.com>
>>> > wrote:
>>> >>
>>> >> I have two test clusters running Bobtail (0.56.4) and Ubuntu Precise
>>> >> (12.04.2).  The problem I'm having is that I'm not able to get either
>>> >> of them into a state where I can both mount the filesystem and have
>>> >> all the PGs in the active+clean state.
>>> >>
>>> >> It seems that on both clusters I can get them into a 100% active+clean
>>> >> state by setting "ceph osd crush tunables bobtail", but when I try to
>>> >> mount the filesystem I get:
>>> >>
>>> >> mount error 5 = Input/output error
>>> >>
>>> >>
>>> >> However, if I set "ceph osd crush tunables legacy" I can mount both
>>> >> filesystems, but then some of the PGs are stuck in the
>>> >> "active+remapped" state:
>>> >>
>>> >> # ceph -s
>>> >>health HEALTH_WARN 29 pgs stuck unclean; recovery 5/1604152
>>> degraded
>>> >> (0.000%)
>>> >>monmap e1: 1 mons at {a=172.16.0.50:6789/0}, election epoch 1,
>>> quorum 0
>>> >> a
>>> >>osdmap e10272: 20 osds: 20 up, 20 in
>>> >> pgmap v1114740: 1920 pgs: 1890 active+clean, 29 active+remapped, 1
>>> >> active+clean+scrubbing; 3086 GB data, 6201 GB used, 3098 GB / 9300 GB
>>> >> avail; 232B/s wr, 0op/s; 5/1604152 degraded (0.000%)
>>> >>mdsmap e420: 1/1/1 up {0=a=up:active}
>>> >>
>>> >>
>>> >> Is any one else seeing this?
>>> >>
>>> >> Thanks,
>>> >>

[ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I've run into an issue where after copying a file to my cephfs cluster
the md5sums no longer match.  I believe I've tracked it down to some
parts of the file which are missing:

$ obj_name=$(cephfs "title1.mkv" show_location -l 0 | grep object_name
| sed -e "s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/")
$ echo "Object name: $obj_name"
Object name: 1001120

$ file_size=$(stat "title1.mkv" | grep Size | awk '{ print $2 }')
$ printf "File size: %d MiB (%d Bytes)\n" $(($file_size/1048576)) $file_size
File size: 20074 MiB (21049178117 Bytes)

$ blocks=$((file_size/4194304+1))
$ printf "Blocks: %d\n" $blocks
Blocks: 5019

$ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
${obj_name}.`printf '%8.8x\n' $b` | grep "error"; done
 error stat-ing data/1001120.1076: No such file or directory
 error stat-ing data/1001120.11c7: No such file or directory
 error stat-ing data/1001120.129c: No such file or directory
 error stat-ing data/1001120.12f4: No such file or directory
 error stat-ing data/1001120.1307: No such file or directory


Any ideas where to look to investigate what caused these blocks to not
be written?

Here's the current state of the cluster:

ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
   osdmap e22059: 24 osds: 24 up, 24 in
pgmap v1783615: 1920 pgs: 1917 active+clean, 3
active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
13592 GB avail
   mdsmap e437: 1/1/1 up {0=a=up:active}

Here's my current crushmap:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host b1 {
id -2   # do not change unnecessarily
# weight 2.980
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.500
item osd.1 weight 0.500
item osd.2 weight 0.500
item osd.3 weight 0.500
item osd.4 weight 0.500
item osd.20 weight 0.480
}
host b2 {
id -4   # do not change unnecessarily
# weight 4.680
alg straw
hash 0  # rjenkins1
item osd.5 weight 0.500
item osd.6 weight 0.500
item osd.7 weight 2.200
item osd.8 weight 0.500
item osd.9 weight 0.500
item osd.21 weight 0.480
}
host b3 {
id -5   # do not change unnecessarily
# weight 3.480
alg straw
hash 0  # rjenkins1
item osd.10 weight 0.500
item osd.11 weight 0.500
item osd.12 weight 1.000
item osd.13 weight 0.500
item osd.14 weight 0.500
item osd.22 weight 0.480
}
host b4 {
id -6   # do not change unnecessarily
# weight 3.480
alg straw
hash 0  # rjenkins1
item osd.15 weight 0.500
item osd.16 weight 1.000
item osd.17 weight 0.500
item osd.18 weight 0.500
item osd.19 weight 0.500
item osd.23 weight 0.480
}
pool default {
id -1   # do not change unnecessarily
# weight 14.620
alg straw
hash 0  # rjenkins1
item b1 weight 2.980
item b2 weight 4.680
item b3 weight 3.480
item b4 weight 3.480
}

# rules
rule data {
ruleset 0
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map


Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I've tried a few different ones:

1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)

It's fairly reproducible, so I can collect logs for you.  Which ones
would you be interested in?

The cluster has been in a couple states during testing (during
expansion/rebalancing and during an all active+clean state).

BTW, all the nodes are running with the 0.56.4-1precise packages.

Bryan

On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum  wrote:
> On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
>  wrote:
>> I've run into an issue where after copying a file to my cephfs cluster
>> the md5sums no longer match.  I believe I've tracked it down to some
>> parts of the file which are missing:
>>
>> $ obj_name=$(cephfs "title1.mkv" show_location -l 0 | grep object_name
>> | sed -e "s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/")
>> $ echo "Object name: $obj_name"
>> Object name: 1001120
>>
>> $ file_size=$(stat "title1.mkv" | grep Size | awk '{ print $2 }')
>> $ printf "File size: %d MiB (%d Bytes)\n" $(($file_size/1048576)) $file_size
>> File size: 20074 MiB (21049178117 Bytes)
>>
>> $ blocks=$((file_size/4194304+1))
>> $ printf "Blocks: %d\n" $blocks
>> Blocks: 5019
>>
>> $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
>> ${obj_name}.`printf '%8.8x\n' $b` | grep "error"; done
>>  error stat-ing data/1001120.1076: No such file or directory
>>  error stat-ing data/1001120.11c7: No such file or directory
>>  error stat-ing data/1001120.129c: No such file or directory
>>  error stat-ing data/1001120.12f4: No such file or directory
>>  error stat-ing data/1001120.1307: No such file or directory
>>
>>
>> Any ideas where to look to investigate what caused these blocks to not
>> be written?
>
> What client are you using to write this? Is it fairly reproducible (so
> you could collect logs of it happening)?
>
> Usually the only times I've seen anything like this were when either
> the file data was supposed to go into a pool which the client didn't
> have write permissions on, or when the RADOS cluster was in bad shape
> and so the data never got flushed to disk. Has your cluster been
> healthy since you started writing the file out?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>>
>> Here's the current state of the cluster:
>>
>> ceph -s
>>health HEALTH_OK
>>monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
>>osdmap e22059: 24 osds: 24 up, 24 in
>> pgmap v1783615: 1920 pgs: 1917 active+clean, 3
>> active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
>> 13592 GB avail
>>mdsmap e437: 1/1/1 up {0=a=up:active}
>>
>> Here's my current crushmap:
>>
>> # begin crush map
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> device 4 osd.4
>> device 5 osd.5
>> device 6 osd.6
>> device 7 osd.7
>> device 8 osd.8
>> device 9 osd.9
>> device 10 osd.10
>> device 11 osd.11
>> device 12 osd.12
>> device 13 osd.13
>> device 14 osd.14
>> device 15 osd.15
>> device 16 osd.16
>> device 17 osd.17
>> device 18 osd.18
>> device 19 osd.19
>> device 20 osd.20
>> device 21 osd.21
>> device 22 osd.22
>> device 23 osd.23
>>
>> # types
>> type 0 osd
>> type 1 host
>> type 2 rack
>> type 3 row
>> type 4 room
>> type 5 datacenter
>> type 6 pool
>>
>> # buckets
>> host b1 {
>> id -2   # do not change unnecessarily
>> # weight 2.980
>> alg straw
>> hash 0  # rjenkins1
>> item osd.0 weight 0.500
>> item osd.1 weight 0.500
>> item osd.2 weight 0.500
>> item osd.3 weight 0.500
>> item osd.4 weight 0.500
>> item osd.20 weight 0.480
>> }
>> host b2 {
>> id -4   # do not change unnecessarily
>> # weight 4.680
>> alg straw
>> hash 0  # rjenkins1
>> item osd.5 weight 0.500
>> item osd.6 weight 0.500
>> item osd.7 weight 2.200
>> item osd.8 weight 0.500
>> item osd.9 weight 0.500
>> item osd.

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I'm using the kernel client that's built into precise & quantal.

I could give the ceph-fuse client a try and see if it has the same
issue.  I haven't used it before, so I'll have to do some reading
first.

Bryan

On Tue, Apr 23, 2013 at 4:04 PM, Gregory Farnum  wrote:
> Sorry, I meant kernel client or ceph-fuse? Client logs would be enough
> to start with, I suppose — "debug client = 20" and "debug ms = 1" if
> using ceph-fuse; if using the kernel client things get tricker; I'd
> have to look at what logging is available without the debugfs stuff
> being enabled. :/
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Apr 23, 2013 at 3:00 PM, Bryan Stillwell
>  wrote:
>> I've tried a few different ones:
>>
>> 1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
>> 2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
>> 3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
>>
>> It's fairly reproducible, so I can collect logs for you.  Which ones
>> would you be interested in?
>>
>> The cluster has been in a couple states during testing (during
>> expansion/rebalancing and during an all active+clean state).
>>
>> BTW, all the nodes are running with the 0.56.4-1precise packages.
>>
>> Bryan
>>
>> On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum  wrote:
>>> On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
>>>  wrote:
>>>> I've run into an issue where after copying a file to my cephfs cluster
>>>> the md5sums no longer match.  I believe I've tracked it down to some
>>>> parts of the file which are missing:
>>>>
>>>> $ obj_name=$(cephfs "title1.mkv" show_location -l 0 | grep object_name
>>>> | sed -e "s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/")
>>>> $ echo "Object name: $obj_name"
>>>> Object name: 1001120
>>>>
>>>> $ file_size=$(stat "title1.mkv" | grep Size | awk '{ print $2 }')
>>>> $ printf "File size: %d MiB (%d Bytes)\n" $(($file_size/1048576)) 
>>>> $file_size
>>>> File size: 20074 MiB (21049178117 Bytes)
>>>>
>>>> $ blocks=$((file_size/4194304+1))
>>>> $ printf "Blocks: %d\n" $blocks
>>>> Blocks: 5019
>>>>
>>>> $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
>>>> ${obj_name}.`printf '%8.8x\n' $b` | grep "error"; done
>>>>  error stat-ing data/1001120.1076: No such file or directory
>>>>  error stat-ing data/1001120.11c7: No such file or directory
>>>>  error stat-ing data/1001120.129c: No such file or directory
>>>>  error stat-ing data/1001120.12f4: No such file or directory
>>>>  error stat-ing data/1001120.1307: No such file or directory
>>>>
>>>>
>>>> Any ideas where to look to investigate what caused these blocks to not
>>>> be written?
>>>
>>> What client are you using to write this? Is it fairly reproducible (so
>>> you could collect logs of it happening)?
>>>
>>> Usually the only times I've seen anything like this were when either
>>> the file data was supposed to go into a pool which the client didn't
>>> have write permissions on, or when the RADOS cluster was in bad shape
>>> and so the data never got flushed to disk. Has your cluster been
>>> healthy since you started writing the file out?
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>>>
>>>> Here's the current state of the cluster:
>>>>
>>>> ceph -s
>>>>health HEALTH_OK
>>>>monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 
>>>> 0 a
>>>>osdmap e22059: 24 osds: 24 up, 24 in
>>>> pgmap v1783615: 1920 pgs: 1917 active+clean, 3
>>>> active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
>>>> 13592 GB avail
>>>>mdsmap e437: 1/1/1 up {0=a=up:active}
>>>>
>>>> Here's my current crushmap:
>>>>
>>>> # begin crush map
>>>>
>>>> # devices
>>>> device 0 osd.0
>>>> device 1 osd.1
>>>> device 2 osd.2
>>>> device 3 osd.3
>>>> device 4 osd.4
>>>> device 5 osd.5
>>&

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I'm testing this now, but while going through the logs I saw something
that might have something to do with this:

Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
epoch 22146 off 102 (88021e0dc802 of
88021e0dc79c-88021e0dc802)
Apr 23 16:35:28 a1 kernel: [692455.505154] osdmap: : 05 00 69
17 a0 33 34 39 4f d7 88 db 46 c9 e1 df  ..i..349O...F...
Apr 23 16:35:28 a1 kernel: [692455.505158] osdmap: 0010: 0d 6e 82
56 00 00 b0 0c 77 51 00 1a 00 22 ff ff  .n.VwQ..."..
Apr 23 16:35:28 a1 kernel: [692455.505161] osdmap: 0020: ff ff ff
ff ff ff 00 00 00 00 00 00 00 00 ff ff  
Apr 23 16:35:28 a1 kernel: [692455.505163] osdmap: 0030: ff ff 00
00 00 00 00 00 00 00 00 00 00 00 00 00  
Apr 23 16:35:28 a1 kernel: [692455.505166] osdmap: 0040: 00 00 00
00 00 00 00 00 00 00 01 00 00 00 ff ff  
Apr 23 16:35:28 a1 kernel: [692455.505169] osdmap: 0050: 5c 02 00
00 00 00 03 00 00 00 0c 00 00 00 00 00  \...
Apr 23 16:35:28 a1 kernel: [692455.505171] osdmap: 0060: 00 00 02
00 00 00..
Apr 23 16:35:28 a1 kernel: [692455.505174] libceph: osdc handle_map corrupt msg
Apr 23 16:35:28 a1 kernel: [692455.513590] header: : 90 03 00
00 00 00 00 00 00 00 00 00 00 00 00 00  
Apr 23 16:35:28 a1 kernel: [692455.513593] header: 0010: 29 00 c4
00 01 00 86 00 00 00 00 00 00 00 00 00  )...
Apr 23 16:35:28 a1 kernel: [692455.513596] header: 0020: 00 00 00
00 01 00 00 00 00 00 00 00 00 01 00 00  
Apr 23 16:35:28 a1 kernel: [692455.513599] header: 0030: 00 5d 68
c5 e8   .]h..
Apr 23 16:35:28 a1 kernel: [692455.513602]  front: : 69 17 a0
33 34 39 4f d7 88 db 46 c9 e1 df 0d 6e  i..349O...Fn
Apr 23 16:35:28 a1 kernel: [692455.513605]  front: 0010: 01 00 00
00 82 56 00 00 66 00 00 00 05 00 69 17  .V..f.i.
Apr 23 16:35:28 a1 kernel: [692455.513607]  front: 0020: a0 33 34
39 4f d7 88 db 46 c9 e1 df 0d 6e 82 56  .349O...Fn.V
Apr 23 16:35:28 a1 kernel: [692455.513610]  front: 0030: 00 00 b0
0c 77 51 00 1a 00 22 ff ff ff ff ff ff  wQ..."..
Apr 23 16:35:28 a1 kernel: [692455.513613]  front: 0040: ff ff 00
00 00 00 00 00 00 00 ff ff ff ff 00 00  
Apr 23 16:35:28 a1 kernel: [692455.513616]  front: 0050: 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00  
Apr 23 16:35:28 a1 kernel: [692455.513618]  front: 0060: 00 00 00
00 00 00 01 00 00 00 ff ff 5c 02 00 00  \...
Apr 23 16:35:28 a1 kernel: [692455.513621]  front: 0070: 00 00 03
00 00 00 0c 00 00 00 00 00 00 00 02 00  
Apr 23 16:35:28 a1 kernel: [692455.513624]  front: 0080: 00 00 00
00 00 00..
Apr 23 16:35:28 a1 kernel: [692455.513627] footer: : ae ee 1e
d8 00 00 00 00 00 00 00 00 01   .

On Tue, Apr 23, 2013 at 4:41 PM, Gregory Farnum  wrote:
> On Tue, Apr 23, 2013 at 3:37 PM, Bryan Stillwell
>  wrote:
>> I'm using the kernel client that's built into precise & quantal.
>>
>> I could give the ceph-fuse client a try and see if it has the same
>> issue.  I haven't used it before, so I'll have to do some reading
>> first.
>
> If you've got the time that would be a good data point, and make
> debugging easier if it reproduces. There's not a ton to learn — you
> install the ceph-fuse package (I think it's packaged separately,
> anyway) and then instead of "mount" you run "ceph-fuse -c  file> --name client. --keyring " or similar. :)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>>
>> Bryan
>>
>> On Tue, Apr 23, 2013 at 4:04 PM, Gregory Farnum  wrote:
>>> Sorry, I meant kernel client or ceph-fuse? Client logs would be enough
>>> to start with, I suppose — "debug client = 20" and "debug ms = 1" if
>>> using ceph-fuse; if using the kernel client things get tricker; I'd
>>> have to look at what logging is available without the debugfs stuff
>>> being enabled. :/
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Tue, Apr 23, 2013 at 3:00 PM, Bryan Stillwell
>>>  wrote:
>>>> I've tried a few different ones:
>>>>
>>>> 1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
>>>> 2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
>>>> 3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
>>>>
>>>> It's fairly reproducible, so I can collect logs for you.  Which ones
>>>> would you be intereste

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil  wrote:
>
> On Tue, 23 Apr 2013, Bryan Stillwell wrote:
> > I'm testing this now, but while going through the logs I saw something
> > that might have something to do with this:
> >
> > Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
> > epoch 22146 off 102 (88021e0dc802 of
> > 88021e0dc79c-88021e0dc802)
>
> Oh, that's not right...  What kernel version is this?  Which ceph version?

$ uname -a
Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
$ ceph -v
ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:45 PM, Sage Weil  wrote:
> On Tue, 23 Apr 2013, Bryan Stillwell wrote:
>> On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil  wrote:
>> >
>> > On Tue, 23 Apr 2013, Bryan Stillwell wrote:
>> > > I'm testing this now, but while going through the logs I saw something
>> > > that might have something to do with this:
>> > >
>> > > Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
>> > > epoch 22146 off 102 (88021e0dc802 of
>> > > 88021e0dc79c-88021e0dc802)
>> >
>> > Oh, that's not right...  What kernel version is this?  Which ceph version?
>>
>> $ uname -a
>> Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
>> x86_64 x86_64 x86_64 GNU/Linux
>
> Oh, that's a sufficiently old kernel that we don't support.  3.4 or later
> is considered stable.  You should be able to get recent mainline kernels
> from an ubuntu ppa...

It looks like Canonical released a 3.5.0 kernel as a security update
to precise that I'll give a try.

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:54 PM, Gregory Farnum  wrote:
> On Tue, Apr 23, 2013 at 4:45 PM, Sage Weil  wrote:
>> On Tue, 23 Apr 2013, Bryan Stillwell wrote:
>>> On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil  wrote:
>>> >
>>> > On Tue, 23 Apr 2013, Bryan Stillwell wrote:
>>> > > I'm testing this now, but while going through the logs I saw something
>>> > > that might have something to do with this:
>>> > >
>>> > > Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
>>> > > epoch 22146 off 102 (88021e0dc802 of
>>> > > 88021e0dc79c-88021e0dc802)
>>> >
>>> > Oh, that's not right...  What kernel version is this?  Which ceph version?
>>>
>>> $ uname -a
>>> Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
>>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> Oh, that's a sufficiently old kernel that we don't support.  3.4 or later
>> is considered stable.  You should be able to get recent mainline kernels
>> from an ubuntu ppa...
>
> By which he means "that could have caused the trouble and there are
> some osdmap decoding problems which are fixed in later kernels". :)
> I'd forgotten about these problems, although fortunately they're not
> consistent. But especially for CephFS you'll want to stick with
> userspace rather than kernelspace for a while if you aren't in the
> habit of staying very up-to-date.

Thanks, that's good to know.  :)

The first copy test using fuse finished and the MD5s match up!  I'm
going to do some more testing overnight, but this seems to be the
cause.

Thanks for the help!

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy documentation fixes

2013-05-07 Thread Bryan Stillwell
With the release of cuttlefish, I decided to try out ceph-deploy and
ran into some documentation errors along the way:


http://ceph.com/docs/master/rados/deployment/preflight-checklist/

Under 'CREATE A USER' it has the following line:

To provide full privileges to the user, add the following to
/etc/sudoers.d/chef.

Based on the command that followed, chef should be replaced with ceph.


http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/

Under 'ZAP DISKS' it has an 'Important' message that states:

Important: This will delete all data in the partition.

If I understand it correctly, this should be changed to:

Important: This will delete all data on the disk.


Under 'PREPARE OSDS' it first gives an example to prepare a disk:

ceph-deploy osd prepare {host-name}:{path/to/disk}[:{path/to/journal}]

And then it gives an example that attempts to prepare a partition:

ceph-deploy osd prepare osdserver1:/dev/sdb1:/dev/ssd1


The same issue exists for 'ACTIVATE OSDS' and 'CREATE OSDS'.


Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon problems after upgrading to cuttlefish

2013-05-22 Thread Bryan Stillwell
I attempted to upgrade my bobtail cluster to cuttlefish tonight and I
believe I'm running into some mon related issues.  I did the original
install manually instead of with mkcephfs or ceph-deploy, so I think
that might have to do with this error:

root@a1:~# ceph-mon -d -c /etc/ceph/ceph.conf
2013-05-22 23:37:29.283975 7f8fb97b3780  0 ceph version 0.61.2
(fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 5531
IO error: /var/lib/ceph/mon/ceph-admin/store.db/LOCK: No such file or directory
2013-05-22 23:37:29.286534 7f8fb97b3780  1 unable to open monitor
store at /var/lib/ceph/mon/ceph-admin
2013-05-22 23:37:29.286544 7f8fb97b3780  1 check for old monitor store format
2013-05-22 23:37:29.286550 7f8fb97b3780  1
store(/var/lib/ceph/mon/ceph-admin) mount
2013-05-22 23:37:29.286559 7f8fb97b3780  1
store(/var/lib/ceph/mon/ceph-admin) basedir
/var/lib/ceph/mon/ceph-admin dne
2013-05-22 23:37:29.286564 7f8fb97b3780 -1 unable to mount monitor
store: (2) No such file or directory
2013-05-22 23:37:29.286577 7f8fb97b3780 -1 found errors while
attempting to convert the monitor store: (2) No such file or directory
root@a1:~# ls -l /var/lib/ceph/mon/
total 4
drwxr-xr-x 15 root root 4096 May 22 23:30 ceph-a


I only have one mon daemon in this cluster as well.  I was planning on
upgrading it to 3 tonight but when I try to run most commands they
just hang now.

I do see the store.db directory in the ceph-a directory if that helps:

root@a1:~# ls -l  /var/lib/ceph/mon/ceph-a/
total 868
drwxr-xr-x 2 root root   4096 May 22 23:30 auth
drwxr-xr-x 2 root root   4096 May 22 23:30 auth_gv
-rw--- 1 root root 37 Feb  4 14:22 cluster_uuid
-rw--- 1 root root  2 May 22 23:30 election_epoch
-rw--- 1 root root120 Feb  4 14:22 feature_set
-rw--- 1 root root  2 Dec 28 11:35 joined
-rw--- 1 root root 77 May 22 22:30 keyring
-rw--- 1 root root  0 Dec 28 11:35 lock
drwxr-xr-x 2 root root  20480 May 22 23:30 logm
drwxr-xr-x 2 root root  20480 May 22 23:30 logm_gv
-rw--- 1 root root 21 Dec 28 11:35 magic
drwxr-xr-x 2 root root  12288 May 22 23:30 mdsmap
drwxr-xr-x 2 root root  12288 May 22 23:30 mdsmap_gv
drwxr-xr-x 2 root root   4096 Dec 28 11:35 monmap
drwxr-xr-x 2 root root 233472 May 22 23:30 osdmap
drwxr-xr-x 2 root root 237568 May 22 23:30 osdmap_full
drwxr-xr-x 2 root root 253952 May 22 23:30 osdmap_gv
drwxr-xr-x 2 root root  20480 May 22 23:30 pgmap
drwxr-xr-x 2 root root  20480 May 22 23:30 pgmap_gv
drwxr-xr-x 2 root root   4096 May 22 23:36 store.db


Any help would be appreciated.

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon problems after upgrading to cuttlefish

2013-05-23 Thread Bryan Stillwell
On Thu, May 23, 2013 at 9:58 AM, Smart Weblications GmbH - Florian
Wiessner  wrote:
> you may need to update your [mon.a] section in your ceph.conf like this:
>
>
> [mon.a]
>mon data = /var/lib/ceph/mon/ceph-a/

That didn't seem to make a difference, it kept trying to use ceph-admin.

I tried adding this as well:

[mon]
mon data = /var/lib/ceph/mon/ceph-a


But it didn't work either:

# ceph-mon -d -c /etc/ceph/ceph.conf
2013-05-23 22:33:14.371601 7fb64f8cd780  0 ceph version 0.61.2
(fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid
31183
IO error: lock /var/lib/ceph/mon/ceph-a/store.db/LOCK: Resource
temporarily unavailable
2013-05-23 22:33:14.374219 7fb64f8cd780  1 unable to open monitor
store at /var/lib/ceph/mon/ceph-a
2013-05-23 22:33:14.374229 7fb64f8cd780  1 check for old monitor store format
2013-05-23 22:33:14.374239 7fb64f8cd780  1 store(/var/lib/ceph/mon/ceph-a) mount
2013-05-23 22:33:14.374300 7fb64f8cd780  1 found old GV monitor store
format -- should convert!
IO error: lock /var/lib/ceph/mon/ceph-a/store.db/LOCK: Resource
temporarily unavailable
mon/Monitor.cc: In function 'int Monitor::StoreConverter::convert()'
thread 7fb64f8cd780 time 2013-05-23 22:33:14.374451
mon/Monitor.cc: 4293: FAILED assert(!db->create_and_open(std::cerr))
 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
 1: (Monitor::StoreConverter::convert()+0x467) [0x4bf257]
 2: (main()+0x44e) [0x48073e]
 3: (__libc_start_main()+0xed) [0x7fb64db2a76d]
 4: ceph-mon() [0x48417d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
2013-05-23 22:33:14.375019 7fb64f8cd780 -1 mon/Monitor.cc: In function
'int Monitor::StoreConverter::convert()' thread 7fb64f8cd780 time
2013-05-23 22:33:14.374451
mon/Monitor.cc: 4293: FAILED assert(!db->create_and_open(std::cerr))

 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
 1: (Monitor::StoreConverter::convert()+0x467) [0x4bf257]
 2: (main()+0x44e) [0x48073e]
 3: (__libc_start_main()+0xed) [0x7fb64db2a76d]
 4: ceph-mon() [0x48417d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -23> 2013-05-23 22:33:14.369804 7fb64f8cd780  5 asok(0x172e000)
register_command perfcounters_dump hook 0x1722010
   -22> 2013-05-23 22:33:14.369831 7fb64f8cd780  5 asok(0x172e000)
register_command 1 hook 0x1722010
   -21> 2013-05-23 22:33:14.369840 7fb64f8cd780  5 asok(0x172e000)
register_command perf dump hook 0x1722010
   -20> 2013-05-23 22:33:14.369848 7fb64f8cd780  5 asok(0x172e000)
register_command perfcounters_schema hook 0x1722010
   -19> 2013-05-23 22:33:14.369855 7fb64f8cd780  5 asok(0x172e000)
register_command 2 hook 0x1722010
   -18> 2013-05-23 22:33:14.369857 7fb64f8cd780  5 asok(0x172e000)
register_command perf schema hook 0x1722010
   -17> 2013-05-23 22:33:14.369861 7fb64f8cd780  5 asok(0x172e000)
register_command config show hook 0x1722010
   -16> 2013-05-23 22:33:14.369867 7fb64f8cd780  5 asok(0x172e000)
register_command config set hook 0x1722010
   -15> 2013-05-23 22:33:14.369869 7fb64f8cd780  5 asok(0x172e000)
register_command log flush hook 0x1722010
   -14> 2013-05-23 22:33:14.369873 7fb64f8cd780  5 asok(0x172e000)
register_command log dump hook 0x1722010
   -13> 2013-05-23 22:33:14.369880 7fb64f8cd780  5 asok(0x172e000)
register_command log reopen hook 0x1722010
   -12> 2013-05-23 22:33:14.371601 7fb64f8cd780  0 ceph version 0.61.2
(fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid
31183
   -11> 2013-05-23 22:33:14.373837 7fb64f8cd780  5 asok(0x172e000)
init /var/run/ceph/ceph-mon.admin.asok
   -10> 2013-05-23 22:33:14.373859 7fb64f8cd780  5 asok(0x172e000)
bind_and_listen /var/run/ceph/ceph-mon.admin.asok
-9> 2013-05-23 22:33:14.373910 7fb64f8cd780  5 asok(0x172e000)
register_command 0 hook 0x17210b0
-8> 2013-05-23 22:33:14.373917 7fb64f8cd780  5 asok(0x172e000)
register_command version hook 0x17210b0
-7> 2013-05-23 22:33:14.373922 7fb64f8cd780  5 asok(0x172e000)
register_command git_version hook 0x17210b0
-6> 2013-05-23 22:33:14.373929 7fb64f8cd780  5 asok(0x172e000)
register_command help hook 0x17220d0
-5> 2013-05-23 22:33:14.373965 7fb64b970700  5 asok(0x172e000) entry start
-4> 2013-05-23 22:33:14.374219 7fb64f8cd780  1 unable to open
monitor store at /var/lib/ceph/mon/ceph-a
-3> 2013-05-23 22:33:14.374229 7fb64f8cd780  1 check for old
monitor store format
-2> 2013-05-23 22:33:14.374239 7fb64f8cd780  1
store(/var/lib/ceph/mon/ceph-a) mount
-1> 2013-05-23 22:33:14.374300 7fb64f8cd780  1 found old GV
monitor store format -- should convert!
 0> 2013-05-23 22:33:14.375019 7fb64f8cd780 -1 mon/Monitor.cc: In
function 'int Monitor::StoreConverter::convert()' thread 7fb64f8cd780
time 2013-05-23 22:33:14.374451
mon/Monitor.cc: 4293: FAILED assert(!db->create_and_open(std::cerr))

 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
 1: (Monitor::StoreConverter::convert()+0x467) [0x

[ceph-users] Failure increasing mons from 1 to 3

2013-05-25 Thread Bryan Stillwell
Shortly after upgrading from bobtail to cuttlefish I tried increasing
the number of monitors in my small test cluster from 1 to 3, but I
believe I messed something up in the process.  At first I thought the
conversion to leveldb failed, but after digging into it a bit I
believe this explains it:

# ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status
{ "name": "a",
  "rank": 0,
  "state": "probing",
  "election_epoch": 0,
  "quorum": [],
  "outside_quorum": [
"a"],
  "extra_probe_peers": [],
  "monmap": { "epoch": 2,
  "fsid": "6917a033-3439-4fd7-88db-46c9e1df0d6e",
  "modified": "2013-05-22 21:45:39.239582",
  "created": "2012-12-28 11:35:06.671375",
  "mons": [
{ "rank": 0,
  "name": "a",
  "addr": "172.24.88.50:6789\/0"},
{ "rank": 1,
  "name": "mon.b",
  "addr": "172.24.88.53:6789\/0"}]}}


At this point I would like to remove the one named "mon.b" since it is
named wrong (should be just "b"), and I would like to get back to a
working state before attempting to expand to 3 monitors again.

I've tried using monmaptool and ceph-mon to inject it, but that hasn't
worked for me yet:

# monmaptool --print monmap.dat
monmaptool: monmap file monmap.dat
epoch 1
fsid 6917a033-3439-4fd7-88db-46c9e1df0d6e
last_changed 2012-12-28 11:35:06.671375
created 2012-12-28 11:35:06.671375
0: 172.24.88.50:6789/0 mon.a
# ceph-mon -i a --inject-monmap monmap.dat
[3506]: (33) Numerical argument out of domain


Is there something I'm missing?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon problems after upgrading to cuttlefish

2013-05-28 Thread Bryan Stillwell
Joao,

I made sure there wasn't any other ceph-mon running, and now when I run it
I see this:

# ceph-mon -d -c /etc/ceph/ceph.conf
2013-05-28 13:46:47.060426 7fbc68a80780  0 ceph version 0.61.2
(fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 9640
2013-05-28 13:46:47.196659 7fbc68a80780  0 mon.admin does not exist in
monmap, will attempt to join an existing cluster
2013-05-28 13:46:47.196813 7fbc68a80780 -1 no public_addr or public_network
specified, and mon.admin not present in monmap or ceph.conf

BTW, I started another thread that hasn't received a response yet that
might have something to do with it.  I'm now thinking it might have to do
more with me attempting to at more mon servers then with the upgrade:

# start ceph-mon id=a
ceph-mon (ceph/a) start/running, process 9695
root@a1:/etc/ceph# ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status
{ "name": "a",
  "rank": 0,
  "state": "probing",
  "election_epoch": 0,
  "quorum": [],
  "outside_quorum": [
"a"],
  "extra_probe_peers": [],
  "monmap": { "epoch": 2,
  "fsid": "6917a033-3439-4fd7-88db-46c9e1df0d6e",
  "modified": "2013-05-22 21:45:39.239582",
  "created": "2012-12-28 11:35:06.671375",
  "mons": [
{ "rank": 0,
  "name": "a",
  "addr": "172.24.88.50:6789\/0"},
{ "rank": 1,
  "name": "mon.b",
  "addr": "172.24.88.53:6789\/0"}]}}

Any ideas how to get rid of mon.b?

Thanks,
Bryan


On Mon, May 27, 2013 at 10:23 AM, Joao Eduardo Luis
wrote:

> On 05/24/2013 05:37 AM, Bryan Stillwell wrote:
>
>> On Thu, May 23, 2013 at 9:58 AM, Smart Weblications GmbH - Florian
>> Wiessner 
>> >
>> wrote:
>>
>>> you may need to update your [mon.a] section in your ceph.conf like this:
>>>
>>>
>>> [mon.a]
>>> mon data = /var/lib/ceph/mon/ceph-a/
>>>
>>
>> That didn't seem to make a difference, it kept trying to use ceph-admin.
>>
>> I tried adding this as well:
>>
>> [mon]
>>  mon data = /var/lib/ceph/mon/ceph-a
>>
>>
>> But it didn't work either:
>>
>> # ceph-mon -d -c /etc/ceph/ceph.conf
>> 2013-05-23 22:33:14.371601 7fb64f8cd780  0 ceph version 0.61.2
>> (**fea782543a844bb277ae94d3391788**b76c5bee60), process ceph-mon, pid
>> 31183
>> IO error: lock /var/lib/ceph/mon/ceph-a/**store.db/LOCK: Resource
>> temporarily unavailable
>> 2013-05-23 22:33:14.374219 7fb64f8cd780  1 unable to open monitor
>> store at /var/lib/ceph/mon/ceph-a
>> 2013-05-23 22:33:14.374229 7fb64f8cd780  1 check for old monitor store
>> format
>> 2013-05-23 22:33:14.374239 7fb64f8cd780  1 store(/var/lib/ceph/mon/ceph-*
>> *a) mount
>> 2013-05-23 22:33:14.374300 7fb64f8cd780  1 found old GV monitor store
>> format -- should convert!
>> IO error: lock /var/lib/ceph/mon/ceph-a/**store.db/LOCK: Resource
>> temporarily unavailable
>>
>
> This looks as if you have another monitor running.  However, from the
> looks of it, given the whole email exchange, it shouldn't be happening.
>
> Nevertheless, please check if you have some other ceph-mon running.  If
> that is not the case, then 'rm /var/lib/ceph/mon/ceph-a/**store.db/LOCK'.
>
> If after that the monitor complains about having an unfinished conversion,
> please 'mv /var/lib/ceph/mon/ceph-a/**store.db 
> /var/lib/ceph/mon/ceph-a/**store.db.old'
> and re-run ceph-mon.
>
>   -Joao
>
> --
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>



-- 
[image: Photobucket] <http://photobucket.com>

*Bryan Stillwell*
SENIOR SYSTEM ADMINISTRATOR

E: bstillw...@photobucket.com
O: 303.228.5109
M: 970.310.6085

[image: Facebook] <http://www.facebook.com/photobucket>[image:
Twitter]<http://twitter.com/photobucket>[image:
Photobucket] <http://photobucket.com/images/photobucket>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Moving an MDS

2013-06-11 Thread Bryan Stillwell
I have a cluster I originally built on argonaut and have since
upgraded it to bobtail and then cuttlefish.  I originally configured
it with one node for both the mds node and mon node, and 4 other nodes
for hosting osd's:

a1: mon.a/mds.a
b1: osd.0, osd.1, osd.2, osd.3, osd.4, osd.20
b2: osd.5, osd.6, osd.7, osd.8, osd.9, osd.21
b3: osd.10, osd.11, osd.12, osd.13, osd.14, osd.22
b4: osd.15, osd.16, osd.17, osd.18, osd.19, osd.23

Yesterday I added two more mon nodes and moved mon.a off of a1 so it
now looks like:

a1: mds.a
b1: osd.0, osd.1, osd.2, osd.3, osd.4, osd.20
b2: mon.a, osd.5, osd.6, osd.7, osd.8, osd.9, osd.21
b3: mon.b, osd.10, osd.11, osd.12, osd.13, osd.14, osd.22
b4: mon.c, osd.15, osd.16, osd.17, osd.18, osd.19, osd.23

What I would like to do is move mds.a to server b1 so I can power-off
a1 and bring up b5 with another 6 osd's (power in my basement is at a
premium), but I'm not finding much in the way of documentation on how
to do that.  I found some docs on doing it with ceph-deploy, but since
I built this a while ago I haven't been using ceph-deploy (and I
haven't had a great experience using it for building a new cluster
either).

Could some one point me at some docs on how to do this?  Also should I
be running with multiple mds nodes at this time?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving an MDS

2013-06-11 Thread Bryan Stillwell
On Tue, Jun 11, 2013 at 3:50 PM, Gregory Farnum  wrote:
> You should not run more than one active MDS (less stable than a
> single-MDS configuration, bla bla bla), but you can run multiple
> daemons and let the extras serve as a backup in case of failure. The
> process for moving an MDS is pretty easy: turn on a daemon somewhere
> else, confirm it's connected to the cluster, then turn off the old
> one.
> Doing it that way will induce ~30 seconds of MDS inavailability while
> it times out, but on cuttlefish you should be able to force an instant
> takeover if the new daemon uses the same name as the old one (I
> haven't worked with this much myself so I might be missing a detail;
> if this is important you should check).
>
> (These relatively simple takeovers are thanks to the MDS only storing
> data in RADOS, and are one of the big design considerations in the
> system architecture).

Thanks Greg!

That sounds pretty easy.  Although it has me wondering what config
option differentiates between an active MDS and a backup MDS daemon?

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell
I'm in the process of cleaning up a test that an internal customer did on our 
production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
--bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate the 
objects are being removed.  From what I can tell a large number of the objects 
are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be set 
to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, or 
reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell
Wouldn't doing it that way cause problems since references to the objects 
wouldn't be getting removed from .rgw.buckets.index?

Bryan

From: Roger Brown 
Date: Monday, July 24, 2017 at 2:43 PM
To: Bryan Stillwell , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Speeding up garbage collection in RGW

I hope someone else can answer your question better, but in my case I found 
something like this helpful to delete objects faster than I could through the 
gateway: 

rados -p default.rgw.buckets.data ls | grep 'replace this with pattern matching 
files you want to delete' | xargs -d '\n' -n 200 rados -p 
default.rgw.buckets.data rm


On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell  wrote:
I'm in the process of cleaning up a test that an internal customer did on our 
production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
--bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate the 
objects are being removed.  From what I can tell a large number of the objects 
are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be set 
to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, or 
reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell
Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi 
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell
Excellent, thank you!  It does exist in 0.94.10!  :)

Bryan

From: Pavan Rallabhandi 
Date: Tuesday, July 25, 2017 at 11:21 AM
To: Bryan Stillwell , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Speeding up garbage collection in RGW

I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.

From: Bryan Stillwell 
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi , 
"ceph-users@lists.ceph.com" 
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW

Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi 
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] expanding cluster with minimal impact

2017-08-07 Thread Bryan Stillwell
Dan,

We recently went through an expansion of an RGW cluster and found that we 
needed 'norebalance' set whenever making CRUSH weight changes to avoid slow 
requests.  We were also increasing the CRUSH weight by 1.0 each time which 
seemed to reduce the extra data movement we were seeing with smaller weight 
increases.  Maybe something to try out next time?

Bryan

From: ceph-users  on behalf of Dan van der 
Ster 
Date: Friday, August 4, 2017 at 1:59 AM
To: Laszlo Budai 
Cc: ceph-users 
Subject: Re: [ceph-users] expanding cluster with minimal impact

Hi Laszlo,

The script defaults are what we used to do a large intervention (the
default delta weight is 0.01). For our clusters going any faster
becomes disruptive, but this really depends on your cluster size and
activity.

BTW, in case it wasn't clear, to use this script for adding capacity
you need to create the new OSDs to your cluster with initial crush
weight = 0.0

osd crush initial weight = 0
osd crush update on start = true

-- Dan



On Thu, Aug 3, 2017 at 8:12 PM, Laszlo Budai  wrote:
Dear all,

I need to expand a ceph cluster with minimal impact. Reading previous
threads on this topic from the list I've found the ceph-gentle-reweight
script
(https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight)
created by Dan van der Ster (Thank you Dan for sharing the script with us!).

I've done some experiments, and it looks promising, but it is needed to
properly set the parameters. Did any of you tested this script before? what
is the recommended delta_weight to be used? From the default parameters of
the script I can see that the default delta weight is .5% of the target
weight that means 200 reweighting cycles. I have experimented with a
reweight ratio of 5% while running a fio test on a client. The results were
OK (I mean no slow requests), but my  test cluster was a very small one.

If any of you has done some larger experiments with this script I would be
really interested to read about your results.

Thank you!
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Client features by IP?

2017-09-06 Thread Bryan Stillwell
I was reading this post by Josh Durgin today and was pretty happy to see we can 
get a summary of features that clients are using with the 'ceph features' 
command:

http://ceph.com/community/new-luminous-upgrade-complete/

However, I haven't found an option to display the IP address of those clients 
with the older feature sets.  Is there a flag I can pass to 'ceph features' to 
list the IPs associated with each feature set?

Thanks,
Bryan 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-07 Thread Bryan Stillwell
On 09/07/2017 10:47 AM, Josh Durgin wrote:
> On 09/06/2017 04:36 PM, Bryan Stillwell wrote:
> > I was reading this post by Josh Durgin today and was pretty happy to
> > see we can get a summary of features that clients are using with the
> > 'ceph features' command:
> >
> > http://ceph.com/community/new-luminous-upgrade-complete/
> >
> > However, I haven't found an option to display the IP address of
> > those clients with the older feature sets.  Is there a flag I can
> > pass to 'ceph features' to list the IPs associated with each feature
> > set?
>
> There is not currently, we should add that - it'll be easy to backport
> to luminous too. The only place both features and IP are shown is in
> 'debug mon = 10' logs right now.

I think that would be great!  The first thing I would want to do after
seeing an old client listed would be to find it and upgrade it.  Having
the IP of the client would make that a ton easier!

Anything I could do to help make that happen?  File a feature request
maybe?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw crashing after buffer overflows detected

2017-09-08 Thread Bryan Stillwell
For about a week we've been seeing a decent number of buffer overflows
detected across all our RGW nodes in one of our clusters.  This started
happening a day after we started weighing in some new OSD nodes, so
we're thinking it's probably related to that.  Could someone help us
determine the root cause of this?

Cluster details:
  Distro: CentOS 7.2
  Release: 0.94.10-0.el7.x86_64
  OSDs: 1120
  RGW nodes: 10

See log messages below.  If you know how to improve the call trace
below I would like to hear that too.  I tried installing the
ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to
help.

Thanks,
Bryan


# From /var/log/messages:

Sep  7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated


# From /var/log/ceph/client.radosgw.p3cephrgw003.log:

 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) 
**
 in thread 7f7b296a2700

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: /bin/radosgw() [0x6d3d92]
 2: (()+0xf100) [0x7f7f425e9100]
 3: (gsignal()+0x37) [0x7f7f4141d5f7]
 4: (abort()+0x148) [0x7f7f4141ece8]
 5: (()+0x75317) [0x7f7f4145d317]
 6: (__fortify_fail()+0x37) [0x7f7f414f5ac7]
 7: (()+0x10bc80) [0x7f7f414f3c80]
 8: (()+0x10da37) [0x7f7f414f5a37]
 9: (OS_Accept()+0xc1) [0x7f7f435bd8b1]
 10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c]
 11: (RGWFCGXProcess::run()+0x7bf) [0x58136f]
 12: (RGWProcessControlThread::entry()+0xe) [0x5821fe]
 13: (()+0x7dc5) [0x7f7f425e1dc5]
 14: (clone()+0x6d) [0x7f7f414de21d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-08 Thread Bryan Stillwell
On 09/07/2017 01:26 PM, Josh Durgin wrote:
> On 09/07/2017 11:31 AM, Bryan Stillwell wrote:
>> On 09/07/2017 10:47 AM, Josh Durgin wrote:
>>> On 09/06/2017 04:36 PM, Bryan Stillwell wrote:
>>>> I was reading this post by Josh Durgin today and was pretty happy to
>>>> see we can get a summary of features that clients are using with the
>>>> 'ceph features' command:
>>>>
>>>> http://ceph.com/community/new-luminous-upgrade-complete/
>>>>
>>>> However, I haven't found an option to display the IP address of
>>>> those clients with the older feature sets.  Is there a flag I can
>>>> pass to 'ceph features' to list the IPs associated with each feature
>>>> set?
>>>
>>> There is not currently, we should add that - it'll be easy to backport
>>> to luminous too. The only place both features and IP are shown is in
>>> 'debug mon = 10' logs right now.
>>
>> I think that would be great!  The first thing I would want to do after
>> seeing an old client listed would be to find it and upgrade it.  Having
>> the IP of the client would make that a ton easier!
>
> Yup, should've included that in the first place!
>
>> Anything I could do to help make that happen?  File a feature request
>> maybe?
>
> Sure, adding a short tracker.ceph.com ticket would help, that way we can
> track the backport easily too.

Ticket created:

http://tracker.ceph.com/issues/21315

Thanks Josh!

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw crashing after buffer overflows detected

2017-09-11 Thread Bryan Stillwell
I found a couple OSDs that were seeing medium errors and marked them out
of the cluster.  Once all the PGs were moved off those OSDs all the
buffer overflows went away.

So there must be some kind of bug that's being triggered when an OSD is
misbehaving.

Bryan

From: ceph-users  on behalf of Bryan 
Stillwell 
Date: Friday, September 8, 2017 at 9:26 AM
To: ceph-users 
Subject: [ceph-users] radosgw crashing after buffer overflows detected

[This sender failed our fraud detection checks and may not be who they appear 
to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]

For about a week we've been seeing a decent number of buffer overflows
detected across all our RGW nodes in one of our clusters.  This started
happening a day after we started weighing in some new OSD nodes, so
we're thinking it's probably related to that.  Could someone help us
determine the root cause of this?

Cluster details:
  Distro: CentOS 7.2
  Release: 0.94.10-0.el7.x86_64
  OSDs: 1120
  RGW nodes: 10

See log messages below.  If you know how to improve the call trace
below I would like to hear that too.  I tried installing the
ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to
help.

Thanks,
Bryan


# From /var/log/messages:

Sep  7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated


# From /var/log/ceph/client.radosgw.p3cephrgw003.log:

 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) 
**
in thread 7f7b296a2700

ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
1: /bin/radosgw() [0x6d3d92]
2: (()+0xf100) [0x7f7f425e9100]
3: (gsignal()+0x37) [0x7f7f4141d5f7]
4: (abort()+0x148) [0x7f7f4141ece8]
5: (()+0x75317) [0x7f7f4145d317]
6: (__fortify_fail()+0x37) [0x7f7f414f5ac7]
7: (()+0x10bc80) [0x7f7f414f3c80]
8: (()+0x10da37) [0x7f7f414f5a37]
9: (OS_Accept()+0xc1) [0x7f7f435bd8b1]
10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c]
11: (RGWFCGXProcess::run()+0x7bf) [0x58136f]
12: (RGWProcessControlThread::entry()+0xe) [0x5821fe]
13: (()+0x7dc5) [0x7f7f425e1dc5]
14: (clone()+0x6d) [0x7f7f414de21d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell
a few references 
to the rgw-gc settings in the config, but nothing that explained the times well 
enough for me to feel comfortable doing anything with them.

On Tue, Jul 25, 2017 at 4:01 PM Bryan Stillwell  wrote:
Excellent, thank you!  It does exist in 0.94.10!  :)
 
Bryan
 
From: Pavan Rallabhandi 
Date: Tuesday, July 25, 2017 at 11:21 AM

To: Bryan Stillwell , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Speeding up garbage collection in RGW
 
I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.
 
From: Bryan Stillwell 
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi , 
"ceph-users@lists.ceph.com" 
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW
 
Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.
 
I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.
 
Thanks,
Bryan
 
From: Pavan Rallabhandi 
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Speeding up garbage collection in RGW
 
If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.
 
Thanks,
 
On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
 wrote:
 
I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell
That helps a little bit, but overall the process would take years at this rate:

# for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"' 
|grep objects; sleep 60; done
"objects": 1660775838
"objects": 1660775733
"objects": 1660775548
"objects": 1660774825
"objects": 1660774790
"objects": 1660774735

This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up 
this process at all?

Bryan

From: Yehuda Sadeh-Weinraub 
Date: Wednesday, October 25, 2017 at 11:32 AM
To: Bryan Stillwell 
Cc: David Turner , Ben Hines , 
"ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Some of the options there won't do much for you as they'll only affect
newer object removals. I think the default number of gc objects is
just inadequate for your needs. You can try manually running
'radosgw-admin gc process' concurrently (for the start 2 or 3
processes), see if it makes any dent there. I think one of the problem
is that the gc omaps grew so much that operations on them are too
slow.

Yehuda

On Wed, Oct 25, 2017 at 9:05 AM, Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:
We tried various options like the one's Ben mentioned to speed up the garbage 
collection process and were unsuccessful.  Luckily, we had the ability to 
create a new cluster and move all the data that wasn't part of the POC which 
created our problem.

One of the things we ran into was the .rgw.gc pool became too large to handle 
drive failures without taking down the cluster.  We eventually had to move that 
pool to SSDs just to get the cluster healthy.  It was not obvious it was 
getting large though, because this is what it looked like in the 'ceph df' 
output:

 NAME   ID USED  %USED MAX AVAIL OBJECTS
 .rgw.gc17 0 0  235G   2647

However, if you look at the SSDs we used (repurposed journal SSDs to get out of 
the disaster) in 'ceph osd df' you can see quite a bit of data is being used:

410 0.2  1.0  181G 23090M   158G 12.44 0.18
411 0.2  1.0  181G 29105M   152G 15.68 0.22
412 0.2  1.0  181G   110G 72223M 61.08 0.86
413 0.2  1.0  181G 42964M   139G 23.15 0.33
414 0.2  1.0  181G 33530M   148G 18.07 0.26
415 0.2  1.0  181G 38420M   143G 20.70 0.29
416 0.2  1.0  181G 92215M 93355M 49.69 0.70
417 0.2  1.0  181G 64730M   118G 34.88 0.49
418 0.2  1.0  181G 61353M   121G 33.06 0.47
419 0.2  1.0  181G 77168M   105G 41.58 0.59

That's ~560G of omap data for the .rgw.gc pool that isn't being reported in 
'ceph df'.

Right now the cluster is still around while we wait to verify the new cluster 
isn't missing anything.  So if there is anything the RGW developers would like 
to try on it to speed up the gc process, we should be able to do that.

Bryan

From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of David Turner mailto:drakonst...@gmail.com>>
Date: Tuesday, October 24, 2017 at 4:07 PM
To: Ben Hines mailto:bhi...@gmail.com>>
Cc: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Thank you so much for chiming in, Ben.

Can you explain what each setting value means? I believe I understand min wait, 
that's just how long to wait before allowing the object to be cleaned up.  gc 
max objs is how many will be cleaned up during each period?  gc processor 
period is how often it will kick off gc to clean things up?  And gc processor 
max time is the longest the process can run after the period starts?  Is that 
about right for that?  I read somewhere saying that prime numbers are optimal 
for gc max objs.  Do you know why that is?  I notice you're using one there.  
What is lc max objs?  I couldn't find a reference for that setting.

Additionally, do you know if the radosgw-admin gc list is ever cleaned up, or 
is it an ever growing list?  I got up to 3.6 Billion objects in the list before 
I killed the gc list command.

On Tue, Oct 24, 2017 at 4:47 PM Ben Hines 
mailto:bhi...@gmail.com>> wrote:
I agree the settings are rather confusing. We also have many millions of 
objects and had this trouble, so i set these rather aggressive gc settings on 
our cluster which result in gc almost always running. We also use lifecycles to 
expire objects.

rgw lifecycle work time = 00:01-23:59
rgw gc max objs = 2647
rgw lc max objs = 2647
rgw gc obj min wait = 300
rgw gc processor period = 600
rgw gc processor max time = 600


-Ben

On Tue, Oct 24, 2017 at 9:25 AM, David Turner 
mailto

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-27 Thread Bryan Stillwell
On Wed, Oct 25, 2017 at 4:02 PM, Yehuda Sadeh-Weinraub  
wrote:
>
> On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell  
> wrote:
> > That helps a little bit, but overall the process would take years at this
> > rate:
> >
> > # for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"' 
> > |grep objects; sleep 60; done
> >  "objects": 1660775838
> >  "objects": 1660775733
> >  "objects": 1660775548
> >  "objects": 1660774825
> >  "objects": 1660774790
> >  "objects": 1660774735
> >
> > This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up
> > this process at all?
>
> I'm not sure it's going to help much, although the omap performance
> might improve there. The big problem is that the omaps are just too
> big, so that every operation on them take considerable time. I think
> the best way forward there is to take a list of all the rados objects
> that need to be removed from the gc omaps, and then get rid of the gc
> objects themselves (newer ones will be created, this time using the
> new configurable). Then remove the objects manually (and concurrently)
> using the rados command line tool.
> The one problem I see here is that even just removal of objects with
> large omaps can affect the availability of the osds that hold these
> objects. I discussed that now with Josh, and we think the best way to
> deal with that is not to remove the gc objects immediatly, but to
> rename the gc pool, and create a new one (with appropriate number of
> pgs). This way new gc entries will now go into the new gc pool (with
> higher number of gc shards), and you don't need to remove the old gc
> objects (thus no osd availability problem). Then you can start
> trimming the old gc objects (on the old renamed pool) by using the
> rados command. It'll take a very very long time, but the process
> should pick up speed slowly, as the objects shrink.

That's fine for us.  We'll be tearing down this cluster in a few weeks
and adding the nodes to the new cluster we created.  I just wanted to
explore other options now that we can use it as a test cluster.

The solution you described with renaming the .rgw.gc pool and creating a
new one is pretty interesting.  I'll have to give that a try, but until
then I've been trying to remove some of the other buckets with the
--bypass-gc option and it keeps dying with output like this:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:00:00.865993 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
2017-10-27 08:00:04.385875 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
2017-10-27 08:00:04.517241 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
2017-10-27 08:00:05.791876 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
2017-10-27 08:00:26.815081 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1090645 stripe_ofs=1090645 part_ofs=0 rule->part_size=0
2017-10-27 08:00:46.757556 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
2017-10-27 08:00:47.093813 7f2b387228c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


I can typically make further progress by running it again:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:20:57.310859 7fae9c3d48c0  0 RGWObjManifest::operator++(): 
result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
2017-10-27 08:20:57.406684 7fae9c3d48c0  0 RGWObjManifest::operator++(): 
result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
2017-10-27 08:20:57.808050 7fae9c3d48c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


and again:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:22:04.992578 7ff8071038c0  0 RGWObjManifest::operator++(): 
result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
2017-10-27 08:22:05.726485 7ff8071038c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


What does this error mean, and is there any way to keep it from dying
like this?  This cluster is running 0.94.10, but I can upgrade it to jewel
pretty easily if you would like.

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems removing buckets with --bypass-gc

2017-10-31 Thread Bryan Stillwell
As mentioned in another thread I'm trying to remove several thousand buckets on 
a hammer cluster (0.94.10), but I'm running into a problem using --bypass-gc.

I usually see either this error:

# radosgw-admin bucket rm --bucket=sg2pl598 --purge-objects --bypass-gc
2017-10-31 09:21:04.111599 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=4194304 stripe_ofs=4194304 part_ofs=0 rule->part_size=15728640
2017-10-31 09:21:04.121664 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=8388608 stripe_ofs=8388608 part_ofs=0 rule->part_size=15728640
2017-10-31 09:21:04.126356 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=12582912 stripe_ofs=12582912 part_ofs=0 rule->part_size=15728640
2017-10-31 09:21:04.130582 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=15728640 stripe_ofs=15728640 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.135791 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=19922944 stripe_ofs=19922944 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.140240 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=24117248 stripe_ofs=24117248 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.145792 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=28311552 stripe_ofs=28311552 part_ofs=15728640 
rule->part_size=15728640
2017-10-31 09:21:04.149964 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=31457280 stripe_ofs=31457280 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.165820 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=35651584 stripe_ofs=35651584 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.171099 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=39845888 stripe_ofs=39845888 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.176765 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=44040192 stripe_ofs=44040192 part_ofs=31457280 
rule->part_size=15728640
2017-10-31 09:21:04.183664 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=47185920 stripe_ofs=47185920 part_ofs=47185920 rule->part_size=83674
2017-10-31 09:21:04.188140 7f45f5d108c0  0 RGWObjManifest::operator++(): 
result: ofs=47269594 stripe_ofs=47269594 part_ofs=47269594 rule->part_size=83674
2017-10-31 09:21:05.034837 7f45f5d108c0 -1 ERROR: failed to get obj ref with 
ret=-22
2017-10-31 09:21:05.034846 7f45f5d108c0 -1 ERROR: delete obj aio failed with -22

or this error:

# radosgw-admin bucket rm --bucket=sg2pl593 --purge-objects --bypass-gc
2017-10-31 09:24:09.082063 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=4194304 stripe_ofs=4194304 part_ofs=0 rule->part_size=15728640
2017-10-31 09:24:09.090394 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=8388608 stripe_ofs=8388608 part_ofs=0 rule->part_size=15728640
2017-10-31 09:24:09.095172 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=12582912 stripe_ofs=12582912 part_ofs=0 rule->part_size=15728640
2017-10-31 09:24:09.099116 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=15728640 stripe_ofs=15728640 part_ofs=15728640 
rule->part_size=15728640
[...snip...]
2017-10-31 09:24:09.245171 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=110100480 stripe_ofs=110100480 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.251659 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=114294784 stripe_ofs=114294784 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.269739 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=118489088 stripe_ofs=118489088 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.273871 7fe7f4be68c0  0 RGWObjManifest::operator++(): 
result: ofs=122683392 stripe_ofs=122683392 part_ofs=110100480 
rule->part_size=15728640
2017-10-31 09:24:09.274968 7fe7f4be68c0 -1 ERROR: could not drain handles as 
aio completion returned with -2

Then successive runs continue failing at the same spot preventing further 
progress.  I can then run it without --bypass-gc for a few seconds followed by 
running it with --bypass-gc, but usually it fails again after a few minutes.

For example, here's another run on sg2pl593 after running it without 
--bypass-gc for a few seconds:

# radosgw-admin bucket rm --bucket=sg2pl593 --purge-objects --bypass-gc
2017-10-31 09:28:03.704490 7efdb31d08c0  0 RGWObjManifest::operator++(): 
result: ofs=565628 stripe_ofs=565628 part_ofs=0 rule->part_size=0
2017-10-31 09:28:03.890675 7efdb31d08c0  0 RGWObjManifest::operator++(): 
result: ofs=1757663 stripe_ofs=1757663 part_ofs=0 rule->part_size=0
2017-10-31 09:28:04.144966 7efdb31d08c0  0 RGWObjManifest::operator++(): 
result: ofs=2723340 stripe_ofs=2723340 part_ofs=0 rule->part_size=0
2017-10-31 09:28:04.380761 7efdb31d08c0 -1 ERROR: could not drain handles as 
aio completion returned with -2

This cluster recently switched from a production cluster to a test cluster 
after a data migration, so I have the option to 

[ceph-users] Switching failure domains

2018-01-31 Thread Bryan Stillwell
We're looking into switching the failure domains on several of our
clusters from host-level to rack-level and I'm trying to figure out the
least impactful way to accomplish this.

First off, I've made this change before on a couple large (500+ OSDs)
OpenStack clusters where the volumes, images, and vms pools were all
about 33% of the cluster.  The way I did it then was to create a new
rule which had a switch-based failure domain and then did one pool at a
time.

That worked pretty well, but now I've inherited several large RGW
clusters (500-1000+ OSDs) where 99% of the data is in the .rgw.buckets
pool with slower and bigger disks (7200 RPM 4TB SATA HDDs vs. the 10k
RPM 1.2TB SAS HDDs I was using previously).  This makes the change take
longer and early testing has shown it being fairly impactful.

I'm wondering if there is a way to more gradually switch to a rack-based
failure domain?

One of the ideas we had was to create new hosts that are actually the
racks and gradually move all the OSDs to those hosts.  Once that is
complete we should be able to turn those hosts into racks and switch the
failure domain at the same time.

Does anyone see a problem with that approach?

I was also wondering if we could take advantage of RGW in any way to
gradually move the data to a new pool with the proper failure domain set
on it?

BTW, these clusters will all be running jewel (10.2.10).  The time I
made the switch previously was done on hammer.

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell
Bryan,

Based off the information you've provided so far, I would say that your largest 
pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing 
.rgw.buckets has 99% of your data), then on a balanced cluster you would have 
just ~11.5 PGs per OSD (3*512/133).  That's way lower than the recommended 100 
PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the 
data, you should have around 4,096 PGs for that pool.  You'll still end up with 
an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster 
something that happens automatically.  He gave a great talk at LinuxConf 
Australia that you should check out, here's a link into the video where he 
talks about the balancer and the need for it:

https://youtu.be/GrStE7XSKFE?t=20m14s

Even though your objects are fairly large, they are getting broken up into 
chunks that are spread across the cluster.  You can see how large each of your 
PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n 
-k2

You'll see that within a pool the PG sizes are fairly close to the same size, 
but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users  on behalf of Bryan 
Banister 
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,
 
We used the “ceph osd reweight-by-utilization “ command to move a small amount 
of data off of the top four OSDs by utilization.  Then we updated the pg_num 
and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of 
the objects around as a result.  The unfortunate issue is that the weights on 
the OSDs are still roughly equivalent and the OSDs that are nearfull were still 
getting allocated objects during the rebalance backfill operations.
 
At this point I have made some massive changes to the weights of the OSDs in an 
attempt to stop Ceph from allocating any more data to OSDs that are getting 
close to full.  Basically the OSD with the lowest utilization remains weighted 
at 1 and the rest of the OSDs are now reduced in weight based on the percent 
usage of the OSD + the %usage of the OSD with the amount of data (21% at the 
time).  This means the OSD that is at the most full at this time at 86% full 
now has a weight of only .33 (it was at 89% when reweight was applied).  I’m 
not sure this is a good idea, but it seemed like the only option I had.  Please 
let me know if I’m making a bad situation worse!
 
I still have the question on how this happened in the first place and how to 
prevent it from happening going forward without a lot of monitoring and 
reweighting on weekends/etc to keep things balanced.  It sounds like Ceph is 
really expecting that objects stored into a pool will roughly have the same 
size, is that right?
 
Our backups going into this pool have very large variation in size, so would it 
be better to create multiple pools based on expected size of objects and then 
put backups of similar size into each pool?
 
The backups also have basically the same names with the only difference being 
the date which it was taken (e.g. backup name difference in subsequent days can 
be one digit at times), so does this mean that large backups with basically the 
same name will end up being placed in the same PGs based on the CRUSH 
calculation using the object name?
 
Thanks,
-Bryan
 
From: Janne Johansson [mailto:icepic...@gmail.com] 
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
 
Note: External Email

 
 
2018-01-31 15:58 GMT+01:00 Bryan Banister :
 
 
Given that this will move data around (I think), should we increase the pg_num 
and pgp_num first and then see how it looks?
 
 
I guess adding pgs and pgps will move stuff around too, but if the PGCALC 
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might 
be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a 
limit due to some temp-data being created during moves.
 
Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 
per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill 
normal client I/O perf in the meantime.

 
-- 
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notifie

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-13 Thread Bryan Stillwell
It may work fine, but I would suggest limiting the number of operations going 
on at the same time.

Bryan

From: Bryan Banister 
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell , Janne Johansson 

Cc: Ceph Users 
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool 
given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-Original Message-
From: Bryan Stillwell [mailto:bstillw...@godaddy.com]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; Janne Johansson 
mailto:icepic...@gmail.com>>
Cc: Ceph Users mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-

Bryan,

Based off the information you've provided so far, I would say that your largest 
pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing 
.rgw.buckets has 99% of your data), then on a balanced cluster you would have 
just ~11.5 PGs per OSD (3*512/133).  That's way lower than the recommended 100 
PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the 
data, you should have around 4,096 PGs for that pool.  You'll still end up with 
an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster 
something that happens automatically.  He gave a great talk at LinuxConf 
Australia that you should check out, here's a link into the video where he 
talks about the balancer and the need for it:

https://youtu.be/GrStE7XSKFE?t=20m14s

Even though your objects are fairly large, they are getting broken up into 
chunks that are spread across the cluster.  You can see how large each of your 
PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n 
-k2

You'll see that within a pool the PG sizes are fairly close to the same size, 
but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson mailto:icepic...@gmail.com>>
Cc: Ceph Users mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount 
of data off of the top four OSDs by utilization.  Then we updated the pg_num 
and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of 
the objects around as a result.  The unfortunate issue is that the weights on 
the OSDs are still roughly equivalent and the OSDs that are nearfull were still 
getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an 
attempt to stop Ceph from allocating any more data to OSDs that are getting 
close to full.  Basically the OSD with the lowest utilization remains weighted 
at 1 and the rest of the OSDs are now reduced in weight based on the percent 
usage of the OSD + the %usage of the OSD with the amount of data (21% at the 
time).  This means the OSD that is at the most full at this time at 86% full 
now has a weight of only .33 (it was at 89% when reweight was applied).  I’m 
not sure this is a good idea, but it seemed like the only option I had.  Please 
let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to 
prevent it from happening going forward without a lot of monitoring and 
reweighting on weekends/etc to keep things balanced.  It sounds like Ceph is 
really expecting that objects stored into a pool will roughly have the same 
size, is that right?

Our backups going into this pool have very large variation in size, so would it 
be better to create multiple pools based on expected size of objects and then 
put backups of similar size into each pool?

The backups also have basically the same names with the only difference being 
the date which it was taken (e.g. backup name difference in subsequent days can 
be one digit at times), so does this mean that large backups with basically the 
same name will end up being placed in the same PGs based on the CRUSH 
calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:icepic...@gmail.com]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister mailto:bbanis...@jumptrading.com>>
Cc: Ceph Users mailto:ceph-users@lists.ceph.com>&

Re: [ceph-users] v13.2.1 Mimic released

2018-07-27 Thread Bryan Stillwell
I decided to upgrade my home cluster from Luminous (v12.2.7) to Mimic (v13.2.1) 
today and ran into a couple issues:

1. When restarting the OSDs during the upgrade it seems to forget my upmap 
settings.  I had to manually return them to the way they were with commands 
like:

ceph osd pg-upmap-items 5.1 11 18 8 6 9 0
ceph osd pg-upmap-items 5.1f 11 17

I also saw this when upgrading from v12.2.5 to v12.2.7.

2. Also after restarting the first OSD during the upgrade I saw 21 messages 
like these in ceph.log:

2018-07-27 15:53:49.868552 osd.1 osd.1 10.0.0.207:6806/4029643 97 : cluster 
[WRN] failed to encode map e100467 with expected crc
2018-07-27 15:53:49.922365 osd.6 osd.6 10.0.0.16:6804/90400 25 : cluster [WRN] 
failed to encode map e100467 with expected crc
2018-07-27 15:53:49.925585 osd.6 osd.6 10.0.0.16:6804/90400 26 : cluster [WRN] 
failed to encode map e100467 with expected crc
2018-07-27 15:53:49.944414 osd.18 osd.18 10.0.0.15:6808/120845 8 : cluster 
[WRN] failed to encode map e100467 with expected crc
2018-07-27 15:53:49.944756 osd.17 osd.17 10.0.0.15:6800/120749 13 : cluster 
[WRN] failed to encode map e100467 with expected crc

Is this a sign that full OSD maps were sent out by the mons to every OSD like 
back in the hammer days?  I seem to remember that OSD maps should be a lot 
smaller now, so maybe this isn't as big of a problem as it was back then?

Thanks,
Bryan

From: ceph-users  on behalf of Sage Weil 

Date: Friday, July 27, 2018 at 1:25 PM
To: "ceph-annou...@lists.ceph.com" , 
"ceph-users@lists.ceph.com" , 
"ceph-maintain...@lists.ceph.com" , 
"ceph-de...@vger.kernel.org" 
Subject: [ceph-users] v13.2.1 Mimic released

This is the first bugfix release of the Mimic v13.2.x long term stable release
series. This release contains many fixes across all components of Ceph,
including a few security fixes. We recommend that all users upgrade.

Notable Changes
--

* CVE 2018-1128: auth: cephx authorizer subject to replay attack (issue#24836 
http://tracker.ceph.com/issues/24836, Sage Weil)
* CVE 2018-1129: auth: cephx signature check is weak (issue#24837 
http://tracker.ceph.com/issues/24837, Sage Weil)
* CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838
* 

Re: [ceph-users] rocksdb mon stores growing until restart

2018-09-19 Thread Bryan Stillwell
> On 08/30/2018 11:00 AM, Joao Eduardo Luis wrote:
> > On 08/30/2018 09:28 AM, Dan van der Ster wrote:
> > Hi,
> > Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
> > eventually triggering the 'mon is using a lot of disk space' warning?
> > Since upgrading to luminous, we've seen this happen at least twice.
> > Each time, we restart all the mons and then stores slowly trim down to
> > <500MB. We have 'mon compact on start = true', but it's not the
> > compaction that's shrinking the rockdb's -- the space used seems to
> > decrease over a few minutes only after *all* mons have been restarted.
> > This reminds me of a hammer-era issue where references to trimmed maps
> > were leaking -- I can't find that bug at the moment, though.
>
> Next time this happens, mind listing the store contents and check if you
> are holding way too many osdmaps? You shouldn't be holding more osdmaps
> than the default IF the cluster is healthy and all the pgs are clean.
>
> I've chased a bug pertaining this last year, even got a patch, but then
> was unable to reproduce it. Didn't pursue merging the patch any longer
> (I think I may still have an open PR for it though), simply because it
> was no longer clear if it was needed.

I just had this happen to me while using ceph-gentle-split on a 12.2.5
cluster with 1,370 OSDs.  Unfortunately, I restarted the mon nodes which
fixed the problem before finding this thread.  I'm only halfway done
with the split, so I'll see if the problem resurfaces again.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
I left some of the 'ceph pg dump' commands running and twice they returned 
results after 30 minutes, and three times it took 45 minutes.  Is there 
something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell 
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users@lists.ceph.com" 
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.
 
I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:
 
2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all
 
A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.
 
This problem also continued to appear after upgrading to 12.2.8.
 
Has anyone else seen this?
 
Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
Thanks Dan!

It does look like we're hitting the ms_tcp_read_timeout.  I changed it to 79 
seconds and I've had a couple dumps that were hung for ~2m40s 
(2*ms_tcp_read_timeout) and one that was hung for 8 minutes 
(6*ms_tcp_read_timeout).

I agree that 15 minutes (900s) is a long timeout.  Anyone know the reasoning 
for that decision?

Bryan

From: Dan van der Ster 
Date: Thursday, October 18, 2018 at 2:03 PM
To: Bryan Stillwell 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

15 minutes seems like the ms tcp read timeout would be related.

Try shortening that and see if it works around the issue...

(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)

-- dan


On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

I left some of the 'ceph pg dump' commands running and twice they returned 
results after 30 minutes, and three times it took 45 minutes.  Is there 
something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell mailto:bstillw...@godaddy.com>>
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
mailto:ceph-users@lists.ceph.com>>
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

2018-10-18 Thread Bryan Stillwell
I could see something related to that bug might be happening, but we're not 
seeing the "clock skew" or "signal: Hangup" messages in our logs.

One reason that this cluster might be running into this problem is that we 
appear to have a script that is gathering stats for collectd which is running 
'ceph pg dump' every 16-17 seconds.  I guess you could say we're stress testing 
that code path fairly well...  :)

Bryan

On Thu, Oct 18, 2018 at 6:17 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.



I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:



2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3

2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump

2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable

2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch

2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all



A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This could be a manifestation of
https://tracker.ceph.com/issues/23460, as the "pg dump" path is one of
the places where the pgmap and osdmap locks are taken together.

Deadlockyness aside, this code path could use some improvement so that
both locks aren't being held unnecessarily, and so that we aren't
holding up all other accesses to pgmap while doing a dump.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

2018-02-21 Thread Bryan Stillwell
Bryan,

The good news is that there is progress being made on making this harder to 
screw up.  Read this article for example:

https://ceph.com/community/new-luminous-pg-overdose-protection/

The bad news is that I don't have a great solution for you regarding your 
peering problem.  I've run into things like that on testing clusters.  That 
almost always teaches me not to do too many operations at one time.  Usually 
some combination of flags (norecover, norebalance, nobackfill, noout, etc.) 
with OSD restarts will fix the problem.  You can also query PGs to figure out 
why they aren't peering, increase logging, or if you want to get it back 
quickly you should consider RedHat support or contacting a Ceph consultant like 
Wido:

In fact, I would recommend watching Wido's presentation on "10 ways to break 
your Ceph cluster" from Ceph Days Germany earlier this month for other things 
to watch out for:

https://ceph.com/cephdays/germany/

Bryan

From: ceph-users  on behalf of Bryan 
Banister 
Date: Tuesday, February 20, 2018 at 2:53 PM
To: David Turner 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

HI David [Resending with smaller message size],

I tried setting the OSDs down and that does clear the blocked requests 
momentarily but they just return back to the same state.  Not sure how to 
proceed here, but one thought was just to do a full cold restart of the entire 
cluster.  We have disabled our backups so the cluster is effectively down.  Any 
recommendations on next steps?

This also seems like a pretty serious issue, given that making this change has 
effectively broken the cluster.  Perhaps Ceph should not allow you to increase 
the number of PGs so drastically or at least make you put in a 
‘--yes-i-really-mean-it’ flag?

Or perhaps just some warnings on the docs.ceph.com placement groups page 
(http://docs.ceph.com/docs/master/rados/operations/placement-groups/ ) and the 
ceph command man page?

Would be good to help other avoid this pitfall.

Thanks again,
-Bryan

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Friday, February 16, 2018 3:21 PM
To: Bryan Banister mailto:bbanis...@jumptrading.com>>
Cc: Bryan Stillwell mailto:bstillw...@godaddy.com>>; 
Janne Johansson mailto:icepic...@gmail.com>>; Ceph Users 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email

That sounds like a good next step.  Start with OSDs involved in the longest 
blocked requests.  Wait a couple minutes after the osd marks itself back up and 
continue through them.  Hopefully things will start clearing up so that you 
don't need to mark all of them down.  There is usually a only a couple OSDs 
holding everything up.

On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister 
mailto:bbanis...@jumptrading.com>> wrote:
Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of 
all OSDs are in this condition.  There isn’t any discernable pattern that I can 
find and they are spread across the three servers.  All of the OSDs are online 
as far as the service is concern.

I have also taken all PGs that were reported the health detail output and 
looked for any that report “peering_blocked_by” but none do, so I can’t tell if 
any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort 
-k13
pg 14.fe0 is stuck peering since forever, current state peering, last 
acting [104,94,108]
pg 14.fe0 is stuck unclean since forever, current state peering, last 
acting [104,94,108]
pg 14.fbc is stuck peering since forever, current state peering, last 
acting [110,91,0]
pg 14.fd1 is stuck peering since forever, current state peering, last 
acting [130,62,111]
pg 14.fd1 is stuck unclean since forever, current state peering, last 
acting [130,62,111]
pg 14.fed is stuck peering since forever, current state peering, last 
acting [32,33,82]
pg 14.fed is stuck unclean since forever, current state peering, last 
acting [32,33,82]
pg 14.fee is stuck peering since forever, current state peering, last 
acting [37,96,68]
pg 14.fee is stuck unclean since forever, current state peering, last 
acting [37,96,68]
pg 14.fe8 is stuck peering since forever, current state peering, last 
acting [45,31,107]
pg 14.fe8 is stuck unclean since forever, current state peering, last 
acting [45,31,107]
pg 14.fc1 is stuck peering since forever, current state peering, last 
acting [59,124,39]
pg 14.ff2 is stuck peering since forever, current state peering, last 
acting [62,117,7]
pg 14.ff2 is stuck unclean since forever, current state peering, last 
acting [62,117,7]
pg 14.fe4 is stuck peering since forever, current state peering, last 
acting [84

[ceph-users] RGW (Swift) failures during upgrade from Jewel to Luminous

2018-05-08 Thread Bryan Stillwell
We recently began our upgrade testing for going from Jewel (10.2.10) to
Luminous (12.2.5) on our clusters.  The first part of the upgrade went
pretty smoothly (upgrading the mon nodes, adding the mgr nodes, upgrading
the OSD nodes), however, when we got to the RGWs we started seeing internal
server errors (500s) on the Jewel RGWs once the first RGW was upgraded to
Luminous.  Further testing found two different problems:

The first problem (internal server error) was seen when the container and
object were created by a Luminous RGW, but then a Jewel RGW attempted to
list the container.

The second problem (container appears to be empty) was seen when the
container was created by a Luminous RGW, an object was added using a Jewel
RGW, and then the container was listed by a Luminous RGW.

Here were all the tests I performed:

Test #1: Create container (Jewel),Add object (Jewel),List container 
(Jewel),Result: Success
Test #2: Create container (Jewel),Add object (Jewel),List container 
(Luminous), Result: Success
Test #3: Create container (Jewel),Add object (Luminous), List container 
(Jewel),Result: Success
Test #4: Create container (Jewel),Add object (Luminous), List container 
(Luminous), Result: Success
Test #5: Create container (Luminous), Add object (Jewel),List container 
(Jewel),Result: Success
Test #6: Create container (Luminous), Add object (Jewel),List container 
(Luminous), Result: Failure (Container appears empty)
Test #7: Create container (Luminous), Add object (Luminous), List container 
(Jewel),Result: Failure (Internal Server Error)
Test #8: Create container (Luminous), Add object (Luminous), List container 
(Luminous), Result: Success

It appears that we ran into these bugs because our load balancer was
alternating between the RGWs while they were running a mixture of the two
versions (like you would expect during an upgrade).

Has anyone run into this problem as well?  Is there a way to workaround it
besides disabling half the RGWs, upgrading that half, swinging all the
traffic to the upgraded RGWs, upgrading the other half, and then enabling
the second half?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Bryan Stillwell
> We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB disks
> each to the cluster. All the 5 nodes rebalanced well without any issues and
> the sixth/last node OSDs started acting weird as I increase weight of one osd
> the utilization doesn't change but a different osd on the same node
> utilization is getting increased. Rebalance complete fine but utilization is
> not right.
>
> Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
> started increasing but its weight is 0.0. If I increase weight of OSD 611 to
> 0.2 then its overall utilization is growing to what if its weight is 0.4. So
> if I increase weight of 610 and 615 to their full weight then utilization on
> OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to stop and
> downsize the OSD's crush weight back to 0.0 to avoid any implications on ceph
> cluster. Its not just one osd but different OSD's on that one node. The only
> correlation I found out is 610 and 611 OSD Journal partitions are on the same
> SSD drive and all the OSDs are SAS drives. Any help on how to debug or
> resolve this will be helpful.

You didn't say which version of Ceph you were using, but based on the output
of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?

I've found that data placement can be a little weird when you have really
low CRUSH weights (0.2) on one of the nodes where the other nodes have large
CRUSH weights (2.0).  I've had it where a single OSD in a node was getting
almost all the data.  It wasn't until I increased the weights to be more in
line with the rest of the cluster that it evened back out.

I believe this can also be caused by not having enough PGs in your cluster.
Or the PGs you do have aren't distributed correctly based on the data usage
in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
correct number of PGs you should have per pool?

Since you are likely running a pre-Jewel cluster it could also be that you
haven't switched your tunables to use the straw2 data placement algorithm:

http://docs.ceph.com/docs/master/rados/operations/crush-map/#hammer-crush-v4

That should help as well.  Once that's enabled you can convert your existing
buckets to straw2 as well.  Just be careful you don't have any old clients
connecting to your cluster that don't support that feature yet.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Living with huge bucket sizes

2017-06-08 Thread Bryan Stillwell
This has come up quite a few times before, but since I was only working with
RBD before I didn't pay too close attention to the conversation.  I'm looking
for the best way to handle existing clusters that have buckets with a large
number of objects (>20 million) in them.  The cluster I'm doing test on is
currently running hammer (0.94.10), so if things got better in jewel I would
love to hear about it!

One idea I've played with is to create a new SSD pool by adding an OSD
to every journal SSD.  My thinking was that our data is mostly small
objects (~100KB) so the journal drives were unlikely to be getting close
to any throughput limitations.  They should also have plenty of IOPs
left to handle the .rgw.buckets.index pool.

So on our test cluster I created a separate root that I called
rgw-buckets-index, I added all the OSDs I created on the journal SSDs,
and created a new crush rule to place data on it:

ceph osd crush rule create-simple rgw-buckets-index_ruleset rgw-buckets-index 
chassis

Once everything was set up correctly I tried switching the
.rgw.buckets.index pool over to it by doing:

ceph osd set norebalance
ceph osd pool set .rgw.buckets.index crush_ruleset 1
# Wait for peering to complete
ceph osd unset norebalance

Things started off well, but once it got to backfilling the PGs which
have the large buckets on them, I started seeing a large number of slow
requests like these:

  ack+ondisk+write+known_if_redirected e68708) currently waiting for degraded 
object
  ondisk+write+known_if_redirected e68708) currently waiting for degraded object
  ack+ondisk+write+known_if_redirected e68708) currently waiting for rw locks

Digging in on the OSDs, it seems they would either restart or die after
seeing a lot of these messages:

  heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f8f5d604700' had timed 
out after 30

or:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f99ec2e4700' had timed out 
after 15

The ones that died saw messages like these:

  heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd59e7c700' had timed 
out after 60

Followed by:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd48c1d700' had suicide 
timed out after 150


The backfilling process would appear to hang on some of the PGs, but I
figured out that they were recovering omap data and was able to keep an
eye on the process by running:

watch 'ceph pg 272.22 query | grep omap_recovered_to'

A lot of the timeouts happened after the PGs finished the omap recovery,
which took over an hour on one of the PGs.

Has anyone found a good solution for this for existing large buckets?  I
know sharding is the solution going forward, but afaik it can't be done
on existing buckets yet (although the dynamic resharding work mentioned
on today's performance call sounds promising).

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Bryan Stillwell
Is this on an RGW cluster?

If so, you might be running into the same problem I was seeing with large 
bucket sizes:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018504.html

The solution is to shard your buckets so the bucket index doesn't get too big.

Bryan

From: ceph-users  on behalf of Tyler Bischel 

Date: Monday, June 12, 2017 at 5:12 PM
To: "ceph-us...@ceph.com" 
Subject: [ceph-users] osd_op_tp timeouts

Hi,
  We've been having this ongoing problem with threads timing out on the OSDs.  
Typically we'll see the OSD become unresponsive for about a minute, as threads 
from other OSDs time out.  The timeouts don't seem to be correlated to high 
load.  We turned up the logs to 10/10 for part of a day to catch some of these 
in progress, and saw the pattern below in the logs several times (grepping for 
individual threads involved in the time outs).

We are using Jewel 10.2.7.

Logs:

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967027 lcod 5484'12967028 active] add_log_entry 
5484'12967030 (0'0) modify   
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
 by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967028 lcod 5484'12967028 active] append_log: 
trimming to 5484'12967028 entries 5484'12967028 (5484'12967026) delete   
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
 by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to: 0'0, 
dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, 
divergent_priors: 0, writeout_from: 5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15



Metrics:
OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to 16, 
IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time spike.  
Average Queue Size on the device spikes.  One minute later, Write Time, Reads, 
and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Directory size doesn't match contents

2017-06-14 Thread Bryan Stillwell
I have a cluster running 10.2.7 that is seeing some extremely large directory 
sizes in CephFS according to the recursive stats:

$ ls -lhd Originals/
drwxrwxr-x 1 bryan bryan 16E Jun 13 13:27 Originals/

du reports a much smaller (and accurate) number:

$ du -sh Originals/
300GOriginals/

This directory recently saw some old rsync temporary files re-appear that I 
have since removed.  Perhaps that could be related?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Directory size doesn't match contents

2017-06-15 Thread Bryan Stillwell
On 6/15/17, 9:20 AM, "John Spray"  wrote:
>
> On Wed, Jun 14, 2017 at 4:31 PM, Bryan Stillwell  
> wrote:
> > I have a cluster running 10.2.7 that is seeing some extremely large 
> > directory sizes in CephFS according to the recursive stats:
> >
> > $ ls -lhd Originals/
> > drwxrwxr-x 1 bryan bryan 16E Jun 13 13:27 Originals/
>
> What client (and version of the client) are you using?

I'm using the ceph-fuse client from the 10.2.7-1trusty packages.


> rstats being out of date is a known issue, but getting a completely
> bogus value like this is not.
>
> Do you get the correct value if you mount a new client and look from there?

I tried doing a new ceph-fuse mount on another host running
10.2.7-1trusty and also see the same problem there:

$ ceph-fuse --version
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
[root@b3:/root]$ ls -ld /ceph/Originals
drwxrwxr-x 1 bryan bryan 1844674382704167 Jun 13 13:27 /ceph/Originals


I then tried mounting it with a newer kernel and rstats don't seem to be
working for that directory or any other directory:

[root@shilling:/root]$ uname -a
Linux shilling 4.8.0-52-generic #55~16.04.1-Ubuntu SMP Fri Apr 28 14:36:29 UTC 
2017 x86_64 x86_64 x86_64 GNU/Linux
[root@shilling:/root]$ ls -ld /ceph-old/{Logs,Music,Originals,Pictures}
drwxrwxr-x 1 bryan bryan 111 Feb 29  2016 /ceph-old/Logs
drwxr-xr-x 1 bryan bryan   5 Feb 17  2012 /ceph-old/Music
drwxrwxr-x 1 bryan bryan   1 Jun 13 13:27 /ceph-old/Originals
drwxr-xr-x 1 bryan bryan  25 Jul  1  2015 /ceph-old/Pictures

I also gave ceph-fuse in kraken a try too:

[root@shilling:/root]$ ceph-fuse --version
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
[root@shilling:/root]$ ls -ld /ceph-old/Originals
drwxrwxr-x 1 bryan bryan 1844674382704167 Jun 13 13:27 /ceph-old/Originals


Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing orphaned radosgw bucket indexes from pool

2018-11-29 Thread Bryan Stillwell
Wido,

I've been looking into this large omap objects problem on a couple of our 
clusters today and came across your script during my research.

The script has been running for a few hours now and I'm already over 100,000 
'orphaned' objects!

It appears that ever since upgrading to Luminous (12.2.5 initially, followed by 
12.2.8) this cluster has been resharding the large bucket indexes at least once 
a day and not cleaning up the previous bucket indexes:

for instance in $(radosgw-admin metadata list bucket.instance | jq -r '.[]' | 
grep go-test-dashboard); do
  mtime=$(radosgw-admin metadata get bucket.instance:${instance} | grep mtime)
  num_shards=$(radosgw-admin metadata get bucket.instance:${instance} | grep 
num_shards)
  echo "${instance}: ${mtime} ${num_shards}"
done | column -t | sort -k3
go-test-dashboard:default.188839135.327804:  "mtime":  "2018-06-01  
22:35:28.693095Z",  "num_shards":  0,
go-test-dashboard:default.617828918.2898:"mtime":  "2018-06-02  
22:35:40.438738Z",  "num_shards":  46,
go-test-dashboard:default.617828918.4:   "mtime":  "2018-06-02  
22:38:21.537259Z",  "num_shards":  46,
go-test-dashboard:default.617663016.10499:   "mtime":  "2018-06-03  
23:00:04.185285Z",  "num_shards":  46,
[...snip...]
go-test-dashboard:default.891941432.342061:  "mtime":  "2018-11-28  
01:41:46.777968Z",  "num_shards":  7,
go-test-dashboard:default.928133068.2899:"mtime":  "2018-11-28  
20:01:49.390237Z",  "num_shards":  46,
go-test-dashboard:default.928133068.5115:"mtime":  "2018-11-29  
01:54:17.788355Z",  "num_shards":  7,
go-test-dashboard:default.928133068.8054:"mtime":  "2018-11-29  
20:21:53.733824Z",  "num_shards":  7,
go-test-dashboard:default.891941432.359004:  "mtime":  "2018-11-29  
20:22:09.201965Z",  "num_shards":  46,

The num_shards is typically around 46, but looking at all 288 instances of that 
bucket index, it has varied between 3 and 62 shards.

Have you figured anything more out about this since you posted this originally 
two weeks ago?

Thanks,
Bryan

From: ceph-users  on behalf of Wido den 
Hollander 
Date: Thursday, November 15, 2018 at 5:43 AM
To: Ceph Users 
Subject: [ceph-users] Removing orphaned radosgw bucket indexes from pool

Hi,

Recently we've seen multiple messages on the mailinglists about people
seeing HEALTH_WARN due to large OMAP objects on their cluster. This is
due to the fact that starting with 12.2.6 OSDs warn about this.

I've got multiple people asking me the same questions and I've done some
digging around.

Somebody on the ML wrote this script:

for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort`; do
  actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'`
  for instance in `radosgw-admin metadata list bucket.instance | jq -r
'.[]' | grep ${bucket}: | cut -d ':' -f 2`
  do
if [ "$actual_id" != "$instance" ]
then
  radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
  radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
fi
  done
done

That partially works, but 'orphaned' objects in the index pool do not work.

So I wrote my own script [0]:

#!/bin/bash
INDEX_POOL=$1

if [ -z "$INDEX_POOL" ]; then
echo "Usage: $0 "
exit 1
fi

INDEXES=$(mktemp)
METADATA=$(mktemp)

trap "rm -f ${INDEXES} ${METADATA}" EXIT

radosgw-admin metadata list bucket.instance|jq -r '.[]' > ${METADATA}
rados -p ${INDEX_POOL} ls > $INDEXES

for OBJECT in $(cat ${INDEXES}); do
MARKER=$(echo ${OBJECT}|cut -d '.' -f 3,4,5)
grep ${MARKER} ${METADATA} > /dev/null
if [ "$?" -ne 0 ]; then
echo $OBJECT
fi
done

It does not remove anything, but for example, it returns these objects:

.dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10406917.5752
.dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6162
.dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6186

The output of:

$ radosgw-admin metadata list|jq -r '.[]'

Does not contain:
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10406917.5752
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6162
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6186

So for me these objects do not seem to be tied to any bucket and seem to
be leftovers which were not cleaned up.

For example, I see these objects tied to a bucket:

- b32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6160
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6188
- eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6167

But notice the difference: 6160, 6188, 6167, but not 6162 nor 6186

Before I remove these objects I want to verify with other users if they
see the same and if my thinking is correct.

Wido

[0]: https://gist.github.com/wido/6650e66b09770ef02df89636891bef04

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-cep

[ceph-users] Compacting omap data

2019-01-02 Thread Bryan Stillwell
Recently on one of our bigger clusters (~1,900 OSDs) running Luminous (12.2.8), 
we had a problem where OSDs would frequently get restarted while deep-scrubbing.

After digging into it I found that a number of the OSDs had very large omap 
directories (50GiB+).  I believe these were OSDs that had previous held PGs 
that were part of the .rgw.buckets.index pool which I have recently moved to 
all SSDs, however, it seems like the data remained on the HDDs.

I was able to reduce the data usage on most of the OSDs (from ~50 GiB to < 200 
MiB!) by compacting the omap dbs offline by setting 'leveldb_compact_on_mount = 
true' in the [osd] section of ceph.conf, but that didn't work on the newer OSDs 
which use rocksdb.  On those I had to do an online compaction using a command 
like:

$ ceph tell osd.510 compact

That worked, but today when I tried doing that on some of the SSD-based OSDs 
which are backing .rgw.buckets.index I started getting slow requests and the 
compaction ultimately failed with this error:

$ ceph tell osd.1720 compact
osd.1720: Error ENXIO: osd down

When I tried it again it succeeded:

$ ceph tell osd.1720 compact
osd.1720: compacted omap in 420.999 seconds

The data usage on that OSD dropped from 57.8 GiB to 43.4 GiB which was nice, 
but I don't believe that'll get any smaller until I start splitting the PGs in 
the .rgw.buckets.index pool to better distribute that pool across the SSD-based 
OSDs.

The first question I have is what is the option to do an offline compaction of 
rocksdb so I don't impact our customers while compacting the rest of the 
SSD-based OSDs?

The next question is if there's a way to configure Ceph to automatically 
compact the omap dbs in the background in a way that doesn't affect user 
experience?

Finally, I was able to figure out that the omap directories were getting large 
because we're using filestore on this cluster, but how could someone determine 
this when using BlueStore?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Omap issues - metadata creating too many

2019-01-03 Thread Bryan Stillwell
Josef,

I've noticed that when dynamic resharding is on it'll reshard some of our 
bucket indices daily (sometimes more).  This causes a lot of wasted space in 
the .rgw.buckets.index pool which might be what you are seeing.

You can get a listing of all the bucket instances in your cluster with this 
command:

radosgw-admin metadata list bucket.instance | jq -r '.[]' | sort

Give that a try and see if you see the same problem.  It seems that once you 
remove the old bucket instances the omap dbs don't reduce in size until you 
compact them.

Bryan

From: Josef Zelenka 
Date: Thursday, January 3, 2019 at 3:49 AM
To: "J. Eric Ivancich" 
Cc: "ceph-users@lists.ceph.com" , Bryan Stillwell 

Subject: Re: [ceph-users] Omap issues - metadata creating too many

Hi, i had the default - so it was on(according to ceph kb). turned it
off, but the issue persists. i noticed Bryan Stillwell(cc-ing him) had
the same issue (reported about it yesterday) - tried his tips about
compacting, but it doesn't do anything, however i have to add to his
last point, this happens even with bluestore. Is there anything we can
do to clean up the omap manually?

Josef

On 18/12/2018 23:19, J. Eric Ivancich wrote:
On 12/17/18 9:18 AM, Josef Zelenka wrote:
Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on
ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three
nodes have an additional SSD i added to have more space to rebalance the
metadata). CUrrently, the cluster is used mainly as a radosgw storage,
with 28tb data in total, replication 2x for both the metadata and data
pools(a cephfs isntance is running alongside there, but i don't think
it's the perpetrator - this happenned likely before we had it). All
pools aside from the data pool of the cephfs and data pool of the
radosgw are located on the SSD's. Now, the interesting thing - at random
times, the metadata OSD's fill up their entire capacity with OMAP data
and go to r/o mode and we have no other option currently than deleting
them and re-creating. The fillup comes at a random time, it doesn't seem
to be triggered by anything and it isn't caused by some data influx. It
seems like some kind of a bug to me to be honest, but i'm not certain -
anyone else seen this behavior with their radosgw? Thanks a lot
Hi Josef,

Do you have rgw_dynamic_resharding turned on? Try turning it off and see
if the behavior continues.

One theory is that dynamic resharding is triggered and possibly not
completing. This could add a lot of data to omap for the incomplete
bucket index shards. After a delay it tries resharding again, possibly
failing again, and adding more data to the omap. This continues.

If this is the ultimate issue we have some commits on the upstream
luminous branch that are designed to address this set of issues.

But we should first see if this is the cause.

Eric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-07 Thread Bryan Stillwell
I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:

# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l
46181

With the osdmaps being over 600KB in size this adds up:

# du -sh /var/lib/ceph/osd/ceph-1754/current/meta
31G /var/lib/ceph/osd/ceph-1754/current/meta

I remember running into this during the hammer days:

http://tracker.ceph.com/issues/13990

Did something change recently that may have broken this fix?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-07 Thread Bryan Stillwell
I believe the option you're looking for is mon_data_size_warn.  The default is 
set to 16106127360.

I've found that sometimes the mons need a little help getting started with 
trimming if you just completed a large expansion.  Earlier today I had a 
cluster where the mon's data directory was over 40GB on all the mons.  When I 
restarted them one at a time with 'mon_compact_on_start = true' set in the 
'[mon]' section of ceph.conf, they stayed around 40GB in size.   However, when 
I was about to hit send on an email to the list about this very topic, the 
warning cleared up and now the data directory is now between 1-3GB on each of 
the mons.  This was on a cluster with >1900 OSDs.

Bryan

From: ceph-users  on behalf of Pardhiv Karri 

Date: Monday, January 7, 2019 at 11:08 AM
To: ceph-users 
Subject: [ceph-users] Is it possible to increase Ceph Mon store?

Hi,

We have a large Ceph cluster (Hammer version). We recently saw its mon store 
growing too big > 15GB on all 3 monitors without any rebalancing happening for 
quiet sometime. We have compacted the DB using  "#ceph tell mon.[ID] compact" 
for now. But is there a way to increase the size of the mon store to 32GB or 
something to avoid getting the Ceph health to warning state due to Mon store 
growing too big?

--
Thanks,
Pardhiv Karri



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-08 Thread Bryan Stillwell
I was able to get the osdmaps to slowly trim (maybe 50 would trim with each 
change) by making small changes to the CRUSH map like this:

for i in {1..100}; do
ceph osd crush reweight osd.1754 4.1
sleep 5
ceph osd crush reweight osd.1754 4
sleep 5
done

I believe this was the solution Dan came across back in the hammer days.  It 
works, but not ideal for sure.  Across the cluster it freed up around 50TB of 
data!

Bryan

From: ceph-users  on behalf of Bryan 
Stillwell 
Date: Monday, January 7, 2019 at 2:40 PM
To: ceph-users 
Subject: [ceph-users] osdmaps not being cleaned up in 12.2.8

I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:

# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l
46181

With the osdmaps being over 600KB in size this adds up:

# du -sh /var/lib/ceph/osd/ceph-1754/current/meta
31G/var/lib/ceph/osd/ceph-1754/current/meta

I remember running into this during the hammer days:

http://tracker.ceph.com/issues/13990

Did something change recently that may have broken this fix?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell
That thread looks like the right one.

So far I haven't needed to restart the osd's for the churn trick to work.  I 
bet you're right that something thinks it still needs one of the old osdmaps on 
your cluster.  Last night our cluster finished another round of expansions and 
we're seeing up to 49,272 osdmaps hanging around.  The churn trick seems to be 
working again too.

Bryan

From: Dan van der Ster 
Date: Thursday, January 10, 2019 at 3:13 AM
To: Bryan Stillwell 
Cc: ceph-users 
Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

Hi Bryan,

I think this is the old hammer thread you refer to:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html

We also have osdmaps accumulating on v12.2.8 -- ~12000 per osd at the moment.

I'm trying to churn the osdmaps like before, but our maps are not being trimmed.

Did you need to restart the osd's before the churn trick would work?
If so, it seems that something is holding references to old maps, like
like that old hammer issue.

Cheers, Dan


On Tue, Jan 8, 2019 at 5:39 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

I was able to get the osdmaps to slowly trim (maybe 50 would trim with each 
change) by making small changes to the CRUSH map like this:



for i in {1..100}; do

 ceph osd crush reweight osd.1754 4.1

 sleep 5

 ceph osd crush reweight osd.1754 4

 sleep 5

done



I believe this was the solution Dan came across back in the hammer days.  It 
works, but not ideal for sure.  Across the cluster it freed up around 50TB of 
data!



Bryan



From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Stillwell 
mailto:bstillw...@godaddy.com>>
Date: Monday, January 7, 2019 at 2:40 PM
To: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: [ceph-users] osdmaps not being cleaned up in 12.2.8



I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:



# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l

46181



With the osdmaps being over 600KB in size this adds up:



# du -sh /var/lib/ceph/osd/ceph-1754/current/meta

31G/var/lib/ceph/osd/ceph-1754/current/meta



I remember running into this during the hammer days:



http://tracker.ceph.com/issues/13990



Did something change recently that may have broken this fix?



Thanks,

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell
I've created the following bug report to address this issue:

http://tracker.ceph.com/issues/37875

Bryan

From: ceph-users  on behalf of Bryan 
Stillwell 
Date: Friday, January 11, 2019 at 8:59 AM
To: Dan van der Ster 
Cc: ceph-users 
Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fixing a broken bucket index in RGW

2019-01-16 Thread Bryan Stillwell
I'm looking for some help in fixing a bucket index on a Luminous (12.2.8)
cluster running on FileStore.

First some background on how I believe the bucket index became broken.  Last
month we had a PG in our .rgw.buckets.index pool become inconsistent:

2018-12-11 09:12:17.743983 osd.1879 osd.1879 10.36.173.147:6820/60041 16 : 
cluster [ERR] 7.8e : soid 7:717333b6:::.dir.default.1110451812.43.2:head 
omap_digest 0x59e4f686 != omap_digest 0x37b99ba6 from shard 1879

We then attempted to repair the PG by using 'ceph pg repair 7.8e', but I
have a feeling the primary copy must have been corrupt (later that day I
learned about 'rados list-inconsistent-obj 7.8e -f json-pretty').  The
repair resulted in an unfound object:

2018-12-11 09:32:02.651241 osd.1753 osd.1753 10.32.12.32:6820/3455358 13 : 
cluster [ERR] 7.8e push 7:717333b6:::.dir.default.1110451812.43.2:head v 
767605'30158112 failed because local copy is 767605'30158924

A couple hours later we started getting reports of 503s from multiple
customers.  Believing that the unfound object was the cause of the problem
we used the 'mark_unfound_lost revert' option to roll back to the previous
version:

ceph pg 7.8e mark_unfound_lost revert

This fixed the cluster, but broke the bucket.

Attempting to list the bucket contents results in:

[root@p3cephrgw007 ~]# radosgw-admin bucket list --bucket=backups.579
ERROR: store->list_objects(): (2) No such file or directory


This bucket appears to have been automatically sharded after we upgraded to
Luminous, so we do have an old bucket instance available (but it's too old
to be very helpful):

[root@p3cephrgw007 ~]# radosgw-admin metadata list bucket.instance |grep 
backups.579
"backups.579:default.1110451812.43",
"backups.579:default.28086735.566138",


Looking for for all the shards based on the name only pulls up the first 2
shards:

[root@p3cephrgw007 ~]# rados -p .rgw.buckets.index ls | grep 
"default.1110451812.43"
...
.dir.default.1110451812.43.0
...
.dir.default.1110451812.43.1
...


But the bucket metadata says there should be three:

[root@p3cephrgw007 ~]# radosgw-admin metadata get 
bucket.instance:backups.579:default.1110451812.43 | jq -r 
'.data.bucket_info.num_shards'
3


If we look in the log message above it said .dir.default.1110451812.43.2 was
the rados object that was slightly newer, so the revert command we ran must
have removed it instead of rolling it back to the previous version.

This leaves me with some questions:

What would have been the better way for dealing with this problem when the
whole cluster stopped working?

Is there a way to recreate the bucket index?  I see a couple options in the
docs for fixing the bucket index (--fix) and for rebuilding the bucket index
(--check-objects), but I don't see any explanations on how it goes about
doing that.  Will it attempt to scan all the objects in the cluster to
determine which ones belong in this bucket index?  Will the missing shard be
ignored and the fixed bucket index be missing 1/3rd of the objects?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rebuilding RGW bucket indices from objects

2019-01-17 Thread Bryan Stillwell
This is sort of related to my email yesterday, but has anyone ever rebuilt a 
bucket index using the objects themselves?

It seems to be that it would be possible since the bucket_id is contained 
within the rados object name:

# rados -p .rgw.buckets.index listomapkeys .dir.default.56630221.139618
error getting omap key set .rgw.buckets.index/.dir.default.56630221.139618: (2) 
No such file or directory
# rados -p .rgw.buckets ls | grep default.56630221.139618
default.56630221.139618__shadow_.IxIe8byqV61eu6g7gSVXBpHfrB3BlC4_1
default.56630221.139618_backup.20181214
default.56630221.139618_backup.20181220
default.56630221.139618__shadow_.GQcmQKfbBkb9WEF1X-6qGBEVfppGKEJ_1
...[ many more snipped ]...

Thanks,
Bryan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Bryan Stillwell
Since you're using jumbo frames, make sure everything between the nodes 
properly supports them (nics & switches).  I've tested this in the past by 
using the size option in ping (you need to use  a payload size of 8972 instead 
of 9000 to account for the 28 byte header):

ping -s 8972 192.168.160.237

If that works, then you'll need to pull out tcpdump/wireshark to determine why 
the packets aren't able to return.

Bryan

From: ceph-users  on behalf of Johan Thomsen 

Date: Thursday, January 17, 2019 at 5:42 AM
To: Kevin Olbrich 
Cc: ceph-users 
Subject: Re: [ceph-users] pgs stuck in creating+peering state

Thanks you for responding!

First thing: I disabled the firewall on all the nodes.
More specifically not firewalld, but the NixOS firewall, since I run NixOS.
I can netcat both udp and tcp traffic on all ports between all nodes
without problems.

Next, I tried raising the mtu to 9000 on the nics where the cluster
network is connected - although I don't see why the mtu should affect
the heartbeat?
I have two bonded nics connected to the cluster network (mtu 9000) and
two separate bonded nics hooked on the public network (mtu 1500).
I've tested traffic and routing on both pairs of nics and traffic gets
through without issues, apparently.


None of the above solved the problem :-(


Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich 
mailto:k...@sv01.de>>:

Are you sure, no service like firewalld is running?
Did you check that all machines have the same MTU and jumbo frames are
enabled if needed?

I had this problem when I first started with ceph and forgot to
disable firewalld.
Replication worked perfectly fine but the OSD was kicked out every few seconds.

Kevin

Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen 
mailto:wr...@ownrisk.dk>>:
>
> Hi,
>
> I have a sad ceph cluster.
> All my osds complain about failed reply on heartbeat, like so:
>
> osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> ever on either front or back, first ping sent 2019-01-16
> 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
>
> .. I've checked the network sanity all I can, and all ceph ports are
> open between nodes both on the public network and the cluster network,
> and I have no problems sending traffic back and forth between nodes.
> I've tried tcpdump'ing and traffic is passing in both directions
> between the nodes, but unfortunately I don't natively speak the ceph
> protocol, so I can't figure out what's going wrong in the heartbeat
> conversation.
>
> Still:
>
> # ceph health detail
>
> HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> pgs inactive, 1072 pgs peering
> OSDMAP_FLAGS nodown,noout flag(s) set
> PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
> pg 7.3cd is stuck inactive for 245901.560813, current state
> creating+peering, last acting [13,41,1]
> pg 7.3ce is stuck peering for 245901.560813, current state
> creating+peering, last acting [1,40,7]
> pg 7.3cf is stuck peering for 245901.560813, current state
> creating+peering, last acting [0,42,9]
> pg 7.3d0 is stuck peering for 245901.560813, current state
> creating+peering, last acting [20,8,38]
> pg 7.3d1 is stuck peering for 245901.560813, current state
> creating+peering, last acting [10,20,42]
>()
>
>
> I've set "noout" and "nodown" to prevent all osd's from being removed
> from the cluster. They are all running and marked as "up".
>
> # ceph osd tree
>
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  -1   249.73434 root default
> -25   166.48956 datacenter m1
> -2483.24478 pod kube1
> -3541.62239 rack 10
> -3441.62239 host ceph-sto-p102
>  40   hdd   7.27689 osd.40 up  1.0 1.0
>  41   hdd   7.27689 osd.41 up  1.0 1.0
>  42   hdd   7.27689 osd.42 up  1.0 1.0
>()
>
> I'm at a point where I don't know which options and what logs to check 
> anymore?
>
> Any debug hint would be very much appreciated.
>
> btw. I have no important data in the cluster (yet), so if the solution
> is to drop all osd and recreate them, it's ok for now. But I'd really
> like to know how the cluster ended in this state.
>
> /Johan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce min_size of an EC pool?

2019-01-17 Thread Bryan Stillwell
When you use 3+2 EC that means you have 3 data chunks and 2 erasure chunks for 
your data.  So you can handle two failures, but not three.  The min_size 
setting is preventing you from going below 3 because that's the number of data 
chunks you specified for the pool.  I'm sorry to say this, but since the data 
was wiped off the other 3 nodes there isn't anything that can be done to 
recover it.

Bryan


From: ceph-users  on behalf of Félix 
Barbeira 
Date: Thursday, January 17, 2019 at 1:27 PM
To: Ceph Users 
Subject: [ceph-users] How to reduce min_size of an EC pool?

I want to bring back my cluster to HEALTHY state because right now I have not 
access to the data.

I have an 3+2 EC pool on a 5 node cluster. 3 nodes were lost, all data wiped. 
They were reinstalled and added to cluster again.

The "ceph health detail" command says to reduce min_size number to a value 
lower than 3, but:

root@ceph-monitor02:~# ceph osd pool set default.rgw.buckets.data min_size 2
Error EINVAL: pool min_size must be between 3 and 5
root@ceph-monitor02:~#

This is the situation:

root@ceph-monitor01:~# ceph -s
  cluster:
id: ce78b02d-03df-4f9e-a35a-31b5f05c4c63
health: HEALTH_WARN
Reduced data availability: 515 pgs inactive, 512 pgs incomplete

  services:
mon: 3 daemons, quorum ceph-monitor01,ceph-monitor03,ceph-monitor02
mgr: ceph-monitor02(active), standbys: ceph-monitor01, ceph-monitor03
osd: 57 osds: 57 up, 57 in

  data:
pools:   8 pools, 568 pgs
objects: 4.48 M objects, 10 TiB
usage:   24 TiB used, 395 TiB / 419 TiB avail
pgs: 0.528% pgs unknown
 90.141% pgs not active
 512 incomplete
 53  active+clean
 3   unknown

root@ceph-monitor01:~#

And this is the output of health detail:

root@ceph-monitor01:~# ceph health detail
HEALTH_WARN Reduced data availability: 515 pgs inactive, 512 pgs incomplete
PG_AVAILABILITY Reduced data availability: 515 pgs inactive, 512 pgs incomplete
pg 10.1cd is stuck inactive since forever, current state incomplete, last 
acting [9,48,41,58,17] (reducing pool default.rgw.buckets.data min_size from 3 
may help; search ceph.com/docs for 'incomplete')
pg 10.1ce is incomplete, acting [3,13,14,42,21] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1cf is incomplete, acting [36,27,3,39,51] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d0 is incomplete, acting [29,9,38,4,56] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d1 is incomplete, acting [2,34,17,7,30] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d2 is incomplete, acting [41,45,53,13,32] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d3 is incomplete, acting [7,28,15,20,3] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d4 is incomplete, acting [11,40,25,23,0] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d5 is incomplete, acting [32,51,20,57,28] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d6 is incomplete, acting [2,53,8,16,15] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d7 is incomplete, acting [1,2,33,43,42] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d8 is incomplete, acting [27,49,9,48,20] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d9 is incomplete, acting [37,8,7,11,20] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1da is incomplete, acting [27,14,33,15,53] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1db is incomplete, acting [58,53,6,26,4] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1dc is incomplete, acting [21,12,47,35,19] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1dd is incomplete, acting [51,4,52,24,7] (red

Re: [ceph-users] Suggestions/experiences with mixed disk sizes and models from 4TB - 14TB

2019-01-17 Thread Bryan Stillwell
I've run my home cluster with drives ranging in size from 500GB to 8TB before 
and the biggest issue you run into is that the bigger drives will get a 
proportional more number of PGs which will increase the memory requirements on 
them.  Typically you want around 100 PGs/OSD, but if you mix 4TB and 14TB 
drives in a cluster the 14TB drives will have 3.5 times the number of PGs.  So 
if the 4TB drives have 100 PGs, the 14TB drives will have 350.   Or if the 14TB 
drives have 100 PGs, the 4TB drives will only have just 28 PGs on them.  Using 
the balancer plugin in the mgr will pretty much be required.

Also since you're using EC you'll need to make sure the math works with these 
nodes receiving 2-3.5 times the data.

Bryan

From: ceph-users  on behalf of Götz Reinicke 

Date: Wednesday, January 16, 2019 at 2:33 AM
To: ceph-users 
Subject: [ceph-users] Suggestions/experiences with mixed disk sizes and models 
from 4TB - 14TB

Dear Ceph users,

I’d like to get some feedback for the following thought:

Currently I run some 24*4TB bluestore OSD nodes. The main focus is on storage 
space over IOPS.

We use erasure code and cephfs, and things look good right now.

The „but“ is, I do need more disk space and don’t have so much more rack space 
available, so I was thinking of adding some 8TB or even 12TB OSDs and/or 
exchange over time 4TB OSDs with bigger disks.

My question is: How are your experiences with the current >=8TB SATA disks are 
some very bad models out there which I should avoid?

The current OSD nodes are connected by 4*10Gb bonds, so for 
replication/recovery speed is a 24 Chassis with bigger disks useful, or should 
I go with smaller chassis? Or dose the chassi sice does not matter at all that 
much in my setup.

I know, EC is quit computing intense, so may be bigger disks hav also there an 
impact?

Lot’s of questions, may be you can help answering some.

Best regards and Thanks a lot for feedback . Götz



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is repairing an RGW bucket index broken?

2019-03-11 Thread Bryan Stillwell
I'm wondering if the 'radosgw-admin bucket check --fix' command is broken in 
Luminous (12.2.8)?

I'm asking because I'm trying to reproduce a situation we have on one of our 
production clusters and it doesn't seem to do anything.  Here's the steps of my 
test:

1. Create a bucket with 1 million objects
2. Verify the bucket got sharded into 10 shards of (100,000 objects each)
3. Remove one of the shards using the rados command
4. Verify the bucket is broken
5. Attempt to fix the bucket

I got as far as step 4:

# rados -p .rgw.buckets.index ls | grep "default.1434737011.12485" | sort
.dir.default.1434737011.12485.0
.dir.default.1434737011.12485.1
.dir.default.1434737011.12485.2
.dir.default.1434737011.12485.3
.dir.default.1434737011.12485.4
.dir.default.1434737011.12485.5
.dir.default.1434737011.12485.6
.dir.default.1434737011.12485.8
.dir.default.1434737011.12485.9
# radosgw-admin bucket list --bucket=bstillwell-1mil
ERROR: store->list_objects(): (2) No such file or directory

But step 5 is proving problematic:

# time radosgw-admin bucket check --fix --bucket=bstillwell-1mil

real0m0.201s
user0m0.105s
sys 0m0.033s

# time radosgw-admin bucket check --fix --check-objects --bucket=bstillwell-1mil

real0m0.188s
user0m0.102s
sys 0m0.025s


Could someone help me figure out what I'm missing?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Bryan Stillwell
We have two separate RGW clusters running Luminous (12.2.8) that have started 
seeing an increase in PGs going active+clean+inconsistent with the reason being 
caused by an omap_digest mismatch.  Both clusters are using FileStore and the 
inconsistent PGs are happening on the .rgw.buckets.index pool which was moved 
from HDDs to SSDs within the last few months.

We've been repairing them by first making sure the odd omap_digest is not the 
primary by setting the primary-affinity to 0 if needed, doing the repair, and 
then setting the primary-affinity back to 1.

For example PG 7.3 went inconsistent earlier today:

# rados list-inconsistent-obj 7.3 -f json-pretty | jq -r '.inconsistents[] | 
.errors, .shards'
[
  "omap_digest_mismatch"
]
[
  {
"osd": 504,
"primary": true,
"errors": [],
"size": 0,
"omap_digest": "0x4c10ee76",
"data_digest": "0x"
  },
  {
"osd": 525,
"primary": false,
"errors": [],
"size": 0,
"omap_digest": "0x26a1241b",
"data_digest": "0x"
  },
  {
"osd": 556,
"primary": false,
"errors": [],
"size": 0,
"omap_digest": "0x26a1241b",
"data_digest": "0x"
  }
]

Since the odd omap_digest is on osd.504 and osd.504 is the primary, we would 
set the primary-affinity to 0 with:

# ceph osd primary-affinity osd.504 0

Do the repair:

# ceph pg repair 7.3

And then once the repair is complete we would set the primary-affinity back to 
1 on osd.504:

# ceph osd primary-affinity osd.504 1

There doesn't appear to be any correlation between the OSDs which would point 
to a hardware issue, and since it's happening on two different clusters I'm 
wondering if there's a race condition that has been fixed in a later version?

Also, what exactly is the omap digest?  From what I can tell it appears to be 
some kind of checksum for the omap data.  Is that correct?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Bryan Stillwell

> On Apr 8, 2019, at 4:38 PM, Gregory Farnum  wrote:
> 
> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell  wrote:
>> 
>> There doesn't appear to be any correlation between the OSDs which would 
>> point to a hardware issue, and since it's happening on two different 
>> clusters I'm wondering if there's a race condition that has been fixed in a 
>> later version?
>> 
>> Also, what exactly is the omap digest?  From what I can tell it appears to 
>> be some kind of checksum for the omap data.  Is that correct?
> 
> Yeah; it's just a crc over the omap key-value data that's checked
> during deep scrub. Same as the data digest.
> 
> I've not noticed any issues around this in Luminous but I probably
> wouldn't have, so will have to leave it up to others if there are
> fixes in since 12.2.8.

Thanks for adding some clarity to that Greg!

For some added information, this is what the logs reported earlier today:

2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x26a1241b != omap_digest 0x4c10ee76 from shard 504
2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x26a1241b != omap_digest 0x4c10ee76 from shard 504

I then tried deep scrubbing it again to see if the data was fine, but the 
digest calculation was just having problems.  It came back with the same 
problem with new digest values:

2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster 
[ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
0x93bac8f != omap_digest 0 xab1b9c6f from shard 504

Which makes sense, but doesn’t explain why the omap data is getting out of sync 
across multiple OSDs and clusters…

I’ll see what I can figure out tomorrow, but if anyone else has some hints I 
would love to hear them.

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-09 Thread Bryan Stillwell

> On Apr 8, 2019, at 5:42 PM, Bryan Stillwell  wrote:
> 
> 
>> On Apr 8, 2019, at 4:38 PM, Gregory Farnum  wrote:
>> 
>> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell  
>> wrote:
>>> 
>>> There doesn't appear to be any correlation between the OSDs which would 
>>> point to a hardware issue, and since it's happening on two different 
>>> clusters I'm wondering if there's a race condition that has been fixed in a 
>>> later version?
>>> 
>>> Also, what exactly is the omap digest?  From what I can tell it appears to 
>>> be some kind of checksum for the omap data.  Is that correct?
>> 
>> Yeah; it's just a crc over the omap key-value data that's checked
>> during deep scrub. Same as the data digest.
>> 
>> I've not noticed any issues around this in Luminous but I probably
>> wouldn't have, so will have to leave it up to others if there are
>> fixes in since 12.2.8.
> 
> Thanks for adding some clarity to that Greg!
> 
> For some added information, this is what the logs reported earlier today:
> 
> 2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 
> I then tried deep scrubbing it again to see if the data was fine, but the 
> digest calculation was just having problems.  It came back with the same 
> problem with new digest values:
> 
> 2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 
> Which makes sense, but doesn’t explain why the omap data is getting out of 
> sync across multiple OSDs and clusters…
> 
> I’ll see what I can figure out tomorrow, but if anyone else has some hints I 
> would love to hear them.

I’ve dug into this more today and it appears that the omap data contains an 
extra entry on the OSDs with the mismatched omap digests.  I then searched the 
RGW logs and found that a DELETE happened shortly after the OSD booted, but the 
omap data wasn’t updated on that OSD so it became mismatched.

Here’s a timeline of the events which caused PG 7.9 to become inconsistent:

2019-04-04 14:37:34 - osd.492 marked itself down
2019-04-04 14:40:35 - osd.492 boot
2019-04-04 14:41:55 - DELETE call happened
2019-04-08 12:06:14 - omap_digest mismatch detected (pg 7.9 is 
active+clean+inconsistent, acting [492,546,523])

Here’s the timeline for PG 7.2b:

2019-04-03 13:54:17 - osd.488 marked itself down
2019-04-03 13:59:27 - osd.488 boot
2019-04-03 14:00:54 - DELETE call happened
2019-04-08 12:42:21 - omap_digest mismatch detected (pg 7.2b is 
active+clean+inconsistent, acting [488,511,541])

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD node trying to possibly start OSDs that were purged

2019-10-29 Thread Bryan Stillwell
On Oct 29, 2019, at 11:23 AM, Jean-Philippe Méthot 
 wrote:
> A few months back, we had one of our OSD node motherboards die. At the time, 
> we simply waited for recovery and purged the OSDs that were on the dead node. 
> We just replaced that node and added back the drives as new OSDs. At the ceph 
> administration level, everything looks fine, no duplicate OSDs when I execute 
> map commands or ask Ceph to list what OSDs are on the node. However, on the 
> OSD node, in /var/log/ceph/ceph-volume, I see that every time the server 
> boots, ceph-volume tries to search for OSD fsids that don’t exist. Here’s the 
> error:
> 
> [2019-10-29 13:12:02,864][ceph_volume][ERROR ] exception caught by decorator
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
> in newfunc
> return f(*a, **kw)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in 
> main
> terminal.dispatch(self.mapper, subcommand_args)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/main.py", 
> line 40, in main
> terminal.dispatch(self.mapper, self.argv)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, **kw)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/trigger.py", 
> line 70, in main
> Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
>   File 
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 
> 339, in main
> self.activate(args)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, **kw)
>   File 
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 
> 249, in activate
> raise RuntimeError('could not find osd.%s with fsid %s' % (osd_id, 
> osd_fsid))
> RuntimeError: could not find osd.213 with fsid 
> 22800a80-2445-41a3-8643-69b4b84d598a
> 
> Of course this fsid ID isn’t listed anywhere in Ceph. Where does ceph-volume 
> get this fsid from? Even when looking at the code, it’s not particularly 
> obvious. This is ceph mimic running on CentOS 7 and bluestore.

That's not the cluster fsid, but the osd fsid.  Try running this command on 
your OSD node to get more details:

ceph-volume inventory --format json-pretty

My guess is there are some systemd files laying around for the old OSDs, or you 
were using 'ceph-volume simple' in the past (check for /etc/ceph/osd/).

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Bryan Stillwell
Jelle,

Try putting just the WAL on the Optane NVMe.  I'm guessing your DB is too big 
to fit within 5GB.  We used a 5GB journal on our nodes as well, but when we 
switched to BlueStore (using ceph-volume lvm batch) it created 37GiB logical 
volumes (200GB SSD / 5 or 400GB SSD / 10) for our DBs.

Also, injecting those settings into the cluster will only work until the OSD is 
restarted.  You'll need to add them to ceph.conf to be persistent.

Bryan

> On Dec 12, 2019, at 3:40 PM, Jelle de Jong  wrote:
> 
> Notice: This email is from an external sender.
> 
> 
> 
> Hello everybody,
> 
> I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's
> with 32GB Intel Optane NVMe journal, 10GB networking.
> 
> I wanted to move to bluestore due to dropping support of filestore, our
> cluster was working fine with filestore and we could take complete nodes
> out for maintenance without issues.
> 
> root@ceph04:~# ceph osd pool get libvirt-pool size
> size: 3
> root@ceph04:~# ceph osd pool get libvirt-pool min_size
> min_size: 2
> 
> I removed all osds from one node, zapping the osd and journal devices,
> we recreated the osds as bluestore and used a small 5GB partition as
> rockdb device instead of journal for all osd's.
> 
> I saw the cluster suffer with pgs inactive and slow request.
> 
> I tried setting the following on all nodes, but no diffrence:
> ceph tell osd.* injectargs '--osd_recovery_max_active 1'
> ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
> ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
> systemctl restart ceph-osd.target
> 
> It took three days to recover and during this time clients were not
> responsive.
> 
> How can I migrate to bluestore without inactive pgs or slow request. I
> got several more filestore clusters and I would like to know how to
> migrate without inactive pgs and slow reguests?
> 
> As a side question, I optimized our cluster for filestore, the Intel
> Optane NVMe journals showed good fio dsync write tests, does bluestore
> also use dsync writes for rockdb caching or can we select NVMe devices
> on other specifications? My test with filestores showed that Optane NVMe
> SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few
> GB for filestore journals, but with bluestore rockdb caching the
> situation is different and I can't find documentation on how to speed
> test NVMe devices for bluestore.
> 
> Kind regards,
> 
> Jelle
> 
> root@ceph04:~# ceph osd tree
> ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
> -1   60.04524 root default
> -2   20.01263 host ceph04
> 0   hdd  2.72899 osd.0   up  1.0 1.0
> 1   hdd  2.72899 osd.1   up  1.0 1.0
> 2   hdd  5.45799 osd.2   up  1.0 1.0
> 3   hdd  2.72899 osd.3   up  1.0 1.0
> 14   hdd  3.63869 osd.14  up  1.0 1.0
> 15   hdd  2.72899 osd.15  up  1.0 1.0
> -3   20.01263 host ceph05
> 4   hdd  5.45799 osd.4   up  1.0 1.0
> 5   hdd  2.72899 osd.5   up  1.0 1.0
> 6   hdd  2.72899 osd.6   up  1.0 1.0
> 13   hdd  3.63869 osd.13  up  1.0 1.0
> 16   hdd  2.72899 osd.16  up  1.0 1.0
> 18   hdd  2.72899 osd.18  up  1.0 1.0
> -4   20.01997 host ceph06
> 8   hdd  5.45999 osd.8   up  1.0 1.0
> 9   hdd  2.73000 osd.9   up  1.0 1.0
> 10   hdd  2.73000 osd.10  up  1.0 1.0
> 11   hdd  2.73000 osd.11  up  1.0 1.0
> 12   hdd  3.64000 osd.12  up  1.0 1.0
> 17   hdd  2.73000 osd.17  up  1.0 1.0
> 
> 
> root@ceph04:~# ceph status
>  cluster:
>id: 85873cda-4865-4147-819d-8deda5345db5
>health: HEALTH_WARN
>18962/11801097 objects misplaced (0.161%)
>1/3933699 objects unfound (0.000%)
>Reduced data availability: 42 pgs inactive
>Degraded data redundancy: 3645135/11801097 objects degraded
> (30.888%), 959 pgs degraded, 960 pgs undersized
>110 slow requests are blocked > 32 sec. Implicated osds 3,10,11
> 
>  services:
>mon: 3 daemons, quorum ceph04,ceph05,ceph06
>mgr: ceph04(active), standbys: ceph06, ceph05
>osd: 18 osds: 18 up, 18 in; 964 remapped pgs
> 
>  data:
>pools:   1 pools, 1024 pgs
>objects: 3.93M objects, 15.0TiB
>usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
>pgs: 4.102% pgs not active
> 3645135/11801097 objects degraded (30.888%)
> 18962/11801097 objects misplaced (0.161%)
> 1/3933699 objects unfound (0.000%)
> 913 active+undersized+degraded+remapped+backfill_wait
> 60  active+clean
> 41  activating+undersized+degraded+remapped
> 4   active+remapped+backfill_wait
> 4   active+undersized+degraded+remapped+backfillin