[ceph-users] RadosGW cant list objects when there are too many of them

2019-10-17 Thread Arash Shams
Dear All

I have a bucket with 5 million Objects and I cant list objects with
radosgw-admin bucket list --bucket=bucket | jq .[].name
or listing files using boto3

s3 = boto3.client('s3',
  endpoint_url=credentials['endpoint_url'],
  aws_access_key_id=credentials['access_key'],
  aws_secret_access_key=credentials['secret_key'])

response = s3.list_objects_v2(Bucket=bucket_name)
for item in response['Contents']:
print(item['Key'])

what is the solution ? how can I find list of my objects ?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recovering from a Failed Disk (replication 1)

2019-10-17 Thread Burkhard Linke

Hi,


On 10/17/19 5:56 AM, Ashley Merrick wrote:
I think your better off doing the DD method, you can export and import 
a PG at a time (ceph-objectstore-tool)


But if the disk is failing a DD is probably your best method.



In case of hardware problems or broken sectors, I would recommend 
'dd_rescue' instead of dd. It can handle broken sectors, automatic 
retries, skipping etc.



You will definitely need a second disk to rescue to.


Regards,

Burkhard


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Paul Emmerich
On Thu, Oct 17, 2019 at 12:17 AM Robert LeBlanc  wrote:
>
> On Wed, Oct 16, 2019 at 2:50 PM Paul Emmerich  wrote:
> >
> > On Wed, Oct 16, 2019 at 11:23 PM Robert LeBlanc  
> > wrote:
> > >
> > > On Tue, Oct 15, 2019 at 8:05 AM Robert LeBlanc  
> > > wrote:
> > > >
> > > > On Mon, Oct 14, 2019 at 2:58 PM Paul Emmerich  
> > > > wrote:
> > > > >
> > > > > Could the 4 GB GET limit saturate the connection from rgw to Ceph?
> > > > > Simple to test: just rate-limit the health check GET
> > > >
> > > > I don't think so, we have dual 25Gbp in a LAG, so Ceph to RGW has
> > > > multiple paths, but we aren't balancing on port yet, so RGW to HAProxy
> > > > is probably limited to one link.
> > > >
> > > > > Did you increase "objecter inflight ops" and "objecter inflight op 
> > > > > bytes"?
> > > > > You absolutely should adjust these settings for large RGW setups,
> > > > > defaults of 1024 and 100 MB are way too low for many RGW setups, we
> > > > > default to 8192 and 800MB
> > >
> > > On Nautilus the defaults already seem to be:
> > > objecter_inflight_op_bytes 104857600
> > >   default
> > = 100MiB
> >
> > > objecter_inflight_ops  24576
> > >   default
> >
> > not sure where you got this from, but the default is still 1024 even
> > in master: 
> > https://github.com/ceph/ceph/blob/4774808cb2923f65f6919fe8be5f98917075cdd7/src/common/options.cc#L2288
>
> Looks like it is overridden in
> https://github.com/ceph/ceph/blob/4774808cb2923f65f6919fe8be5f98917075cdd7/src/rgw/rgw_main.cc#L187

you are right, this is new in Nautilus. Last time I had to play around
with these settings was indeed on a Mimic deployment.

> I'm just not
> understanding how your suggestions would help, the problem doesn't
> seem to be on the RADOS side (which it appears your tweaks target),
> but on the HTTP side as an HTTP health check takes a long time to come
> back when a big transfer is going on.

I was guessing a bottleneck on the RADOS side because you mentioned
that you tried both civetweb and beast, somewhat unlikely to run into
the exact same problem with both

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW cant list objects when there are too many of them

2019-10-17 Thread Paul Emmerich
Listing large buckets is slow due to S3 ordering requirements, it's
approximately O(n^2).
However, I wouldn't consider 5M to be a large bucket, it should go to
only ~50 shards which should still perform reasonable. How fast are
your metadata OSDs?

Try --allow-unordered in radosgw-admin to get an unordered result
which is only O(n) as you'd expect.

For boto3: I'm not sure if v2 object listing is available yet (I think
it has only been merged into master but has not yet made it into a
release?). It doesn't support unordered listing but there has been
some work to implement it there, not sure about the current state.



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 17, 2019 at 9:19 AM Arash Shams  wrote:
>
> Dear All
>
> I have a bucket with 5 million Objects and I cant list objects with
> radosgw-admin bucket list --bucket=bucket | jq .[].name
> or listing files using boto3
>
> s3 = boto3.client('s3',
>   endpoint_url=credentials['endpoint_url'],
>   aws_access_key_id=credentials['access_key'],
>   aws_secret_access_key=credentials['secret_key'])
>
> response = s3.list_objects_v2(Bucket=bucket_name)
> for item in response['Contents']:
> print(item['Key'])
>
> what is the solution ? how can I find list of my objects ?
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recovering from a Failed Disk (replication 1)

2019-10-17 Thread Frank Schilder
You probably need to attempt a physical data rescue. Data access will be lost 
until done.

First thing is shut down the OSD to avoid any further damage to the disk.
Second thing is to try ddrescue, repair data on a copy if possible and then 
create a clone on a new disk from the copy.
If this doesn't help and you really need that last bit of data, you might need 
support from one of those companies that restore disk data with electron 
microscopy.

I successfully transferred OSDs between disks using ddrescue.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: vladimir franciz blando 
Sent: 17 October 2019 05:29:13
To: ceph-users@ceph.io
Subject: [ceph-users] Recovering from a Failed Disk (replication 1)

Hi,

I have a not ideal setup on one of my cluster,  3 ceph  nodes but using 
replication 1 on all pools (don't ask me why replication 1, it's a long story).

So it has come to this situation that a disk keeps on crashing, possible a 
hardware failure and I need to recover from that.

What's my best option for me to recover the data from the failed disk and 
transfer it to the other healthy disks?

This cluster is using Firefly

- Vlad
[https://mailfoogae.appspot.com/t?sender=admxhZGltaXIuYmxhbmRvQGdtYWlsLmNvbQ%3D%3D&type=zerocontent&guid=976ce724-3894-4a75-b591-dca017bdf19e]ᐧ
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-users Digest, Vol 81, Issue 39 Re:RadosGW cant list objects when there are too many of them

2019-10-17 Thread Romit Misra
Hi Arash,
  If the number of objects in a bucket are too large in the order of
millions, a paginated listing approach works better.
There are also ceratin RGW configs, that controls on how big a RGW response
(in terms of number objects can be, by default I believe this is 1000)
The code for Paginated Listing (Snippet can be modified):-

*"*


*try: buckethandle = s3_conn_src.get_bucket(bucket_name)*
* while True:*

* keys = buckethandle.get_all_keys(max_keys=1000,marker =
marker) for k in keys:*
* #do operation on keys (which are the objects)*
* print k.name *
*#update marker*
*marker = k.name *


* if keys.is_truncated is False: print
"Breaking" break*
*except Exception, e:*
*  print e*
*"*

Thanks
Romit Misra


Thanks
Romit


On Thu, Oct 17, 2019 at 4:18 PM  wrote:

> Send ceph-users mailing list submissions to
> ceph-users@ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-requ...@ceph.io
>
> You can reach the person managing the list at
> ceph-users-ow...@ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
>1. RadosGW cant list objects when there are too many of them
>   (Arash Shams)
>2. Re: Recovering from a Failed Disk (replication 1) (Burkhard Linke)
>3. Re: RGW blocking on large objects (Paul Emmerich)
>4. Re: RadosGW cant list objects when there are too many of them
>   (Paul Emmerich)
>5. Re: Recovering from a Failed Disk (replication 1) (Frank Schilder)
>
>
> --
>
> Date: Thu, 17 Oct 2019 07:19:12 +
> From: Arash Shams 
> Subject: [ceph-users] RadosGW cant list objects when there are too
> many of them
> To: "ceph-users@ceph.io" 
> Message-ID:   P265.PROD.OUTLOOK.COM>
> Content-Type: multipart/alternative;boundary="_000_LNXP265MB0508FF
> 1F47CB5EA9C29219FA926D0LNXP265MB0508GBRP_"
>
> --_000_LNXP265MB0508FF1F47CB5EA9C29219FA926D0LNXP265MB0508GBRP_
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> Dear All
>
> I have a bucket with 5 million Objects and I cant list objects with
> radosgw-admin bucket list --bucket=3Dbucket | jq .[].name
> or listing files using boto3
>
> s3 =3D boto3.client('s3',
>   endpoint_url=3Dcredentials['endpoint_url'],
>   aws_access_key_id=3Dcredentials['access_key'],
>   aws_secret_access_key=3Dcredentials['secret_key'])
>
> response =3D s3.list_objects_v2(Bucket=3Dbucket_name)
> for item in response['Contents']:
> print(item['Key'])
>
> what is the solution ? how can I find list of my objects ?
>
>
>
>
> --_000_LNXP265MB0508FF1F47CB5EA9C29219FA926D0LNXP265MB0508GBRP_
> Content-Type: text/html; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> 
> 
>  charset=3Diso-8859-=
> 1">
>  P
> {margin-top:0;margin-bo=
> ttom:0;} 
> 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> Dear All 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> I have a bucket with 5 million Objects and I cant list objects with 
> radosgw-admin bucket list --bucket=3Dbucket | jq .[].name
> or listing files using boto3 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
>     s3 =3D boto3.client('s3',
> 
>                    
> =
>   endpoint_url=3Dcredentials['endpoint_url'],
> 
>                    
> =
>   aws_access_key_id=3Dcredentials['access_key'],
> 
>                    
> =
>   aws_secret_access_key=3Dcredentials['secret_key'])
> 
> 
> 
>     response =3D
> s3.list_objects_v2(Bucket=3Dbucket_name) >
> 
>     for item in response['Contents']:
> 
>         print(item['Key'])
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> what is the solution ? how can I find list of my objects ?
>  12pt;=
>  color: rgb(0, 0, 0);">
> 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> 
> 
>  12pt;=
>  color: rgb(0, 0, 0);">
> 
> 
> 
> 
>
> --_000_LNXP265MB0508FF1F47CB5EA9C29219FA926D0LNXP265MB0508GBRP_--
>
> --
>
> Date: Thu, 17 Oct 2019 10:18:11 +0200
> From: Burkhard Linke 
> Subject: [ceph-users] Re: Recovering from a Failed Disk (replication
> 1)
> To: ceph-users@ceph.io
> Message-ID:  i-giessen.de>
> Content-Type: multipart/alternative;
> boundary="71A0D501B0D56489A2F673CA"
>
> This is a multi-part message in MIME format.
> --71A0D501B0D56489A2F673CA
> Content-Type: text/plain; charset=utf-8; format=flowed
> Content-Transfer-Encoding: 7bit
>
> Hi,
>
>
> On 10/17/19 5:56 AM, Ashley Merrick wrote:
> > I thin

[ceph-users] Re: RadosGW cant list objects when there are too many of them

2019-10-17 Thread Casey Bodley
When you say that you can't list it with boto or radosgw-admin, what 
happens? Does it give you an error, or just hang/timeout? How many 
shards does the bucket have?


On 10/17/19 6:00 AM, Paul Emmerich wrote:

Listing large buckets is slow due to S3 ordering requirements, it's
approximately O(n^2).
However, I wouldn't consider 5M to be a large bucket, it should go to
only ~50 shards which should still perform reasonable. How fast are
your metadata OSDs?


I just wanted to share that recent work by Eric and Mark is showing huge 
improvements in sharded listing performance: 
https://github.com/ceph/ceph/pull/30853#issuecomment-541964967



Try --allow-unordered in radosgw-admin to get an unordered result
which is only O(n) as you'd expect.

For boto3: I'm not sure if v2 object listing is available yet (I think
it has only been merged into master but has not yet made it into a
release?). It doesn't support unordered listing but there has been
some work to implement it there, not sure about the current state.



Paul


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-17 Thread Stig Telfer
Hi All - 

I did some investigation into Ceph RDMA as part of a performance analysis 
project working with Ceph over Omnipath and NVME.  

I wrote up some of the analysis here: 
https://www.stackhpc.com/ceph-on-the-brain-a-year-with-the-human-brain-project.html
 


My conclusion at the time was that Ceph’s RDMA support was not portable across 
different RDMA-capable network fabrics, but that RoCE worked pretty well.  
Unfortunately, on the hardware I had available for RoCE testing the network was 
not the bottleneck, so I didn’t see any compelling advantage.  It would be 
great to do this testing again on a system with the potential to really shine.

This work concluded about a year ago, so might be a little out of date.

Best wishes,
Stig


> On 15 Oct 2019, at 13:46, Paul Emmerich  wrote:
> 
> That's apply/commit latency (the exact same since BlueStore btw, no
> point in tracking both). It should not contain any network component.
> 
> Since the path you are optimizing is inter-OSD communication: check
> out subop latency, that's the one where this should show up.
> 
> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> 
> On Tue, Oct 15, 2019 at 2:39 PM  wrote:
>> 
>> I don't see any changes here...
>> 
>>> There is graph here. It was pure Nautilus before 10-05 and
>>> Nautilus+RDMA after.
>>> https://nc.avalon.org.ua/s/LptPTEaTeTTyKtD
>>> Link expires on Nov 1.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 2:50 AM Paul Emmerich  wrote:
>
> On Thu, Oct 17, 2019 at 12:17 AM Robert LeBlanc  wrote:
> >
> > On Wed, Oct 16, 2019 at 2:50 PM Paul Emmerich  
> > wrote:
> > >
> > > On Wed, Oct 16, 2019 at 11:23 PM Robert LeBlanc  
> > > wrote:
> > > >
> > > > On Tue, Oct 15, 2019 at 8:05 AM Robert LeBlanc  
> > > > wrote:
> > > > >
> > > > > On Mon, Oct 14, 2019 at 2:58 PM Paul Emmerich 
> > > > >  wrote:
> > > > > >
> > > > > > Could the 4 GB GET limit saturate the connection from rgw to Ceph?
> > > > > > Simple to test: just rate-limit the health check GET
> > > > >
> > > > > I don't think so, we have dual 25Gbp in a LAG, so Ceph to RGW has
> > > > > multiple paths, but we aren't balancing on port yet, so RGW to HAProxy
> > > > > is probably limited to one link.
> > > > >
> > > > > > Did you increase "objecter inflight ops" and "objecter inflight op 
> > > > > > bytes"?
> > > > > > You absolutely should adjust these settings for large RGW setups,
> > > > > > defaults of 1024 and 100 MB are way too low for many RGW setups, we
> > > > > > default to 8192 and 800MB
> > > >
> > > > On Nautilus the defaults already seem to be:
> > > > objecter_inflight_op_bytes 104857600
> > > >   default
> > > = 100MiB
> > >
> > > > objecter_inflight_ops  24576
> > > >   default
> > >
> > > not sure where you got this from, but the default is still 1024 even
> > > in master: 
> > > https://github.com/ceph/ceph/blob/4774808cb2923f65f6919fe8be5f98917075cdd7/src/common/options.cc#L2288
> >
> > Looks like it is overridden in
> > https://github.com/ceph/ceph/blob/4774808cb2923f65f6919fe8be5f98917075cdd7/src/rgw/rgw_main.cc#L187
>
> you are right, this is new in Nautilus. Last time I had to play around
> with these settings was indeed on a Mimic deployment.
>
> > I'm just not
> > understanding how your suggestions would help, the problem doesn't
> > seem to be on the RADOS side (which it appears your tweaks target),
> > but on the HTTP side as an HTTP health check takes a long time to come
> > back when a big transfer is going on.
>
> I was guessing a bottleneck on the RADOS side because you mentioned
> that you tried both civetweb and beast, somewhat unlikely to run into
> the exact same problem with both

Looping in ceph-dev in case they have some insights into the inner
workings that may be helpful.

>From what I understand civitweb was not async and beast is, but if
beast is not coded exactly right, then it could behave similarly as
civitweb.

It seems that with beast incoming requests are being assigned to BEAST
threads and possibly it is doing as sync call to rados therefore
blocking requests behind it until the RADOS call is completed. I tried
looking through the code, but I'm not familiar with async in C++. I
could see two options that may resolve this. First, have a seperate
thread pool for accessing RADOS objects with a queue that BEAST
dispatches to and callback the completion at the end. The second
option is creating async RADOS calls so that it can yield the event
loop to another RADOS task. I couldn't tell if either one of these are
being done, but that should help small IO not get stuck behind large
IO.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Casey Bodley



On 10/17/19 10:58 AM, Robert LeBlanc wrote:

On Thu, Oct 17, 2019 at 2:50 AM Paul Emmerich  wrote:

On Thu, Oct 17, 2019 at 12:17 AM Robert LeBlanc  wrote:

On Wed, Oct 16, 2019 at 2:50 PM Paul Emmerich  wrote:

On Wed, Oct 16, 2019 at 11:23 PM Robert LeBlanc  wrote:

On Tue, Oct 15, 2019 at 8:05 AM Robert LeBlanc  wrote:

On Mon, Oct 14, 2019 at 2:58 PM Paul Emmerich  wrote:

Could the 4 GB GET limit saturate the connection from rgw to Ceph?
Simple to test: just rate-limit the health check GET

I don't think so, we have dual 25Gbp in a LAG, so Ceph to RGW has
multiple paths, but we aren't balancing on port yet, so RGW to HAProxy
is probably limited to one link.


Did you increase "objecter inflight ops" and "objecter inflight op bytes"?
You absolutely should adjust these settings for large RGW setups,
defaults of 1024 and 100 MB are way too low for many RGW setups, we
default to 8192 and 800MB

On Nautilus the defaults already seem to be:
objecter_inflight_op_bytes 104857600
   default

= 100MiB


objecter_inflight_ops  24576
   default

not sure where you got this from, but the default is still 1024 even
in master: 
https://github.com/ceph/ceph/blob/4774808cb2923f65f6919fe8be5f98917075cdd7/src/common/options.cc#L2288

Looks like it is overridden in
https://github.com/ceph/ceph/blob/4774808cb2923f65f6919fe8be5f98917075cdd7/src/rgw/rgw_main.cc#L187

you are right, this is new in Nautilus. Last time I had to play around
with these settings was indeed on a Mimic deployment.


I'm just not
understanding how your suggestions would help, the problem doesn't
seem to be on the RADOS side (which it appears your tweaks target),
but on the HTTP side as an HTTP health check takes a long time to come
back when a big transfer is going on.

I was guessing a bottleneck on the RADOS side because you mentioned
that you tried both civetweb and beast, somewhat unlikely to run into
the exact same problem with both

Looping in ceph-dev in case they have some insights into the inner
workings that may be helpful.

 From what I understand civitweb was not async and beast is, but if
beast is not coded exactly right, then it could behave similarly as
civitweb.


With respect to this issue, civetweb and beast should behave the same. 
Both frontends have a large thread pool, and their calls to 
process_request() run synchronously (including blocking on rados 
requests) on a frontend thread. So once there are more concurrent client 
connections than there are frontend threads, new connections will block 
until there's a thread available to service them.




It seems that with beast incoming requests are being assigned to BEAST
threads and possibly it is doing as sync call to rados therefore
blocking requests behind it until the RADOS call is completed. I tried
looking through the code, but I'm not familiar with async in C++. I
could see two options that may resolve this. First, have a seperate
thread pool for accessing RADOS objects with a queue that BEAST
dispatches to and callback the completion at the end. The second
option is creating async RADOS calls so that it can yield the event
loop to another RADOS task. I couldn't tell if either one of these are
being done, but that should help small IO not get stuck behind large
IO.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 9:22 AM Casey Bodley  wrote:

> With respect to this issue, civetweb and beast should behave the same.
> Both frontends have a large thread pool, and their calls to
> process_request() run synchronously (including blocking on rados
> requests) on a frontend thread. So once there are more concurrent client
> connections than there are frontend threads, new connections will block
> until there's a thread available to service them.

Okay, this really helps me understand what's going on here. Is there
plans to remove the synchronous calls and make them async or improve
this flow a bit?

Currently I'm seeing 1024 max concurrent ops and 512 thread pool. Does
this mean that on an equally distributed requests that one op could be
processing on the backend RADOS with another queued behind it waiting?
Is this done in round robin fashion so for 99% small io, a very long
RADOS request can get many IO blocked behind it because it is being
round-robin dispatched to the thread pool? (I assume the latter is
what I'm seeing).

rgw_max_concurrent_requests1024
rgw_thread_pool_size   512

If I match the two, do you think it would help prevent small IO from
being blocked by larger IO?

I'm also happy to look into the code to suggest improvements if you
can give me some quick points into the code to start will help.



Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus power outage - 2/3 mons and mgrs dead and no cephfs

2019-10-17 Thread Alex L
Hi,
I am still having issues accessing my cephfs and managed to pull out more 
interesting logs, I also have enabled logs to 20/20 that I intend to upload as 
soon as my ceph tracker account gets accepted.

Oct 17 16:35:22 pve21 kernel: libceph: read_partial_message 8ae0e636 
signature check failed
Oct 17 16:35:22 pve21 kernel: libceph: mds0 192.168.1.22:6801 bad crc/signature

Oct 17 16:49:14 pve23 pvestatd[3150]: mount error: exit code 5
Oct 17 16:49:19 pve23 ceph-mon[2373]: 2019-10-17 16:49:19.559 7ff2f21d0700 -1 
mon.pve23@2(electing) e20 failed to get devid for : fallback method has serial 
''but no model
[   39.843048] libceph: read_partial_message 10ae5ee0 
signature check failed
[   39.843062] libceph: mds0 192.168.1.22:6801 bad crc/signature
Oct 17 
16:54:18 pve21 ceph-mon[2215]: 2019-10-17 16:54:18.163 7f3ccd47d700 -1 
log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline 
(MDS_ALL_DOWN)

Thanks!
A
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus power outage - 2/3 mons and mgrs dead and no cephfs

2019-10-17 Thread Alex L
Final update.

I switched the below from false and everything magically started working!
cephx_require_signatures = true
cephx_cluster_require_signatures = true
cephx_sign_messages = true
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Casey Bodley



On 10/17/19 12:59 PM, Robert LeBlanc wrote:

On Thu, Oct 17, 2019 at 9:22 AM Casey Bodley  wrote:


With respect to this issue, civetweb and beast should behave the same.
Both frontends have a large thread pool, and their calls to
process_request() run synchronously (including blocking on rados
requests) on a frontend thread. So once there are more concurrent client
connections than there are frontend threads, new connections will block
until there's a thread available to service them.

Okay, this really helps me understand what's going on here. Is there
plans to remove the synchronous calls and make them async or improve
this flow a bit?


Absolutely yes, this work has been in progress for a long time now, and 
octopus does get a lot of concurrency here. Eventually, all of 
process_request() will be async-enabled, and we'll be able to run beast 
with a much smaller thread pool.




Currently I'm seeing 1024 max concurrent ops and 512 thread pool. Does
this mean that on an equally distributed requests that one op could be
processing on the backend RADOS with another queued behind it waiting?
Is this done in round robin fashion so for 99% small io, a very long
RADOS request can get many IO blocked behind it because it is being
round-robin dispatched to the thread pool? (I assume the latter is
what I'm seeing).

rgw_max_concurrent_requests1024
rgw_thread_pool_size   512

If I match the two, do you think it would help prevent small IO from
being blocked by larger IO?
rgw_max_concurrent_requests was added in support of the beast/async 
work, precisely because (post-Nautilus) the number of beast threads will 
no longer limit the number of concurrent requests. This variable is what 
throttles incoming requests to prevent radosgw's resource consumption 
from ballooning under heavy workload. And unlike the existing model 
where a request remains in the queue until a thread is ready to service 
it, any requests that exceed rgw_max_concurrent_requests will be 
rejected with '503 SlowDown' in s3 or '498 Rate Limited' in swift.


With respect to prioritization, there isn't any by default but we do 
have a prototype request scheduler that uses dmclock to prioritize 
requests based on some hard-coded request classes. It's not especially 
useful in its current form, but we do have plans to further elaborate 
the classes and eventually pass the information down to osds for 
integrated QOS.


As of nautilus, though, the thread pool size is the only effective knob 
you have.



I'm also happy to look into the code to suggest improvements if you
can give me some quick points into the code to start will help.



Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 11:46 AM Casey Bodley  wrote:
>
>
> On 10/17/19 12:59 PM, Robert LeBlanc wrote:
> > On Thu, Oct 17, 2019 at 9:22 AM Casey Bodley  wrote:
> >
> >> With respect to this issue, civetweb and beast should behave the same.
> >> Both frontends have a large thread pool, and their calls to
> >> process_request() run synchronously (including blocking on rados
> >> requests) on a frontend thread. So once there are more concurrent client
> >> connections than there are frontend threads, new connections will block
> >> until there's a thread available to service them.
> > Okay, this really helps me understand what's going on here. Is there
> > plans to remove the synchronous calls and make them async or improve
> > this flow a bit?
>
> Absolutely yes, this work has been in progress for a long time now, and
> octopus does get a lot of concurrency here. Eventually, all of
> process_request() will be async-enabled, and we'll be able to run beast
> with a much smaller thread pool.

This is great news. Anything we can do to help in this effort as it is
very important for us?

> > Currently I'm seeing 1024 max concurrent ops and 512 thread pool. Does
> > this mean that on an equally distributed requests that one op could be
> > processing on the backend RADOS with another queued behind it waiting?
> > Is this done in round robin fashion so for 99% small io, a very long
> > RADOS request can get many IO blocked behind it because it is being
> > round-robin dispatched to the thread pool? (I assume the latter is
> > what I'm seeing).
> >
> > rgw_max_concurrent_requests1024
> > rgw_thread_pool_size   512
> >
> > If I match the two, do you think it would help prevent small IO from
> > being blocked by larger IO?
> rgw_max_concurrent_requests was added in support of the beast/async
> work, precisely because (post-Nautilus) the number of beast threads will
> no longer limit the number of concurrent requests. This variable is what
> throttles incoming requests to prevent radosgw's resource consumption
> from ballooning under heavy workload. And unlike the existing model
> where a request remains in the queue until a thread is ready to service
> it, any requests that exceed rgw_max_concurrent_requests will be
> rejected with '503 SlowDown' in s3 or '498 Rate Limited' in swift.
>
> With respect to prioritization, there isn't any by default but we do
> have a prototype request scheduler that uses dmclock to prioritize
> requests based on some hard-coded request classes. It's not especially
> useful in its current form, but we do have plans to further elaborate
> the classes and eventually pass the information down to osds for
> integrated QOS.
>
> As of nautilus, though, the thread pool size is the only effective knob
> you have.

Do you see any problems with running 2k-4k threads if we have the RAM to do so?


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Matt Benjamin
My impression is that running a second gateway (assuming 1 at present)
on the same host would be preferable to running one with very high
thread count, also that 1024 is a good maximum value for thread count.

Matt

On Thu, Oct 17, 2019 at 4:01 PM Robert LeBlanc  wrote:
>
> On Thu, Oct 17, 2019 at 11:46 AM Casey Bodley  wrote:
> >
> >
> > On 10/17/19 12:59 PM, Robert LeBlanc wrote:
> > > On Thu, Oct 17, 2019 at 9:22 AM Casey Bodley  wrote:
> > >
> > >> With respect to this issue, civetweb and beast should behave the same.
> > >> Both frontends have a large thread pool, and their calls to
> > >> process_request() run synchronously (including blocking on rados
> > >> requests) on a frontend thread. So once there are more concurrent client
> > >> connections than there are frontend threads, new connections will block
> > >> until there's a thread available to service them.
> > > Okay, this really helps me understand what's going on here. Is there
> > > plans to remove the synchronous calls and make them async or improve
> > > this flow a bit?
> >
> > Absolutely yes, this work has been in progress for a long time now, and
> > octopus does get a lot of concurrency here. Eventually, all of
> > process_request() will be async-enabled, and we'll be able to run beast
> > with a much smaller thread pool.
>
> This is great news. Anything we can do to help in this effort as it is
> very important for us?
>
> > > Currently I'm seeing 1024 max concurrent ops and 512 thread pool. Does
> > > this mean that on an equally distributed requests that one op could be
> > > processing on the backend RADOS with another queued behind it waiting?
> > > Is this done in round robin fashion so for 99% small io, a very long
> > > RADOS request can get many IO blocked behind it because it is being
> > > round-robin dispatched to the thread pool? (I assume the latter is
> > > what I'm seeing).
> > >
> > > rgw_max_concurrent_requests1024
> > > rgw_thread_pool_size   512
> > >
> > > If I match the two, do you think it would help prevent small IO from
> > > being blocked by larger IO?
> > rgw_max_concurrent_requests was added in support of the beast/async
> > work, precisely because (post-Nautilus) the number of beast threads will
> > no longer limit the number of concurrent requests. This variable is what
> > throttles incoming requests to prevent radosgw's resource consumption
> > from ballooning under heavy workload. And unlike the existing model
> > where a request remains in the queue until a thread is ready to service
> > it, any requests that exceed rgw_max_concurrent_requests will be
> > rejected with '503 SlowDown' in s3 or '498 Rate Limited' in swift.
> >
> > With respect to prioritization, there isn't any by default but we do
> > have a prototype request scheduler that uses dmclock to prioritize
> > requests based on some hard-coded request classes. It's not especially
> > useful in its current form, but we do have plans to further elaborate
> > the classes and eventually pass the information down to osds for
> > integrated QOS.
> >
> > As of nautilus, though, the thread pool size is the only effective knob
> > you have.
>
> Do you see any problems with running 2k-4k threads if we have the RAM to do 
> so?
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>


-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 1:05 PM Matt Benjamin  wrote:
>
> My impression is that running a second gateway (assuming 1 at present)
> on the same host would be preferable to running one with very high
> thread count, also that 1024 is a good maximum value for thread count.

We are running 4 RGW containers per host and have 4 hosts for 16 RGW
instances. This cluster was doing 200,000+ IOs per second according to
`ceph -s`. We can expect more large objects as the system load is
increased, but the vast majority of objects will be really small. I'm
almost tempted to have more threads in the thread pool than max
concurrent requests so that there is a buffer of idle threads rather
than IOs being scheduled behind one another.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Casey Bodley



On 10/17/19 4:00 PM, Robert LeBlanc wrote:

On Thu, Oct 17, 2019 at 11:46 AM Casey Bodley  wrote:


On 10/17/19 12:59 PM, Robert LeBlanc wrote:

On Thu, Oct 17, 2019 at 9:22 AM Casey Bodley  wrote:


With respect to this issue, civetweb and beast should behave the same.
Both frontends have a large thread pool, and their calls to
process_request() run synchronously (including blocking on rados
requests) on a frontend thread. So once there are more concurrent client
connections than there are frontend threads, new connections will block
until there's a thread available to service them.

Okay, this really helps me understand what's going on here. Is there
plans to remove the synchronous calls and make them async or improve
this flow a bit?

Absolutely yes, this work has been in progress for a long time now, and
octopus does get a lot of concurrency here. Eventually, all of
process_request() will be async-enabled, and we'll be able to run beast
with a much smaller thread pool.

This is great news. Anything we can do to help in this effort as it is
very important for us?


We would love help here. While most of the groundwork is done, so the 
remaining work is mostly mechanical.


To summarize the strategy, the beast frontend spawns a coroutine for 
each client connection, and that coroutine is represented by a 
boost::asio::yield_context. We wrap this in an 'optional_yield' struct 
that gets passed to process_request(). The civetweb frontend always 
passes an empty object (ie null_yield) so that everything runs 
synchronously. When making calls into librados, we have a 
rgw_rados_operate() function that supports this optional_yield argument. 
If it gets a null_yield, it calls the blocking version of 
librados::IoCtx::operate(). Otherwise it calls a special 
librados::async_operate() function which suspends the coroutine until 
completion instead of blocking the thread.


So most of the remaining work is in plumbing this optional_yield 
variable through all of the code paths under process_request() that call 
into librados. The rgw_rados_operate() helpers will log a "WARNING: 
blocking librados call" whenever they block inside of a beast frontend 
thread, so we can go through the rgw log to identify all of the places 
that still need a yield context. By iterating on this process, we can 
eventually remove all of the blocking calls, then set up regression 
testing to verify that no rgw logs contain that warning.


Here's an example pr from Ali that adds the optional_yield to requests 
for bucket instance info: https://github.com/ceph/ceph/pull/27898. It 
extends the get_bucket_info() call to take optional_yield, and passes 
one in where available, using null_yield to mark the synchronous cases 
where one isn't available.





Currently I'm seeing 1024 max concurrent ops and 512 thread pool. Does
this mean that on an equally distributed requests that one op could be
processing on the backend RADOS with another queued behind it waiting?
Is this done in round robin fashion so for 99% small io, a very long
RADOS request can get many IO blocked behind it because it is being
round-robin dispatched to the thread pool? (I assume the latter is
what I'm seeing).

rgw_max_concurrent_requests1024
rgw_thread_pool_size   512

If I match the two, do you think it would help prevent small IO from
being blocked by larger IO?

rgw_max_concurrent_requests was added in support of the beast/async
work, precisely because (post-Nautilus) the number of beast threads will
no longer limit the number of concurrent requests. This variable is what
throttles incoming requests to prevent radosgw's resource consumption
from ballooning under heavy workload. And unlike the existing model
where a request remains in the queue until a thread is ready to service
it, any requests that exceed rgw_max_concurrent_requests will be
rejected with '503 SlowDown' in s3 or '498 Rate Limited' in swift.

With respect to prioritization, there isn't any by default but we do
have a prototype request scheduler that uses dmclock to prioritize
requests based on some hard-coded request classes. It's not especially
useful in its current form, but we do have plans to further elaborate
the classes and eventually pass the information down to osds for
integrated QOS.

As of nautilus, though, the thread pool size is the only effective knob
you have.

Do you see any problems with running 2k-4k threads if we have the RAM to do so?


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 2:03 PM Casey Bodley  wrote:
> > This is great news. Anything we can do to help in this effort as it is
> > very important for us?
>
> We would love help here. While most of the groundwork is done, so the
> remaining work is mostly mechanical.
>
> To summarize the strategy, the beast frontend spawns a coroutine for
> each client connection, and that coroutine is represented by a
> boost::asio::yield_context. We wrap this in an 'optional_yield' struct
> that gets passed to process_request(). The civetweb frontend always
> passes an empty object (ie null_yield) so that everything runs
> synchronously. When making calls into librados, we have a
> rgw_rados_operate() function that supports this optional_yield argument.
> If it gets a null_yield, it calls the blocking version of
> librados::IoCtx::operate(). Otherwise it calls a special
> librados::async_operate() function which suspends the coroutine until
> completion instead of blocking the thread.
>
> So most of the remaining work is in plumbing this optional_yield
> variable through all of the code paths under process_request() that call
> into librados. The rgw_rados_operate() helpers will log a "WARNING:
> blocking librados call" whenever they block inside of a beast frontend
> thread, so we can go through the rgw log to identify all of the places
> that still need a yield context. By iterating on this process, we can
> eventually remove all of the blocking calls, then set up regression
> testing to verify that no rgw logs contain that warning.
>
> Here's an example pr from Ali that adds the optional_yield to requests
> for bucket instance info: https://github.com/ceph/ceph/pull/27898. It
> extends the get_bucket_info() call to take optional_yield, and passes
> one in where available, using null_yield to mark the synchronous cases
> where one isn't available.

I'll work to get familiar with the code base and see if I can submit
some PRs to help out. Things are a bit crazy, but this is very
important to us too.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-17 Thread Matt Benjamin
Thanks very much, Robert.

Matt

On Thu, Oct 17, 2019 at 5:24 PM Robert LeBlanc  wrote:
>
> On Thu, Oct 17, 2019 at 2:03 PM Casey Bodley  wrote:
> > > This is great news. Anything we can do to help in this effort as it is
> > > very important for us?
> >
> > We would love help here. While most of the groundwork is done, so the
> > remaining work is mostly mechanical.
> >
> > To summarize the strategy, the beast frontend spawns a coroutine for
> > each client connection, and that coroutine is represented by a
> > boost::asio::yield_context. We wrap this in an 'optional_yield' struct
> > that gets passed to process_request(). The civetweb frontend always
> > passes an empty object (ie null_yield) so that everything runs
> > synchronously. When making calls into librados, we have a
> > rgw_rados_operate() function that supports this optional_yield argument.
> > If it gets a null_yield, it calls the blocking version of
> > librados::IoCtx::operate(). Otherwise it calls a special
> > librados::async_operate() function which suspends the coroutine until
> > completion instead of blocking the thread.
> >
> > So most of the remaining work is in plumbing this optional_yield
> > variable through all of the code paths under process_request() that call
> > into librados. The rgw_rados_operate() helpers will log a "WARNING:
> > blocking librados call" whenever they block inside of a beast frontend
> > thread, so we can go through the rgw log to identify all of the places
> > that still need a yield context. By iterating on this process, we can
> > eventually remove all of the blocking calls, then set up regression
> > testing to verify that no rgw logs contain that warning.
> >
> > Here's an example pr from Ali that adds the optional_yield to requests
> > for bucket instance info: https://github.com/ceph/ceph/pull/27898. It
> > extends the get_bucket_info() call to take optional_yield, and passes
> > one in where available, using null_yield to mark the synchronous cases
> > where one isn't available.
>
> I'll work to get familiar with the code base and see if I can submit
> some PRs to help out. Things are a bit crazy, but this is very
> important to us too.
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io