Re: [Openstack-operators] How are people dealing with API rate limiting?

John Dickinson Tue, 14 Jun 2016 17:27:37 -0700

Cluster protection, mainly. Swift's rate limiting is based on write requests 
(eg PUT, POST, DELETE) per second per container. Since a large number of object 
writes in a single container could cause some background processes to back up 
and not service other requests, limiting the ops/sec to a container limits the 
impact of one busy resource in the cluster. [<more technical detail>]

As others have mentioned, rate limiting at the load balancing layer is fine too 
(assuming the LB can limit with enough granularity to matter). Most Swift 
clusters have a bunch of proxy servers behind something like HAProxy. It 
balances requests across the proxy servers (any proxy can service any request, 
so that helps) and may also be used to terminate TLS to offload that from the 
proxies.

<more technical detail>

Here's some background bullet points to help understand the flow

 * stuff in Swift is organized logically as <account>/<container>/<object>
 * when an object is created, some metadata about that object (name, size, 
etag, etc) is sent to the related container
 * the container keeps a list of the objects and some aggregated metadata
 * containers are implemented as a SQLite3 DB
 * containers are replicated in the cluster, typically 3 times

OK, so with that info, here's what happens. Let's assume, for simplicity, that 
we have a 3x replica policy for objects. So when an object is written, each 
object replica also sends one update to one of the container replicas. Suppose 
we have a cluster of 100 servers with 10 drives each. Create an object, and it 
will be stored on 3 of those 100 servers. Create another, and it will be stored 
on 3 other servers. So basically, you've got 100 choose 3, followed by 10 
choose 1 options for where an object will be. As you keep creating objects, the 
load is spread out across every drive in the cluster, all 1000 of them.

But let's say we're adding a bunch of objects to the same container. The 
objects will be nicely spread out across all 1000 drives, but we've only got 3 
replicas of the container. So with each object write, you've got the exact same 
three containers to update. Each object write will end up with 3 of the 100 
servers updating the data on three hard drives in the cluster. Suppose you're 
trying to write 1000/sec to the cluster, and that's 1000 updates per second to 
each of the three container replicas. If you've just got spinning media, that 
just isn't possible.

And 1000 writes/sec is a rather low number, compared to what most Swift 
clusters expect.

The above is somewhat simplified, but it gets the point across. The container 
DBs can become overwhelmed, and the problem gets worse as the containers get 
bigger since SQLite gets slower to update as the DB gets bigger. The easiest 
way to mitigate the slowdown is for clients to spread writes across many 
containers and to put container DBs on SSDs (more IOPS). But even with that, 
you still need to give the operator a way to protect the cluster from this 
access pattern.

What problems are we protecting from? There's more and more HW resources 
consumed by background jobs trying to keep up with the container updates and 
keeping the replicas of the container in sync. Since there's a fixed hardware 
budget for requests (i.e. you can't get more IOPS or cycles), other requests 
now have to wait. Everything slows down, and eventually the whole cluster can 
get so far behind it just won't be able to catch up to the correct view of the 
world.

We've spent a lot of time working on this problem in the past. We've vastly 
improves some parts, and we're still working on improving other parts. The rate 
limiting functionality that's in Swift is part of the overall solution for 
operators the manage large, active clusters.

--John

On 14 Jun 2016, at 16:23, Joshua Harlow wrote:

> Am curious,
>
> Any reason why swift got in the business of ratelimiting in the first place?
>
> -Josh
>
> John Dickinson wrote:
>> Swift does rate limiting across the proxy servers ("api servers" in nava 
>> parlance) as described at 
>> http://docs.openstack.org/developer/swift/ratelimit.html. It uses a memcache 
>> pool to coordinate the rate limiting across proxy processes (local or across 
>> machines).
>>
>> Code's at 
>> https://github.com/openstack/swift/blob/master/swift/common/middleware/ratelimit.py
>>
>> --John
>>
>>
>>
>> On 14 Jun 2016, at 8:02, Matt Riedemann wrote:
>>
>>> A question came up in the nova IRC channel this morning about the 
>>> api_rate_limit config option in nova which was only for the v2 API.
>>>
>>> Sean Dague explained that it never really worked because it was per API 
>>> server so if you had more than one API server it was busted. There is no 
>>> in-tree replacement in nova.
>>>
>>> So the open question here is, what are people doing as an alternative?
>>>
>>> --
>>>
>>> Thanks,
>>>
>>> Matt Riedemann
>>>
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> [email protected]
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> _______________________________________________
>> OpenStack-operators mailing list
>> [email protected]
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

signature.asc
Description: OpenPGP digital signature

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] How are people dealing with API rate limiting?

Reply via email to