On Thu, May 5, 2016 at 12:55 AM, Anssi Kääriäinen <[email protected]> wrote:
> On Thursday, May 5, 2016, Russell Keith-Magee <[email protected]> > wrote: > >> I will admin that I haven’t been paying *close* attention to Andrew’s >> work - I’m aware of the broad strokes, and I’ve skimmed some of the design >> discussions, but I haven’t been keeping close tabs on things. From that >> (admittedly weak) position, the only counterargument that I’ve heard are >> performance concerns. >> > > I haven't brought this up before, but security is something we should > discuss pre-merge. > > What I'm mainly worried about is malicious clients intentionally trying to > choke the channels layer. I guess the approaches for DoS attack would fall > under these categories: > 1. Try to generate large responses and read those response slowly. > This would likely lead to either the response packets expiring in redis after one minute, or redis running out of memory as it overfills with packets and doing whatever its configured OOM response is (which I will suggest be "expire things early"). asgi_redis could likely be improved so that the channel lists don't get overfull in this situation. > 2. Fire a large request, don't read the response. > Same as above. ASGI has no backpressure on channels per se, so you can't tell if the response is being read at all. > 3. Try to cause exceptions in various parts of the stack - if the worker > never writes to the response channel, what will happen? > Daphne will time out the request after 120 seconds from request start with a 503 Service Unavailable in the default configuration, but I'm tempted to drop that to 60 and reset the clock every time a response chunk turns up, and have a second absolute timeout that handles the slow-reader DoS case. > > There are always DoS vectors, but checking there aren't easy ones should > be done. The main concern with channels is potential resource leak. > > I found accidentally some presentations that seem relevant to thus > discussion. I recently watched some presentations about high availability > at Braintree. There are two presentations available, one from 2013 by Paul > Gross, where he explains their approach to HA, and one from 2015 by Lionel > Barrow, explaining what changed. Both are very interesting and highly > recommended. > > The 2013 presentation introduces one key piece to HA at braintree, called > Broxy. Broxy basically serves HTTP the same way as Daphne - write requests > to tedis, and wait for response again through Redis. > > The 2015 representation explains what changed. They removed Broxy because > it turned out to be conceptually complex and fragile. It might be their > implementation. But there is certain level of both complexity and possible > fragility about the design. On the other hand their story pretty much > verifies that the design does scale. > Do you have a link to the presentation about them removing it? I've tried to solve the problems I had last time I wrote a reverse proxy like this by making the thing drop requests and responses whenever a problem comes up (previous ones I've worked with suffered from a bad recovery after high traffic as they tried to get through the queue) The main problem with the design that I can potentially forsee is the lack of backpressure, which was raised before. While it keeps the design simple, it also means that anyone writing into a channel has no idea about the current state of it; the problem is, however, that what constitutes "full" varies by channel (e.g. for http.request it could be 1000 packets, for http.response.clientid it could be 10), and so you start needing per-channel, per-project configuration. I'm tempted to add this as a configurable option on the channel layer (defaulting every channel to, say, 500), though, and extend the ASGI spec just slightly to allow send() to return a ChannelFull exception, which clients can treat how they like (daphne might drop the request, django might just wait while writing a response and give up after X seconds). > > All in all I'd feel a lot more confident if there were large production > sites using the channels system, so that we shouldn't theorize on the > excellence of the approach. Are there already production deployments out > there? > > Jacob mentioned he'd been running it on something, but I don't know what exactly, and I only hear mention, not details, of people trying it in production. Personally, it's only running on my own sites, which are anything but high traffic. I had discussed with several people before about getting Channels running as a parallel to their site and redirecting some load to it via a hidden iframe to do a slightly better load test; maybe now is the time to start that process? Andrew -- You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/django-developers. To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAFwN1uohyt-yTM9NkVg2XCBqtmBP5Xr1a8gZJb%2BET1v-0K72Yg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
