Re: Feedback wanted on gethttpd graylisting ideas included

Joachim Schipper Mon, 11 Sep 2006 13:40:52 -0700

On Sat, Sep 09, 2006 at 08:03:18PM -0400, Daniel Ouellet wrote:
> I am working on this idea and put into place a series of defense that 
> are proved effective so far, but obviously not as practical and speedy 
> as spamd is at the moment. It's a variable of scripts here and there 
> based on multiple aspect of the standard use for web access.


> Some of the ideas are not new and are based on spamd, just not all in 
> place yet.
> 
> 1. For Crawlers and Bot
> 
> First is the proliferations of mom and pop pots and crawlers. After 
> testing difference setup, I realize to my surprise, yes call me stupid, 
> that a handful are actually good citizen! The use and standard of 
> robots.txt is well known and all good citizen robots should respect 
> that. Not a mean of protection for your site, but never the less they 
> should respect that. So, what's inside it, if you forbid some 
> directories, or files, they should respect that and any that do not, 
> well I guess it's fine to kill them. Why should they be granted access 
> if they do respect my wishes as the owner and/or operator of the site(s).
> 
> 1.1 First defense. No crawling on forbidden preset robots.txt with 
> incremental deny access to them.
> 
> Many be not the best approach, but it is working as of all crawlers, 
> this method in place catch 381 bad citizen crawlers in a week time. The 
> idea is very simple. I preset my robots.txt file to include a file, or 
> in this case a directory that if not to be crawler and in the directory 
> I put a file that include a script that will block the source via PF and 
> log the entry in a SQL database as well as it will be share between all 
> servers later on. I also put on the front page of the site a very simple 
> LINK to a 1 pixel image at the bottom of the page that is simply not 
> visible to the users and that is not click able as well. So a regular 
> user will never click and nor see it. But a crawler will follow all 
> links obviously as the definition of a crawler. Now don't forget that 
> the crawler is suppose to respect the robots.txt directives. So, this 
> URL is in the forbidden directory and many crawlers do respect that very 
> well. Live test proved this just to well. However, all the bad one, will 
> not and as such, the URL trigger a script that will log their IP and add 
> them to PF to block them right away! BYE BYE!
> 
> Now you may asked why I do incremental deny here. To be nice I guess, 
> but also because some connections are from PROXY and not all proxy also 
> have the header identifier as such. So, as such, you don't want to loose 
> traffic from legitimate users that are behind PROXY like AOL. This need 
> a bit more work and so far the standard should help to make sure only 
> proxy from the same remote users behind it would be block should all 
> proxy respect the standard and add this part to their header as most do. 
> You can call this the bypass of broken proxy for now. Should all proxy 
> be right, then this could be permanent, may be. This also have the side 
> benefit to stop some low life from stealing your content by trying to 
> import all your site content at once. Not the goal here, but it's a side 
> benefit to it should you want that.

Your worries about losing proxies is correct; it looks like you have
that problem mostly covered. I'm not sure it would help much about
bandwidth hogs, though - I don't have any numbers on what programs are
most often used, but something like wget certainly does respect
robots.txt.

> 3. DDoS GET attacks & Bandwidth suckers defense. Multiple approach.
> 
> 3.1 Good users supply data check.
> 
> So far most/all of the variations of attacks on web sites are with 
> scripts trying to inject itself to your servers. Well, you need to do 
> sanity checks on your code. Nothing can really protect you for that if 
> you don't check what you expect to receive from users input. So, I have 
> nothing for that. No idea anyway on how to, other then may be limiting 
> the side of the argument a get can send, but even that is bad idea I think.

This is not applicable to DDoS, really - though you are otherwise right,
of course.

> 3.2 Gray listing idea via 302 temporary return code.
> 
> Many scripts wants you to waist as much bandwidth as possible, if they 
> can't inject itself into your servers, so they will in turn attack a 
> specific page or section of your site and try to make you waist plenty 
> of bandwidth, or even SQL back end power as well.
> 
> One simple approach on this defense came to me from the idea of spamd. 
> But to do this. You don't want the users to wait, or they will go else 
> where and you just lots them. So, the idea is again simple. Just return 
> the users a code to tell them to come back. Simply with a 302 temporary 
> redirect code.
> 
> You might say this will affect my search engine, well not really. There 
> isn't any impact as any search engine will not save temporary content on 
> redirect and if they do, then they are wrong. But should you be concern 
> with this, then add as well in the header a do not cache or save 
> directive also defined in the standard.
> 
> So, what is happening is that GET attack and the like, if you look at a 
> few different variation of them, do send GET message, some HEAD, etc. 
> They impersonate a browser, an OS, etc. But NONE so far that I have seen 
> will also process the HTML code itself. This mean they send the GET 
> request and will not process the content of what they get and not follow 
> the 302 request.
> 
> So, what you have is obviously the connection establish to your web 
> server and you can't know if that's from a good or bad (fake) browser. 
> However, this virus, or attack will not process the received temporary 
> redirect and come back to you on a new URL.
> 
> So, to process this quickly, you simply put this IP in your gray listing 
> process, just like spamd would do, and send back a 302 redirect with a 
> new URL, same if you want, with added useless things to it.
> 
> Now the users are coming back to you and you see the new requests coming 
> in and process the changed from gray to white listing and following 
> requests will be without any delay or processing from your servers. So, 
> the impact is minimal n the first request only.
> 
> In effect you just added a very simple gray listing to your server and 
> protected it from bandwidth hugs! At the same time protected your 
> database bank end as well as no request was sent to it as well for that 
> use content, nor did you send any images, or object to the requesting 
> user and your header is very short as well, so you don't waist 
> processing power, drive access and bandwidth and your server(s) are 
> still replying to users requests fully.

This could be effective, indeed - though I am not sure it would block
many attackers.

> 3.4 What about the compromise user computer itself, or proxy server.
> 
> Here again, it's possible to do this. The header does provide the 
> difference needed to allow/deny connection from the same user computer 
> that are sent to you from the real user browsers and the virus/attack 
> that would be on the same computer should you want to allow this.
> 
> You would even have the possibility to provide feedback and alerts to 
> that user of the problem and advising them to clean their computer if 
> you really wanted to go that far. The signature of the user browser, OS, 
> etc, in the header and the fake header from the GET virus will simply 
> not match. Not until the virus get smarter as to find that information 
> from the user computer and then preset itself as such. So you can see 
> the difference between them here allowing real user connections from a 
> compromise computer should you want it obviously. Should you, well 
> that's an other question and add to the complexity of the daemon and i 
> am not sure it should.

Faking those headers is easily done, though; ideally, you'd want to
cross-check p0f and the headers. I'm not entirely sure it would hurt an
attacker more than it hurt you, though, and priviliged code is always
scary, and doubly so when close to essentially untrusted web apps.


> 4. What about more intelligent attack.
> 
> It's possible that more intelligent attacks would be develop as to read 
> the incoming request and do the redirect. In witch case, most of the 
> above would be useless. So, what could be done then. Sign up sites, 
> only. Who wants that, plus it's a deterrent for users. More elaborate 
> redirect as having a redirect all the time, not sure that's any good 
> idea. Having an image on the page requested, all pages and wait until 
> the request for that image comes in and then white list the request. 
> This would require a lot more complicated design and I am not sure of 
> the benefit of it. But it sure would increase the results. Cost/benefit, 
> I am not sure however. More complex setup and software, mean more bugs 
> and possibly works against what this is suppose to fight.

You *should* consider some unconventional browsers before going to far
down this lane, though. Notably, your 1x1-image will show up quite
readable on text-mode browsers; be sure to, at least, add a 'don't
click' alt attribute.

Also, neither text-based browsers nor most legitimate bots will request
images.

                Joachim

Re: Feedback wanted on gethttpd graylisting ideas included

Reply via email to