This bug was a very serious, almost fatal, bug for me recently, and I thought I would share my story to emphasize that for me, this was not a 'wishlist' severity bug.
I research Tor black-markets (see http://www.gwern.net/Silk%20Road ) because I am interested in them from economic, historical, and statistical perspectives. Black-markets are dangerous risky enterprises, even when run as Tor hidden-services, and so people like me or Nicolas Christin often download or spider them so as to have copies to analyze later. In October 2013, Silk Road was famously busted (to everyone's complete surprise). Fortunately, the FBI seizure left the SR forums alone, and it became a top priority for me to grab a copy of the forums while I still could since they would be invaluable in the post-mortem of SR and the wave of arrests everyone expected to follow the bust. The wget spider of the public forum went fine. But even more importantly, I needed to get a copy of the members-only subforum, the Vendor Roundtable, where all the Silk Road drug dealers talked shop, and more importantly, turned out to have discovered some of the early bits and pieces of how Silk Road/Ross Ulbricht was busted. I'm not a drug dealer, but I know a few of the SR ones and was able to get login credentials. I logged in, checked that I had access to the Roundtable, exported my cookies, and read the wget man page for guidance: -R rejlist --reject rejlist Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix. Perfect. Exactly what I needed to avoid being logged out. I threw in a `--reject '*logout*'` to cover all possible logout links, and kicked the spider off. I watched for a few minutes, everything looked like it was going fine with no suspicious 'index.php?logout' files showing up or anything, and I went off to deal with other aspects of breaking news. 2 days later, the spider was still running (it's a very big forum and Tor has high latency), and I needed to check a particular claim about a Roundtable thread. No problem, I had a copy of the Roundtable - I'd just check that. NOPE. The thread wasn't there at all. In fact, almost *nothing* in the Roundtable had been downloaded at all! I panicked. No one knew why the FBI hadn't shut down the forums, who was running them, or when they would disappear into the digital ether. Christin wasn't spidering the Roundtable, and I was it. If I didn't have a copy, then likely, no one did. It would be gone permanently. Luckily, the forums were still up... but for how long? Minutes, hours, or days? What had gone wrong and how could I fix it? I logged in again, exported cookies, restarted, checked in a few hours. No Roundtable. WTF?! I logged in, exported, restarted, watched closely... I spotted in the stream a mention of 'index.php?logout'. But why? I went back to the `--reject` documentation. Had I called it wrong? Made a syntax error? Did `--reject` not do what it was supposed to do? But the documentation is perfectly clear: --reject rejects URLs from being downloaded. It doesn't do something remotely as absurd as download a URL and then delete it! There is no usecase for that in combination with rejecting URLs, it's trivially broken for many use-cases, and it would *definitely* be documented in the manpage. I went back, logged in... Repeat 5 or 10 times with various invocations of `--reject` and regexps and escalating blood pressure, until I checked the downloaded pages and resigned myself that somehow, somehow or other, I couldn't begin to explain it, neither the how nor the why, wget was logging itself out of the forums. As absurd as it sounded, nothing else fit the evidence. I started googling 'wget reject'. To discover this bug report, among others. Oh how I raged that night. 'principle of least surprise', 'betrayal', 'crime against posterity', 'moronic', 'deliberately malicious', 'what the hell', and more indelicate phrases were uttered. I was also not pleased to discover that, `--reject` aside, there was apparently no way whatsoever to genuinely reject URLs inside wget. Eventually, I rigged up a hack where I pointed wget to Privoxy, and wrote Privoxy rules to block certain URLs including the logout links. It's ugly, it's not easy to modify, I'm not really familiar with Privoxy syntax, but at least it does, in fact, work. And I was able to get a good chunk of the Roundtable before the forums went down. (Not all of it, but that's another story which is not wget's but the forum software's fault - I think.) Summary: `--reject` is a problem. It can't be *that* hard to fix, there are short patches floating around. Please fix it. -- gwern -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected]

