Yes, but normally you fork a worker process that tracks progress and scrapes N sites. If the worker process dies processing a site, the site is marked “bad” and only periodically scraped after a retry/backoff period.
There are probably a lot of ways to crash a worker process, intentionally or accidentally - a robust design is called for. As an aside, if I was writing a large scale scraper I don’t think I would use HttpClient anyway - I think a custom url accessor would be easier to monitor, etc. > On Jul 29, 2024, at 3:43 PM, Ethan McCue <et...@mccue.dev> wrote: > > Scraping of unknown/untrusted websites is a common task in certain...fields? > I don't want to comment on it too deeply, but I know that is something folks > would do. > > Imagine a site where someone inputs a URL, clicks submit, and then with the > power of funding they return a summary of the page. > > On Mon, Jul 29, 2024, 3:52 PM robert engels <reng...@ix.netcom.com > <mailto:reng...@ix.netcom.com>> wrote: > Isn’t the HttpClient almost always used to access other services? > > Why would a developer access a malicious service? > > I also think there are lots of ways for a service to crash the client - .e.g > it could attempt to return a very large response - if the client uses a > memory buffered reader, it will cause an OOM as well. > >> On Jul 29, 2024, at 2:42 PM, Andy Boothe <andy.boo...@gmail.com >> <mailto:andy.boo...@gmail.com>> wrote: >> >> Following up here. >> >> I believe I have discovered that it is possible to craft a malicious HTTP >> response that can cause the built-in HttpURLConnection and HttpClient >> implementations to throw exceptions. Specifically, HttpURLConnection can be >> made to throw a NegativeArraySizeException, and HttpClient can be made to >> throw an OutOfMemoryError. Proof of this behavior is in the attached (very >> simple) Java programs. >> >> This seems like A Bad Thing to me. >> >> I've moved from the dev list to this list based on a recommendation from >> that list. Is this the right list? If not, can you point me in the right >> direction? Perhaps a security list? >> >> Thank you, >> >> Andy Boothe >> Email: andy.boo...@gmail.com <mailto:andy.boo...@gmail.com> >> Mobile: (979) 574-1089 >> On Wed, Jul 24, 2024 at 4:47 PM Andy Boothe <andy.boo...@gmail.com >> <mailto:andy.boo...@gmail.com>> wrote: >> Hello, >> >> I'm moving this thread from jdk-dev to this list on the sage advice of Pavel >> Rappo. >> >> As a brief recap, it looks like HttpClient and HttpURLConnection do not >> currently support a way to set the maximum acceptable response header >> length. As a result, sending HTTP requests with these classes that result in >> a response with very long headers causes an OutOfMemoryError and a >> NegativeArraySizeException, respectively. (Simple programs for reproducing >> the issue are attached.) This seems like A Bad Thing. There is a (very >> brief) discussion in the thread about how to handle, but of course you guys >> are the experts. >> >> If my head is on straight and this turns out to be a real issue as opposed >> to a mistake on my part, I'm keen to help however I can. >> >> Andy Boothe >> Email: andy.boo...@gmail.com <mailto:andy.boo...@gmail.com> >> Mobile: (979) 574-1089 >> >> >> ---------- Forwarded message --------- >> From: Pavel Rappo <pavel.ra...@oracle.com <mailto:pavel.ra...@oracle.com>> >> Date: Wed, Jul 24, 2024 at 4:30 PM >> Subject: Re: Very long response headers and java.net.http.HttpClient? >> To: Andy Boothe <andy.boo...@gmail.com <mailto:andy.boo...@gmail.com>> >> Cc: jdk-...@openjdk.org <mailto:jdk-...@openjdk.org> <jdk-...@openjdk.org >> <mailto:jdk-...@openjdk.org>> >> >> >> A proper list would be net-dev at openjdk.java.net >> <http://openjdk.java.net/>. >> >> > On 24 Jul 2024, at 21:13, Andy Boothe <andy.boo...@gmail.com >> > <mailto:andy.boo...@gmail.com>> wrote: >> > >> > Hello, >> > >> > I'm documenting some guidelines for using java.net.http.HttpClient >> > defensively for my team. For example: "Always set a request timeout", >> > "Don't assume HTTP response entities are small and/or will fit in memory", >> > etc. >> > >> > One guideline I'd like to document is "Set a maximum for HTTP response >> > header size." However, I can't seem to find a way to set that limit, >> > either in documentation or in OpenJDK code. >> > >> > I tried my best to search the archives for this mailing list for any >> > mentions, but came up empty. >> > >> > To make sure my head is on straight and there isn't an undocumented limit >> > set by default, I wrote the attached (very quick and dirty) client and >> > server programs. LongResponseHeaderDemoServer opens a raw server socket >> > and reads (what it assumes is) a well-formed HTTP request, and then prints >> > an HTTP response which includes a response header of infinite length. >> > LongResponseHeaderDemoHttpClient uses java.net.http.HttpClient to make a >> > request and print the response body. >> > >> > When I run LongResponseHeaderDemoServer in one terminal and make a curl >> > request to the server in another terminal, this is what curl spits out: >> > >> > $ curl -vvv -D - http://localhost:3000 <http://localhost:3000/> >> > * Host localhost:3000 was resolved. >> > * IPv6: ::1 >> > * IPv4: 127.0.0.1 >> > * Trying [::1]:3000... >> > * Connected to localhost (::1) port 3000 >> > > GET / HTTP/1.1 >> > > Host: localhost:3000 >> > > User-Agent: curl/8.6.0 >> > > Accept: */* >> > > >> > < HTTP/1.1 200 OK >> > HTTP/1.1 200 OK >> > < Content-Type: text/plain >> > Content-Type: text/plain >> > < Connection: close >> > Connection: close >> > < Content-Length: 3 >> > Content-Length: 3 >> > * Closing connection >> > curl: (100) A value or data field grew larger than allowed >> > >> > So curl detects the long response header and bails out. Safe and sane. >> > >> > However, when I run LongResponseHeaderDemoServer in one terminal and run >> > LongResponseHeaderDemoHttpClient in another terminal, this is what happens: >> > >> > $ java LongResponseHeaderDemoHttpClient >> > Exception in thread "main" java.io.IOException: Requested array size >> > exceeds VM limit >> > at >> > java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:966) >> > at >> > java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133) >> > at >> > LongResponseHeaderDemoHttpClient.main(LongResponseHeaderDemoHttpClient.java:13) >> > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM >> > limit >> > at java.base/java.util.Arrays.copyOf(Arrays.java:3541) >> > at >> > java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:242) >> > at >> > java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:806) >> > at java.base/java.lang.StringBuilder.append(StringBuilder.java:246) >> > at >> > java.net.http/jdk.internal.net.http.Http1HeaderParser.readResumeHeader(Http1HeaderParser.java:250) >> > at >> > java.net.http/jdk.internal.net.http.Http1HeaderParser.parse(Http1HeaderParser.java:124) >> > at >> > java.net.http/jdk.internal.net.http.Http1Response$HeadersReader.handle(Http1Response.java:605) >> > at >> > java.net.http/jdk.internal.net.http.Http1Response$HeadersReader.handle(Http1Response.java:536) >> > at >> > java.net.http/jdk.internal.net.http.Http1Response$Receiver.accept(Http1Response.java:527) >> > at >> > java.net.http/jdk.internal.net.http.Http1Response$HeadersReader.tryAsyncReceive(Http1Response.java:583) >> > at >> > java.net.http/jdk.internal.net.http.Http1AsyncReceiver.flush(Http1AsyncReceiver.java:233) >> > at >> > java.net.http/jdk.internal.net.http.Http1AsyncReceiver$$Lambda/0x00000008010dbd50.run(Unknown >> > Source) >> > at >> > java.net.http/jdk.internal.net.http.common.SequentialScheduler$LockingRestartableTask.run(SequentialScheduler.java:182) >> > at >> > java.net.http/jdk.internal.net.http.common.SequentialScheduler$CompleteRestartableTask.run(SequentialScheduler.java:149) >> > at >> > java.net.http/jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:207) >> > at >> > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) >> > at >> > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) >> > at java.base/java.lang.Thread.runWith(Thread.java:1596) >> > at java.base/java.lang.Thread.run(Thread.java:1583) >> > >> > Ostensibly, HttpClient just keeps on reading the never-ending header until >> > it OOMs. This seems to confirm that there is no default limit to header >> > size. It also seems like A Very Bad Thing to me. This suggests that any >> > time a program makes an HTTP request to an untrusted source using >> > HttpClient, for example when crawling the web, they are at risk of an OOM. >> > >> > For grins, I also wrote an application >> > LongResponseHeaderDemoHttpURLConnection that does the same thing as >> > LongResponseHeaderDemoHttpClient, just using HttpURLConnection instead of >> > HttpClient. When I run LongResponseHeaderDemoServer in one terminal and >> > LongResponseHeaderDemoHttpURLConnection in another terminal, this is what >> > happens: >> > >> > $ java LongResponseHeaderDemoHttpURLConnection >> > Exception in thread "main" java.lang.NegativeArraySizeException: >> > -1610612736 >> > at java.base/sun.net.www.MessageHeader.mergeHeader(MessageHeader.java:526) >> > at java.base/sun.net.www.MessageHeader.parseHeader(MessageHeader.java:481) >> > at >> > java.base/sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:804) >> > at java.base/sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:726) >> > at >> > java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1688) >> > at >> > java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589) >> > at java.base/java.net.URL.openStream(URL.java:1161) >> > at >> > LongResponseHeaderDemoHttpURLConnection.main(LongResponseHeaderDemoHttpURLConnection.java:12) >> > >> > So HttpURLConnection doesn't handle things gracefully either, but at least >> > it doesn't OOM. That seems like a bug, too, but perhaps less severe. >> > >> > For reference, here's my java version: >> > >> > $ java -version >> > openjdk version "21.0.2" 2024-01-16 LTS >> > OpenJDK Runtime Environment Corretto-21.0.2.13.1 (build 21.0.2+13-LTS) >> > OpenJDK 64-Bit Server VM Corretto-21.0.2.13.1 (build 21.0.2+13-LTS, mixed >> > mode, sharing) >> > >> > Can anyone check my work, and maybe reproduce? And ideally, can someone >> > with more knowledge than me about java.net.http.HttpClient and/or >> > java.net.HttpURLConnection please comment? Is this real, or have I made a >> > mistake somewhere along the way? If it's real, what's next? A bug report? >> > >> > Andy Boothe >> > Email: andy.boo...@gmail.com <mailto:andy.boo...@gmail.com> >> > Mobile: (979) 574-1089 >> >> <LongResponseHeaderDemoHttpClient.java><LongResponseHeaderDemoHttpURLConnection.java><LongResponseHeaderDemoServer.java> >