I've been running storm successfully now for a while with a fairly simple topology of this form:
spout with a stream of tweets --> bolt to check tweet user against cache --> bolts to do some persistence based on tweet content. So far that's been humming along quite well with execute latencies in low single digit or sub millisecond. Other than setting the parallelism for various bolts, I've been able to run it the default topology config pretty well. Now I'm trying a topology of the form: spout with a stream of tweets --> bolt to extract the urls in the tweet --> bolt to fetch the url and get the page's title. For this topology the "fetch" portion can have a much longer latency, I'm seeing execute latencies in the 300-500ms range to accommodate the fetch of any of these arbitrary urls. I've implemented caching to avoid fetching urls I already have titles for and using socket/connection timeouts to keep fetches from hanging for too long, but even still, this is going to be a bottleneck. I've set the parallelism for the fetch bolt fairly high already, but are there any best practices for configuring a topology like this where at least one bolt is going to take much more time to process than the rest?
