Thanks Martin and Nathan! Didn't know about the custom schedulers.

On Tue, Mar 10, 2015 at 10:54 AM Martin Illecker <[email protected]>
wrote:

> Thanks Nathan, that's exactly what I meant. :-)
>
> 2015-03-10 17:45 GMT+01:00 Nathan Leung <[email protected]>:
>
>> Storm supports custom schedulers:
>> http://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/
>>
>> On Tue, Mar 10, 2015 at 12:37 PM, Martin Illecker <[email protected]>
>> wrote:
>>
>>> Curtis, I have made exactly the same observations. I have decreased the
>>> max spout pending to eliminate tuple timeouts.
>>> But this actually means throttling the whole topology because of one
>>> bolt with a high latency! (e.g., 5 bolts with 0.1 ms latency and 1 bolt
>>> with 1 ms)
>>>
>>> At some point, increasing the parallelism of the high latency bolt will
>>> impact the overall performance of a worker. There has to be a better way.
>>>
>>> A possible solution might be to assign a bolt to a specific worker.
>>> Currently, if I assume correctly, each bolt is evenly distributed among
>>> multiple workers.
>>> (e.g., a bolt with parallelism 10 can be executed by 5 threads on 2
>>> workers or 2 threads on 5 workers)
>>>
>>> If a bolt could be assigned to a specific worker type, then it would be
>>> possible to add more workers / nodes, which exclusively execute multiple
>>> threads of a high latency bolt.
>>> For example, we could have one worker, which executes a high latency
>>> bolt and another worker, which executes the rest of the topology.
>>> So the default behavior would be evenly distribute the bolts but it
>>> should be possible to define different worker types and assign a bolt to
>>> these worker types.
>>>
>>> Does this make any sense?
>>> And could this be an additional feature of Storm?
>>>
>>> 2015-03-10 16:59 GMT+01:00 Curtis Allen <[email protected]>:
>>>
>>>> Idan, Use the Config class
>>>> https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/Config.java#L1295
>>>>
>>>> On Tue, Mar 10, 2015 at 9:49 AM Idan Fridman <[email protected]>
>>>> wrote:
>>>>
>>>>> curtis, how do you set the storm.message.timeout.secs?
>>>>>
>>>>> 2015-03-10 17:07 GMT+02:00 Curtis Allen <[email protected]>:
>>>>>
>>>>>> Tuning an topology that contains bolts that have a unpredictable
>>>>>> execute latency is extremely difficult. I've had to slow down the entire
>>>>>> topology by increasing the storm.max.spout.pending and
>>>>>> storm.message.timeout.secs otherwise you'll have tuples queue up and
>>>>>> timeout.
>>>>>>
>>>>>
>>>>> On Tue, Mar 10, 2015 at 8:53 AM Martin Illecker <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I would be interested in a solution for high latency bolts as well.
>>>>>>
>>>>>> Maybe a custom scheduler, which prioritizes high latency bolts might
>>>>>> help?
>>>>>> (e.g., allowing a worker to exclusively run high latency bolts)
>>>>>>
>>>>>> Does anyone have a working solution for a high-throughput topology
>>>>>> (x0000 tuples / sec) including a HTTPClient bolt (latency around 100ms)?
>>>>>>
>>>>>>
>>>>>> 2015-03-08 20:35 GMT+01:00 Frank Jania <[email protected]>:
>>>>>>
>>>>>>> I've been running storm successfully now for a while with a fairly
>>>>>>> simple topology of this form:
>>>>>>>
>>>>>>> spout with a stream of tweets --> bolt to check tweet user against
>>>>>>> cache --> bolts to do some persistence based on tweet content.
>>>>>>>
>>>>>>> So far that's been humming along quite well with execute latencies
>>>>>>> in low single digit or sub millisecond. Other than setting the 
>>>>>>> parallelism
>>>>>>> for various bolts, I've been able to run it the default topology config
>>>>>>> pretty well.
>>>>>>>
>>>>>>> Now I'm trying a topology of the form:
>>>>>>>
>>>>>>> spout with a stream of tweets --> bolt to extract the urls in the
>>>>>>> tweet --> bolt to fetch the url and get the page's title.
>>>>>>>
>>>>>>> For this topology the "fetch" portion can have a much longer
>>>>>>> latency, I'm seeing execute latencies in the 300-500ms range to 
>>>>>>> accommodate
>>>>>>> the fetch of any of these arbitrary urls. I've implemented caching to 
>>>>>>> avoid
>>>>>>> fetching urls I already have titles for and using socket/connection
>>>>>>> timeouts to keep fetches from hanging for too long, but even still, 
>>>>>>> this is
>>>>>>> going to be a bottleneck.
>>>>>>>
>>>>>>> I've set the parallelism for the fetch bolt fairly high already, but
>>>>>>> are there any best practices for configuring a topology like this where 
>>>>>>> at
>>>>>>> least one bolt is going to take much more time to process than the rest?
>>>>>>>
>>>>>>
>>>>>>
>>>
>>
>

Reply via email to