Hi Daniel,In Trident it is possible to do batch aggregations. If the spout
emits X pages for a batch the aggregation can happen on that batch.
In the example that you have, the spout will keep emitting all the X pages from
a website as tuples for a single batch. Once you have no more pages to emit,
the spout will signal the completion of batch.
For that batch then you can do aggregations using a State and persist the
values using any storage system. After that the report can be generated.
-Nikhil
On Wednesday, June 3, 2015 6:04 AM, Daniel Sachse <[email protected]>
wrote:
Hey folks,
I am currently evaluating Trident as a replacement for our website analysis
tool.We currently have several components that do: crawling, analyzing,
aggregation and reporting. They talk to each other via message queues.
I think that most of our current infrastructure code can be replaced by Storms
Trident, but at one point I am unsure if this is possible:When we crawl a
webpage we don´t know how many pages are to be crawled in advance. Once our
Crawler does not detect any new pages it fires an aggregation event and we for
example check if all subpages have Google Analytics installed. We include
several more metrics and send a report.A simple flowchart: 1 Crawler produces X
pages, Analyzer consumes 1 page and produces 1 result, Aggregator consumes X
results and produces 1 report, Reporting consumes 1 report and produces 1
enriched report in Y formats.
The critical thing here is the migration of our aggregation system because as
far as I understood it is only possible in real-time and not batch-wise. What I
would like to know is if there is a way to say: „Do the aggregation once there
has not been any new data for 5 minutes or so“.
Is this somehow achievable? Or do you see any other methods I could use? Or is
this a wrong use-case for Trident?
Best regards,
Daniel