On Fri, Sep 2, 2011 at 08:54, Mick Semb Wever <m...@apache.org> wrote:
> Patrik: is it possible to describe the use-case you have here?

Sure.

We use Cassandra as a storage for web-pages, we store the HTML, all
URLs that has the same HTML data and some computed data. We run Hadoop
MR jobs to compute lexical and thematical data for each page and for
exporting the data to a binary files for later use. URL gets to a
Cassandra on user request (a pageview) so if we delete an URL, it gets
back quickly if the page is active. Because of that and because there
is lots of data, we have the keyspace set to RF=1. We can drop the
whole keyspace and it will regenerate quickly and would contain only
fresh data, so we don't care about lossing a node. But Hadoop does
care, well to be specific the Cassnadra ColumnInputFormat and
ColumnRecortReader are the problem parts. If I stop one Cassandra node
all MR jobs that read/write Cassandra fail. In our case, it doesn't
matter, we can skip the range of URLs. The MR jobs run in a tight
loop, so when the node is back with it's data, we use them. It's not
only about some HW crash but it makes maintenance quite difficult. To
stop a Cassandra node, you have to stop tasktracker there too which is
unfortunate as there are another MR jobs that don't need Cassandra and
can happily run.

Regards,
P.

Reply via email to