Greetings,

Started getting my feet wet with Cassandra in earnest this week. I'm
building a custom inverted index of sorts on top of Cassandra, in part
inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
nearly a million documents over a 3-node cluster, and initial query tests
look promising.

The problem is that our target use case has hundreds of millions of
documents (each document is very small however). Loading time will be an
important factor. I've investigated using the BinaryMemtable interface (as
found in contrib/bmt_example) to speed up bulk insertion. I have a prototype
up that successfully inserts data using BMT, but there is a problem.

If I perform multiple writes for the same row key & column family, the row
ends up containing only one of the writes. I'm guessing this is because with
BMT I need to group all writes for a given row key & column family into one
operation, rather than doing it incrementally as is possible with the thrift
interface. Hadoop obviously is the solution for doing such a grouping.
Unfortunately, we can't perform such a process over our entire dataset, we
will need to do it in increments.

So my question is: If I properly flush every node after performing a larger
bulk insert, can Cassandra merge multiple writes on a single row & column
family when using the BMT interface? Or is using BMT only feasible for
loading data on rows that don't exist yet?

Thanks in advance,
Toby Jungen

Reply via email to