Greetings, Started getting my feet wet with Cassandra in earnest this week. I'm building a custom inverted index of sorts on top of Cassandra, in part inspired by the work of Jake Luciani in Lucandra. I've successfully loaded nearly a million documents over a 3-node cluster, and initial query tests look promising.
The problem is that our target use case has hundreds of millions of documents (each document is very small however). Loading time will be an important factor. I've investigated using the BinaryMemtable interface (as found in contrib/bmt_example) to speed up bulk insertion. I have a prototype up that successfully inserts data using BMT, but there is a problem. If I perform multiple writes for the same row key & column family, the row ends up containing only one of the writes. I'm guessing this is because with BMT I need to group all writes for a given row key & column family into one operation, rather than doing it incrementally as is possible with the thrift interface. Hadoop obviously is the solution for doing such a grouping. Unfortunately, we can't perform such a process over our entire dataset, we will need to do it in increments. So my question is: If I properly flush every node after performing a larger bulk insert, can Cassandra merge multiple writes on a single row & column family when using the BMT interface? Or is using BMT only feasible for loading data on rows that don't exist yet? Thanks in advance, Toby Jungen