I'm wondering what the performance considerations are on Join-like queries.
I have a ColumnFamily that holds millions of records (not unusual as I understand) and I want to work on them using Pig and Hadoop. Until now we always fetched all rows in Cassandra and just filtered and worked on them. The idea now is to introduce indices to speed up some of these analysis. Let's assume we have page hits, each of them has a user associated and many of our queries work on the users, so creating a ColumnFamily whose key is the user id would be logic, but that would mean that we'd store all the data twice (once in the all encompassing ColumnFamily and once as SubcolumnFamilies in the Index) and since we might insert additional indices it would multiply our data size. Usually in a relational world we'd not save the data in the index, but a pointer to the real entry. Would it be wise to just store the key of the item that is referenced and then iteratively fetch them from the cluster? Also I'd like to know how key range queries perform against simple key lookups since I'd like to build a dynamic storage system which splits really large rows into smaller ones, by specifying one more byte of the key (so from a\0\0\0\0 we might got to a\0\0\0\0 - a\255\0\0\0, and then get all results by simply querying a\0\0\0\0 through a\255\255\255\255). I have no idea if this is even possible, just playing around with some ideas :D Regards, Chris