It looks like there is some interest so I'm going to disgorge everything I've learned/considered in the past couple weeks just so that we have a consistant base. I'm going to break down how the indexes work, different optimizations and drawbacks and try to address the points/questions that people have raised. I've broken it down by subject.
Basic Equality: Bitmap indexes are essentially giant arrays of bytes, with one bit per possibility. If you want to know what rows have boolean value "event1" set to true then you set the address of those rows to 1 in the index. For example in index 0100100101, this would mean rows 1, 4, 7 and 9 would contain "event1". If you want to know which rows contain event1 and event2 then you do a bitwise AND over the two indexes to get the set intersection eg. 00100101 AND 10000101 results in 00000101. From this you can build complex queries by doing simple set arithmetic. Each of these sets are called a query dimension. Range Queries: If you want to encode ranges such as give me all users who have a counter in the integer interval [0, 2] then you need a two dimensional bitmap index. The first dimension is what values between [0, 7] have been hit: 10010011, the second dimension is which rows for each of those possible values contain the value. So for value 0 there would be another index 00100010, which means that rows 3 and 7 contain value 0. This forms a giant two dimensional array. [1] [00100010] [0] [00001010] [0] [10100110] [1] [00100010] [0] [00101010] [0] [01101010] [0] [00111110] [1] [00100110] [1] [00100000] To figure out the answer to who has counter value [0, 2] would be the union of the sets [00100010], [00001010], [10100110] which is [10101110]. Binning: Each index has an address size which limits the total number of rows/values you can encode in an index. So if you have a 32bit index out can encode a total of 4,294,967,295 positions. If you need to encode more values than what is possible in that address space you can do two things. Increase the address space or perform binning. Binning is essentially hash collision, meaning two or more values are assigned to the same value. For example if you wanted to index floats you could use the same index as above but if you want to know the rows who contain the real numbers [0,1.9] then you would need to check the actual value for the rows in the result set. This is called a candidate check which is very expensive, often the most expensive part of a query. Segments: If you increase the address size of the index then you run into space/memory problems. A 32 bit index is 4 GB when fully materialized, a 64 bit index is 2 PB. To solve this problem you use segments/partitions or compression. Segments work through the use of a sparse table or interval list. You break the index up into chunks of equal size lets say 1kb. If an bit gets flipped at position 4098, then you go to segment 0x02 and if it exists you flip that bit. If that segment doesn't exist then you create it and set the bit. The advantage is that you only have to create segments that contain flipped bits. The downside is that if you have a wide distribution of bits flipped you end up with many segments with one or two bits flipped and the is empty space or wasted memory. Compression: The other approach to use compression. The trick is that you need a compression algorithm that doesn't require decompression before doing the bitwise operations. This means you need a run length encoding (RLE). There are 3 leading contenders for king of the hill, CONCISE which is used by Druid, WAH which is used by Fastbit, and PWAH which isn't used by anybody I think yet. They all work the same way which is to encode store large blocks of zeros or ones as encoded values. So you get this index taken from PWAH [101 10000000001011 111010000100101 000000000010010] which means 11 blocks of 1 bits a literal block of 15 bits and 18 blocks of 0 bits. The downside to this approach is that you lose the ability to index directly to a position in the index. If you want to perform an update you've either got to read/decode to that position and split the index or rebuild the entire index. This is why Druid, and Fastbit are very good for static data sets but can't deal with fast changing data. A hybrid approach to performance: You can take a hybrid approach to performance which means combine segments and compression. You break the index address space up into segments and then perform compression within each of those segments to negate the problem of many segments with only 1 bit flipped but consuming the entire segment size worth of memory. This also limits the cost of having to split the encoding for any compressed segment. As segments fill up you can compress and join them together using the standard SSTable or levelDB approach. Distributing work across the ring: If you use segments and a uniform index size that means you can assign segments of the index to different portions of the C* ring. This means that one node would have all the segments necessary to answer queries involving that portion of the address space. If this also corresponds to the rows it has, It means it could answer with certainty that rows in it's address space match the query, and to get the entire query results you just merge the segments to obtain the final set. Other optimizations possibly worth considering: If you know the size of each index beforehand you can perform the bitwise operations across indexes with the indexes sorted by increasing size. If you know that the index for event1 only has 1 segment populated that means you only need to look at that corresponding segment for the other indexes even if those indexes are a fully materialized. There are also some other bitmap index types that more efficiently encode range but I don't know enough about them yet to render an opinion. Other optimizations not worth considering: There are some other papers on limiting the IO cost of candidate checks through optimal binning strategies, but this requires modeling or having full knowledge of the distribution of your data which isn't feasible. Query planning/limitations: To be honest I haven't dug into the existing query planner for CQL yet so I can't answer to Jonathan's point about it's complexity. What I can say is that the query logic for bitmap indexes is pretty simple. You just need to break it down into a set of boolean operations over the selected sets, and it can be performed in parallel by performing the same operations over adjacent segments. If you have knowledge about about segment size then you can perform the operations in order to limit the operation size. To simply things I would not allow indexing of CQL3 collections. My temptation would be to ship with 2D indexes allowing support for range queries (also I need it for my use case), but maybe that comes in another iteration. Trigram support, fuzzy matching and regex just get layered on top of these fundamental concepts, and you can steal the hard bits from existing libraries/projects. Further research: Some thought would need to be put into what are the appropriate segment size, index size, binning strategy, and hash selection for encoding non-numeric values. That's basically it. I'm sure I've forgotten something. I'm really looking forward to any feedback, questions, or criticisms that you may have. I would really like to see something like this in C*, and I'm offering up my dedicated time to make it happen if I can make the goals/timeline of my employer and the project meet up. I've already done things such as put the book I was writing on hold in order to make time. I understand if you say no, you do have the greater project to consider. If you do say no then we will most likely continue with this project as a separate query/index engine and C* will just act as the storage layer. If we go alone the resulting code may or may not be open source; it hasn't been decided yet.