> (This is a case were 1/3 of the rows are of type 2, but, say only a few > hundred rows of type 2 have e=5.)
How many rows would have e=5 without worrying about their type value? Aaron On 14 Apr 2011, at 23:48, David Boxenhorn wrote: > Thanks. I'm aware that I can roll my own. I wanted to avoid that, for ease of > use, but especially for atomicity concerns. > > I thought that the secondary index would bring into memory all keys where > type=2, and then iterate over them to find keys where=5. (This is a case were > 1/3 of the rows are of type 2, but, say only a few hundred rows of type 2 > have e=5.) The reason why I put "type" first is that queries on type will > always be an exact match, whereas the other clauses might be inequalities. > > On Thu, Apr 14, 2011 at 2:07 PM, aaron morton <aa...@thelastpickle.com> wrote: > You could make your own inverted index by using keys like "e=5-type=2" where > the columns are either the keys for the object or the objects themselves. > Then just grab the full row back. If you know you always want to run queries > like that. > > This recent discussion and blog post from Ed is good background > http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html > > I'm not sure how efficient the join from "e" to type would be. AFAIK it will > iterate all keys where e=5 and lookup corresponding rows to find out if type > = 2. > > If know how you want to read things back and need to deal with lots-o-data I > would start testing with custom indexes. Then compare to the built in ones, > it should be reasonably simple add them for a test. > > Hope that helps. > Aaron > > On 14 Apr 2011, at 22:33, David Boxenhorn wrote: > >> Thank you for your answer, and sorry about the sloppy terminology. >> >> I'm thinking of the scenario where there are a small number of results in >> the result set, but there are billions of rows in the first of your >> secondary indexes. >> >> That is, I want to do something like (not sure of the CQL syntax): >> >> select * where type=2 and e=5 >> >> where there are billions of rows of type 2, but some manageable number of >> those rows have e=5. >> >> As I understand it, secondary indexes are like column families, where each >> value is a column. So the billions of rows where type=2 would go into a >> single row of the secondary index. This sounds like a problem to me, is it? >> >> I'm assuming that the billions of rows that don't have column "e" at all >> (those rows of other types) are not a problem at all... >> >> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com> >> wrote: >> Need to clear up some terminology here. >> >> Rows have a key and can be retrieved by key. This is *sort of* the primary >> index, but not primary in the normal RDBMS sense. >> Rows can have different columns and the column names are sorted and can be >> efficiently selected. >> There are "secondary indexes" in cassandra 0.7 based on column values >> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes >> >> So you could create secondary indexes on the a,e, and h columns and get rows >> that have specific values. There are some limitations to secondary indexes, >> read the linked article. >> >> Or you can make your own secondary indexes using row keys as the index >> values. >> >> If you have billions of rows, how many do you need to read back at once? >> >> Hope that helps >> Aaron >> >> On 14 Apr 2011, at 04:23, David Boxenhorn wrote: >> >>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have >>> different sets of columns? >>> >>> For example, let's say you have three types of objects (1, 2, 3) which each >>> had three members. If your rows had the following pattern >>> >>> type=1 a=? b=? c=? >>> type=2 d=? e=? f=? >>> type=3 g=? h=? i=? >>> >>> could you index "type" as your primary index, and also index "a", "e", "h" >>> as secondary indexes, to get the objects of that type that you are looking >>> for? >>> >>> Would it work if you had billions of rows of each type? >> >> > >