A few of us working on a book for casanadra and got to the point where we (well 
I did anyway)  wanted to include an example of a non trivial inverted index. 

I've been playing around  with different ideas on how I could store the data 
and I've had a look at the previous threads that touched on the subject but 
with the 2 or 3 ideas I've seen on the list someone always points out something 
in the approach that punches a hole in it.

I've been playing around with the idea of using a Columnfamily for the index 
where I store the terms as the key then each column name is a 64 bit long and 
its value is the doc id. If the column name represents a ranking for the doc id 
it stores and the compare with option is LongType then once a term is retrieved 
the first x amount of columns would represent the most related docs for that 
term. 

I'd go on in more detail but I'm using my phone to write this and I think that 
gets the idea across.
Ofcourse my first thought to this is, is it scalable? In a system where 
possibly millions of docs are related to one term, is that a good idea to have 
potentially that many columns in one row all associated to the one row key 
which is the term?

I just want to know what others think, if you have any suggestions or have a 
similar thing implemented and you're able to share.

On a side note to that, there has been a bit of talk about secondary indexes in 
0.7 can anyone shed some light on that, or point me to any presentation or the 
like where its mentioned so I can get a better idea of what its for.

Thanks,
Courtney
                                          

Reply via email to