At 10:13 AM 10/22/2001 -0700, Chris Little wrote:
>Would that be possible for a RE that involved crossing a word boundary?
>Something like /\<Jesus \w+d\>/, for example.  I suppose you could split
>up the RE itself by word boundaries, collecting a list of words that
>match /\<Jesus\>/ and words that match /\<\w+d\>/, then finding all
>instances where they come in order, separated by spaces.  But then you
>have to account for \s+ and .+, at which point I would give up and just
>reconstitute the whole verse string. :)

For inverted indexes, I believe you need to restrict your regular expressions to 
matching individual words only (more precisely, whatever terms are indexed) if you 
want to take advantage of the performance increase.... but this isn't necessarily bad 
if you add more search operators such as phrase, within n words, followed by within n 
words (phrase = followed by within 1 word but it can be more optimized if done 
separately), etc...  In fact you could end up with a much richer operator set and 
still be lightning fast.

I'm not sure about punctuation, I need to review his documentation first.  I've done 
work on this kind of inverted index before and I'm itchin to see how he did that part! 
 ;o)  I've usually thought it desirable for text search engines to ignore punctuation 
anyway and just match words.

See http://beaver.dburry.com/cgi-perl/bible to see what I've done on inverted indexes 
before, specifically geared toward speed at all costs otherwise.  It's not sword-based 
but I've been interested in integrating some of my stuff with sword for a long time 
(at least an import tool to convert sword modules to my index format), maybe this 
thing by Dave Orme will get me motivated!  Or maybe I'll scrap my work and just work 
on his...  I just want to see the best tool possible who cares about ego...  ;o)

Dave

Reply via email to