David Burry wrote:

>At 10:13 AM 10/22/2001 -0700, Chris Little wrote:
>
>>Would that be possible for a RE that involved crossing a word boundary?
>>Something like /\<Jesus \w+d\>/, for example.  I suppose you could split
>>up the RE itself by word boundaries, collecting a list of words that
>>match /\<Jesus\>/ and words that match /\<\w+d\>/, then finding all
>>instances where they come in order, separated by spaces.  But then you
>>have to account for \s+ and .+, at which point I would give up and just
>>reconstitute the whole verse string. :)
>>
>
>For inverted indexes, I believe you need to restrict your regular expressions to 
>matching individual words only (more precisely, whatever terms are indexed) if you 
>want to take advantage of the performance increase.... but this isn't necessarily bad 
>if you add more search operators such as phrase, within n words, followed by within n 
>words (phrase = followed by within 1 word but it can be more optimized if done 
>separately), etc...  In fact you could end up with a much richer operator set and 
>still be lightning fast.
>
This would be the easiest way to do it.  You'll probably have all the other operators 
anyway, because joe user doesn't know REs and doesn't want to learn. 

However, I think it would be possible to analyze the RE (probably using REs), and 
break it down into multiple REs that each match a word, which transforms the problem 
into the one you described.  Weather this is worth the hassle is another question, 
though, as Chris pointed out.

>I'm not sure about punctuation, I need to review his documentation first.  I've done 
>work on this kind of inverted index before and I'm itchin to see how he did that 
>part!  ;o)  I've usually thought it desirable for text search engines to ignore 
>punctuation anyway and just match words.
>
In my code, punctuation is treated like a word; each punctuation mark gets its own 
entry in the dictionary file, ....  See the docs / code for details.  Actually, the 
code that tokenizes the Bible into "words" for the dictionary is generated using flex, 
so you might just want to dig into that.

>See http://beaver.dburry.com/cgi-perl/bible to see what I've done on inverted indexes 
>before, specifically geared toward speed at all costs otherwise.  It's not 
>sword-based but I've been interested in integrating some of my stuff with sword for a 
>long time (at least an import tool to convert sword modules to my index format), 
>maybe this thing by Dave Orme will get me motivated!  Or maybe I'll scrap my work and 
>just work on his...  I just want to see the best tool possible who cares about ego... 
> ;o)
>
Let me know what you think after you've read the docs/code. Maybe you 
can do it better than I can.  ;-)  I'll check out your web site too.


Best,

Dave

-- 
The number of UNIX installations has grown to 10, with more expected.
   -- The Unix Programmer's Manual, 2nd Edition, June 1972




Reply via email to