David Burry wrote: >At 10:13 AM 10/22/2001 -0700, Chris Little wrote: > >>Would that be possible for a RE that involved crossing a word boundary? >>Something like /\<Jesus \w+d\>/, for example. I suppose you could split >>up the RE itself by word boundaries, collecting a list of words that >>match /\<Jesus\>/ and words that match /\<\w+d\>/, then finding all >>instances where they come in order, separated by spaces. But then you >>have to account for \s+ and .+, at which point I would give up and just >>reconstitute the whole verse string. :) >> > >For inverted indexes, I believe you need to restrict your regular expressions to >matching individual words only (more precisely, whatever terms are indexed) if you >want to take advantage of the performance increase.... but this isn't necessarily bad >if you add more search operators such as phrase, within n words, followed by within n >words (phrase = followed by within 1 word but it can be more optimized if done >separately), etc... In fact you could end up with a much richer operator set and >still be lightning fast. > This would be the easiest way to do it. You'll probably have all the other operators anyway, because joe user doesn't know REs and doesn't want to learn.
However, I think it would be possible to analyze the RE (probably using REs), and break it down into multiple REs that each match a word, which transforms the problem into the one you described. Weather this is worth the hassle is another question, though, as Chris pointed out. >I'm not sure about punctuation, I need to review his documentation first. I've done >work on this kind of inverted index before and I'm itchin to see how he did that >part! ;o) I've usually thought it desirable for text search engines to ignore >punctuation anyway and just match words. > In my code, punctuation is treated like a word; each punctuation mark gets its own entry in the dictionary file, .... See the docs / code for details. Actually, the code that tokenizes the Bible into "words" for the dictionary is generated using flex, so you might just want to dig into that. >See http://beaver.dburry.com/cgi-perl/bible to see what I've done on inverted indexes >before, specifically geared toward speed at all costs otherwise. It's not >sword-based but I've been interested in integrating some of my stuff with sword for a >long time (at least an import tool to convert sword modules to my index format), >maybe this thing by Dave Orme will get me motivated! Or maybe I'll scrap my work and >just work on his... I just want to see the best tool possible who cares about ego... > ;o) > Let me know what you think after you've read the docs/code. Maybe you can do it better than I can. ;-) I'll check out your web site too. Best, Dave -- The number of UNIX installations has grown to 10, with more expected. -- The Unix Programmer's Manual, 2nd Edition, June 1972