> On September 9, 2002 07:12, [EMAIL PROTECTED] wrote: > > Bible is 31102 (if I counted correctly) verses. It is ~3.8Kbytes if a bit > > for every verse. > > You counted all the verses in the Bible?! (grin) > > > Searching for "Christ & (God | Father)" we can construct 3 such bit vectors > > (~10.6Kbytes) and then make logical operations over these. > > Bit vectors have some nice properties such as the ability to do very fast > logical operations. However, they have some significant downsides as well: > > 1. They are very large to store for the Bible. I did a quick calculation and I > figured the indexes I've build would increase approx 10 x if I stored them as > bit vectors. The reason for this is that the average word occurs only 100 > times, at least in the KJV (I assume other word based languages are in the > same order of magnitude). This means that 4K bit vectors are very sparse.
I don't suggest to store so for anything, but only for the most often encountered words (like "the"). > 2. Converion to and from them can be costly computationaly (especially > converting from them). Since storing bit vectors and returning bit vectors to > the frontends aren't options this would have to be considered. If my memory is right, 80386 has a special command for searching ones in bit vectors. In any case searching non-zeor bytes is fast. > 3. Perhaps most significantly, bit vectors are only really a big improvement > for logical operators. Verse and word proximity (i.e. within x verses, or > within y words) are better done other ways. This could easily lead to > multiple conversions to and from bit vectors just to complete one search > expression. I'm not about verse proximity, but namely about paragraphs with specified borders! > > I can (as will have time) even write necessary algorithms. If it will be > > too slow for 80386, I can remember its assembler! > > Since Sword is a cross platform library, assembler isn't really an option (I > know it is already compiled on at least 3 different CPU arcitectures). Plus, > do you really think hand coded assembly would be much faster than what a good > compiler could produce for a series of bitwise logical operations on arrays? Isn't only 80386 slow? -- Victor Porton ([EMAIL PROTECTED])