On Mon, Mar 20, 2000 at 10:54:21PM -0700, Greg McGary wrote: > "James A. Treacy" <[EMAIL PROTECTED]> writes: > > > Is it really necessary to know the charset used on the page? As long > > as searches are 8 bit clean I would think that it wouldn't make a > > difference. > > The issue is how does one delimit tokens? You need to know the > character-classes in order to know when you have transitioned from one > character class to another, and therefore ought to end the current > token. You then need to know which sequences of character classes to > keep and which to toss (keep "words", toss sequences of whitespace and > punctuation). I suppose if non-word char classes (e.g., whitespace, > punctuation) are consistent across all languages and charsets, then > you can treat everything that's not a non-word as a word and be done > with. I don't know enough about the subject to judge. > Obviously, neither do I. :) All Americans should be taken out back and forced to learn another language (actually, I understand 3 languages poorly -- 4 if you include English).
> > Unless you are interested in creating a general purpose cgi frontend > > it is probably better if you work on the searching/indexing and > > specialized parsers while we create the cgi interface. > > That's fine by me. The less I have to do the better. 8^) I'm sure I > can mold the id-utils query interface to be whatever you like. > Writing a general front end is tricky (there are a lot of details to consider), but specific ones are easy. The only part that is a pain is when you want to chop up results into multiple pages. > > I'd love to help, but am already > > overextended. :( > > I think Craig is going to give that a go. We've been discussing it > offline already. If you can handle the cgi frontend, that's plenty > useful. > Great. > > With respect to parsers, do you have a suggestion on the best way to > > handle the list archives? We currently generate a single file for each > > list for each month (in standard mail format - some as big as 10MB). > > Each file is then broken up into a directory containing one htmlized > > file for each piece of mail. This generates a LOT of files. Do you > > think it would be practical (from a speed point of view) to work > > directly from the big files and extract the relevant mails on the fly? > > Don't queries need to return the html file names, since that's what > the users will see? No. The query is sent through a cgi script and we can simply let the results be returned through that url. We also have the choice of whether the input for the script is displayed. This results in those funky urls you often see, e.g. (this is totally made up) http://cgi.debian.org/cgi-bin/search?package=apache&arch=i386&version=stable Of course, if the result of a cgi script is a static file, then it makes sense to do a redirect to that file so the actual URL is displayed. > The users never see the 10 MB monthly files, do > they? No. We extract the relevant section of the big file, wrap it in html, and return it to the user. It would clearly be faster to break up the file beforehand. Maintaining all those little files is a pain though. It's a typical size (number of files in this case) versus speed tradeoff. > Assuming the html files are what we want to index, we should > just index them with no fancy footwork to save the open(2) system > calls. Email archives are index-once-and-for-all things, especially > when mkid can build incrementally. The development time of teaching > the scanner about how to scan a large file but make index entries as > though it had scanned the html files doesn't seem worth the trouble. > There is no reason to pretend that the mail is html. html is not something we care about from a searching perspective. In fact, everything inside of html tags should be ignored (except for meta tags). Note that inside a tag (between '<' and '>') is different than being between an opening and closing tag, e.g. <p>This is a paragraph</p>. Craig is gonna have fun with this, because some container tags don't require the closing tag (<p> is a good example). Since I started this, I'll finish the thought on using the large files instead of many small ones so we can drop it. The more I think about it, the more complicated it gets. The problem is we'd have to keep track of two pieces of information to return a mail instead of just one. When searching within a given month, a hit within a given piece of mail would return the starting location (offset from beginning of file) for that mail. Thus, each mail is uniquely specified by file and offset. Displaying that piece of mail should then be relatively fast: print html headers, open the file, move to the offset, read the mail headers, mark up relevant headers and print them (ignore the others), read and print body of mail, print html footers, close file. Drat. I forgot that the result page for a search should include the sender, date and subject for each mail. More complications. On the off chance that this is doable and worth pursuing, it would be even better if we could compress each file and work from those. Picky, aren't I. :) -- James (Jay) Treacy [EMAIL PROTECTED]