On Mon, Mar 20, 2000 at 11:31:52AM -0700, Greg McGary wrote: > > Files are of the form foo.lang.html, e.g. index.en.html. > > OK. That makes it very easy. What's the complete list of languages, > and what charset encoding is used for each? I'm a lowly mono-lingual > ugly American, but I have a brain-trust of i18n pros, so I'll get them > to help me figure out how best to code language-specific scanners. > Is it really necessary to know the charset used on the page? As long as searches are 8 bit clean I would think that it wouldn't make a difference.
> I should wget a representative sample of your site for use as test > data. Is there a subtree that has some of every language on the site? > The most commonly translated page is the main page: http://www.debian.org/ . At the bottom are links to all the translations. And if you started with the english version, save that as index.en.html . [snip] > > Is this doable? > > Definitely. What query API must I provide for Apache? Just point me > to the documentation. > > > <meta name="Keywords" content="debian, main, stable, size:88.3 apache"> > > So, the above sample is for the package "apache", and the indexable > key/value pair is labelled meta name/content, right? > > > This makes it easy for us to restrict the search to packages by > > distribution (main, non-free or contrib), release (stable, unstable or > > frozen) and package name (or substrings of the name). > > OK, that's another area that needs work. > Unless you are interested in creating a general purpose cgi frontend it is probably better if you work on the searching/indexing and specialized parsers while we create the cgi interface. In a seperate mail you asked for help in creating the html parser (I hope my terminology is correct). I'd love to help, but am already overextended. :( With respect to parsers, do you have a suggestion on the best way to handle the list archives? We currently generate a single file for each list for each month (in standard mail format - some as big as 10MB). Each file is then broken up into a directory containing one htmlized file for each piece of mail. This generates a LOT of files. Do you think it would be practical (from a speed point of view) to work directly from the big files and extract the relevant mails on the fly? -- James (Jay) Treacy [EMAIL PROTECTED]