Christian Perrier wrote: > The first step of the process is to review the debconf source > template file(s) of htdig. This review will start on Friday, December 21, > 2007, or > as soon as you acknowledge this mail with an agreement for us to > carry out this process.
I'll be away from keyboard for the next week, so I'll share my rough notes in advance. The package description needs quite a lot of rephrasing. For a start, its short description: -Description: WWW search system for an intranet or small internet ht://Dig is precisely not a World Wide Web search engine - it's a local website search engine. And what's a "small internet"? +Description: web search engine for intranets - The ht://Dig system is a complete World Wide Web indexing and searching + The ht://Dig system is a complete web indexing and searching system for a small domain or intranet. This system is not meant to replace the need for powerful internet-wide search systems like Lycos, (Dated - these days Lycos is a portal rather than a search engine) Google, or Yahoo!. Instead it is meant to cover the search needs of a single company, campus, or even a particular subsection of a website. . As opposed to some WAIS-based or web-server based search engines, ht://Dig isn't opposed to WAIS, and "-based" is just fog as usual. I'd boil it down to "Unlike some WAIS or web search engines" - but then I wonder about the claim it's leading into: ht://Dig can span several web servers at a site. The type of these different web servers doesn't matter as long as they understand the HTTP 1.0 protocol. Does ht://Dig really have rivals that can only index one server? Are there web servers that still don't support HTTP 1.0? Perhaps these "features" should be retired into the bulleted feature-list. The list's bullet syle should be standardised, but I'll take that part for granted. - * Intranet searching - * It is free - * Full source code included - * Full support for the ISO-Latin-1 character set Cut these non-features (what's htdig doing with ja.po and ru.po files etcetera if it can't even handle š or €?). Perhaps replace them with: + - indexing of any number of unrelated web servers; Standardising on noun phrases: - * Robot exclusion is supported + - robot exclusion support; - * Keywords can be added to HTML documents + - keyword tagging of HTML documents; - * A Protected server can be indexed + - indexing of protected servers; - * The depth of the search can be limited + - configurable-depth searches; Then the trailing caveat: - Please note that ht://Dig is a resource-hog, with respect to processor usage, - when indexing. - . - Disk space requirements: - . - 13.000 documents indexed: 150MB disk space with a 'wordlist database' - 93MB disk space without a 'wordlist' The first half is subtly bad en_US; the second half has a blatantly wrong $LC_NUMERIC! + Please note that ht://Dig indexing is processor-intensive; and its disk + space requirements are approximately 12kB per document indexed (so e.g. + 13,000 documents indexed = 150MB with a wordlist database, 93MB without). -- JBR with qualifications in linguistics, experience as a Debian sysadmin, and probably no clue about this particular package