Hash: SHA1

On Tue, 25 Jul 2000, Craig Small wrote:
> On Fri, Jul 21, 2000 at 09:31:56PM +0200, Erik Rossen wrote:
> > entire website into a .deb package, searchable with htdig.  How many
> > megabytes would that make?
> Try Gig, like 4 Gig.

If I search on Altavista,

"url:www.debian.org" gives about 65,826 pages (say, 66,000 pages)

"url:www.debian.org AND NOT url:www.debian.org/Lists-Archives" gives about
9,334 (say, 9,300 pages)

Assuming that the 4GB number is due to the 66,000 pages, that makes an
average of about 64kB per page.  This number seems to be a bit high for me
- - I suspect that Altavista has been obeying robots.txt and that in reality
there are many more pages.

Anyhow, assuming that one were to use htDig and budget 12kB per page for
word indices (so that the database could be built incrementally), one

For everything that AV has seen so far: 66,000 x 12kB = 792,000kB = 773MB

Ditto, minus the mail archives: 9,300 x 12kB = 111600kB = 109MB

Would someone with more experience than me tell us if these numbers pose
any difficulties?  Unless there is a real need to keep all of the indices
in RAM, shouldn't it be fairly cheap and easy to get this thing
operational right now?  Even if the space required was one order of
magnitude greater that what I've calculated?

I was told at the conference that getting equipment is normally never a
problem for the Debian project.  The only problem that I am aware of with
htDig is that it is probably incapable of handling non-Western-European
languages.  If anyone has a better candidate search engine, let them
speak.  Incidentally, the GNU project uses htDig on their site.

> Website searching will occur when potato is frozen so the admins can
> update the servers.

Glad to hear that.  So, ummmm, when is potato going to be frozen? ;-)

Erik Rossen                         ^
[EMAIL PROTECTED]                 /e\
http://www.multimania.com/rossen   ---   GPG key ID: 2935D0B9
Version: GnuPG v1.0.0 (GNU/Linux)
Comment: Made with pgp4pine 1.75


Reply via email to