I am indexing with Lucene a hughe set of logfiles, about 130GB of plain text in 
disk (up to now), planning to build a system capable of perform searches over 
Terabytes of such info in a kind of metaindex built from a mesh of little ones, 
all of them created and maintained with Lucene.

I have randomly variable file sizes, from 1KB to several hundreds of MB of 
plain text, and I have done tests with files about 2GB, obtaning very good 
performance in time and search. Of course, once we can get search results from 
such system we get confident that Lucene was capable of doing right its job, 
i.e., split all contents and index all tokens correctly.

But last week, with our first beta release in our LAN environment, some 
problems arose. In certain situations we've found that the Analysis stage 
"fails", or better, has anomalies in its activity. We have isolated one, that 
can be reproduced with LUKE in its Search window: parsing URL domains that end 
with a point, as in "www.my.domain.es." becomes in a token with the following 
text: "wwwmydomaines".

Maybe this behavior extends to emails, as we aren´t able to get search results 
with some emails that are indeed in the contents of the logfile, and with words 
too.

Such behavior is not acceptable for nobody, as in natural speaking is possible 
to find such URLs at the end of a sentence. Is this an effect of document 
vectorization? I write this as log's content structure doesn't match for 
natural language rules...

Any notice about this?

We are working on an Log Analyzer now, but i'm sure i'm not the only fellow 
with this issue in the world... Did you know anyone else?

Thanks for your attention.
 
Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206            FAX: 981 173 223

VIDEOCONFERENCIA: 981 173 596 

[EMAIL PROTECTED]






       
______________________________________________ 
¿Chef por primera vez?
Sé un mejor Cocinillas. 
http://es.answers.yahoo.com/info/welcome

Reply via email to