Hi I tried this code:
TokenStream ts = analyzer.tokenStream("content", new StringReader(" www.abc.com")); Token t; while ((t = ts.next()) != null) { System.out.println(t); } If I pass "www.abc.com" (without an extra '.'), it prints (www.abc.com,0,11,type=<HOST>) ---> it recognizes the type HOST. If I pass "www.abc.com." (with an extra '.'), it prints (wwwabccom,0,12,type=<ACRONYM>) ---> it recognizes the type ACRONYM. Personally, I think it is a bug, as ACRONYMs are usually of the form A.B.C. and not ABC.DEF. ... maybe you can try the java-dev mailing list and consult them if you should open an issue on that ... On Nov 26, 2007 5:47 PM, Eugenio Martinez <[EMAIL PROTECTED]> wrote: > I am indexing with Lucene a hughe set of logfiles, about 130GB of plain > text in disk (up to now), planning to build a system capable of perform > searches over Terabytes of such info in a kind of metaindex built from a > mesh of little ones, all of them created and maintained with Lucene. > > I have randomly variable file sizes, from 1KB to several hundreds of MB of > plain text, and I have done tests with files about 2GB, obtaning very good > performance in time and search. Of course, once we can get search results > from such system we get confident that Lucene was capable of doing right its > job, i.e., split all contents and index all tokens correctly. > > But last week, with our first beta release in our LAN environment, some > problems arose. In certain situations we've found that the Analysis stage > "fails", or better, has anomalies in its activity. We have isolated one, > that can be reproduced with LUKE in its Search window: parsing URL domains > that end with a point, as in "www.my.domain.es." becomes in a token with > the following text: "wwwmydomaines". > > Maybe this behavior extends to emails, as we aren´t able to get search > results with some emails that are indeed in the contents of the logfile, and > with words too. > > Such behavior is not acceptable for nobody, as in natural speaking is > possible to find such URLs at the end of a sentence. Is this an effect of > document vectorization? I write this as log's content structure doesn't > match for natural language rules... > > Any notice about this? > > We are working on an Log Analyzer now, but i'm sure i'm not the only > fellow with this issue in the world... Did you know anyone else? > > Thanks for your attention. > > Eugenio F. Martínez Pacheco > > Fundación Instituto Tecnológico de Galicia - Área TIC > > TFN: 981 173 206 FAX: 981 173 223 > > VIDEOCONFERENCIA: 981 173 596 > > [EMAIL PROTECTED] > > > > > > > > ______________________________________________ > ¿Chef por primera vez? > Sé un mejor Cocinillas. > http://es.answers.yahoo.com/info/welcome > -- Regards, Shai Erera