Re: I have found a kind of strange behavior in StandardAnalyzer

Shai Erera Mon, 26 Nov 2007 08:44:58 -0800

Hi

I tried this code:


        TokenStream ts = analyzer.tokenStream("content", new StringReader("
www.abc.com"));
        Token t;
        while ((t = ts.next()) != null) {
            System.out.println(t);
        }
If I pass "www.abc.com" (without an extra '.'), it prints
(www.abc.com,0,11,type=<HOST>)
---> it recognizes the type HOST.
If I pass "www.abc.com." (with an extra '.'), it prints
(wwwabccom,0,12,type=<ACRONYM>) ---> it recognizes the type ACRONYM.

Personally, I think it is a bug, as ACRONYMs are usually of the form A.B.C.
and not ABC.DEF. ... maybe you can try the java-dev mailing list and consult
them if you should open an issue on that ...

On Nov 26, 2007 5:47 PM, Eugenio Martinez <[EMAIL PROTECTED]> wrote:

> I am indexing with Lucene a hughe set of logfiles, about 130GB of plain
> text in disk (up to now), planning to build a system capable of perform
> searches over Terabytes of such info in a kind of metaindex built from a
> mesh of little ones, all of them created and maintained with Lucene.
>
> I have randomly variable file sizes, from 1KB to several hundreds of MB of
> plain text, and I have done tests with files about 2GB, obtaning very good
> performance in time and search. Of course, once we can get search results
> from such system we get confident that Lucene was capable of doing right its
> job, i.e., split all contents and index all tokens correctly.
>
> But last week, with our first beta release in our LAN environment, some
> problems arose. In certain situations we've found that the Analysis stage
> "fails", or better, has anomalies in its activity. We have isolated one,
> that can be reproduced with LUKE in its Search window: parsing URL domains
> that end with a point, as in "www.my.domain.es." becomes in a token with
> the following text: "wwwmydomaines".
>
> Maybe this behavior extends to emails, as we aren´t able to get search
> results with some emails that are indeed in the contents of the logfile, and
> with words too.
>
> Such behavior is not acceptable for nobody, as in natural speaking is
> possible to find such URLs at the end of a sentence. Is this an effect of
> document vectorization? I write this as log's content structure doesn't
> match for natural language rules...
>
> Any notice about this?
>
> We are working on an Log Analyzer now, but i'm sure i'm not the only
> fellow with this issue in the world... Did you know anyone else?
>
> Thanks for your attention.
>
> Eugenio F. Martínez Pacheco
>
> Fundación Instituto Tecnológico de Galicia - Área TIC
>
> TFN: 981 173 206            FAX: 981 173 223
>
> VIDEOCONFERENCIA: 981 173 596
>
> [EMAIL PROTECTED]
>
>
>
>
>
>
>
> ______________________________________________
> ¿Chef por primera vez?
> Sé un mejor Cocinillas.
> http://es.answers.yahoo.com/info/welcome
>



-- 
Regards,

Shai Erera

Re: I have found a kind of strange behavior in StandardAnalyzer

Reply via email to