I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as a
term, and then also the octets.
Are you adding the "contents" field into the index multiple times,
possibly with separate analyzers?
Could you possibly try a test, very simple case?
Just create an index with a single lucene document, with that documents
contents being "aa.bb.cc.dd" and then take a look at the index via Luke
again.
When you look at the terms section (Its what comes up by default) you
SHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLY
terms in the index). This could vary depending on your analyzer, as
some will show an index containing only a single term "aa.bb.cc.dd".
What I would not expect is an index that would contain both.
Furthermore by making the field not analyzed you will now have a
trickier time searching for it. As you will need to use a keyword
analyzer or something similar to search, which if I'm understanding the
spirit of your problem isn't really something that you want to do.
So, if you could run that test scenario that I've outlined for you I
think you should be able to have a nice test bed to see what the results
of swapping to different analyzers will have on the data that you are
trying to index. Then, after you have played with that a bit you should
be able to re-expand your corpus again, and see if the analyzer you have
chosen continues to stand up.
I.. had thought that StandardAnalyzer already kept IP addresses together
as a single token, but maybe its doing something... special and
interesting and thusly you are seeing the behavior that you are describing.
Matt
oh...@cox.net wrote:
Hi,
Oh. Ok, thanks! I'll give that a try.
Jim
---- "Armasu wrote:
Keyword: Field.Index.NOT_ANALYZED
-----Original Message-----
From: oh...@cox.net [mailto:oh...@cox.net]
Sent: Thursday, July 30, 2009 4:36 PM
To: java-user@lucene.apache.org
Subject: How to index IP addresses?
Hi,
I am trying to index information in some proprietary-formatted files.
In particular, these files contain some IP addresses in dotted notation, e.g.,
aa.bb.cc.dd.
For my initial test, I have a Document implementation, and after I extract what I need
into a String named "Info", I do:
doc.add(new Field("contents", Info, Field.Store.YES, Field.Index.ANALYZED));
From looking at the resulting index using Luke, it appears that I am getting terms for
the full IP address string (e.g., "aa.bb.cc.dd"), but I am also getting terms
for each octet of each IP address string, e.g.:
aa
bb
cc
dd
I'm still just getting started with Lucene, but from the research that I've done, it seems like
Lucene is treating the "." in the dotted notation strings as "noise". Is that
correct?
If so, is there a way to get it not to do that?
Thanks,
Jim
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street,
floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania.
Registration number J40/12967/2005.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org