Re: How to index IP addresses?

Matthew Hall Thu, 30 Jul 2009 07:03:28 -0700

I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as aterm, and then also the octets.

Are you adding the "contents" field into the index multiple times,possibly with separate analyzers?


Could you possibly try a test, very simple case?

Just create an index with a single lucene document, with that documentscontents being "aa.bb.cc.dd" and then take a look at the index via Lukeagain.

When you look at the terms section (Its what comes up by default) youSHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLYterms in the index). This could vary depending on your analyzer, assome will show an index containing only a single term "aa.bb.cc.dd".What I would not expect is an index that would contain both.

Furthermore by making the field not analyzed you will now have atrickier time searching for it. As you will need to use a keywordanalyzer or something similar to search, which if I'm understanding thespirit of your problem isn't really something that you want to do.

So, if you could run that test scenario that I've outlined for you Ithink you should be able to have a nice test bed to see what the resultsof swapping to different analyzers will have on the data that you aretrying to index. Then, after you have played with that a bit you shouldbe able to re-expand your corpus again, and see if the analyzer you havechosen continues to stand up.I.. had thought that StandardAnalyzer already kept IP addresses togetheras a single token, but maybe its doing something... special andinteresting and thusly you are seeing the behavior that you are describing.


Matt

oh...@cox.net wrote:

Hi,

Oh.  Ok, thanks!  I'll give that a try.

Jim

---- "Armasu wrote:

Keyword: Field.Index.NOT_ANALYZED

-----Original Message-----

From: oh...@cox.net [mailto:oh...@cox.net]Sent: Thursday, July 30, 2009 4:36 PM

To: java-user@lucene.apache.org
Subject: How to index IP addresses?

Hi,

I am trying to index information in some proprietary-formatted files.

In particular, these files contain some IP addresses in dotted notation, e.g., 
aa.bb.cc.dd.

For my initial test, I have a Document implementation, and after I extract what I need 
into a String named "Info", I do:

doc.add(new Field("contents", Info, Field.Store.YES, Field.Index.ANALYZED));

From looking at the resulting index using Luke, it appears that I am getting terms for 
the full IP address string (e.g., "aa.bb.cc.dd"), but I am also getting terms 
for each octet of each IP address string, e.g.:

aa
bb
cc
dd

I'm still just getting started with Lucene, but from the research that I've done, it seems like 
Lucene is treating the "." in the dotted notation strings as "noise".  Is that 
correct?

If so, is there a way to get it not to do that?

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street, 
floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania. 
Registration number J40/12967/2005.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to index IP addresses?

Reply via email to