Re: Indexing source code files

Ken Krugler Thu, 28 Feb 2008 08:27:30 -0800

I am working on some sort of search mechanism to link a requirement (i.e. a
query) to source code files (i.e., documents). For that purpose, I indexed
the source code files using Lucene. Contrary to traditional natural language
search scenario, we search for code files that are relevant to a given
requirement. One problem here is that the source files usually contain a lot
of abbreviations, words joint by _ or combination of words and/or
abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
of you already did indexing of (source) files or documents which contain
that kind of words.

Yes, that's been something we've spent a fair amount of time on...seehttp://www.krugle.org (public code search).

As Mathieu noted, the first thing you really want to do is split thefile up into at least comments vs. code. Then you can use a regularanalyzer (or perhaps something more human language-specific, e.g.with stemming support) on the comment text, and your own customtokenizer on the code.

In the code, you might further want to treat literals (strings, etc)differently than other terms.

And in "real" code terms, then you want to do essentially synonymprocessing, where you turn a single term into multiple terms based onthe automatic splitting of the term using '_', '-', camelCasing,letter/digit transitions, etc.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing source code files

Reply via email to