I am working on some sort of search mechanism to link a requirement (i.e. a
query) to source code files (i.e., documents). For that purpose, I indexed
the source code files using Lucene. Contrary to traditional natural language
search scenario, we search for code files that are relevant to a given
requirement. One problem here is that the source files usually contain a lot
of abbreviations, words joint by _ or combination of words and/or
abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
of you already did indexing of (source) files or documents which contain
that kind of words.

Yes, that's been something we've spent a fair amount of time on...see http://www.krugle.org (public code search).

As Mathieu noted, the first thing you really want to do is split the file up into at least comments vs. code. Then you can use a regular analyzer (or perhaps something more human language-specific, e.g. with stemming support) on the comment text, and your own custom tokenizer on the code.

In the code, you might further want to treat literals (strings, etc) differently than other terms.

And in "real" code terms, then you want to do essentially synonym processing, where you turn a single term into multiple terms based on the automatic splitting of the term using '_', '-', camelCasing, letter/digit transitions, etc.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to