I am working on some sort of search mechanism to link a requirement (i.e. a query) to source code files (i.e., documents). For that purpose, I indexed the source code files using Lucene. Contrary to traditional natural language search scenario, we search for code files that are relevant to a given requirement. One problem here is that the source files usually contain a lot of abbreviations, words joint by _ or combination of words and/or abbreviations (e.x., getAccountBalanceTbl). I am wondering whether anyone of you already did indexing of (source) files or documents which contain that kind of words.
Yes, that's been something we've spent a fair amount of time on...see http://www.krugle.org (public code search).
As Mathieu noted, the first thing you really want to do is split the file up into at least comments vs. code. Then you can use a regular analyzer (or perhaps something more human language-specific, e.g. with stemming support) on the comment text, and your own custom tokenizer on the code.
In the code, you might further want to treat literals (strings, etc) differently than other terms.
And in "real" code terms, then you want to do essentially synonym processing, where you turn a single term into multiple terms based on the automatic splitting of the term using '_', '-', camelCasing, letter/digit transitions, etc.
-- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it" --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]