Hello, I'd better begin by identifying myself as a newbie.
I am investigating using Lucene as a search tool for a library of technical documents, much of which consists of pieces of source code and discussion of the content. The standard analyzer does an adequate job with normal text but strips out non-alpha characters in code fragments; the whitespace analyzer does an adequate job with source code but at the expense of treating punctuation characters as significant text. As a couple of trivial examples, the line "The !F1 key." ideally needs to be analyzed as [the] [!f1] [key]. The standard analyzer turns it into [the] [f1] [key] while the Whitespace analyzer turns it into [the] [!f1] [key.]. Similarly "the abort() function, or the stop() function." ideally needs to be analyzed as [the] [abort()] [function] [or] [the] [stop()] [function]. But no analyzer will retain the parentheses while discarding the comma and full stop. Are there examples of analyzers for technical documentation around, or any helpful pointers? Or am I barking up a rotten tree here? cheers T