Hello Trevor,

I don’t know of an analyzer for mixes of code and text but I know of an analyser for mixes of code and formulæ.

Clearly, you could build a custom analyzer that would tokenize differently depending on weather you’re in code or in text. That’s no super hard.

However, where things get complicated is at mixing and that happens latest at querying: If you query `while` you want to find matches for the real world and the stemmed word too. If you use Lucene for other tasks than searches, however, this may be a problem (e.g. clustering, LSA…).

In the case of the formula-enabled search I built, the query modalities were different (two different input-fields) so that you knew how to transform the query (for math, span queries were used).

I’m suspecting that you should decide on this first: if you want to just search and query by a mix then I’d recommend simply using different field-names with a whitespace and a standard-analyzer. Later on the code-oriented one is able to, say, enrich code-tokens by alternative names (e.g. use “loop” as an weaker alternative of the “for” token). Solr and lucene can do this really well (eDismax provides an easy parametrisation).

But I’d be happy to read of others’ works on this!

In the Math working group of W3C at the time, work stopped when considering the complexity of compound documents: the alternatives as above (mix words or recognise math pieces?) certainly made things difficult.

paul


PS: [paper for my math search here](https://hoplahup.net/paul_pubs/AccessRetrievalAM.html). Please ask for the source code, it is old and built on Lucene 3.5 so would need quite some upgrade.

On 23 Nov 2020, at 8:42, Trevor Nicholls wrote:

Hello, I'd better begin by identifying myself as a newbie.



I am investigating using Lucene as a search tool for a library of technical documents, much of which consists of pieces of source code and discussion of
the content.



The standard analyzer does an adequate job with normal text but strips out non-alpha characters in code fragments; the whitespace analyzer does an adequate job with source code but at the expense of treating punctuation
characters as significant text.



As a couple of trivial examples, the line "The !F1 key." ideally needs to be analyzed as [the] [!f1] [key]. The standard analyzer turns it into [the] [f1] [key] while the Whitespace analyzer turns it into [the] [!f1] [key.].



Similarly "the abort() function, or the stop() function." ideally needs to be analyzed as [the] [abort()] [function] [or] [the] [stop()] [function]. But no analyzer will retain the parentheses while discarding the comma and
full stop.



Are there examples of analyzers for technical documentation around, or any
helpful pointers? Or am I barking up a rotten tree here?



cheers

T

Reply via email to