Re: Using Lucene for technical documentation

Paul Libbrecht Mon, 23 Nov 2020 00:22:52 -0800

Hello Trevor,

I don’t know of an analyzer for mixes of code and text but I know ofan analyser for mixes of code and formulæ.

Clearly, you could build a custom analyzer that would tokenizedifferently depending on weather you’re in code or in text. That’sno super hard.

However, where things get complicated is at mixing and that happenslatest at querying: If you query `while` you want to find matches forthe real world and the stemmed word too. If you use Lucene for othertasks than searches, however, this may be a problem (e.g. clustering,LSA…).

In the case of the formula-enabled search I built, the query modalitieswere different (two different input-fields) so that you knew how totransform the query (for math, span queries were used).

I’m suspecting that you should decide on this first: if you want tojust search and query by a mix then I’d recommend simply usingdifferent field-names with a whitespace and a standard-analyzer. Lateron the code-oriented one is able to, say, enrich code-tokens byalternative names (e.g. use “loop” as an weaker alternative of the“for” token). Solr and lucene can do this really well (eDismaxprovides an easy parametrisation).


But I’d be happy to read of others’ works on this!

In the Math working group of W3C at the time, work stopped whenconsidering the complexity of compound documents: the alternatives asabove (mix words or recognise math pieces?) certainly made thingsdifficult.


paul

PS: [paper for my math searchhere](https://hoplahup.net/paul_pubs/AccessRetrievalAM.html). Please askfor the source code, it is old and built on Lucene 3.5 so would needquite some upgrade.


On 23 Nov 2020, at 8:42, Trevor Nicholls wrote:

Hello, I'd better begin by identifying myself as a newbie.
I am investigating using Lucene as a search tool for a library oftechnicaldocuments, much of which consists of pieces of source code anddiscussion of
the content.
The standard analyzer does an adequate job with normal text but stripsoutnon-alpha characters in code fragments; the whitespace analyzer doesanadequate job with source code but at the expense of treatingpunctuation
characters as significant text.
As a couple of trivial examples, the line "The !F1 key." ideally needsto beanalyzed as [the] [!f1] [key]. The standard analyzer turns it into[the][f1] [key] while the Whitespace analyzer turns it into [the] [!f1][key.].
Similarly "the abort() function, or the stop() function." ideallyneeds tobe analyzed as [the] [abort()] [function] [or] [the] [stop()][function].But no analyzer will retain the parentheses while discarding the commaand
full stop.
Are there examples of analyzers for technical documentation around, orany
helpful pointers? Or am I barking up a rotten tree here?



cheers

T

Re: Using Lucene for technical documentation

Reply via email to