Complex Query Parsing and Tokenization: ANTLR, JavaCC, Solr

Tavi Nathanson Mon, 26 Apr 2010 19:17:58 -0700

Hey everyone,

My organization uses our own homebrew QueryParser class, unrelated to
Lucene's JavaCC-based QueryParser, to parse our queries. We don't currently
use anything from Solr. Our QueryParser class has gotten quite cumbersome,
and I'm looking into alternatives. Grammar-based parsing seems like the way
to go, but I've got some questions:


- ANTLR seems to be very well-supported and well-liked, but I see that
Lucene's QueryParser and StandardTokenizer use JavaCC. Does anyone have
experience writing a Lucene or Solr parser using ANTLR? Any thoughts on
whether it would be helpful to stick with JavaCC, or problematic to use
ANTLR, in light of Lucene's default usage of JavaCC?
- Any experience using ANTLR for tokenization?
- I was told that Solr might be componentizing its query parsing in such a
way that we might be able to use that instead of a homebrew grammar-based
solution. However, I haven't found anything written about that. I don't know
much about Solr's query parsing, other than what I saw looking at
QParser.java and QParserPlugin.java: it seems that one can plug in any
parser needed. That doesn't really help us, as our goal is to simplify our
parsing logic. Is there any way to structure our query parsing logic without
needing to write a grammar from scratch, whether it's a Solr component or
something else?

In a nutshell, I'm trying to get a sense of the best practices in this
situation (namely, custom query parsing that's getting very complex) before
I dive into implementing a solution.

Thanks!
Tavi

Complex Query Parsing and Tokenization: ANTLR, JavaCC, Solr

Reply via email to