Cycling back on this one... I'm in a bit of a bind now.
Using a SynonymMap with useOrig=true, the phrase recognition works:
CharsRef canonical = createCharsRef("canonical phrase");
CharsRef alias = createCharsRef("alias phrase");
builder.add(canonical, canonical, true);
builder.add(alias, canonical, true);
However, if I parse the string "alias phrase", then I get as query:
foo:"canonical phrase" foo:"alias phrase"
This results in skewed scores, as another document that contains both of
them scores higher. The score is better with useOrig=false (third
parameter of builder.add), but then phrase recognition no longer works:
The string "alias phrase" now results in the query: foo:"canonical"
foo:"phrase"
It feels to me that this is a bug, and phrase recognition should also
work with useOrig=false.
What do people think?
Thanks,
Kai
On 2025-11-26 14:43, Kai Grossjohann wrote:
Thank you Mikhail, very interesting. It has taken me a long time to
reply because I got other priorities...
With “enable position increments” it works much better. “Split on
whitespace” has to be false (as you say) and “auto-generate phrase
queries” also has to be false. But interestingly enough,
“auto-generate multi-term synonyms phrase query” can be true, and
setting it to true helps.
This is now good enough for my actual application code. I do still
see some oddities. One of them is hopefully more cosmetic, and the
other can be worked around.
I will work around the following behavior:
* If a phrase appears as the /output/, but not as the /input/, of a
SynonymMap entry, then it is /not/ automatically recognized.
* A phrase that appears as the input of a SynonymMap entry is
automatically recognized.
“My” synonyms are structured in such a way that there is a canonical
term and multiple possible alias terms. My understanding was that I
should have one SynonymMap entry per alias term, each of them
specifying the alias term as input and the canonical term as output.
I will work around the problem by adding another SynonymMap entry,
specifying the canonical term as both input and output.
* If I map a phrase to itself (i.e. both input and output) then it's
doubled in the resulting query.
The workaround above means that the canonical terms are doubled in the
query, but I'm just going to live with that. I hope it doesn't skew
the weights too bad.
Kai
On 2025-11-03 21:38, Mikhail Khludnev wrote:
Hello Kai
Pardon for vide coding, but this sample
https://github.com/mkhludnev/mutlyword-phrase-query-test/blob/3e3f1cce6b2b6790970e4a042ddb2967e49d0077/src/test/java/org/example/phrases/MultiWordTests.java#L88
parses plain biword "power grid" without quotes as a bool/should of
phrases
org.example.phrases.MultiWordTests#testPhraseQueryGeneratedFromPlainMultiWordSynonym
Parsed Query for 'power grid': ("electrical grid" "power grid")
Does it look closer to what you are looking for?
On Mon, Nov 3, 2025 at 1:50 PM Kai Grossjohann
<[email protected]> wrote:
Hi Mikhail,
I tried to change this to false, and this was the result:
java.lang.IllegalArgumentException:
setAutoGeneratePhraseQueries(true) is disallowed when
getSplitOnWhitespace() == false
I experimented with other combinations of setSplitOnWhitespace,
setAutoGeneratePhraseQueries, and
setAutoGenerateMultiTermSynonymsPhraseQuery. None of them got me
the phrase queries I'm looking for. Though some of them searched
for more synonyms.
In particular, false/false/true resulted in “synonym alias” being
parsed as Synonym(foo:canonical foo:synonym) Synonym(foo:alias
foo:phrase) which still doesn't produce the foo:"canonical
phrase"~1 that I was looking for.
Kai
On 2025-10-30 18:01, Mikhail Khludnev wrote:
Hello Kaj
Briefly skimming through the letter
queryParser.setSplitOnWhitespace(true); // shouldn't false be here
?
queryParser.setAutoGeneratePhraseQueries(true);
queryParser.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
queryParser.setPhraseSlop(1);
Query q = queryParser.parse("canonical phrase");
assertEquals("foo:canonical foo:phrase", q.toString(),
"I was expecting a phrase query here: foo:\"canonical
phrase\"~1");
On Thu, Oct 30, 2025 at 4:49 PM Kai Grossjohann
<[email protected]> <mailto:[email protected]>
wrote:
I thought if I have a synonym map that says “synonym alias” is an alias
for “canonical phrase”, and I noodle “canonical phrase” through the
query parser, telling it to auto generate multi term queries, I'd get a
multi term query. But that doesn't seem to be the case.
The only way to generate multi term queries seems to be when the synonym
says that “shortsyn” is an alias for “another phrase”, and then noodle
“shortsyn” through the query parser. Then I get foo:"another phrase"~1
which is what I expected.
My use case is as follows: I have some multi-word strings, and I need to
create queries from them. And if one of the synonym phrases appears in
the multi-word string, then I would like to generate a phrase query for
that part. For example, given the synonyms mentioned above, if the
multi-word string is, say, “my synonym alias is nice”, then I'd like to
generate a query that searches for the word “my”, the phrase “canonical
phrase”, and the words “is” and “nice”. Maybe I would like to
/also/ search for the words “synonym” and “alias”, or the words
“canonical” and “phrase”, or all four of them, I'm not sure.
This description left out quite a bit of information, I'll paste some
code below to clarify.
Kai
/**
* This tests the behavior of the Lucene query
* builder with synonyms
*/
public class SynonymGraphQueryBuilderTest {
private static class MyAnalyzer extends Analyzer {
private final CharArraySet stopwords;
private final SynonymMap synonyms;
public MyAnalyzer(Set<String> stopwords, SynonymMap synonyms) {
this.stopwords = new CharArraySet(stopwords, true);
this.synonyms = synonyms;
}
@Override
protected TokenStreamComponents createComponents(String
fieldName) {
final Tokenizer src = new SimplePatternTokenizer("[a-z0-9]+");
TokenStream tok = new LowerCaseFilter(src);
tok = new SynonymGraphFilter(tok, synonyms, true);
tok = new FlattenGraphFilter(tok);
tok = new StopFilter(tok, stopwords);
return new TokenStreamComponents(
src::setReader,
tok);
}
}
@Test
void testSynonymPhrases() throws Exception {
Builder builder = new Builder();
// canonical phrase <- synonym alias
CharsRef canonical = Builder.join(new String[] { "canonical",
"phrase" }, new CharsRefBuilder());
CharsRef synonym = Builder.join(new String[] { "synonym",
"alias" }, new CharsRefBuilder());
builder.add(synonym, canonical, true);
// another phrase <- shortsyn
canonical = Builder.join(new String[] { "another", "phrase" },
new CharsRefBuilder());
synonym = Builder.join(new String[] { "shortsyn" }, new
CharsRefBuilder());
builder.add(synonym, canonical, true);
SynonymMap synonyms = builder.build();
Set<String> stopwords = Set.of("the");
MyAnalyzer analyzer = new MyAnalyzer(stopwords, synonyms);
QueryParser queryParser = new QueryParser("foo", analyzer);
queryParser.setSplitOnWhitespace(true);
queryParser.setAutoGeneratePhraseQueries(true);
queryParser.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
queryParser.setPhraseSlop(1);
Query q = queryParser.parse("canonical phrase");
assertEquals("foo:canonical foo:phrase", q.toString(),
"I was expecting a phrase query here: foo:\"canonical
phrase\"~1");
q = queryParser.parse("synonym alias");
assertEquals("foo:synonym foo:alias", q.toString(),
"I was expecting a phrase query here: foo:\"canonical
phrase\"~1");
q = queryParser.parse("shortsyn");
assertEquals("foo:\"another phrase\"~1 foo:shortsyn",
q.toString(),
"This is what I expected.");
q = queryParser.parse("another phrase");
assertEquals("foo:another foo:phrase", q.toString(),
"I was expecting a phrase query here: foo:\"another
phrase\"~1");
}
}
--
Sincerely yours
Mikhail Khludnev