Re: How does auto-generating phrases work?

Kai Grossjohann Fri, 06 Mar 2026 07:05:50 -0800

Cycling back on this one...  I'm in a bit of a bind now.


Using a SynonymMap with useOrig=true, the phrase recognition works:

CharsRef canonical = createCharsRef("canonical phrase");
CharsRef alias = createCharsRef("alias phrase");
builder.add(canonical, canonical, true);
builder.add(alias, canonical, true);

However, if I parse the string "alias phrase", then I get as query:foo:"canonical phrase" foo:"alias phrase"

This results in skewed scores, as another document that contains both ofthem scores higher. The score is better with useOrig=false (thirdparameter of builder.add), but then phrase recognition no longer works:The string "alias phrase" now results in the query: foo:"canonical"foo:"phrase"

It feels to me that this is a bug, and phrase recognition should alsowork with useOrig=false.


What do people think?

Thanks,
Kai

On 2025-11-26 14:43, Kai Grossjohann wrote:

Thank you Mikhail, very interesting. It has taken me a long time toreply because I got other priorities...

With “enable position increments” it works much better. “Split onwhitespace” has to be false (as you say) and “auto-generate phrasequeries” also has to be false. But interestingly enough,“auto-generate multi-term synonyms phrase query” can be true, andsetting it to true helps.

This is now good enough for my actual application code. I do stillsee some oddities. One of them is hopefully more cosmetic, and theother can be worked around.


I will work around the following behavior:

  * If a phrase appears as the /output/, but not as the /input/, of a
    SynonymMap entry, then it is /not/ automatically recognized.
  * A phrase that appears as the input of a SynonymMap entry is
    automatically recognized.

“My” synonyms are structured in such a way that there is a canonicalterm and multiple possible alias terms. My understanding was that Ishould have one SynonymMap entry per alias term, each of themspecifying the alias term as input and the canonical term as output. I will work around the problem by adding another SynonymMap entry,specifying the canonical term as both input and output.


  * If I map a phrase to itself (i.e. both input and output) then it's
    doubled in the resulting query.

The workaround above means that the canonical terms are doubled in thequery, but I'm just going to live with that. I hope it doesn't skewthe weights too bad.


Kai


On 2025-11-03 21:38, Mikhail Khludnev wrote:

Hello Kai

Pardon for vide coding, but this samplehttps://github.com/mkhludnev/mutlyword-phrase-query-test/blob/3e3f1cce6b2b6790970e4a042ddb2967e49d0077/src/test/java/org/example/phrases/MultiWordTests.java#L88

parses plain biword "power grid" without quotes as a bool/should ofphrases


org.example.phrases.MultiWordTests#testPhraseQueryGeneratedFromPlainMultiWordSynonym

Parsed Query for 'power grid': ("electrical grid" "power grid")
Does it look closer to what you are looking for?

On Mon, Nov 3, 2025 at 1:50 PM Kai Grossjohann<[email protected]> wrote:


    Hi Mikhail,

    I tried to change this to false, and this was the result:

    java.lang.IllegalArgumentException:
    setAutoGeneratePhraseQueries(true) is disallowed when
    getSplitOnWhitespace() == false

    I experimented with other combinations of setSplitOnWhitespace,
    setAutoGeneratePhraseQueries, and
    setAutoGenerateMultiTermSynonymsPhraseQuery.  None of them got me
    the phrase queries I'm looking for.  Though some of them searched
    for more synonyms.

    In particular, false/false/true resulted in “synonym alias” being
    parsed as Synonym(foo:canonical foo:synonym) Synonym(foo:alias
    foo:phrase) which still doesn't produce the foo:"canonical
    phrase"~1 that I was looking for.

    Kai

    On 2025-10-30 18:01, Mikhail Khludnev wrote:

    Hello Kaj

    Briefly skimming through the letter

              queryParser.setSplitOnWhitespace(true); // shouldn't false be here
    ?
              queryParser.setAutoGeneratePhraseQueries(true);
    queryParser.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
              queryParser.setPhraseSlop(1);

              Query q = queryParser.parse("canonical phrase");
              assertEquals("foo:canonical foo:phrase", q.toString(),
                      "I was expecting a phrase query here: foo:\"canonical
    phrase\"~1");



    On Thu, Oct 30, 2025 at 4:49 PM Kai Grossjohann
    <[email protected]> <mailto:[email protected]> 
wrote:

    I thought if I have a synonym map that says “synonym alias” is an alias
    for “canonical phrase”, and I noodle “canonical phrase” through the
    query parser, telling it to auto generate multi term queries, I'd get a
    multi term query.  But that doesn't seem to be the case.

    The only way to generate multi term queries seems to be when the synonym
    says that “shortsyn” is an alias for “another phrase”, and then noodle
    “shortsyn” through the query parser.  Then I get foo:"another phrase"~1
    which is what I expected.

    My use case is as follows: I have some multi-word strings, and I need to
    create queries from them.  And if one of the synonym phrases appears in
    the multi-word string, then I would like to generate a phrase query for
    that part.  For example, given the synonyms mentioned above, if the
    multi-word string is, say, “my synonym alias is nice”, then I'd like to
    generate a query that searches for the word “my”, the phrase “canonical
    phrase”, and the words “is” and “nice”.  Maybe I would like to
    /also/ search for the words “synonym” and “alias”, or the words
    “canonical” and “phrase”, or all four of them, I'm not sure.

    This description left out quite a bit of information, I'll paste some
    code below to clarify.

    Kai

    /**
       * This tests the behavior of the Lucene query
       * builder with synonyms
       */
    public class SynonymGraphQueryBuilderTest {

          private static class MyAnalyzer extends Analyzer {
              private final CharArraySet stopwords;
              private final SynonymMap synonyms;

              public MyAnalyzer(Set<String> stopwords, SynonymMap synonyms) {
                  this.stopwords = new CharArraySet(stopwords, true);
                  this.synonyms = synonyms;
              }

              @Override
              protected TokenStreamComponents createComponents(String
    fieldName) {
                  final Tokenizer src = new SimplePatternTokenizer("[a-z0-9]+");
                  TokenStream tok = new LowerCaseFilter(src);
                  tok = new SynonymGraphFilter(tok, synonyms, true);
                  tok = new FlattenGraphFilter(tok);
                  tok = new StopFilter(tok, stopwords);
                  return new TokenStreamComponents(
                          src::setReader,
                          tok);
              }
          }

          @Test
          void testSynonymPhrases() throws Exception {
              Builder builder = new Builder();

              // canonical phrase <- synonym alias
              CharsRef canonical = Builder.join(new String[] { "canonical",
    "phrase" }, new CharsRefBuilder());
              CharsRef synonym = Builder.join(new String[] { "synonym",
    "alias" }, new CharsRefBuilder());
              builder.add(synonym, canonical, true);

              // another phrase <- shortsyn
              canonical = Builder.join(new String[] { "another", "phrase" },
    new CharsRefBuilder());
              synonym = Builder.join(new String[] { "shortsyn" }, new
    CharsRefBuilder());
              builder.add(synonym, canonical, true);

              SynonymMap synonyms = builder.build();

              Set<String> stopwords = Set.of("the");

              MyAnalyzer analyzer = new MyAnalyzer(stopwords, synonyms);

              QueryParser queryParser = new QueryParser("foo", analyzer);
              queryParser.setSplitOnWhitespace(true);
              queryParser.setAutoGeneratePhraseQueries(true);
    queryParser.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
              queryParser.setPhraseSlop(1);

              Query q = queryParser.parse("canonical phrase");
              assertEquals("foo:canonical foo:phrase", q.toString(),
                      "I was expecting a phrase query here: foo:\"canonical
    phrase\"~1");

              q = queryParser.parse("synonym alias");
              assertEquals("foo:synonym foo:alias", q.toString(),
                      "I was expecting a phrase query here: foo:\"canonical
    phrase\"~1");

              q = queryParser.parse("shortsyn");
              assertEquals("foo:\"another phrase\"~1 foo:shortsyn",
    q.toString(),
                      "This is what I expected.");

              q = queryParser.parse("another phrase");
              assertEquals("foo:another foo:phrase", q.toString(),
                      "I was expecting a phrase query here: foo:\"another
    phrase\"~1");
          }
    }



--
Sincerely yours
Mikhail Khludnev

Re: How does auto-generating phrases work?

Reply via email to