fabriziofortino commented on code in PR #2193: URL: https://github.com/apache/jackrabbit-oak/pull/2193#discussion_r2021182572
########## oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java: ########## @@ -145,49 +157,93 @@ public static IndexSettingsAnalysis.Builder buildCustomAnalyzers(NodeState state @NotNull private static TokenizerDefinition loadTokenizer(NodeState state) { - String name = normalize(Objects.requireNonNull(state.getString(FulltextIndexConstants.ANL_NAME))); - Map<String, Object> args = convertNodeState(state); + String name; + Map<String, Object> args; + if (!state.exists()) { + LOG.warn("No tokenizer specified; the standard with an empty configuration"); + name = "Standard"; + args = new HashMap<String, Object>(); + } else { + name = Objects.requireNonNull(state.getString(FulltextIndexConstants.ANL_NAME)); + try { + args = convertNodeState(state); + } catch (IOException e) { + LOG.warn("Can not load tokenizer; using an empty configuration", e); + args = new HashMap<String, Object>(); + } + } + name = normalize(name); + if ("n_gram".equals(name)) { + // OAK-11568 + // https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html + Integer minGramSize = getIntegerSetting(args, "minGramSize", 2); + Integer maxGramSize = getIntegerSetting(args, "maxGramSize", 3); + TokenizerDefinition ngram = TokenizerDefinition.of(t -> t.ngram( + NGramTokenizer.of(n -> n.minGram(minGramSize).maxGram(maxGramSize)))); + return ngram; + } Review Comment: This is okay for now. We should structure it better to cover all the possible tokenizers (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html). This can go in a separate PR. ########## oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java: ########## @@ -201,6 +257,21 @@ private static <FD> LinkedHashMap<String, FD> loadFilters(NodeState state, Map<String, Object> args = convertNodeState(child, transformers, content); + if (name.equals("word_delimiter")) { + // https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html + // We recommend using the word_delimiter_graph instead of the word_delimiter filter. + // The word_delimiter filter can produce invalid token graphs. + LOG.info("Replacing the word delimiter filter with the word delimiter graph"); + name = "word_delimiter_graph"; + } + if (name.equals("hyphenation_compound_word")) { + name = "hyphenation_decompounder"; + String hypenator = args.getOrDefault("hyphenator", "").toString(); + LOG.info("Using the hyphenation_decompounder: " + hypenator); + args.put("hyphenation_patterns_path", "analysis/hyphenation_patterns.xml"); Review Comment: Should `"analysis/hyphenation_patterns.xml"` be installed in the Elastic nodes? ########## oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java: ########## @@ -221,14 +292,31 @@ private static <FD> LinkedHashMap<String, FD> loadFilters(NodeState state, } args.put(ANALYZER_TYPE, name); - filters.put(name + "_" + i, factory.apply(name, JsonData.of(args))); + if (skipEntry) { + continue; + } + String key = name + "_" + i; + filters.put(key, factory.apply(name, JsonData.of(args))); + if (name.equals("word_delimiter_graph")) { + wordDelimiterFilterKey = key; + } else if (name.equals("synonym")) { + if (wordDelimiterFilterKey != null) { + LOG.info("Removing word delimiter because there is a synonyms filter as well: " + wordDelimiterFilterKey); + filters.remove(wordDelimiterFilterKey); + } + } Review Comment: Another option could be the use of https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html We can work on this in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: oak-dev-unsubscr...@jackrabbit.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org