Re: [PR] OAK-11568 Elastic: improved compatibility for aggregation definitions [jackrabbit-oak]

via GitHub Mon, 31 Mar 2025 08:11:53 -0700


fabriziofortino commented on code in PR #2193:
URL: https://github.com/apache/jackrabbit-oak/pull/2193#discussion_r2021182572



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -145,49 +157,93 @@ public static IndexSettingsAnalysis.Builder 
buildCustomAnalyzers(NodeState state
 
     @NotNull
     private static TokenizerDefinition loadTokenizer(NodeState state) {
-        String name = 
normalize(Objects.requireNonNull(state.getString(FulltextIndexConstants.ANL_NAME)));
-        Map<String, Object> args = convertNodeState(state);
+        String name;
+        Map<String, Object> args;
+        if (!state.exists()) {
+            LOG.warn("No tokenizer specified; the standard with an empty 
configuration");
+            name = "Standard";
+            args = new HashMap<String, Object>();
+        } else {
+            name = 
Objects.requireNonNull(state.getString(FulltextIndexConstants.ANL_NAME));
+            try {
+                args = convertNodeState(state);
+            } catch (IOException e) {
+                LOG.warn("Can not load tokenizer; using an empty 
configuration", e);
+                args = new HashMap<String, Object>();
+            }
+        }
+        name = normalize(name);
+        if ("n_gram".equals(name)) {
+            // OAK-11568
+            // 
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
+            Integer minGramSize = getIntegerSetting(args, "minGramSize", 2);
+            Integer maxGramSize = getIntegerSetting(args, "maxGramSize", 3);
+            TokenizerDefinition ngram = TokenizerDefinition.of(t -> t.ngram(
+                    NGramTokenizer.of(n -> 
n.minGram(minGramSize).maxGram(maxGramSize))));
+            return ngram;
+        }

Review Comment:
   This is okay for now. We should structure it better to cover all the 
possible tokenizers 
(https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html).
 This can go in a separate PR.



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -201,6 +257,21 @@ private static <FD> LinkedHashMap<String, FD> 
loadFilters(NodeState state,
 
             Map<String, Object> args = convertNodeState(child, transformers, 
content);
 
+            if (name.equals("word_delimiter")) {
+                // 
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
+                // We recommend using the word_delimiter_graph instead of the 
word_delimiter filter.
+                // The word_delimiter filter can produce invalid token graphs.
+                LOG.info("Replacing the word delimiter filter with the word 
delimiter graph");
+                name = "word_delimiter_graph";
+            }
+            if (name.equals("hyphenation_compound_word")) {
+                name = "hyphenation_decompounder";
+                String hypenator = args.getOrDefault("hyphenator", 
"").toString();
+                LOG.info("Using the hyphenation_decompounder: " + hypenator);
+                args.put("hyphenation_patterns_path", 
"analysis/hyphenation_patterns.xml");

Review Comment:
   Should `"analysis/hyphenation_patterns.xml"` be installed in the Elastic 
nodes?



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -221,14 +292,31 @@ private static <FD> LinkedHashMap<String, FD> 
loadFilters(NodeState state,
             }
             args.put(ANALYZER_TYPE, name);
 
-            filters.put(name + "_" + i, factory.apply(name, 
JsonData.of(args)));
+            if (skipEntry) {
+                continue;
+            }
+            String key = name + "_" + i;
+            filters.put(key, factory.apply(name, JsonData.of(args)));
+            if (name.equals("word_delimiter_graph")) {
+                wordDelimiterFilterKey = key;
+            } else if (name.equals("synonym")) {
+                if (wordDelimiterFilterKey != null) {
+                    LOG.info("Removing word delimiter because there is a 
synonyms filter as well: " + wordDelimiterFilterKey);
+                    filters.remove(wordDelimiterFilterKey);
+                }
+            }

Review Comment:
   Another option could be the use of 
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html
   We can work on this in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: oak-dev-unsubscr...@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] OAK-11568 Elastic: improved compatibility for aggregation definitions [jackrabbit-oak]

Reply via email to