Hi, Rinat! On Oct 21, Rinat Ibragimov wrote: > > > + > > > + // Avoid creating partial n-grams by stopping at word boundaries. > > > + // Underscore is treated as a word character too, since identifiers > > > in > > > + // programming languages often contain them. > > > + if (!is_alnum(char_type) && !is_underscore(cs, next, end)) { > > > + start= next + char_len; > > > + next= start; > > > + n_chars= 0; > > > + continue; > > > > I don't think it's a good idea. On the opposite, I would rather create > > ngrams over word boundaries, so that, say, I'd split "n-gram plugin" to > > > > "n-g", "-gr", "gra", "ram", "am ", "m p", " pl", "plu", "lug", "ugi", "gin". > > I see how this can increase index size: more different strings are > included. Implementation will become a tiny bit simpler. But can it > make searching better somehow?
Yes, the idea is that it'll rank "One can use n-gram plugin in MariaDB fulltext search" higher than "Plugins are useful for extending functionality of a program" because some ngrams encode that "n-gram" and "plugin" words must be together as a phrase. > I can't decide. From my point of view, the current approach is fine. > Please pick a variant, and I'll try to implement that. No, I cannot guess which approach will produce more relevant searches. Implement something and then we test what works better > > > + } > > > + > > > + next += char_len; > > > + n_chars++; > > > + > > > + if (n_chars == ngram_token_size) { > > > + bool_info->position= static_cast<uint>(start - doc); > > > > nope, we don't have MYSQL_FTPARSER_FULL_BOOLEAN_INFO::position and this > > plugin is definitely not the reason to break the API and add it. > > There is a patch in the patchset that adds the "position" field. Without > that field, > it's possible to trigger an assertion in InnoDB's FTS code by executing: > > INSTALL PLUGIN simple_parser SONAME 'mypluglib'; > CREATE TABLE articles( > a TEXT DEFAULT NULL, > b TEXT DEFAULT NULL, > FULLTEXT (a, b) WITH PARSER simple_parser > ) ENGINE=InnoDB; > INSERT INTO articles VALUES ('111', '1234 1234 1234'); > DROP TABLE articles; > UNINSTALL PLUGIN simple_parser; > > Same assertion may be triggered by any fts plugin. Are you sure you > want that patch to be removed? I don't think InnoDB assert is a reason to extend the fulltext plugin API. What about this fix instead: diff --git a/storage/innobase/fts/fts0fts.cc b/storage/innobase/fts/fts0fts.cc --- a/storage/innobase/fts/fts0fts.cc +++ b/storage/innobase/fts/fts0fts.cc @@ -4669,7 +4669,7 @@ fts_tokenize_add_word_for_parser( ut_ad(boolean_info->position >= 0); position = boolean_info->position + fts_param->add_pos; */ - position = fts_param->add_pos; + position = fts_param->add_pos++; fts_add_token(result_doc, str, position); this fixed the crash for me. > > > + n_chars= ngram_token_size - 1; > > > + is_first= false; > > > + } > > > + } > > > + > > > + // If nothing was emitted yet, emit the remainder. > > > > what reminder, what is it for? > > It's "remainder", with "a" before "i". Remainder of the processed > string. That way strings that are too short for having even a single > n-gram, will still generate some index entries and will be > discoverable by exact queries. Perhaps the comment should say it? That if the string too short, it'll be emitted here? > > > + case MYSQL_FTPARSER_FULL_BOOLEAN_INFO: > > > + // Reusing InnoDB's boolean mode parser to handle all special > > > characters. > > > + while (fts_get_word(cs, &start, end, &word, &bool_info)) { > > > > nope. InnoDB might not even be compiled it. Use param->mysql_parse() > > Copied fts_get_word() implementation into the plugin code. > > As far as I understand, param->mysql_parse() cannot be used here as it > splits a string into words like default fulltext search does. This > plugin however must generate n-grams. Of course, it can. Note that fts_get_word() doesn't generate n-grams either, it gets the whole word and the n-gram plugin later splits it into n-grams. Similarly param->mysql_parse() will extract words for you and you'll split them into n-grams. Regards, Sergei VP of MariaDB Server Engineering and secur...@mariadb.org _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp