Re: Tokenization and SAI query syntax

Josh McKenzie Mon, 07 Aug 2023 14:27:07 -0700

Been chatting a bit w/Caleb about this offline and poking around to better 
educate myself.


> using functions (ignoring the implementation complexity) at least removes 
> ambiguity. 
This, plus using functions lets us kick the can down the road a bit in terms of 
landing on an integrated grammar we agree on. It seems to me there's a tension 
between:
 1. "SQL-like" (i.e. postgres-like)
 2. "Indexing and Search domain-specific-like" (i.e. lucene syntax which, as 
Benedict points out, doesn't really jell w/what we have in CQL at this point), 
and
 3. ??? Some other YOLO CQL / C* specific thing where we go our own road
I don't think we're really going to know what our feature-set in terms of 
indexing is going to look like or the shape it's going to take for awhile, so 
backing ourselves into any of the 3 corners above right now feels very 
premature to me.

So I'm coming around to the expr / method call approach to preserve that 
flexibility. It's maximally explicit and preserves optionality at the expense 
of being clunky. For now.

On Mon, Aug 7, 2023, at 4:00 PM, Caleb Rackliffe wrote:
> > I do not think we should start using lucene syntax for it, it will make 
> > people think they can do everything else lucene allows.
> 
> I'm sure we won't be supporting everything Lucene allows, but this is going 
> to evolve. Right off the bat, if you introduce support for tokenization and 
> filtering, someone is, for example, going to ask for phrase queries. ("John 
> Smith landed in Virginia" is tokenized, but someone wants to match exactly on 
> "John Smith".) The whole point of the Vector project is to do relevance, 
> right? Are we going to do term boosting? Do we need queries like "field: 
> quick brown +fox -news" where fox must be present, news cannot be present, 
> and quick and brown increase relevance?
> 
> SASI uses "=" and "LIKE" in a way that assumes the user understands the 
> tokenization scheme in use on the target field. I understand that's a bit 
> ambiguous.
> 
> If we object to allowing expr embedding of a subset of the Lucene syntax, I 
> can't imagine we're okay w/ then jamming a subset of that syntax into the 
> main CQL grammar.
> 
> If we want to do this in non-expr CQL space, I think using functions 
> (ignoring the implementation complexity) at least removes ambiguity. 
> "token_match", "phrase_match", "token_like", "=", and "LIKE" would all be 
> pretty clear, although there may be other problems. For instance, what 
> happens when I try to use "token_match" on an indexed field whose analyzer 
> does not tokenize? We obviously can't use the index, so we'd be reduced to 
> requiring a filtering query, but maybe that's fine. My point is that, if 
> we're going to make write and read analyzers symmetrical, there's really no 
> way to make the semantics of our queries totally independent of analysis. 
> (ex. "field : foo bar" behaves differently w/ read tokenization than it does 
> without. It could even be an OR or AND query w/ tokenization, depending on 
> our defaults.)
> 
> On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <a...@apache.org> wrote:
>> Why not start with SQLish operators supported by many databases (LIKE and 
>> CONTAINS)?
>> 
>> On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com> 
>> wrote:
>>> 
>>> I am also -1 on directly exposing lucene like syntax here. Besides being 
>>> ugly, SAI is not lucene, I do not think we should start using lucene syntax 
>>> for it, it will make people think they can do everything else lucene allows.
>>> 
>>>> On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote:
>>>> 
>>>> 
>>>> I’m strongly opposed to : 
>>>> 
>>>> It is very dissimilar to our current operators. CQL is already not the 
>>>> prettiest language, but let’s not make it a total mish mash.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote:
>>>>> 
>>>>> I am also in agreement with 'column : token' in that 'I don't hate it' 
>>>>> but I'd like to offer an alternative to this in 'column HAS token'. HAS 
>>>>> is currently not a keyword that we use so wouldn't cause any brain 
>>>>> conflicts.
>>>>> 
>>>>> While I don't hate ':' I have a particular dislike of the lucene search 
>>>>> syntax because of its terseness and lack of easy readability. 
>>>>> 
>>>>> Saying that, I'm happy to do with ':' if that is the decision. 
>>>>> 
>>>>> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> 
>>>>> wrote:
>>>>>> Assuming SAI is a superset of SASI, and we were to set up something so 
>>>>>> that SASI indexes auto convert to SAI, this gives even more weight to my 
>>>>>> point regarding how differing behavior for the same syntax can lead to 
>>>>>> issues.  Imo the best case scenario results in the user not even 
>>>>>> noticing their indexes have changed.
>>>>>> 
>>>>>> An (maybe better?) alternative is to add a flag to the index 
>>>>>> configuration for "compatibility mod", which might address the concerns 
>>>>>> around using an equality operator when it actually is a partial match.
>>>>>> 
>>>>>> For what it's worth, I'm in agreement that = should mean full equality 
>>>>>> and not token match.
>>>>>> 
>>>>>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote:
>>>>>> > For what it's worth, I'd very much like to completely remove SASI from 
>>>>>> > the
>>>>>> > codebase for 6.0. The only remaining functionality gaps at the moment 
>>>>>> > are
>>>>>> > LIKE (prefix/suffix) queries and its limited tokenization
>>>>>> > capabilities, both of which already have SAI Phase 2 Jiras.
>>>>>> >
>>>>>> > On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > SASI just uses “=“ for the tokenized equality matching, which is the 
>>>>>> > > exact
>>>>>> > > thing this discussion is about changing/not liking.
>>>>>> > >
>>>>>> > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan 
>>>>>> > > > <jeremiah.jor...@gmail.com>
>>>>>> > > wrote:
>>>>>> > > >
>>>>>> > > > I do not think LIKE actually applies here. LIKE is used for 
>>>>>> > > > prefix,
>>>>>> > > contains, or suffix searches in SASI depending on the index type.
>>>>>> > > >
>>>>>> > > > This is about exact matching of tokens.
>>>>>> > > >
>>>>>> > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad 
>>>>>> > > >> <rustyrazorbl...@apache.org>
>>>>>> > > wrote:
>>>>>> > > >>
>>>>>> > > >> Certain bits of functionality also already exist on the SASI 
>>>>>> > > >> side of
>>>>>> > > things, but I'm not sure how much overlap there is.  Currently, 
>>>>>> > > there's a
>>>>>> > > LIKE keyword that handles token matching, although it seems to have 
>>>>>> > > some
>>>>>> > > differences from the feature set in SAI.
>>>>>> > > >>
>>>>>> > > >> That said, there seems to be enough of an overlap that it would 
>>>>>> > > >> make
>>>>>> > > sense to consider using LIKE in the same manner, doesn't it?  I 
>>>>>> > > think it
>>>>>> > > would be a little odd if we have different syntax for different 
>>>>>> > > indexes.
>>>>>> > > >>
>>>>>> > > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>>>>>> > > >>
>>>>>> > > >> I think one complication here is that there seems to be a desire, 
>>>>>> > > >> that
>>>>>> > > I very much agree with, to expose as much of the underlying 
>>>>>> > > flexibility of
>>>>>> > > Lucene as much as possible.  If it means we use Caleb's suggestion, 
>>>>>> > > I'd ask
>>>>>> > > that the queries that SASI and SAI both support use the same syntax, 
>>>>>> > > even
>>>>>> > > if it means there's two ways of writing the same query.  To use 
>>>>>> > > Caleb's
>>>>>> > > example, this would mean supporting both LIKE and the `expr` column.
>>>>>> > > >>
>>>>>> > > >> Jon
>>>>>> > > >>
>>>>>> > > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>>>>>> > > >>> Here are some additional bits of prior art, if anyone finds them
>>>>>> > > useful:
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>> The Stratio Lucene Index -
>>>>>> > > >>> https://github.com/Stratio/cassandra-lucene-index#examples
>>>>>> > > >>>
>>>>>> > > >>> Stratio was the reason C* added the "expr" functionality. They 
>>>>>> > > >>> embedded
>>>>>> > > >>> something similar to ElasticSearch JSON, which probably isn't my
>>>>>> > > favorite
>>>>>> > > >>> choice, but it's there.
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>> The ElasticSearch match query syntax -
>>>>>> > > >>>
>>>>>> > > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>>>>>> > > >>>
>>>>>> > > >>> Again, not my favorite. It's verbose, and probably too powerful 
>>>>>> > > >>> for us.
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>> ElasticSearch's documentation for the basic Lucene query syntax -
>>>>>> > > >>>
>>>>>> > > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>>>>>> > > >>>
>>>>>> > > >>> One idea is to take the basic Lucene index, which it seems we 
>>>>>> > > >>> already
>>>>>> > > have
>>>>>> > > >>> some support for, and feed it to "expr". This is nice for two 
>>>>>> > > >>> reasons:
>>>>>> > > >>>
>>>>>> > > >>> 1.) People can just write Lucene queries if they already know 
>>>>>> > > >>> how.
>>>>>> > > >>> 2.) No changes to the grammar.
>>>>>> > > >>>
>>>>>> > > >>> Lucene has distinct concepts of filtering and querying, and this 
>>>>>> > > >>> is
>>>>>> > > kind of
>>>>>> > > >>> the latter. I'm not sure how, for example, we would want "expr" 
>>>>>> > > >>> to
>>>>>> > > interact
>>>>>> > > >>> w/ filters on other column indexes in vanilla CQL space...
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie 
>>>>>> > > >>>> <jmcken...@apache.org>
>>>>>> > > wrote:
>>>>>> > > >>>>
>>>>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and 
>>>>>> > > >>>> Python for
>>>>>> > > >>>> substring searches, so at least some users will be surprised by
>>>>>> > > term-based
>>>>>> > > >>>> behavior.
>>>>>> > > >>>>
>>>>>> > > >>>> I wonder whether users are in their "programming language" 
>>>>>> > > >>>> headspace
>>>>>> > > or in
>>>>>> > > >>>> their "querying a database" headspace when interacting with 
>>>>>> > > >>>> CQL? i.e.
>>>>>> > > this
>>>>>> > > >>>> would only present confusion if we expected users to be 
>>>>>> > > >>>> thinking in
>>>>>> > > the
>>>>>> > > >>>> idioms of their respective programming languages. If they're 
>>>>>> > > >>>> thinking
>>>>>> > > in
>>>>>> > > >>>> terms of SQL, MATCHES would probably end up confusing them a bit
>>>>>> > > since it
>>>>>> > > >>>> doesn't match the general structure of the MATCH operator.
>>>>>> > > >>>>
>>>>>> > > >>>> That said, I also think CONTAINS loses something important that 
>>>>>> > > >>>> you
>>>>>> > > allude
>>>>>> > > >>>> to here Jonathan:
>>>>>> > > >>>>
>>>>>> > > >>>> with corresponding query-time tokenization and analysis.  This 
>>>>>> > > >>>> means
>>>>>> > > that
>>>>>> > > >>>> the query term is not always a substring of the original string!
>>>>>> > > Besides
>>>>>> > > >>>> obvious transformations like lowercasing, you have things like
>>>>>> > > >>>> PhoneticFilter available as well.
>>>>>> > > >>>>
>>>>>> > > >>>> So to me, neither MATCHES nor CONTAINS are particularly great
>>>>>> > > candidates.
>>>>>> > > >>>>
>>>>>> > > >>>> So +1 to the "I don't actually hate it" sentiment on:
>>>>>> > > >>>>
>>>>>> > > >>>> column : term`. Inspired by Lucene’s syntax
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>> I have a strong preference not to use the name of an SQL 
>>>>>> > > >>>> operator,
>>>>>> > > since
>>>>>> > > >>>> it precludes us later providing the SQL standard operator to 
>>>>>> > > >>>> users.
>>>>>> > > >>>>
>>>>>> > > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña 
>>>>>> > > >>>>> <adelap...@apache.org>
>>>>>> > > wrote:
>>>>>> > > >>>>
>>>>>> > > >>>> 
>>>>>> > > >>>> `column = term` is definitively problematic because it creates 
>>>>>> > > >>>> an
>>>>>> > > >>>> ambiguity when the queried column belongs to the primary key. 
>>>>>> > > >>>> For some
>>>>>> > > >>>> queries we wouldn't know whether the user wants a primary key 
>>>>>> > > >>>> query
>>>>>> > > using
>>>>>> > > >>>> regular equality or an index query using the analyzer.
>>>>>> > > >>>>
>>>>>> > > >>>> `term_matches(column, term)` seems quite clear and hard to
>>>>>> > > misinterpret,
>>>>>> > > >>>> but it's quite long to write and its implementation will be
>>>>>> > > challenging
>>>>>> > > >>>> since we would need a bunch of special casing around 
>>>>>> > > >>>> SelectStatement
>>>>>> > > and
>>>>>> > > >>>> functions.
>>>>>> > > >>>>
>>>>>> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they 
>>>>>> > > >>>> seem
>>>>>> > > to
>>>>>> > > >>>> evoke different behaviours to what they would have.
>>>>>> > > >>>>
>>>>>> > > >>>> `column LIKE :term:` seems a bit redundant compared to just 
>>>>>> > > >>>> using
>>>>>> > > `column
>>>>>> > > >>>> : term`, and we are still introducing a new symbol.
>>>>>> > > >>>>
>>>>>> > > >>>> I think I like `column : term` the most, because it's brief, 
>>>>>> > > >>>> it's
>>>>>> > > similar
>>>>>> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash 
>>>>>> > > >>>> with
>>>>>> > > other
>>>>>> > > >>>> different meanings that I can think of.
>>>>>> > > >>>>
>>>>>> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis 
>>>>>> > > >>>>> <jbel...@gmail.com>
>>>>>> > > wrote:
>>>>>> > > >>>>
>>>>>> > > >>>> Hi all,
>>>>>> > > >>>>
>>>>>> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball 
>>>>>> > > >>>> rolling on
>>>>>> > > >>>> aligning around phase 2 features.
>>>>>> > > >>>>
>>>>>> > > >>>> In particular, we need to nail down the syntax for doing 
>>>>>> > > >>>> non-exact
>>>>>> > > string
>>>>>> > > >>>> matches.  We have a proof of concept that includes full Lucene
>>>>>> > > analyzer and
>>>>>> > > >>>> filter functionality – just the text transformation pieces, 
>>>>>> > > >>>> none of
>>>>>> > > the
>>>>>> > > >>>> storage parts – which is the gold standard in this space.  For
>>>>>> > > example, the
>>>>>> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
>>>>>> > > (common
>>>>>> > > >>>> words like “a”, “is”, “the” that are usually not useful to 
>>>>>> > > >>>> search
>>>>>> > > >>>> against).  Lucene also has classes that offer stemming, special 
>>>>>> > > >>>> case
>>>>>> > > >>>> handling for email, and many languages besides English [2].
>>>>>> > > >>>>
>>>>>> > > >>>> What syntax should we use to express “rows whose analyzed 
>>>>>> > > >>>> tokens match
>>>>>> > > >>>> this search term?”
>>>>>> > > >>>>
>>>>>> > > >>>> The syntax must be clear that we want to look for this term 
>>>>>> > > >>>> within the
>>>>>> > > >>>> column data using the configured index with corresponding 
>>>>>> > > >>>> query-time
>>>>>> > > >>>> tokenization and analysis.  This means that the query term is 
>>>>>> > > >>>> not
>>>>>> > > always a
>>>>>> > > >>>> substring of the original string!  Besides obvious 
>>>>>> > > >>>> transformations
>>>>>> > > like
>>>>>> > > >>>> lowercasing, you have things like PhoneticFilter available as 
>>>>>> > > >>>> well.
>>>>>> > > >>>>
>>>>>> > > >>>> Here are my thoughts on some of the options:
>>>>>> > > >>>>
>>>>>> > > >>>> `column = term`.  This is what the POC does today and it’s super
>>>>>> > > confusing
>>>>>> > > >>>> to overload = to mean something other than exact equality.  I 
>>>>>> > > >>>> am not
>>>>>> > > a fan.
>>>>>> > > >>>>
>>>>>> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL 
>>>>>> > > >>>> operator,
>>>>>> > > but
>>>>>> > > >>>> neither the wildcarded nor unwildcarded syntax matches the 
>>>>>> > > >>>> semantics
>>>>>> > > of
>>>>>> > > >>>> term-based search.
>>>>>> > > >>>>
>>>>>> > > >>>> `column MATCHES term`. I rather like this one, although Mike 
>>>>>> > > >>>> points
>>>>>> > > out
>>>>>> > > >>>> that “match” has a meaning in the context of regular 
>>>>>> > > >>>> expressions that
>>>>>> > > could
>>>>>> > > >>>> cause confusion here.
>>>>>> > > >>>>
>>>>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and 
>>>>>> > > >>>> Python for
>>>>>> > > >>>> substring searches, so at least some users will be surprised by
>>>>>> > > term-based
>>>>>> > > >>>> behavior.
>>>>>> > > >>>>
>>>>>> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use 
>>>>>> > > >>>> functions
>>>>>> > > like
>>>>>> > > >>>> this for everything.  It’s pretty clunky, and we would need to 
>>>>>> > > >>>> make
>>>>>> > > the
>>>>>> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a 
>>>>>> > > >>>> function
>>>>>> > > >>>> result in a predicate” like this.
>>>>>> > > >>>>
>>>>>> > > >>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually 
>>>>>> > > >>>> hate
>>>>>> > > it.
>>>>>> > > >>>>
>>>>>> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
>>>>>> > > symbol to
>>>>>> > > >>>> indicate term matching.  Arguably more SQL-ish than a new bare 
>>>>>> > > >>>> symbol
>>>>>> > > >>>> operator.
>>>>>> > > >>>>
>>>>>> > > >>>> [1]
>>>>>> > > >>>>
>>>>>> > > https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>>>>>> > > >>>> [2] 
>>>>>> > > >>>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>>>>>> > > >>>>
>>>>>> > > >>>> --
>>>>>> > > >>>> Jonathan Ellis
>>>>>> > > >>>> co-founder, http://www.datastax.com
>>>>>> > > >>>> @spyced
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>
>>>>>> > >
>>>>>> >
>>>>> 
>>>>> 
>>>>> --
>>>>> DataStax Logo Square <https://www.datastax.com/>
>>>>> *Mike Adamson*
>>>>> Engineering
>>>>> +1 650 389 6000 <tel:16503896000> | datastax.com 
>>>>> <https://www.datastax.com/>
>>>>> Find DataStax Online:
>>>>> LinkedIn Logo 
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>>>    Facebook Logo 
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>>>    Twitter Logo <https://twitter.com/DataStax>   RSS Feed 
>>>>> <https://www.datastax.com/blog/rss.xml>   Github Logo 
>>>>> <https://github.com/datastax>
>> 
>> 
>> --
>> Regards,
>> Atri
>> Apache Concerted

Re: Tokenization and SAI query syntax

Reply via email to