Been chatting a bit w/Caleb about this offline and poking around to better educate myself.
> using functions (ignoring the implementation complexity) at least removes > ambiguity. This, plus using functions lets us kick the can down the road a bit in terms of landing on an integrated grammar we agree on. It seems to me there's a tension between: 1. "SQL-like" (i.e. postgres-like) 2. "Indexing and Search domain-specific-like" (i.e. lucene syntax which, as Benedict points out, doesn't really jell w/what we have in CQL at this point), and 3. ??? Some other YOLO CQL / C* specific thing where we go our own road I don't think we're really going to know what our feature-set in terms of indexing is going to look like or the shape it's going to take for awhile, so backing ourselves into any of the 3 corners above right now feels very premature to me. So I'm coming around to the expr / method call approach to preserve that flexibility. It's maximally explicit and preserves optionality at the expense of being clunky. For now. On Mon, Aug 7, 2023, at 4:00 PM, Caleb Rackliffe wrote: > > I do not think we should start using lucene syntax for it, it will make > > people think they can do everything else lucene allows. > > I'm sure we won't be supporting everything Lucene allows, but this is going > to evolve. Right off the bat, if you introduce support for tokenization and > filtering, someone is, for example, going to ask for phrase queries. ("John > Smith landed in Virginia" is tokenized, but someone wants to match exactly on > "John Smith".) The whole point of the Vector project is to do relevance, > right? Are we going to do term boosting? Do we need queries like "field: > quick brown +fox -news" where fox must be present, news cannot be present, > and quick and brown increase relevance? > > SASI uses "=" and "LIKE" in a way that assumes the user understands the > tokenization scheme in use on the target field. I understand that's a bit > ambiguous. > > If we object to allowing expr embedding of a subset of the Lucene syntax, I > can't imagine we're okay w/ then jamming a subset of that syntax into the > main CQL grammar. > > If we want to do this in non-expr CQL space, I think using functions > (ignoring the implementation complexity) at least removes ambiguity. > "token_match", "phrase_match", "token_like", "=", and "LIKE" would all be > pretty clear, although there may be other problems. For instance, what > happens when I try to use "token_match" on an indexed field whose analyzer > does not tokenize? We obviously can't use the index, so we'd be reduced to > requiring a filtering query, but maybe that's fine. My point is that, if > we're going to make write and read analyzers symmetrical, there's really no > way to make the semantics of our queries totally independent of analysis. > (ex. "field : foo bar" behaves differently w/ read tokenization than it does > without. It could even be an OR or AND query w/ tokenization, depending on > our defaults.) > > On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <a...@apache.org> wrote: >> Why not start with SQLish operators supported by many databases (LIKE and >> CONTAINS)? >> >> On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com> >> wrote: >>> >>> I am also -1 on directly exposing lucene like syntax here. Besides being >>> ugly, SAI is not lucene, I do not think we should start using lucene syntax >>> for it, it will make people think they can do everything else lucene allows. >>> >>>> On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote: >>>> >>>> >>>> I’m strongly opposed to : >>>> >>>> It is very dissimilar to our current operators. CQL is already not the >>>> prettiest language, but let’s not make it a total mish mash. >>>> >>>> >>>> >>>> >>>>> On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote: >>>>> >>>>> I am also in agreement with 'column : token' in that 'I don't hate it' >>>>> but I'd like to offer an alternative to this in 'column HAS token'. HAS >>>>> is currently not a keyword that we use so wouldn't cause any brain >>>>> conflicts. >>>>> >>>>> While I don't hate ':' I have a particular dislike of the lucene search >>>>> syntax because of its terseness and lack of easy readability. >>>>> >>>>> Saying that, I'm happy to do with ':' if that is the decision. >>>>> >>>>> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> >>>>> wrote: >>>>>> Assuming SAI is a superset of SASI, and we were to set up something so >>>>>> that SASI indexes auto convert to SAI, this gives even more weight to my >>>>>> point regarding how differing behavior for the same syntax can lead to >>>>>> issues. Imo the best case scenario results in the user not even >>>>>> noticing their indexes have changed. >>>>>> >>>>>> An (maybe better?) alternative is to add a flag to the index >>>>>> configuration for "compatibility mod", which might address the concerns >>>>>> around using an equality operator when it actually is a partial match. >>>>>> >>>>>> For what it's worth, I'm in agreement that = should mean full equality >>>>>> and not token match. >>>>>> >>>>>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote: >>>>>> > For what it's worth, I'd very much like to completely remove SASI from >>>>>> > the >>>>>> > codebase for 6.0. The only remaining functionality gaps at the moment >>>>>> > are >>>>>> > LIKE (prefix/suffix) queries and its limited tokenization >>>>>> > capabilities, both of which already have SAI Phase 2 Jiras. >>>>>> > >>>>>> > On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com> >>>>>> > wrote: >>>>>> > >>>>>> > > SASI just uses “=“ for the tokenized equality matching, which is the >>>>>> > > exact >>>>>> > > thing this discussion is about changing/not liking. >>>>>> > > >>>>>> > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan >>>>>> > > > <jeremiah.jor...@gmail.com> >>>>>> > > wrote: >>>>>> > > > >>>>>> > > > I do not think LIKE actually applies here. LIKE is used for >>>>>> > > > prefix, >>>>>> > > contains, or suffix searches in SASI depending on the index type. >>>>>> > > > >>>>>> > > > This is about exact matching of tokens. >>>>>> > > > >>>>>> > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad >>>>>> > > >> <rustyrazorbl...@apache.org> >>>>>> > > wrote: >>>>>> > > >> >>>>>> > > >> Certain bits of functionality also already exist on the SASI >>>>>> > > >> side of >>>>>> > > things, but I'm not sure how much overlap there is. Currently, >>>>>> > > there's a >>>>>> > > LIKE keyword that handles token matching, although it seems to have >>>>>> > > some >>>>>> > > differences from the feature set in SAI. >>>>>> > > >> >>>>>> > > >> That said, there seems to be enough of an overlap that it would >>>>>> > > >> make >>>>>> > > sense to consider using LIKE in the same manner, doesn't it? I >>>>>> > > think it >>>>>> > > would be a little odd if we have different syntax for different >>>>>> > > indexes. >>>>>> > > >> >>>>>> > > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md >>>>>> > > >> >>>>>> > > >> I think one complication here is that there seems to be a desire, >>>>>> > > >> that >>>>>> > > I very much agree with, to expose as much of the underlying >>>>>> > > flexibility of >>>>>> > > Lucene as much as possible. If it means we use Caleb's suggestion, >>>>>> > > I'd ask >>>>>> > > that the queries that SASI and SAI both support use the same syntax, >>>>>> > > even >>>>>> > > if it means there's two ways of writing the same query. To use >>>>>> > > Caleb's >>>>>> > > example, this would mean supporting both LIKE and the `expr` column. >>>>>> > > >> >>>>>> > > >> Jon >>>>>> > > >> >>>>>> > > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote: >>>>>> > > >>> Here are some additional bits of prior art, if anyone finds them >>>>>> > > useful: >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>> The Stratio Lucene Index - >>>>>> > > >>> https://github.com/Stratio/cassandra-lucene-index#examples >>>>>> > > >>> >>>>>> > > >>> Stratio was the reason C* added the "expr" functionality. They >>>>>> > > >>> embedded >>>>>> > > >>> something similar to ElasticSearch JSON, which probably isn't my >>>>>> > > favorite >>>>>> > > >>> choice, but it's there. >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>> The ElasticSearch match query syntax - >>>>>> > > >>> >>>>>> > > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$ >>>>>> > > >>> >>>>>> > > >>> Again, not my favorite. It's verbose, and probably too powerful >>>>>> > > >>> for us. >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>> ElasticSearch's documentation for the basic Lucene query syntax - >>>>>> > > >>> >>>>>> > > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$ >>>>>> > > >>> >>>>>> > > >>> One idea is to take the basic Lucene index, which it seems we >>>>>> > > >>> already >>>>>> > > have >>>>>> > > >>> some support for, and feed it to "expr". This is nice for two >>>>>> > > >>> reasons: >>>>>> > > >>> >>>>>> > > >>> 1.) People can just write Lucene queries if they already know >>>>>> > > >>> how. >>>>>> > > >>> 2.) No changes to the grammar. >>>>>> > > >>> >>>>>> > > >>> Lucene has distinct concepts of filtering and querying, and this >>>>>> > > >>> is >>>>>> > > kind of >>>>>> > > >>> the latter. I'm not sure how, for example, we would want "expr" >>>>>> > > >>> to >>>>>> > > interact >>>>>> > > >>> w/ filters on other column indexes in vanilla CQL space... >>>>>> > > >>> >>>>>> > > >>> >>>>>> > > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie >>>>>> > > >>>> <jmcken...@apache.org> >>>>>> > > wrote: >>>>>> > > >>>> >>>>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and >>>>>> > > >>>> Python for >>>>>> > > >>>> substring searches, so at least some users will be surprised by >>>>>> > > term-based >>>>>> > > >>>> behavior. >>>>>> > > >>>> >>>>>> > > >>>> I wonder whether users are in their "programming language" >>>>>> > > >>>> headspace >>>>>> > > or in >>>>>> > > >>>> their "querying a database" headspace when interacting with >>>>>> > > >>>> CQL? i.e. >>>>>> > > this >>>>>> > > >>>> would only present confusion if we expected users to be >>>>>> > > >>>> thinking in >>>>>> > > the >>>>>> > > >>>> idioms of their respective programming languages. If they're >>>>>> > > >>>> thinking >>>>>> > > in >>>>>> > > >>>> terms of SQL, MATCHES would probably end up confusing them a bit >>>>>> > > since it >>>>>> > > >>>> doesn't match the general structure of the MATCH operator. >>>>>> > > >>>> >>>>>> > > >>>> That said, I also think CONTAINS loses something important that >>>>>> > > >>>> you >>>>>> > > allude >>>>>> > > >>>> to here Jonathan: >>>>>> > > >>>> >>>>>> > > >>>> with corresponding query-time tokenization and analysis. This >>>>>> > > >>>> means >>>>>> > > that >>>>>> > > >>>> the query term is not always a substring of the original string! >>>>>> > > Besides >>>>>> > > >>>> obvious transformations like lowercasing, you have things like >>>>>> > > >>>> PhoneticFilter available as well. >>>>>> > > >>>> >>>>>> > > >>>> So to me, neither MATCHES nor CONTAINS are particularly great >>>>>> > > candidates. >>>>>> > > >>>> >>>>>> > > >>>> So +1 to the "I don't actually hate it" sentiment on: >>>>>> > > >>>> >>>>>> > > >>>> column : term`. Inspired by Lucene’s syntax >>>>>> > > >>>> >>>>>> > > >>>> >>>>>> > > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote: >>>>>> > > >>>> >>>>>> > > >>>> >>>>>> > > >>>> I have a strong preference not to use the name of an SQL >>>>>> > > >>>> operator, >>>>>> > > since >>>>>> > > >>>> it precludes us later providing the SQL standard operator to >>>>>> > > >>>> users. >>>>>> > > >>>> >>>>>> > > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term? >>>>>> > > >>>> >>>>>> > > >>>> >>>>>> > > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña >>>>>> > > >>>>> <adelap...@apache.org> >>>>>> > > wrote: >>>>>> > > >>>> >>>>>> > > >>>> >>>>>> > > >>>> `column = term` is definitively problematic because it creates >>>>>> > > >>>> an >>>>>> > > >>>> ambiguity when the queried column belongs to the primary key. >>>>>> > > >>>> For some >>>>>> > > >>>> queries we wouldn't know whether the user wants a primary key >>>>>> > > >>>> query >>>>>> > > using >>>>>> > > >>>> regular equality or an index query using the analyzer. >>>>>> > > >>>> >>>>>> > > >>>> `term_matches(column, term)` seems quite clear and hard to >>>>>> > > misinterpret, >>>>>> > > >>>> but it's quite long to write and its implementation will be >>>>>> > > challenging >>>>>> > > >>>> since we would need a bunch of special casing around >>>>>> > > >>>> SelectStatement >>>>>> > > and >>>>>> > > >>>> functions. >>>>>> > > >>>> >>>>>> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they >>>>>> > > >>>> seem >>>>>> > > to >>>>>> > > >>>> evoke different behaviours to what they would have. >>>>>> > > >>>> >>>>>> > > >>>> `column LIKE :term:` seems a bit redundant compared to just >>>>>> > > >>>> using >>>>>> > > `column >>>>>> > > >>>> : term`, and we are still introducing a new symbol. >>>>>> > > >>>> >>>>>> > > >>>> I think I like `column : term` the most, because it's brief, >>>>>> > > >>>> it's >>>>>> > > similar >>>>>> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash >>>>>> > > >>>> with >>>>>> > > other >>>>>> > > >>>> different meanings that I can think of. >>>>>> > > >>>> >>>>>> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis >>>>>> > > >>>>> <jbel...@gmail.com> >>>>>> > > wrote: >>>>>> > > >>>> >>>>>> > > >>>> Hi all, >>>>>> > > >>>> >>>>>> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball >>>>>> > > >>>> rolling on >>>>>> > > >>>> aligning around phase 2 features. >>>>>> > > >>>> >>>>>> > > >>>> In particular, we need to nail down the syntax for doing >>>>>> > > >>>> non-exact >>>>>> > > string >>>>>> > > >>>> matches. We have a proof of concept that includes full Lucene >>>>>> > > analyzer and >>>>>> > > >>>> filter functionality – just the text transformation pieces, >>>>>> > > >>>> none of >>>>>> > > the >>>>>> > > >>>> storage parts – which is the gold standard in this space. For >>>>>> > > example, the >>>>>> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords >>>>>> > > (common >>>>>> > > >>>> words like “a”, “is”, “the” that are usually not useful to >>>>>> > > >>>> search >>>>>> > > >>>> against). Lucene also has classes that offer stemming, special >>>>>> > > >>>> case >>>>>> > > >>>> handling for email, and many languages besides English [2]. >>>>>> > > >>>> >>>>>> > > >>>> What syntax should we use to express “rows whose analyzed >>>>>> > > >>>> tokens match >>>>>> > > >>>> this search term?” >>>>>> > > >>>> >>>>>> > > >>>> The syntax must be clear that we want to look for this term >>>>>> > > >>>> within the >>>>>> > > >>>> column data using the configured index with corresponding >>>>>> > > >>>> query-time >>>>>> > > >>>> tokenization and analysis. This means that the query term is >>>>>> > > >>>> not >>>>>> > > always a >>>>>> > > >>>> substring of the original string! Besides obvious >>>>>> > > >>>> transformations >>>>>> > > like >>>>>> > > >>>> lowercasing, you have things like PhoneticFilter available as >>>>>> > > >>>> well. >>>>>> > > >>>> >>>>>> > > >>>> Here are my thoughts on some of the options: >>>>>> > > >>>> >>>>>> > > >>>> `column = term`. This is what the POC does today and it’s super >>>>>> > > confusing >>>>>> > > >>>> to overload = to mean something other than exact equality. I >>>>>> > > >>>> am not >>>>>> > > a fan. >>>>>> > > >>>> >>>>>> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL >>>>>> > > >>>> operator, >>>>>> > > but >>>>>> > > >>>> neither the wildcarded nor unwildcarded syntax matches the >>>>>> > > >>>> semantics >>>>>> > > of >>>>>> > > >>>> term-based search. >>>>>> > > >>>> >>>>>> > > >>>> `column MATCHES term`. I rather like this one, although Mike >>>>>> > > >>>> points >>>>>> > > out >>>>>> > > >>>> that “match” has a meaning in the context of regular >>>>>> > > >>>> expressions that >>>>>> > > could >>>>>> > > >>>> cause confusion here. >>>>>> > > >>>> >>>>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and >>>>>> > > >>>> Python for >>>>>> > > >>>> substring searches, so at least some users will be surprised by >>>>>> > > term-based >>>>>> > > >>>> behavior. >>>>>> > > >>>> >>>>>> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use >>>>>> > > >>>> functions >>>>>> > > like >>>>>> > > >>>> this for everything. It’s pretty clunky, and we would need to >>>>>> > > >>>> make >>>>>> > > the >>>>>> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a >>>>>> > > >>>> function >>>>>> > > >>>> result in a predicate” like this. >>>>>> > > >>>> >>>>>> > > >>>> `column : term`. Inspired by Lucene’s syntax. I don’t actually >>>>>> > > >>>> hate >>>>>> > > it. >>>>>> > > >>>> >>>>>> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new >>>>>> > > symbol to >>>>>> > > >>>> indicate term matching. Arguably more SQL-ish than a new bare >>>>>> > > >>>> symbol >>>>>> > > >>>> operator. >>>>>> > > >>>> >>>>>> > > >>>> [1] >>>>>> > > >>>> >>>>>> > > https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html >>>>>> > > >>>> [2] >>>>>> > > >>>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html >>>>>> > > >>>> >>>>>> > > >>>> -- >>>>>> > > >>>> Jonathan Ellis >>>>>> > > >>>> co-founder, http://www.datastax.com >>>>>> > > >>>> @spyced >>>>>> > > >>>> >>>>>> > > >>>> >>>>>> > > >>>> >>>>>> > > >>> >>>>>> > > >>>>>> > >>>>> >>>>> >>>>> -- >>>>> DataStax Logo Square <https://www.datastax.com/> >>>>> *Mike Adamson* >>>>> Engineering >>>>> +1 650 389 6000 <tel:16503896000> | datastax.com >>>>> <https://www.datastax.com/> >>>>> Find DataStax Online: >>>>> LinkedIn Logo >>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> >>>>> Facebook Logo >>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> >>>>> Twitter Logo <https://twitter.com/DataStax> RSS Feed >>>>> <https://www.datastax.com/blog/rss.xml> Github Logo >>>>> <https://github.com/datastax> >> >> >> -- >> Regards, >> Atri >> Apache Concerted