Ok I get around the issue about %w%a%

So this will be interpreter first by the CQL parser as LIKE CONTAINS with
searched term = w%a

And then things get complicated

1) if you're using NonTokeninzingAnalyzer or NoOpAnalyzer, everything is
fine, the % in 'w%a' is interpreted as simple literal and not wildcard
character

2) if you're using StandardAnalyzer, it's an entirely different story.
During the parsing of the search predicates by the query planer, the term
'w%a' is passed to the analyzer (StandardAnalyzer here):
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java#L303-L323

The StandardAnalyzer is tokenizing the search term so 'w%a' becomes 2
distinct token, 'w' OR 'a'. Why does it ignore the % ? Because according to
Unicode line breaking rule, % is a separator, read here:
http://www.unicode.org/Public/UNIDATA/LineBreak.txt

Nowhere in the source code we can see this, in fact you'll need to look
into the JFlex grammar file
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/analyzer/StandardTokenizerImpl.jflex
to see a reference to Unicode word breaking rules ...

So indeed when using StandardAnalyzer, any % character will be interpreter
as a separator so our LIKE '%w%a%' is indeed transformed into a LIKE '%w%'
OR LIKE '%a%' e.g all words containing 'w' OR 'a', irrespective of their
relative position to each other ...

Why is it an OR predicate and not an AND predicate ? The answer is a
comment in the source code here:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java#L290-L295

I'll end by a famous sentence : "It is not a bug, it is a feature"  :D

On Thu, Sep 15, 2016 at 4:11 PM, DuyHai Doan <doanduy...@gmail.com> wrote:

> Currently SASI can only understand the % for the beginning (suffix) or
> ending (prefix) position.
>
> Any expression containing the % in the middle like %w%a% will not be
> interpreter by SASI as wildcard.
>
> %w%a% will translate into "Give me all results containing w%a
>
> On Thu, Sep 15, 2016 at 3:58 PM, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
>> Thank you for the investigation. Will wait for a fix and news.
>>
>> Probably it’s not a directly related question but what do you think about
>> CASSANDRA-12573? Let me know if it’s better to create a separate thread for
>> it.
>>
>> Thanks,
>> Mikhail
>>
>>
>> On 15 Sep 2016, at 16:02, DuyHai Doan <doanduy...@gmail.com> wrote:
>>
>> Ok so I've found the source of the issue, it's pretty well hidden because
>> it is NOT in the SASI source code directly.
>>
>> Here is the method where C* determines what kind of LIKE expression
>> you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>>
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/
>> apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>>
>> As you can see, it's pretty simple, maybe too simple. Indeed, they forget
>> to remove escape character BEFORE doing the matching so if your search is 
>> LIKE
>> '%%esc%', the detected expression is LIKE_CONTAINS.
>>
>> A possible fix would be:
>>
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>> the column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced
>> regex to exclude the %% from parsing e.g
>>
>> Step 2) is dead easy but step 1) is harder because I don't know if
>> converting the bytebuffer into String at this stage of the CQL parser is
>> expensive or not (in term of computation)
>>
>> Let me try a patch
>>
>>
>>
>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> Ok you're right, I get your point
>>>
>>> LIKE '%%esc%' --> startWith('%esc')
>>>
>>> LIKE 'escape%%' -->  = 'escape%'
>>>
>>> What I strongly suspect is that in the source code of SASI, we parse the
>>> % xxx % expression BEFORE applying escape. That will explain the observed
>>> behavior. E.g:
>>>
>>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>>
>>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>>
>>> Let me check in the source code and try to reproduce the issue
>>>
>>>
>>>
>>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
>>> mikhail.krupits...@jetbrains.com> wrote:
>>>
>>>> Looks like we have different understanding of what results are expected.
>>>> I based my understanding on http://docs.datastax.com/en
>>>> /cql/3.3/cql/cql_using/useSASIIndex.html
>>>> According to the doc ‘esc’ is a pattern for exact match and I guess
>>>> that there is no semantical difference between two LIKE patterns (both of
>>>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>>>>
>>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>>>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>>>> *%esc*
>>>>
>>>> Why ‘containing’? I expect that it should be ’starting’..
>>>>
>>>>
>>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>>>> *starting* with 'escape%' so *escape%*me is a valid result and also
>>>> *escape%*esc
>>>>
>>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>>
>>>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it
>>>> returns nothing (CASSANDRA-12573).
>>>>
>>>> What I’m missing?
>>>>
>>>> Thanks,
>>>> Mikhail
>>>>
>>>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduy...@gmail.com> wrote:
>>>>
>>>> CREATE CUSTOM INDEX ON test.escape(val) USING '
>>>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode':
>>>> 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa
>>>> si.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
>>>>
>>>> I don't see any problem in the results you got
>>>>
>>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>>>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>>>> *%esc*
>>>>
>>>> Why ‘containing’? I expect that it should be ’starting’..
>>>>
>>>>
>>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>>>> *starting* with 'escape%' so *escape%*me is a valid result and also
>>>> *escape%*esc
>>>>
>>>> Why ’starting’? I expect that it should be ‘exact matching’.
>>>>
>>>>
>>>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy <
>>>> mikhail.krupits...@jetbrains.com> wrote:
>>>>
>>>>> Thanks for the reply.
>>>>> Could you please provide what index definition did you use?
>>>>> With the index from my script I get the following results:
>>>>>
>>>>> cqlsh:test> select * from escape;
>>>>>
>>>>>  id | val
>>>>> ----+-----------
>>>>>   1 | %escapeme
>>>>>   2 | escape%me
>>>>> *  3 | escape%esc*
>>>>>
>>>>> Contains search
>>>>>
>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>>>
>>>>>  id | val
>>>>> ----+-----------
>>>>>   1 | %escapeme
>>>>>   3
>>>>> * | escape%esc*(2 rows)
>>>>>
>>>>>
>>>>> Prefix search
>>>>>
>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>>>
>>>>>  id | val
>>>>> ----+-----------
>>>>>   2 | escape%me
>>>>>   3
>>>>> * | escape%esc*
>>>>>
>>>>> Thanks,
>>>>> Mikhail
>>>>>
>>>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduy...@gmail.com> wrote:
>>>>>
>>>>> Use % to escape %
>>>>>
>>>>> cqlsh:test> select * from escape;
>>>>>
>>>>>  id | val
>>>>> ----+-----------
>>>>>   1 | %escapeme
>>>>>   2 | escape%me
>>>>>
>>>>>
>>>>> Contains search
>>>>>
>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>>>
>>>>>  id | val
>>>>> ----+-----------
>>>>>   1 | %escapeme
>>>>>
>>>>> (1 rows)
>>>>>
>>>>>
>>>>> Prefix search
>>>>>
>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>>>
>>>>>  id | val
>>>>> ----+-----------
>>>>>   2 | escape%me
>>>>>
>>>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy <
>>>>> mikhail.krupits...@jetbrains.com> wrote:
>>>>>
>>>>>> Hi Cassandra guys,
>>>>>>
>>>>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a simple char in
>>>>>> a search pattern.
>>>>>> Here is my test script:
>>>>>>
>>>>>> DROP keyspace if exists kmv;
>>>>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class' :
>>>>>> 'SimpleStrategy', 'replication_factor':'1'} ;
>>>>>> USE kmv;
>>>>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY
>>>>>> KEY(id, c1));
>>>>>> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING '
>>>>>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>>>>>> 'analyzed' : 'true',
>>>>>> 'analyzer_class' : 'org.apache.cassandra.index.sa
>>>>>> si.analyzer.NonTokenizingAnalyzer',
>>>>>> 'case_sensitive' : 'false',
>>>>>> 'mode' : 'CONTAINS'
>>>>>> };
>>>>>>
>>>>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd');
>>>>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd');
>>>>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%');
>>>>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1');
>>>>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd');
>>>>>>
>>>>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_';
>>>>>>
>>>>>> _pattern_ '%%%' finds all columns that contain %.
>>>>>> How to find columns that start form ‘%’ or ‘%a’?
>>>>>> How to find columns that end with ‘%’?
>>>>>> What about more complex patterns: '%qwe%a%sd%’? How to differentiate
>>>>>> ‘%’ char form % as a command symbol? (Also there is a related issue
>>>>>> CASSANDRA-12573).
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Mikhail
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Reply via email to