Ok so I've found the source of the issue, it's pretty well hidden because
it is NOT in the SASI source code directly.

Here is the method where C* determines what kind of LIKE expression you're
using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778

As you can see, it's pretty simple, maybe too simple. Indeed, they forget
to remove escape character BEFORE doing the matching so if your search is LIKE
'%%esc%', the detected expression is LIKE_CONTAINS.

A possible fix would be:

1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
the column data type)
2) remove the escape character e.g. before parsing OR use some advanced
regex to exclude the %% from parsing e.g

Step 2) is dead easy but step 1) is harder because I don't know if
converting the bytebuffer into String at this stage of the CQL parser is
expensive or not (in term of computation)

Let me try a patch



On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> Ok you're right, I get your point
>
> LIKE '%%esc%' --> startWith('%esc')
>
> LIKE 'escape%%' -->  = 'escape%'
>
> What I strongly suspect is that in the source code of SASI, we parse the %
> xxx % expression BEFORE applying escape. That will explain the observed
> behavior. E.g:
>
> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>
> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>
> Let me check in the source code and try to reproduce the issue
>
>
>
> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
>> Looks like we have different understanding of what results are expected.
>> I based my understanding on http://docs.datastax.com/en
>> /cql/3.3/cql/cql_using/useSASIIndex.html
>> According to the doc ‘esc’ is a pattern for exact match and I guess that
>> there is no semantical difference between two LIKE patterns (both of
>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
>>
>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>> *%esc*
>>
>> Why ‘containing’? I expect that it should be ’starting’..
>>
>>
>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>> *starting* with 'escape%' so *escape%*me is a valid result and also
>> *escape%*esc
>>
>> Why ’starting’? I expect that it should be ‘exact matching’.
>>
>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it
>> returns nothing (CASSANDRA-12573).
>>
>> What I’m missing?
>>
>> Thanks,
>> Mikhail
>>
>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduy...@gmail.com> wrote:
>>
>> CREATE CUSTOM INDEX ON test.escape(val) USING '
>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode':
>> 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa
>> si.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
>>
>> I don't see any problem in the results you got
>>
>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
>> *containing* '%esc' so *%esc*apeme is a possible match and also escape
>> *%esc*
>>
>> Why ‘containing’? I expect that it should be ’starting’..
>>
>>
>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
>> *starting* with 'escape%' so *escape%*me is a valid result and also
>> *escape%*esc
>>
>> Why ’starting’? I expect that it should be ‘exact matching’.
>>
>>
>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy <
>> mikhail.krupits...@jetbrains.com> wrote:
>>
>>> Thanks for the reply.
>>> Could you please provide what index definition did you use?
>>> With the index from my script I get the following results:
>>>
>>> cqlsh:test> select * from escape;
>>>
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>   2 | escape%me
>>> *  3 | escape%esc*
>>>
>>> Contains search
>>>
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>   3
>>> * | escape%esc*(2 rows)
>>>
>>>
>>> Prefix search
>>>
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>
>>>  id | val
>>> ----+-----------
>>>   2 | escape%me
>>>   3
>>> * | escape%esc*
>>>
>>> Thanks,
>>> Mikhail
>>>
>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduy...@gmail.com> wrote:
>>>
>>> Use % to escape %
>>>
>>> cqlsh:test> select * from escape;
>>>
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>   2 | escape%me
>>>
>>>
>>> Contains search
>>>
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>>
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>
>>> (1 rows)
>>>
>>>
>>> Prefix search
>>>
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>>
>>>  id | val
>>> ----+-----------
>>>   2 | escape%me
>>>
>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy <
>>> mikhail.krupits...@jetbrains.com> wrote:
>>>
>>>> Hi Cassandra guys,
>>>>
>>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a simple char in a
>>>> search pattern.
>>>> Here is my test script:
>>>>
>>>> DROP keyspace if exists kmv;
>>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class' :
>>>> 'SimpleStrategy', 'replication_factor':'1'} ;
>>>> USE kmv;
>>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY
>>>> KEY(id, c1));
>>>> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING '
>>>> org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>>>> 'analyzed' : 'true',
>>>> 'analyzer_class' : 'org.apache.cassandra.index.sa
>>>> si.analyzer.NonTokenizingAnalyzer',
>>>> 'case_sensitive' : 'false',
>>>> 'mode' : 'CONTAINS'
>>>> };
>>>>
>>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd');
>>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd');
>>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%');
>>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1');
>>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd');
>>>>
>>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_';
>>>>
>>>> _pattern_ '%%%' finds all columns that contain %.
>>>> How to find columns that start form ‘%’ or ‘%a’?
>>>> How to find columns that end with ‘%’?
>>>> What about more complex patterns: '%qwe%a%sd%’? How to differentiate
>>>> ‘%’ char form % as a command symbol? (Also there is a related issue
>>>> CASSANDRA-12573).
>>>>
>>>>
>>>> Thanks,
>>>> Mikhail
>>>
>>>
>>>
>>>
>>
>>
>

Reply via email to