[ 
https://issues.apache.org/jira/browse/KAFKA-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367892#comment-16367892
 ] 

Xavier Léauté commented on KAFKA-5285:
--------------------------------------

[~davispw] yes I understand your concern, and that's part the reason I filed 
this JIRA in the first place. Currently our internal APIs assume that knowing 
keyTo and timeTo is sufficient to determine the upper bound. However, without 
information about the lower key, or timeFrom it is not possible to determine an 
optimal upper bound for the byte range. Let me illustrate the problem that 
[~guozhang] is trying to solve:

Let's say we query the key range {{A1}} through {{AAB}}, and timestamps 
{{0000}} through {{DFFF}} (let's assume timestamps consist of 4 bytes for this 
example). We could have the following order:
{code}
A13333   (A1  | 3333)
AA0000   (AA  | 0000)
AAA0000  (AAA | 0000)
AAB1234  (AAB | 1234)
AAC321   (AA  | C321)
AADFFF   (AA  | DFFF)
{code}
That is, it is possible for keys smaller than keyTo to show up after the bytes 
for (keyTo+maxSuffix).
This happens only if:
# keys within the range can be shorter than keyTo.
# The suffix (a.k.a. timestamp) bytes are greater than the bytes in the same 
position for keyTo

Without any additional information about the key length or or the lower bound, 
we can only assume that keys are at least 1 byte, and that byte has to be 
smaller or equal to the first byte of keyTo (i.e. our upper bound has to start 
with the first byte of keyTo), so our best guess for and upper bound in that 
case is ADFFF.

NB: [~guozhang] in this particular case MAX(keyFrom + timeEndEarliest, keyTo + 
timeStartLatest) = max(A1DFFF, AAB0000) = AAB0000 would not work since it does 
not include the full time range for the AAB keys, or maybe I did not understand 
your suggestion.

Any key smaller than keyTo that appears after (keyTo+maxSuffix) is necessarily 
a prefix of keyTo, so knowing the lower bound is only helpful if it shares a 
prefix with keyTo (e.g. if in this case keyFrom was {{7F}}, then the lower 
bound would not be helpful).

However because the lower bound is {{A1}}, we now know keys starting with A 
have at least two bytes, and the second byte has to be an A based on the 
previous statement. This means me can reduce our upper bound to {{AADFFFF}}.

Also, if we have a smaller timeTo we can include any prefix of keyTo that does 
not contain any byte larger than the first byte of  of timeTo, but it gets 
trickier once you have matching bytes in the timestamp and the key[

Let me think this though a bit more and I'll try to come up with something that 
will work optimally in most cases, but my hunch is that unless we introduce 
optimizations for stores with fixed size keys, or have some information about 
the minimum key length, we will end up scanning a large portion of the table 
for long time ranges.

> Optimize upper / lower byte range for key range scan on windowed stores
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-5285
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5285
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Xavier Léauté
>            Assignee: Guozhang Wang
>            Priority: Major
>              Labels: performance
>
> The current implementation of {{WindowKeySchema}} / {{SessionKeySchema}} 
> {{upperRange}} and {{lowerRange}} does not make any assumptions with respect 
> to the other key bound (e.g. the upper byte bound does not depends on lower 
> key bound).
> It should be possible to optimize the byte range somewhat further using the 
> information provided by the lower bound.
> More specifically, by incorporating that information, we should be able to 
> eliminate the corresponding {{upperRangeFixedSize}} and 
> {{lowerRangeFixedSize}}, since the result should be the same if we implement 
> that optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to