Re: [DISCUSS] CEP-7 Storage Attached Index

Mike Adamson Wed, 02 Feb 2022 05:24:57 -0800

Hi,

I’d like to restart this thread.


We merged the row-aware branch to the SAI codebase just before Christmas and 
have subsequently updated the CEP to reflect these changes.

I would like to move the discussion forward as to how we move this CEP towards 
a vote.

MikeA

> On 16 Sep 2021, at 19:49, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
> Good new Mike that row based indexing will be available, this was a major
> lacking from SASI at that time !
> 
> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson <madam...@datastax.com 
> <mailto:madam...@datastax.com>> a
> écrit :
> 
>> Hi,
>> 
>> Just to keep this thread up to date with development progress, we will be
>> adding row-aware support to SAI in the next few weeks. This is currently
>> going through the final stages of review and testing.
>> 
>> This feature also adds on-disk versioning to SAI. This allows SAI to
>> support multiple on-disk formats during upgrades.
>> 
>> I am mentioning this now because the CEP mentions “Partition Based
>> Iteration” as an initial feature. We will change that to “Row Based
>> Iteration” when the feature is merged.
>> 
>> MikeA
>> 
>>> On 15 Sep 2021, at 19:42, Caleb Rackliffe <calebrackli...@gmail.com>
>> wrote:
>>> 
>>> Hey there,
>>> 
>>> In the spirit of trying to get as many possible objections to a
>> successful
>>> vote out of the way, I've added a "Challenges" section to the CEP:
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>> <
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>  
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges>
>>> 
>>> 
>>> Most of you will be familiar with these, but I think we need to be as
>>> open/candid as possible about the potential risk they pose to SAI's
>> broader
>>> usability. I've described them from the point of view that they are not
>>> intractable, but if anyone thinks they are, let's hash that disagreement
>>> out.
>>> 
>>> Thanks!
>>> 
>>> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin <pmcfa...@gmail.com
>> <mailto:pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>>> wrote:
>>> 
>>>> +1 on introducing this in an incremental manner and after reading
>> through
>>>> CASSANDRA-16092 that seems like a perfect place to start. I see that
>> work
>>>> on that Jira has stopped until direction for CEP-7 has been voted in.
>>>> 
>>>> I say start the vote and let's get this really valuable developer
>> feature
>>>> underway.
>>>> 
>>>> Patrick
>>>> 
>>>> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>> wrote:
>>>> 
>>>>> So this thread stalled almost a year ago. (Wow, time flies when you're
>>>>> trying to release 4.0.) My synthesis of the conversation to this point
>> is
>>>>> that while there are some open questions about testing
>>>>> methodology/"definition of done" and our choice of particular on-disk
>>>> data
>>>>> structures, neither of these should be a serious obstacle to moving
>>>> forward
>>>>> w/ a vote. Having said that, is there anything left around the CEP that
>>>> we
>>>>> feel should prevent it from moving to a vote?
>>>>> 
>>>>> In terms of how we would proceed from the point a vote passes, it seems
>>>>> like there have been enough concerns around the proposed/necessary
>>>> breaking
>>>>> changes to the 2i API, that we will start development by introducing
>>>>> components as incrementally as possible into a long-running feature
>>>> branch
>>>>> off trunk. (This work would likely start w/ *CASSANDRA-16092*
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092 
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092>>, which we
>> could
>>>>> resolve as a sub-task of the SAI epic without interfering with other
>>>> trunk
>>>>> development likely destined for a 4.x minor, etc.)
>>>>> 
>>>>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote:
>>>>> 
>>>>>>>> Question is: is this planned as a next step?
>>>>>>>> If yes, how are we going to mark SAI as experimental until it gets
>>>>>>>> row offsets? Also, it is likely that index format is going to change
>>>>>> when
>>>>>>>> row offsets are added, so my concern is that we may have to support
>>>>> two
>>>>>>>> versions of a format for a smooth migration.
>>>>>> 
>>>>>> The goal is to support row-level index when merging SAI, I will update
>>>>> the
>>>>>> CEP about it.
>>>>>> 
>>>>>>>> I think switching to row
>>>>>>>> offsets also has a huge impact on interaction with SPRC and has some
>>>>>>>> potential for optimisations.
>>>>>> 
>>>>>> Can you share more details on the optimizations?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
>>>>> oleksandr.pet...@gmail.com <mailto:oleksandr.pet...@gmail.com>
>>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>>> But for improving overall index read performance, I think improving
>>>>>> base
>>>>>>> table read perf  (because SAI/SASI executes LOTS of
>>>>>>> SinglePartitionReadCommand after searching on-disk index) is more
>>>>>> effective
>>>>>>> than switching from Trie to Prefix BTree.
>>>>>>> 
>>>>>>> I haven't suggested switching to Prefix B-Tree or any other
>>>> structure,
>>>>>> the
>>>>>>> question was about rationale and motivation of picking one over the
>>>>>> other,
>>>>>>> which I am curious about for personal reasons/interests that lie
>>>>> outside
>>>>>> of
>>>>>>> Cassandra. Having this listed in CEP could have been helpful for
>>>> future
>>>>>>> guidance. It's ok if this question is outside of the CEP scope.
>>>>>>> 
>>>>>>> I also agree that there are many areas that require improvement
>>>> around
>>>>>> the
>>>>>>> read/write path and 2i, many of which (even outside of base table
>>>>> format
>>>>>> or
>>>>>>> read perf) can yield positive performance results.
>>>>>>> 
>>>>>>>> FWIW, I personally look forward to receiving that contribution when
>>>>> the
>>>>>>> time is right.
>>>>>>> 
>>>>>>> I am very excited for this contribution, too, and it looks like very
>>>>>> solid
>>>>>>> work.
>>>>>>> 
>>>>>>> I have one more question, about "Upon resolving partition keys, rows
>>>>> are
>>>>>>> loaded using Cassandra’s internal partition read command across
>>>>> SSTables
>>>>>>> and are post filtered". One of the criticisms of SASI and reasons for
>>>>>>> marking it as experimental was CASSANDRA-11990. I think switching to
>>>>> row
>>>>>>> offsets also has a huge impact on interaction with SPRC and has some
>>>>>>> potential for optimisations. Question is: is this planned as a next
>>>>> step?
>>>>>>> If yes, how are we going to mark SAI as experimental until it gets
>>>>>>> row offsets? Also, it is likely that index format is going to change
>>>>> when
>>>>>>> row offsets are added, so my concern is that we may have to support
>>>> two
>>>>>>> versions of a format for a smooth migration.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
>>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote:
>>>>>>> 
>>>>>>>>>> I think CEP should be more upfront with "eventually replace
>>>>>>>>>> it" bit, since it raises the question about what the people who
>>>>> are
>>>>>>>> using
>>>>>>>>>> other index implementations can expect.
>>>>>>>> 
>>>>>>>> Will update the CEP to emphasize: SAI will replace other indexes.
>>>>>>>> 
>>>>>>>>>> Unfortunately, I do not have an
>>>>>>>>>> implementation sitting around for a direct comparison, but I can
>>>>>>> imagine
>>>>>>>>>> situations when B-Trees may perform better because of simpler
>>>>>>>> construction.
>>>>>>>>>> Maybe we should even consider prototyping a prefix B-Tree to
>>>> have
>>>>> a
>>>>>>> more
>>>>>>>>>> fair comparison.
>>>>>>>> 
>>>>>>>> As long as prefix BTree supports range/prefix aggregation (which is
>>>>>> used
>>>>>>> to
>>>>>>>> speed up
>>>>>>>> range/prefix query when matching entire subtree), we can plug it in
>>>>> and
>>>>>>>> compare. It won't
>>>>>>>> affect the CEP design which focuses on sharing data across indexes
>>>>> and
>>>>>>>> posting aggregation.
>>>>>>>> 
>>>>>>>> But for improving overall index read performance, I think improving
>>>>>> base
>>>>>>>> table read perf
>>>>>>>> (because SAI/SASI executes LOTS of SinglePartitionReadCommand
>>>> after
>>>>>>>> searching on-disk index)
>>>>>>>> is more effective than switching from Trie to Prefix BTree.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith <
>>>>>>> bened...@apache.org <mailto:bened...@apache.org>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> FWIW, I personally look forward to receiving that contribution
>>>> when
>>>>>> the
>>>>>>>>> time is right.
>>>>>>>>> 
>>>>>>>>> On 23/09/2020, 18:45, "Josh McKenzie" <jmcken...@apache.org 
>>>>>>>>> <mailto:jmcken...@apache.org>>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>   talking about that would involve some bits of information
>>>>>> DataStax
>>>>>>>>> might
>>>>>>>>>   not be ready to share?
>>>>>>>>> 
>>>>>>>>>   At the risk of derailing, I've been poking and prodding this
>>>>> week
>>>>>>> at
>>>>>>>> we
>>>>>>>>>   contributors at DS getting our act together w/a draft CEP for
>>>>>>>> donating
>>>>>>>>> the
>>>>>>>>>   trie-based indices to the ASF project.
>>>>>>>>> 
>>>>>>>>>   More to come; the intention is certainly to contribute that
>>>>> code.
>>>>>>> The
>>>>>>>>> lack
>>>>>>>>>   of a destination to merge it into (i.e. no 5.0-dev branch) is
>>>>>>>> removing
>>>>>>>>>   significant urgency from the process as well (not to open a
>>>> 3rd
>>>>>>>>> Pandora's
>>>>>>>>>   box), but there's certainly an interrelatedness to the
>>>>>>> conversations
>>>>>>>>> going
>>>>>>>>>   on.
>>>>>>>>> 
>>>>>>>>>   ---
>>>>>>>>>   Josh McKenzie
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>   Sent via Superhuman <
>>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=>
>> <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=>>
>> 
>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>   On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
>>>>>>>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>>>>>>>   wrote:
>>>>>>>>> 
>>>>>>>>>> As long as we can construct the on-disk indexes
>>>>>>>> efficiently/directly
>>>>>>>>> from
>>>>>>>>>> a Memtable-attached index on flush, there's room to try
>>>> other
>>>>>>> data
>>>>>>>>>> structures. Most of the innovation in SAI is around the
>>>>> layout
>>>>>> of
>>>>>>>>> postings
>>>>>>>>>> (something we can expand on if people are interested) and
>>>>>> having
>>>>>>> a
>>>>>>>>>> natively row-oriented design that scales w/ multiple
>>>> indexed
>>>>>>>> columns
>>>>>>>>> on
>>>>>>>>>> single SSTables. There are some broader implications of
>>>> using
>>>>>> the
>>>>>>>>> trie that
>>>>>>>>>> reach outside SAI itself, but talking about that would
>>>>> involve
>>>>>>> some
>>>>>>>>> bits of
>>>>>>>>>> information DataStax might not be ready to share?
>>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan <
>>>>>>>> jeremiah.jordan@
>>>>>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Short question: looking forward, how are we going to
>>>> maintain
>>>>>>> three
>>>>>>>>> 2i
>>>>>>>>>> implementations: SASI, SAI, and 2i?
>>>>>>>>>> 
>>>>>>>>>> I think one of the goals stated in the CEP is for SAI to
>>>> have
>>>>>>>> parity
>>>>>>>>> with
>>>>>>>>>> 2i such that it could eventually replace it.
>>>>>>>>>> 
>>>>>>>>>> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
>>>>>>>>>> 
>>>>>>>>>> oleksandr.pet...@gmail.com <mailto:oleksandr.pet...@gmail.com>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Short question: looking forward, how are we going to
>>>> maintain
>>>>>>> three
>>>>>>>>> 2i
>>>>>>>>>> implementations: SASI, SAI, and 2i?
>>>>>>>>>> 
>>>>>>>>>> Another thing I think this CEP is missing is rationale and
>>>>>>>> motivation
>>>>>>>>>> about why trie-based indexes were chosen over, say, B-Tree.
>>>>> We
>>>>>>> did
>>>>>>>>> have a
>>>>>>>>>> short discussion about this on Slack, but both arguments
>>>> that
>>>>>>> I've
>>>>>>>>> heard
>>>>>>>>>> (space-saving and keeping a small subset of nodes in
>>>> memory)
>>>>>> work
>>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>> for
>>>>>>>>>> 
>>>>>>>>>> the most primitive implementation of a B-Tree.
>>>> Fully-occupied
>>>>>>>> prefix
>>>>>>>>>> 
>>>>>>>>>> B-Tree
>>>>>>>>>> 
>>>>>>>>>> can have similar properties. There's been a lot of research
>>>>> on
>>>>>>>>> B-Trees
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> optimisations in those. Unfortunately, I do not have an
>>>>>>>>> implementation
>>>>>>>>>> sitting around for a direct comparison, but I can imagine
>>>>>>>> situations
>>>>>>>>> when
>>>>>>>>>> B-Trees may perform better because of simpler
>>>>>>>>>> 
>>>>>>>>>> construction.
>>>>>>>>>> 
>>>>>>>>>> Maybe we should even consider prototyping a prefix B-Tree
>>>> to
>>>>>>> have a
>>>>>>>>> more
>>>>>>>>>> fair comparison.
>>>>>>>>>> 
>>>>>>>>>> Thank you,
>>>>>>>>>> -- Alex
>>>>>>>>>> 
>>>>>>>>>> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
>>>>>>>>> jasonstack.zhao@
>>>>>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thank you Patrick for hosting Cassandra Contributor Meeting
>>>>> for
>>>>>>>> CEP-7
>>>>>>>>>> 
>>>>>>>>>> SAI.
>>>>>>>>>> 
>>>>>>>>>> The recorded video is available here:
>>>>>>>>>> 
>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ 
>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/>
>>>>>>>>>> 2020-09-01+Apache+Cassandra+Contributor+Meeting
>>>>>>>>>> 
>>>>>>>>>> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
>>>>>>>>> jasonstack.zhao@gmail.
>>>>>>>>>> com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thank you, Charles and Patrick
>>>>>>>>>> 
>>>>>>>>>> On Tue, 1 Sep 2020 at 04:56, Charles Cao <
>>>>> caohair...@gmail.com <mailto:caohair...@gmail.com>
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thank you, Patrick!
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin <
>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I just moved it to 8AM for this meeting to better
>>>> accommodate
>>>>>>> APAC.
>>>>>>>>>> 
>>>>>>>>>> Please
>>>>>>>>>> 
>>>>>>>>>> see the update here:
>>>>>>>>>> 
>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ 
>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/>
>>>>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>>>>>>>>>> 
>>>>>>>>>> Patrick
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao <
>>>>>>> caohair...@gmail.com <mailto:caohair...@gmail.com>
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Patrick,
>>>>>>>>>> 
>>>>>>>>>> 11AM PST is a bad time for the people in the APAC timezone.
>>>>> Can
>>>>>>> we
>>>>>>>>> move it
>>>>>>>>>> to 7 or 8AM PST in the morning to accommodate their needs ?
>>>>>>>>>> 
>>>>>>>>>> ~Charles
>>>>>>>>>> 
>>>>>>>>>> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin <
>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Meeting scheduled.
>>>>>>>>>> 
>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ 
>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/>
>>>>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>>>>>>>>>> 
>>>>>>>>>> Tuesday September 1st, 11AM PST. I added a basic bullet for
>>>>> the
>>>>>>>>>> 
>>>>>>>>>> agenda
>>>>>>>>>> 
>>>>>>>>>> but
>>>>>>>>>> 
>>>>>>>>>> if there is more, edit away.
>>>>>>>>>> 
>>>>>>>>>> Patrick
>>>>>>>>>> 
>>>>>>>>>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
>>>>>>>>> jasonstack.zhao@
>>>>>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> 
>>>>>>>>>> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
>>>>>>>>>> 
>>>>>>>>>> e.dimitr...@gmail.com <mailto:e.dimitr...@gmail.com>>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> 
>>>>>>>>>> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
>>>>>>>>>> 
>>>>>>>>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> 
>>>>>>>>>> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>>>>>>>>>> 
>>>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> This is related to the discussion Jordan and I had about
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> contributor
>>>>>>>>>> 
>>>>>>>>>> Zoom call. Instead of open mic for any issue, call it
>>>>>>>>>> 
>>>>>>>>>> based
>>>>>>>>>> 
>>>>>>>>>> on a
>>>>>>>>>> 
>>>>>>>>>> discussion
>>>>>>>>>> 
>>>>>>>>>> thread or threads for higher bandwidth discussion.
>>>>>>>>>> 
>>>>>>>>>> I would be happy to schedule on for next week to
>>>>>>>>>> 
>>>>>>>>>> specifically
>>>>>>>>>> 
>>>>>>>>>> discuss
>>>>>>>>>> 
>>>>>>>>>> CEP-7. I can attach the recorded call to the CEP after.
>>>>>>>>>> 
>>>>>>>>>> +1 or -1?
>>>>>>>>>> 
>>>>>>>>>> Patrick
>>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
>>>>>>>>>> 
>>>>>>>>>> jmcken...@apache.org <mailto:jmcken...@apache.org>>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Does community plan to open another discussion or CEP
>>>>>>>>>> 
>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>> modularization?
>>>>>>>>>> 
>>>>>>>>>> We probably should have a discussion on the ML or
>>>>>>>>>> 
>>>>>>>>>> monthly
>>>>>>>>>> 
>>>>>>>>>> contrib
>>>>>>>>>> 
>>>>>>>>>> call
>>>>>>>>>> 
>>>>>>>>>> about it first to see how aligned the interested
>>>>>>>>>> 
>>>>>>>>>> contributors
>>>>>>>>>> 
>>>>>>>>>> are.
>>>>>>>>>> 
>>>>>>>>>> Could
>>>>>>>>>> 
>>>>>>>>>> do
>>>>>>>>>> 
>>>>>>>>>> that through CEP as well but CEP's (at least thus far
>>>>>>>>>> 
>>>>>>>>>> sans k8s
>>>>>>>>>> 
>>>>>>>>>> operator)
>>>>>>>>>> 
>>>>>>>>>> tend to start with a strong, deeply thought out point of
>>>>>>>>>> 
>>>>>>>>>> view
>>>>>>>>>> 
>>>>>>>>>> being
>>>>>>>>>> 
>>>>>>>>>> expressed.
>>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
>>>>>>>>>> 
>>>>>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> SASI's performance, specifically the search in the
>>>>>>>>>> 
>>>>>>>>>> B+
>>>>>>>>>> 
>>>>>>>>>> tree
>>>>>>>>>> 
>>>>>>>>>> component,
>>>>>>>>>> 
>>>>>>>>>> depends a lot on the component file's header being
>>>>>>>>>> 
>>>>>>>>>> available
>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with
>>>>>>>>>> 
>>>>>>>>>> lots of
>>>>>>>>>> 
>>>>>>>>>> RAM.
>>>>>>>>>> 
>>>>>>>>>> Is
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> bound
>>>>>>>>>> 
>>>>>>>>>> to this same or similar limitation?
>>>>>>>>>> 
>>>>>>>>>> SAI also benefits from larger memory because SAI puts
>>>>>>>>>> 
>>>>>>>>>> block
>>>>>>>>>> 
>>>>>>>>>> info
>>>>>>>>>> 
>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>> heap
>>>>>>>>>> 
>>>>>>>>>> for searching on-disk components and having
>>>>>>>>>> 
>>>>>>>>>> cross-index
>>>>>>>>>> 
>>>>>>>>>> files on
>>>>>>>>>> 
>>>>>>>>>> page
>>>>>>>>>> 
>>>>>>>>>> cache
>>>>>>>>>> 
>>>>>>>>>> improves read performance of different indexes on the
>>>>>>>>>> 
>>>>>>>>>> same
>>>>>>>>>> 
>>>>>>>>>> table.
>>>>>>>>>> 
>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the
>>>>>>>>>> 
>>>>>>>>>> point of
>>>>>>>>>> 
>>>>>>>>>> saturation,
>>>>>>>>>> 
>>>>>>>>>> pauses, and crashes on the node. SSDs are a must,
>>>>>>>>>> 
>>>>>>>>>> along
>>>>>>>>>> 
>>>>>>>>>> with
>>>>>>>>>> 
>>>>>>>>>> a
>>>>>>>>>> 
>>>>>>>>>> bit
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> tuning, just to avoid bringing down your cluster.
>>>>>>>>>> 
>>>>>>>>>> Beyond
>>>>>>>>>> 
>>>>>>>>>> reducing
>>>>>>>>>> 
>>>>>>>>>> space
>>>>>>>>>> 
>>>>>>>>>> requirements, does SAI improve on these things?
>>>>>>>>>> 
>>>>>>>>>> Like
>>>>>>>>>> 
>>>>>>>>>> SASI how
>>>>>>>>>> 
>>>>>>>>>> does
>>>>>>>>>> 
>>>>>>>>>> SAI,
>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>> its own way, change/narrow the recommendations on
>>>>>>>>>> 
>>>>>>>>>> node
>>>>>>>>>> 
>>>>>>>>>> hardware
>>>>>>>>>> 
>>>>>>>>>> specs?
>>>>>>>>>> 
>>>>>>>>>> SAI won't crash the node during compaction and
>>>>>>>>>> 
>>>>>>>>>> requires
>>>>>>>>>> 
>>>>>>>>>> less
>>>>>>>>>> 
>>>>>>>>>> CPU/IO.
>>>>>>>>>> 
>>>>>>>>>> * SAI defines global memory limit for compaction
>>>>>>>>>> 
>>>>>>>>>> instead of
>>>>>>>>>> 
>>>>>>>>>> per-index
>>>>>>>>>> 
>>>>>>>>>> memory limit used by SASI.
>>>>>>>>>> 
>>>>>>>>>> For example, compactions are running on 10 tables
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> each
>>>>>>>>>> 
>>>>>>>>>> has
>>>>>>>>>> 
>>>>>>>>>> 10
>>>>>>>>>> 
>>>>>>>>>> indexes. SAI will cap the
>>>>>>>>>> 
>>>>>>>>>> memory usage with global limit while SASI may use up
>>>>>>>>>> 
>>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>> 100 *
>>>>>>>>>> 
>>>>>>>>>> per-index
>>>>>>>>>> 
>>>>>>>>>> limit.
>>>>>>>>>> 
>>>>>>>>>> * After flushing in-memory segments to disk, SAI won't
>>>>>>>>>> 
>>>>>>>>>> merge
>>>>>>>>>> 
>>>>>>>>>> on-disk
>>>>>>>>>> 
>>>>>>>>>> segments while SASI
>>>>>>>>>> 
>>>>>>>>>> attempts to merge them at the end.
>>>>>>>>>> 
>>>>>>>>>> There are pros and cons of not merging segments:
>>>>>>>>>> 
>>>>>>>>>> ** Pros: compaction runs faster and requires fewer
>>>>>>>>>> 
>>>>>>>>>> resources.
>>>>>>>>>> 
>>>>>>>>>> ** Cons: small segments reduce compression ratio.
>>>>>>>>>> 
>>>>>>>>>> * SAI on-disk format with row ids compresses better.
>>>>>>>>>> 
>>>>>>>>>> I understand the desire in keeping out of scope
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> longer
>>>>>>>>>> 
>>>>>>>>>> term
>>>>>>>>>> 
>>>>>>>>>> deprecation
>>>>>>>>>> 
>>>>>>>>>> and migration plan, but… if SASI provides
>>>>>>>>>> 
>>>>>>>>>> functionality
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> doesn't,
>>>>>>>>>> 
>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet
>>>>>>>>>> 
>>>>>>>>>> introduces a
>>>>>>>>>> 
>>>>>>>>>> body
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> code
>>>>>>>>>> 
>>>>>>>>>> ~somewhat similar, shouldn't we be roughly
>>>>>>>>>> 
>>>>>>>>>> sketching out
>>>>>>>>>> 
>>>>>>>>>> how
>>>>>>>>>> 
>>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>> reduce
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> maintenance surface area?
>>>>>>>>>> 
>>>>>>>>>> Agreed that we should reduce maintenance area if
>>>>>>>>>> 
>>>>>>>>>> possible,
>>>>>>>>>> 
>>>>>>>>>> but
>>>>>>>>>> 
>>>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>> very
>>>>>>>>>> 
>>>>>>>>>> limited
>>>>>>>>>> 
>>>>>>>>>> code base (eg. RangeIterator, QueryPlan) can be
>>>>>>>>>> 
>>>>>>>>>> shared.
>>>>>>>>>> 
>>>>>>>>>> The
>>>>>>>>>> 
>>>>>>>>>> rest
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> code base
>>>>>>>>>> 
>>>>>>>>>> is quite different because of on-disk format and
>>>>>>>>>> 
>>>>>>>>>> cross-index
>>>>>>>>>> 
>>>>>>>>>> files.
>>>>>>>>>> 
>>>>>>>>>> The goal of this CEP is to get community buy-in on
>>>>>>>>>> 
>>>>>>>>>> SAI's
>>>>>>>>>> 
>>>>>>>>>> design.
>>>>>>>>>> 
>>>>>>>>>> Tokenization,
>>>>>>>>>> 
>>>>>>>>>> DelimiterAnalyzer should be straightforward to
>>>>>>>>>> 
>>>>>>>>>> implement on
>>>>>>>>>> 
>>>>>>>>>> top
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> SAI.
>>>>>>>>>> 
>>>>>>>>>> Can we list what configurations of SASI will
>>>>>>>>>> 
>>>>>>>>>> become
>>>>>>>>>> 
>>>>>>>>>> deprecated
>>>>>>>>>> 
>>>>>>>>>> once
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> becomes non-experimental?
>>>>>>>>>> 
>>>>>>>>>> Except for "Like", "Tokenisation",
>>>>>>>>>> 
>>>>>>>>>> "DelimiterAnalyzer",
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> rest
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> SASI
>>>>>>>>>> 
>>>>>>>>>> can
>>>>>>>>>> 
>>>>>>>>>> be replaced by SAI.
>>>>>>>>>> 
>>>>>>>>>> Given a few bugs are open against 2i and SASI, can
>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>>>> 
>>>>>>>>>> provide
>>>>>>>>>> 
>>>>>>>>>> some
>>>>>>>>>> 
>>>>>>>>>> overview, or rough indication, of how many of them
>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>>>> 
>>>>>>>>>> could
>>>>>>>>>> 
>>>>>>>>>> "triage
>>>>>>>>>> 
>>>>>>>>>> away"?
>>>>>>>>>> 
>>>>>>>>>> I believe most of the known bugs in 2i/SASI either
>>>>>>>>>> 
>>>>>>>>>> have
>>>>>>>>>> 
>>>>>>>>>> been
>>>>>>>>>> 
>>>>>>>>>> addressed
>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>> SAI or
>>>>>>>>>> 
>>>>>>>>>> don't apply to SAI.
>>>>>>>>>> 
>>>>>>>>>> And, is it time for the project to start
>>>>>>>>>> 
>>>>>>>>>> introducing new
>>>>>>>>>> 
>>>>>>>>>> SPI
>>>>>>>>>> 
>>>>>>>>>> implementations as separate sub-modules and jar
>>>>>>>>>> 
>>>>>>>>>> files
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>> loaded
>>>>>>>>>> 
>>>>>>>>>> at runtime based on configuration settings? (sorry
>>>>>>>>>> 
>>>>>>>>>> for
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> conflation
>>>>>>>>>> 
>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>> this one, but maybe it's the right time to raise
>>>>>>>>>> 
>>>>>>>>>> it
>>>>>>>>>> 
>>>>>>>>>> :shrug:)
>>>>>>>>>> 
>>>>>>>>>> Agreed that modularization is the way to go and will
>>>>>>>>>> 
>>>>>>>>>> speed up
>>>>>>>>>> 
>>>>>>>>>> module
>>>>>>>>>> 
>>>>>>>>>> development speed.
>>>>>>>>>> 
>>>>>>>>>> Does community plan to open another discussion or CEP
>>>>>>>>>> 
>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>> modularization?
>>>>>>>>>> 
>>>>>>>>>> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <
>>>>>>>>>> 
>>>>>>>>>> m...@apache.org>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Adding to Duy's questions…
>>>>>>>>>> 
>>>>>>>>>> * Hardware specs
>>>>>>>>>> 
>>>>>>>>>> SASI's performance, specifically the search in the
>>>>>>>>>> 
>>>>>>>>>> B+
>>>>>>>>>> 
>>>>>>>>>> tree
>>>>>>>>>> 
>>>>>>>>>> component,
>>>>>>>>>> 
>>>>>>>>>> depends a lot on the component file's header being
>>>>>>>>>> 
>>>>>>>>>> available in
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with
>>>>>>>>>> 
>>>>>>>>>> lots
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> RAM.
>>>>>>>>>> 
>>>>>>>>>> Is
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> bound
>>>>>>>>>> 
>>>>>>>>>> to this same or similar limitation?
>>>>>>>>>> 
>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the
>>>>>>>>>> 
>>>>>>>>>> point of
>>>>>>>>>> 
>>>>>>>>>> saturation,
>>>>>>>>>> 
>>>>>>>>>> pauses, and crashes on the node. SSDs are a must,
>>>>>>>>>> 
>>>>>>>>>> along
>>>>>>>>>> 
>>>>>>>>>> with a
>>>>>>>>>> 
>>>>>>>>>> bit
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> tuning, just to avoid bringing down your cluster.
>>>>>>>>>> 
>>>>>>>>>> Beyond
>>>>>>>>>> 
>>>>>>>>>> reducing
>>>>>>>>>> 
>>>>>>>>>> space
>>>>>>>>>> 
>>>>>>>>>> requirements, does SAI improve on these things? Like
>>>>>>>>>> 
>>>>>>>>>> SASI
>>>>>>>>>> 
>>>>>>>>>> how
>>>>>>>>>> 
>>>>>>>>>> does
>>>>>>>>>> 
>>>>>>>>>> SAI,
>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>> its own way, change/narrow the recommendations on
>>>>>>>>>> 
>>>>>>>>>> node
>>>>>>>>>> 
>>>>>>>>>> hardware
>>>>>>>>>> 
>>>>>>>>>> specs?
>>>>>>>>>> 
>>>>>>>>>> * Code Maintenance
>>>>>>>>>> 
>>>>>>>>>> I understand the desire in keeping out of scope the
>>>>>>>>>> 
>>>>>>>>>> longer
>>>>>>>>>> 
>>>>>>>>>> term
>>>>>>>>>> 
>>>>>>>>>> deprecation
>>>>>>>>>> 
>>>>>>>>>> and migration plan, but… if SASI provides
>>>>>>>>>> 
>>>>>>>>>> functionality
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> doesn't,
>>>>>>>>>> 
>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet
>>>>>>>>>> 
>>>>>>>>>> introduces a
>>>>>>>>>> 
>>>>>>>>>> body
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> code
>>>>>>>>>> 
>>>>>>>>>> ~somewhat similar, shouldn't we be roughly sketching
>>>>>>>>>> 
>>>>>>>>>> out
>>>>>>>>>> 
>>>>>>>>>> how to
>>>>>>>>>> 
>>>>>>>>>> reduce
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> maintenance surface area?
>>>>>>>>>> 
>>>>>>>>>> Can we list what configurations of SASI will become
>>>>>>>>>> 
>>>>>>>>>> deprecated
>>>>>>>>>> 
>>>>>>>>>> once
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> becomes non-experimental?
>>>>>>>>>> 
>>>>>>>>>> Given a few bugs are open against 2i and SASI, can
>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>>>> 
>>>>>>>>>> provide
>>>>>>>>>> 
>>>>>>>>>> some
>>>>>>>>>> 
>>>>>>>>>> overview, or rough indication, of how many of them
>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>>>> 
>>>>>>>>>> could
>>>>>>>>>> 
>>>>>>>>>> "triage
>>>>>>>>>> 
>>>>>>>>>> away"?
>>>>>>>>>> 
>>>>>>>>>> And, is it time for the project to start introducing
>>>>>>>>>> 
>>>>>>>>>> new
>>>>>>>>>> 
>>>>>>>>>> SPI
>>>>>>>>>> 
>>>>>>>>>> implementations as separate sub-modules and jar
>>>>>>>>>> 
>>>>>>>>>> files
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>> loaded
>>>>>>>>>> 
>>>>>>>>>> at runtime based on configuration settings? (sorry
>>>>>>>>>> 
>>>>>>>>>> for the
>>>>>>>>>> 
>>>>>>>>>> conflation
>>>>>>>>>> 
>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>> this one, but maybe it's the right time to raise it
>>>>>>>>>> 
>>>>>>>>>> :shrug:)
>>>>>>>>>> 
>>>>>>>>>> regards,
>>>>>>>>>> 
>>>>>>>>>> Mick
>>>>>>>>>> 
>>>>>>>>>> On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <
>>>>>>>>>> 
>>>>>>>>>> doanduy...@gmail.com>
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thank you Zhao Yang for starting this topic
>>>>>>>>>> 
>>>>>>>>>> After reading the short design doc, I have a few
>>>>>>>>>> 
>>>>>>>>>> questions
>>>>>>>>>> 
>>>>>>>>>> 1) SASI was pretty inefficient indexing wide
>>>>>>>>>> 
>>>>>>>>>> partitions
>>>>>>>>>> 
>>>>>>>>>> because
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> index
>>>>>>>>>> 
>>>>>>>>>> structure only retains the partition token, not
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> clustering
>>>>>>>>>> 
>>>>>>>>>> colums.
>>>>>>>>>> 
>>>>>>>>>> As
>>>>>>>>>> 
>>>>>>>>>> per design doc SAI has row id mapping to partition
>>>>>>>>>> 
>>>>>>>>>> offset,
>>>>>>>>>> 
>>>>>>>>>> can
>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>>>> 
>>>>>>>>>> hope
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> indexing wide partition will be more efficient
>>>>>>>>>> 
>>>>>>>>>> with
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> ? One
>>>>>>>>>> 
>>>>>>>>>> detail
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> worries me is that in the beggining of the design
>>>>>>>>>> 
>>>>>>>>>> doc,
>>>>>>>>>> 
>>>>>>>>>> it is
>>>>>>>>>> 
>>>>>>>>>> said
>>>>>>>>>> 
>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> matching rows are post filtered while scanning the
>>>>>>>>>> 
>>>>>>>>>> partition.
>>>>>>>>>> 
>>>>>>>>>> Can
>>>>>>>>>> 
>>>>>>>>>> you
>>>>>>>>>> 
>>>>>>>>>> confirm or infirm that SAI is efficient with wide
>>>>>>>>>> 
>>>>>>>>>> partitions
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> provides
>>>>>>>>>> 
>>>>>>>>>> the partition offsets to the matching rows ?
>>>>>>>>>> 
>>>>>>>>>> 2) About space efficiency, one of the biggest
>>>>>>>>>> 
>>>>>>>>>> drawback of
>>>>>>>>>> 
>>>>>>>>>> SASI
>>>>>>>>>> 
>>>>>>>>>> was
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> huge
>>>>>>>>>> 
>>>>>>>>>> space required for index structure when using
>>>>>>>>>> 
>>>>>>>>>> CONTAINS
>>>>>>>>>> 
>>>>>>>>>> logic
>>>>>>>>>> 
>>>>>>>>>> because
>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> decomposition of text columns into n-grams. Will
>>>>>>>>>> 
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>> suffer
>>>>>>>>>> 
>>>>>>>>>> from
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> same
>>>>>>>>>> 
>>>>>>>>>> issue in future iterations ? I'm anticipating a
>>>>>>>>>> 
>>>>>>>>>> bit
>>>>>>>>>> 
>>>>>>>>>> 3) If I'm querying using SAI and providing
>>>>>>>>>> 
>>>>>>>>>> complete
>>>>>>>>>> 
>>>>>>>>>> partition
>>>>>>>>>> 
>>>>>>>>>> key,
>>>>>>>>>> 
>>>>>>>>>> will
>>>>>>>>>> 
>>>>>>>>>> it
>>>>>>>>>> 
>>>>>>>>>> be more efficient than querying without partition
>>>>>>>>>> 
>>>>>>>>>> key. In
>>>>>>>>>> 
>>>>>>>>>> other
>>>>>>>>>> 
>>>>>>>>>> words,
>>>>>>>>>> 
>>>>>>>>>> does
>>>>>>>>>> 
>>>>>>>>>> SAI provide any optimisation when partition key is
>>>>>>>>>> 
>>>>>>>>>> specified
>>>>>>>>>> 
>>>>>>>>>> ?
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> Duy Hai DOAN
>>>>>>>>>> 
>>>>>>>>>> Le mar. 18 août 2020 à 11:39, Mick Semb Wever <
>>>>>>>>>> 
>>>>>>>>>> m...@apache.org>
>>>>>>>>>> 
>>>>>>>>>> a
>>>>>>>>>> 
>>>>>>>>>> écrit :
>>>>>>>>>> 
>>>>>>>>>> We are looking forward to the community's
>>>>>>>>>> 
>>>>>>>>>> feedback
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> suggestions.
>>>>>>>>>> 
>>>>>>>>>> What comes immediately to mind is testing
>>>>>>>>>> 
>>>>>>>>>> requirements. It
>>>>>>>>>> 
>>>>>>>>>> has
>>>>>>>>>> 
>>>>>>>>>> been
>>>>>>>>>> 
>>>>>>>>>> mentioned already that the project's testability
>>>>>>>>>> 
>>>>>>>>>> and QA
>>>>>>>>>> 
>>>>>>>>>> guidelines
>>>>>>>>>> 
>>>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>> inadequate to successfully introduce new
>>>>>>>>>> 
>>>>>>>>>> features
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> refactorings
>>>>>>>>>> 
>>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> codebase. During the 4.0 beta phase this was
>>>>>>>>>> 
>>>>>>>>>> intended
>>>>>>>>>> 
>>>>>>>>>> to be
>>>>>>>>>> 
>>>>>>>>>> addressed,
>>>>>>>>>> 
>>>>>>>>>> i.e.
>>>>>>>>>> 
>>>>>>>>>> defining more specific QA guidelines for 4.0-rc.
>>>>>>>>>> 
>>>>>>>>>> This
>>>>>>>>>> 
>>>>>>>>>> would
>>>>>>>>>> 
>>>>>>>>>> be
>>>>>>>>>> 
>>>>>>>>>> an
>>>>>>>>>> 
>>>>>>>>>> important
>>>>>>>>>> 
>>>>>>>>>> step towards QA guidelines for all changes and
>>>>>>>>>> 
>>>>>>>>>> CEPs
>>>>>>>>>> 
>>>>>>>>>> post-4.0.
>>>>>>>>>> 
>>>>>>>>>> Questions from me
>>>>>>>>>> 
>>>>>>>>>> - How will this be tested, how will its QA
>>>>>>>>>> 
>>>>>>>>>> status and
>>>>>>>>>> 
>>>>>>>>>> lifecycle
>>>>>>>>>> 
>>>>>>>>>> be
>>>>>>>>>> 
>>>>>>>>>> defined? (per above)
>>>>>>>>>> 
>>>>>>>>>> - With existing C* code needing to be changed,
>>>>>>>>>> 
>>>>>>>>>> what
>>>>>>>>>> 
>>>>>>>>>> is the
>>>>>>>>>> 
>>>>>>>>>> proposed
>>>>>>>>>> 
>>>>>>>>>> plan
>>>>>>>>>> 
>>>>>>>>>> for making those changes ensuring maintained QA,
>>>>>>>>>> 
>>>>>>>>>> e.g.
>>>>>>>>>> 
>>>>>>>>>> is
>>>>>>>>>> 
>>>>>>>>>> there
>>>>>>>>>> 
>>>>>>>>>> separate
>>>>>>>>>> 
>>>>>>>>>> QA
>>>>>>>>>> 
>>>>>>>>>> cycles planned for altering the SPI before
>>>>>>>>>> 
>>>>>>>>>> adding
>>>>>>>>>> 
>>>>>>>>>> a
>>>>>>>>>> 
>>>>>>>>>> new SPI
>>>>>>>>>> 
>>>>>>>>>> implementation?
>>>>>>>>>> 
>>>>>>>>>> - Despite being out of scope, it would be nice
>>>>>>>>>> 
>>>>>>>>>> to have
>>>>>>>>>> 
>>>>>>>>>> some
>>>>>>>>>> 
>>>>>>>>>> idea
>>>>>>>>>> 
>>>>>>>>>> from
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> CEP author of when users might still choose
>>>>>>>>>> 
>>>>>>>>>> afresh 2i
>>>>>>>>>> 
>>>>>>>>>> or
>>>>>>>>>> 
>>>>>>>>>> SASI
>>>>>>>>>> 
>>>>>>>>>> over
>>>>>>>>>> 
>>>>>>>>>> SAI,
>>>>>>>>>> 
>>>>>>>>>> - Who fills the roles involved? Who are the
>>>>>>>>>> 
>>>>>>>>>> contributors
>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>> this
>>>>>>>>>> 
>>>>>>>>>> DataStax
>>>>>>>>>> 
>>>>>>>>>> team? Who is the shepherd? Are there other
>>>>>>>>>> 
>>>>>>>>>> stakeholders
>>>>>>>>>> 
>>>>>>>>>> willing
>>>>>>>>>> 
>>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>> be
>>>>>>>>>> 
>>>>>>>>>> involved?
>>>>>>>>>> 
>>>>>>>>>> - Is there a preference to use gdoc instead of
>>>>>>>>>> 
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> project's
>>>>>>>>>> 
>>>>>>>>>> wiki,
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> why? (the CEP process suggest a wiki page, and
>>>>>>>>>> 
>>>>>>>>>> feedback on
>>>>>>>>>> 
>>>>>>>>>> why
>>>>>>>>>> 
>>>>>>>>>> another
>>>>>>>>>> 
>>>>>>>>>> approach is considered better helps evolve the
>>>>>>>>>> 
>>>>>>>>>> CEP
>>>>>>>>>> 
>>>>>>>>>> process
>>>>>>>>>> 
>>>>>>>>>> itself)
>>>>>>>>>> 
>>>>>>>>>> cheers,
>>>>>>>>>> 
>>>>>>>>>> Mick
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> To unsubscribe, e-mail:
>>>> dev-unsubscr...@cassandra.apache.org
>>>>>> For
>>>>>>>>>> additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>> To
>>>>>>>>>> unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>>> For
>>>>>>>>> additional
>>>>>>>>>> commands, e-mail: dev-h...@cassandra.apache.org
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> alex p
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>> To
>>>>>>>>>> unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>>> For
>>>>>>>>> additional
>>>>>>>>>> commands, e-mail: dev-h...@cassandra.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> alex p

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to