Re: [DISCUSS] CEP-7 Storage Attached Index

Ekaterina Dimitrova Wed, 26 Aug 2020 13:52:15 -0700

+1

On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <[email protected]>
wrote:


> +1
>
>
>
> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <[email protected]> wrote:
>
>
>
> > This is related to the discussion Jordan and I had about the contributor
>
> > Zoom call. Instead of open mic for any issue, call it based on a
> discussion
>
> > thread or threads for higher bandwidth discussion.
>
> >
>
> > I would be happy to schedule on for next week to specifically discuss
>
> > CEP-7. I can attach the recorded call to the CEP after.
>
> >
>
> > +1 or -1?
>
> >
>
> > Patrick
>
> >
>
> > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <[email protected]>
>
> > wrote:
>
> >
>
> > > >
>
> > > > Does community plan to open another discussion or CEP on
>
> > modularization?
>
> > >
>
> > > We probably should have a discussion on the ML or monthly contrib call
>
> > > about it first to see how aligned the interested contributors are.
> Could
>
> > do
>
> > > that through CEP as well but CEP's (at least thus far sans k8s
> operator)
>
> > > tend to start with a strong, deeply thought out point of view being
>
> > > expressed.
>
> > >
>
> > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
>
> > > [email protected]> wrote:
>
> > >
>
> > > > >>> SASI's performance, specifically the search in the B+ tree
>
> > component,
>
> > > > >>> depends a lot on the component file's header being available in
> the
>
> > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is
>
> > SAI
>
> > > > bound
>
> > > > >>> to this same or similar limitation?
>
> > > >
>
> > > > SAI also benefits from larger memory because SAI puts block info on
>
> > heap
>
> > > > for searching on-disk components and having cross-index files on page
>
> > > cache
>
> > > > improves read performance of different indexes on the same table.
>
> > > >
>
> > > >
>
> > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
>
> > saturation,
>
> > > > >>> pauses, and crashes on the node. SSDs are a must, along with a
> bit
>
> > of
>
> > > > >>> tuning, just to avoid bringing down your cluster. Beyond reducing
>
> > > space
>
> > > > >>> requirements, does SAI improve on these things? Like SASI how
> does
>
> > > SAI,
>
> > > > in
>
> > > > >>> its own way, change/narrow the recommendations on node hardware
>
> > > specs?
>
> > > >
>
> > > > SAI won't crash the node during compaction and requires less CPU/IO.
>
> > > >
>
> > > > * SAI defines global memory limit for compaction instead of per-index
>
> > > > memory limit used by SASI.
>
> > > >   For example, compactions are running on 10 tables and each has 10
>
> > > > indexes. SAI will cap the
>
> > > >   memory usage with global limit while SASI may use up to 100 *
>
> > per-index
>
> > > > limit.
>
> > > >
>
> > > > * After flushing in-memory segments to disk, SAI won't merge on-disk
>
> > > > segments while SASI
>
> > > >   attempts to merge them at the end.
>
> > > >
>
> > > >   There are pros and cons of not merging segments:
>
> > > >     ** Pros: compaction runs faster and requires fewer resources.
>
> > > >     ** Cons: small segments reduce compression ratio.
>
> > > >
>
> > > > * SAI on-disk format with row ids compresses better.
>
> > > >
>
> > > >
>
> > > > >>> I understand the desire in keeping out of scope the longer term
>
> > > > deprecation
>
> > > > >>> and migration plan, but… if SASI provides functionality that SAI
>
> > > > doesn't,
>
> > > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of
>
> > > code
>
> > > > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
>
> > > reduce
>
> > > > the
>
> > > > >>> maintenance surface area?
>
> > > >
>
> > > > Agreed that we should reduce maintenance area if possible, but only
>
> > very
>
> > > > limited
>
> > > > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of
> the
>
> > > > code base
>
> > > > is quite different because of on-disk format and cross-index files.
>
> > > >
>
> > > > The goal of this CEP is to get community buy-in on SAI's design.
>
> > > > Tokenization,
>
> > > > DelimiterAnalyzer should be straightforward to implement on top of
> SAI.
>
> > > >
>
> > > > >>> Can we list what configurations of SASI will become deprecated
> once
>
> > > SAI
>
> > > > >>> becomes non-experimental?
>
> > > >
>
> > > > Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of
>
> > SASI
>
> > > > can
>
> > > > be replaced by SAI.
>
> > > >
>
> > > > >>> Given a few bugs are open against 2i and SASI, can we provide
> some
>
> > > > >>> overview, or rough indication, of how many of them we could
> "triage
>
> > > > away"?
>
> > > >
>
> > > > I believe most of the known bugs in 2i/SASI either have been
> addressed
>
> > in
>
> > > > SAI or
>
> > > > don't apply to SAI.
>
> > > >
>
> > > > >>> And, is it time for the project to start introducing new SPI
>
> > > > >>> implementations as separate sub-modules and jar files that are
> only
>
> > > > loaded
>
> > > > >>> at runtime based on configuration settings? (sorry for the
>
> > conflation
>
> > > > on
>
> > > > >>> this one, but maybe it's the right time to raise it :shrug:)
>
> > > >
>
> > > > Agreed that modularization is the way to go and will speed up module
>
> > > > development speed.
>
> > > >
>
> > > > Does community plan to open another discussion or CEP on
>
> > modularization?
>
> > > >
>
> > > >
>
> > > > On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <[email protected]>
> wrote:
>
> > > >
>
> > > > > Adding to Duy's questions…
>
> > > > >
>
> > > > >
>
> > > > > * Hardware specs
>
> > > > >
>
> > > > > SASI's performance, specifically the search in the B+ tree
> component,
>
> > > > > depends a lot on the component file's header being available in the
>
> > > > > pagecache. SASI benefits from (needs) nodes with lots of RAM. Is
> SAI
>
> > > > bound
>
> > > > > to this same or similar limitation?
>
> > > > >
>
> > > > > Flushing of SASI can be CPU+IO intensive, to the point of
> saturation,
>
> > > > > pauses, and crashes on the node. SSDs are a must, along with a bit
> of
>
> > > > > tuning, just to avoid bringing down your cluster. Beyond reducing
>
> > space
>
> > > > > requirements, does SAI improve on these things? Like SASI how does
>
> > SAI,
>
> > > > in
>
> > > > > its own way, change/narrow the recommendations on node hardware
>
> > specs?
>
> > > > >
>
> > > > >
>
> > > > > * Code Maintenance
>
> > > > >
>
> > > > > I understand the desire in keeping out of scope the longer term
>
> > > > deprecation
>
> > > > > and migration plan, but… if SASI provides functionality that SAI
>
> > > doesn't,
>
> > > > > like tokenisation and DelimiterAnalyzer, yet introduces a body of
>
> > code
>
> > > > > ~somewhat similar, shouldn't we be roughly sketching out how to
>
> > reduce
>
> > > > the
>
> > > > > maintenance surface area?
>
> > > > >
>
> > > > > Can we list what configurations of SASI will become deprecated once
>
> > SAI
>
> > > > > becomes non-experimental?
>
> > > > >
>
> > > > > Given a few bugs are open against 2i and SASI, can we provide some
>
> > > > > overview, or rough indication, of how many of them we could "triage
>
> > > > away"?
>
> > > > >
>
> > > > > And, is it time for the project to start introducing new SPI
>
> > > > > implementations as separate sub-modules and jar files that are only
>
> > > > loaded
>
> > > > > at runtime based on configuration settings? (sorry for the
> conflation
>
> > > on
>
> > > > > this one, but maybe it's the right time to raise it :shrug:)
>
> > > > >
>
> > > > > regards,
>
> > > > > Mick
>
> > > > >
>
> > > > >
>
> > > > > On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <[email protected]>
>
> > > wrote:
>
> > > > >
>
> > > > > > Thank you Zhao Yang for starting this topic
>
> > > > > >
>
> > > > > > After reading the short design doc, I have a few questions
>
> > > > > >
>
> > > > > > 1) SASI was pretty inefficient indexing wide partitions because
> the
>
> > > > index
>
> > > > > > structure only retains the partition token, not the clustering
>
> > > colums.
>
> > > > As
>
> > > > > > per design doc SAI has row id mapping to partition offset, can we
>
> > > hope
>
> > > > > that
>
> > > > > > indexing wide partition will be more efficient with SAI ? One
>
> > detail
>
> > > > that
>
> > > > > > worries me is that in the beggining of the design doc, it is said
>
> > > that
>
> > > > > the
>
> > > > > > matching rows are post filtered while scanning the partition. Can
>
> > you
>
> > > > > > confirm or infirm that SAI is efficient with wide partitions and
>
> > > > provides
>
> > > > > > the partition offsets to the matching rows ?
>
> > > > > >
>
> > > > > > 2) About space efficiency, one of the biggest drawback of SASI
> was
>
> > > the
>
> > > > > huge
>
> > > > > > space required for index structure when using CONTAINS logic
>
> > because
>
> > > of
>
> > > > > the
>
> > > > > > decomposition of text columns into n-grams. Will SAI suffer from
>
> > the
>
> > > > same
>
> > > > > > issue in future iterations ? I'm anticipating a bit
>
> > > > > >
>
> > > > > > 3) If I'm querying using SAI and providing complete partition
> key,
>
> > > will
>
> > > > > it
>
> > > > > > be more efficient than querying without partition key. In other
>
> > > words,
>
> > > > > does
>
> > > > > > SAI provide any optimisation when partition key is specified ?
>
> > > > > >
>
> > > > > > Regards
>
> > > > > >
>
> > > > > > Duy Hai DOAN
>
> > > > > >
>
> > > > > > Le mar. 18 août 2020 à 11:39, Mick Semb Wever <[email protected]> a
>
> > > > écrit :
>
> > > > > >
>
> > > > > > > >
>
> > > > > > > > We are looking forward to the community's feedback and
>
> > > suggestions.
>
> > > > > > > >
>
> > > > > > >
>
> > > > > > >
>
> > > > > > > What comes immediately to mind is testing requirements. It has
>
> > been
>
> > > > > > > mentioned already that the project's testability and QA
>
> > guidelines
>
> > > > are
>
> > > > > > > inadequate to successfully introduce new features and
>
> > refactorings
>
> > > to
>
> > > > > the
>
> > > > > > > codebase. During the 4.0 beta phase this was intended to be
>
> > > > addressed,
>
> > > > > > i.e.
>
> > > > > > > defining more specific QA guidelines for 4.0-rc. This would be
> an
>
> > > > > > important
>
> > > > > > > step towards QA guidelines for all changes and CEPs post-4.0.
>
> > > > > > >
>
> > > > > > > Questions from me
>
> > > > > > >  - How will this be tested, how will its QA status and
> lifecycle
>
> > be
>
> > > > > > > defined? (per above)
>
> > > > > > >  - With existing C* code needing to be changed, what is the
>
> > > proposed
>
> > > > > plan
>
> > > > > > > for making those changes ensuring maintained QA, e.g. is there
>
> > > > separate
>
> > > > > > QA
>
> > > > > > > cycles planned for altering the SPI before adding a new SPI
>
> > > > > > implementation?
>
> > > > > > >  - Despite being out of scope, it would be nice to have some
> idea
>
> > > > from
>
> > > > > > the
>
> > > > > > > CEP author of when users might still choose afresh 2i or SASI
>
> > over
>
> > > > SAI,
>
> > > > > > >  - Who fills the roles involved? Who are the contributors in
> this
>
> > > > > > DataStax
>
> > > > > > > team? Who is the shepherd? Are there other stakeholders willing
>
> > to
>
> > > be
>
> > > > > > > involved?
>
> > > > > > >  - Is there a preference to use gdoc instead of the project's
>
> > wiki,
>
> > > > and
>
> > > > > > > why? (the CEP process suggest a wiki page, and feedback on why
>
> > > > another
>
> > > > > > > approach is considered better helps evolve the CEP process
>
> > itself)
>
> > > > > > >
>
> > > > > > > cheers,
>
> > > > > > > Mick
>
> > > > > > >
>
> > > > > >
>
> > > > >
>
> > > >
>
> > >
>
> >
>
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to