> About space efficiency, one of the biggest drawback of SASI was the huge space required for index structure when using CONTAINS logic because of the decomposition of text columns into n-grams. Will SAI suffer from the same issue in future iterations ?
SAI does not have specific ngram support atm, though that may be added with tokenizers. Ngrams do indeed grow the index, that's a user decision for faster queries or more disk space. On Tue, Aug 18, 2020 at 6:05 AM DuyHai Doan <doanduy...@gmail.com> wrote: > > Thank you Zhao Yang for starting this topic > > After reading the short design doc, I have a few questions > > 1) SASI was pretty inefficient indexing wide partitions because the index > structure only retains the partition token, not the clustering colums. As > per design doc SAI has row id mapping to partition offset, can we hope that > indexing wide partition will be more efficient with SAI ? One detail that > worries me is that in the beggining of the design doc, it is said that the > matching rows are post filtered while scanning the partition. Can you > confirm or infirm that SAI is efficient with wide partitions and provides > the partition offsets to the matching rows ? > > 2) About space efficiency, one of the biggest drawback of SASI was the huge > space required for index structure when using CONTAINS logic because of the > decomposition of text columns into n-grams. Will SAI suffer from the same > issue in future iterations ? I'm anticipating a bit > > 3) If I'm querying using SAI and providing complete partition key, will it > be more efficient than querying without partition key. In other words, does > SAI provide any optimisation when partition key is specified ? > > Regards > > Duy Hai DOAN > > Le mar. 18 août 2020 à 11:39, Mick Semb Wever <m...@apache.org> a écrit : > > > > > > > We are looking forward to the community's feedback and suggestions. > > > > > > > > > What comes immediately to mind is testing requirements. It has been > > mentioned already that the project's testability and QA guidelines are > > inadequate to successfully introduce new features and refactorings to the > > codebase. During the 4.0 beta phase this was intended to be addressed, i.e. > > defining more specific QA guidelines for 4.0-rc. This would be an important > > step towards QA guidelines for all changes and CEPs post-4.0. > > > > Questions from me > > - How will this be tested, how will its QA status and lifecycle be > > defined? (per above) > > - With existing C* code needing to be changed, what is the proposed plan > > for making those changes ensuring maintained QA, e.g. is there separate QA > > cycles planned for altering the SPI before adding a new SPI implementation? > > - Despite being out of scope, it would be nice to have some idea from the > > CEP author of when users might still choose afresh 2i or SASI over SAI, > > - Who fills the roles involved? Who are the contributors in this DataStax > > team? Who is the shepherd? Are there other stakeholders willing to be > > involved? > > - Is there a preference to use gdoc instead of the project's wiki, and > > why? (the CEP process suggest a wiki page, and feedback on why another > > approach is considered better helps evolve the CEP process itself) > > > > cheers, > > Mick > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org