Re: About the relationship between the sstable compaction and the read path

Jinhua Luo Wed, 09 Jan 2019 21:54:28 -0800

> We stop at the memtable if we know that’s all we need. This depends on a lot 
> of factors (schema, point read vs slice, etc)


The codes seems to search sstables without checking whether the query
is already satisfied in memtable only.
Could you point out the related code snippets for what you said?




Could you give quick and simple answer to my questions about the complex types:

For collection, when I select a column of collection type, e.g.
map<text, text>, to ensure the whole set of map fields is collected,
it is necessary to search in all sstables.

For cdt, it needs to ensure all fields of the cdt is collected.

For counter, it needs to merge all mutations distributed in all
sstables to give a final state of counter value.




Another related question, since the sstable only contains partitioning
key index, clustering key index (inline within the index file), but no
index for collection, like map and set. So, for field getting,
cassandra needs to iterate all fields or do quick search based on
sorted array?

Jeff Jirsa <jji...@gmail.com> 于2019年1月9日周三 下午10:43写道：
>
> You’re comparing single machine key/value stores to a distributed db with a 
> much richer data model (partitions/slices, statics, range reads, range 
> deletions, etc). They’re going to read very differently. Instead of 
> explaining why they’re not like rocks/ldb, how about you tell us what you’re 
> trying to do / learn so we can answer the real question?
>
> Few other notes inline.
>
> --
> Jeff Jirsa
>
>
> > On Jan 8, 2019, at 10:51 PM, Jinhua Luo <luajit...@gmail.com> wrote:
> >
> > Thanks. Let me clarify my questions more.
> >
> > 1) For memtable, if the selected columns (assuming they are in simple
> > types) could be found in memtable only, why bother to search sstables
> > then? In leveldb and rocksdb, they would stop consulting sstables if
> > the memtable already fulfill the query.
>
> We stop at the memtable if we know that’s all we need. This depends on a lot 
> of factors (schema, point read vs slice, etc)
>
> >
> > 2) For STCS and LCS, obviously, the sstables are grouped in
> > generations (old mutations would promoted into next level or bucket),
> > so why not search the columns level by level (or bucket by bucket)
> > until all selected columns are collected? In leveldb and rocksdb, they
> > do in this way.
>
> They’re single machine and Cassandra isn’t. There’s no guarantee in Cassandra 
> that the small sstables in stcs or low levels in LCS are newest:
>
> - you can write arbitrary timestamps into the memtable
> - read repair can put old data in the memtable
> - streaming (bootstrap/repair) can put old data into new files
> - user processes (nodetool refresh) can put old data into new files
>
>
> >
> > 3) Could you explain the collection, cdt and counter types in more
> > detail? Does they need to iterate all sstables? Because they could not
> > be simply filtered by timestamp or value range.
> >
>
> I can’t (combination of time available and it’s been a long time since I’ve 
> dealt with that code and I don’t want to misspeak).
>
>
> > For collection, when I select a column of collection type, e.g.
> > map<text, text>, to ensure the whole set of map fields is collected,
> > it is necessary to search in all sstables.
> >
> > For cdt, it needs to ensure all fields of the cdt is collected.
> >
> > For counter, it needs to merge all mutations distributed in all
> > sstables to give a final state of counter value.
> >
> > Am I correct? If so, then there three complex types seems less
> > efficient than simple types, right?
> >
> > Jeff Jirsa <jji...@gmail.com> 于2019年1月8日周二 下午11:58写道：
> >>
> >> First:
> >>
> >> Compaction controls how sstables are combined but not how they’re read. 
> >> The read path (with one tiny exception) doesn’t know or care which 
> >> compaction strategy you’re using.
> >>
> >> A few more notes inline.
> >>
> >>> On Jan 8, 2019, at 3:04 AM, Jinhua Luo <luajit...@gmail.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> The compaction would organize the sstables, e.g. with LCS, the
> >>> sstables would be categorized into levels, and the read path should
> >>> read sstables level by level until the read is fulfilled, correct?
> >>
> >> LCS levels are to minimize the number of sstables scanned - at most one 
> >> per level - but there’s no attempt to fulfill the read with low levels 
> >> beyond the filtering done by timestamp.
> >>
> >>>
> >>> For STCS, it would search sstables in buckets from smallest to largest?
> >>
> >> Nope. No attempt to do this.
> >>
> >>>
> >>> What about other compaction cases? They would iterate all sstables?
> >>
> >> In all cases, we’ll use a combination of bloom filters and sstable 
> >> metadata and indices to include / exclude sstables. If the bloom filter 
> >> hits, we’ll consider things like timestamps and whether or not the min/max 
> >> clustering of the sstable matches the slice we care about. We don’t 
> >> consult the compaction strategy, though the compaction strategy may have 
> >> (in the case of LCS or TWCS) placed the sstables into a state that makes 
> >> this read less expensive.
> >>
> >>>
> >>> But in the codes, I'm confused a lot:
> >>> In 
> >>> org.apache.cassandra.db.SinglePartitionReadCommand#queryMemtableAndDiskInternal,
> >>> it seems that no matter whether the selected columns (except the
> >>> collection/cdt and counter cases, let's assume here the selected
> >>> columns are simple cell) are collected and satisfied, it would search
> >>> both memtable and all sstables, regardless of the compaction strategy.
> >>
> >> There’s another that includes timestamps that will do some smart-ish 
> >> exclusion of sstables that aren’t needed for the read command.
> >>
> >>>
> >>> Why?
> >>>
> >>> Moreover, for collection/cdt (non-frozen) and counter types, it would
> >>> need to iterate all sstable to ensure the whole set of the fields are
> >>> collected, correct? If so, such multi-cell or counter types are
> >>> heavyweight in performance, correct?
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >>> For additional commands, e-mail: user-h...@cassandra.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: user-h...@cassandra.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: About the relationship between the sstable compaction and the read path

Reply via email to