There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc. I would look at the benchmarking
info available on the lists as a good starting point.

On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<michael.widm...@gmail.com> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>

Reply via email to