Thanks for this detailed description ... You mentioned the secondary index in a standard column, would it be better to build several indizes? Is that even possible to build a index on for example 32 columns?
The hint with the smaller boxes is very valuable! Mike 2010/7/26 Aaron Morton <aa...@thelastpickle.com> > For what it's worth... > > * Many smaller boxes with local disk storage are preferable to 2 with huge > NAS storage. > * To cache the hash values look at the KeysCached setting in the > storage-config > * There are some row size limits see > http://wiki.apache.org/cassandra/CassandraLimitations > * If you wanted to get 1000 blobs, rather then group them in a single row > using a super column consider building a secondary index in a standard > column. One CF for the blobs using your hash, one CF that uses whatever they > grouping key is with a col for every blobs hash value. Read from the index > first, then from the blobs themselves. > > Aaron > > > On 24 Jul, 2010,at 06:51 PM, Michael Widmann <michael.widm...@gmail.com> > wrote: > > Hi Jonathan > > Thanks for your very valuable input on this. > > I maybe didn't enough explanation - so I'll try to clarify > > Here are some thoughts: > > > - binary data will not be indexed - only stored. > - The file name to the binary data (a hash) should be indexed for > search > - We could group the hashes in 62 "entry" points for search retrieving > -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9) > - the 64k Blobs meta data (which one belong to which file) should be > stored separate in cassandra > - For Hardware we rely on solaris / opensolaris with ZFS in the backend > - Write operations occur much more often than reads > - Memory should hold the hash values mainly for fast search (not the > binary data) > - Read Operations (restore from cassandra) may be async - (get about > 1000 Blobs) - group them restore > > So my question is too: > > 2 or 3 Big boxes or 10 till 20 small boxes for storage... > Could we separate "caching" - hash values CFs cashed and indexed - binary > data CFs not ... > Writes happens around the clock - on not that tremor speed but constantly > Would compaction of the database need really much disk space > Is it reliable on this size (more my fear) > > thx for thinking and answers... > > greetings > > Mike > > 2010/7/23 Jonathan Shook <jsh...@gmail.com> > >> There are two scaling factors to consider here. In general the worst >> case growth of operations in Cassandra is kept near to O(log2(N)). Any >> worse growth would be considered a design problem, or at least a high >> priority target for improvement. This is important for considering >> the load generated by very large column families, as binary search is >> used when the bloom filter doesn't exclude rows from a query. >> O(log2(N)) is basically the best achievable growth for this type of >> data, but the bloom filter improves on it in some cases by paying a >> lower cost every time. >> >> The other factor to be aware of is the reduction of binary search >> performance for datasets which can put disk seek times into high >> ranges. This is mostly a direct consideration for those installations >> which will be doing lots of cold reads (not cached data) against large >> sets. Disk seek times are much more limited (low) for adjacent or near >> tracks, and generally much higher when tracks are sufficiently far >> apart (as in a very large data set). This can compound with other >> factors when session times are longer, but that is to be expected with >> any system. Your storage system may have completely different >> characteristics depending on caching, etc. >> >> The read performance is still quite high relative to other systems for >> a similar data set size, but the drop-off in performance may be much >> worse than expected if you are wanting it to be linear. Again, this is >> not unique to Cassandra. It's just an important consideration when >> dealing with extremely large sets of data, when memory is not likely >> to be able to hold enough hot data for the specific application. >> >> As always, the real questions have lots more to do with your specific >> access patterns, storage system, etc I would look at the benchmarking >> info available on the lists as a good starting point. >> >> >> On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann >> <michael.widm...@gmail.com> wrote: >> > Hi >> > >> > We plan to use cassandra as a data storage on at least 2 nodes with RF=2 >> > for about 1 billion small files. >> > We do have about 48TB discspace behind for each node. >> > >> > now my question is - is this possible with cassandra - reliable - means >> > (every blob is stored on 2 jbods).. >> > >> > we may grow up to nearly 40TB or more on cassandra "storage" data ... >> > >> > anyone out did something similar? >> > >> > for retrieval of the blobs we are going to index them with an hashvalue >> > (means hashes are used to store the blob) ... >> > so we can search fast for the entry in the database and combine the >> blobs to >> > a normal file again ... >> > >> > thanks for answer >> > >> > michael >> > >> > > > > -- > bayoda.com - Professional Online Backup Solutions for Small and Medium > Sized Companies > > -- bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies