Re: Cassandra to store 1 billion small 64KB Blobs

Michael Widmann Mon, 26 Jul 2010 08:07:30 -0700

Okay . That really made a knot into my brain - It twist's a little bit now
I've to draw that  on the whiteboard to understand it better ... but I've
seen some very interesting cornerstones in your answer
for our project.


really thanks a lot
mike

2010/7/26 aaron morton <aa...@thelastpickle.com>

> I see, got carried away thinking about it so here are some thoughts....
>
> Your access patterns will determine the best storage design, so it's
> probably not the best solution. I would welcome thoughts from others.
>
> => Standard CF: Chunks
>
> * key is chunk hash
> * col named 'data' col value is chunk data
> * another col with name "hash" and value of the hash.
> * could also store access data against the individual chunks here
> * probably cache lots of keys, and a only a few rows
>
> - to test if the hashes you have already exist do a multi get slice . But
> you need to specify a col to return the value for, and not the big data one.
> So use the hash column.
>
> => Standard CF: CustomerFiles
>
> * key is customer id
> * col per file name, col value perhaps latest version number or last
> accessed
> * cache keys and rows
>
> - to get all the files for a customer slice the customers row
> - if you want access data when you list all files for a customer, use a
> super CF with a super col for each file name. Store the meta data for the
> file in this CF *and* in the Files CF
>
> => Super CF: Files
>
> * key is client_id.file_name
> * super column called "meta" with path / accessed etc. including versions
> * super column called "current" with columns named "0001" and col values as
> the chunk hash
> * super column called "version.X" with the same format as above for current
> * cache keys and rows
>
> - assumes meta is shared across versions
> - to rebuild a file get_slice for all cols in the "current" super col and
> then do multi gets for the chunks
> - row grows as versions of file grows, but only storing links to the chunks
> so probably OK. Consider how many versions X how many chunks, then may way
> to make the number of rows grow instead of row size. Perhaps have a
> FileVersions CF where the key includes the version number, then maintain
> information about the current version in both Files and FileVersions CFs.
> Files CF would only ever have the current file, update access meta data in
> both CF's.
>
> => Standard CF: ChunkUsage
>
> * key is chunk hash
> * col name is a versioned file key cust_id.file_name.version col value is
> the count of usage in that version
> * no caching
>
> - cannot increase a counter until cassandra 0.7, so cannot keep a count of
> chunk usage.
> - this is the reverse index to see where the chunk is used.
> - does not consider what is current, just all time usage.
> - not sure how much re-use their is, but row size grows with reuse. Should
> be ok for couple of million cols.
>
>
> Oh and if your going to use Hadoop / PIG to analyse the data in this
> beastie you need to think about that in the design. You'll probably want
> single CF to serve the queries with.
>
> Hope that helps
> Aaron
>
>
> On 26 Jul 2010, at 17:29, Michael Widmann wrote:
>
> Hi
>
> Wow that was lot of information...
>
> Think about users storing files online (means with their customer name) -
> each customer maintains his own "hashtable" of files. Each File can consist
> of some or several thousand entries (depends on the size of the whole file).
>
>
> for example:
>
> File Test.doc  consists of 3 * 64K Blobs  - each blob does have the hash
> value ABC - so we will only store one blob, one hash value and the entry
> that this blob is needed 3 times, so we try to avoid duplicate file data
> (and there's a lot of duplicates)
>
> Means our modell is:
>
> Filename:
> Test.doc
> Hashes: 3
> Hash 1: ABC
> Hash 2: ABC
> Hash 3: ABC
> BLOB:ABC
>  Used: 3
>  Binary: Data 1*
>
> Each customer does have:
>
> Customer:  Customer:ID
>
> Filename: Test.doc
>  MetaData (Path / Accessed / modified / Size / compression / OS-Type from /
> Security)
>  Version: 0
>  Hash 1 / Hash 2 / Hash 3
>
> Filename: Another.doc
>
>   MetaData (Path / Accessed / modified / Size / compression / OS-Type from
> / Security)
>   Version: 0
>   Hash 1 / Hash 2 / Hash 3
>   Version: 1
>   Hash 1 / Hash 2 / Hash 4
>   Version: 2
>   Hash 3 / Hash 2 / Hash 2
>
> Hope this clear some things :-)
>
> Mike
>
>
> 2010/7/26 Aaron Morton <aa...@thelastpickle.com>
>
>> Some background reading..
>> http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
>>
>> Not sure on your follow up question, so I'll just wildly blather on about
>> things :)
>>
>> My assumption of your data is you have 64K chunks that are identified by a
>> hash, which can somehow be grouped together into larger files (so there is a
>> "file name" of sorts).
>>
>> One possible storage design (assuming the Random Partitioner) is....
>>
>> A Chunks CF, each row in this CF uses the hash of the chunk as it's key
>> and has is a single column with the chunk data. You could use more columns
>> to store meta here.
>>
>> A ChunkIndex CF, each row uses the file name (from above) as the key and
>> has one column for each chunk in the file. The column name *could* be an
>> offset for the chunk and the column value could be the hash for the chunk.
>> Or you could use the chunk hash as the col name and the offset as the col
>> value if needed.
>>
>> To rebuild the file read the entire row from the ChunkIndex, then make a
>> series of multi gets to read all the chunks. Or you could lazy populate the
>> ones you needed.
>>
>> This is all assuming that the 1000's comment below means you could want to
>> combine the chunks  60+ MB chunks. It would be easier to keep all the chunks
>> together in one row, if you are going to have large (unbounded) file size
>> this may not be appropriate.
>>
>> You could also think about using the order preserving partitioner, and
>> using a compound key for each row such as "file_name_hash.offset" . Then by
>> using the get_range_slices to scan the range of chunks for a file you would
>> not need to maintain a secondary index. Some drawbacks to that approach,
>> read the article above.
>>
>> Hope the helps
>> Aaron
>>
>>
>>
>> On 26 Jul, 2010,at 04:01 PM, Michael Widmann <michael.widm...@gmail.com>
>> wrote:
>>
>> Thanks for this detailed description ...
>>
>> You mentioned the secondary index in a standard column, would it be better
>> to build several indizes?
>> Is that even possible to build a index on for example 32 columns?
>>
>> The hint with the smaller boxes is very valuable!
>>
>> Mike
>>
>> 2010/7/26 Aaron Morton <aa...@thelastpickle.com>
>>
>>> For what it's worth...
>>>
>>> * Many smaller boxes with local disk storage are preferable to 2 with
>>> huge NAS storage.
>>> * To cache the hash values look at the KeysCached setting in the
>>> storage-config
>>> * There are some row size limits see
>>> http://wiki.apache.org/cassandra/CassandraLimitations
>>> * If you wanted to get 1000 blobs, rather then group them in a single row
>>> using a super column consider building a secondary index in a standard
>>> column. One CF for the blobs using your hash, one CF that uses whatever they
>>> grouping key is with a col for every blobs hash value. Read from the index
>>> first, then from the blobs themselves.
>>>
>>> Aaron
>>>
>>>
>>>
>>> On 24 Jul, 2010,at 06:51 PM, Michael Widmann <michael.widm...@gmail.com>
>>> wrote:
>>>
>>> Hi Jonathan
>>>
>>> Thanks for your very valuable input on this.
>>>
>>> I maybe didn't enough explanation - so I'll try to clarify
>>>
>>> Here are some thoughts:
>>>
>>>
>>>    - binary data will not be indexed - only stored.
>>>    - The file name to the binary data (a hash) should be indexed for
>>>    search
>>>    - We could group the hashes in 62 "entry" points for search
>>>    retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
>>>    - the 64k Blobs meta data (which one belong to which file) should be
>>>    stored separate in cassandra
>>>    - For Hardware we rely on solaris / opensolaris with ZFS in the
>>>    backend
>>>    - Write operations occur much more often than reads
>>>    - Memory should hold the hash values mainly for fast search (not the
>>>    binary data)
>>>    - Read Operations (restore from cassandra) may be async - (get about
>>>    1000 Blobs) - group them restore
>>>
>>> So my question is too:
>>>
>>> 2 or 3 Big boxes or 10 till 20 small boxes for storage...
>>> Could we separate "caching" - hash values CFs cashed and indexed - binary
>>> data CFs not ...
>>> Writes happens around the clock - on not that tremor speed but constantly
>>>
>>> Would compaction of the database need really much disk space
>>> Is it reliable on this size (more my fear)
>>>
>>> thx for thinking and answers...
>>>
>>> greetings
>>>
>>> Mike
>>>
>>> 2010/7/23 Jonathan Shook <jsh...@gmail.com>
>>>
>>>> There are two scaling factors to consider here. In general the worst
>>>> case growth of operations in Cassandra is kept near to O(log2(N)). Any
>>>> worse growth would be considered a design problem, or at least a high
>>>> priority target for improvement.  This is important for considering
>>>> the load generated by very large column families, as binary search is
>>>> used when the bloom filter doesn't exclude rows from a query.
>>>> O(log2(N)) is basically the best achievable growth for this type of
>>>> data, but the bloom filter improves on it in some cases by paying a
>>>> lower cost every time.
>>>>
>>>> The other factor to be aware of is the reduction of binary search
>>>> performance for datasets which can put disk seek times into high
>>>> ranges. This is mostly a direct consideration for those installations
>>>> which will be doing lots of cold reads (not cached data) against large
>>>> sets. Disk seek times are much more limited (low) for adjacent or near
>>>> tracks, and generally much higher when tracks are sufficiently far
>>>> apart (as in a very large data set). This can compound with other
>>>> factors when session times are longer, but that is to be expected with
>>>> any system. Your storage system may have completely different
>>>> characteristics depending on caching, etc.
>>>>
>>>> The read performance is still quite high relative to other systems for
>>>> a similar data set size, but the drop-off in performance may be much
>>>> worse than expected if you are wanting it to be linear. Again, this is
>>>> not unique to Cassandra. It's just an important consideration when
>>>> dealing with extremely large sets of data, when memory is not likely
>>>> to be able to hold enough hot data for the specific application.
>>>>
>>>> As always, the real questions have lots more to do with your specific
>>>> access patterns, storage system, etc I would look at the benchmarking
>>>> info available on the lists as a good starting point.
>>>>
>>>>
>>>> On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
>>>> <michael.widm...@gmail.com> wrote:
>>>> > Hi
>>>> >
>>>> > We plan to use cassandra as a data storage on at least 2 nodes with
>>>> RF=2
>>>> > for about 1 billion small files.
>>>> > We do have about 48TB discspace behind for each node.
>>>> >
>>>> > now my question is - is this possible with cassandra - reliable -
>>>> means
>>>> > (every blob is stored on 2 jbods)..
>>>> >
>>>> > we may grow up to nearly 40TB or more on cassandra "storage" data ...
>>>> >
>>>> > anyone out did something similar?
>>>> >
>>>> > for retrieval of the blobs we are going to index them with an
>>>> hashvalue
>>>> > (means hashes are used to store the blob) ...
>>>> > so we can search fast for the entry in the database and combine the
>>>> blobs to
>>>> > a normal file again ...
>>>> >
>>>> > thanks for answer
>>>> >
>>>> > michael
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> bayoda.com - Professional Online Backup Solutions for Small and Medium
>>> Sized Companies
>>>
>>>
>>
>>
>> --
>> bayoda.com - Professional Online Backup Solutions for Small and Medium
>> Sized Companies
>>
>>
>
>
> --
> bayoda.com - Professional Online Backup Solutions for Small and Medium
> Sized Companies
>
>
>


-- 
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized
Companies

Re: Cassandra to store 1 billion small 64KB Blobs

Reply via email to