hi, all - I am very new to Cassandra, please bear with me if this is really a FAQ. We are exploring if Cassandra is suitable use for a data management project. The basic characteristics of the data are the following:
- it centers around data files, each data file's size can be very small to very large, with 1 or 2GB not uncommon. And each file has a set of properties (metadata) associated with it for various purpose. The eventual total size of data can be big, but I am happy to start with <500 TB. - the data are likely generated and hosted in multiple locations, geographically dispersed, replication support and no single point of failure is *very much* desired. - data is usually written once, and read multiple times, but occasionally a collection of data needs to be replaced altogether. My questions are two-fold: - Although Cassandra (and other decentralized NoSQL data store) has been reported to handle very large data in total, my preliminary understanding is the individual "column value" is quite limited. I have read some posts saying you shouldn't store file this big in Cassandra for example, use a path instead and let file system handle it. Is this true? - The previous solution (if true) will cut much of the appeal of p2p and replication that we have so desired. Another reported solution is chunking: split each files into small chunks. My question is, is this a common practice? is there any client side support for the split/merge or is this completely up to the application? I'd appreciate any input on this. Thanks Ruby