hi, all -

I am very new to Cassandra, please bear with me if this is really a
FAQ. We are exploring if Cassandra is suitable use for a data
management project. The basic characteristics of the data are the
following:

- it centers around data files, each data file's size can be very
small to very large, with 1 or 2GB not uncommon. And each file has a
set of properties (metadata) associated with it for various purpose.
The eventual total size of data can be big, but I am happy to start
with <500 TB.

- the data are likely generated and hosted in multiple locations,
geographically dispersed, replication support and no single point of
failure is *very much* desired.

- data is usually written once, and read multiple times, but
occasionally a collection of data needs to be replaced altogether.

My questions are two-fold:

- Although Cassandra (and other decentralized NoSQL data store) has
been reported to handle very large data in total, my preliminary
understanding is the individual "column value" is quite limited. I
have read some posts saying you shouldn't store file this big in
Cassandra for example, use a path instead and let file system handle
it. Is this true?

- The previous solution (if true) will cut much of the appeal of p2p
and replication that we have so desired. Another reported solution is
chunking: split each files into small chunks. My question is, is this
a common practice? is there any client side support for the
split/merge or is this completely up to the application?

I'd appreciate any input on this.

Thanks

Ruby

Reply via email to