Thanks.
But 1) overcomes with C* API for commitlog and memtables or with mixed
access (direct IO + traditional connectors or pure CQL if data model
allows, we experimented with it).
2) is more complex for universal solution. In our case C* uses without
replication (RF=1) because of huge data size (replication too expensive).
On 09/16/2014 03:40 PM, DuyHai Doan wrote:
If you access directly the C* sstables from those frameworks, you will:
1) miss live data which are in memory and not dumped yet to disk
2) skip the Dynamo layer of C* responsible for data consistency
Le 16 sept. 2014 10:58, "platon.tema" <platon.t...@yandex.ru
<mailto:platon.t...@yandex.ru>> a écrit :
Hi.
As I see massive data processing tools (map\reduce) with C* data
include
connectors
- Calliope http://tuplejump.github.io/calliope/
- Datastax spark cassandra connector
https://github.com/datastax/spark-cassandra-connector
- Startio Deep https://github.com/Stratio/stratio-deep
- other free\commercial
runtime (job management and infrastructure)
- Spark
- Hadoop
But if I'm not mistaken all these solutions use network for data
loading. In best case logic instance (some "job") run on the same
node (wherethe corresponding range was found).
Why this logic can`t use direct C* IO (sstable reading from disk)?
Any cons ?
Some time ago i read article (still can't find it) about
academical research within Hadoop was modified to support this
direct IO mode. According to that benchmarks direct IOgave a
significant performance increase.