Re: Direct IO with Spark and Hadoop over Cassandra

platon.tema Tue, 16 Sep 2014 06:20:14 -0700

Yes, updates and deletes is trouble. At the moment for updatescollection we refresh result data by query to C* (java driver) beforereporting to user. For deletes we can skip it during scanning by TTL forexample (not tested yet).


On 09/16/2014 04:53 PM, moshe.kr...@barclays.com wrote:

You will also have to read/resolve multiple row instances (if youupdate records) and tombstones (if you delete records) yourself.
*From:*platon.tema [mailto:platon.t...@yandex.ru]
*Sent:* Tuesday, September 16, 2014 1:51 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Direct IO with Spark and Hadoop over Cassandra

Thanks.
But 1) overcomes with C* API for commitlog and memtables or with mixedaccess (direct IO + traditional connectors or pure CQL if data modelallows, we experimented with it).
2) is more complex for universal solution. In our case C* uses withoutreplication (RF=1) because of huge data size (replication too expensive).
On 09/16/2014 03:40 PM, DuyHai Doan wrote:

    If you access directly the C* sstables from those frameworks, you
    will:

    1) miss live data which are in memory and not dumped yet to disk

    2) skip the Dynamo layer of C* responsible for data consistency

    Le 16 sept. 2014 10:58, "platon.tema" <platon.t...@yandex.ru
    <mailto:platon.t...@yandex.ru>> a écrit :

    Hi.

    As I see massive data processing tools (map\reduce) with C* data
    include

    connectors
    - Calliope http://tuplejump.github.io/calliope/
    - Datastax spark cassandra connector
    https://github.com/datastax/spark-cassandra-connector
    - Startio Deep https://github.com/Stratio/stratio-deep
    - other free\commercial

    runtime (job management and infrastructure)
    - Spark
    - Hadoop

    But if I'm not mistaken all these solutions use network for data
    loading. In best case logic instance (some "job") run on the same
    node (wherethe corresponding range was found).

    Why this logic can`t use direct C* IO (sstable reading from disk)?
    Any cons ?

    Some time ago i read article (still can't find it) about
    academical research within Hadoop was modified to support this
    direct IO mode. According to that benchmarks direct IOgave a
    significant performance increase.

_______________________________________________
This message is for information purposes only, it is not arecommendation, advice, offer or solicitation to buy or sell a productor service nor an official confirmation of any transaction. It isdirected at persons who are professionals and is not intended forretail customer use. Intended for recipient only. This message issubject to the terms at: www.barclays.com/emaildisclaimer<http://www.barclays.com/emaildisclaimer>.
For important disclosures, please see:www.barclays.com/salesandtradingdisclaimer<http://www.barclays.com/salesandtradingdisclaimer> regarding marketcommentary from Barclays Sales and/or Trading, who are active marketparticipants; and in respect of Barclays Research, includingdisclosures relating to specific issuers, please seehttp://publicresearch.barclays.com.
_______________________________________________

Re: Direct IO with Spark and Hadoop over Cassandra

Reply via email to