Re: Cassandra bulk import confusion

aaron morton Mon, 01 Aug 2011 14:55:42 -0700

Incase you missed it, fresh off the press 
http://www.datastax.com/dev/blog/bulk-loading


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 30 Jul 2011, at 04:10, Jeff Schmidt wrote:

> Hello:
> 
> I'm relatively new to Cassandra, but I've been searching around, and it looks 
> like Cassandra 0.8.x has improved support for bulk importing of data.  I keep 
> finding references to the json2sstable command, and I've read about that on 
> the Datastax and Apache documentation pages.
> 
> There's a lot of detail here if you want it, otherwise please skip to the 
> end. json2sstable seems to run successfully, but I cannot see the data in the 
> new CF using the CLI.
> 
> My goal is extract data from various sources, munge it together in some 
> manner, and then bulk load it into Cassandra.  That is as opposed to using 
> Hector to programmatically insert the data.  I'd like to deploy these files 
> to the cloud (Puppet) and then instruct Cassndra to bulk load them, and then 
> inform the application that new data exists.  This is for a period content 
> update of certain column families of curated, read-only, data that occurs on 
> a monthly basis. I'm thinking of using JMX to signal the application to 
> switch to a new set of CFs and keep running w/o downtime.  At a later time, 
> I'll delete the old CFs.
> 
> I'm using Cassandra 0.8.2 and I'm just playing with this concept.  I create a 
> test CF using the CLI
> 
> [default@Ingenuity] use Test;
> Authenticated to keyspace: Test
> [default@Test] create column family TestCF with comparator = UTF8Type and 
> column_metadata = [{column_name: nodeId, validation_class: UTF8Type}];
> 28991070-b9f9-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] update column family TestCF with 
> key_validation_class=UTF8Type; 
> 2af88440-b9f9-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] set TestCF['SID|123']['nodeId'] = 'ING:001';  
> Value inserted.
> [default@Test] set TestCF['EG|3030']['nodeId'] = 'ING:002';  
> Value inserted.
> [default@Test] set TestCF['EG|3031']['nodeId'] = 'ING:003'; 
> Value inserted.
> [default@Test] list TestCF;
> Using default limit of 100
> -------------------
> RowKey: EG|3030
> => (column=nodeId, value=ING:002, timestamp=1311954072252000)
> -------------------
> RowKey: EG|3031
> => (column=nodeId, value=ING:003, timestamp=1311954073631000)
> -------------------
> RowKey: SID|123
> => (column=nodeId, value=ING:001, timestamp=1311954072249000)
> 
> 3 Rows Returned.
> [default@Test] 
> 
> Now, cassandra.yaml is stock, except I changed it to place the data in a 
> non-default location:
> 
> # directories where Cassandra should store data on disk.
> data_file_directories:
>     - /usr/local/ingenuity/isec/cassandra/datastore/data
> 
> # commit log
> commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog
> 
> # saved caches
> saved_caches_directory: 
> /usr/local/ingenuity/isec/cassandra/datastore/saved_caches
> 
> In that data directory:
> 
> [imac:datastore/data/Test] jas% pwd
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test
> [imac:datastore/data/Test] jas% ls
> [imac:datastore/data/Test] jas% 
> 
> There is nothing there.  Perhaps Cassandra has not yet felt the need to write 
> the SSTables.  So, since I need to reference in actual data file with 
> sstable2json, I ran nodetool flush:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost 
> flush Test TestCF
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Now, I have files!
> 
> [imac:datastore/data/Test] jas% pwd
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test
> [imac:datastore/data/Test] jas% ls
> TestCF-g-1-Data.db            TestCF-g-1-Index.db
> TestCF-g-1-Filter.db          TestCF-g-1-Statistics.db
> [imac:datastore/data/Test] jas% 
> 
> Given that, I'm able run sstable2json and I can see I'm getting what's in 
> that CF:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas%  bin/sstable2json 
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db > 
> testcf.jason
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason 
> {
> "45477c33303330": [["nodeId","ING:002",1311954072252000]],
> "45477c33303331": [["nodeId","ING:003",1311954073631000]],
> "5349447c313233": [["nodeId","ING:001",1311954072249000]]
> }
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Oops, okay, that file extension should be json not jason, but oh well... :)
> 
> Okay, so I now I have data in the proper format for importing with 
> json2sstable.  Like I said, I want to import this data into a new CF. Let's 
> call it TestCF2 (in the same keyspace):
> 
> [default@Test] create column family TestCF2 with comparator = UTF8Type and 
> column_metadata = [{column_name: nodeId, validation_class: UTF8Type}];
> 4dcc44b0-b9fa-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] update column family TestCF2 with 
> key_validation_class=UTF8Type; 
> 5092dec0-b9fa-11e0-0000-242d50cf1fb5
> Waiting for schema agreement...
> ... schemas agree across the cluster
> [default@Test] 
> 
> Again there are no files created in the data directory, so I do a flush for 
> the new CF:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost 
> flush Test TestCF2
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Well, that did not help, still no files for TestCF2.  There is no actual data 
> yet, so I'm guessing the system tables have what they need. So, I go ahead 
> and import the data using json2sstable:
> 
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K Test -c 
> TestCF2 testcf.jason 
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.db
> Importing 3 keys...
> 3 keys imported successfully.
> [imac:isec/cassandra/apache-cassandra-0.8.2] jas% 
> 
> Okay, and the files did show up:
> 
> [imac:datastore/data/Test] jas% pwd
> /usr/local/ingenuity/isec/cassandra/datastore/data/Test
> [imac:datastore/data/Test] jas% ls
> TestCF-g-1-Data.db            TestCF2-g-1-Data.db
> TestCF-g-1-Filter.db          TestCF2-g-1-Filter.db
> TestCF-g-1-Index.db           TestCF2-g-1-Index.db
> TestCF-g-1-Statistics.db      TestCF2-g-1-Statistics.db
> [imac:datastore/data/Test] jas% 
> 
> Back in the CLI:
> 
> [default@Test] list TestCF2;
> Using default limit of 100
> 
> 0 Row Returned.
> [default@Test] 
> 
> However, if I edit TestCF-g-1-Data.db, I can sort of see the data is present. 
>  Quitting and starting the CLI has no affect. What gets the the CF data into 
> the MemTables so it's accessible to a Cassandra client?   I tried various 
> nodetool commands (repair, compact, cleanup, flush, invalidatekeycache, 
> invalidaterowcache) and I don't see any rows for TestCF2 in the CLI.
> 
> Anyway, it seems this procedure works as I'd expect, well except for not 
> seeing the new data. :)
> 
> What am I missing here?
> 
> Thanks,
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> j...@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Cassandra bulk import confusion

Reply via email to