Cassandra bulk import confusion

Jeff Schmidt Fri, 29 Jul 2011 09:24:36 -0700

Hello:

I'm relatively new to Cassandra, but I've been searching around, and it looks 
like Cassandra 0.8.x has improved support for bulk importing of data.  I keep 
finding references to the json2sstable command, and I've read about that on the 
Datastax and Apache documentation pages.


There's a lot of detail here if you want it, otherwise please skip to the end. 
json2sstable seems to run successfully, but I cannot see the data in the new CF 
using the CLI.

My goal is extract data from various sources, munge it together in some manner, 
and then bulk load it into Cassandra.  That is as opposed to using Hector to 
programmatically insert the data.  I'd like to deploy these files to the cloud 
(Puppet) and then instruct Cassndra to bulk load them, and then inform the 
application that new data exists.  This is for a period content update of 
certain column families of curated, read-only, data that occurs on a monthly 
basis. I'm thinking of using JMX to signal the application to switch to a new 
set of CFs and keep running w/o downtime.  At a later time, I'll delete the old 
CFs.

I'm using Cassandra 0.8.2 and I'm just playing with this concept.  I create a 
test CF using the CLI

[default@Ingenuity] use Test;
Authenticated to keyspace: Test
[default@Test] create column family TestCF with comparator = UTF8Type and 
column_metadata = [{column_name: nodeId, validation_class: UTF8Type}];
28991070-b9f9-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] update column family TestCF with key_validation_class=UTF8Type; 
2af88440-b9f9-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] set TestCF['SID|123']['nodeId'] = 'ING:001';  
Value inserted.
[default@Test] set TestCF['EG|3030']['nodeId'] = 'ING:002';  
Value inserted.
[default@Test] set TestCF['EG|3031']['nodeId'] = 'ING:003'; 
Value inserted.
[default@Test] list TestCF;
Using default limit of 100
-------------------
RowKey: EG|3030
=> (column=nodeId, value=ING:002, timestamp=1311954072252000)
-------------------
RowKey: EG|3031
=> (column=nodeId, value=ING:003, timestamp=1311954073631000)
-------------------
RowKey: SID|123
=> (column=nodeId, value=ING:001, timestamp=1311954072249000)

3 Rows Returned.
[default@Test] 

Now, cassandra.yaml is stock, except I changed it to place the data in a 
non-default location:

# directories where Cassandra should store data on disk.
data_file_directories:
    - /usr/local/ingenuity/isec/cassandra/datastore/data

# commit log
commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog

# saved caches
saved_caches_directory: 
/usr/local/ingenuity/isec/cassandra/datastore/saved_caches

In that data directory:

[imac:datastore/data/Test] jas% pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% ls
[imac:datastore/data/Test] jas% 

There is nothing there.  Perhaps Cassandra has not yet felt the need to write 
the SSTables.  So, since I need to reference in actual data file with 
sstable2json, I ran nodetool flush:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost 
flush Test TestCF
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Now, I have files!

[imac:datastore/data/Test] jas% pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% ls
TestCF-g-1-Data.db              TestCF-g-1-Index.db
TestCF-g-1-Filter.db            TestCF-g-1-Statistics.db
[imac:datastore/data/Test] jas% 

Given that, I'm able run sstable2json and I can see I'm getting what's in that 
CF:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas%  bin/sstable2json 
/usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db > 
testcf.jason
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason 
{
"45477c33303330": [["nodeId","ING:002",1311954072252000]],
"45477c33303331": [["nodeId","ING:003",1311954073631000]],
"5349447c313233": [["nodeId","ING:001",1311954072249000]]
}
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Oops, okay, that file extension should be json not jason, but oh well... :)

Okay, so I now I have data in the proper format for importing with 
json2sstable.  Like I said, I want to import this data into a new CF. Let's 
call it TestCF2 (in the same keyspace):

[default@Test] create column family TestCF2 with comparator = UTF8Type and 
column_metadata = [{column_name: nodeId, validation_class: UTF8Type}];
4dcc44b0-b9fa-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] update column family TestCF2 with key_validation_class=UTF8Type; 
5092dec0-b9fa-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] 

Again there are no files created in the data directory, so I do a flush for the 
new CF:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost 
flush Test TestCF2
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Well, that did not help, still no files for TestCF2.  There is no actual data 
yet, so I'm guessing the system tables have what they need. So, I go ahead and 
import the data using json2sstable:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K Test -c 
TestCF2 testcf.jason 
/usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.db
Importing 3 keys...
3 keys imported successfully.
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Okay, and the files did show up:

[imac:datastore/data/Test] jas% pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% ls
TestCF-g-1-Data.db              TestCF2-g-1-Data.db
TestCF-g-1-Filter.db            TestCF2-g-1-Filter.db
TestCF-g-1-Index.db             TestCF2-g-1-Index.db
TestCF-g-1-Statistics.db        TestCF2-g-1-Statistics.db
[imac:datastore/data/Test] jas% 

Back in the CLI:

[default@Test] list TestCF2;
Using default limit of 100

0 Row Returned.
[default@Test] 

However, if I edit TestCF-g-1-Data.db, I can sort of see the data is present.  
Quitting and starting the CLI has no affect. What gets the the CF data into the 
MemTables so it's accessible to a Cassandra client?   I tried various nodetool 
commands (repair, compact, cleanup, flush, invalidatekeycache, 
invalidaterowcache) and I don't see any rows for TestCF2 in the CLI.

Anyway, it seems this procedure works as I'd expect, well except for not seeing 
the new data. :)

What am I missing here?

Thanks,

Jeff
--
Jeff Schmidt
535 Consulting
[email protected]
http://www.535consulting.com
(650) 423-1068

Cassandra bulk import confusion

Reply via email to