Incase you missed it, fresh off the press http://www.datastax.com/dev/blog/bulk-loading
Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 30 Jul 2011, at 04:10, Jeff Schmidt wrote: > Hello: > > I'm relatively new to Cassandra, but I've been searching around, and it looks > like Cassandra 0.8.x has improved support for bulk importing of data. I keep > finding references to the json2sstable command, and I've read about that on > the Datastax and Apache documentation pages. > > There's a lot of detail here if you want it, otherwise please skip to the > end. json2sstable seems to run successfully, but I cannot see the data in the > new CF using the CLI. > > My goal is extract data from various sources, munge it together in some > manner, and then bulk load it into Cassandra. That is as opposed to using > Hector to programmatically insert the data. I'd like to deploy these files > to the cloud (Puppet) and then instruct Cassndra to bulk load them, and then > inform the application that new data exists. This is for a period content > update of certain column families of curated, read-only, data that occurs on > a monthly basis. I'm thinking of using JMX to signal the application to > switch to a new set of CFs and keep running w/o downtime. At a later time, > I'll delete the old CFs. > > I'm using Cassandra 0.8.2 and I'm just playing with this concept. I create a > test CF using the CLI > > [default@Ingenuity] use Test; > Authenticated to keyspace: Test > [default@Test] create column family TestCF with comparator = UTF8Type and > column_metadata = [{column_name: nodeId, validation_class: UTF8Type}]; > 28991070-b9f9-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] update column family TestCF with > key_validation_class=UTF8Type; > 2af88440-b9f9-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] set TestCF['SID|123']['nodeId'] = 'ING:001'; > Value inserted. > [default@Test] set TestCF['EG|3030']['nodeId'] = 'ING:002'; > Value inserted. > [default@Test] set TestCF['EG|3031']['nodeId'] = 'ING:003'; > Value inserted. > [default@Test] list TestCF; > Using default limit of 100 > ------------------- > RowKey: EG|3030 > => (column=nodeId, value=ING:002, timestamp=1311954072252000) > ------------------- > RowKey: EG|3031 > => (column=nodeId, value=ING:003, timestamp=1311954073631000) > ------------------- > RowKey: SID|123 > => (column=nodeId, value=ING:001, timestamp=1311954072249000) > > 3 Rows Returned. > [default@Test] > > Now, cassandra.yaml is stock, except I changed it to place the data in a > non-default location: > > # directories where Cassandra should store data on disk. > data_file_directories: > - /usr/local/ingenuity/isec/cassandra/datastore/data > > # commit log > commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog > > # saved caches > saved_caches_directory: > /usr/local/ingenuity/isec/cassandra/datastore/saved_caches > > In that data directory: > > [imac:datastore/data/Test] jas% pwd > /usr/local/ingenuity/isec/cassandra/datastore/data/Test > [imac:datastore/data/Test] jas% ls > [imac:datastore/data/Test] jas% > > There is nothing there. Perhaps Cassandra has not yet felt the need to write > the SSTables. So, since I need to reference in actual data file with > sstable2json, I ran nodetool flush: > > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost > flush Test TestCF > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% > > Now, I have files! > > [imac:datastore/data/Test] jas% pwd > /usr/local/ingenuity/isec/cassandra/datastore/data/Test > [imac:datastore/data/Test] jas% ls > TestCF-g-1-Data.db TestCF-g-1-Index.db > TestCF-g-1-Filter.db TestCF-g-1-Statistics.db > [imac:datastore/data/Test] jas% > > Given that, I'm able run sstable2json and I can see I'm getting what's in > that CF: > > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/sstable2json > /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db > > testcf.jason > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason > { > "45477c33303330": [["nodeId","ING:002",1311954072252000]], > "45477c33303331": [["nodeId","ING:003",1311954073631000]], > "5349447c313233": [["nodeId","ING:001",1311954072249000]] > } > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% > > Oops, okay, that file extension should be json not jason, but oh well... :) > > Okay, so I now I have data in the proper format for importing with > json2sstable. Like I said, I want to import this data into a new CF. Let's > call it TestCF2 (in the same keyspace): > > [default@Test] create column family TestCF2 with comparator = UTF8Type and > column_metadata = [{column_name: nodeId, validation_class: UTF8Type}]; > 4dcc44b0-b9fa-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] update column family TestCF2 with > key_validation_class=UTF8Type; > 5092dec0-b9fa-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] > > Again there are no files created in the data directory, so I do a flush for > the new CF: > > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost > flush Test TestCF2 > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% > > Well, that did not help, still no files for TestCF2. There is no actual data > yet, so I'm guessing the system tables have what they need. So, I go ahead > and import the data using json2sstable: > > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K Test -c > TestCF2 testcf.jason > /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.db > Importing 3 keys... > 3 keys imported successfully. > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% > > Okay, and the files did show up: > > [imac:datastore/data/Test] jas% pwd > /usr/local/ingenuity/isec/cassandra/datastore/data/Test > [imac:datastore/data/Test] jas% ls > TestCF-g-1-Data.db TestCF2-g-1-Data.db > TestCF-g-1-Filter.db TestCF2-g-1-Filter.db > TestCF-g-1-Index.db TestCF2-g-1-Index.db > TestCF-g-1-Statistics.db TestCF2-g-1-Statistics.db > [imac:datastore/data/Test] jas% > > Back in the CLI: > > [default@Test] list TestCF2; > Using default limit of 100 > > 0 Row Returned. > [default@Test] > > However, if I edit TestCF-g-1-Data.db, I can sort of see the data is present. > Quitting and starting the CLI has no affect. What gets the the CF data into > the MemTables so it's accessible to a Cassandra client? I tried various > nodetool commands (repair, compact, cleanup, flush, invalidatekeycache, > invalidaterowcache) and I don't see any rows for TestCF2 in the CLI. > > Anyway, it seems this procedure works as I'd expect, well except for not > seeing the new data. :) > > What am I missing here? > > Thanks, > > Jeff > -- > Jeff Schmidt > 535 Consulting > j...@535consulting.com > http://www.535consulting.com > (650) 423-1068 > > > > > > > > > >