FWIW, I'm working on migrating a large amount of data out of Oracle into my test cluster. The data has been warehoused as CSV files on Amazon S3. Having that in place allows me to not put extra load on the production service when doing many repeated tests. I then parse the data using CSV Python module and, as Jonathan says, use threads to batch upload data into Cassandra. Notable points: since the data is relatively sparse (i.e. many zeros for integers and empty strings for strings etc), I establish a default value dictionary, and don't write these to Cassandra at all -- they can be reconstructed as needed when reading back.
Also, make sure you wrap Cassandra writes etc into exceptions. When load is high, you might get timeouts at TSocket level etc. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Moving-data-tp5992669p5993443.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.