Re: full-tabe scan - extracting all data from C*

Xu Zhongxing Tue, 27 Jan 2015 23:20:46 -0800

This is hard to answer. The performance is a thing depending on context. 
You could tune various parameters.


At 2015-01-28 14:43:38, "Shenghua(Daniel) Wan" <wansheng...@gmail.com> wrote:

Cool. What about performance? e.g. how many record for how long?


On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing <xu_zhong_x...@163.com> wrote:

For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" <wansheng...@gmail.com> wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record 
data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu_zhong_x...@163.com> wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" <moham...@glassbeam.com> wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)      Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan





--



Regards,
Shenghua (Daniel) Wan

Re: full-tabe scan - extracting all data from C*

Reply via email to