Jonathan, By what I have read in the docs, Python API has some limitations yet, not being possible to use any hadoop binary input format.
The python example for Cassandra is only in the master branch: https://github.com/apache/spark/blob/master/examples/src/main/python/cassandra_inputformat.py I may be lacking knowledge of Spark, but if I understood it correctly, the access to Cassandra data is still made by the CqlPagingInputFormat, from hadoop integration. Here is where I ask: even if Spark supports Cassandra, will it be fast enough? My understanding (please some correct me if I am wrong) is that when you insert N items in a Cassandra CF, you are executing N binary searches to insert the item already indexed by a key. When you read the data, it's already sorted. So you take O(N * log(N)) (binary search complexity to insert all data already sorted. However, by using a fast sort algorithm, you also take O(N * log(N)) to sort the data after ir was inserted, but then using more IO. If I write a job in Spark / Java with Cassandra, how will the mapped data be stored and sorted? Will it be stored in Cassandra too? Will spark run sort after the mapping? Best regards, Marcelo. 2014-07-21 14:06 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: > I haven't tried pyspark yet, but it's part of the distribution. My > main language is Python too, so I intend on getting deep into it. > > On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle > <marc...@s1mbi0se.com.br> wrote: > > Hi Jonathan, > > > > Do you know if this RDD can be used with Python? AFAIK, python + > Cassandra > > will be supported just in the next version, but I would like to be > wrong... > > > > Best regards, > > Marcelo Valle. > > > > > > > > 2014-07-21 13:06 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: > > > >> Hey Marcelo, > >> > >> You should check out spark. It intelligently deals with a lot of the > >> issues you're mentioning. Al Tobey did a walkthrough of how to set up > >> the OSS side of things here: > >> > >> > http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html > >> > >> It'll be less work than writing a M/R framework from scratch :) > >> Jon > >> > >> > >> On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle > >> <marc...@s1mbi0se.com.br> wrote: > >> > Hi, > >> > > >> > I have the need to executing a map/reduce job to identity data stored > in > >> > Cassandra before indexing this data to Elastic Search. > >> > > >> > I have already used ColumnFamilyInputFormat (before start using CQL) > to > >> > write hadoop jobs to do that, but I use to have a lot of troubles to > >> > perform > >> > tunning, as hadoop depends on how map tasks are split in order to > >> > successfull execute things in parallel, for IO/bound processes. > >> > > >> > First question is: Am I the only one having problems with that? Is > >> > anyone > >> > else using hadoop jobs that reads from Cassandra in production? > >> > > >> > Second question is about the alternatives. I saw new version spark > will > >> > have > >> > Cassandra support, but using CqlPagingInputFormat, from hadoop. I > tried > >> > to > >> > use HIVE with Cassandra community, but it seems it only works with > >> > Cassandra > >> > Enterprise and doesn't do more than FB presto (http://prestodb.io/), > >> > which > >> > we have been using reading from Cassandra and so far it has been great > >> > for > >> > SQL-like queries. For custom map reduce jobs, however, it is not > enough. > >> > > >> > Does anyone know some other tool that performs MR on Cassandra? My > >> > impression is most tools were created to work on top of HDFS and > reading > >> > from a nosql db is some kind of "workaround". > >> > > >> > Third question is about how these tools work. Most of them writtes > >> > mapped > >> > data on a intermediate storage, then data is shuffled and sorted, then > >> > it is > >> > reduced. Even when using CqlPagingInputFormat, if you are using hadoop > >> > it > >> > will write files to HDFS after the mapping phase, shuffle and sort > this > >> > data, and then reduce it. > >> > > >> > I wonder if a tool supporting Cassandra out of the box wouldn't be > >> > smarter. > >> > Is it faster to write all your data to a file and then sorting it, or > >> > batch > >> > inserting data and already indexing it, as it happens when you store > >> > data in > >> > a Cassandra CF? I didn't do the calculations to check the complexity > of > >> > each > >> > one, what should consider no index in Cassandra would be really large, > >> > as > >> > the maximum index size will always depend on the maximum capacity of a > >> > single host, but my guess is that a map / reduce tool written > >> > specifically > >> > to Cassandra, from the beggining, could perform much better than a > tool > >> > written to HDFS and adapted. I hear people saying Map/Reduce on > >> > Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really > >> > make > >> > sense? Should we expect a result like this? > >> > > >> > Final question: Do you think writting a new M/R tool like described > >> > would be > >> > reinventing the wheel? Or it makes sense? > >> > > >> > Thanks in advance. Any opinions about this subject will be very > >> > appreciated. > >> > > >> > Best regards, > >> > Marcelo Valle. > >> > >> > >> > >> -- > >> Jon Haddad > >> http://www.rustyrazorblade.com > >> skype: rustyrazorblade > > > > > > > > -- > Jon Haddad > http://www.rustyrazorblade.com > skype: rustyrazorblade >