Re: Proposal: freeze Thrift starting with 2.1.0

Russell Bradberry Wed, 12 Mar 2014 07:23:41 -0700

I would love to help with the REST interface, however my point was not to add 
REST into Cassandra.  My point was that if we had an abstract interface that 
even CQL used to access data, and this interface was made available for other 
drop in modules to access, then the project becomes extensible as a whole.  You 
get CQL out of the box, but it allows others to create interface projects of 
their own and keep them up without putting the burden of that maintenance on 
the core developers.


It could also mean that down the line, say if CQL stops working out like Avro 
and Thrift before it, then pulling it out would be less of a problem.  We can 
even get all cowboy up in here and put CQL in its own project that can grow by 
itself, as long as an interface in the Cassandra project is made available.


On March 12, 2014 at 10:13:34 AM, Brian O'Neill (b...@alumni.brown.edu) wrote:


just when you thought the thread died…


First, let me say we are *WAY* off topic.  But that is a good thing.  
I love this community because there are a ton of passionate, smart people. 
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We’ve had the same experience.  Pig + Hadoop is painful.  We are experimenting 
with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to use 
and fast (for smaller data sets).  In front of this, we are going to add 
dimensional aggregations so we can operate at larger scales.  (then the Hive 
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly on 
Thrift.  We built it directly on top of Thrift, so one day it could be easily 
embedded in the C* server itself.   It could be deployed separately, or run an 
embedded C*.  More often than not, we ended up running it separately to 
separate the layers.  (just like Titan and Rexster)  I’ve started on a rewrite 
of Virgil called Memnon that rides on top of CQL. (I’d love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We’ve hitched our wagons to CQL.  CQL != Relational.  
We’ve had success translating our “native” schemas into CQL, including all the 
NoSQL goodness of wide-rows, etc.  You just need a good understanding of how 
things translate into storage and underlying CFs.  If anything, I think we 
could add some DESCRIBE information, which would help users with this, along 
the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex queries 
using more familiar syntax.  (including future things such as joins, grouping, 
etc.)   To me, that is exciting, and again — one of the reasons we are leaning 
on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  •  
healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.
 

From: Peter Lin <wool...@gmail.com>
Reply-To: <user@cassandra.apache.org>
Date: Wednesday, March 12, 2014 at 8:44 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Proposal: freeze Thrift starting with 2.1.0


yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries are 
important. Having tried to do a simple join in PIG, the level of pain is  high. 
I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, 
though I do find myself asking "why the hell does it need to be so painful in 
PIG?" Many of my friends say "what is this crap!" or "this is better than 
writing sql queries to run reports?"

Plus, using ETL techniques to extract summaries only works for cases where the 
data is small enough. Once it gets beyond a certain size, it's not practical, 
which means we're back to crappy reporting languages that make life painful. 
Lots of big healthcare companies have thousands of MOLAP cubes on dozens of 
mainframes. The old OLTP -> DW/OLAP creates it's own set of management 
headaches.

being able to report directly on the raw data avoids many of the issues, but 
that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
"I would love to see Cassandra get to the point where users can define complex 
queries with subqueries, like, group by and joins" --> Did you have a look at 
Intravert ? I think it does union & intersection on server side for you. Not 
sure about join though..


On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin <wool...@gmail.com> wrote:

Hi Ed,

I agree Solr is deeply integrated into DSE. I've looked at Solandra in the past 
and studied the code.

My understanding is DSE uses Cassandra for storage and the user has both API 
available. I do think it can be integrated further to make moderate to complex 
queries easier and probably faster. That's why we built our own JPA-like object 
query API. I would love to see Cassandra get to the point where users can 
define complex queries with subqueries, like, group by and joins. Clearly lots 
of people want these features and even google built their own tools to do these 
types of queries.

I see lots of people trying to improve this with Presto, Impala, drill, etc. To 
me, it's a natural progression as NoSql databases mature. For most people, at 
some point you want to be able to report/analyze the data. Today some people 
use MapReduce to summarize the data and ETL it into a relational database or 
OLAP database for reporting. Even though I don't need CAS or atomic batch for 
what I do in cassandra today, I'm sure in the future it will be handy. From my 
experience in the financial and insurance sector, features like CAS and "select 
for update" are important for the kinds of transactions they handle. I'm bias, 
these kinds of features are useful and good addition to cassandra.

These are interesting times in database land!




On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
Peter,
Solr is deeply integrated into DSE. Seemingly this can not efficiently be done 
client side (CQL/Thrift whatever) but the Solandra approach was to embed Solr 
in Cassandra. I think that is actually the future client dev, allowing users to 
embedded custom server side logic into there own API.

Things like this take a while. Back in the day no one wanted cassandra to be 
heavy-weight and rejected ideas like read-before write operations. The common 
advice was "do them client side". Now in the case of collections sometimes they 
do read-before-write and it is the "stuff users want".



On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin <wool...@gmail.com> wrote:

I'll give you a concrete example.

One of the things we often need to do is do a keyword search on unstructured 
text. What we did in our tooling is we combined solr with cassandra, but we put 
an Object API infront of it. The API is inspired by JPA, but designed 
specifically to fit our needs.

the user can do queries with like %blah% and behind the scenes we issues a 
query to solr to find the keys and then query cassandra for the records.

With plain Cassandra, the developer has to manually do all of this stuff and 
integrate solr. Then they have to know which system to query and in what order. 
 Our tooling lets the user define the schema in a modeler. Once the model is 
done, it compiles the classes, configuration files, data access objects and 
unit tests.

when the application makes a call, our query classes handle the details behind 
the scene. I know lots of people would like to see Solr integrated more deeply 
into Cassandra and CQL. I hope it happens in the future. If DataStax accepts my 
talk, we will be showing our temporal database and modeler in september.




On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt <srobe...@stanford.edu> 
wrote:
I should add that I'm not trying to ignite a flame war. Just trying to 
understand your intentions.


On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt <srobe...@stanford.edu> 
wrote:
Okay, I'm officially lost on this thread. If you plan on forking Cassandra to 
preserve and continue to enhance the Thrift interface, you would also want to 
add a bunch of relational features to CQL as part of that same fork?


On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
"one of the things I'd like to see happen is for Cassandra to support queries 
with disjunction, exist, subqueries, joins and like. In theory CQL could 
support these features in the future. Cassandra would need a new query compiler 
and query planner. I don't see how the current design could do these things 
without a significant redesign/enhancement. In a past life, I implemented an 
inference rule engine, so I've spent over decade studying and implementing 
query optimizers. All of these things can be done, it's just a matter of people 
finding the time to do it."

I see what your saying. CQL started as a way to make slice easier but it is not 
even a query language, retrofitting these things is going to be very hard.



On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin <wool...@gmail.com> wrote:

I have no problems maintain my own fork :) or joining others forking cassandra.

I'd be happy to work with you or anyone else to add features to thrift. That's 
the great thing about open source. Each person can scratch a technical itch and 
do what they love. I see lots of potential for Cassandra and many of them 
include improving thrift to make it happen. Some of the features in theory 
"could" be done in CQL, but not with the current design.

one of the things I'd like to see happen is for Cassandra to support queries 
with disjunction, exist, subqueries, joins and like. In theory CQL could 
support these features in the future. Cassandra would need a new query compiler 
and query planner. I don't see how the current design could do these things 
without a significant redesign/enhancement. In a past life, I implemented an 
inference rule engine, so I've spent over decade studying and implementing 
query optimizers. All of these things can be done, it's just a matter of people 
finding the time to do it.




On Tue, Mar 11, 2014 at 6:17 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
Peter,

My advice. Do not bother. I have become very active recently in attempting to 
add features to thrift. I had 4 open tickets I was actively working on. (I even 
found two bugs in the Cassandra in the process).

People were aware of this and still called this vote. Several commit people 
have voted in a +1 and my -1 vote is non binding. It is a clear message: The 
committers are unwilling to accept new thrift features even if said features 
are contributed by others.

Edward



On Tue, Mar 11, 2014 at 5:51 PM, Peter Lin <wool...@gmail.com> wrote:

My bias opinion, just because some member of cassandra develop want to abandon 
Thrift, I see benefits of continuing to improve it.

The great thing about open source is that as long as some people want to keep 
working on it and improve it, it can happen. I plan to do my best to keep 
Thrift going, since it gives me fine grain control that I want and need. If the 
ultimate goal of Cassandra is to be "as close to SQL" as practical, my bias 
take is use a NewSQL database that gives you the full power of subqueries, 
like, exists and disjunction.

When customers ask me which database to choose and they really want Relational 
model, I tell them use NewSql. I love that Cassandra sits between NoSql and 
NewSql. There are things I do in Cassandra today that are much harder in NewSql 
or NoSql document databases. NewSql database can scale to similar sizes, so the 
"big" part of big data won't be a significant advantage forever. Looking at 
some of the recent NewSql performance numbers, it's clear the gap is closing.

peter



On Tue, Mar 11, 2014 at 3:59 PM, Tyler Hobbs <ty...@datastax.com> wrote:

On Tue, Mar 11, 2014 at 2:41 PM, Shao-Chuan Wang 
<shaochuan.w...@bloomreach.com> wrote:

So, does anyone know how to do "describing the splits" and "describing the 
local rings" using native protocol?

For a ring description, you would do something like "select peer, tokens from 
system.peers".  I'm not sure about describe_splits().
 

Also, cqlsh uses python client, which is talking via thrift protocol too. Does 
it mean that it will be migrated to native protocol soon as well?

Yes: https://issues.apache.org/jira/browse/CASSANDRA-6307


--
Tyler Hobbs
DataStax







--
Steve Robenalt
Software Architect
HighWire | Stanford University 
425 Broadway St, Redwood City, CA 94063 

srobe...@stanford.edu
http://highwire.stanford.edu 








--
Steve Robenalt
Software Architect
HighWire | Stanford University 
425 Broadway St, Redwood City, CA 94063 

srobe...@stanford.edu
http://highwire.stanford.edu

Re: Proposal: freeze Thrift starting with 2.1.0

Reply via email to