Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
Hi Ed,

I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
past and studied the code.

My understanding is DSE uses Cassandra for storage and the user has both
API available. I do think it can be integrated further to make moderate to
complex queries easier and probably faster. That's why we built our own
JPA-like object query API. I would love to see Cassandra get to the point
where users can define complex queries with subqueries, like, group by and
joins. Clearly lots of people want these features and even google built
their own tools to do these types of queries.

I see lots of people trying to improve this with Presto, Impala, drill,
etc. To me, it's a natural progression as NoSql databases mature. For most
people, at some point you want to be able to report/analyze the data. Today
some people use MapReduce to summarize the data and ETL it into a
relational database or OLAP database for reporting. Even though I don't
need CAS or atomic batch for what I do in cassandra today, I'm sure in the
future it will be handy. From my experience in the financial and insurance
sector, features like CAS and "select for update" are important for the
kinds of transactions they handle. I'm bias, these kinds of features are
useful and good addition to cassandra.

These are interesting times in database land!




On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo wrote:

> Peter,
> Solr is deeply integrated into DSE. Seemingly this can not efficiently be
> done client side (CQL/Thrift whatever) but the Solandra approach was to
> embed Solr in Cassandra. I think that is actually the future client dev,
> allowing users to embedded custom server side logic into there own API.
>
> Things like this take a while. Back in the day no one wanted cassandra to
> be heavy-weight and rejected ideas like read-before write operations. The
> common advice was "do them client side". Now in the case of collections
> sometimes they do read-before-write and it is the "stuff users want".
>
>
>
> On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin  wrote:
>
>>
>> I'll give you a concrete example.
>>
>> One of the things we often need to do is do a keyword search on
>> unstructured text. What we did in our tooling is we combined solr with
>> cassandra, but we put an Object API infront of it. The API is inspired by
>> JPA, but designed specifically to fit our needs.
>>
>> the user can do queries with like %blah% and behind the scenes we issues
>> a query to solr to find the keys and then query cassandra for the records.
>>
>> With plain Cassandra, the developer has to manually do all of this stuff
>> and integrate solr. Then they have to know which system to query and in
>> what order.  Our tooling lets the user define the schema in a modeler. Once
>> the model is done, it compiles the classes, configuration files, data
>> access objects and unit tests.
>>
>> when the application makes a call, our query classes handle the details
>> behind the scene. I know lots of people would like to see Solr integrated
>> more deeply into Cassandra and CQL. I hope it happens in the future. If
>> DataStax accepts my talk, we will be showing our temporal database and
>> modeler in september.
>>
>>
>>
>>
>> On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt > > wrote:
>>
>>> I should add that I'm not trying to ignite a flame war. Just trying to
>>> understand your intentions.
>>>
>>>
>>> On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt <
>>> srobe...@stanford.edu> wrote:
>>>
 Okay, I'm officially lost on this thread. If you plan on forking
 Cassandra to preserve and continue to enhance the Thrift interface, you
 would also want to add a bunch of relational features to CQL as part of
 that same fork?


 On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo >>> > wrote:

> "one of the things I'd like to see happen is for Cassandra to support
> queries with disjunction, exist, subqueries, joins and like. In theory CQL
> could support these features in the future. Cassandra would need a new
> query compiler and query planner. I don't see how the current design could
> do these things without a significant redesign/enhancement. In a past 
> life,
> I implemented an inference rule engine, so I've spent over decade studying
> and implementing query optimizers. All of these things can be done, it's
> just a matter of people finding the time to do it."
>
> I see what your saying. CQL started as a way to make slice easier but
> it is not even a query language, retrofitting these things is going to be
> very hard.
>
>
>
> On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin  wrote:
>
>>
>> I have no problems maintain my own fork :) or joining others forking
>> cassandra.
>>
>> I'd be happy to work with you or anyone else to add features to
>> thrift. That's the great thing about open source. Each person can 
>> scratch a
>> technical i

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread DuyHai Doan
"I would love to see Cassandra get to the point where users can define
complex queries with subqueries, like, group by and joins" --> Did you have
a look at Intravert ? I think it does union & intersection on server side
for you. Not sure about join though..


On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin  wrote:

>
> Hi Ed,
>
> I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
> past and studied the code.
>
> My understanding is DSE uses Cassandra for storage and the user has both
> API available. I do think it can be integrated further to make moderate to
> complex queries easier and probably faster. That's why we built our own
> JPA-like object query API. I would love to see Cassandra get to the point
> where users can define complex queries with subqueries, like, group by and
> joins. Clearly lots of people want these features and even google built
> their own tools to do these types of queries.
>
> I see lots of people trying to improve this with Presto, Impala, drill,
> etc. To me, it's a natural progression as NoSql databases mature. For most
> people, at some point you want to be able to report/analyze the data. Today
> some people use MapReduce to summarize the data and ETL it into a
> relational database or OLAP database for reporting. Even though I don't
> need CAS or atomic batch for what I do in cassandra today, I'm sure in the
> future it will be handy. From my experience in the financial and insurance
> sector, features like CAS and "select for update" are important for the
> kinds of transactions they handle. I'm bias, these kinds of features are
> useful and good addition to cassandra.
>
> These are interesting times in database land!
>
>
>
>
> On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo 
> wrote:
>
>> Peter,
>> Solr is deeply integrated into DSE. Seemingly this can not efficiently be
>> done client side (CQL/Thrift whatever) but the Solandra approach was to
>> embed Solr in Cassandra. I think that is actually the future client dev,
>> allowing users to embedded custom server side logic into there own API.
>>
>> Things like this take a while. Back in the day no one wanted cassandra to
>> be heavy-weight and rejected ideas like read-before write operations. The
>> common advice was "do them client side". Now in the case of collections
>> sometimes they do read-before-write and it is the "stuff users want".
>>
>>
>>
>> On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin  wrote:
>>
>>>
>>> I'll give you a concrete example.
>>>
>>> One of the things we often need to do is do a keyword search on
>>> unstructured text. What we did in our tooling is we combined solr with
>>> cassandra, but we put an Object API infront of it. The API is inspired by
>>> JPA, but designed specifically to fit our needs.
>>>
>>> the user can do queries with like %blah% and behind the scenes we issues
>>> a query to solr to find the keys and then query cassandra for the records.
>>>
>>> With plain Cassandra, the developer has to manually do all of this stuff
>>> and integrate solr. Then they have to know which system to query and in
>>> what order.  Our tooling lets the user define the schema in a modeler. Once
>>> the model is done, it compiles the classes, configuration files, data
>>> access objects and unit tests.
>>>
>>> when the application makes a call, our query classes handle the details
>>> behind the scene. I know lots of people would like to see Solr integrated
>>> more deeply into Cassandra and CQL. I hope it happens in the future. If
>>> DataStax accepts my talk, we will be showing our temporal database and
>>> modeler in september.
>>>
>>>
>>>
>>>
>>> On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt <
>>> srobe...@stanford.edu> wrote:
>>>
 I should add that I'm not trying to ignite a flame war. Just trying to
 understand your intentions.


 On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt <
 srobe...@stanford.edu> wrote:

> Okay, I'm officially lost on this thread. If you plan on forking
> Cassandra to preserve and continue to enhance the Thrift interface, you
> would also want to add a bunch of relational features to CQL as part of
> that same fork?
>
>
> On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo <
> edlinuxg...@gmail.com> wrote:
>
>> "one of the things I'd like to see happen is for Cassandra to support
>> queries with disjunction, exist, subqueries, joins and like. In theory 
>> CQL
>> could support these features in the future. Cassandra would need a new
>> query compiler and query planner. I don't see how the current design 
>> could
>> do these things without a significant redesign/enhancement. In a past 
>> life,
>> I implemented an inference rule engine, so I've spent over decade 
>> studying
>> and implementing query optimizers. All of these things can be done, it's
>> just a matter of people finding the time to do it."
>>
>> I see what your sayin

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries
are important. Having tried to do a simple join in PIG, the level of pain
is  high. I'm a masochist, so I don't mind breaking a simple join into
multiple MR tasks, though I do find myself asking "why the hell does it
need to be so painful in PIG?" Many of my friends say "what is this crap!"
or "this is better than writing sql queries to run reports?"

Plus, using ETL techniques to extract summaries only works for cases where
the data is small enough. Once it gets beyond a certain size, it's not
practical, which means we're back to crappy reporting languages that make
life painful. Lots of big healthcare companies have thousands of MOLAP
cubes on dozens of mainframes. The old OLTP -> DW/OLAP creates it's own set
of management headaches.

being able to report directly on the raw data avoids many of the issues,
but that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan  wrote:

> "I would love to see Cassandra get to the point where users can define
> complex queries with subqueries, like, group by and joins" --> Did you have
> a look at Intravert ? I think it does union & intersection on server side
> for you. Not sure about join though..
>
>
> On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin  wrote:
>
>>
>> Hi Ed,
>>
>> I agree Solr is deeply integrated into DSE. I've looked at Solandra in
>> the past and studied the code.
>>
>> My understanding is DSE uses Cassandra for storage and the user has both
>> API available. I do think it can be integrated further to make moderate to
>> complex queries easier and probably faster. That's why we built our own
>> JPA-like object query API. I would love to see Cassandra get to the point
>> where users can define complex queries with subqueries, like, group by and
>> joins. Clearly lots of people want these features and even google built
>> their own tools to do these types of queries.
>>
>> I see lots of people trying to improve this with Presto, Impala, drill,
>> etc. To me, it's a natural progression as NoSql databases mature. For most
>> people, at some point you want to be able to report/analyze the data. Today
>> some people use MapReduce to summarize the data and ETL it into a
>> relational database or OLAP database for reporting. Even though I don't
>> need CAS or atomic batch for what I do in cassandra today, I'm sure in the
>> future it will be handy. From my experience in the financial and insurance
>> sector, features like CAS and "select for update" are important for the
>> kinds of transactions they handle. I'm bias, these kinds of features are
>> useful and good addition to cassandra.
>>
>> These are interesting times in database land!
>>
>>
>>
>>
>> On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo 
>> wrote:
>>
>>> Peter,
>>> Solr is deeply integrated into DSE. Seemingly this can not efficiently
>>> be done client side (CQL/Thrift whatever) but the Solandra approach was to
>>> embed Solr in Cassandra. I think that is actually the future client dev,
>>> allowing users to embedded custom server side logic into there own API.
>>>
>>> Things like this take a while. Back in the day no one wanted cassandra
>>> to be heavy-weight and rejected ideas like read-before write operations.
>>> The common advice was "do them client side". Now in the case of collections
>>> sometimes they do read-before-write and it is the "stuff users want".
>>>
>>>
>>>
>>> On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin  wrote:
>>>

 I'll give you a concrete example.

 One of the things we often need to do is do a keyword search on
 unstructured text. What we did in our tooling is we combined solr with
 cassandra, but we put an Object API infront of it. The API is inspired by
 JPA, but designed specifically to fit our needs.

 the user can do queries with like %blah% and behind the scenes we
 issues a query to solr to find the keys and then query cassandra for the
 records.

 With plain Cassandra, the developer has to manually do all of this
 stuff and integrate solr. Then they have to know which system to query and
 in what order.  Our tooling lets the user define the schema in a modeler.
 Once the model is done, it compiles the classes, configuration files, data
 access objects and unit tests.

 when the application makes a call, our query classes handle the details
 behind the scene. I know lots of people would like to see Solr integrated
 more deeply into Cassandra and CQL. I hope it happens in the future. If
 DataStax accepts my talk, we will be showing our temporal database and
 modeler in september.




 On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt <
 srobe...@stanford.edu> wrote:

> I should add that I'm not trying to ignite a flame war. Just trying to
> understand your intentions.
>
>
> On Tue, Mar 11, 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Brian O'Neill

just when you thought the thread diedŠ


First, let me say we are *WAY* off topic.  But that is a good thing.
I love this community because there are a ton of passionate, smart people.
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We¹ve had the same experience.  Pig + Hadoop is painful.  We are
experimenting with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to
use and fast (for smaller data sets).  In front of this, we are going to add
dimensional aggregations so we can operate at larger scales.  (then the Hive
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly
on Thrift.  We built it directly on top of Thrift, so one day it could be
easily embedded in the C* server itself.   It could be deployed separately,
or run an embedded C*.  More often than not, we ended up running it
separately to separate the layers.  (just like Titan and Rexster)  I¹ve
started on a rewrite of Virgil called Memnon that rides on top of CQL. (I¹d
love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We¹ve hitched our wagons to CQL.  CQL != Relational.
We¹ve had success translating our ³native² schemas into CQL, including all
the NoSQL goodness of wide-rows, etc.  You just need a good understanding of
how things translate into storage and underlying CFs.  If anything, I think
we could add some DESCRIBE information, which would help users with this,
along the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex
queries using more familiar syntax.  (including future things such as joins,
grouping, etc.)   To me, that is exciting, and again ‹ one of the reasons we
are leaning on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42    €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Peter Lin 
Reply-To:  
Date:  Wednesday, March 12, 2014 at 8:44 AM
To:  "user@cassandra.apache.org" 
Subject:  Re: Proposal: freeze Thrift starting with 2.1.0


yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries are
important. Having tried to do a simple join in PIG, the level of pain is
high. I'm a masochist, so I don't mind breaking a simple join into multiple
MR tasks, though I do find myself asking "why the hell does it need to be so
painful in PIG?" Many of my friends say "what is this crap!" or "this is
better than writing sql queries to run reports?"

Plus, using ETL techniques to extract summaries only works for cases where
the data is small enough. Once it gets beyond a certain size, it's not
practical, which means we're back to crappy reporting languages that make
life painful. Lots of big healthcare companies have thousands of MOLAP cubes
on dozens of mainframes. The old OLTP -> DW/OLAP creates it's own set of
management headaches.

being able to report directly on the raw data avoids many of the issues, but
that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan  wrote:
> "I would love to see Cassandra get to the point where users can define complex
> queries with subqueries, like, group by and joins" --> Did you have a look at
> Intravert ? I think it does union & intersection on server side for you. Not
> sure about join though..
> 
> 
> On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin  wrote:
>> 
>> Hi Ed,
>> 
>> I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
>> past and studied the code.
>> 
>> My understanding is DSE uses Cassandra for storage and the user has both API
>> available. I do think it can be integrated further to make moderate to
>> complex queries easier and probably faster. That's why we built our own
>> JPA-like object query API. I would love to see Cassandra get to the point
>> where users can define complex queries with subqueries, like, group by and
>> joins. Clearly lots of people want these features and even google built their
>> own tools to do these types of querie

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Russell Bradberry
I would love to help with the REST interface, however my point was not to add 
REST into Cassandra.  My point was that if we had an abstract interface that 
even CQL used to access data, and this interface was made available for other 
drop in modules to access, then the project becomes extensible as a whole.  You 
get CQL out of the box, but it allows others to create interface projects of 
their own and keep them up without putting the burden of that maintenance on 
the core developers.

It could also mean that down the line, say if CQL stops working out like Avro 
and Thrift before it, then pulling it out would be less of a problem.  We can 
even get all cowboy up in here and put CQL in its own project that can grow by 
itself, as long as an interface in the Cassandra project is made available.


On March 12, 2014 at 10:13:34 AM, Brian O'Neill (b...@alumni.brown.edu) wrote:


just when you thought the thread died…


First, let me say we are *WAY* off topic.  But that is a good thing.  
I love this community because there are a ton of passionate, smart people. 
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We’ve had the same experience.  Pig + Hadoop is painful.  We are experimenting 
with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to use 
and fast (for smaller data sets).  In front of this, we are going to add 
dimensional aggregations so we can operate at larger scales.  (then the Hive 
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly on 
Thrift.  We built it directly on top of Thrift, so one day it could be easily 
embedded in the C* server itself.   It could be deployed separately, or run an 
embedded C*.  More often than not, we ended up running it separately to 
separate the layers.  (just like Titan and Rexster)  I’ve started on a rewrite 
of Virgil called Memnon that rides on top of CQL. (I’d love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We’ve hitched our wagons to CQL.  CQL != Relational.  
We’ve had success translating our “native” schemas into CQL, including all the 
NoSQL goodness of wide-rows, etc.  You just need a good understanding of how 
things translate into storage and underlying CFs.  If anything, I think we 
could add some DESCRIBE information, which would help users with this, along 
the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex queries 
using more familiar syntax.  (including future things such as joins, grouping, 
etc.)   To me, that is exciting, and again — one of the reasons we are leaning 
on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  •  
healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.
 

From: Peter Lin 
Reply-To: 
Date: Wednesday, March 12, 2014 at 8:44 AM
To: "user@cassandra.apache.org" 
Subject: Re: Proposal: freeze Thrift starting with 2.1.0


yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries are 
important. Having tried to do a simple join in PIG, the level of pain is  high. 
I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, 
though I do find myself asking "why the hell does it need to be so painful in 
PIG?" Many of my friends say "what is this crap!" or "this is better than 
writing sql queries to run reports?"

Plus, using ETL techniques to extract summaries only works for cases where the 
data is small enough. Once it gets beyond a certain size, it's not practical, 
which means we're back to crappy reporting languages that make life painful. 
Lots of big healthcare companies have thousands of MOLAP cubes on dozens of 
mainframes. The old OLTP -> DW/OLAP creates it's own set of management 
headaches.

being able to report directly on the raw data avoids many of the issues, but 
that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan  wrote:
"I would love to see Cassandra get to the point where users can define co

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
I'm enjoying the discussion also.

@Brian
I've been looking at spark/shark along with other recent developments the
last few years. Berkeley has been doing some interesting stuff. One reason
I like Thrift is for type safety and the benefits for query validation and
query optimization. One could do similar things with CQL, but it's just
more work, especially with dynamic columns. I know others are mixing static
with dynamic columns, so I'm not alone. I have no clue how long it will
take to get there, but having tools like query explanation is a big time
saver. Writing business reports is hard enough, so every bit of help the
tool can provide makes it less painful.


On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:

>
> just when you thought the thread died...
>
>
> First, let me say we are *WAY* off topic.  But that is a good thing.
> I love this community because there are a ton of passionate, smart people.
> (often with differing perspectives ;)
>
> RE: Reporting against C* (@Peter Lin)
> We've had the same experience.  Pig + Hadoop is painful.  We are
> experimenting with Spark/Shark, operating directly against the data.
> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>
> The Shark layer gives you SQL and caching capabilities that make it easy
> to use and fast (for smaller data sets).  In front of this, we are going to
> add dimensional aggregations so we can operate at larger scales.  (then the
> Hive reports will run against the aggregations)
>
> RE: REST Server (@Russel Bradbury)
> We had moderate success with Virgil, which was a REST server built
> directly on Thrift.  We built it directly on top of Thrift, so one day it
> could be easily embedded in the C* server itself.   It could be deployed
> separately, or run an embedded C*.  More often than not, we ended up
> running it separately to separate the layers.  (just like Titan and
> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
> top of CQL. (I'd love some help)
> https://github.com/boneill42/memnon
>
> RE: CQL vs. Thrift
> We've hitched our wagons to CQL.  CQL != Relational.
> We've had success translating our "native" schemas into CQL, including all
> the NoSQL goodness of wide-rows, etc.  You just need a good understanding
> of how things translate into storage and underlying CFs.  If anything, I
> think we could add some DESCRIBE information, which would help users with
> this, along the lines of:
> (https://issues.apache.org/jira/browse/CASSANDRA-6676)
>
> CQL does open up the *opportunity* for users to articulate more complex
> queries using more familiar syntax.  (including future things such as
> joins, grouping, etc.)   To me, that is exciting, and again -- one of the
> reasons we are leaning on it.
>
> my two cents,
> brian
>
> ---
>
> Brian O'Neill
>
> Chief Technology Officer
>
>
> *Health Market Science*
>
> *The Science of Better Results*
>
> 2700 Horizon Drive * King of Prussia, PA * 19406
>
> M: 215.588.6024 * @boneill42   *
>
> healthmarketscience.com
>
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or the
> person responsible to deliver it to the intended recipient, please contact
> the sender at the email above and delete this email and any attachments and
> destroy any copies thereof. Any review, retransmission, dissemination,
> copying or other use of, or taking any action in reliance upon, this
> information by persons or entities other than the intended recipient is
> strictly prohibited.
>
>
>
>
> From: Peter Lin 
> Reply-To: 
> Date: Wednesday, March 12, 2014 at 8:44 AM
> To: "user@cassandra.apache.org" 
> Subject: Re: Proposal: freeze Thrift starting with 2.1.0
>
>
> yes, I was looking at intravert last nite.
>
> For the kinds of reports my customers ask us to do, joins and subqueries
> are important. Having tried to do a simple join in PIG, the level of pain
> is  high. I'm a masochist, so I don't mind breaking a simple join into
> multiple MR tasks, though I do find myself asking "why the hell does it
> need to be so painful in PIG?" Many of my friends say "what is this crap!"
> or "this is better than writing sql queries to run reports?"
>
> Plus, using ETL techniques to extract summaries only works for cases where
> the data is small enough. Once it gets beyond a certain size, it's not
> practical, which means we're back to crappy reporting languages that make
> life painful. Lots of big healthcare companies have thousands of MOLAP
> cubes on dozens of mainframes. The old OLTP -> DW/OLAP creates it's own set
> of management headaches.
>
> being able to report directly on the raw data avoids many of the issues,
> but that's my bias perspective.
>
>
>
>
> On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan  wrote:
>
>> "I would love to see Cassandra get to th

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Theo Hultberg
Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.

I agree with Edward that it's unfortunate that there are no official
drivers being maintained by the Cassandra maintainers -- even though the
current state with the Datastax drivers is in practice very close (it is
not the same thing though).

However, I don't agree that not having drivers in the same repo/project is
a problem. Whether or not there's a Java driver in the Cassandra source or
not doesn't matter at all to us non-Java developers, and I don't see any
difference between the situation where there's no driver in the source or
just a Java driver. I might have misunderstood Edwards point about this,
though.

The CQL protocol is the key, as others have mentioned. As long as that is
maintained, and respected I think it's absolutely fine not having any
drivers shipped as part of Cassandra. However, I feel as this has not been
the case lately. I'm thinking particularly about the UDT feature of 2.1,
which is not a part of the CQL spec. There is no documentation on how
drivers should handle them and what a user should be able to expect from a
driver, they're completely implemented as custom types.

I hope this will be fixed before 2.1 is released (and there's been good
discussions on the mailing lists about how a driver should handle UDTs),
but it shows a problem with the the-spec-is-the-thruth argument. I think
we'll be fine as long as the spec is the truth, but that requires the spec
to be the truth and new features to not be bolted on outside of the spec.

T#


On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin  wrote:

> I'm enjoying the discussion also.
>
> @Brian
> I've been looking at spark/shark along with other recent developments the
> last few years. Berkeley has been doing some interesting stuff. One reason
> I like Thrift is for type safety and the benefits for query validation and
> query optimization. One could do similar things with CQL, but it's just
> more work, especially with dynamic columns. I know others are mixing static
> with dynamic columns, so I'm not alone. I have no clue how long it will
> take to get there, but having tools like query explanation is a big time
> saver. Writing business reports is hard enough, so every bit of help the
> tool can provide makes it less painful.
>
>
> On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:
>
>>
>> just when you thought the thread died...
>>
>>
>> First, let me say we are *WAY* off topic.  But that is a good thing.
>> I love this community because there are a ton of passionate, smart
>> people. (often with differing perspectives ;)
>>
>> RE: Reporting against C* (@Peter Lin)
>> We've had the same experience.  Pig + Hadoop is painful.  We are
>> experimenting with Spark/Shark, operating directly against the data.
>> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>>
>> The Shark layer gives you SQL and caching capabilities that make it easy
>> to use and fast (for smaller data sets).  In front of this, we are going to
>> add dimensional aggregations so we can operate at larger scales.  (then the
>> Hive reports will run against the aggregations)
>>
>> RE: REST Server (@Russel Bradbury)
>> We had moderate success with Virgil, which was a REST server built
>> directly on Thrift.  We built it directly on top of Thrift, so one day it
>> could be easily embedded in the C* server itself.   It could be deployed
>> separately, or run an embedded C*.  More often than not, we ended up
>> running it separately to separate the layers.  (just like Titan and
>> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
>> top of CQL. (I'd love some help)
>> https://github.com/boneill42/memnon
>>
>> RE: CQL vs. Thrift
>> We've hitched our wagons to CQL.  CQL != Relational.
>> We've had success translating our "native" schemas into CQL, including
>> all the NoSQL goodness of wide-rows, etc.  You just need a good
>> understanding of how things translate into storage and underlying CFs.  If
>> anything, I think we could add some DESCRIBE information, which would help
>> users with this, along the lines of:
>> (https://issues.apache.org/jira/browse/CASSANDRA-6676)
>>
>> CQL does open up the *opportunity* for users to articulate more complex
>> queries using more familiar syntax.  (including future things such as
>> joins, grouping, etc.)   To me, that is exciting, and again -- one of the
>> reasons we are leaning on it.
>>
>> my two cents,
>> brian
>>
>> ---
>>
>> Brian O'Neill
>>
>> Chief Technology Officer
>>
>>
>> *Health Market Science*
>>
>> *The Science of Better Results*
>>
>> 2700 Horizon Drive * King of Prussia, PA * 19406
>>
>> M: 215.588.6024 * @boneill42   *
>>
>> healthmarketscience.com
>>
>>
>> This information transmitted in this email message is for the intended
>> recipient only and may contain confidential and/or privileged material. If
>> you received this email in error and are not the int

Re: NetworkTopologyStrategy ring distribution across 2 DC

2014-03-12 Thread Ramesh Natarajan
Thanks. The error is gone if i specify the keyspace name. However the
replicas in the ring output is not correct. Shouldn't it say 3 because I
have DC1:3, DC2:3 in my schema?


thanks
Ramesh

Datacenter: DC1
==
Replicas: 2

AddressRackStatus State   LoadOwns
   Token

  -9223372036854775808
192.168.1.107  RAC1Up Normal  4.72 MB 42.86%
   6588122883467697004
192.168.1.106  RAC1Up Normal  4.73 MB 42.86%
   3952873730080618202
192.168.1.105  RAC1Up Normal  4.8 MB  42.86%
   1317624576693539400
192.168.1.104  RAC1Up Normal  4.77 MB 42.86%
   -1317624576693539402
192.168.1.103  RAC1Up Normal  4.83 MB 42.86%
   -3952873730080618204
192.168.1.102  RAC1Up Normal  4.69 MB 42.86%
   -6588122883467697006
192.168.1.101  RAC1Up Normal  4.8 MB  42.86%
   -9223372036854775808

Datacenter: DC2
==
Replicas: 2

AddressRackStatus State   LoadOwns
   Token

  3952873730080618203
192.168.1.111  RAC1Up Normal  4.73 MB 42.86%
   -1317624576693539401
192.168.1.110  RAC1Up Normal  4.79 MB 42.86%
   -3952873730080618203
192.168.1.109  RAC1Up Normal  3.16 MB 42.86%
   -6588122883467697005
192.168.1.108  RAC1Up Normal  3.22 MB 42.86%
   -9223372036854775807
192.168.1.114  RAC1Up Normal  4.69 MB 42.86%
   6588122883467697005
192.168.1.112  RAC1Up Normal  4.76 MB 42.86%
   1317624576693539401
192.168.1.113  RAC1Up Normal  3.19 MB 42.86%
   3952873730080618203


On Tue, Mar 11, 2014 at 7:24 PM, Tyler Hobbs  wrote:

>
> On Tue, Mar 11, 2014 at 1:37 PM, Ramesh Natarajan wrote:
>
>>
>> Note: Ownership information does not include topology; for complete
>> information, specify a keyspace
>>
>> Also the owns column is 0% for the second DC.
>>
>> Is this normal?
>>
>
> Yes.
>
> Without a keyspace specified, the Owns column is showing the equivalent of
> SimpleStrategy with replication_factor=1.  If you specify a keyspace, it
> will take the replication strategy and options into account.
>
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Theo
I totally understand that. Spending time to maintain support for 2
different protocols is a significant overhead. From my own experience
contributing to open source projects, time is the biggest limiting factor.
My bias perspective, CQL can be extended with additional features so that
query validation and optimization is easier. If we look at the history of
RDBMS and the development of query planners/optimizers, having the type
metadata is important. RDBMS don't have to deal with dynamic columns, since
the schema is static. Even then there's dozens of papers from researchers
and implementers on how to optimize a query plan. If we look at Data grid
products, we see a similar thing. Coherence gives users the ability to
query their key/value data and get a query plan. I hope projects like
presto, impala, etc will provide these features eventually. I favor thrift
for a simple reason. My modeling tool and framework retains the type
information, so that makes it easier to build query optimizers. I realize
not everyone cares about this kind of stuff and don't have to write complex
reports. I'm not suggesting others spend their valuable time improving
thrift. At the same time, if I'm willing to work on thrift and the
enhancements are acceptable to others, then Cassandra should include them.
If not, I'm happy to fork Cassandra and do my own thing. I can't be the
only person that needs to do complex reports.

peter




On Wed, Mar 12, 2014 at 11:20 AM, Theo Hultberg  wrote:

> Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.
>
> I agree with Edward that it's unfortunate that there are no official
> drivers being maintained by the Cassandra maintainers -- even though the
> current state with the Datastax drivers is in practice very close (it is
> not the same thing though).
>
> However, I don't agree that not having drivers in the same repo/project is
> a problem. Whether or not there's a Java driver in the Cassandra source or
> not doesn't matter at all to us non-Java developers, and I don't see any
> difference between the situation where there's no driver in the source or
> just a Java driver. I might have misunderstood Edwards point about this,
> though.
>
> The CQL protocol is the key, as others have mentioned. As long as that is
> maintained, and respected I think it's absolutely fine not having any
> drivers shipped as part of Cassandra. However, I feel as this has not been
> the case lately. I'm thinking particularly about the UDT feature of 2.1,
> which is not a part of the CQL spec. There is no documentation on how
> drivers should handle them and what a user should be able to expect from a
> driver, they're completely implemented as custom types.
>
> I hope this will be fixed before 2.1 is released (and there's been good
> discussions on the mailing lists about how a driver should handle UDTs),
> but it shows a problem with the the-spec-is-the-thruth argument. I think
> we'll be fine as long as the spec is the truth, but that requires the spec
> to be the truth and new features to not be bolted on outside of the spec.
>
> T#
>
>
> On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin  wrote:
>
>> I'm enjoying the discussion also.
>>
>> @Brian
>> I've been looking at spark/shark along with other recent developments the
>> last few years. Berkeley has been doing some interesting stuff. One reason
>> I like Thrift is for type safety and the benefits for query validation and
>> query optimization. One could do similar things with CQL, but it's just
>> more work, especially with dynamic columns. I know others are mixing static
>> with dynamic columns, so I'm not alone. I have no clue how long it will
>> take to get there, but having tools like query explanation is a big time
>> saver. Writing business reports is hard enough, so every bit of help the
>> tool can provide makes it less painful.
>>
>>
>> On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:
>>
>>>
>>> just when you thought the thread died...
>>>
>>>
>>> First, let me say we are *WAY* off topic.  But that is a good thing.
>>> I love this community because there are a ton of passionate, smart
>>> people. (often with differing perspectives ;)
>>>
>>> RE: Reporting against C* (@Peter Lin)
>>> We've had the same experience.  Pig + Hadoop is painful.  We are
>>> experimenting with Spark/Shark, operating directly against the data.
>>>
>>> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>>>
>>> The Shark layer gives you SQL and caching capabilities that make it easy
>>> to use and fast (for smaller data sets).  In front of this, we are going to
>>> add dimensional aggregations so we can operate at larger scales.  (then the
>>> Hive reports will run against the aggregations)
>>>
>>> RE: REST Server (@Russel Bradbury)
>>> We had moderate success with Virgil, which was a REST server built
>>> directly on Thrift.  We built it directly on top of Thrift, so one day it
>>> could be easily embedded in the C* server

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Nate McCall
IME/O one of the best things about Cassandra was the separation of (and I'm
over-simplifying a bit, but still):

- The transport/API layer
- The Datacenter layer
- The Storage layer


> I don't think we're well-served by the "construction kit" approach.
> It's difficult enough to evaluate NoSQL without deciding if you should
> run CQLSandra or Hectorsandra or Intravertandra etc.

In tree, or even documented, I agree completely. I've never argued CQL3 is
not the best approach for new users.

But I've been around long enough that I know precisely what I want to do
sometimes and any general purpose API will get in the way of that.

I would like the transport/API layer to at least remain pluggable
("hackable" if you will) in it's current form. I really just want to be
able to create my own *Daemon - as I can now - and go on my merry way
without having to modify any internals. Much like with compaction
strategies and SSTable components.

Do you intend to change this current behavior of allowing a custom
transport without code modification? (as opposed to changing the daemon
class in a script?).


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Edward Capriolo
Great points about the CQL driver and the supposed spec. It shows how a
driver living outside the project poses a problem to open source
development. How could custom types have been implemented without a spec?
In the apache world the saying is "If it did not happen on the list, it did
not happen." Did that happen here?

I still do not understand how and open source apache java database can rely
on third party client software to connect to said database. However the
committers seem comfortable with this arrangement to the point they are
willing to remove support for the other way to connect to the database.

Again, I am glad that the project has officially ended support for thrift
with this clear decree. For years the project kept saying "Thrift is not
going anywhere". It was obviously meant literally like the project would do
the absolute minimum to support it until they could make the case to remove
it completely.




On Wed, Mar 12, 2014 at 11:20 AM, Theo Hultberg  wrote:

> Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.
>
> I agree with Edward that it's unfortunate that there are no official
> drivers being maintained by the Cassandra maintainers -- even though the
> current state with the Datastax drivers is in practice very close (it is
> not the same thing though).
>
> However, I don't agree that not having drivers in the same repo/project is
> a problem. Whether or not there's a Java driver in the Cassandra source or
> not doesn't matter at all to us non-Java developers, and I don't see any
> difference between the situation where there's no driver in the source or
> just a Java driver. I might have misunderstood Edwards point about this,
> though.
>
> The CQL protocol is the key, as others have mentioned. As long as that is
> maintained, and respected I think it's absolutely fine not having any
> drivers shipped as part of Cassandra. However, I feel as this has not been
> the case lately. I'm thinking particularly about the UDT feature of 2.1,
> which is not a part of the CQL spec. There is no documentation on how
> drivers should handle them and what a user should be able to expect from a
> driver, they're completely implemented as custom types.
>
> I hope this will be fixed before 2.1 is released (and there's been good
> discussions on the mailing lists about how a driver should handle UDTs),
> but it shows a problem with the the-spec-is-the-thruth argument. I think
> we'll be fine as long as the spec is the truth, but that requires the spec
> to be the truth and new features to not be bolted on outside of the spec.
>
> T#
>
>
> On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin  wrote:
>
>> I'm enjoying the discussion also.
>>
>> @Brian
>> I've been looking at spark/shark along with other recent developments the
>> last few years. Berkeley has been doing some interesting stuff. One reason
>> I like Thrift is for type safety and the benefits for query validation and
>> query optimization. One could do similar things with CQL, but it's just
>> more work, especially with dynamic columns. I know others are mixing static
>> with dynamic columns, so I'm not alone. I have no clue how long it will
>> take to get there, but having tools like query explanation is a big time
>> saver. Writing business reports is hard enough, so every bit of help the
>> tool can provide makes it less painful.
>>
>>
>> On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:
>>
>>>
>>> just when you thought the thread died...
>>>
>>>
>>> First, let me say we are *WAY* off topic.  But that is a good thing.
>>> I love this community because there are a ton of passionate, smart
>>> people. (often with differing perspectives ;)
>>>
>>> RE: Reporting against C* (@Peter Lin)
>>> We've had the same experience.  Pig + Hadoop is painful.  We are
>>> experimenting with Spark/Shark, operating directly against the data.
>>>
>>> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>>>
>>> The Shark layer gives you SQL and caching capabilities that make it easy
>>> to use and fast (for smaller data sets).  In front of this, we are going to
>>> add dimensional aggregations so we can operate at larger scales.  (then the
>>> Hive reports will run against the aggregations)
>>>
>>> RE: REST Server (@Russel Bradbury)
>>> We had moderate success with Virgil, which was a REST server built
>>> directly on Thrift.  We built it directly on top of Thrift, so one day it
>>> could be easily embedded in the C* server itself.   It could be deployed
>>> separately, or run an embedded C*.  More often than not, we ended up
>>> running it separately to separate the layers.  (just like Titan and
>>> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
>>> top of CQL. (I'd love some help)
>>> https://github.com/boneill42/memnon
>>>
>>> RE: CQL vs. Thrift
>>> We've hitched our wagons to CQL.  CQL != Relational.
>>> We've had success translating our "native" schemas into CQL, including
>>> all the No

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Tupshin Harper
I agree that we are way off the initial topic, but I think we are spot on
the most important topic. As seen in various tickets, including #6704 (wide
row scanners), #6167 (end-slice termination predicate), the existence
of intravert-ug (Cassandra interface to intravert), and a number of others,
there is an increasing desire to do more complicated processing,
server-side, on a Cassandra cluster.

I very much share those goals, and would like to propose the following only
partially hand-wavey path forward.

Instead of creating a pluggable interface for Thrift, I'd like to create a
pluggable interface for arbitrary app-server deep integration.

Inspired by both the existence of intravert-ug, as well as there being a
long history of various parties embedding tomcat or jetty servlet engines
inside Cassandra, I'd like to propose the creation an internal somewhat
stable (versioned?) interface that could allow any app server to achieve
deep integration with Cassandra, and as a result, these servers could
1) host their own apis (REST, for example
2) extend core functionality by having limited (see triggers and wide row
scanners) access to the internals of cassandra

The hand wavey part comes because while I have been mulling this about for
a while, I have not spent any significant time into looking at the actual
surface area of intravert-ug's integration. But, using it as a model, and
also keeping in minds the general needs of your more traditional
servlet/j2ee containers, I believe we could come up with a reasonable
interface to allow any jvm app server to be integrated and maintained in or
out of the Cassandra tree.

This would satisfy the needs that many of us (Both Ed and I, for example)
to have a much greater degree of control over server side execution, and to
be able to start building much more interestingly (and simply) tiered
applications.

Anybody interested in working on a coherent proposal with me?

-Tupshin


On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:

>
> just when you thought the thread died...
>
>
> First, let me say we are *WAY* off topic.  But that is a good thing.
> I love this community because there are a ton of passionate, smart people.
> (often with differing perspectives ;)
>
> RE: Reporting against C* (@Peter Lin)
> We've had the same experience.  Pig + Hadoop is painful.  We are
> experimenting with Spark/Shark, operating directly against the data.
> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>
> The Shark layer gives you SQL and caching capabilities that make it easy
> to use and fast (for smaller data sets).  In front of this, we are going to
> add dimensional aggregations so we can operate at larger scales.  (then the
> Hive reports will run against the aggregations)
>
> RE: REST Server (@Russel Bradbury)
> We had moderate success with Virgil, which was a REST server built
> directly on Thrift.  We built it directly on top of Thrift, so one day it
> could be easily embedded in the C* server itself.   It could be deployed
> separately, or run an embedded C*.  More often than not, we ended up
> running it separately to separate the layers.  (just like Titan and
> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
> top of CQL. (I'd love some help)
> https://github.com/boneill42/memnon
>
> RE: CQL vs. Thrift
> We've hitched our wagons to CQL.  CQL != Relational.
> We've had success translating our "native" schemas into CQL, including all
> the NoSQL goodness of wide-rows, etc.  You just need a good understanding
> of how things translate into storage and underlying CFs.  If anything, I
> think we could add some DESCRIBE information, which would help users with
> this, along the lines of:
> (https://issues.apache.org/jira/browse/CASSANDRA-6676)
>
> CQL does open up the *opportunity* for users to articulate more complex
> queries using more familiar syntax.  (including future things such as
> joins, grouping, etc.)   To me, that is exciting, and again -- one of the
> reasons we are leaning on it.
>
> my two cents,
> brian
>
> ---
>
> Brian O'Neill
>
> Chief Technology Officer
>
>
> *Health Market Science*
>
> *The Science of Better Results*
>
> 2700 Horizon Drive * King of Prussia, PA * 19406
>
> M: 215.588.6024 * @boneill42   *
>
> healthmarketscience.com
>
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or the
> person responsible to deliver it to the intended recipient, please contact
> the sender at the email above and delete this email and any attachments and
> destroy any copies thereof. Any review, retransmission, dissemination,
> copying or other use of, or taking any action in reliance upon, this
> information by persons or entities other than the intended recipient is
> strictly prohibited.
>
>
>
>
> From: Pete

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Nate
I don't want to change the separation of components in cassandra. My
ultimate goal is "make writing complex queries less painful and more
efficient." How that becomes reality is anyone's guess. There's different
ways to get there. I also like having a plugging transport layer, which is
why I feel sad every time I hear people say "thrift is dead" or "thrift is
frozen beyond 2.1" or "don't use thrift". When people ask me what to learn
with Cassandra, I say both thrift and CQL. Not everyone has time to read
the native protocol spec or dive into cassandra code, but clearly "some"
people do and enjoy it. I understand some people don't want the burden of
maintaining Thrift, and it's totally valid. It's up to those that want to
keep thrift to make sure patches and enhancements are well tested and solid.





On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall wrote:

> IME/O one of the best things about Cassandra was the separation of (and
> I'm over-simplifying a bit, but still):
>
> - The transport/API layer
> - The Datacenter layer
> - The Storage layer
>
>
> > I don't think we're well-served by the "construction kit" approach.
> > It's difficult enough to evaluate NoSQL without deciding if you should
> > run CQLSandra or Hectorsandra or Intravertandra etc.
>
> In tree, or even documented, I agree completely. I've never argued CQL3 is
> not the best approach for new users.
>
> But I've been around long enough that I know precisely what I want to do
> sometimes and any general purpose API will get in the way of that.
>
> I would like the transport/API layer to at least remain pluggable
> ("hackable" if you will) in it's current form. I really just want to be
> able to create my own *Daemon - as I can now - and go on my merry way
> without having to modify any internals. Much like with compaction
> strategies and SSTable components.
>
> Do you intend to change this current behavior of allowing a custom
> transport without code modification? (as opposed to changing the daemon
> class in a script?).
>
>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Russell Bradberry
@Nate, @Tupshin, this is pretty close to what I had in mind. I would be open to 
helping out with a formal proposal.



On March 12, 2014 at 12:11:41 PM, Tupshin Harper (tups...@tupshin.com) wrote:

I agree that we are way off the initial topic, but I think we are spot on the 
most important topic. As seen in various tickets, including #6704 (wide row 
scanners), #6167 (end-slice termination predicate), the existence of 
intravert-ug (Cassandra interface to intravert), and a number of others, there 
is an increasing desire to do more complicated processing, server-side, on a 
Cassandra cluster.

I very much share those goals, and would like to propose the following only 
partially hand-wavey path forward.

Instead of creating a pluggable interface for Thrift, I'd like to create a 
pluggable interface for arbitrary app-server deep integration.

Inspired by both the existence of intravert-ug, as well as there being a long 
history of various parties embedding tomcat or jetty servlet engines inside 
Cassandra, I'd like to propose the creation an internal somewhat stable 
(versioned?) interface that could allow any app server to achieve deep 
integration with Cassandra, and as a result, these servers could 
1) host their own apis (REST, for example
2) extend core functionality by having limited (see triggers and wide row 
scanners) access to the internals of cassandra

The hand wavey part comes because while I have been mulling this about for a 
while, I have not spent any significant time into looking at the actual surface 
area of intravert-ug's integration. But, using it as a model, and also keeping 
in minds the general needs of your more traditional servlet/j2ee containers, I 
believe we could come up with a reasonable interface to allow any jvm app 
server to be integrated and maintained in or out of the Cassandra tree.

This would satisfy the needs that many of us (Both Ed and I, for example) to 
have a much greater degree of control over server side execution, and to be 
able to start building much more interestingly (and simply) tiered applications.

Anybody interested in working on a coherent proposal with me?

-Tupshin


On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill  wrote:

just when you thought the thread died…


First, let me say we are *WAY* off topic.  But that is a good thing.  
I love this community because there are a ton of passionate, smart people. 
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We’ve had the same experience.  Pig + Hadoop is painful.  We are experimenting 
with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to use 
and fast (for smaller data sets).  In front of this, we are going to add 
dimensional aggregations so we can operate at larger scales.  (then the Hive 
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly on 
Thrift.  We built it directly on top of Thrift, so one day it could be easily 
embedded in the C* server itself.   It could be deployed separately, or run an 
embedded C*.  More often than not, we ended up running it separately to 
separate the layers.  (just like Titan and Rexster)  I’ve started on a rewrite 
of Virgil called Memnon that rides on top of CQL. (I’d love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We’ve hitched our wagons to CQL.  CQL != Relational.  
We’ve had success translating our “native” schemas into CQL, including all the 
NoSQL goodness of wide-rows, etc.  You just need a good understanding of how 
things translate into storage and underlying CFs.  If anything, I think we 
could add some DESCRIBE information, which would help users with this, along 
the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex queries 
using more familiar syntax.  (including future things such as joins, grouping, 
etc.)   To me, that is exciting, and again — one of the reasons we are leaning 
on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  •  
healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities oth

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Tupshin Harper
Peter,

I didn't specifically call it out, but the interface I just proposed in my
last email would be very much with the goal of "make writing complex
queries less painful and more efficient." by providing a deep integration
mechanism to host that code.  It's very much a "enough rope to hang
ourselves" approach, but badly needed,  IMO

-Tupshin
On Mar 12, 2014 12:12 PM, "Peter Lin"  wrote:

>
> @Nate
> I don't want to change the separation of components in cassandra. My
> ultimate goal is "make writing complex queries less painful and more
> efficient." How that becomes reality is anyone's guess. There's different
> ways to get there. I also like having a plugging transport layer, which is
> why I feel sad every time I hear people say "thrift is dead" or "thrift is
> frozen beyond 2.1" or "don't use thrift". When people ask me what to learn
> with Cassandra, I say both thrift and CQL. Not everyone has time to read
> the native protocol spec or dive into cassandra code, but clearly "some"
> people do and enjoy it. I understand some people don't want the burden of
> maintaining Thrift, and it's totally valid. It's up to those that want to
> keep thrift to make sure patches and enhancements are well tested and solid.
>
>
>
>
>
> On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall wrote:
>
>> IME/O one of the best things about Cassandra was the separation of (and
>> I'm over-simplifying a bit, but still):
>>
>> - The transport/API layer
>> - The Datacenter layer
>> - The Storage layer
>>
>>
>> > I don't think we're well-served by the "construction kit" approach.
>> > It's difficult enough to evaluate NoSQL without deciding if you should
>> > run CQLSandra or Hectorsandra or Intravertandra etc.
>>
>> In tree, or even documented, I agree completely. I've never argued CQL3
>> is not the best approach for new users.
>>
>> But I've been around long enough that I know precisely what I want to do
>> sometimes and any general purpose API will get in the way of that.
>>
>> I would like the transport/API layer to at least remain pluggable
>> ("hackable" if you will) in it's current form. I really just want to be
>> able to create my own *Daemon - as I can now - and go on my merry way
>> without having to modify any internals. Much like with compaction
>> strategies and SSTable components.
>>
>> Do you intend to change this current behavior of allowing a custom
>> transport without code modification? (as opposed to changing the daemon
>> class in a script?).
>>
>>
>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Nate & Tupshin, glad to help where I can


On Wed, Mar 12, 2014 at 12:14 PM, Russell Bradberry wrote:

> @Nate, @Tupshin, this is pretty close to what I had in mind. I would be
> open to helping out with a formal proposal.
>
>
>
> On March 12, 2014 at 12:11:41 PM, Tupshin Harper (tups...@tupshin.com)
> wrote:
>
> I agree that we are way off the initial topic, but I think we are spot on
> the most important topic. As seen in various tickets, including #6704 (wide
> row scanners), #6167 (end-slice termination predicate), the existence
> of intravert-ug (Cassandra interface to intravert), and a number of others,
> there is an increasing desire to do more complicated processing,
> server-side, on a Cassandra cluster.
>
> I very much share those goals, and would like to propose the following
> only partially hand-wavey path forward.
>
> Instead of creating a pluggable interface for Thrift, I'd like to create a
> pluggable interface for arbitrary app-server deep integration.
>
> Inspired by both the existence of intravert-ug, as well as there being a
> long history of various parties embedding tomcat or jetty servlet engines
> inside Cassandra, I'd like to propose the creation an internal somewhat
> stable (versioned?) interface that could allow any app server to achieve
> deep integration with Cassandra, and as a result, these servers could
> 1) host their own apis (REST, for example
> 2) extend core functionality by having limited (see triggers and wide row
> scanners) access to the internals of cassandra
>
> The hand wavey part comes because while I have been mulling this about for
> a while, I have not spent any significant time into looking at the actual
> surface area of intravert-ug's integration. But, using it as a model, and
> also keeping in minds the general needs of your more traditional
> servlet/j2ee containers, I believe we could come up with a reasonable
> interface to allow any jvm app server to be integrated and maintained in or
> out of the Cassandra tree.
>
> This would satisfy the needs that many of us (Both Ed and I, for example)
> to have a much greater degree of control over server side execution, and to
> be able to start building much more interestingly (and simply) tiered
> applications.
>
> Anybody interested in working on a coherent proposal with me?
>
> -Tupshin
>
>
> On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:
>
>>
>> just when you thought the thread died...
>>
>>
>> First, let me say we are *WAY* off topic.  But that is a good thing.
>> I love this community because there are a ton of passionate, smart
>> people. (often with differing perspectives ;)
>>
>> RE: Reporting against C* (@Peter Lin)
>> We've had the same experience.  Pig + Hadoop is painful.  We are
>> experimenting with Spark/Shark, operating directly against the data.
>> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>>
>> The Shark layer gives you SQL and caching capabilities that make it easy
>> to use and fast (for smaller data sets).  In front of this, we are going to
>> add dimensional aggregations so we can operate at larger scales.  (then the
>> Hive reports will run against the aggregations)
>>
>> RE: REST Server (@Russel Bradbury)
>> We had moderate success with Virgil, which was a REST server built
>> directly on Thrift.  We built it directly on top of Thrift, so one day it
>> could be easily embedded in the C* server itself.   It could be deployed
>> separately, or run an embedded C*.  More often than not, we ended up
>> running it separately to separate the layers.  (just like Titan and
>> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
>> top of CQL. (I'd love some help)
>> https://github.com/boneill42/memnon
>>
>> RE: CQL vs. Thrift
>> We've hitched our wagons to CQL.  CQL != Relational.
>> We've had success translating our "native" schemas into CQL, including
>> all the NoSQL goodness of wide-rows, etc.  You just need a good
>> understanding of how things translate into storage and underlying CFs.  If
>> anything, I think we could add some DESCRIBE information, which would help
>> users with this, along the lines of:
>> (https://issues.apache.org/jira/browse/CASSANDRA-6676)
>>
>> CQL does open up the *opportunity* for users to articulate more complex
>> queries using more familiar syntax.  (including future things such as
>> joins, grouping, etc.)   To me, that is exciting, and again -- one of the
>> reasons we are leaning on it.
>>
>> my two cents,
>> brian
>>
>>   ---
>>
>> Brian O'Neill
>>
>> Chief Technology Officer
>>
>>
>>  *Health Market Science*
>>
>> *The Science of Better Results*
>>
>> 2700 Horizon Drive * King of Prussia, PA * 19406
>>
>> M: 215.588.6024 * @boneill42   *
>>
>> healthmarketscience.com
>>
>>
>>   This information transmitted in this email message is for the intended
>> recipient only and may contain confidential and/or privileged material. If
>> you received this email in error 

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Peter Lin
@Tupshin
LOL, there's always enough rope to hang oneself. I agree it's badly needed
for folks that really do need more "messy" queries. I was just discussing a
similar concept with a co-worker and going over the pros/cons of various
approaches to realizing the goal. I'm still digging into Presto. I saw some
people are working on support for cassandra in presto.



On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper wrote:

> Peter,
>
> I didn't specifically call it out, but the interface I just proposed in my
> last email would be very much with the goal of "make writing complex
> queries less painful and more efficient." by providing a deep integration
> mechanism to host that code.  It's very much a "enough rope to hang
> ourselves" approach, but badly needed,  IMO
>
> -Tupshin
> On Mar 12, 2014 12:12 PM, "Peter Lin"  wrote:
>
>>
>> @Nate
>> I don't want to change the separation of components in cassandra. My
>> ultimate goal is "make writing complex queries less painful and more
>> efficient." How that becomes reality is anyone's guess. There's different
>> ways to get there. I also like having a plugging transport layer, which is
>> why I feel sad every time I hear people say "thrift is dead" or "thrift is
>> frozen beyond 2.1" or "don't use thrift". When people ask me what to learn
>> with Cassandra, I say both thrift and CQL. Not everyone has time to read
>> the native protocol spec or dive into cassandra code, but clearly "some"
>> people do and enjoy it. I understand some people don't want the burden of
>> maintaining Thrift, and it's totally valid. It's up to those that want to
>> keep thrift to make sure patches and enhancements are well tested and solid.
>>
>>
>>
>>
>>
>> On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall wrote:
>>
>>> IME/O one of the best things about Cassandra was the separation of (and
>>> I'm over-simplifying a bit, but still):
>>>
>>> - The transport/API layer
>>> - The Datacenter layer
>>> - The Storage layer
>>>
>>>
>>> > I don't think we're well-served by the "construction kit" approach.
>>> > It's difficult enough to evaluate NoSQL without deciding if you should
>>> > run CQLSandra or Hectorsandra or Intravertandra etc.
>>>
>>> In tree, or even documented, I agree completely. I've never argued CQL3
>>> is not the best approach for new users.
>>>
>>> But I've been around long enough that I know precisely what I want to do
>>> sometimes and any general purpose API will get in the way of that.
>>>
>>> I would like the transport/API layer to at least remain pluggable
>>> ("hackable" if you will) in it's current form. I really just want to be
>>> able to create my own *Daemon - as I can now - and go on my merry way
>>> without having to modify any internals. Much like with compaction
>>> strategies and SSTable components.
>>>
>>> Do you intend to change this current behavior of allowing a custom
>>> transport without code modification? (as opposed to changing the daemon
>>> class in a script?).
>>>
>>>
>>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Tupshin Harper
OK, so I'm greatly encouraged by the level of interest in this. I went
ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846, and
will be starting to look into what the interface would have to look like.
Anybody feel free to continue the discussion here, email me privately, or
comment on ticket with your thoughts.

-Tupshin


On Wed, Mar 12, 2014 at 12:21 PM, Peter Lin  wrote:

>
> @Tupshin
> LOL, there's always enough rope to hang oneself. I agree it's badly needed
> for folks that really do need more "messy" queries. I was just discussing a
> similar concept with a co-worker and going over the pros/cons of various
> approaches to realizing the goal. I'm still digging into Presto. I saw some
> people are working on support for cassandra in presto.
>
>
>
> On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper wrote:
>
>> Peter,
>>
>> I didn't specifically call it out, but the interface I just proposed in
>> my last email would be very much with the goal of "make writing complex
>> queries less painful and more efficient." by providing a deep integration
>> mechanism to host that code.  It's very much a "enough rope to hang
>> ourselves" approach, but badly needed,  IMO
>>
>> -Tupshin
>> On Mar 12, 2014 12:12 PM, "Peter Lin"  wrote:
>>
>>>
>>> @Nate
>>> I don't want to change the separation of components in cassandra. My
>>> ultimate goal is "make writing complex queries less painful and more
>>> efficient." How that becomes reality is anyone's guess. There's different
>>> ways to get there. I also like having a plugging transport layer, which is
>>> why I feel sad every time I hear people say "thrift is dead" or "thrift is
>>> frozen beyond 2.1" or "don't use thrift". When people ask me what to learn
>>> with Cassandra, I say both thrift and CQL. Not everyone has time to read
>>> the native protocol spec or dive into cassandra code, but clearly "some"
>>> people do and enjoy it. I understand some people don't want the burden of
>>> maintaining Thrift, and it's totally valid. It's up to those that want to
>>> keep thrift to make sure patches and enhancements are well tested and solid.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall wrote:
>>>
 IME/O one of the best things about Cassandra was the separation of (and
 I'm over-simplifying a bit, but still):

 - The transport/API layer
 - The Datacenter layer
 - The Storage layer


 > I don't think we're well-served by the "construction kit" approach.
 > It's difficult enough to evaluate NoSQL without deciding if you should
 > run CQLSandra or Hectorsandra or Intravertandra etc.

 In tree, or even documented, I agree completely. I've never argued CQL3
 is not the best approach for new users.

 But I've been around long enough that I know precisely what I want to
 do sometimes and any general purpose API will get in the way of that.

 I would like the transport/API layer to at least remain pluggable
 ("hackable" if you will) in it's current form. I really just want to be
 able to create my own *Daemon - as I can now - and go on my merry way
 without having to modify any internals. Much like with compaction
 strategies and SSTable components.

 Do you intend to change this current behavior of allowing a custom
 transport without code modification? (as opposed to changing the daemon
 class in a script?).


>>>
>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Nate McCall
Awesome! Thanks Tupshin (and everyone else). I'll put some of my thoughts
up there shortly.

On Wed, Mar 12, 2014 at 11:26 AM, Tupshin Harper wrote:

> OK, so I'm greatly encouraged by the level of interest in this. I went
> ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846,
> and will be starting to look into what the interface would have to look
> like. Anybody feel free to continue the discussion here, email me
> privately, or comment on ticket with your thoughts.
>
> -Tupshin
>
>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Edward Capriolo
@Tushpin

I like that approach, right now I think of that piece as the
"StorageProxy". I agree, over the years people have take that approach.
Solandra and is a good example and I am guessing DSE SOLR works this way.
This says something about the entire "thrift vs cql" thing as there are
clearly power users writing applications that use neither.

I do feel this vote was called to shoot down any attempt to add a feature
that was non CQL. However if you think you can drive something like this
forward more power to you I will help out.





On Wed, Mar 12, 2014 at 12:11 PM, Tupshin Harper wrote:

> I agree that we are way off the initial topic, but I think we are spot on
> the most important topic. As seen in various tickets, including #6704 (wide
> row scanners), #6167 (end-slice termination predicate), the existence
> of intravert-ug (Cassandra interface to intravert), and a number of others,
> there is an increasing desire to do more complicated processing,
> server-side, on a Cassandra cluster.
>
> I very much share those goals, and would like to propose the following
> only partially hand-wavey path forward.
>
> Instead of creating a pluggable interface for Thrift, I'd like to create a
> pluggable interface for arbitrary app-server deep integration.
>
> Inspired by both the existence of intravert-ug, as well as there being a
> long history of various parties embedding tomcat or jetty servlet engines
> inside Cassandra, I'd like to propose the creation an internal somewhat
> stable (versioned?) interface that could allow any app server to achieve
> deep integration with Cassandra, and as a result, these servers could
> 1) host their own apis (REST, for example
> 2) extend core functionality by having limited (see triggers and wide row
> scanners) access to the internals of cassandra
>
> The hand wavey part comes because while I have been mulling this about for
> a while, I have not spent any significant time into looking at the actual
> surface area of intravert-ug's integration. But, using it as a model, and
> also keeping in minds the general needs of your more traditional
> servlet/j2ee containers, I believe we could come up with a reasonable
> interface to allow any jvm app server to be integrated and maintained in or
> out of the Cassandra tree.
>
> This would satisfy the needs that many of us (Both Ed and I, for example)
> to have a much greater degree of control over server side execution, and to
> be able to start building much more interestingly (and simply) tiered
> applications.
>
> Anybody interested in working on a coherent proposal with me?
>
> -Tupshin
>
>
> On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill wrote:
>
>>
>> just when you thought the thread died...
>>
>>
>> First, let me say we are *WAY* off topic.  But that is a good thing.
>> I love this community because there are a ton of passionate, smart
>> people. (often with differing perspectives ;)
>>
>> RE: Reporting against C* (@Peter Lin)
>> We've had the same experience.  Pig + Hadoop is painful.  We are
>> experimenting with Spark/Shark, operating directly against the data.
>> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>>
>> The Shark layer gives you SQL and caching capabilities that make it easy
>> to use and fast (for smaller data sets).  In front of this, we are going to
>> add dimensional aggregations so we can operate at larger scales.  (then the
>> Hive reports will run against the aggregations)
>>
>> RE: REST Server (@Russel Bradbury)
>> We had moderate success with Virgil, which was a REST server built
>> directly on Thrift.  We built it directly on top of Thrift, so one day it
>> could be easily embedded in the C* server itself.   It could be deployed
>> separately, or run an embedded C*.  More often than not, we ended up
>> running it separately to separate the layers.  (just like Titan and
>> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
>> top of CQL. (I'd love some help)
>> https://github.com/boneill42/memnon
>>
>> RE: CQL vs. Thrift
>> We've hitched our wagons to CQL.  CQL != Relational.
>> We've had success translating our "native" schemas into CQL, including
>> all the NoSQL goodness of wide-rows, etc.  You just need a good
>> understanding of how things translate into storage and underlying CFs.  If
>> anything, I think we could add some DESCRIBE information, which would help
>> users with this, along the lines of:
>> (https://issues.apache.org/jira/browse/CASSANDRA-6676)
>>
>> CQL does open up the *opportunity* for users to articulate more complex
>> queries using more familiar syntax.  (including future things such as
>> joins, grouping, etc.)   To me, that is exciting, and again -- one of the
>> reasons we are leaning on it.
>>
>> my two cents,
>> brian
>>
>> ---
>>
>> Brian O'Neill
>>
>> Chief Technology Officer
>>
>>
>> *Health Market Science*
>>
>> *The Science of Better Results*
>>
>> 2700 Horizon Drive * King of Prussia, PA * 19406
>>
>> M: 215.58

Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Robert Coli
On Wed, Mar 12, 2014 at 9:10 AM, Edward Capriolo wrote:

> Again, I am glad that the project has officially ended support for thrift
> with this clear decree. For years the project kept saying "Thrift is not
> going anywhere". It was obviously meant literally like the project would do
> the absolute minimum to support it until they could make the case to remove
> it completely.
>

Yes, I didn't realize at the time, but both meanings of "not going
anywhere" were apparently intended.

"Not going anywhere" as in not likely to be removed (for another few major
versions at least)
but also
"Not going anywhere" as in being the (un/semi/barely-)maintained second
class citizen API

For the record, I have always presumed that thrift will eventually be
removed from the codebase, so for me this new announcement does not
generate new surprise or outrage. Separate cannot be equal, and eventually
the pain of keeping it in there will outweigh the pain of deprecating it.
Even though I do not use CQL3 or the binary protocol and the removal of
thrift would force me to do so, having two APIs is so bizarro that I'm left
hoping that it *is* eventually deprecated...

=Rob


Opscenter help?

2014-03-12 Thread Drew from Zhrodague
	I am having a hard time installing the Datastax Opscenter agents on EL6 
and EL5 hosts. Where is an appropriate place to ask for help? Datastax 
has move their forums to Stack Exchange, which seems to be a waste of 
time, as I don't have enough reputation points to properly tag my questions.


The agent installation seems to be broken:
[] agent rpm conflicts with sudo
	[] install from opscenter does not work, even if manually installing 
the rpm (requres --force, conflicts with sudo)

[] error message re: log4j #noconf
[] Could not find the main class: opsagent.opsagent. Program will exit.
[] No other (helpful/more in-depth) documentation exists


--

Drew from Zhrodague
post-apocalyptic ad-hoc industrialist
d...@zhrodague.net


Java heap size does not change on Windows

2014-03-12 Thread Lukas Steiblys
I am running Windows Server 2008 R2 Enterprise on a 2 Core Intel Xeon with 16GB 
of RAM and I want to change the max heap size. I set MAX_HEAP_SIZE in 
cassandra-env.sh, but when I start Cassandra, it’s still reporting:

INFO 12:37:36,221 Global memtable threshold is enabled at 247MB
INFO 12:37:36,377 using multi-threaded compaction
INFO 12:37:36,705 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.7.0_51
INFO 12:37:36,705 Heap size: 1037959168/1037959168

My question is: how do I change the heap size?

Lukas Steiblys


Re: Java heap size does not change on Windows

2014-03-12 Thread Tyler Hobbs
cassandra-env.sh is only used on *nix systems.  You'll need to change
bin/cassandra.bat.  Interestingly, that's hardcoded to use a 1G heap, which
seems like a bug.


On Wed, Mar 12, 2014 at 2:40 PM, Lukas Steiblys wrote:

>   I am running Windows Server 2008 R2 Enterprise on a 2 Core Intel Xeon
> with 16GB of RAM and I want to change the max heap size. I set
> MAX_HEAP_SIZE in cassandra-env.sh, but when I start Cassandra, it's still
> reporting:
>
> INFO 12:37:36,221 Global memtable threshold is enabled at 247MB
> INFO 12:37:36,377 using multi-threaded compaction
> INFO 12:37:36,705 JVM vendor/version: Java HotSpot(TM) 64-Bit Server
> VM/1.7.0_51
> INFO 12:37:36,705 Heap size: 1037959168/1037959168
>
> My question is: how do I change the heap size?
>
> Lukas Steiblys
>
>



-- 
Tyler Hobbs
DataStax 


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Edward Capriolo
This brainstorming idea has already been -1 ed in jira. ROFL.


On Wed, Mar 12, 2014 at 12:26 PM, Tupshin Harper wrote:

> OK, so I'm greatly encouraged by the level of interest in this. I went
> ahead and created https://issues.apache.org/jira/browse/CASSANDRA-6846,
> and will be starting to look into what the interface would have to look
> like. Anybody feel free to continue the discussion here, email me
> privately, or comment on ticket with your thoughts.
>
> -Tupshin
>
>
> On Wed, Mar 12, 2014 at 12:21 PM, Peter Lin  wrote:
>
>>
>> @Tupshin
>> LOL, there's always enough rope to hang oneself. I agree it's badly
>> needed for folks that really do need more "messy" queries. I was just
>> discussing a similar concept with a co-worker and going over the pros/cons
>> of various approaches to realizing the goal. I'm still digging into Presto.
>> I saw some people are working on support for cassandra in presto.
>>
>>
>>
>> On Wed, Mar 12, 2014 at 12:15 PM, Tupshin Harper wrote:
>>
>>> Peter,
>>>
>>> I didn't specifically call it out, but the interface I just proposed in
>>> my last email would be very much with the goal of "make writing complex
>>> queries less painful and more efficient." by providing a deep integration
>>> mechanism to host that code.  It's very much a "enough rope to hang
>>> ourselves" approach, but badly needed,  IMO
>>>
>>> -Tupshin
>>> On Mar 12, 2014 12:12 PM, "Peter Lin"  wrote:
>>>

 @Nate
 I don't want to change the separation of components in cassandra. My
 ultimate goal is "make writing complex queries less painful and more
 efficient." How that becomes reality is anyone's guess. There's different
 ways to get there. I also like having a plugging transport layer, which is
 why I feel sad every time I hear people say "thrift is dead" or "thrift is
 frozen beyond 2.1" or "don't use thrift". When people ask me what to learn
 with Cassandra, I say both thrift and CQL. Not everyone has time to read
 the native protocol spec or dive into cassandra code, but clearly "some"
 people do and enjoy it. I understand some people don't want the burden of
 maintaining Thrift, and it's totally valid. It's up to those that want to
 keep thrift to make sure patches and enhancements are well tested and 
 solid.





 On Wed, Mar 12, 2014 at 11:52 AM, Nate McCall 
 wrote:

> IME/O one of the best things about Cassandra was the separation of
> (and I'm over-simplifying a bit, but still):
>
> - The transport/API layer
> - The Datacenter layer
> - The Storage layer
>
>
> > I don't think we're well-served by the "construction kit" approach.
> > It's difficult enough to evaluate NoSQL without deciding if you
> should
> > run CQLSandra or Hectorsandra or Intravertandra etc.
>
> In tree, or even documented, I agree completely. I've never argued
> CQL3 is not the best approach for new users.
>
> But I've been around long enough that I know precisely what I want to
> do sometimes and any general purpose API will get in the way of that.
>
> I would like the transport/API layer to at least remain pluggable
> ("hackable" if you will) in it's current form. I really just want to be
> able to create my own *Daemon - as I can now - and go on my merry way
> without having to modify any internals. Much like with compaction
> strategies and SSTable components.
>
> Do you intend to change this current behavior of allowing a custom
> transport without code modification? (as opposed to changing the daemon
> class in a script?).
>
>

>>
>


[no subject]

2014-03-12 Thread Batranut Bogdan
Hello all,

The environment:

I have a 6 node Cassandra cluster. On each node I have:
- 32 G RAM
- 24 G RAM for cassa
- ~150 - 200 MB/s disk speed
- tomcat 6 with axis2 webservice that uses the datastax java driver to make
asynch reads / writes 
- replication factor for the keyspace is 3

All nodes in the same data center 
The clients that read / write are in the same datacenter so network is
Gigabit.

Writes are performed via exposed methods from Axis2 WS . The Cassandra Java
driver uses the round robin load balancing policy so all the nodes in the
cluster should be hit with write requests under heavy write or read load
from multiple clients.

I am monitoring all nodes with JConsole from another box.

The problem:

When wrinting to a particular column family, only 3 nodes have high CPU load
~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
timeout. 

I need more speed for both writes of reads. Due to the fact that 3 nodes
barely have CPU activity leads me to think that the whole potential for C*
is not touched.

I am running out of ideas...

If further details about the environment I can provide them.


Thank you very much.

Dead node seen as UP by replacement node

2014-03-12 Thread Paulo Ricardo Motta Gomes
Hello,

I'm trying to replace a dead node using the procedure in [1], but the
replacement node initially sees the dead node as UP, and after a few
minutes the node is marked as DOWN again, failing the streaming/bootstrap
procedure of the replacement node. This dead node is always seen as DOWN by
the rest of the cluster.

Could this be a bug? I can easily reproduce it in our production
environment, but don't know if it's reproducible in a clean environment.

Version: 1.2.13

Here is the log from the replacement node (192.168.1.10 is the dead node):

 INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node
/192.168.1.10 is now part of the cluster
 INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809)
InetAddress /192.168.1.10 is now UP
 INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823)
InetAddress /192.168.1.10 is now DOWN
ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
(line 110) Stream failed because /192.168.1.10 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)
 WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246)
Streaming from /192.168.1.10 failed
ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
(line 110) Stream failed because /192.168.1.10 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)
 WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246)
Streaming from /192.168.1.10 failed

[1]
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

Cheers,

Paulo

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br *
+55 48 3232.3200
+55 83 9690-1314


Re: Dead node seen as UP by replacement node

2014-03-12 Thread Paulo Ricardo Motta Gomes
Some further info:

I'm not using Vnodes, so I'm using the 1.1 replace node trick of setting
the initial_token in the cassandra.yaml file to the value of the dead
node's token -1, and autobootstrap=true. However, according to the Apache
wiki (
https://wiki.apache.org/cassandra/Operations#For_versions_1.2.0_and_above),
on 1.2 you should actually remove the dead node from the ring, before
adding a replacement node.

Does that mean the trick of setting the initial token to the value of the
dead node's -1 (described in
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node) is
not valid anymore in 1.2 without vnodes?


On Wed, Mar 12, 2014 at 5:57 PM, Paulo Ricardo Motta Gomes <
paulo.mo...@chaordicsystems.com> wrote:

> Hello,
>
> I'm trying to replace a dead node using the procedure in [1], but the
> replacement node initially sees the dead node as UP, and after a few
> minutes the node is marked as DOWN again, failing the streaming/bootstrap
> procedure of the replacement node. This dead node is always seen as DOWN by
> the rest of the cluster.
>
> Could this be a bug? I can easily reproduce it in our production
> environment, but don't know if it's reproducible in a clean environment.
>
> Version: 1.2.13
>
> Here is the log from the replacement node (192.168.1.10 is the dead node):
>
>  INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843)
> Node /192.168.1.10 is now part of the cluster
>  INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809)
> InetAddress /192.168.1.10 is now UP
>  INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823)
> InetAddress /192.168.1.10 is now DOWN
> ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
> (line 110) Stream failed because /192.168.1.10 died or was
> restarted/removed (streams may still be active in background, but further
> streams won't be started)
>  WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line
> 246) Streaming from /192.168.1.10 failed
> ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
> (line 110) Stream failed because /192.168.1.10 died or was
> restarted/removed (streams may still be active in background, but further
> streams won't be started)
>  WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line
> 246) Streaming from /192.168.1.10 failed
>
> [1]
> http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
>
> Cheers,
>
> Paulo
>
> --
> *Paulo Motta*
>
> Chaordic | *Platform*
> *www.chaordic.com.br *
> +55 48 3232.3200
> +55 83 9690-1314
>



-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br *
+55 48 3232.3200
+55 83 9690-1314


Re:

2014-03-12 Thread Edward Capriolo
That is too much ram for cassandra make that 6g to 10g.

The uneven perf could be because your requests do not shard evenly.

On Wednesday, March 12, 2014, Batranut Bogdan  wrote:
> Hello all,
>
> The environment:
>
> I have a 6 node Cassandra cluster. On each node I have:
> - 32 G RAM
> - 24 G RAM for cassa
> - ~150 - 200 MB/s disk speed
> - tomcat 6 with axis2 webservice that uses the datastax java driver to
make
> asynch reads / writes
> - replication factor for the keyspace is 3
>
> All nodes in the same data center
> The clients that read / write are in the same datacenter so network is
> Gigabit.
>
> Writes are performed via exposed methods from Axis2 WS . The Cassandra
Java
> driver uses the round robin load balancing policy so all the nodes in the
> cluster should be hit with write requests under heavy write or read load
> from multiple clients.
>
> I am monitoring all nodes with JConsole from another box.
>
> The problem:
>
> When wrinting to a particular column family, only 3 nodes have high CPU
load
> ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
> timeout.
>
> I need more speed for both writes of reads. Due to the fact that 3 nodes
> barely have CPU activity leads me to think that the whole potential for C*
> is not touched.
>
> I am running out of ideas...
>
> If further details about the environment I can provide them.
>
>
> Thank you very much.

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re:

2014-03-12 Thread Russ Bradberry
I wouldn't go above 8G unless you have a very powerful machine that can keep 
the GC pauses low.

Sent from my iPhone

> On Mar 12, 2014, at 7:11 PM, Edward Capriolo  wrote:
> 
> That is too much ram for cassandra make that 6g to 10g. 
> 
> The uneven perf could be because your requests do not shard evenly.
> 
> On Wednesday, March 12, 2014, Batranut Bogdan  wrote:
> > Hello all,
> >
> > The environment:
> >
> > I have a 6 node Cassandra cluster. On each node I have:
> > - 32 G RAM
> > - 24 G RAM for cassa
> > - ~150 - 200 MB/s disk speed
> > - tomcat 6 with axis2 webservice that uses the datastax java driver to make
> > asynch reads / writes 
> > - replication factor for the keyspace is 3
> >
> > All nodes in the same data center 
> > The clients that read / write are in the same datacenter so network is
> > Gigabit.
> >
> > Writes are performed via exposed methods from Axis2 WS . The Cassandra Java
> > driver uses the round robin load balancing policy so all the nodes in the
> > cluster should be hit with write requests under heavy write or read load
> > from multiple clients.
> >
> > I am monitoring all nodes with JConsole from another box.
> >
> > The problem:
> >
> > When wrinting to a particular column family, only 3 nodes have high CPU load
> > ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
> > timeout. 
> >
> > I need more speed for both writes of reads. Due to the fact that 3 nodes
> > barely have CPU activity leads me to think that the whole potential for C*
> > is not touched.
> >
> > I am running out of ideas...
> >
> > If further details about the environment I can provide them.
> >
> >
> > Thank you very much.
> 
> -- 
> Sorry this was sent from mobile. Will do less grammar and spell check than 
> usual.


Re:

2014-03-12 Thread David McNelis
Not knowing anything about your data structure (to expand on what Edward
said), you could be running into something where you've got some hot keys
that are getting the majority of writes during those heavily loads more
specifically I might look for a single key that you're writing, since
you're RF=3 and you have 3 nodes specifically that are causing problems.



On Wed, Mar 12, 2014 at 7:28 PM, Russ Bradberry wrote:

> I wouldn't go above 8G unless you have a very powerful machine that can
> keep the GC pauses low.
>
> Sent from my iPhone
>
> On Mar 12, 2014, at 7:11 PM, Edward Capriolo 
> wrote:
>
> That is too much ram for cassandra make that 6g to 10g.
>
> The uneven perf could be because your requests do not shard evenly.
>
> On Wednesday, March 12, 2014, Batranut Bogdan  wrote:
> > Hello all,
> >
> > The environment:
> >
> > I have a 6 node Cassandra cluster. On each node I have:
> > - 32 G RAM
> > - 24 G RAM for cassa
> > - ~150 - 200 MB/s disk speed
> > - tomcat 6 with axis2 webservice that uses the datastax java driver to
> make
> > asynch reads / writes
> > - replication factor for the keyspace is 3
> >
> > All nodes in the same data center
> > The clients that read / write are in the same datacenter so network is
> > Gigabit.
> >
> > Writes are performed via exposed methods from Axis2 WS . The Cassandra
> Java
> > driver uses the round robin load balancing policy so all the nodes in the
> > cluster should be hit with write requests under heavy write or read load
> > from multiple clients.
> >
> > I am monitoring all nodes with JConsole from another box.
> >
> > The problem:
> >
> > When wrinting to a particular column family, only 3 nodes have high CPU
> load
> > ~ 80 - 99 %. The remaining 3 are at ~2 - 10 % CPU. During writes, reads
> > timeout.
> >
> > I need more speed for both writes of reads. Due to the fact that 3 nodes
> > barely have CPU activity leads me to think that the whole potential for
> C*
> > is not touched.
> >
> > I am running out of ideas...
> >
> > If further details about the environment I can provide them.
> >
> >
> > Thank you very much.
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>
>


Re: Driver documentation questions

2014-03-12 Thread Alex Popescu
While this is a question that would fit better on the Java driver group
[1], I'll try to provide a very short answer:

1. Cluster is an long-lived object and the application should have only 1
instance
2. Session is also a long-lived object and you should try to have 1 Session
per keyspace.

A session manages connection pools  for nodes in the cluster and it's
an expensive resource.

2.1. In case your application uses a lot of keyspaces, then you should
try to limit the number of Sessions and use fully qualified identifiers

3. PreparedStatements should be prepared only once.

Session and PreparedStatements are thread-safe and should be shared across
your app.

[1]
https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user


On Fri, Mar 7, 2014 at 12:42 PM, Green, John M (HP Education) <
john.gr...@hp.com> wrote:

>  I’ve been tinkering with both the C++ and Java drivers but in neither
> case have I got a good indication of how threading and resource mgmt should
> be implemented in a long-lived multi-threaded application server
> process.That is, what should be the scope of a builder, a cluster,
> session, and statement.   A JDBC connection is typically a per-thread
> affair.When application server receives a request, it typically
>
> a)  gets JDBC connection from a connection pool,
>
> b)  processes the request
>
> c)   returns the connection to the JDBC connection pool.
>
>
>
> All the Cassandra driver sample code I’ve seen so far is for single
> threaded command-line applications so I’m wondering what is thread safe (if
> anything) and what objects are “expensive” to instantiate.   I’m assuming a
> Session is analogous to a JDBC connection so when a request comes into my
> multi-threaded application server, I should create a new Session (or find a
> way to pool Sessions), but should I be creating a new cluster first?   What
> about a builder?
>
>
>
> John “lost in the abyss”
>



-- 

:- a)


Alex Popescu
Sen. Product Manager @ DataStax
@al3xandru


Re: Opscenter help?

2014-03-12 Thread Jack Krupansky
Please do use Stack Overflow - that is the appropriate forum for OpsCenter 
support (unless you are a DataStax customer). Use the OpsCenter tag:


http://stackoverflow.com/tags/opscenter/info

-- Jack Krupansky

-Original Message- 
From: Drew from Zhrodague

Sent: Wednesday, March 12, 2014 2:51 PM
To: user@cassandra.apache.org
Subject: Opscenter help?

I am having a hard time installing the Datastax Opscenter agents on EL6
and EL5 hosts. Where is an appropriate place to ask for help? Datastax
has move their forums to Stack Exchange, which seems to be a waste of
time, as I don't have enough reputation points to properly tag my questions.

The agent installation seems to be broken:
[] agent rpm conflicts with sudo
[] install from opscenter does not work, even if manually installing
the rpm (requres --force, conflicts with sudo)
[] error message re: log4j #noconf
[] Could not find the main class: opsagent.opsagent. Program will exit.
[] No other (helpful/more in-depth) documentation exists


--

Drew from Zhrodague
post-apocalyptic ad-hoc industrialist
d...@zhrodague.net 



750Gb compaction task

2014-03-12 Thread Plotnik, Alexey
After rebalance and cleanup I have leveled CF (SSTable size = 100MB) and a 
compaction Task that is going to process ~750GB:

> root@da1-node1:~# nodetool compactionstats
pending tasks: 10556
  compaction typekeyspace   column family   completed   
total  unit  progress
   Compaction cafs_chunks  chunks 41015024065
808740269082 bytes 5.07%

I have no space for this operation, I have 300 Gb only. Is it possible to 
resolve this situation?