Re: Mystery PIG issue with 1.2.10

2013-09-26 Thread Janne Jalkanen

Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear 
answers appeared, I'm going to assume that this is a regression and file a JIRA 
ticket on this.

/Janne

On 26 Sep 2013, at 08:00, Aaron Morton  wrote:

>> > (testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
>> > dump testc
>> foo,{(ivalue,
>> ),(svalue,bar),(value,A)})
> 
> 
> 
> If the CQL 3 data ye wish to read, CqlStorage be the driver of your success. 
> 
> (btw there is a ticket out to update the example if you get excited 
> https://issues.apache.org/jira/browse/CASSANDRA-5709)
> 
> Cheers
> 
> 
> -
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 26/09/2013, at 3:57 AM, Chad Johnston  wrote:
> 
>> As an FYI, creating the table without the "WITH COMPACT STORAGE" and using 
>> CqlStorage works just fine in 1.2.10.
>> 
>> I know that CqlStorage and AbstractCassandraStorage got changed for 1.2.10 - 
>> maybe there's a regression with the existing CassandraStorage?
>> 
>> Chad
>> 
>> 
>> On Wed, Sep 25, 2013 at 1:51 AM, Janne Jalkanen  
>> wrote:
>> Heya!
>> 
>> I am seeing something rather strange in the way Cass 1.2 + Pig seem to 
>> handle integer values.
>> 
>> Setup: Cassandra 1.2.10, OSX 10.8, JDK 1.7u40, Pig 0.11.1.  Single node for 
>> testing this.
>> 
>> First a table:
>> 
>> > CREATE TABLE testc (
>>   key text PRIMARY KEY,
>>   ivalue int,
>>   svalue text,
>>   value bigint
>> ) WITH COMPACT STORAGE;
>> 
>> > insert into testc (key,ivalue,svalue,value) values ('foo',10,'bar',65);
>> > select * from testc;
>> 
>>  key | ivalue | svalue | value
>> -+++---
>>  foo | 10 |bar | 65
>> 
>> For my Pig setup, I then use libraries from different C* versions to 
>> actually talk to my database (which stays on 1.2.10 all the time).
>> 
>> Cassandra 1.0.12 (using cassandra_storage.jar):
>> 
>> > testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
>> > dump testc
>> (foo,(svalue,bar),(ivalue,10),(value,65),{})
>> 
>> Cassandra 1.1.10:
>> 
>> > testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
>> > dump testc
>> (foo,(svalue,bar),(ivalue,10),(value,65),{})
>> 
>> Cassandra 1.2.10:
>> 
>> > (testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
>> > dump testc
>> foo,{(ivalue,
>> ),(svalue,bar),(value,A)})
>> 
>> 
>> To me it appears that ints and bigints are interpreted as ascii values in 
>> cass 1.2.10.  Did something change for CassandraStorage, is there a 
>> regression, or am I doing something wrong?  Quick perusal of the JIRA didn't 
>> reveal anything that I could directly pin on this.
>> 
>> Note that using compact storage does not seem to affect the issue, though it 
>> obviously changes the resulting pig format.
>> 
>> In addition, trying to use Pygmalion
>> 
>> > tf = foreach testc generate key, 
>> > flatten(FromCassandraBag('ivalue,svalue,value',columns)) as 
>> > (ivalue:int,svalue:chararray,lvalue:long);
>> > dump tf
>> 
>> (foo,
>> ,bar,A)
>> 
>> So no help there. Explicitly casting the values to (long) or (int) just 
>> results in a ClassCastException.
>> 
>> /Janne
>> 
> 



RE: 1.2.10 -> 2.0.1 migration issue

2013-09-26 Thread Christopher Wirt
Yes that was the problem.

 

I got confused between 2.0.1 and 2.1.0 after downloading the trunk source.

 

From: Robert Coli [mailto:rc...@eventbrite.com] 
Sent: 25 September 2013 18:10
To: user@cassandra.apache.org
Subject: Re: 1.2.10 -> 2.0.1 migration issue

 

On Wed, Sep 25, 2013 at 6:05 AM, Christopher Wirt 
wrote:

Should also say. I have managed to move one node from 1.2.10 to 2.0.0. I'm
seeing this error on the machine I tried to migrate earlier to 2.0.1

 

I'm confused... for the record :

 

1) you tried to upgrade from 1.2.10 to 2.0.1

2) the NEWS.txt snippet you posted refers to upgrading from versions below
2.0 to 2.1

3) 2.0.1 is 2.0, not 2.1

 

Therefore the problem is actually
https://issues.apache.org/jira/browse/CASSANDRA-6093 ?

 

=Rob 



RE: [Cassandra] Initial Setup - VMs for Research

2013-09-26 Thread shathawa
Thanks for the link.

I also found some useful infrastructure integration help
documents on your web pages.

Steven J. Hathaway

> What help are u looking for ?
>
> http://www.datastax.com/docs/datastax_enterprise3.1/install/install_deb_pkg
>
> -Original Message-
> From: shath...@e-z.net [mailto:shath...@e-z.net]
> Sent: 25 September 2013 15:27
> To: user@cassandra.apache.org
> Subject: [Cassandra] Initial Setup - VMs for Research
>
> Request some initial setup guidance for Cassandra deployment
>
> I expect to mentor a project at the Oregon State University
> computer science department for a senior engineering student
> project.
>
> I am trying to pre-configure one or more VMware virtual
> machines to hold an initial Cassandra database for a NOSQL
> project.
>
> Any guidance on the steps for initial deployment would
> be appreciated.
>
> My VMware machines already have the necessary 3rd party
> tools such as Oracle Java 7 and are running on a Debian Linux
> 7.1.0 release.  The Oregon State University computer science
> department will eventually host these virtual machines on
> their department servers if the student project is selected.
>
> Sincerely,
> Steven J. Hathaway
> (Senior IT Systems Architect)
>
>




Re: [Cassandra] Initial Setup - VMs for Research

2013-09-26 Thread shathawa
Thanks for the links!

Very Helpful

Steven J. Hathaway

> There is also Debian for the standard Apache Cassandra build
> http://wiki.apache.org/cassandra/DebianPackaging
>
> Or you can use DS Community which is a totally free distro of Apache
> Cassandra
> http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/install/installDeb_t.html
> (Instructions for DS Community will work for both)
>
> Data Stax Enterprise is free for development purposes but can not be used
> in production without a licence. In addition is ships with extra
> installation companents (hadoop, mahoout, solr) that you probably want
> need.
>
> Hope that helps.
>
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 26/09/2013, at 2:42 PM, Kanwar Sangha  wrote:
>
>> What help are u looking for ?
>>
>> http://www.datastax.com/docs/datastax_enterprise3.1/install/install_deb_pkg
>>
>> -Original Message-
>> From: shath...@e-z.net [mailto:shath...@e-z.net]
>> Sent: 25 September 2013 15:27
>> To: user@cassandra.apache.org
>> Subject: [Cassandra] Initial Setup - VMs for Research
>>
>> Request some initial setup guidance for Cassandra deployment
>>
>> I expect to mentor a project at the Oregon State University
>> computer science department for a senior engineering student
>> project.
>>
>> I am trying to pre-configure one or more VMware virtual
>> machines to hold an initial Cassandra database for a NOSQL
>> project.
>>
>> Any guidance on the steps for initial deployment would
>> be appreciated.
>>
>> My VMware machines already have the necessary 3rd party
>> tools such as Oracle Java 7 and are running on a Debian Linux
>> 7.1.0 release.  The Oregon State University computer science
>> department will eventually host these virtual machines on
>> their department servers if the student project is selected.
>>
>> Sincerely,
>> Steven J. Hathaway
>> (Senior IT Systems Architect)
>>
>
>




Re: [Cassandra] Initial Setup - VMs for Research

2013-09-26 Thread shathawa
> There is also Debian for the standard Apache Cassandra build
> http://wiki.apache.org/cassandra/DebianPackaging
>
> Or you can use DS Community which is a totally free distro of Apache
> Cassandra
> http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/install/installDeb_t.html
> (Instructions for DS Community will work for both)
>
> Data Stax Enterprise is free for development purposes but can not be used
> in production without a licence. In addition is ships with extra
> installation companents (hadoop, mahoout, solr) that you probably want
> need.
>
> Hope that helps.
>
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 26/09/2013, at 2:42 PM, Kanwar Sangha  wrote:
>
>> What help are u looking for ?
>>
>> http://www.datastax.com/docs/datastax_enterprise3.1/install/install_deb_pkg
>>
>> -Original Message-
>> From: shath...@e-z.net [mailto:shath...@e-z.net]
>> Sent: 25 September 2013 15:27
>> To: user@cassandra.apache.org
>> Subject: [Cassandra] Initial Setup - VMs for Research
>>
>> Request some initial setup guidance for Cassandra deployment
>>
>> I expect to mentor a project at the Oregon State University
>> computer science department for a senior engineering student
>> project.
>>
>> I am trying to pre-configure one or more VMware virtual
>> machines to hold an initial Cassandra database for a NOSQL
>> project.
>>
>> Any guidance on the steps for initial deployment would
>> be appreciated.
>>
>> My VMware machines already have the necessary 3rd party
>> tools such as Oracle Java 7 and are running on a Debian Linux
>> 7.1.0 release.  The Oregon State University computer science
>> department will eventually host these virtual machines on
>> their department servers if the student project is selected.
>>
>> Sincerely,
>> Steven J. Hathaway
>> (Senior IT Systems Architect)
>>
>
>




Re: Mystery PIG issue with 1.2.10

2013-09-26 Thread Robert Coli
On Thu, Sep 26, 2013 at 1:00 AM, Janne Jalkanen wrote:

>
> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear
> answers appeared, I'm going to assume that this is a regression and file a
> JIRA ticket on this.
>

Could you let the list know the ticket number, when you do? :)

=Rob


data schema for hourly runnning analytics

2013-09-26 Thread Renat Gilfanov
 Hello,

We have a column family which stores incoming requests, and we would like to 
perform some analytics  on that data using Hadoop. The analytic results should 
be available pretty soon, not realtime, but within an hour or so. 
So we store the current hour number (calculated from timestamp) as a "partition 
number" field with secondary index.

Currently it looks like this (I skipped a few columns to avoid unnecessary 
details):

CREATE TABLE requests (
    request_id UUID PRIMARY KEY,
    partition_number INT,
    payload ASCII
 );

CREATE INDEX ON requests(partition_number);

Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop 
performs query over the indexed "partition_number" column.
Currently having several million rows I observe very poor performance of such 
queries, and realize that secondary index on the field with high cardinality is 
a bad idea. However, I don't see good alternatives so far.
I was considering creating a temp column family every hour, write data there, 
process it with Hadoop next hour and throw it away, however there is a 
limitation - we need to store the raw incoming data, as in the future we'll 
have to provide new types of analytic reports.

So my questions are the following:

1.  Does the approach with hourly running Hadoop jobs is solid for the 
near-realtime analytics (when results should be available within 1 hour), or 
it's better to take a look at Storm and something like that?
2.  What's the recommended  data schema to store events "sharded" by hour, with 
further possibility to quickly retrieve them by hour? (assuming the hourly 
amount of data is big enough to fit in one wide row.)


Thank you.





How many Column Families can Cassandra handle?

2013-09-26 Thread Raihan Jamal
I am working on a use case for Timeline series data. I have been told to
create 600 column families in Cassandra. Meaning for 10 minutes, I will be
having column families in Cassandra. Each second will have its own column
family, so till 10 minutes which is 600 second, I will be having 600 column
families...

In each second, we will write into that particular second column family..
so at 10 minutes (which is 600 second), we will write into 600 second
column family..

I am wondering whether Cassandra will be able to handle 600 column families
or not.. Right now, I am not sure how much data each column family will
have... What I know so far is write will be coming at a rate of 20,000
writes per second...

Can anyone shed some light into this?


Re: How many Column Families can Cassandra handle?

2013-09-26 Thread Hiller, Dean
600 is probably doable but each CF takes up memory……PlayOrm goes with a 
strategy that can virtualize CF's into one CF allowing less memory usage….we 
have 80,000 virtual CF's in cassandra through playorm….you can copy playorm's 
pattern if desired.  But 600 is probably doable but high.  10,000 is not very 
doable.

But you would have to try out 600 to see if it works for you….it may not 
work…try and find out in your load and context.

NOTE: We have changed the 80,000 virtual CF's such that are in 10 real CF's 
these days so we get more parallel compaction going on.

Dean

From: Raihan Jamal mailto:jamalrai...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday, September 26, 2013 11:39 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: How many Column Families can Cassandra handle?

I am working on a use case for Timeline series data. I have been told to create 
600 column families in Cassandra. Meaning for 10 minutes, I will be having 
column families in Cassandra. Each second will have its own column family, so 
till 10 minutes which is 600 second, I will be having 600 column families...

In each second, we will write into that particular second column family.. so at 
10 minutes (which is 600 second), we will write into 600 second column family..

I am wondering whether Cassandra will be able to handle 600 column families or 
not.. Right now, I am not sure how much data each column family will have... 
What I know so far is write will be coming at a rate of 20,000 writes per 
second...

Can anyone shed some light into this?


Re: Best version to upgrade from 1.1.10 to 1.2.X

2013-09-26 Thread Paulo Motta
Hello Charles,

Thank you very much for your detailed upgrade report. It'll be very helpful
during our upgrade operation (even though we'll do a rolling production
upgrade).

I'll also share our findings during the upgrade here.

Cheers,

Paulo


2013/9/24 Charles Brophy 

> Hi Paulo,
>
> I just completed a migration from 1.1.10 to 1.2.10 and it was surprisingly
> painless.
>
> The course of action that I took:
> 1) describe cluster - make sure all nodes are on the same schema
> 2) shutoff all maintenance tasks; i.e. make sure no scheduled repair is
> going to kick off in the middle of what you're doing
> 3) snapshot - maybe not necessary but it's so quick it makes no sense to
> skip this step
> 4) drain the nodes - I shut down the entire cluster rather than chance any
> incompatible gossip concerns that might come from a rolling upgrade. I have
> the luxury of controlling both the providers and consumers of our data, so
> this wasn't so disruptive for us.
> 5) Upgrade the nodes, turn them on one-by-one, monitor the logs for funny
> business.
> 6) nodetool upgradesstables
> 7) Turn various maintenance tasks back on, etc.
>
> The worst part was managing the yaml/config changes between the versions.
> It wasn't horrible, but the diff was "noisier" than a more incremental
> upgrade typically is. A few things I recall that were special:
> 1) Since you have an existing cluster, you'll probably need to set the
> default partitioner back to RandomPartitioner in cassandra.yaml. I believe
> that is outlined in NEWS.
> 2) I set the initial tokens to be the same as what the nodes held
> previously.
> 3) The timeout is now divided into more atomic settings and you get to
> decided how (or if) to configure it from the default appropriately.
>
> tldr; I did a standard upgrade and payed careful attention to the NEWS.txt
> upgrade notices. I did a full cluster restart and NOT a rolling upgrade. It
> went without a hitch.
>
> Charles
>
>
>
>
>
>
> On Tue, Sep 24, 2013 at 2:33 PM, Paulo Motta wrote:
>
>> Cool, sounds fair enough. Thanks for the help, Rob!
>>
>> If anyone has upgraded from 1.1.X to 1.2.X, please feel invited to share
>> any tips on issues you're encountered that are not yet documented.
>>
>> Cheers,
>>
>> Paulo
>>
>>
>> 2013/9/24 Robert Coli 
>>
>>> On Tue, Sep 24, 2013 at 1:41 PM, Paulo Motta 
>>> wrote:
>>>
 Doesn't the probability of something going wrong increases as the gap
 between the versions increase? So, using this reasoning, upgrading from
 1.1.10 to 1.2.6 would have less chance of something going wrong then from
 1.1.10 to 1.2.9 or 1.2.10.

>>>
>>> Sorta, but sorta not.
>>>
>>> https://github.com/apache/cassandra/blob/trunk/NEWS.txt
>>>
>>> Is the canonical source of concerns on upgrade. There are a few cases
>>> where upgrading to the "root" of X.Y.Z creates issues that do not exist if
>>> you upgrade to the "head" of that line. AFAIK there have been no cases
>>> where upgrading to the "head" of a line (where that line is mature, like
>>> 1.2.10) has created problems which would have been avoided by upgrading to
>>> the "root" first.
>>>
>>>
 I'm hoping this reasoning is wrong and I can update directly from
 1.1.10 to 1.2.10. :-)

>>>
>>> That's what I plan to do when we move to 1.2.X, FWIW.
>>>
>>> =Rob
>>>
>>
>>
>>
>> --
>> Paulo Ricardo
>>
>> --
>> European Master in Distributed Computing***
>> Royal Institute of Technology - KTH
>> *
>> *Instituto Superior Técnico - IST*
>> *http://paulormg.com*
>>
>
>


-- 
Paulo Ricardo

-- 
European Master in Distributed Computing***
Royal Institute of Technology - KTH
*
*Instituto Superior Técnico - IST*
*http://paulormg.com*


What is the best way to install & upgrade Cassandra on Ubuntu ?

2013-09-26 Thread Ertio Lew
How do you install Cassandra on Ubuntu & later how do you upgrade the
installation on the node when an update has arrived ? Do you simply
download & replace the latest tar.gz, untar it to replace the older
cassandra files? How do you do it ? How does this upgrade process differ
for a major version upgrade, like say switching from 1.2 series to 2.0
series ?


Re: How many Column Families can Cassandra handle?

2013-09-26 Thread Krishna Pisupat
I don't know the full use case. However, for a generic time series scenario, we 
can make the timestamp (may be unto second part) part of the key and write all 
the data into the same CF(one CF for all data). Again, it may not make sense in 
your case, given the full use case. Just my 2 cents. 


Thanks and Regards,
Krishna Pisupat
krishna.pisu...@gmail.com



On Sep 26, 2013, at 11:18 AM, "Hiller, Dean"  wrote:

> 600 is probably doable but each CF takes up memory……PlayOrm goes with a 
> strategy that can virtualize CF's into one CF allowing less memory usage….we 
> have 80,000 virtual CF's in cassandra through playorm….you can copy playorm's 
> pattern if desired.  But 600 is probably doable but high.  10,000 is not very 
> doable.
> 
> But you would have to try out 600 to see if it works for you….it may not 
> work…try and find out in your load and context.
> 
> NOTE: We have changed the 80,000 virtual CF's such that are in 10 real CF's 
> these days so we get more parallel compaction going on.
> 
> Dean
> 
> From: Raihan Jamal mailto:jamalrai...@gmail.com>>
> Reply-To: "user@cassandra.apache.org" 
> mailto:user@cassandra.apache.org>>
> Date: Thursday, September 26, 2013 11:39 AM
> To: "user@cassandra.apache.org" 
> mailto:user@cassandra.apache.org>>
> Subject: How many Column Families can Cassandra handle?
> 
> I am working on a use case for Timeline series data. I have been told to 
> create 600 column families in Cassandra. Meaning for 10 minutes, I will be 
> having column families in Cassandra. Each second will have its own column 
> family, so till 10 minutes which is 600 second, I will be having 600 column 
> families...
> 
> In each second, we will write into that particular second column family.. so 
> at 10 minutes (which is 600 second), we will write into 600 second column 
> family..
> 
> I am wondering whether Cassandra will be able to handle 600 column families 
> or not.. Right now, I am not sure how much data each column family will 
> have... What I know so far is write will be coming at a rate of 20,000 writes 
> per second...
> 
> Can anyone shed some light into this?



Re: What is the best way to install & upgrade Cassandra on Ubuntu ?

2013-09-26 Thread Robert Coli
On Thu, Sep 26, 2013 at 12:05 PM, Ertio Lew  wrote:

> How do you install Cassandra on Ubuntu & later how do you upgrade the
> installation on the node when an update has arrived ? Do you simply
> download & replace the latest tar.gz, untar it to replace the older
> cassandra files? How do you do it ? How does this upgrade process differ
> for a major version upgrade, like say switching from 1.2 series to 2.0
> series ?
>

Use the deb packages. To upgrade, install the new package. Only upgrade a
single major version. and be sure to consult NEWS.txt for any upgrade
caveats.

Also be aware of this sub-optimal behavior of the debian packages :

https://issues.apache.org/jira/browse/CASSANDRA-2356

=Rob


How to create multiple column families using some script?

2013-09-26 Thread Raihan Jamal
I have to create Multiple Column Families in my keyspace. One way is to
create the column families one by one.. But in my case, I have around 100
column families so I cannot do it one by one... So is there any way, I can
create multiple column families through some sort of script which can
create multiple column families for me in one short?

create column family USER_DATA_SECOND_1
with comparator = 'UTF8Type'
and key_validation_class = 'CompositeType(DateType,UTF8Type)'
and default_validation_class = 'BytesType'
and gc_grace = 86400

create column family USER_DATA_SECOND_2
with comparator = 'UTF8Type'
and key_validation_class = 'CompositeType(DateType,UTF8Type)'
and default_validation_class = 'BytesType'
and gc_grace = 86400

create column family USER_DATA_SECOND_3
with comparator = 'UTF8Type'
and key_validation_class = 'CompositeType(DateType,UTF8Type)'
and default_validation_class = 'BytesType'
and gc_grace = 86400





create column family USER_DATA_SECOND_100
with comparator = 'UTF8Type'
and key_validation_class = 'CompositeType(DateType,UTF8Type)'
and default_validation_class = 'BytesType'
and gc_grace = 86400
And also after creating these multiple column families.. Suppose if I need
to drop all these column families again then how to do that using some
script again?

Below is the way, I am creating the column families now from my local
machine to my staging cassandra server one by one, which is not what I
want..

C:\Apache Cassandra\apache-cassandra-1.2.3\bin>cassandra-cli -h
sc-cdbhost01.vip.slc.qa.host.com
Starting Cassandra Client
Connected to: "Staging Cluster cass01" on
sc-cdbhost01.vip.slc.qa.host.com/9160
Welcome to Cassandra CLI version 1.2.3

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use profileks;
Authenticated to keyspace: profileks
[default@profileks] create column family USER_DATA_SECOND_1
... with comparator = 'UTF8Type'
... and key_validation_class = 'CompositeType(DateType,UTF8Type)'
... and default_validation_class = 'BytesType'
... and gc_grace = 86400;
27fe1848-c7de-3994-9289-486a9bbbf344
[default@profileks]

Can anyone help me whether this is possible to create multiple column
families through some sort of script and then drop those column families as
well through some sort of script?


Re: How to create multiple column families using some script?

2013-09-26 Thread Robert Coli
On Thu, Sep 26, 2013 at 1:18 PM, Raihan Jamal  wrote:

> I have to create Multiple Column Families in my keyspace. One way is to
> create the column families one by one.. But in my case, I have around 100
> column families so I cannot do it one by one... So is there any way, I can
> create multiple column families through some sort of script which can
> create multiple column families for me in one short?
>

Historically, any sequence of operations other than :

1) modify schema
2) wait for schema agreement from all N nodes
3) modify schema again

Creates a non-zero chance of schema desynch. The rewrite of schema in 1.1
era supposedly eliminates this chance, but it is still safest to do the
above.

If you are going to programatically create and delete keyspaces, be aware
of :

https://issues.apache.org/jira/browse/CASSANDRA-4857

=Rob


Re: Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-26 Thread Aaron Morton
> org.apache.cassandra.thrift.Column column; // initialize this with name, 
> value, timestamp, TTL
This is the wrong object to use. 

one overload of addColumn() accepts IColumn which is from 
org.apache.cassanda.db . The thrift classes are only use for the thrift API. 

> What is the difference between calling writer.addColumn() on the column's 
> name, value and timestamp, and writer.addExpiringColumn() on the column's 
> name, value, TTL, timestamp and expiration timestamp ?
They both add an column to the row. addExpiringColumn() adds an expiring 
column, and addColumn adds a normal one. 

only addExpiringColumn accepts a TTL (in seconds) for the column.


> Does the former result in the column expiring still , in cassandra 1.2.x 
> (i.e. does setting the TTL on a Column object change the name or value in a 
> way so as to ensure the column will expire as required) ? 
No. 
An expiring column must be an ExpiringColumn column instance. 
The base IColumn interface does not have a TTL, only expiring columns do. 

>  If not , what is the TTL attribute used for in the Column object ?
The org.apache.cassandra.db.Column class does not have a TTL. 

Cheers
  

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 26/09/2013, at 12:44 AM, Jayadev Jayaraman  wrote:

> Can someone answer this doubt reg. SSTableSimpleWriter ? I'd asked about this 
> earlier but it probably missed. Apologies for repeating the question (with 
> minor additions)  : 
> 
> """
> Let's say I've initialized a SSTableSimpleWriter instance and a new column 
> with TTL set : 
> 
> org.apache.cassandra.io.sstable.SSTableSimpleWriter writer = new 
> SSTableSimpleWriter( ... /* params here */);
> org.apache.cassandra.thrift.Column column; // initialize this with name, 
> value, timestamp, TTL
> 
> What is the difference between calling writer.addColumn() on the column's 
> name, value and timestamp, and writer.addExpiringColumn() on the column's 
> name, value, TTL, timestamp and expiration timestamp ? Does the former result 
> in the column expiring still , in cassandra 1.2.x (i.e. does setting the TTL 
> on a Column object change the name or value in a way so as to ensure the 
> column will expire as required) ? If not , what is the TTL attribute used for 
> in the Column object ?
> """
> 
> Thanks,
> Jayadev
> 
> 
> On Tue, Sep 24, 2013 at 2:48 PM, Jayadev Jayaraman  
> wrote:
> Let's say I've initialized a SSTableSimpleWriter instance and a new column 
> with TTL set : 
> 
> SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here */);
> Column column;
> 
> What is the difference between calling writer.addColumn() on the column's 
> name and value, and writer.addExpiringColumn() on the column and its TTL ? 
> Does the former result in the column expiring still , in cassandra 1.2.x ? Or 
> does it not ?
> 
> 
> 



Re: Nodes not added to existing cluster

2013-09-26 Thread Aaron Morton
>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
If you can turn up logging to TRACE for 
org.apache.cassandra.net.OutboundTcpConnection it will include the full error. 

> The two addresses that it is unable to handshake with are the other two 
> addresses of nodes in the cluster I'm unable to join.
Are you mixing versions ? 


Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 26/09/2013, at 5:13 PM, Skye Book  wrote:

> Hi Aaron, thanks for the clarification.
> 
> As might be expected, having the broadcast_address fixed hasn't fixed 
> anything.  What I did find after writing my last email is that output.log is 
> littered with these:
> 
>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
>  INFO 05:03:49,803 Cannot handshake version with /ww.xx.yy.zz
>  INFO 05:03:49,805 Handshaking version with /ww.xx.yy.zz
> 
> The two addresses that it is unable to handshake with are the other two 
> addresses of nodes in the cluster I'm unable to join.  I started thinking 
> that maybe EC2 was having an-advertised problem communicating between AZ's 
> but bringing up nodes in both of the other availability zones resulted in the 
> same wrong behavior.
> 
> I've gist'd my cassandra.yaml, its pretty standard and hasn't caused an issue 
> in the past for me.  https://gist.github.com/skyebook/ec9364cdcec02e803ffc
> 
> Skye Book
> http://skyebook.net -- @sbook
> 
> On Sep 26, 2013, at 12:34 AM, Aaron Morton  wrote:
> 
>>>  I am curious, though, how any of this worked in the first place spread 
>>> across three AZ's without that being set?
>> boradcast_address is only needed when you are going cross region (IIRC it's 
>> the EC2MultiRegionSnitch) that sets it. 
>> 
>> As rob said, make sure the seed list includes on of the other nodes and that 
>> the cluster_name set. 
>> 
>> Cheers
>> 
>> -
>> Aaron Morton
>> New Zealand
>> @aaronmorton
>> 
>> Co-Founder & Principal Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>> 
>> On 26/09/2013, at 8:12 AM, Skye Book  wrote:
>> 
>>> Thank you, both Michael and Robert for your suggestions.  I actually saw 
>>> 5760, but we were running on 2.0.0, which it seems like this was fixed in.
>>> 
>>> That said, I noticed that my Chef scripts were failing to set the 
>>> broadcast_address correctly, which I'm guessing is the cause of the 
>>> problem, fixing that and trying a redeploy.  I am curious, though, how any 
>>> of this worked in the first place spread across three AZ's without that 
>>> being set?
>>> 
>>> -Skye
>>> 
>>> On Sep 25, 2013, at 3:56 PM, Robert Coli  wrote:
>>> 
 On Wed, Sep 25, 2013 at 12:41 PM, Skye Book  wrote:
 I have a three node cluster using the EC2 Multi-Region Snitch currently 
 operating only in US-EAST.  On having a node go down this morning, I 
 started a new node with an identical configuration, except for the seed 
 list, the listen address and the rpc address.  The new node comes up and 
 creates its own cluster rather than joining the pre-existing ring.  I've 
 tried creating a node both before ad after using `nodetool remove` for the 
 bad node, each time with the same result.
 
 What version of Cassandra?
 
 This particular confusing behavior is fixed upstream, in a version you 
 should not deploy to production yet. Take some solace, however, that you 
 may be the last Cassandra administrator to die for a broken code path!
 
 https://issues.apache.org/jira/browse/CASSANDRA-5768
 
 Does anyone have any suggestions for where to look that might put me on 
 the right track?
 
 It must be that your seed list is wrong in some way, or your node state is 
 wrong. If you're trying to bootstrap a node, note that you can't bootstrap 
 a node when it is in its own seed list.
 
 If you have installed Cassandra via debian package, there is a possibility 
 that your node has started before you explicitly started it. If so, it 
 might have invalid node state.
 
 Have you tried wiping the data directory and trying again?
 
 What is your seed list? Are you sure the new node can reach the seeds on 
 the network layer?
 
 =Rob
>>> 
>> 
> 



Re: Mystery PIG issue with 1.2.10

2013-09-26 Thread Aaron Morton

> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear 
> answers appeared, I'm going to assume that this is a regression and file a 
> JIRA ticket on this.
Could you explain that a little more? 

You tried using the CqlStorage read with a CQL 3 table and it did not work ? 

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 27/09/2013, at 5:04 AM, Robert Coli  wrote:

> On Thu, Sep 26, 2013 at 1:00 AM, Janne Jalkanen  
> wrote:
> 
> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear 
> answers appeared, I'm going to assume that this is a regression and file a 
> JIRA ticket on this.
> 
> Could you let the list know the ticket number, when you do? :)
> 
> =Rob



Re: data schema for hourly runnning analytics

2013-09-26 Thread Aaron Morton
> CREATE TABLE requests (
> request_id UUID PRIMARY KEY,
> partition_number INT,
> payload ASCII
>  );
> 
> CREATE INDEX ON requests(partition_number);
If reading all the request in an hour is something you do frequently than I 
strongly recommend modelling that  with another table. 

e.g. 

CREATE TABLE requests_by_hour (
hour long, # MMDDHH
request_id UUID ,
partition_number INT,
payload ASCII
PRIMARY KEY (hour, request_id)
 );

Check how much data you have per hour and split it further if more than a few 
10's of MB.

> Currently having several million rows I observe very poor performance of such 
> queries, and realize that secondary index on the field with high cardinality 
> is a bad idea. However, I don't see good alternatives so far.
Above. 
There may also be an issue here about how Hadoop is allocating jobs to nodes as 
it tries to be smart about this based on the token assignments for the nodes. 

> I was considering creating a temp column family every hour, write data there, 
> process it with Hadoop next hour and throw it away, however there is a 
> limitation - we need to store the raw incoming data, as in the future we'll 
> have to provide new types of analytic reports.
This is where we say "denormalise to support reads". 
Store the raw requests as you were so there is a "database of record" sitting 
there. 
Denormalise / copy into a table that is tuned for the per hour read case. If 
needed use Deflate compression on that CF to reduce size, not that this will be 
slower. 
 
> 1.  Does the approach with hourly running Hadoop jobs is solid for the 
> near-realtime analytics (when results should be available within 1 hour), or 
> it's better to take a look at Storm and something like that?
Working on the data per hour (or some other measure) is something people often 
do. But you need to support the process in the data model as above. 

people have also been using storm with cassanda. 


> 2.  What's the recommended  data schema to store events "sharded" by hour, 
> with further possibility to quickly retrieve them by hour? (assuming the 
> hourly amount of data is big enough to fit in one wide row.)
Above and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Cheers

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 27/09/2013, at 5:13 AM, Renat Gilfanov  wrote:

> Hello,
> 
> We have a column family which stores incoming requests, and we would like to 
> perform some analytics  on that data using Hadoop. The analytic results 
> should be available pretty soon, not realtime, but within an hour or so. 
> So we store the current hour number (calculated from timestamp) as a 
> "partition number" field with secondary index.
> 
> Currently it looks like this (I skipped a few columns to avoid unnecessary 
> details):
> 
> CREATE TABLE requests (
> request_id UUID PRIMARY KEY,
> partition_number INT,
> payload ASCII
>  );
> 
> CREATE INDEX ON requests(partition_number);
> 
> Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop 
> performs query over the indexed "partition_number" column.
> Currently having several million rows I observe very poor performance of such 
> queries, and realize that secondary index on the field with high cardinality 
> is a bad idea. However, I don't see good alternatives so far.
> I was considering creating a temp column family every hour, write data there, 
> process it with Hadoop next hour and throw it away, however there is a 
> limitation - we need to store the raw incoming data, as in the future we'll 
> have to provide new types of analytic reports.
> 
> So my questions are the following:
> 
> 1.  Does the approach with hourly running Hadoop jobs is solid for the 
> near-realtime analytics (when results should be available within 1 hour), or 
> it's better to take a look at Storm and something like that?
> 2.  What's the recommended  data schema to store events "sharded" by hour, 
> with further possibility to quickly retrieve them by hour? (assuming the 
> hourly amount of data is big enough to fit in one wide row.)
> 
> 
> Thank you.
> 
> 
> 



Re: Mystery PIG issue with 1.2.10

2013-09-26 Thread Chad Johnston
The OP was using a Thrift table and CassandraStorage. I verified that the
problem does not exist with a CQL3 table and CqlStorage.

Chad


On Thu, Sep 26, 2013 at 7:05 PM, Aaron Morton wrote:

>
> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear
>> answers appeared, I'm going to assume that this is a regression and file a
>> JIRA ticket on this.
>>
> Could you explain that a little more?
>
> You tried using the CqlStorage read with a CQL 3 table and it did not work
> ?
>
> Cheers
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 27/09/2013, at 5:04 AM, Robert Coli  wrote:
>
> On Thu, Sep 26, 2013 at 1:00 AM, Janne Jalkanen 
> wrote:
>
>>
>> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear
>> answers appeared, I'm going to assume that this is a regression and file a
>> JIRA ticket on this.
>>
>
> Could you let the list know the ticket number, when you do? :)
>
> =Rob
>
>
>


Re: Query about class org.apache.cassandra.io.sstable.SSTableSimpleWriter

2013-09-26 Thread Jayadev Jayaraman
Thanks for the reply. Isn't the addColumn(IColumn col) method in the writer
private though? I know what to do now in order to construct a column with a
TTL now. Thanks.
On Sep 26, 2013 9:00 PM, "Aaron Morton"  wrote:

> > org.apache.cassandra.thrift.Column column; // initialize this with name,
> value, timestamp, TTL
> This is the wrong object to use.
>
> one overload of addColumn() accepts IColumn which is from
> org.apache.cassanda.db . The thrift classes are only use for the thrift API.
>
> > What is the difference between calling writer.addColumn() on the
> column's name, value and timestamp, and writer.addExpiringColumn() on the
> column's name, value, TTL, timestamp and expiration timestamp ?
> They both add an column to the row. addExpiringColumn() adds an expiring
> column, and addColumn adds a normal one.
>
> only addExpiringColumn accepts a TTL (in seconds) for the column.
>
>
> > Does the former result in the column expiring still , in cassandra 1.2.x
> (i.e. does setting the TTL on a Column object change the name or value in a
> way so as to ensure the column will expire as required) ?
> No.
> An expiring column must be an ExpiringColumn column instance.
> The base IColumn interface does not have a TTL, only expiring columns do.
>
> >  If not , what is the TTL attribute used for in the Column object ?
> The org.apache.cassandra.db.Column class does not have a TTL.
>
> Cheers
>
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 26/09/2013, at 12:44 AM, Jayadev Jayaraman  wrote:
>
> > Can someone answer this doubt reg. SSTableSimpleWriter ? I'd asked about
> this earlier but it probably missed. Apologies for repeating the question
> (with minor additions)  :
> >
> > """
> > Let's say I've initialized a SSTableSimpleWriter instance and a new
> column with TTL set :
> >
> > org.apache.cassandra.io.sstable.SSTableSimpleWriter writer = new
> SSTableSimpleWriter( ... /* params here */);
> > org.apache.cassandra.thrift.Column column; // initialize this with name,
> value, timestamp, TTL
> >
> > What is the difference between calling writer.addColumn() on the
> column's name, value and timestamp, and writer.addExpiringColumn() on the
> column's name, value, TTL, timestamp and expiration timestamp ? Does the
> former result in the column expiring still , in cassandra 1.2.x (i.e. does
> setting the TTL on a Column object change the name or value in a way so as
> to ensure the column will expire as required) ? If not , what is the TTL
> attribute used for in the Column object ?
> > """
> >
> > Thanks,
> > Jayadev
> >
> >
> > On Tue, Sep 24, 2013 at 2:48 PM, Jayadev Jayaraman 
> wrote:
> > Let's say I've initialized a SSTableSimpleWriter instance and a new
> column with TTL set :
> >
> > SSTableSimpleWriter writer = new SSTableSimpleWriter( ... /* params here
> */);
> > Column column;
> >
> > What is the difference between calling writer.addColumn() on the
> column's name and value, and writer.addExpiringColumn() on the column and
> its TTL ? Does the former result in the column expiring still , in
> cassandra 1.2.x ? Or does it not ?
> >
> >
> >
>
>


Re: What is the best way to install & upgrade Cassandra on Ubuntu ?

2013-09-26 Thread Ertio Lew
 Could you please clarify that:
1.  when I upgrade to a newer version, would that retain my previous
configurations so that I don't need to configure everything again ?
2.  would that smoothly replace the previous installation by itself ?
3.  what's the way (kindly, if you can tell the command) to upgrade ?
4. when should I prefer datastax's dsc to that ? (I need to install for
production env.)


On Fri, Sep 27, 2013 at 12:50 AM, Robert Coli  wrote:

> On Thu, Sep 26, 2013 at 12:05 PM, Ertio Lew  wrote:
>
>> How do you install Cassandra on Ubuntu & later how do you upgrade the
>> installation on the node when an update has arrived ? Do you simply
>> download & replace the latest tar.gz, untar it to replace the older
>> cassandra files? How do you do it ? How does this upgrade process differ
>> for a major version upgrade, like say switching from 1.2 series to 2.0
>> series ?
>>
>
> Use the deb packages. To upgrade, install the new package. Only upgrade a
> single major version. and be sure to consult NEWS.txt for any upgrade
> caveats.
>
> Also be aware of this sub-optimal behavior of the debian packages :
>
> https://issues.apache.org/jira/browse/CASSANDRA-2356
>
> =Rob
>
>


Re: Mystery PIG issue with 1.2.10

2013-09-26 Thread Janne Jalkanen

Sorry, got sidetracked :)

https://issues.apache.org/jira/browse/CASSANDRA-6102

/Janne

On Sep 26, 2013, at 20:04 , Robert Coli  wrote:

> On Thu, Sep 26, 2013 at 1:00 AM, Janne Jalkanen  
> wrote:
> 
> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear 
> answers appeared, I'm going to assume that this is a regression and file a 
> JIRA ticket on this.
> 
> Could you let the list know the ticket number, when you do? :)
> 
> =Rob