Re: problem of inserting columns of a great amount

2012-08-10 Thread Jin Lei
Sorry, something is wrong with my previous problem description. The fact is
that the cassandra deny my requests when I try to insert 50k rows (rather
than 50k columns) into a column family at one time. Each row with 1 column.

2012/8/10 Jin Lei 

> Hello everyone,
> I'm a novice to cassandra and meet a problem recently.
> I want to insert over 50k columns into cassandra at one time, total size
> of which doesn't exceed 16MB, but the database return an exception as
> follows.
>
> [E 120809 15:37:31 service:1251] error in write to database
> Traceback (most recent call last):
>   File "/home/stoneiii/mycode/src/user/service.py", line 1248, in
> flush_mutator
> self.mutator.send()
>
>   File "/home/stoneiii/mycode/pylib/pycassa/batch.py", line 127, in
> send
>
> conn.batch_mutate(mutations, write_consistency_level)
>   File "/home/stoneiii/gaia2/pylib/pycassa/pool.py", line 145, in new_f
> return new_f(self, *args, **kwargs)
>   File "/home/stoneiii/mycode/pylib/pycassa/pool.py", line 145, in
> new_f
> return new_f(self, *args, **kwargs)
>   File "/home/stoneiii/mycode/pylib/pycassa/pool.py", line 145, in
> new_f
> return new_f(self, *args, **kwargs)
>   File "/home/stoneiii/mycode/pylib/pycassa/pool.py", line 145, in
> new_f
> return new_f(self, *args, **kwargs)
>   File "/home/stoneiii/mycode/pylib/pycassa/pool.py", line 145, in
> new_f
> return new_f(self, *args, **kwargs)
>   File "/home/stoneiii/mycode/pylib/pycassa/pool.py", line 140, in
> new_f
> (self._retry_count, exc.__class__.__name__, exc))
> MaximumRetryException: Retried 6 times. Last failure was error: [Errno
> 104] Connection reset by peer
>
> Since cassandra supports 2 billion of columns in one table, why can't I
> insert 50k columns in this way? Or what settings should I adjust to break
> this limit?
> Thanks for any hint in advance!
>
>
>
>


Re: Cassandra data model help

2012-08-10 Thread dinesh . simkhada
Thanks Aaron for your reply,
creating vector for raw data is good work around for decreasing disk space, but 
I am not still clear tracking time for nodes, say if we want a query like give 
me the list of nodes for a cluster between this period of time then how do we 
get that information? do we scan through each node row as we will have row for 
each node? 

thanks

-Aaron Turner  wrote: -
To: user@cassandra.apache.org
From: Aaron Turner 
Date: 08/09/2012 07:38PM
Subject: Re: Cassandra data model help

On Thu, Aug 9, 2012 at 5:52 AM,   wrote:
> Hi,
> I am trying to create a Cassandra schema for cluster monitoring system, where 
> one cluster can have multiple nodes and I am monitoring multiple matrices 
> from a node. My raw data schema looks like and taking values in every 5 min 
> interval
>
> matrix_name + daily time stamp as row key, composite column name of node name 
> and time stamp and matrix value as column value
>
> the problem I am facing is a node can go back and forth between the 
> clusters(system can have more than one clusters) so if i need monthly 
> statistics plotting of a cluster I have to consider the nodes that are 
> leaving and joining during this period of time, some node might be part of 
> the cluster for just 15 days and some could join the cluster last 10 day of 
> month, so to plot data for a particular cluster for a time interval I need to 
> know the nodes which were part of that cluster for that period of time, what 
> could be the best schema for this solution ? I have tried few ideas so far no 
> luck, any suggestions ?

Store each node stat in it's own row.  Then decide if you want to
track when a node joins/leaves a cluster so you can build the aggs on
the fly or just store cluster aggregates in their own row as well.  If
the latter, depending on your polling methodology, you may want to use
counters for the cluster aggregates.

Also, if you're doing 5 min intervals with each row = 1 day, then your
disk space usage is going to grow pretty quickly due to per-column
overhead.   You didn't say what the values are that you're storing,
but if they're just 64bit integers or something like that, most of
your disk space is actually being used for column overhead not your
data.

I worked around this by creating a 2nd CF, where each row = 1 year
worth of data and each column = 1 days worth of data.  The values are
just a vector of the 5min values from the original CF.  Then I just
have a cron job which reads the previous days data and builds the
vectors in the new CF and then deletes the original row.  By doing
this, my disk space requirements (before replication) went from over
1.1TB/year to 305GB/year.


-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"



Question regarding tombstone removal and compaction

2012-08-10 Thread Fredrik
We've had a bug that caused one of our column families to grow very big 
280 GB on a 500 GB disk. We're using size tiered compaction.
Since it's "only append" data I've now issued deletes of 260 GB of 
superflous data.


1. There are som quite large SSTables (80 GB, 40 GB etc..). If I run a 
major compaction before GC grace, which is 6 hours, will the compaction 
succeed or will it fail due to the GC grace hasn't elapsed and thus 
major compaction will ignore the tombstones and then fail due to 
insufficient disk space?


2. If I wait until GC grace has elapsed, will it be possible to run a 
major compaction since there are only deletes which doesn't require 
double amount of SStable size when merging tombstones with the large 
SSTables?


Regards
/Fredrik







Problem with building Cascading tap for Cassandra

2012-08-10 Thread Gijs Stuurman
Hi all,

I'm trying to build a Cascading tap for Cassandra. Cascading is a
layer on top of Hadoop. For this purpose I use ColumnFamilyInputFormat and
ColumnFamilyRecordReader from Cassandra.

I ran into a problem that the record reader would create an endless
iterator because something goes wrong with the starttoken of the
batches the ColumnFamilyRecordReader gets out of Cassandra.

In this comment on an issue Jira this situation is
explained:https://issues.apache.org/jira/browse/CASSANDRA-4229

The reply on the issue that the behavior is caused by some keys of a
row being modified. The suggested solution is to copy all the
bytebuffers that are used.

I have added ByteBufferUtil.clone liberally, put the problem persists.

Any suggestions on what might be causing this?

Below the two files that make up the Cascading tap, these use the
ColumnFamilyInputFormat and ColumnFamilyRecordReader from Cassandra
version 1.1.2:

CassandraScheme.java
package cascalog.cassandra;

import org.apache.cassandra.thrift.*;
import org.apache.cassandra.utils.ByteBufferUtil;

import cascading.flow.FlowProcess;
import cascading.scheme.Scheme;
import cascading.scheme.SinkCall;
import cascading.scheme.SourceCall;
import cascading.tap.Tap;
import cascading.tuple.Fields;
import cascading.tuple.Tuple;
import cascading.tuple.TupleEntry;
import cascading.util.Util;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.fs.Path;

import org.apache.cassandra.hadoop.ColumnFamilyInputFormat;
import org.apache.cassandra.hadoop.ConfigHelper;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.SortedMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.UUID;
import java.nio.ByteBuffer;

import org.apache.cassandra.db.IColumn;

public class CassandraScheme
extends Scheme {

private String pathUUID;
private String host;
private String port;
private String keyspace;
private String columnFamily;
private List columnFieldNames;


public CassandraScheme(String host, String port, String keyspace,
String columnFamily, List columnFieldNames) {

  this.host = host;
  this.port = port;
  this.keyspace = keyspace;
  this.columnFamily = columnFamily;
  this.columnFieldNames = columnFieldNames;

  this.pathUUID = UUID.randomUUID().toString();
  //setSourceFields(new Fields("text3")); // default is unknown
  //setSinkFields
  }

  @Override
  public void sourcePrepare(FlowProcess flowProcess,
  SourceCall sourceCall) {

  ByteBuffer key =
ByteBufferUtil.clone((ByteBuffer)sourceCall.getInput().createKey());
  SortedMap value = (SortedMap)sourceCall.getInput().createValue();

  Object[] obj = new Object[]{key, value};

sourceCall.setContext(obj);
  }

  @Override
  public void sourceCleanup(FlowProcess flowProcess,
  SourceCall sourceCall) {
  sourceCall.setContext(null);
  }

  @Override
  public boolean source(FlowProcess flowProcess,
  SourceCall sourceCall) throws IOException {
Tuple result = new Tuple();

Object key = sourceCall.getContext()[0];
Object value = sourceCall.getContext()[1];

boolean hasNext = sourceCall.getInput().next(key, value);
if (!hasNext) { return false; }

ByteBuffer orgkey = (ByteBuffer)key;
ByteBuffer rowkey = ByteBufferUtil.clone(orgkey);

SortedMap columns = (SortedMap) value;

String rowkey_str = ByteBufferUtil.string(rowkey);

result.add(rowkey_str);

for (String columnFieldName: columnFieldNames) {
IColumn col = columns.get(ByteBufferUtil.bytes(columnFieldName));
if (col != null) {

result.add(ByteBufferUtil.string(ByteBufferUtil.clone(col.value(;
} else {
result.add(null);
}
}
sourceCall.getIncomingEntry().setTuple(result);
return true;

  }


  @Override
  public void sink(FlowProcess flowProcess, SinkCall sinkCall)
  throws IOException {
  System.out.println("sink");
TupleEntry tupleEntry = sinkCall.getOutgoingEntry();
OutputCollector outputCollector = sinkCall.getOutput();
throw new UnsupportedOperationException("TODO");
//outputCollector.collect(null, put);
  }

  @Override
  public void sinkConfInit(FlowProcess process,
  Tap tap, JobConf conf) {
  System.out.println("sinkConfInit");
  }

  @Override
  public void sourceConfInit(FlowProcess process,
  Tap tap, JobConf conf) {

  FileInputFormat.addInputPaths(conf, getPath().toString());
  conf.setInputFormat(ColumnFamilyInputFormat.class);

  ConfigHelper.setRangeBatchSize(conf, 100);
  ConfigHelper.setInputSplitSize(conf, 30);
  ConfigHelper.setInputRpcPort(conf, port);
  ConfigHelper.setInputInitialAddress(conf, host);
  ConfigHelper.setInputPartit

Commit log + Data directory on same partition (software raid)

2012-08-10 Thread Thibaut Britz
Hi,

Has anyone of you made some experience with software raid (raid 1,
mirroring 2 disks)?

Our workload is rather read based at the moment (Commit Log directory only
grows by 128MB every 2-3 minutes), while the second hd is under high load
due to the read requests to our cassandra cluster.

I was thinking about putting both the commit log and the data directory on
a software raid partition spanning over the two disks. Would this increase
the general read performance? In theory I could get twice the read
performance, but I don't know how the commit log will influence the read
performance on both disks?

Thanks,
Thibaut


Re: Commit log + Data directory on same partition (software raid)

2012-08-10 Thread Radim Kolar


I was thinking about putting both the commit log and the data 
directory on a software raid partition spanning over the two disks. 
Would this increase the general read performance? In theory I could 
get twice the read performance, but I don't know how the commit log 
will influence the read performance on both disks?

zfs + ssd cache is best. get freebsd 8.3 and install cassandra from ports.



CQL connections

2012-08-10 Thread David McNelis
In using CQL (the python library, at least), I didn't see a way to pass in
multiple nodes as hosts.  With other libraries (like Hector and Pycassa) I
can set multiple hosts and my app will work with anyone on that list.  Is
there something similar going on in the background with CQL?

If not, then is anyone aware of plans to do so?


Re: Decision Making- YCSB

2012-08-10 Thread Edward Capriolo
There are many YCSB forks on github that get optimized for specific
databases but the default one is decent across the defaults. Cassandra
has it's own internal stress tool that we like better.

The short comings are that generic tools and generic workloads are
generic and thus not real-world. But other then that being able to
tweak the workload percentages and change the read patterns from
latest/random/etc does a decent job of stressing normal and worst-case
scenarios on the read path. Still I would try to build my own real
world use case as a tool to evaluate a solution before making a
choice.

Edward

On Thu, Aug 9, 2012 at 8:58 PM, Roshni Rajagopal
 wrote:
> Hi Folks,
>
> I'm coming up with a set of decision criteria on when to chose traditional 
> RDBMS vs various  NoSQL options.
> So one aspect is the  application requirements around Consistency, 
> Availability, Partition Tolerance, Scalability, Data Modeling etc. These can 
> be decided at a theoretical level.
>
> Once we are sure we need NoSQL, to effectively benchmark the performance 
> around use-cases or application workloads, we need a standard method.
> Some tools are specific to a database like cassandra's stress tool.The only 
> tool I could find which seems to compare across NoSQL databases, and can be 
> extended and is freely available is YCSB.
>
> Is YCSB updated for latest versions of cassandra and hbase? Does it work for 
> Datastax enterprise? Is it regularly updated for new versions of NoSQL 
> databases, or is this something we would need to take up as a development 
> effort?
>
> Are there any shortcomings to using YCSB- and would it be preferable to 
> develop own tool for performance benchmarking of NoSQL systems. Do share your 
> thoughts.
>
>
> Regards,
> Roshni
>
> This email and any files transmitted with it are confidential and intended 
> solely for the individual or entity to whom they are addressed. If you have 
> received this email in error destroy it immediately. *** Walmart Confidential 
> ***


Re: Decision Making- YCSB

2012-08-10 Thread Mohit Anchlia
I agree with Edward. We always develop our own stress tool that tests each
use case of interest. Every use case is different in certain ways that can
only be tested using custom stress tool.

On Fri, Aug 10, 2012 at 7:25 AM, Edward Capriolo wrote:

> There are many YCSB forks on github that get optimized for specific
> databases but the default one is decent across the defaults. Cassandra
> has it's own internal stress tool that we like better.
>
> The short comings are that generic tools and generic workloads are
> generic and thus not real-world. But other then that being able to
> tweak the workload percentages and change the read patterns from
> latest/random/etc does a decent job of stressing normal and worst-case
> scenarios on the read path. Still I would try to build my own real
> world use case as a tool to evaluate a solution before making a
> choice.
>
> Edward
>
> On Thu, Aug 9, 2012 at 8:58 PM, Roshni Rajagopal
>  wrote:
> > Hi Folks,
> >
> > I'm coming up with a set of decision criteria on when to chose
> traditional RDBMS vs various  NoSQL options.
> > So one aspect is the  application requirements around Consistency,
> Availability, Partition Tolerance, Scalability, Data Modeling etc. These
> can be decided at a theoretical level.
> >
> > Once we are sure we need NoSQL, to effectively benchmark the performance
> around use-cases or application workloads, we need a standard method.
> > Some tools are specific to a database like cassandra's stress tool.The
> only tool I could find which seems to compare across NoSQL databases, and
> can be extended and is freely available is YCSB.
> >
> > Is YCSB updated for latest versions of cassandra and hbase? Does it work
> for Datastax enterprise? Is it regularly updated for new versions of NoSQL
> databases, or is this something we would need to take up as a development
> effort?
> >
> > Are there any shortcomings to using YCSB- and would it be preferable to
> develop own tool for performance benchmarking of NoSQL systems. Do share
> your thoughts.
> >
> >
> > Regards,
> > Roshni
> >
> > This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed. If
> you have received this email in error destroy it immediately. *** Walmart
> Confidential ***
>


Re: Cassandra data model help

2012-08-10 Thread Aaron Turner
You need to track node membership separately.  I do that in a SQL
database, but you can use cassandra for that.  For example:

rowkey = cluster name
column name  Composite[ :] = [join|leave]

Then every time a node joins or leaves a cluster, write an entry.
Then you can just read the row (ordered by epoch times) to build your
list of active nodes for a given time period.  Note, you can set a
ending read range, but you basically have to start reading from 0.

Notice that is really for figuring out which nodes are in a cluster
for a given period of time.  You wouldn't want to model it that way if
you wanted to know which cluster(s) a single node was  in over a given
period of time.  In that case you'd model it this way:

rowkey = node name
column name  Composite[ :] = [join|leave]

Depending on your needs, you may end up using both!



On Fri, Aug 10, 2012 at 1:34 AM,   wrote:
> Thanks Aaron for your reply,
> creating vector for raw data is good work around for decreasing disk space, 
> but I am not still clear tracking time for nodes, say if we want a query like 
> give me the list of nodes for a cluster between this period of time then how 
> do we get that information? do we scan through each node row as we will have 
> row for each node?
>
> thanks
>
> -Aaron Turner  wrote: -
> To: user@cassandra.apache.org
> From: Aaron Turner 
> Date: 08/09/2012 07:38PM
> Subject: Re: Cassandra data model help
>
> On Thu, Aug 9, 2012 at 5:52 AM,   wrote:
>> Hi,
>> I am trying to create a Cassandra schema for cluster monitoring system, 
>> where one cluster can have multiple nodes and I am monitoring multiple 
>> matrices from a node. My raw data schema looks like and taking values in 
>> every 5 min interval
>>
>> matrix_name + daily time stamp as row key, composite column name of node 
>> name and time stamp and matrix value as column value
>>
>> the problem I am facing is a node can go back and forth between the 
>> clusters(system can have more than one clusters) so if i need monthly 
>> statistics plotting of a cluster I have to consider the nodes that are 
>> leaving and joining during this period of time, some node might be part of 
>> the cluster for just 15 days and some could join the cluster last 10 day of 
>> month, so to plot data for a particular cluster for a time interval I need 
>> to know the nodes which were part of that cluster for that period of time, 
>> what could be the best schema for this solution ? I have tried few ideas so 
>> far no luck, any suggestions ?
>
> Store each node stat in it's own row.  Then decide if you want to
> track when a node joins/leaves a cluster so you can build the aggs on
> the fly or just store cluster aggregates in their own row as well.  If
> the latter, depending on your polling methodology, you may want to use
> counters for the cluster aggregates.
>
> Also, if you're doing 5 min intervals with each row = 1 day, then your
> disk space usage is going to grow pretty quickly due to per-column
> overhead.   You didn't say what the values are that you're storing,
> but if they're just 64bit integers or something like that, most of
> your disk space is actually being used for column overhead not your
> data.
>
> I worked around this by creating a 2nd CF, where each row = 1 year
> worth of data and each column = 1 days worth of data.  The values are
> just a vector of the 5min values from the original CF.  Then I just
> have a cron job which reads the previous days data and builds the
> vectors in the new CF and then deletes the original row.  By doing
> this, my disk space requirements (before replication) went from over
> 1.1TB/year to 305GB/year.
>
>
> --
> Aaron Turner
> http://synfin.net/ Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & 
> Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
> "carpe diem quam minimum credula postero"
>



-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
"carpe diem quam minimum credula postero"


Re: How to create a COLUMNFAMILY with Leveled Compaction?

2012-08-10 Thread Andy Ballingall TF
On 3 August 2012 21:31, Data Craftsman 木匠  wrote:
>
> Nobody use Leveled Compaction with CQL 3.0 ?



I tried this, and I can't get it to work either.

I'm using:

[cqlsh 2.2.0 | Cassandra 1.1.2 | CQL spec 3.0.0 | Thrift protocol 19.32.0]


Here's what my create table looks like:

CREATE TABLE php_sessions(
domain_name text,
session_id text,
session_data text,
PRIMARY KEY (domain_name, session_id)
)
WITH COMPACT STORAGE ' AND
COMPACTION_STRATEGY_CLASS='LeveledCompactionStrategy';

It does create the table, but if you do DESCRIBE TABLE php_sessions it displays:

CREATE TABLE php_sessions (
  domain_name text,
  session_id text,
  session_data text,
  PRIMARY KEY (domain_name, session_id)
) WITH COMPACT STORAGE AND
  caching='KEYS_ONLY' AND
  read_repair_chance=0.10 AND
  gc_grace_seconds=864000 AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  replicate_on_write='true' AND
  compaction_strategy_class='SizeTieredCompactionStrategy' AND
  compression_parameters:sstable_compression='SnappyCompressor';


This table is created in a keyspace declared as follows:

CREATE KEYSPACE testkeyspace WITH
strategy_class = 'NetworkTopologyStrategy'
AND strategy_options:dc1=3;



I'd definitely like to use this new compaction type. Help!

Andy





>
>
> -Z
>
> On Tue, Jul 31, 2012 at 11:17 AM, Data Craftsman 木匠
>  wrote:
> > Sorry for my stupid simple question. How to create a COLUMNFAMILY with
> > Leveled Compaction?
> >
> > There is no example in documentation:
> > http://www.datastax.com/docs/1.1/configuration/storage_configuration#compaction-strategy
> >
> > I try it on Cassandra 1.1.0 and 1.1.2, both failed. The COLUMNFAMILY
> > is still 'SizeTieredCompactionStrategy'.  :(
> >
> > Here is my test and output:
> >
> > @host01:/usr/share/cassandra>cqlsh host01 --cql3
> > Connected to BookCluster at host01:9160.
> > [cqlsh 2.2.0 | Cassandra 1.1.0 | CQL spec 3.0.0 | Thrift protocol 19.30.0]
> > Use HELP for help.
> > cqlsh>
> > cqlsh> use demo;
> >
> > cqlsh:demo> CREATE COLUMNFAMILY book
> > ... (isbn varchar,
> > ...  book_id bigint,
> > ...  price int,
> > ...  obj varchar,
> > ...  PRIMARY KEY (isbn, book_id)
> > ... )
> > ... WITH compaction_strategy_class='LeveledCompactionStrategy';
> > cqlsh:demo>
> > cqlsh:demo> describe COLUMNFAMILY book;
> >
> > CREATE COLUMNFAMILY book (
> >   isbn text PRIMARY KEY
> > ) WITH
> >   comment='' AND
> >   
> > comparator='CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.UTF8Type)'
> > AND
> >   read_repair_chance=0.10 AND
> >   gc_grace_seconds=864000 AND
> >   default_validation=text AND
> >   min_compaction_threshold=4 AND
> >   max_compaction_threshold=32 AND
> >   replicate_on_write=True AND
> >   compaction_strategy_class='SizeTieredCompactionStrategy' AND
> >   
> > compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompressor';
> >
> > cqlsh:demo>
> >
> > Thanks,
> > Charlie (@mujiang) 一个 木匠
> > ===
> > Data Architect Developer
> > http://mujiang.blogspot.com




--
Andy Ballingall
Senior Software Engineer

The Foundry
6th Floor, The Communications Building,
48, Leicester Square,
London, WC2H 7LT, UK
Tel: +44 (0)20 7968 6828 - Fax: +44 (0)20 7930 8906
Web: http://www.thefoundry.co.uk/

The Foundry Visionmongers Ltd.
Registered in England and Wales No: 4642027


Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Hi all

 

Just replaced ( clean install ) version 1.0.9 with 1.1.3 - two node
amazon cluster.  After yaml modification and starting both nodes - they
do not see each other:

 

  Note: Ownership information does not include topology, please specify
a keyspace.

Address DC  RackStatus State   Load
OwnsToken

10.168.87.107   datacenter1 rack1   Up Normal  9.07 KB
100.00% 0

 

Address DC  RackStatus State   Load
Effective-Ownership Token

10.171.77.39datacenter1 rack1   Up Normal  36.16 KB
100.00% 85070591730234615865843651857942052863

 

Help please



Re: Problem with version 1.1.3

2012-08-10 Thread Derek Barnes
Do both nodes refer to one another as seeds in cassandra.yaml?

On Fri, Aug 10, 2012 at 1:46 PM, Dwight Smith
wrote:

> Hi all
>
> ** **
>
> Just replaced ( clean install ) version 1.0.9 with 1.1.3 – two node amazon
> cluster.  After yaml modification and starting both nodes – they do not see
> each other:
>
> ** **
>
>   Note: Ownership information does not include topology, please specify a
> keyspace.
>
> Address DC  RackStatus State   Load
> OwnsToken
>
> 10.168.87.107   datacenter1 rack1   Up Normal  9.07 KB
> 100.00% 0
>
> ** **
>
> Address DC  RackStatus State   Load
> Effective-Ownership Token
>
> 10.171.77.39datacenter1 rack1   Up Normal  36.16 KB
> 100.00% 85070591730234615865843651857942052863
>
> ** **
>
> Help please
>


RE: Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Yes - BUT they are the node hostnames and not the ip addresses

 

From: Derek Barnes [mailto:sj.clim...@gmail.com] 
Sent: Friday, August 10, 2012 2:00 PM
To: user@cassandra.apache.org
Subject: Re: Problem with version 1.1.3

 

Do both nodes refer to one another as seeds in cassandra.yaml?

On Fri, Aug 10, 2012 at 1:46 PM, Dwight Smith
 wrote:

Hi all

 

Just replaced ( clean install ) version 1.0.9 with 1.1.3 - two node
amazon cluster.  After yaml modification and starting both nodes - they
do not see each other:

 

  Note: Ownership information does not include topology, please specify
a keyspace.

Address DC  RackStatus State   Load
OwnsToken

10.168.87.107   datacenter1 rack1   Up Normal  9.07 KB
100.00% 0

 

Address DC  RackStatus State   Load
Effective-Ownership Token

10.171.77.39datacenter1 rack1   Up Normal  36.16 KB
100.00% 85070591730234615865843651857942052863

 

Help please

 



RE: Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Derek

 

I added both node hostnames to the seeds and it now has the correct
nodetool ring:

 

Address DC  RackStatus State   Load
OwnsToken

 
85070591730234615865843651857942052863

10.168.87.107   datacenter1 rack1   Up Normal  13.5 KB
50.00%  0

10.171.77.39datacenter1 rack1   Up Normal  13.5 KB
50.00%  85070591730234615865843651857942052863

 

Thanks for the hint. 

 

From: Derek Barnes [mailto:sj.clim...@gmail.com] 
Sent: Friday, August 10, 2012 2:00 PM
To: user@cassandra.apache.org
Subject: Re: Problem with version 1.1.3

 

Do both nodes refer to one another as seeds in cassandra.yaml?

On Fri, Aug 10, 2012 at 1:46 PM, Dwight Smith
 wrote:

Hi all

 

Just replaced ( clean install ) version 1.0.9 with 1.1.3 - two node
amazon cluster.  After yaml modification and starting both nodes - they
do not see each other:

 

  Note: Ownership information does not include topology, please specify
a keyspace.

Address DC  RackStatus State   Load
OwnsToken

10.168.87.107   datacenter1 rack1   Up Normal  9.07 KB
100.00% 0

 

Address DC  RackStatus State   Load
Effective-Ownership Token

10.171.77.39datacenter1 rack1   Up Normal  36.16 KB
100.00% 85070591730234615865843651857942052863

 

Help please

 



Re: CQL connections

2012-08-10 Thread Data Craftsman 木匠
I want to know it too.

http://www.datastax.com/support-forums/topic/when-will-pycassa-support-cql

Connection pool and load balance is a necessary feature for multi-user
production application.

Thanks,
Charlie | DBA

On Fri, Aug 10, 2012 at 6:47 AM, David McNelis  wrote:
> In using CQL (the python library, at least), I didn't see a way to pass in
> multiple nodes as hosts.  With other libraries (like Hector and Pycassa) I
> can set multiple hosts and my app will work with anyone on that list.  Is
> there something similar going on in the background with CQL?
>
> If not, then is anyone aware of plans to do so?


RE: Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Further info - it seems I had the seeds list backwards - it did not need
both nodes - I have corrected that with each pointing to the other as a
single seed entry - and it works fine.

 

Thanks again for the quick response.

 

From: Dwight Smith [mailto:dwight.sm...@genesyslab.com] 
Sent: Friday, August 10, 2012 2:15 PM
To: user@cassandra.apache.org
Subject: RE: Problem with version 1.1.3

 

Derek

 

I added both node hostnames to the seeds and it now has the correct
nodetool ring:

 

Address DC  RackStatus State   Load
OwnsToken

 
85070591730234615865843651857942052863

10.168.87.107   datacenter1 rack1   Up Normal  13.5 KB
50.00%  0

10.171.77.39datacenter1 rack1   Up Normal  13.5 KB
50.00%  85070591730234615865843651857942052863

 

Thanks for the hint. 

 

From: Derek Barnes [mailto:sj.clim...@gmail.com] 
Sent: Friday, August 10, 2012 2:00 PM
To: user@cassandra.apache.org
Subject: Re: Problem with version 1.1.3

 

Do both nodes refer to one another as seeds in cassandra.yaml?

On Fri, Aug 10, 2012 at 1:46 PM, Dwight Smith
 wrote:

Hi all

 

Just replaced ( clean install ) version 1.0.9 with 1.1.3 - two node
amazon cluster.  After yaml modification and starting both nodes - they
do not see each other:

 

  Note: Ownership information does not include topology, please specify
a keyspace.

Address DC  RackStatus State   Load
OwnsToken

10.168.87.107   datacenter1 rack1   Up Normal  9.07 KB
100.00% 0

 

Address DC  RackStatus State   Load
Effective-Ownership Token

10.171.77.39datacenter1 rack1   Up Normal  36.16 KB
100.00% 85070591730234615865843651857942052863

 

Help please

 



anyone have any performance numbers? and here are some perf numbers of my own...

2012-08-10 Thread Hiller, Dean
** 3. In my test below, I see there is now 8Gig of data and 9,000,000 rows. 
 Does that sound right?,  nearly 1MB of space is used per row for a 50 column 
row  That sounds like a huge amount of overhead. (my values are long on 
every column, but that is still not much).  I was expecting KB / row maybe, but 
MB / row?  My column names are "col"+I as well so they are very short too.

A common configuration is 1T drives per node, so I was wondering if anyone ran 
any tests with map/reduce on reading in all those rows(not doing anything with 
it, just reading it in).

** 1. How long does it take to go through the 500MB that would be on that 
node?

I ran some tests on just writing a fake table in 50 columns wide and am seeing 
it will take about 31 hours to write 500MB of information (a node is about full 
at 500MB since need to reserve 50-30% space for compaction and such).  Ie. If I 
need to rerun any kind of indexing, it will take 31 hours…does this sound about 
normal/ballpark?  Obviously many nodes will be below so that would be worst 
case with 1 T drives.

** 2. Anyone have any other data?

Thanks,
Dean


Re: anyone have any performance numbers? and here are some perf numbers of my own...

2012-08-10 Thread Hiller, Dean
Ignore the third one, my math was badŠworked out to 733 bytes / row and it
ended up being 6.6 gig as it compacted it some after it was done when the
load was light(noticed that a bit later)

But what about the other two?  Is that the time is expected approximately?

Thanks,
Dean

On 8/10/12 3:50 PM, "Hiller, Dean"  wrote:

>** 3. In my test below, I see there is now 8Gig of data and 9,000,000
>rows.  Does that sound right?,  nearly 1MB of space is used per row for a
>50 column row  That sounds like a huge amount of overhead. (my values
>are long on every column, but that is still not much).  I was expecting
>KB / row maybe, but MB / row?  My column names are "col"+I as well so
>they are very short too.
>
>A common configuration is 1T drives per node, so I was wondering if
>anyone ran any tests with map/reduce on reading in all those rows(not
>doing anything with it, just reading it in).
>
>** 1. How long does it take to go through the 500MB that would be on
>that node?
>
>I ran some tests on just writing a fake table in 50 columns wide and am
>seeing it will take about 31 hours to write 500MB of information (a node
>is about full at 500MB since need to reserve 50-30% space for compaction
>and such).  Ie. If I need to rerun any kind of indexing, it will take 31
>hoursŠdoes this sound about normal/ballpark?  Obviously many nodes will
>be below so that would be worst case with 1 T drives.
>
>** 2. Anyone have any other data?
>
>Thanks,
>Dean



quick question about data layout on disk

2012-08-10 Thread Aaron Turner
Curious, but does cassandra store the rowkey along with every
column/value pair on disk (pre-compaction) like Hbase does?  If so
(which makes the most sense), I assume that's something that is
optimized during compaction?


-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
"carpe diem quam minimum credula postero"


Re: quick question about data layout on disk

2012-08-10 Thread Terje Marthinussen
Rowkey is stored only once in any sstable file.

That is, in the spesial case where you get sstable file per column/value, you 
are correct, but normally, I guess most of us are storing more per key.

Regards,
Terje

On 11 Aug 2012, at 10:34, Aaron Turner  wrote:

> Curious, but does cassandra store the rowkey along with every
> column/value pair on disk (pre-compaction) like Hbase does?  If so
> (which makes the most sense), I assume that's something that is
> optimized during compaction?
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/ Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & 
> Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>-- Benjamin Franklin
> "carpe diem quam minimum credula postero"


Re: Decision Making- YCSB

2012-08-10 Thread Roshni Rajagopal
  Thanks Edward and Mohit.

   We do have an in house tool, but that tests pretty much the same thing as 
YCSB- read , write performance given a number of threads & type of operations 
as an input.
The good thing here is that we own the code and we can modify it easily. YCSB 
does not seem to be very well supported.

When you say you modify the tests for your use-case what exactly do you modify. 
Could you give me an example of a use case driven approach.

Regards,
Roshni

From: Mohit Anchlia mailto:mohitanch...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Decision Making- YCSB

I agree with Edward. We always develop our own stress tool that tests each use 
case of interest. Every use case is different in certain ways that can only be 
tested using custom stress tool.

On Fri, Aug 10, 2012 at 7:25 AM, Edward Capriolo 
mailto:edlinuxg...@gmail.com>> wrote:
There are many YCSB forks on github that get optimized for specific
databases but the default one is decent across the defaults. Cassandra
has it's own internal stress tool that we like better.

The short comings are that generic tools and generic workloads are
generic and thus not real-world. But other then that being able to
tweak the workload percentages and change the read patterns from
latest/random/etc does a decent job of stressing normal and worst-case
scenarios on the read path. Still I would try to build my own real
world use case as a tool to evaluate a solution before making a
choice.

Edward

On Thu, Aug 9, 2012 at 8:58 PM, Roshni Rajagopal
mailto:roshni.rajago...@wal-mart.com>> wrote:
> Hi Folks,
>
> I'm coming up with a set of decision criteria on when to chose traditional 
> RDBMS vs various  NoSQL options.
> So one aspect is the  application requirements around Consistency, 
> Availability, Partition Tolerance, Scalability, Data Modeling etc. These can 
> be decided at a theoretical level.
>
> Once we are sure we need NoSQL, to effectively benchmark the performance 
> around use-cases or application workloads, we need a standard method.
> Some tools are specific to a database like cassandra's stress tool.The only 
> tool I could find which seems to compare across NoSQL databases, and can be 
> extended and is freely available is YCSB.
>
> Is YCSB updated for latest versions of cassandra and hbase? Does it work for 
> Datastax enterprise? Is it regularly updated for new versions of NoSQL 
> databases, or is this something we would need to take up as a development 
> effort?
>
> Are there any shortcomings to using YCSB- and would it be preferable to 
> develop own tool for performance benchmarking of NoSQL systems. Do share your 
> thoughts.
>
>
> Regards,
> Roshni
>
> This email and any files transmitted with it are confidential and intended 
> solely for the individual or entity to whom they are addressed. If you have 
> received this email in error destroy it immediately. *** Walmart Confidential 
> ***

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***