Here is a link to get started with DSBench:
https://github.com/datastax/dsbench-labs#getting-started
and DataStax Labs:
https://downloads.datastax.com/#labs
On Thu, Jan 30, 2020 at 11:47 AM Jonathan Shook wrote:
>
> Some of you may remember NGCC talks on metagener (now VirtualDataSet)
Some of you may remember NGCC talks on metagener (now VirtualDataSet)
and engineblock from 2015 and 2016. The main themes went something
along the lines of "testing c* with realistic workloads is hard,
sizing cassandra is hard, we need tools in this space that go beyond
what cassandra-stress can do
Benson,
I was considering using Redis for a specific project. Can you
elaborate a bit on your problem with it? What were the circumstances,
loading factors, etc?
On Fri, Feb 18, 2011 at 9:19 AM, Benson Margulies wrote:
> redis times out at random regardless of what we configure for client
> timeo
Would you share with us the changes you made, or problems you found?
On Wed, Jan 26, 2011 at 10:41 AM, Oleg Proudnikov wrote:
> Hi All,
>
> I was able to run contrib/stress at a very impressive throughput. Single
> threaded client was able to pump 2,000 inserts per second with 0.4 ms latency.
> M
clients:
Java and MVEL + Hector
Perl + thrift
Usage: high-traffic monitoring harness with dynamic mapping and
loading of handlers
Cassandra was part of the "do more with less hardware" approach to
designing this system.
On Fri, Jan 14, 2011 at 11:24 AM, Ertio Lew wrote:
> Hey,
>
> If you have
Perhaps. I use hector. I have an bit of rework to do moving from .6 to
.7. This is something I wasn't anticipating in my earlier planning.
Had Pelops been around when I started using Hector, I would have
probably chosen it over Hector. The Pelops client seemed to be better
conceived as far as progr
I believe the following condition within submitMinorIfNeeded(...)
determines whether to continue, so it's not a hard loop.
// if (sstables.size() >= minThreshold) ...
On Thu, Jan 6, 2011 at 2:51 AM, shimi wrote:
> According to the code it make sense.
> submitMinorIfNeeded() calls doCompaction(
but
> Jonathan did a good job of covering that. Don't forget about the effects of
> caching here, too.
>
> The only way to tell if it is cost-effective is to test your particular
> access patterns (using a configured stress.py test or, preferably, your
> actual application).
>
ns to ask about your data access:
Is there a "user session" which shows an access pattern to proximal data?
Are there sets of access which always happen close together?
Are there keys or maps which add extra indirection?
I'm not familiar with your situation. I was just providing
... some kind of what?
On Mon, Sep 6, 2010 at 3:38 AM, Michal Augustýn
wrote:
> Thank you for the great link!
> The mentioned solution is using locking but I would prefer some optimistic
> strategy (because the conflicts are rare in my situation) but I'm afraid
> that this is really the best solu
I have been able to reproduce this, although it was a bug in
application client code. If you keep a thrift client around longer
after it has had an exception, it may generate this error.
In my case, I was holding a reference via ThreadLocal<> to a stale
storage object.
Another symptom which may h
Don't forget about the tombstones. (delete markers)
They are still present on the other two nodes, then they will
replicate to the 3rd node and finish off your deleted data.
On Mon, Aug 2, 2010 at 9:30 AM, Edward Capriolo wrote:
> On Mon, Aug 2, 2010 at 9:11 AM, john xie wrote:
>> ReplicationFac
Also, google trends is only a measure of what terms people are
searching for. To equate this directly to growth would be misleading.
Tue, Jul 27, 2010 at 12:27 PM, Drew Dahlke wrote:
> There's a good post on stackoverflow comparing the two
> http://stackoverflow.com/questions/2892729/mongodb-vs-
As long as you only want to edit YEd files and print them, it's great.
Anything else to do with it is proprietary and expensive (for me, at
least).
On Mon, Jul 26, 2010 at 7:12 PM, Ashwin Jayaprakash
wrote:
>
> YEd ( http://www.yworks.com/en/products_yed_about.html
> http://www.yworks.com/en/prod
ot too bad
>> > but I
>> >would like to get something more along the lines of this example
>> >http://www.javageneration.com/?p=70
>> >
>> >Regards,
>> >
>> >Michael
>> >
>> >
>> >On Mon, Jul 26, 2010 at 1:24 PM,
+1 for Inkscape/SVG
On Mon, Jul 26, 2010 at 1:07 PM, uncle mantis wrote:
> What do you all use for this? I am currently using MySQL Workbench for my
> SQL projects.
>
> PowerPoint? Visio? Gimp? Pencil and Paper?
>
> Thanks for the help!
>
> Regards,
>
> Michael
>
My guess:
Your test is beating up your system. The system may need more memory
or disk throughput or CPU in order to keep up with that particular
test.
Check some of the posts on the list with "deferred processing" in the
body to see why.
Also, can you post the error log?
On Mon, Jul 26, 2010 at
>>> http://wiki.apache.org/cassandra/CassandraLimitations
>>>> * If you wanted to get 1000 blobs, rather then group them in a single
>>>> row using a super column consider building a secondary index in a standard
>>>> column. One CF for the blobs using
If only one instance of Cassandra is running on each node, then use
something like
pkill -f 'java.*cassandra'
If more than one (not recommended for various reasons), then you
should modify the scripts to put a unique token in the process name.
Something like -Dprocname=... will work. Then you can
CordiS,
The general approach for this kind of change is to implement it
yourself and submit a patch. In such a case, you may still have to be
thoughtful and patient in order to get everyone on board. I wish you
luck.
On Mon, Jul 26, 2010 at 6:51 AM, CordiS wrote:
> Thank you for nothing.
>
> 201
iple columns atomically. Do I have to use
> the batch_mutation for deletion, too?
> On Sat, Jul 24, 2010 at 2:36 PM, Jonathan Shook wrote:
>>
>> Just to clarify, microseconds may be used, but they provide the same
>> behavior as milliseconds if they aren't using
; (as well as the CLI) used milliseconds, not micro.
> So if you're using hector version 0.6.0-11 or earlier, or by any chance in
> some other ways are mixing milisec in your app (are you using
> System.currentTimeMili() somewhere?) then the behavior you're seeing is
> expected.
&g
mnPath cp1 = new ColumnPath("Super2");
> cp1.setSuper_column("hotel".getBytes());
> cp1.setColumn("Best Western".getBytes()); client.insert(KEYSPACE, "name",
> cp1, "Best Western of SF".getBytes(), System.currentTimeMillis(),
> Co
gt; Econolodge: {name: "Econolodge of SF"}
> }
> }
>
> Are the CRUD Operations not referencing this correctly?
>
>
>
> -Original Message-
> From: Jonathan Shook [mailto:jsh...@gmail.com]
> Sent: Friday, July 23, 2010 1:
suggestion. Unfortunately, CRUD test still does not work for
> me. Can you provide a simplest CRUD test possible that works?
> On Fri, Jul 23, 2010 at 10:59 AM, Jonathan Shook wrote:
>>
>> I suspect that it is still your timestamps.
>> You can verify this with a fake timesta
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement. This is important for considering
the load generated by
I suspect that it is still your timestamps.
You can verify this with a fake timestamp generator that is simply
incremented on each getTimestamp().
1 millisecond is a long time for code that is wrapped tightly in a
test. You are likely using the same logical time stamp for multiple
operations.
On
You are correct. In this case, Cassandra would journal two writes to
the same logical row, but they would be 2 independent writes. Writes
do not depend on reads, so they are self-contained. If either column
exists already, it will be overwritten.
These journaled actions would then be applied to th
pping to CL.ONE and see if you only get one copy. If that
> fixes it, I'd suggest searching JIRA.
> Mike
>
> On Thu, Jul 8, 2010 at 6:40 PM, Jonathan Shook wrote:
>>
>> Should I ever expect multiples of the same key (with non-empty column
>> sets) from the same
Should I ever expect multiples of the same key (with non-empty column
sets) from the same get_range_slices call?
I've verified that the column data is identical byte-for-byte, as
well, including column timestamps?
Or the same key, in some cases. If you have multiple operations
against the same columns 'at the same time', they ordering may be
indefinite.
This can happen if the effective resolution of your time stamp is
coarse enough to bracket multiple operations. Milliseconds are not
fine enough in many case
Until then, a pragmatic solution, however undesirable, would be to
only have a single logical thread/task/actor that is allowed to
read,modify,update. If this doesn't work for your application, then a
(distributed) lock manager may be used until such time that you can
take it out. Some are using Zo
Ideas:
Use a checkpoint that moves forward in time for each logical partition
of the workload.
Establish a way of dividing up jobs between clients that doesn't
require synchronization. One way of doing this would be to modulo the
key by the number of logical workers, allowing them to graze direct
Doh! Replace "of" with "if" in the top line.
On Tue, Jun 15, 2010 at 7:57 PM, Jonathan Shook wrote:
> There is JSON import and export, of you want a form of external backup.
>
> No, you can't hook event subscribers into the storage engine. You can
> modify
There is JSON import and export, of you want a form of external backup.
No, you can't hook event subscribers into the storage engine. You can modify
it to do this, however. It may not be trivial.
An easier way to do this would be to have a boundary system (or dedicated
thread, for example) consum
Actually, you shouldn't expect errors in the general case, unless you
are simply trying to use data that can't fit in available heap. There
are some practical limitations, as always.
If there aren't enough resources on the server side to service the
clients, the expectation should be that the serv
ntinuous
> bulk writes?
> Thanks for all the help,
> Rishi
>
> From: Jonathan Shook
> To: user@cassandra.apache.org
> Sent: Thu, June 10, 2010 7:39:24 PM
> Subject: Re: Cassandra Write Performance, CPU usage
>
> You are testing Cassandra in a way
You are testing Cassandra in a way that it was not designed to be used.
Bandwidth to disk is not a meaningful example for nearly anything
except for filesystem benchmarking and things very nearly the same as
filesystem benchmarking.
Unless the usage patterns of your application match your test data
give us the data to insert that
> allows reproducing this?
>
> On Tue, Jun 8, 2010 at 10:20 AM, Jonathan Shook wrote:
>> Possible bug...
>>
>> Using a slice range with the empty sentinel values, and a count of 1
>> sometimes yields 2 ColumnOrSuperColumns, sometim
bug in my
client.
(Cassandra 6.1/Thrift/Perl)
On Tue, Jun 8, 2010 at 11:18 AM, Jonathan Shook wrote:
> I was misreading the result with the original slice range.
> I should have been expecting exactly 2 ColumnOrSuperColumns, which is
> what I got. I was erroneously expecting only 1.
&
I was misreading the result with the original slice range.
I should have been expecting exactly 2 ColumnOrSuperColumns, which is
what I got. I was erroneously expecting only 1.
Thanks!
Jonathan
2010/6/8 Ted Zlatanov :
> On Mon, 7 Jun 2010 17:20:56 -0500 Jonathan Shook wrote:
>
> JS&g
I have a structure like this:
CF:"Status"
{
Row("Component42")
{
SuperColumn(1275948636203) (epoch millis)
{
sub columns...
}
}
}
The supercolumns are dropped in periodically by system A, which is using Hector.
System B uses a lightweight perl/Thrift client to reduce proce
Sorry for the extra post. This version has confusing parts removed and
better formatting.
It sounds like you are getting a handle on it, but maybe in a round-about way.
Here are some ways I like of conceptualizing Cassandra. Maybe they can help.
Either the grid analogy or the maps-of-maps analogy
It sounds like you are getting a handle on it, but maybe in a round-about way.
Here are some ways I like of conceptualizing Cassandra. Maybe they can
shorten your walk.
Either the grid analogy or the maps-of-maps analogy can apply, as they
both map conceptually to the way that we use a column fami
If I may ask, why the need for frequent topology changes?
On Fri, Jun 4, 2010 at 1:21 PM, Benjamin Black wrote:
> On Fri, Jun 4, 2010 at 11:14 AM, Philip Stanhope wrote:
>> I guess I'm thick ...
>>
>> What would be the right choice? Our data demands have already been proven to
>> scale beyond
Insert "if you want to use long values for keys and column names"
above paragraph 2. I forgot that part.
On Wed, Jun 2, 2010 at 1:29 PM, Jonathan Shook wrote:
> If you want to do range queries on the keys, you can use OPP to do this:
> (example using UTF-8 lexicographic keys, w
at if
>> the events come in bursts, so within a day there are millions of events, but
>> they all come within microseconds of each other a few times a day? How do
>> you find the events that happened on a particular day if you can't store
>> them all in one row?
&
Either OPP by key, or within a row by column name. I'd suggest the latter.
If you have structured data to stick under a column (named by the
timestamp), then you can serialize and unserialize it yourself, or you
can use a supercolumn. It's effectively the same thing. Cassandra
only provides the su
Can you clarify what you mean by 'random between nodes' ?
On Wed, Jun 2, 2010 at 8:15 AM, David Boxenhorn wrote:
> I see. But we could make this work if the random partitioner was random only
> between nodes, but was still ordered within each node. (Or if there were
> another partitioner that did
There is no easy answer to this. The requirements vary widely even
within a particular "type" of application.
If you have a list of specific requirements for a given application,
it is easier to say whether it is a good fit.
If you need a schema marshaling system, then you will have to build it
in
Also, what are you meaning specifically by 'slow'? Which measurements
are you looking at. What are your baseline constraints for your test
system?
2010/6/1 史英杰 :
> Hi, It would be better if we know which Consistency Level did you choose,
> and what is the schema of test data?
>
> 在 2010年6月1日 下午4:
Depending on the key, the request would have been proxied to the first
or second node.
The CLI uses a consistency level of "ONE", meaning that only a single
node's data would have been considered when you get().
Also, the responsible nodes for a given key are mapped accordingly at
request time, and
The example is a little confusing.
.. but ..
1) "sharding"
You can square the capacity by having a 2-level map.
CF1->row->value->CF2->row->value
This means finding some natural subgrouping or hash that provides a
good distribution.
2) "hashing"
You can also use some additional key hashing to sp
I wrote some Iterable<*> methods to do this for column families that
share key structure with OPP.
It is on the hector examples page. Caveat emptor.
It does iterative chunking of the working set for each column family,
so that you can set the nominal transfer size when you construct the
Iterator/I
I don't think that queries on a key range are valid unless you are using OPP.
As far as hashing the key for OPP goes, I take it to be the same a not
using OPP. It's really a matter of where it gets done, but it has much
the same effect.
(I think)
Jonathan
On Wed, May 26, 2010 at 12:51 PM, Peter H
Writes only have to write to the journal before returning. Reads have
to read potentially from several sources, including binary searches of
things that may or may not be cached anywhere. The journal writes do
not involve much random disk IO, while the read activity does.
On Tue, May 25, 2010 at
It would be helpful to know the replication factor and consistency
levels of your reads and writes.
2010/5/23 史英杰 :
> Thanks for your reply!
> //Were all of those 20 nodes running real hardware (i.e. NOT VMs)?
> Yes, there are 20 real servers running in the cluster, and one Casssandra
> instance
Every system has its limits. When you say to imagine there are
billions of users without providing any other real data, it limits the
discussion strictly to the hypothetical (and hyperbolic, usually).
The only reasonable answer we could provide would be about the types
of limitations we know about
ture.
>
> Anybody want to tell me I'm wrong?
>
> BTW, Bill, I think we've corresponded before, here:
> http://www.dehora.net/journal/2004/04/whats_in_a_name.html
>
> On Fri, May 14, 2010 at 2:23 AM, Bill de hOra wrote:
>>
>> A SlicePredicate/SliceRange ca
get_slice
see: http://wiki.apache.org/cassandra/API under get_slice and SlicePredicate
On Thu, May 13, 2010 at 9:45 AM, Bill de hOra wrote:
> get_count returns the number of columns, not the names of those columns? I
> should have been specific, by "list the columns", I meant "list the column
>
You can choose to have keys ordered by using an
OrderPreservingPartioner with the trade-off that key ranges can get
denser on certain nodes than others.
On Wed, May 12, 2010 at 7:48 PM, philip andrew wrote:
>
> Hi,
> From my understanding, Cassandra entities are indexed on only one key, so
> this
Although, if replication factor spans all nodes, then the disparity in
row allocation should be a non-issue when using
OrderPreservingPartitioner.
On Wed, May 12, 2010 at 6:42 PM, Vijay wrote:
> If you use Random partitioner, You will NOT get RowKey's sorted. (Columns
> are sorted always).
> Answ
RAID may be less valuable to you here. More useful to you would be to
split the storage according to
http://wiki.apache.org/cassandra/CassandraHardware
When Cassandra is accessing effectively random parts of a large data
store, expect it to be constantly hitting certain "always hot" parts
of files
This is one of the sticking points with the key concatenation
argument. You can't simply access subpartitions of data along an
aggregate name using a concatenated key unless you can efficiently
address a range of the keys according to a property of a subset. I'm
hoping this will bear out with more
Agreed
On Mon, May 10, 2010 at 12:01 PM, Mike Malone wrote:
> On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook wrote:
>>
>> I have to disagree about the naming of things. The name of something
>> isn't just a literal identifier. It affects the way people think about
I have to disagree about the naming of things. The name of something
isn't just a literal identifier. It affects the way people think about
it. For new users, the whole naming thing has been a persistent
barrier.
As for your suggestions, I'm all for simplifying or generalizing the
"how it works" p
I'm not sure this is much of an improvement. It does illustrate,
however, the desire to couch the concepts in terms that each is
already comfortable with. Nearly every set of terms which come from an
existing system will have baggage which doesn't map appropriately. Not
that the "sparse multidimens
Dallas
On Thu, May 6, 2010 at 4:28 PM, Jonathan Ellis wrote:
> We're planning that now. Where would you like to see one?
>
> On Thu, May 6, 2010 at 2:40 PM, S Ahmed wrote:
>> Do you have rough ideas when you would be doing the next one? Maybe in 1 or
>> 2 months or much later?
>>
>>
>> On Tue,
e
timestamp for tightly grouped operations, which may lead to unexpected
behavior. I've submitted a request to simplify this.
On Wed, May 5, 2010 at 5:03 PM, Jonathan Shook wrote:
> When I try to replace a set of columns, like this:
>
> 1) remove all columns under a CF/row
> 2) batch
When I try to replace a set of columns, like this:
1) remove all columns under a CF/row
2) batch insert columns into the same CF/row
.. the columns cease to exist.
Is this expected?
This is just across 2 nodes with Replication Factor 2 and Consistency
Level QUOROM.
Ah! Thank you.
Explained better here:
http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency
On Tue, May 4, 2010 at 8:38 PM, Robert Coli wrote:
> On 5/4/10 7:16 AM, Jonathan Shook wrote:
>
>> I may be wrong here. Someone please correc
pler and I am just stupid
> I retried with clean data and commit log directories and everything works
> well.
>
> I should have missed something (maybe when I upgraded from 0.5.1 to 0.6)
> but anyway, I am just in test.
>
>
> On Tue, May 4, 2010 at 8:47 AM, Jonathan Sh
I think you may found the "eventually" in eventually consistent. With a
replication factor of 1, you are allowing the client thread to continue to
the read on node#2 before it is replicated to node 2. Try setting your
replication factor higher for different results.
Jonathan
On Tue, May 4, 2010 a
I am only speaking to your second question.
It may be helpful to think of modeling your storage layout in terms of
* lists
* sets
* hash maps
... and certain combinations of these.
Since there are no schema-defined relations, your relations may appear
implicit between different views or "copies"
en a BAR has dynamically growing numbers of fields
> (subcolumns) that you get in trouble with that model.
>
> On Tue, Apr 27, 2010 at 4:24 PM, Jonathan Shook wrote:
> > I'm trying to model a one-to-many set of data in which both sides of the
> > relation may grow arbitraril
I'm trying to model a one-to-many set of data in which both sides of the
relation may grow arbitrarily large.
There are arbitrarily many FOOs. For each FOO, there are arbitrarily many
BARs.
Both types are modeled as an object, containing multiple fields (columns) in
the application.
Given a key-add
The allocation of memory may have failed depending on the available virtual
memory, whether or not the memory would have been subsequently accessed by
the process. Some systems do the work of allocating physical pages only
when they are accessed for the first time. I'm not sure if yours is one of
77 matches
Mail list logo