question about cassandra.in.sh

2011-08-24 Thread Koert Kuipers
i have an existing cassandra instance on my machine, it came with brisk and
lives in /usr/share/brisk/cassandra. it also created /usr/share/cassandra/
cassandra.in.sh

now i wanted to run another instance of cassandra (i needed a 0.7 version
for compatibility reasons), so i downloaded it from apache cassandra website
and installed it in /usr/share/cassandra-0.7

my problem is that the scripts for my cassandra 0.7 instance don't work
properly. the problem lies in the code snippets below. when i run the
scripts they source /usr/share/cassandra/cassandra.in.sh, which has the
wrong settings (it now loads all the jars from
/usr/share/brisk/cassandra/lib). i know i can fix it by settings
CASSANDRA_INCLUDE but i think thats not a very nice solution.

why was the decision made that the central "casssandra.in.sh" should have
higher priority than the local one? doesn't that break local installs?
wouldnt it make more sense if scripts assumed they were in SOMEDIR/bin and
then tried to load casssandra.in.sh from SOMEDIR first with the highest
priority?

best, koert


code snippet:
if [ "x$CASSANDRA_INCLUDE" = "x" ]; then
# Locations (in order) to use when searching for an include file.
for include in /usr/share/cassandra/cassandra.in.sh \
   /usr/local/share/cassandra/cassandra.in.sh \
   /opt/cassandra/cassandra.in.sh \
   ~/.cassandra.in.sh \
   `dirname $0`/cassandra.in.sh; do
if [ -r $include ]; then
. $include
break
fi
done


Re: question about cassandra.in.sh

2011-08-25 Thread Koert Kuipers
hey eric, the one thing i do not agree that it is the element of least
surprise. i would argue that the default behavior for *nix appplications is
that they find out what their home directory is and operate relative to
that. something like:

script_dir="$(dirname "$(readlink -f ${BASH_SOURCE[0]})")"
home_dir=${script_dir%/bin}

or production quality code from hadoop-config.sh which is sourced by the
main hadoop script:
this="${BASH_SOURCE-$0}"
bin=$(cd -P -- "$(dirname -- "$this")" && pwd -P)
script="$(basename -- "$this")"
this="$bin/$script"
# the root of the Hadoop installation
if [ -z "$HADOOP_HOME" ]; then
  export HADOOP_HOME=`dirname "$this"`/..
fi

i find setting a variable in your shell like CASSANDRA_INCLUDE to be error
prone. at some point i will forget what i set it to and them i am by
accident using the wrong application. once applications are aware of their
home dir all i have to do is "ln -s /usr/lib/cassandra-0.7/bin/nodetool
/usr/sbin/nodetool-0.7" and then i can use it without risk of confusion.

best, koert

On Wed, Aug 24, 2011 at 9:48 PM, Eric Evans  wrote:

> On Wed, Aug 24, 2011 at 1:28 PM, Koert Kuipers  wrote:
> > my problem is that the scripts for my cassandra 0.7 instance don't work
> > properly. the problem lies in the code snippets below. when i run the
> > scripts they source /usr/share/cassandra/cassandra.in.sh, which has the
> > wrong settings (it now loads all the jars from
> > /usr/share/brisk/cassandra/lib). i know i can fix it by settings
> > CASSANDRA_INCLUDE but i think thats not a very nice solution.
> >
> > why was the decision made that the central "casssandra.in.sh" should
> have
> > higher priority than the local one? doesn't that break local installs?
>
> It was considered the element of least surprise.  If it exists in
> /usr/share/cassandra then Cassandra's been "installed", and in the
> absence of any other data, that's probably what should be used.  If
> it's a local copy *and* there's a copy installed in
> /usr/share/cassandra, it's probably the owner of the local copy that
> needs to know what they are doing and intervene with
> CASSANDRA_INCLUDE.
>
> > wouldnt it make more sense if scripts assumed they were in SOMEDIR/bin
> and
> > then tried to load casssandra.in.sh from SOMEDIR first with the highest
> > priority?
>
> I don't think so, but then I was the one that reasoned out the current
> search order, so YMMV. :)
>
> > code snippet:
> > if [ "x$CASSANDRA_INCLUDE" = "x" ]; then
> > # Locations (in order) to use when searching for an include file.
> > for include in /usr/share/cassandra/cassandra.in.sh \
> >/usr/local/share/cassandra/cassandra.in.sh \
> >/opt/cassandra/cassandra.in.sh \
> >~/.cassandra.in.sh \
> >`dirname $0`/cassandra.in.sh; do
> > if [ -r $include ]; then
> > . $include
> > break
> > fi
> > done
> >
>
> --
> Eric Evans
> Acunu | http://www.acunu.com | @acunu
>


Re: Storing (python) objects

2011-09-23 Thread Koert Kuipers
i would advise not to use a language specific storage format, you might
regret it later on if you want to add an application to your system that is
written in anything else than python. i mean python is great, but it is not
necessary the right tool for every job

look at thrift/protobuf/avro/bson/json
i would use a serialization with an IDL

On Fri, Sep 23, 2011 at 5:07 AM, David Allsopp  wrote:

> We have done exactly as you describe (nested dicts etc) - works fine as
> long as you are happy to read the whole lump of data, i.e. don't need to
> read at a finer granularity. This approach can also save a lot of storage
> space as you don't have the overhead of many small columns.
>
> Some folks also write JSON, which would be a bit more language-independent
> of course.
>
>
> On 22 September 2011 19:28, Ian Danforth  wrote:
>
>> All,
>>
>>  I find myself considering storing serialized python dicts in Cassandra.
>> I'd like to store fairly complex, nested dicts, and it's just easier to do
>> this rather than work out a lot of super columns / columns etc.
>>
>>  Do others find themselves storing serialized data structures in Cassandra
>> or is this generally a sign of doing something wrong?
>>
>>  Thanks in advance!
>>
>> Ian
>>
>
>


how to do a get_range_slices where all keys start with same string

2011-01-11 Thread Koert Kuipers
I would like to do a get_range_slices for all keys (which are strings) that 
start with the same substring x (for example "com.google"). How do I do that?
start_key = x abd end_key = x doesn't seem to do the job...
thanks koert



RE: how to do a get_range_slices where all keys start with same string

2011-01-11 Thread Koert Kuipers
Ok I see get_range_slice is really only useful for paging with RP...

So if I were using OPP (which I am not) and I wanted all keys starting with 
"com.google", what should my start_key and end_key be?

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Tuesday, January 11, 2011 9:02 PM
To: user
Subject: Re: how to do a get_range_slices where all keys start with same string

http://wiki.apache.org/cassandra/FAQ#range_rp

also, start==end==x means "give me back exactly row x, if it exists."
IF you were using OPP you'd need end=y.

On Tue, Jan 11, 2011 at 7:45 PM, Koert Kuipers
 wrote:
> I would like to do a get_range_slices for all keys (which are strings) that
> start with the same substring x (for example "com.google"). How do I do
> that?
>
> start_key = x abd end_key = x doesn't seem to do the job...
>
> thanks koert
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com



deletion

2010-10-14 Thread Koert Kuipers
Hello All,

I am testing Cassandra 0.7 with the Avro api on a single machine as a financial 
time series server, so my setup looks something like this:
keyspace = timeseries, column family = tickdata, key = ticker, super column = 
field (price, volume, high, low), column = timestamp.

So a single value, say a price of 140.72 for IBM today at 14:00 would be stored 
as
tickdata["IBM"]["price"]["2010-10-14 14:00"] = 140.72 (well of course 
everything needs to be encoded properly but you get the point).

My subcomparator type is TimeUUIDType so that I can do queries over time 
ranges. Inserting and querying all work reasonably well so far.

But sometimes I have a need to wipe out all the data for all day. To be more 
precise: I need to delete the stored values for all keys (tickers) and all 
super-columns (fields) for a given time period (condition on column). How would 
I go about doing that? First a multiget_slice and then a remove command for 
each value? Or am I missing an easier way?

Is slice deletion within batch_mutate still scheduled to be implemented?

Thanks for your help,
Koert



RE: deletion

2010-10-14 Thread Koert Kuipers
Aaron, Thanks for your response.

I use a custom UUID generator so that the second part is randomly generated (no 
MAC address). I actually want this to be random since I could potentially have 
multiple values for the same ticker, measure and time and I do not want to 
override.

I didn't realize that supercolumns had that limitation. ticker:measure fields 
indeed seem to make sense. That's a relative easy switch.

I could indeed add the day to the field (so ticker:measure:day) to enable easy 
deletion of days. However this doesn't feel very clean. I would prefer to keep 
using columns for time and use a slice for deletion. However last time I tried 
this I got an error (something about slice deletion not yet being supported 
with batch_mutate). CASSANDRA-494 seems to indicate this is still in the works 
but I am not sure if it actually is.

Thanks again. Koert


From: Aaron Morton [mailto:aa...@thelastpickle.com]
Sent: October 14 2010 15:45
To: user@cassandra.apache.org
Cc: 'user@cassandra.apache.org'
Subject: Re: deletion

I would recommend using epoch time for your timestamp and comparing as 
LongType. The version 1 UUID includes the MAC of the machine that generated it, 
it two different machines will create different UUID's for the some time. They 
are meant to be unique after all 
http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_1_.28MAC_address.29

You may also want to adjust your model, see the discussion on supercolumn 
limitations here http://wiki.apache.org/cassandra/CassandraLimitations . Your 
current model is going to create very big super columns, which will degrade in 
performance over time. Perhaps use a standard CF and use "ticket:measure" as 
the row key, then you can add 2billion (i think) columns on there for each 
time. You may still want to break the rows up further depending on your use 
case, e.g. ticket:measure:day then perhaps pull back the entire row to get 
every value for the day or delete the entire day easily.

For your deletion issue, batch_mutate is your friend. The Deletion struct lets 
you delete:
- a row, by excluding the predicate and super_column
- a super_column by including super_column and not predicate
- a column

Some of the things that were not implemented were fixed in 0.6.4 i think. 
Anyway they all work AFAIK.

Hope that helps.
Aaron


On 15 Oct, 2010,at 07:55 AM, Koert Kuipers  
wrote:
Hello All,

I am testing Cassandra 0.7 with the Avro api on a single machine as a financial 
time series server, so my setup looks something like this:
keyspace = timeseries, column family = tickdata, key = ticker, super column = 
field (price, volume, high, low), column = timestamp.

So a single value, say a price of 140.72 for IBM today at 14:00 would be stored 
as
tickdata["IBM"]["price"]["2010-10-14 14:00"] = 140.72 (well of course 
everything needs to be encoded properly but you get the point).

My subcomparator type is TimeUUIDType so that I can do queries over time 
ranges. Inserting and querying all work reasonably well so far.

But sometimes I have a need to wipe out all the data for all day. To be more 
precise: I need to delete the stored values for all keys (tickers) and all 
super-columns (fields) for a given time period (condition on column). How would 
I go about doing that? First a multiget_slice and then a remove command for 
each value? Or am I missing an easier way?

Is slice deletion within batch_mutate still scheduled to be implemented?

Thanks for your help,
Koert



java.lang.OutOfMemoryError: Map failed

2010-10-27 Thread Koert Kuipers
While bootstrapping a new node, the existing node that is supposed to provide 
the data throws an error, and the bootstrapping hangs. The log from the 
existing node is below. Both nodes have little memory (only 2 Gig, windows 
machines). I used default configurations (Cassandra 0.7). Any suggestions how 
to fix this? Should I just add memory? Thanks Koert

INFO [STREAM_STAGE:1] 2010-10-27 11:53:09,905 StreamOut.java (line 127) 
Beginning transfer process to /192.168.162.102 - 62825437862633 for ranges 
(124804735337540159479107746638263794797,47070309318543332246917226414989217721]
 INFO [STREAM_STAGE:1] 2010-10-27 11:53:09,905 StreamOut.java (line 101) 
Flushing memtables for timeseries...
 INFO [STREAM_STAGE:1] 2010-10-27 11:53:09,905 StreamOut.java (line 205) Stream 
context metadata 
[C:\Devel\cassandra\data\timeseries\tickdata-e-82-Data.db/[(0,645809447), 
(1630778211,2136523711)], 
C:\Devel\cassandra\data\timeseries\tickdata-e-83-Data.db/[(0,51509)]], 2 
sstables.
 INFO [STREAM_STAGE:1] 2010-10-27 11:53:09,905 StreamOut.java (line 179) 
Streaming file 
C:\Devel\cassandra\data\timeseries\tickdata-e-82-Data.db/[(0,645809447), 
(1630778211,2136523711)] to /192.168.162.102
ERROR [MESSAGE-STREAMING-POOL:3] 2010-10-27 11:53:10,124 
DebuggableThreadPoolExecutor.java (line 102) Error in ThreadPoolExecutor
java.lang.RuntimeException: java.io.IOException: Map failed
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758)
at 
sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:447)
at 
sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:520)
at 
org.apache.cassandra.net.FileStreamTask.stream(FileStreamTask.java:96)
at 
org.apache.cassandra.net.FileStreamTask.runMayThrow(FileStreamTask.java:61)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:755)
... 8 more
ERROR [MESSAGE-STREAMING-POOL:3] 2010-10-27 11:53:10,124 CassandraDaemon.java 
(line 75) Fatal exception in thread Thread[MESSAGE-STREAMING-POOL:3,5,main]
java.lang.RuntimeException: java.io.IOException: Map failed
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758)
at 
sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:447)
at 
sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:520)
at 
org.apache.cassandra.net.FileStreamTask.stream(FileStreamTask.java:96)
at 
org.apache.cassandra.net.FileStreamTask.runMayThrow(FileStreamTask.java:61)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:755)
... 8 more


RE: cassandra + avro | python client vs java client

2010-10-27 Thread Koert Kuipers
It does not have a c extension as far as I know

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Wednesday, October 27, 2010 5:01 PM
To: user
Subject: Re: cassandra + avro | python client vs java client

Does Avro have a Python C extension yet?

If not, 10x is right in line with how much faster I would expect Java
to be than pure Python.

On Wed, Oct 27, 2010 at 11:59 AM, Koert Kuipers
 wrote:
> Hey all,
>
> I have Cassandra 0.7 (nightly build from halfway September) running on one
> test machine with the avro interface. The node holds about 16mm values
> across 10k keys.
>
> As a simple test I ran 2 test queries from a client, one query where I ask
> for all columns for 100 keys and one query where I ask all columns for one
> key (which I know to have a lot of columns). I am not using any buffering
> for columns. I ran the tests multiple times to make sure file caching on
> server wouldn't mess up the comparison.
>
>
>
> Using a java client the results are:
>
> *** test1 ***
>
> running test get_range_slices
>
> 2.672 seconds.
>
> 100 keys
>
> 81849 total columns
>
> *** test2 ***
>
> running test multiget_slice
>
> 1.0 seconds.
>
> 1 keys
>
> 36626 total columns
>
>
>
> That's pretty impressive to me. I also later confirmed that with multiple
> nodes the query across multiple keys is much faster. Also using a clientpool
> would probably speed it up more too.
>
>
>
> Then I ran a python client. The results are:
>
> *** test1 ***
>
> client:rpc get_range_slices
>
> client:rpc call took 30.6 seconds
>
> 100 keys
>
> 81849 total columns
>
> *** test2 ***
>
> client:rpc multiget_slice
>
> client:rpc call took 13.9 seconds
>
> 1 keys
>
> 36626 total columns
>
>
>
> So the python client took 11.4 times as long with the first query and 13.9
> times as long with the second query. That is a big difference! I suspect the
> avro deserialization is causing the slowdown (since the rpc call consists of
> contacting the server, retrieving results and deserializing results). Has
> anyone seen a similar performance difference? This would mean that for a
> production system python avro is not acceptable to me at the moment
>
>
>
> Both client use only the avro library.
>
>
>
> Best, Koert



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com



encoding of values in cassandra

2010-11-10 Thread Koert Kuipers
Cassandra keys and values are just bytes. My values range from simple doubles 
to complex objects so I need to serialize them with something like avro, thrift 
or protobuf.

Since I am working in a test environment and casssandra is moving to avro I 
decided to use the avro protocol  to communicate with cassandra (from python 
and java). So naturally I would also like to encode my values with avro (why 
have 2 serialization frameworks around?). However avro needs to safe the schema 
with the serialized values. This is considerable overhead (even if I just safe 
pointers to schemas  or something like that with the serialized values). It 
also seems complicated compared to thrift or protobuf where one can just store 
values.

Did anyone find a neat solution to this? Or should I just use avro for 
communication and something like protobuf for value serialization?

Best, Koert