Re: Caching is a full row?

2010-04-14 Thread Sylvain Lebresne
Yes, it will put the whole row in cache even if you read only a bunch
of columns.
It means in particular that with row cache, every time you read a row,
the full row
will be read on a cache miss. Thus it may hurts you read badly in some scenario
(typically with big rows) instead of helping them. Enable row cache wisely :)

--
Sylvain

On Wed, Apr 14, 2010 at 3:06 AM, Paul Prescod  wrote:
> On Tue, Apr 13, 2010 at 5:26 PM, Rob Coli  wrote:
>> On 4/13/10 5:04 PM, Paul Prescod wrote:
>>>
>>> Am I correct in my understanding that the unit of caching (and
>>> fetching from disk?) is a full row?
>>
>> Cassandra has both a Key and a Row cache. Unfortunately there appears to be
>> no current wiki doc describing them. If you are looking into the topic, wiki
>> updates are always appreciated. :)
>
> Yeah, sorry I knew that but I was asking specifically about if there
> was any caching within a row that is more granular than the whole row.
> It looks to me like it doesn't just cache the columns you asked for,
> but all columns in the row. This obviously has some interesting
> implications for keys that are large "indexes" in a single row.
>
>  Paul Prescod
>


History values

2010-04-14 Thread Yésica Rey
I am new to using cassandra. In the documentation I have read, 
understand, that as in other non-documentary databases, to update the 
value of a key-value tuple, this new value is stored with a timestamp 
different but without entirely losing the old value.
I wonder, as I can restore the historic values that have had a 
particular field.

Greetings and thanks


Re: History values

2010-04-14 Thread Benjamin Black
Values with newer timestamps completely replace the old values.  There
is no way to access historic values.

On Wed, Apr 14, 2010 at 12:34 AM, Yésica Rey  wrote:
> I am new to using cassandra. In the documentation I have read, understand,
> that as in other non-documentary databases, to update the value of a
> key-value tuple, this new value is stored with a timestamp different but
> without entirely losing the old value.
> I wonder, as I can restore the historic values that have had a particular
> field.
> Greetings and thanks
>


Re: History values

2010-04-14 Thread Sylvain Lebresne
> I am new to using cassandra. In the documentation I have read, understand,
> that as in other non-documentary databases, to update the value of a
> key-value tuple, this new value is stored with a timestamp different but
> without entirely losing the old value.
> I wonder, as I can restore the historic values that have had a particular
> field.

You can't. Upon update, the old value is lost.
>From a technical standpoint, it is true that this old value is not
deleted (from disk)
right away, but it is deleted eventually by compaction (and you don't
really control
when the compactions occur).

--
Sylvain


Re: History values

2010-04-14 Thread Yésica Rey

Ok, thank you very much for your reply.
I have another question may seem stupid ... Cassandra has a graphical 
console, such as mysql for SQL databases?


Regards!


Re: History values

2010-04-14 Thread Bertil Chapuis
I'm also new to cassandra and about the same question I asked me if using
super columns with one key per version was feasible. Is there limitations to
this use case (or better practices)?

Thank you and best regards,

Bertil Chapuis

On 14 April 2010 09:45, Sylvain Lebresne  wrote:

> > I am new to using cassandra. In the documentation I have read,
> understand,
> > that as in other non-documentary databases, to update the value of a
> > key-value tuple, this new value is stored with a timestamp different but
> > without entirely losing the old value.
> > I wonder, as I can restore the historic values that have had a particular
> > field.
>
> You can't. Upon update, the old value is lost.
> From a technical standpoint, it is true that this old value is not
> deleted (from disk)
> right away, but it is deleted eventually by compaction (and you don't
> really control
> when the compactions occur).
>
> --
> Sylvain
>


Re: New User: OSX vs. Debian on Cassandra 0.5.0 with Thrift

2010-04-14 Thread Zhiguo Zhang
Hi,

sorry I can't help you, but could you please tell me, how could you get the
charts in the attachment?
Thanks.

Mike

On Wed, Apr 14, 2010 at 6:38 AM, Heath Oderman  wrote:

> Hi,
>
> I wrote a few days ago and got a few good suggestions.  I'm still seeing
> dramatic differences between Cassandra 0.5.0 on OSX vs. Debian Linux.
>
> I've tried on Debian with the Sun JRE and the Open JDK with nearly
> identical results. I've tried a mix of hardware.
>
> Attached are some graphs I've produced of my results which show that in
> OSX, Cassandra takes longer with a greater load but is wicked fast
> (expected).
>
> In the SunJDK or Open JDK on Debian I get amazingly consistent time taken
> to do the writes, regardless of the load and the times are always
> ridiculously high.  It's insanely slow.
>
> I genuinely believe that I must be doing something very wrong in my Debian
> setups, but they are all vanilla installs, both 64 bit and 32 bit machines,
> 64bit and 32 bit installs.  Cassandra packs taken from
> http://www.apache.org/dist/cassandra/debian.
>
> I am using Thrift, and I'm using a c# client because that's how I intend to
> actually use Cassandra and it seems pretty sensible.
>
> An example of what I'm seeing is:
>
> 5 Threads Each writing 100,000 Simple Entries
> OSX: 1 min 16 seconds ~ 6515 Entries / second
> Debian: 1 hour 15 seconds ~ 138 Records / second
>
> 15 Threads Each writing 100,000 Simple Entries
> OSX: 2min 30 seconds seconds writing ~10,000 Entries / second
> Debian: 1 hour 1.5 minutes ~406 Entries / second
>
> 20 Threads Each Writing 100,000 Simple Entries
> OSX: 3min 19 seconds ~ 10,050 Entries / second
> Debian: 1 hour 20 seconds ~ 492 Entries / second
>
> If anyone has any suggestions or pointers I'd be glad to hear them.
> Thanks,
> Stu
>
> Attached:
> 1. CassLoadTesting.ods (all my results and graphs in OpenOffice format
> downloaded from Google Docs)
> 2. OSX Records per Second - a graph of how many entries get written per
> second for 10,000 & 100,000 entries as thread count is increased in OSX.
> 3. Open JDK Records per Second - the same graph but of Open JDK on Debian
> 4. Open JDK Total Time By Thread - the total time taken from test start to
> finish (all threads completed) to write 10,000 & 100,000 entries as thread
> count is increased in Debian with Open JDK
> 5. OSX Total time by Thread - same as 4, but for OSX.
>
>
>


Re: History values

2010-04-14 Thread Zhiguo Zhang
I think it is still to young, and have to wait or write your self the
"graphical console", at least, I don't find any until now.

On Wed, Apr 14, 2010 at 10:04 AM, Bertil Chapuis  wrote:

> I'm also new to cassandra and about the same question I asked me if using
> super columns with one key per version was feasible. Is there limitations to
> this use case (or better practices)?
>
> Thank you and best regards,
>
> Bertil Chapuis
>
> On 14 April 2010 09:45, Sylvain Lebresne  wrote:
>
>> > I am new to using cassandra. In the documentation I have read,
>> understand,
>> > that as in other non-documentary databases, to update the value of a
>> > key-value tuple, this new value is stored with a timestamp different but
>> > without entirely losing the old value.
>> > I wonder, as I can restore the historic values that have had a
>> particular
>> > field.
>>
>> You can't. Upon update, the old value is lost.
>> From a technical standpoint, it is true that this old value is not
>> deleted (from disk)
>> right away, but it is deleted eventually by compaction (and you don't
>> really control
>> when the compactions occur).
>>
>> --
>> Sylvain
>>
>
>


Lucandra or some way to query

2010-04-14 Thread Jesus Ibanez
Hello.

I need to know how to search in Cassandra. I could save the data in
different ways so I can then retrive it like for example this:

get keyspace.users['123']
=> (column=name, value=John, timestamp=xx)

get keyspace.searchByName['John']
=> (column=userID, value=123, timestamp=xx)
=> (column=userID, value=456, timestamp=xx)
=> (column=userID, value=789, timestamp=xx)

This works, but is very hard to maintain in the future and the amount of
data increase exponentially. You can do it for some data, but if I have to
do it for each property I need to query on, I have doubts if this is a good
idea. But if you think it can works, I will do it.

I red about Lucandra and seems interesting, but I couldn't run the examples,
I don't know if it is a good idea to use it and to be honest, I don't know
where to start.

I need to query a lot in my website, so what do you Cassandra users,
developers and testers recommend me?
Option 1 - insert data in all different ways I need in order to be able to
query?
Option 2 - implement Lucandra? Can you link me to a blog or an article that
guides me on how to implement Lucandra?
Option 3 - switch to an SQL database? (I hope not).

Thanks in advance!

Jesus.


Re: History values

2010-04-14 Thread aXqd
On Wed, Apr 14, 2010 at 5:13 PM, Zhiguo Zhang  wrote:
> I think it is still to young, and have to wait or write your self the
> "graphical console", at least, I don't find any until now.

Frankly speaking, I'm OK to be without GUI...But I am really
disappointed by those so-called 'documents'.
I really prefer to have some more documents in real 'English' and in a
more tutorial way.
Hope I can write some texts after I managed to understand the current ones.

>
> On Wed, Apr 14, 2010 at 10:04 AM, Bertil Chapuis  wrote:
>>
>> I'm also new to cassandra and about the same question I asked me if using
>> super columns with one key per version was feasible. Is there limitations to
>> this use case (or better practices)?
>> Thank you and best regards,
>> Bertil Chapuis
>> On 14 April 2010 09:45, Sylvain Lebresne  wrote:
>>>
>>> > I am new to using cassandra. In the documentation I have read,
>>> > understand,
>>> > that as in other non-documentary databases, to update the value of a
>>> > key-value tuple, this new value is stored with a timestamp different
>>> > but
>>> > without entirely losing the old value.
>>> > I wonder, as I can restore the historic values that have had a
>>> > particular
>>> > field.
>>>
>>> You can't. Upon update, the old value is lost.
>>> From a technical standpoint, it is true that this old value is not
>>> deleted (from disk)
>>> right away, but it is deleted eventually by compaction (and you don't
>>> really control
>>> when the compactions occur).
>>>
>>> --
>>> Sylvain
>>
>
>


server crash - how to invertigate

2010-04-14 Thread Ran Tavory
I'm running a 0.6.0 cluster with four nodes and one of them just crashed.

The logs all seem normal and I haven't seen anything special in the jmx
counters before the crash.

I have one client writing and reading using 10 threads and using 3 different
column families: KvAds, KvImpressions and KvUsers

the client had got a few UnavailableException, TimedOutException and
TTransportException but was able to complete the read/write operation by
failing over to another available host. I can't tell if the exceptions were
from the crashed host or from other hosts in the ring.

Any hints how to investigate this are greatly appreciated. So far I'm
lost...

Here's a snippet from the log just before it went down. It doesn't seem to
have anything special in it, everything is INFO level.

The only thing that seems a bit strange is that last message: Compacting [].
This message usually comes with things inside the [], such as Compacting
[org.apache.cassandra.io.SSTableReader(path='/outbrain/cassdata/data/system/LocationInfo-1-Data.db'),...]
but this time it was just empty.
However, this is not the only place in the log were I see an empty
Compacting []. There are other places and they didn't end up in a crash, so
I don't know if it's related.

here's the log:
 INFO [ROW-MUTATION-STAGE:6] 2010-04-14 05:55:07,014 ColumnFamilyStore.java
(line 357) KvImpressions has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/outbrain/cassdata/commitlog/CommitLog-1271238432773.log',
position=68606651)
 INFO [ROW-MUTATION-STAGE:6] 2010-04-14 05:55:07,015 ColumnFamilyStore.java
(line 609) Enqueuing flush of Memtable(KvImpressions)@258729366
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:55:07,015 Memtable.java (line 148)
Writing Memtable(KvImpressions)@258729366
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:55:10,130 Memtable.java (line 162)
Completed flushing
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-24-Data.db
 INFO [COMMIT-LOG-WRITER] 2010-04-14 05:55:10,154 CommitLog.java (line 407)
Discarding obsolete commit
log:CommitLogSegment(/outbrain/cassdata/commitlog/CommitLog-1271238049425.log)
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,415
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-16-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,440
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-8-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,454
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-10-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,526
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-5-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,585
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-11-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,602
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-11-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,614
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-9-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,682
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-21-Data.db
 INFO [COMMIT-LOG-WRITER] 2010-04-14 05:55:52,254 CommitLogSegment.java
(line 50) Creating new commitlog segment
/outbrain/cassdata/commitlog/CommitLog-1271238952254.log
 INFO [ROW-MUTATION-STAGE:16] 2010-04-14 05:56:25,347 ColumnFamilyStore.java
(line 357) KvImpressions has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/outbrain/cassdata/commitlog/CommitLog-1271238952254.log',
position=47568158)
 INFO [ROW-MUTATION-STAGE:16] 2010-04-14 05:56:25,348 ColumnFamilyStore.java
(line 609) Enqueuing flush of Memtable(KvImpressions)@1955587316
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:56:25,348 Memtable.java (line 148)
Writing Memtable(KvImpressions)@1955587316
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:56:30,572 Memtable.java (line 162)
Completed flushing
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-25-Data.db
 INFO [COMMIT-LOG-WRITER] 2010-04-14 05:57:26,790 CommitLogSegment.java
(line 50) Creating new commitlog segment
/outbrain/cassdata/commitlog/CommitLog-1271239046790.log
 INFO [ROW-MUTATION-STAGE:7] 2010-04-14 05:57:59,513 ColumnFamilyStore.java
(line 357) KvImpressions has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/outbrain/cassdata/commitlog/CommitLog-1271239046790.log',
position=24265615)
 INFO [ROW-MUTATION-STAGE:7] 2010-04-14 05:57:59,513 ColumnFamilyStore.java
(line 609) Enqueuing flush of Memtable(KvImpressions)@1617250066
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:57:59,513 Memtable.java (line 148)
Writing Memtable(KvImpressions)@1617250066
 INFO [FLUSH-WRITER-POOL:1] 201

Re: RE : Re: RE : Re: Two dimensional matrices

2010-04-14 Thread Philippe
> I'm confused : don't range queries such as the ones we've been

> > discussing require using an orderedpartitionner ?
>
> Alright, so distribution depends on your choice of token.
>
Ah yes, I get it now : with a naive orderedpartitioner, the key is
associated with the node whose token is the closest numerically-wise and
that is where the "master" replica is located. Yes ?

Now let's assume I am using super columns as {X} and columns as {timeFrame}.
In time each row will grow very large because X can (very sparsly) go to
2^28
i) does cassandra load all columns everytime it reads a row ? Same question
for super column
ii) Similarly does it cache all columns in memory ?

Now some order of magnitudes, let's say a row is about 20KB and the cluster
is running smoothly on low-end servers. There are millions of rows per node.
i) If I were to only issue gets on the key, what is the order of magnitude I
can expect to reach : 10/s, 100/s, 1000/s or 10.000/s ?
ii) If I were to issue a slice on just the keys, does cassandra optimize the
gets or does it run every get on the server and then concatenate to send to
the client ?
iii) is slicing on the columns going to improve the time to get the data on
the server side or does it just cut down on network traffic ?

Thanks
Philippe


Re: Starting Cassandra Fauna

2010-04-14 Thread Jonathan Ellis
there are two "installing on centos" articles linked on
http://wiki.apache.org/cassandra/ArticlesAndPresentations

On Wed, Apr 14, 2010 at 1:28 AM, Nirmala Agadgar  wrote:
> Hi,
>
> Can anyone please list steps to install and run cassandra in centos.
> It can help me to follow and check where i missed and run correctly.
> Also, if i wanted to insert some data programmatically, where i need to do
> place the code in Fauna.Can anyone help me on this?
>
> On Mon, Apr 12, 2010 at 10:36 PM, Ryan King  wrote:
>>
>> I'm guessing you missed the ant ivy-retrieve step.
>>
>> We're planning on releasing a new gem today that should fix this issue.
>>
>> -ryan
>>
>> On Mon, Apr 12, 2010 at 3:30 AM, Nirmala Agadgar 
>> wrote:
>> > Hi,
>> >
>> > Yes, used only master.
>> > i downloaded  the tar file and placed in cassandra folder and run again
>> > cassandra_helper cassandra
>> > now i am getting
>> > Error: Exception thrown by the agent : java.net.MalformedURLException:
>> > Local
>> > host name
>> > when set hostname to localhost or 127.0.0.1
>> >  i get Exception in thread "main" java.lang.NoClassDefFoundError:
>> > org/apache/log4j/Logger
>> >     at
>> >
>> > org.apache.cassandra.thrift.CassandraDaemon.(CassandraDaemon.java:55)
>> > how to solve this?
>> > Can anyone tell steps to run cassandra or config to done?
>> >
>> > -
>> > Nirmala
>> >
>> >
>> > On Sat, Apr 10, 2010 at 10:48 PM, Jeff Hodges 
>> > wrote:
>> >>
>> >> Did you try master? We fixed this around the 7th, but haven't made a
>> >> release yet.
>> >> --
>> >> Jeff
>> >>
>> >> On Sat, Apr 10, 2010 at 10:10 AM, Nirmala Agadgar
>> >> 
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I tried to dig in problem and found
>> >> > 1) DIST_URL is pointed to
>> >> >
>> >> >
>> >> > http://apache.osuosl.org/incubator/cassandra/0.6.0/apache-cassandra-0.6.0-beta2-bin.tar.gz
>> >> > and it has no resource in it.( in Rakefile of  Cassandra Gem)
>> >> > DIST_URL =
>> >> >
>> >> >
>> >> > "http://apache.osuosl.org/incubator/cassandra/0.6.0/apache-cassandra-0.6.0-beta2-bin.tar.gz";
>> >> >
>> >> > 2) It does not executes after
>> >> >   sh "tar xzf #{DIST_FILE}"
>> >> >
>> >> > Can anyone help on this problem?
>> >> > Where the tar file should be downloaded?
>> >> >
>> >> >
>> >> > On Fri, Apr 9, 2010 at 3:28 AM, Jeff Hodges 
>> >> > wrote:
>> >> >>
>> >> >> While I wasn't able to reproduce the error, we did have another pop
>> >> >> up. I think I may have actually fixed your problem the other day.
>> >> >> Pull
>> >> >> the latest master from fauna/cassandra and you should be good to go.
>> >> >> --
>> >> >> Jeff
>> >> >>
>> >> >> On Thu, Apr 8, 2010 at 10:51 AM, Ryan King  wrote:
>> >> >> > Yeah, this is a known issue, we're working on it today.
>> >> >> >
>> >> >> > -ryan
>> >> >> >
>> >> >> > On Thu, Apr 8, 2010 at 10:31 AM, Jonathan Ellis
>> >> >> > 
>> >> >> > wrote:
>> >> >> >> Sounds like it's worth reporting on the github project then.
>> >> >> >>
>> >> >> >> On Thu, Apr 8, 2010 at 11:53 AM, Paul Prescod 
>> >> >> >> wrote:
>> >> >> >>> On Thu, Apr 8, 2010 at 9:49 AM, Jonathan Ellis
>> >> >> >>> 
>> >> >> >>> wrote:
>> >> >>  cassandra_helper does a bunch of magic to set things up.  looks
>> >> >>  like
>> >> >>  the "extract a private copy of cassandra 0.6 beta2" part of the
>> >> >>  magic
>> >> >>  is failing.  you'll probably need to manually attempt the
>> >> >>  un-tar
>> >> >>  to
>> >> >>  figure out why it is bailing.
>> >> >> >>>
>> >> >> >>> Yes, I had the same problem. I didn't dig into it, but perhaps
>> >> >> >>> all
>> >> >> >>> users have this problem now.
>> >> >> >>>
>> >> >> >>>  Paul Prescod
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>


Re: History values

2010-04-14 Thread Jonathan Ellis
The closest is http://github.com/driftx/chiton

On Wed, Apr 14, 2010 at 2:57 AM, Yésica Rey  wrote:
> Ok, thank you very much for your reply.
> I have another question may seem stupid ... Cassandra has a graphical
> console, such as mysql for SQL databases?
>
> Regards!
>


Time-series data model

2010-04-14 Thread Jean-Pierre Bergamin
Hello everyone

We are currently evaluating a new DB system (replacing MySQL) to store
massive amounts of time-series data. The data are various metrics from
various network and IT devices and systems. Metrics i.e. could be CPU usage
of the server "xy" in percent, memory usage of server "xy" in MB, ping
response time of server "foo" in milliseconds, network traffic of router
"bar" in MB/s and so on. Different metrics can be collected for different
devices in different intervals.

The metrics are stored together with a timestamp. The queries we want to
perform are:
 * The last value of a specific metric of a device
 * The values of a specific metric of a device between two timestamps t1 and
t2

I stumbled across this blog post which describes a very similar setup with
Cassandra:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
This post gave me confidence that what we want is definitively doable with
Cassandra.

But since I'm just digging into columns and super-columns and their
families, I still have some problems understanding everything.

Our data model could look in json'isch notation like this:
{
"my_server_1": {
"cpu_usage": {
{ts: 1271248215, value: 87 },
{ts: 1271248220, value: 34 },
{ts: 1271248225, value: 23 },
{ts: 1271248230, value: 49 }
}
"ping_response": {
{ts: 1271248201, value: 0.345 },
{ts: 1271248211, value: 0.423 },
{ts: 1271248221, value: 0.311 },
{ts: 1271248232, value: 0.582 }
}
}

"my_server_2": {
"cpu_usage": {
{ts: 1271248215, value: 23 },
...
}
"disk_usage": {
{ts: 1271243451, value: 123445 },
...
}
}

"my_router_1": {
"bytes_in": {
{ts: 1271243451, value: 2452346 },
...
}
"bytes_out": {
{ts: 1271243451, value: 13468 },
...
}
"errors": {
{ts: 1271243451, value: 24 },
...
}
}
}

What I don't get is how to created the two level hierarchy [device][metric].

Am I right that the devices would be kept in a super column family? The
ordering of those is not important.

But the metrics per device are also a super column, where the columns would
be the metric values ({ts: 1271243451, value: 24 }), isn't it?

So I'd need a super column in a super column... Hm.
My brain is definitively RDBMS-damaged and I don't see through columns and
super-columns yet. :-)

How could this be modeled in Cassandra?


Thank you very much
James




Re: Time-series data model

2010-04-14 Thread Zhiguo Zhang
first of all I am a new bee by Non-SQL. I try write down my opinions as
references:

If I were you, I will use 2 columnfamilys:

1.CF,  key is devices
2.CF,  key is timeuuid

how do u think about that?

Mike


On Wed, Apr 14, 2010 at 3:02 PM, Jean-Pierre Bergamin wrote:

> Hello everyone
>
> We are currently evaluating a new DB system (replacing MySQL) to store
> massive amounts of time-series data. The data are various metrics from
> various network and IT devices and systems. Metrics i.e. could be CPU usage
> of the server "xy" in percent, memory usage of server "xy" in MB, ping
> response time of server "foo" in milliseconds, network traffic of router
> "bar" in MB/s and so on. Different metrics can be collected for different
> devices in different intervals.
>
> The metrics are stored together with a timestamp. The queries we want to
> perform are:
>  * The last value of a specific metric of a device
>  * The values of a specific metric of a device between two timestamps t1
> and
> t2
>
> I stumbled across this blog post which describes a very similar setup with
> Cassandra:
> https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
> This post gave me confidence that what we want is definitively doable with
> Cassandra.
>
> But since I'm just digging into columns and super-columns and their
> families, I still have some problems understanding everything.
>
> Our data model could look in json'isch notation like this:
> {
> "my_server_1": {
>"cpu_usage": {
>{ts: 1271248215, value: 87 },
>{ts: 1271248220, value: 34 },
>{ts: 1271248225, value: 23 },
>{ts: 1271248230, value: 49 }
>}
>"ping_response": {
>{ts: 1271248201, value: 0.345 },
>{ts: 1271248211, value: 0.423 },
>{ts: 1271248221, value: 0.311 },
>{ts: 1271248232, value: 0.582 }
>}
> }
>
> "my_server_2": {
>"cpu_usage": {
>{ts: 1271248215, value: 23 },
>...
>}
>"disk_usage": {
>{ts: 1271243451, value: 123445 },
>...
>}
> }
>
> "my_router_1": {
>"bytes_in": {
>{ts: 1271243451, value: 2452346 },
>...
>}
>"bytes_out": {
>{ts: 1271243451, value: 13468 },
>...
>}
>"errors": {
>{ts: 1271243451, value: 24 },
>...
>}
> }
> }
>
> What I don't get is how to created the two level hierarchy
> [device][metric].
>
> Am I right that the devices would be kept in a super column family? The
> ordering of those is not important.
>
> But the metrics per device are also a super column, where the columns would
> be the metric values ({ts: 1271243451, value: 24 }), isn't it?
>
> So I'd need a super column in a super column... Hm.
> My brain is definitively RDBMS-damaged and I don't see through columns and
> super-columns yet. :-)
>
> How could this be modeled in Cassandra?
>
>
> Thank you very much
> James
>
>
>


Re: Time-series data model

2010-04-14 Thread Ted Zlatanov
On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"  
wrote: 

JB> The metrics are stored together with a timestamp. The queries we want to
JB> perform are:
JB>  * The last value of a specific metric of a device
JB>  * The values of a specific metric of a device between two timestamps t1 and
JB> t2

Make your key "devicename-metricname-MMDD-HHMM" (with whatever time
sharding makes sense to you; I use UTC by-hours and by-day in my
environment).  Then your supercolumn is the collection time as a
LongType and your columns inside the supercolumn can express the metric
in detail (collector agent, detailed breakdown, etc.).

If you want your clients to discover the available metrics, you may need
to keep an external index.  But from your spec that doesn't seem necessary.

Ted



Re: Reading thousands of columns

2010-04-14 Thread Gautam Singaraju
Yes, I find that get_range_slices takes an incredibly long time return
the results.
---
Gautam



On Tue, Apr 13, 2010 at 2:00 PM, James Golick  wrote:
> Hi All,
> I'm seeing about 35-50ms to read 1000 columns from a CF using
> get_range_slices. The columns are TimeUUIDType with empty values.
> The row cache is enabled and I'm running the query 500 times in a row, so I
> can only assume the row is cached.
> Is that about what's expected or am I doing something wrong? (It's from java
> this time, so it's not ruby thrift being slow).
> - James


Re: Time-series data model

2010-04-14 Thread alex kamil
James,

i'm a big fan of Cassandra, but have you looked at
http://en.wikipedia.org/wiki/RRDtool
is is natively built for this type of problem

Alex

On Wed, Apr 14, 2010 at 9:02 AM, Jean-Pierre Bergamin wrote:

> Hello everyone
>
> We are currently evaluating a new DB system (replacing MySQL) to store
> massive amounts of time-series data. The data are various metrics from
> various network and IT devices and systems. Metrics i.e. could be CPU usage
> of the server "xy" in percent, memory usage of server "xy" in MB, ping
> response time of server "foo" in milliseconds, network traffic of router
> "bar" in MB/s and so on. Different metrics can be collected for different
> devices in different intervals.
>
> The metrics are stored together with a timestamp. The queries we want to
> perform are:
>  * The last value of a specific metric of a device
>  * The values of a specific metric of a device between two timestamps t1
> and
> t2
>
> I stumbled across this blog post which describes a very similar setup with
> Cassandra:
> https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
> This post gave me confidence that what we want is definitively doable with
> Cassandra.
>
> But since I'm just digging into columns and super-columns and their
> families, I still have some problems understanding everything.
>
> Our data model could look in json'isch notation like this:
> {
> "my_server_1": {
>"cpu_usage": {
>{ts: 1271248215, value: 87 },
>{ts: 1271248220, value: 34 },
>{ts: 1271248225, value: 23 },
>{ts: 1271248230, value: 49 }
>}
>"ping_response": {
>{ts: 1271248201, value: 0.345 },
>{ts: 1271248211, value: 0.423 },
>{ts: 1271248221, value: 0.311 },
>{ts: 1271248232, value: 0.582 }
>}
> }
>
> "my_server_2": {
>"cpu_usage": {
>{ts: 1271248215, value: 23 },
>...
>}
>"disk_usage": {
>{ts: 1271243451, value: 123445 },
>...
>}
> }
>
> "my_router_1": {
>"bytes_in": {
>{ts: 1271243451, value: 2452346 },
>...
>}
>"bytes_out": {
>{ts: 1271243451, value: 13468 },
>...
>}
>"errors": {
>{ts: 1271243451, value: 24 },
>...
>}
> }
> }
>
> What I don't get is how to created the two level hierarchy
> [device][metric].
>
> Am I right that the devices would be kept in a super column family? The
> ordering of those is not important.
>
> But the metrics per device are also a super column, where the columns would
> be the metric values ({ts: 1271243451, value: 24 }), isn't it?
>
> So I'd need a super column in a super column... Hm.
> My brain is definitively RDBMS-damaged and I don't see through columns and
> super-columns yet. :-)
>
> How could this be modeled in Cassandra?
>
>
> Thank you very much
> James
>
>
>


Re: Reading thousands of columns

2010-04-14 Thread Jonathan Ellis
35-50ms for how many rows of 1000 columns each?

get_range_slices does not use the row cache, for the same reason that
oracle doesn't cache tuples from sequential scans -- blowing away
1000s of rows worth of recently used rows queried by key, for a swath
of rows from the scan, is the wrong call more often than it is the
right one.

On Tue, Apr 13, 2010 at 1:00 PM, James Golick  wrote:
> Hi All,
> I'm seeing about 35-50ms to read 1000 columns from a CF using
> get_range_slices. The columns are TimeUUIDType with empty values.
> The row cache is enabled and I'm running the query 500 times in a row, so I
> can only assume the row is cached.
> Is that about what's expected or am I doing something wrong? (It's from java
> this time, so it's not ruby thrift being slow).
> - James


Re: [RELEASE] 0.6.0

2010-04-14 Thread Ted Zlatanov
On Tue, 13 Apr 2010 15:54:39 -0500 Eric Evans  wrote: 

EE> I leaned into it. An updated package has been uploaded to the Cassandra
EE> repo (see: http://wiki.apache.org/cassandra/DebianPackaging).

Thank you for providing the release to the repository.

Can it support a non-root user through /etc/default/cassandra?  I've
been patching the init script myself but was hoping this would be
standard.

Thanks
Ted



KeysCached and sstable

2010-04-14 Thread Paul Prescod
The inline docs say:

   ~ The optional KeysCached attribute specifies
   ~ the number of keys per sstable whose locations we keep in
   ~ memory in "mostly LRU" order.

There are a few confusing bits in that sentence.

 1. Why is "keys per sstable" rather than "keys per column family". If
I have 7 SSTable files and I set KeysCached to 1, will I have
7 keys cached? If so, why? What is the logical relationship here?

 2. What makes the algorithm "mostly LRU" rather than just LRU?

 3. Is it accurate the say that the goal of the Key Cache is to avoid
looking through a bunch off SSTable's Bloom Filters? (how big do the
bloom filters grow to...too much to be cached themselves?)

I'd like to document the detail.

 Paul Prescod


Re: History values

2010-04-14 Thread Mike Gallamore

Here here on documentation.

For example thrift examples in python and java. That is great but I've 
never coded in either (and am limited to perl or C at work because when 
have 5 years worth of code and experience with other modules provided 
for those languages). So I'm stuck with whatever the person who makes 
the module I chose to use gives for documentation. However that isn't 
always great, especially in my view particularly for perl. Often you get 
10 lines of example code and if you are lucky (and often you are not) a 
listing of the methods the module provides and what they do. It seems 
often in perl people expect you to look through the source to see if 
their is a method you should call to do something which I think is 
unreasonable (how do I know if at start up I'm supposed to use new, 
connect, auto-connect etc with no comments and no examples?). I'm open 
to helping out with documentation but my problem is that my learning 
process is slow because their is little documentation and once I figure 
something out it was by trial and error, so I don't even know if how I 
do it is the right way to do it, just that it works. Not ideal.

On 04/14/2010 03:09 AM, aXqd wrote:

On Wed, Apr 14, 2010 at 5:13 PM, Zhiguo Zhang  wrote:
   

I think it is still to young, and have to wait or write your self the
"graphical console", at least, I don't find any until now.
 

Frankly speaking, I'm OK to be without GUI...But I am really
disappointed by those so-called 'documents'.
I really prefer to have some more documents in real 'English' and in a
more tutorial way.
Hope I can write some texts after I managed to understand the current ones.

   

On Wed, Apr 14, 2010 at 10:04 AM, Bertil Chapuis  wrote:
 

I'm also new to cassandra and about the same question I asked me if using
super columns with one key per version was feasible. Is there limitations to
this use case (or better practices)?
Thank you and best regards,
Bertil Chapuis
On 14 April 2010 09:45, Sylvain Lebresne  wrote:
   
 

I am new to using cassandra. In the documentation I have read,
understand,
that as in other non-documentary databases, to update the value of a
key-value tuple, this new value is stored with a timestamp different
but
without entirely losing the old value.
I wonder, as I can restore the historic values that have had a
particular
field.
   

You can't. Upon update, the old value is lost.
 From a technical standpoint, it is true that this old value is not
deleted (from disk)
right away, but it is deleted eventually by compaction (and you don't
really control
when the compactions occur).

--
Sylvain
 
   


 




Re: Reading thousands of columns

2010-04-14 Thread James Golick
Right - that make sense. I'm only fetching one row. I'll give it a try with
get_slice().

Thanks,

-James

On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis  wrote:

> 35-50ms for how many rows of 1000 columns each?
>
> get_range_slices does not use the row cache, for the same reason that
> oracle doesn't cache tuples from sequential scans -- blowing away
> 1000s of rows worth of recently used rows queried by key, for a swath
> of rows from the scan, is the wrong call more often than it is the
> right one.
>
> On Tue, Apr 13, 2010 at 1:00 PM, James Golick 
> wrote:
> > Hi All,
> > I'm seeing about 35-50ms to read 1000 columns from a CF using
> > get_range_slices. The columns are TimeUUIDType with empty values.
> > The row cache is enabled and I'm running the query 500 times in a row, so
> I
> > can only assume the row is cached.
> > Is that about what's expected or am I doing something wrong? (It's from
> java
> > this time, so it's not ruby thrift being slow).
> > - James
>


Re: Reading thousands of columns

2010-04-14 Thread James Golick
That helped a little. But, it's still quite slow. Now, it's around 20-35ms
on average, sometimes as high as 70ms.

On Wed, Apr 14, 2010 at 8:50 AM, James Golick  wrote:

> Right - that make sense. I'm only fetching one row. I'll give it a try with
> get_slice().
>
> Thanks,
>
> -James
>
>
> On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis  wrote:
>
>> 35-50ms for how many rows of 1000 columns each?
>>
>> get_range_slices does not use the row cache, for the same reason that
>> oracle doesn't cache tuples from sequential scans -- blowing away
>> 1000s of rows worth of recently used rows queried by key, for a swath
>> of rows from the scan, is the wrong call more often than it is the
>> right one.
>>
>> On Tue, Apr 13, 2010 at 1:00 PM, James Golick 
>> wrote:
>> > Hi All,
>> > I'm seeing about 35-50ms to read 1000 columns from a CF using
>> > get_range_slices. The columns are TimeUUIDType with empty values.
>> > The row cache is enabled and I'm running the query 500 times in a row,
>> so I
>> > can only assume the row is cached.
>> > Is that about what's expected or am I doing something wrong? (It's from
>> java
>> > this time, so it's not ruby thrift being slow).
>> > - James
>>
>
>


Re: Lucandra or some way to query

2010-04-14 Thread Eric Evans
On Wed, 2010-04-14 at 06:45 -0300, Jesus Ibanez wrote:
> Option 1 - insert data in all different ways I need in order to be
> able to query?

Rolling your own indexes is fairly common with Cassandra.

> Option 2 - implement Lucandra? Can you link me to a blog or an article
> that guides me on how to implement Lucandra?

I would recommend you explore this route a little further. I've never
used Lucandra so I can't be of help, but the author is active. Have you
tried submitting an issue on the github project page?

> Option 3 - switch to an SQL database? (I hope not). 

If your requirements can be met with an SQL database, then sure, why
not?

-- 
Eric Evans
eev...@rackspace.com



Re: [RELEASE] 0.6.0

2010-04-14 Thread Eric Evans
On Wed, 2010-04-14 at 10:16 -0500, Ted Zlatanov wrote:
> Can it support a non-root user through /etc/default/cassandra?  I've
> been patching the init script myself but was hoping this would be
> standard. 

It's the first item on debian/TODO, but, you know, patches welcome and
all that.

-- 
Eric Evans
eev...@rackspace.com



Re: Reading thousands of columns

2010-04-14 Thread Mike Malone
On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis  wrote:

> 35-50ms for how many rows of 1000 columns each?
>
> get_range_slices does not use the row cache, for the same reason that
> oracle doesn't cache tuples from sequential scans -- blowing away
> 1000s of rows worth of recently used rows queried by key, for a swath
> of rows from the scan, is the wrong call more often than it is the
> right one.


Couldn't you cache a list of keys that were returned for the key range, then
cache individual rows separately or not at all?

By "blowing away rows queried by key" I'm guessing you mean "pushing them
out of the LRU cache," not explicitly blowing them away? Either way I'm not
entirely convinced. In my experience I've had pretty good success caching
items that were pulled out via more complicated join / range type queries.
If your system is doing lots of range quereis, and not a lot of lookups by
key, you'd obviously see a performance win from caching the range queries.
Maybe range scan caching could be turned on separately?

Mike


Re: History values

2010-04-14 Thread Paul Prescod
If you want to use Cassandra, you should probably store each
historical value as a new column in the row.

On Wed, Apr 14, 2010 at 12:34 AM, Yésica Rey  wrote:
> I am new to using cassandra. In the documentation I have read, understand,
> that as in other non-documentary databases, to update the value of a
> key-value tuple, this new value is stored with a timestamp different but
> without entirely losing the old value.
> I wonder, as I can restore the historic values that have had a particular
> field.
> Greetings and thanks
>


Re: Reading thousands of columns

2010-04-14 Thread Paul Prescod
On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone  wrote:
> ...
>
> Couldn't you cache a list of keys that were returned for the key range, then
> cache individual rows separately or not at all?
> By "blowing away rows queried by key" I'm guessing you mean "pushing them
> out of the LRU cache," not explicitly blowing them away? Either way I'm not
> entirely convinced. In my experience I've had pretty good success caching
> items that were pulled out via more complicated join / range type queries.
> If your system is doing lots of range quereis, and not a lot of lookups by
> key, you'd obviously see a performance win from caching the range queries.
> Maybe range scan caching could be turned on separately?

I agree with you that the caches should be separate, if you're going
to cache ranges. You could imagine a single query (perhaps entered
interactively) would replace the entire row caching all of the data
for the systems' interactive users. For example, a summary page of who
is most over the last month active could replace the profile
information for the actual users who are using the system at that
moment.

 Paul Prescod


Re: [RELEASE] 0.6.0

2010-04-14 Thread Ted Zlatanov
On Wed, 14 Apr 2010 12:23:19 -0500 Eric Evans  wrote: 

EE> On Wed, 2010-04-14 at 10:16 -0500, Ted Zlatanov wrote:
>> Can it support a non-root user through /etc/default/cassandra?  I've
>> been patching the init script myself but was hoping this would be
>> standard. 

EE> It's the first item on debian/TODO, but, you know, patches welcome and
EE> all that.

The appended patch has been sufficient for me.  I have to override the
PIDFILE too, but that's a system issue.  So my /etc/default/cassandra,
for example, is:

JAVA_HOME="/usr/lib/jvm/java-6-sun"
USER=cassandra
PIDFILE=/var/tmp/$NAME.pid

Ted

--- debian/init 2010-04-14 12:57:30.0 -0500
+++ /etc/init.d/cassandra   2010-04-14 13:00:25.0 -0500
@@ -21,6 +21,7 @@
 JSVC=/usr/bin/jsvc
 JVM_MAX_MEM="1G"
 JVM_START_MEM="128M"
+USER=root
 
 [ -e /usr/share/cassandra/apache-cassandra.jar ] || exit 0
 [ -e /etc/cassandra/storage-conf.xml ] || exit 0
@@ -75,6 +76,7 @@
 is_running && return 1
 
 $JSVC \
+-user $USER \
 -home $JAVA_HOME \
 -pidfile $PIDFILE \
 -errfile "&1" \



Re: Reading thousands of columns

2010-04-14 Thread James Golick
Just for the record, I am able to repeat this locally.

I'm seeing around 150ms to read 1000 columns from a row that has 3000 in it.
If I enable the rowcache, that goes down to about 90ms. According to my
profile, 90% of the time is being spent waiting for cassandra to respond, so
it's not thrift.

On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod  wrote:

> On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone  wrote:
> > ...
> >
> > Couldn't you cache a list of keys that were returned for the key range,
> then
> > cache individual rows separately or not at all?
> > By "blowing away rows queried by key" I'm guessing you mean "pushing them
> > out of the LRU cache," not explicitly blowing them away? Either way I'm
> not
> > entirely convinced. In my experience I've had pretty good success caching
> > items that were pulled out via more complicated join / range type
> queries.
> > If your system is doing lots of range quereis, and not a lot of lookups
> by
> > key, you'd obviously see a performance win from caching the range
> queries.
> > Maybe range scan caching could be turned on separately?
>
> I agree with you that the caches should be separate, if you're going
> to cache ranges. You could imagine a single query (perhaps entered
> interactively) would replace the entire row caching all of the data
> for the systems' interactive users. For example, a summary page of who
> is most over the last month active could replace the profile
> information for the actual users who are using the system at that
> moment.
>
>  Paul Prescod
>


Re: Reading thousands of columns

2010-04-14 Thread Avinash Lakshman
How large are the values? How much data on disk?

On Wednesday, April 14, 2010, James Golick  wrote:
> Just for the record, I am able to repeat this locally.
> I'm seeing around 150ms to read 1000 columns from a row that has 3000 in it. 
> If I enable the rowcache, that goes down to about 90ms. According to my 
> profile, 90% of the time is being spent waiting for cassandra to respond, so 
> it's not thrift.
>
> On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod  wrote:
>
> On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone  wrote:
>> ...
>>
>> Couldn't you cache a list of keys that were returned for the key range, then
>> cache individual rows separately or not at all?
>> By "blowing away rows queried by key" I'm guessing you mean "pushing them
>> out of the LRU cache," not explicitly blowing them away? Either way I'm not
>> entirely convinced. In my experience I've had pretty good success caching
>> items that were pulled out via more complicated join / range type queries.
>> If your system is doing lots of range quereis, and not a lot of lookups by
>> key, you'd obviously see a performance win from caching the range queries.
>> Maybe range scan caching could be turned on separately?
>
> I agree with you that the caches should be separate, if you're going
> to cache ranges. You could imagine a single query (perhaps entered
> interactively) would replace the entire row caching all of the data
> for the systems' interactive users. For example, a summary page of who
> is most over the last month active could replace the profile
> information for the actual users who are using the system at that
> moment.
>
>  Paul Prescod
>
>
>


Re: Reading thousands of columns

2010-04-14 Thread James Golick
The values are empty. It's 3000 UUIDs.

On Wed, Apr 14, 2010 at 12:40 PM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> How large are the values? How much data on disk?
>
> On Wednesday, April 14, 2010, James Golick  wrote:
> > Just for the record, I am able to repeat this locally.
> > I'm seeing around 150ms to read 1000 columns from a row that has 3000 in
> it. If I enable the rowcache, that goes down to about 90ms. According to my
> profile, 90% of the time is being spent waiting for cassandra to respond, so
> it's not thrift.
> >
> > On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod 
> wrote:
> >
> > On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone 
> wrote:
> >> ...
> >>
> >> Couldn't you cache a list of keys that were returned for the key range,
> then
> >> cache individual rows separately or not at all?
> >> By "blowing away rows queried by key" I'm guessing you mean "pushing
> them
> >> out of the LRU cache," not explicitly blowing them away? Either way I'm
> not
> >> entirely convinced. In my experience I've had pretty good success
> caching
> >> items that were pulled out via more complicated join / range type
> queries.
> >> If your system is doing lots of range quereis, and not a lot of lookups
> by
> >> key, you'd obviously see a performance win from caching the range
> queries.
> >> Maybe range scan caching could be turned on separately?
> >
> > I agree with you that the caches should be separate, if you're going
> > to cache ranges. You could imagine a single query (perhaps entered
> > interactively) would replace the entire row caching all of the data
> > for the systems' interactive users. For example, a summary page of who
> > is most over the last month active could replace the profile
> > information for the actual users who are using the system at that
> > moment.
> >
> >  Paul Prescod
> >
> >
> >
>


Re: Lucandra or some way to query

2010-04-14 Thread Jesus Ibanez
I will explore Lucandra a little more and if I can't get it to work today, I
will go for Option 2.
Using SQL will not be efficient in the future, if my website grows.

Thenks for your answer Eric!

Jesús.


2010/4/14 Eric Evans 

> On Wed, 2010-04-14 at 06:45 -0300, Jesus Ibanez wrote:
> > Option 1 - insert data in all different ways I need in order to be
> > able to query?
>
> Rolling your own indexes is fairly common with Cassandra.
>
> > Option 2 - implement Lucandra? Can you link me to a blog or an article
> > that guides me on how to implement Lucandra?
>
> I would recommend you explore this route a little further. I've never
> used Lucandra so I can't be of help, but the author is active. Have you
> tried submitting an issue on the github project page?
>
> > Option 3 - switch to an SQL database? (I hope not).
>
> If your requirements can be met with an SQL database, then sure, why
> not?
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: Lucandra or some way to query

2010-04-14 Thread Jake Luciani
Hi,

What doesn't work with lucandra exactly?  Feel free to msg me.

-Jake

On Wed, Apr 14, 2010 at 9:30 PM, Jesus Ibanez  wrote:

> I will explore Lucandra a little more and if I can't get it to work today,
> I will go for Option 2.
> Using SQL will not be efficient in the future, if my website grows.
>
> Thenks for your answer Eric!
>
> Jesús.
>
>
> 2010/4/14 Eric Evans 
>
> On Wed, 2010-04-14 at 06:45 -0300, Jesus Ibanez wrote:
>> > Option 1 - insert data in all different ways I need in order to be
>> > able to query?
>>
>> Rolling your own indexes is fairly common with Cassandra.
>>
>> > Option 2 - implement Lucandra? Can you link me to a blog or an article
>> > that guides me on how to implement Lucandra?
>>
>> I would recommend you explore this route a little further. I've never
>> used Lucandra so I can't be of help, but the author is active. Have you
>> tried submitting an issue on the github project page?
>>
>> > Option 3 - switch to an SQL database? (I hope not).
>>
>> If your requirements can be met with an SQL database, then sure, why
>> not?
>>
>> --
>> Eric Evans
>> eev...@rackspace.com
>>
>>
>


Did 0.6 break sstable2jason? or am I missing something?

2010-04-14 Thread Chris Beaumont
I enjoy very much being able to quickly get a peak at my data once stored, and 
so
far sstable2json was a great help...

I just completed switching from 0.5.1 to 0.6, and here is what I am getting now:
$ sstable2json Standard2-1-Index.db 
Exception in thread "main" java.lang.NullPointerException
at java.util.Arrays$ArrayList.(Arrays.java:3357)
at java.util.Arrays.asList(Arrays.java:3343)
at 
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:255)
at 
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:299)
at 
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:323)
at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:367)

On the other end, sstablekeys seems to still be doing the right thing:
$ sstablekeys Standard2-1-Index.db 
newone

This is on a Centos5 install with
$ java -version
java version "1.6.0_19"

Can anyone reproduce this? or am I missing something?

TIA

Chris.



  



Re: Did 0.6 break sstable2jason? or am I missing something?

2010-04-14 Thread Brandon Williams
On Wed, Apr 14, 2010 at 11:53 AM, Chris Beaumont wrote:

> I enjoy very much being able to quickly get a peak at my data once stored,
> and so
> far sstable2json was a great help...
>
> I just completed switching from 0.5.1 to 0.6, and here is what I am getting
> now:
> $ sstable2json Standard2-1-Index.db
> Exception in thread "main" java.lang.NullPointerException
>

This was a regression introduced as part of
https://issues.apache.org/jira/browse/CASSANDRA-843 and is fixed in
https://issues.apache.org/jira/browse/CASSANDRA-934

Until 0.6.1 is released, you can work around the NPE by passing -x "" to
sstable2json.

-Brandon


Re: KeysCached and sstable

2010-04-14 Thread Jonathan Ellis
On Wed, Apr 14, 2010 at 10:23 AM, Paul Prescod  wrote:
> The inline docs say:
>
>       ~ The optional KeysCached attribute specifies
>       ~ the number of keys per sstable whose locations we keep in
>       ~ memory in "mostly LRU" order.
>
> There are a few confusing bits in that sentence.
>
>  1. Why is "keys per sstable" rather than "keys per column family". If
> I have 7 SSTable files and I set KeysCached to 1, will I have
> 7 keys cached? If so, why? What is the logical relationship here?

This is out of date, it's per CF now.

>  2. What makes the algorithm "mostly LRU" rather than just LRU?

it's called second-chance eviction, discussed at
http://code.google.com/p/concurrentlinkedhashmap/wiki/ProductionVersion

>  3. Is it accurate the say that the goal of the Key Cache is to avoid
> looking through a bunch off SSTable's Bloom Filters?

No.  it's to avoid deserializing the rows from the sstables and merging them.

-Jonathan


Re: Lucandra or some way to query

2010-04-14 Thread Zhuguo Shi
I think Lucandra is really a great idea, but since it needs
order-preserving-partitioner, does that mean there may be some 'hot-spot'
during searching?


Is that possible to write a file system over Cassandra?

2010-04-14 Thread Zhuguo Shi
Hi,

Cassandra has a good distributed model: decentralized, auto-partition,
auto-recovery. I am evaluating about writing a file system over Cassandra
(like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
Cassandra is good at such use case?

Regards


Re: Lucandra or some way to query

2010-04-14 Thread HubertChang

If you worked with Lucandra in a dedicated searching-purposed cluster, you
could balanced the data very well with some effort. 
>>I think Lucandra is really a great idea, but since it needs
order-preserving-partitioner, does that mean >>there may be some 'hot-spot'
during searching?
-- 
View this message in context: 
http://n2.nabble.com/Lucandra-or-some-way-to-query-tp4900727p4905149.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Tatu Saloranta
On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi  wrote:
> Hi,
> Cassandra has a good distributed model: decentralized, auto-partition,
> auto-recovery. I am evaluating about writing a file system over Cassandra
> (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
> Cassandra is good at such use case?

It sort of depends on what you are looking for. From use case for
which something like S3 is good, yes, except with one difference:
Cassandra is more geared towards lots of small files, whereas S3 is
more geared towards moderate number of files (possibly large).

So I think it can definitely be a good use case, and I may use
Cassandra for this myself in future. Having range queries allows
implementing directory/path structures (list keys using path as
prefix). And you can split storage such that metadata could live in
OPP partition, raw data in RP.

-+ Tatu +-


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Ken Sandney
Large files can be split into small blocks, and the size of block can be
tuned. It may increase the complexity of writing such a file system, but can
be for general purpose (not only for relative small files)

On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta wrote:

> On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi  wrote:
> > Hi,
> > Cassandra has a good distributed model: decentralized, auto-partition,
> > auto-recovery. I am evaluating about writing a file system over Cassandra
> > (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
> > Cassandra is good at such use case?
>
> It sort of depends on what you are looking for. From use case for
> which something like S3 is good, yes, except with one difference:
> Cassandra is more geared towards lots of small files, whereas S3 is
> more geared towards moderate number of files (possibly large).
>
> So I think it can definitely be a good use case, and I may use
> Cassandra for this myself in future. Having range queries allows
> implementing directory/path structures (list keys using path as
> prefix). And you can split storage such that metadata could live in
> OPP partition, raw data in RP.
>
> -+ Tatu +-
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Miguel Verde
On Wed, Apr 14, 2010 at 9:15 PM, Ken Sandney  wrote:

> Large files can be split into small blocks, and the size of block can be
> tuned. It may increase the complexity of writing such a file system, but can
> be for general purpose (not only for relative small files)


 Right, this is the path that MongoDB has taken with GridFS:
http://www.mongodb.org/display/DOCS/GridFS+Specification

I don't have any use for such a filesystem, but if I were to design one I
would probably mostly follow Tatu's suggestions:


>  On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta wrote:
>>
>> So I think it can definitely be a good use case, and I may use
>> Cassandra for this myself in future. Having range queries allows
>> implementing directory/path structures (list keys using path as
>> prefix). And you can split storage such that metadata could live in
>> OPP partition, raw data in RP.
>
>
but using OPP for all data, using prefixed metadata, and UUID_chunk# for
keys in the chunk CF.


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Avinash Lakshman
Exactly. You can split a file into blocks of any size and you can actually
distribute the metadata across a large set of machines. You wouldn't have
the issue of having small files in this approach. The issue maybe the
eventual consistency - not sure that is a paradigm that would be acceptable
for a file system. But that is a discussion for another time/day.

Avinash

On Wed, Apr 14, 2010 at 7:15 PM, Ken Sandney  wrote:

> Large files can be split into small blocks, and the size of block can be
> tuned. It may increase the complexity of writing such a file system, but can
> be for general purpose (not only for relative small files)
>
>
> On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta wrote:
>
>> On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi  wrote:
>> > Hi,
>> > Cassandra has a good distributed model: decentralized, auto-partition,
>> > auto-recovery. I am evaluating about writing a file system over
>> Cassandra
>> > (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
>> > Cassandra is good at such use case?
>>
>> It sort of depends on what you are looking for. From use case for
>> which something like S3 is good, yes, except with one difference:
>> Cassandra is more geared towards lots of small files, whereas S3 is
>> more geared towards moderate number of files (possibly large).
>>
>> So I think it can definitely be a good use case, and I may use
>> Cassandra for this myself in future. Having range queries allows
>> implementing directory/path structures (list keys using path as
>> prefix). And you can split storage such that metadata could live in
>> OPP partition, raw data in RP.
>>
>> -+ Tatu +-
>>
>
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Avinash Lakshman
OPP is not required here. You would be better off using a Random partitioner
because you want to get a random distribution of the metadata.

Avinash

On Wed, Apr 14, 2010 at 7:25 PM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> Exactly. You can split a file into blocks of any size and you can actually
> distribute the metadata across a large set of machines. You wouldn't have
> the issue of having small files in this approach. The issue maybe the
> eventual consistency - not sure that is a paradigm that would be acceptable
> for a file system. But that is a discussion for another time/day.
>
> Avinash
>
> On Wed, Apr 14, 2010 at 7:15 PM, Ken Sandney  wrote:
>
>> Large files can be split into small blocks, and the size of block can be
>> tuned. It may increase the complexity of writing such a file system, but can
>> be for general purpose (not only for relative small files)
>>
>>
>> On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta wrote:
>>
>>> On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi  wrote:
>>> > Hi,
>>> > Cassandra has a good distributed model: decentralized, auto-partition,
>>> > auto-recovery. I am evaluating about writing a file system over
>>> Cassandra
>>> > (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
>>> > Cassandra is good at such use case?
>>>
>>> It sort of depends on what you are looking for. From use case for
>>> which something like S3 is good, yes, except with one difference:
>>> Cassandra is more geared towards lots of small files, whereas S3 is
>>> more geared towards moderate number of files (possibly large).
>>>
>>> So I think it can definitely be a good use case, and I may use
>>> Cassandra for this myself in future. Having range queries allows
>>> implementing directory/path structures (list keys using path as
>>> prefix). And you can split storage such that metadata could live in
>>> OPP partition, raw data in RP.
>>>
>>> -+ Tatu +-
>>>
>>
>>
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Miguel Verde
On Wed, Apr 14, 2010 at 9:26 PM, Avinash Lakshman <
avinash.laksh...@gmail.com> wrote:

> OPP is not required here. You would be better off using a Random
> partitioner because you want to get a random distribution of the metadata.


Not required, certainly.  However, it strikes me that 1 cluster is better
than 2, and most consumers of a filesystem would expect to be able to get an
ordered listing or tree of the metadata which is easy using the OPP row key
pattern listed previously.  You could still do this with the Random
partitioner using column names in rows to describe the structure but the
current compaction limitations could be an issue if a branch becomes too
large, and you'd still have a root row hotspot (at least in the schema which
comes to mind).


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Jeff Zhang
We are currently doing such things, and now we are still at the start stage.

Currently we only plan to store small files. For large files, splitting to
small blocks is really one of our options.
You can check out from here http://code.google.com/p/cassandra-fs/

Document for this project is lack now, but still welcome any feedback and
contribution.



On Wed, Apr 14, 2010 at 7:32 PM, Miguel Verde wrote:

> On Wed, Apr 14, 2010 at 9:26 PM, Avinash Lakshman <
> avinash.laksh...@gmail.com> wrote:
>
>> OPP is not required here. You would be better off using a Random
>> partitioner because you want to get a random distribution of the metadata.
>
>
> Not required, certainly.  However, it strikes me that 1 cluster is better
> than 2, and most consumers of a filesystem would expect to be able to get an
> ordered listing or tree of the metadata which is easy using the OPP row key
> pattern listed previously.  You could still do this with the Random
> partitioner using column names in rows to describe the structure but the
> current compaction limitations could be an issue if a branch becomes too
> large, and you'd still have a root row hotspot (at least in the schema which
> comes to mind).
>



-- 
Best Regards

Jeff Zhang


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread HubertChang

Note: there are glusterfs, ceph, brtfs and luster. there is drbd.
-- 
View this message in context: 
http://n2.nabble.com/Is-that-possible-to-write-a-file-system-over-Cassandra-tp4905111p4905312.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Jonathan Ellis
You forked Cassandra 0.5 for that?

That's... a strange way to do it.

On Wed, Apr 14, 2010 at 9:36 PM, Jeff Zhang  wrote:
> We are currently doing such things, and now we are still at the start stage.
> Currently we only plan to store small files. For large files, splitting to
> small blocks is really one of our options.
> You can check out from here http://code.google.com/p/cassandra-fs/
>
> Document for this project is lack now, but still welcome any feedback and
> contribution.
>
>
>
> On Wed, Apr 14, 2010 at 7:32 PM, Miguel Verde 
> wrote:
>>
>> On Wed, Apr 14, 2010 at 9:26 PM, Avinash Lakshman
>>  wrote:
>>>
>>> OPP is not required here. You would be better off using a Random
>>> partitioner because you want to get a random distribution of the metadata.
>>
>>
>> Not required, certainly.  However, it strikes me that 1 cluster is better
>> than 2, and most consumers of a filesystem would expect to be able to get an
>> ordered listing or tree of the metadata which is easy using the OPP row key
>> pattern listed previously.  You could still do this with the Random
>> partitioner using column names in rows to describe the structure but the
>> current compaction limitations could be an issue if a branch becomes too
>> large, and you'd still have a root row hotspot (at least in the schema which
>> comes to mind).
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Ken Sandney
 a fuse based FS maybe better I guess

On Thu, Apr 15, 2010 at 11:50 AM, Jonathan Ellis  wrote:

> You forked Cassandra 0.5 for that?
>
> That's... a strange way to do it.
>
> On Wed, Apr 14, 2010 at 9:36 PM, Jeff Zhang  wrote:
> > We are currently doing such things, and now we are still at the start
> stage.
> > Currently we only plan to store small files. For large files, splitting
> to
> > small blocks is really one of our options.
> > You can check out from here http://code.google.com/p/cassandra-fs/
> >
> > Document for this project is lack now, but still welcome any feedback and
> > contribution.
> >
> >
> >
> > On Wed, Apr 14, 2010 at 7:32 PM, Miguel Verde 
> > wrote:
> >>
> >> On Wed, Apr 14, 2010 at 9:26 PM, Avinash Lakshman
> >>  wrote:
> >>>
> >>> OPP is not required here. You would be better off using a Random
> >>> partitioner because you want to get a random distribution of the
> metadata.
> >>
> >>
> >> Not required, certainly.  However, it strikes me that 1 cluster is
> better
> >> than 2, and most consumers of a filesystem would expect to be able to
> get an
> >> ordered listing or tree of the metadata which is easy using the OPP row
> key
> >> pattern listed previously.  You could still do this with the Random
> >> partitioner using column names in rows to describe the structure but the
> >> current compaction limitations could be an issue if a branch becomes too
> >> large, and you'd still have a root row hotspot (at least in the schema
> which
> >> comes to mind).
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Michael Greene
On Wed, Apr 14, 2010 at 11:01 PM, Ken Sandney  wrote:

>  a fuse based FS maybe better I guess


This has been done, for better or worse, by jdarcy of http://pl.atyp.us/:
http://github.com/jdarcy/CassFS


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Ken Sandney
tried CassFS, but not stable yet, may be a good prototype to start

On Thu, Apr 15, 2010 at 12:15 PM, Michael Greene
wrote:

> On Wed, Apr 14, 2010 at 11:01 PM, Ken Sandney  wrote:
>
>>  a fuse based FS maybe better I guess
>
>
> This has been done, for better or worse, by jdarcy of http://pl.atyp.us/:
> http://github.com/jdarcy/CassFS
>


Re: Starting Cassandra Fauna

2010-04-14 Thread Nirmala Agadgar
Hi,

I want to insert data into Cassandra programmatically in a loop.
Also  i'm a newbie to Linux world and Github. Started to work on Linux  for
only reason to implement Cassandra.Digging Cassandra for last on week.How to
insert data in cassandra and test it?
Can anyone help me out on this?

-
Nimala


Re: Starting Cassandra Fauna

2010-04-14 Thread richard yao
try this
https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP



On Thu, Apr 15, 2010 at 12:23 PM, Nirmala Agadgar wrote:

> Hi,
>
> I want to insert data into Cassandra programmatically in a loop.
> Also  i'm a newbie to Linux world and Github. Started to work on Linux  for
> only reason to implement Cassandra.Digging Cassandra for last on week.How to
> insert data in cassandra and test it?
> Can anyone help me out on this?
>
> -
> Nimala
>


Re: Starting Cassandra Fauna

2010-04-14 Thread Paul Prescod
There is a tutorial here:

 * http://www.sodeso.nl/?p=80

This page includes data inserts:

 * http://www.sodeso.nl/?p=251

Like:

c.setColumn(new Column("email".getBytes("utf-8"), "ronald (at)
sodeso.nl".getBytes("utf-8"), timestamp))
columns.add(c);

The Sample code is attached to that blog post.

On Wed, Apr 14, 2010 at 9:23 PM, Nirmala Agadgar  wrote:
> Hi,
>
> I want to insert data into Cassandra programmatically in a loop.
> Also  i'm a newbie to Linux world and Github. Started to work on Linux  for
> only reason to implement Cassandra.Digging Cassandra for last on week.How to
> insert data in cassandra and test it?
> Can anyone help me out on this?
>
> -
> Nimala
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Tatu Saloranta
On Wed, Apr 14, 2010 at 7:26 PM, Avinash Lakshman
 wrote:
> OPP is not required here. You would be better off using a Random partitioner
> because you want to get a random distribution of the metadata.

Not for splitting, but for actual file system hierarchy it would. How
else would you traverse hierarchy? (list sub-directiories, files)

As to splitting files, yes, can be done, but I personally think that
would be asking for trouble because of lack atomicity for operations.
Exception being if only operations ever would be append.

-+ Tatu +-


Re: Starting Cassandra Fauna

2010-04-14 Thread Nirmala Agadgar
Hi,

I'm  using ruby client as of now. Can u give details for ruby client.Also if
possible java client.
Thanks for reply.

-
Nirmala

On Thu, Apr 15, 2010 at 10:02 AM, richard yao wrote:

> try this
> https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP
>
>
>
>
> On Thu, Apr 15, 2010 at 12:23 PM, Nirmala Agadgar wrote:
>
>> Hi,
>>
>> I want to insert data into Cassandra programmatically in a loop.
>> Also  i'm a newbie to Linux world and Github. Started to work on Linux
>> for only reason to implement Cassandra.Digging Cassandra for last on
>> week.How to insert data in cassandra and test it?
>> Can anyone help me out on this?
>>
>> -
>> Nimala
>>
>
>


TException: Error: TSocket: timed out reading 1024 bytes from 10.1.1.27:9160

2010-04-14 Thread richard yao
I am having a try on cassandra, and I use php to access cassandra by thrift
API.
I got an error like this:
TException:  Error: TSocket: timed out reading 1024 bytes from
10.1.1.27:9160
What's wrong?
Thanks.


Re: Lucandra or some way to query

2010-04-14 Thread Jake Luciani
Lucandra spreads the data randomly by index + field combination so you do
get "some" distribution for free. Otherwise you can use "nodetool
loadbalance" to alter the token ring to alleviate hotspots.

On Thu, Apr 15, 2010 at 2:04 AM, HubertChang  wrote:

>
> If you worked with Lucandra in a dedicated searching-purposed cluster, you
> could balanced the data very well with some effort.
> >>I think Lucandra is really a great idea, but since it needs
> order-preserving-partitioner, does that mean >>there may be some 'hot-spot'
> during searching?
> --
> View this message in context:
> http://n2.nabble.com/Lucandra-or-some-way-to-query-tp4900727p4905149.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


SuperColumns

2010-04-14 Thread Christian Torres
I'm defining a ColumnFamily (Table) type Super, It's posible to have a
SuperColumn inside another SuperColumn or SuperColumns can only have normal
columns?

-- 
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


Re: SuperColumns

2010-04-14 Thread Vijay
Yes a super column can only have columns in it.

Regards,




On Wed, Apr 14, 2010 at 10:28 PM, Christian Torres wrote:

> I'm defining a ColumnFamily (Table) type Super, It's posible to have a
> SuperColumn inside another SuperColumn or SuperColumns can only have normal
> columns?
>
> --
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming
>


Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Vijay
What I did for one of our project was similar Use super col to strore
files and dir metadata use another row(Key UUID) to store the dir
contents (Files and subdirectory). we used UUID instead of paths because
there will be rename or move store the small files in cassandra

We used Internally developed filesystem to store the big files which are
more than x bytes Locking is done using Zookeeper and queuing by zeromq.

Regards,




On Wed, Apr 14, 2010 at 9:39 PM, Tatu Saloranta wrote:

> On Wed, Apr 14, 2010 at 7:26 PM, Avinash Lakshman
>  wrote:
> > OPP is not required here. You would be better off using a Random
> partitioner
> > because you want to get a random distribution of the metadata.
>
> Not for splitting, but for actual file system hierarchy it would. How
> else would you traverse hierarchy? (list sub-directiories, files)
>
> As to splitting files, yes, can be done, but I personally think that
> would be asking for trouble because of lack atomicity for operations.
> Exception being if only operations ever would be append.
>
> -+ Tatu +-
>