Re: compaction behaviour

2011-04-03 Thread Zhu Han
best regards,
Zhu Han



On Sun, Apr 3, 2011 at 9:21 AM, Anurag Gujral wrote:

> Hi All,
>I have loaded data into cassandra using batch processing  the
> response times for reads are in the range of  0.8 ms but I am using SSDs. so
> I expect the read times to be even faster.
>

Does your working set fit in the memory? If so, SSD is not helpful to reduce
the latency.


> Every time I run compaction the latency numbers reduce to 0.3 to 0.4ms  ,
> is there a way I can run compaction once with some parameters
> so that i can get the same numbers 0.3 to 0.4 ms for reads.
> Please note that I am not loading the data again.
>
> Thanks
> Anurag
>


Re: urgent

2011-04-03 Thread aaron morton
Is this still a problem ? Are you getting errors on the server ?

It should be choosing the directory with the most space.  

btw, the recommended approach is to use a single large volume/directory for the 
data. 

Aaron

On 2 Apr 2011, at 01:56, Anurag Gujral wrote:

> Hi All,
>   I have setup a cassandra cluster with three data directories but 
> cassandra is using only one of them and that disk is out of space
> and .Why is cassandra not using all the three data directories.
> 
> Plz Suggest.
> 
> Thanks
> Anurag



Re: Endless minor compactions after heavy inserts

2011-04-03 Thread aaron morton
With only one data file your reads would use the least amount of IO to find the 
data. 

Most people have multiple nodes and probably fewer disks, so each node may have 
a TB or two of data. How much capacity do your 10 disks give ? Will you be 
running multiple nodes in production ?

Aaron


 
On 2 Apr 2011, at 12:45, Sheng Chen wrote:

> Thank you very much.
> 
> The major compaction will merge everything into one big file., which would be 
> very large.
> Is there any way to control the number or size of files created by major 
> compaction?
> Or, is there a recommended number or size of files for cassandra to handle?
> 
> Thanks. I see the trigger of my minor compaction is OperationsInMillions. It 
> is a number of operations in total, which I thought was in a second.
> 
> Cheers,
> Sheng
> 
> 
> 2011/4/1 aaron morton 
> If you are doing some sort of bulk load you can disable minor compactions by 
> setting the min_compaction_threshold and max_compaction_threshold to 0 . Then 
> once your insert is complete run a major compaction via nodetool before 
> turning the minor compaction back on.
> 
> You can also reduce the compaction threads priority, see 
> compaction_thread_priority in the yaml file.
> 
> The memtable will be flushed when either the MB or ops throughput is 
> triggered. If you are seeing a lot of memtables smaller than the MB threshold 
> then the ops threshold is probably been triggered. Look for a log message at 
> INFO level starting with "Enqueuing flush of Memtable" that will tell you how 
> many bytes and ops the memtable had when it was flushed. Trying increasing 
> the ops threshold and see what happens.
> 
> You're change in the compaction threshold may not have an an effect because 
> the compaction process was already running.
> 
> AFAIK the best way to get the best out of your 10 disks will be to use a 
> dedicated mirror for the commit log and a  stripe set for the data.
> 
> Hope that helps.
> Aaron
> 
> On 1 Apr 2011, at 14:52, Sheng Chen wrote:
> 
> > I've got a single node of cassandra 0.7.4, and I used the java stress tool 
> > to insert about 100 million records.
> > The inserts took about 6 hours (45k inserts/sec) but the following minor 
> > compactions last for 2 days and the pending compaction jobs are still 
> > increasing.
> >
> > From jconsole I can read the MemtableThroughputInMB=1499, 
> > MemtableOperationsInMillions=7.0
> > But in my data directory, I got hundreds of 438MB data files, which should 
> > be the cause of the minor compactions.
> >
> > I tried to set compaction threshold by nodetool, but it didn't seem to take 
> > effects (no change in pending compaction tasks).
> > After restarting the node, my setting is lost.
> >
> > I want to distribute the read load in my disks (10 disks in xfs, LVM), so I 
> > don't want to do a major compaction.
> > So, what can I do to keep the sstable file in a reasonable size, or to make 
> > the minor compactions faster?
> >
> > Thank you in advance.
> > Sheng
> >
> 
> 



Re: Ditching Cassandra

2011-04-03 Thread Nico Guba
On 3/30/2011 1:11 AM, Gregori Schmidt wrote
>
> * You need to have official client libraries and they need to be
>   programmer friendly.  Yes, I know there are nice people
>   maintaining a plethora of different libraries, but you need to
>   man up and face reality:  the chaos that is the Cassandra client
>   space is a horrible mess.
>

I wouldn't call it a horrible mess but the learning curve for newcomers
can be quite steep.   This said, having a common portable spec to
program to (ie thrift) is a very good idea instead.

> * It is buggy and the solution seems to be to just go to the next
>   release.  And the next.  And the next.  Which would be okay if
>   you could upgrade all the time, but what to do once you hit
>   production?
>

It's a fair concern.  But bugs are fixed in either upstream or minor
releases (or even *shudder* maintenance patches...).  But that may be a
small price to pay when you consider the headaches scaling out other
systems (and don't you even think this will be un-problematic).

> I would recommend that everyone interested in improving Cassandra take
> the day off,  download MongoDB and
> read https://github.com/karlseguin/the-little-mongodb-book . Then,
> while you are downloading, unpacking, looking at what was in the JAR,
> reading the book and pawing through the examples: _pay attention_ to
> the neatness and the effortlessness the ease with which you can use
> MongoDB.  Then spend the rest of the day implementing something on top
> of it to gain some hacking experience.

Arguably, the documentation is neat.  The scaling solution however, is
not.  And that's the biggest headache.  Mongo should learn from
cassandra in this respect.

> No, really.  Do it.  This is important.  You need to connect with the
> user and you need to understand what you ought to be aspiring to.

Took you advice, done it.  You may be looking for a rdbms replacement --
and mongo may be a good solution there, but the master/slave replication
setup puts me off.  That's a big nono for us and probably a lot of
people on this list.

Considering the griefs of master/slave replication for scalability (oh,
how have we been bitten by this one over the years),  I strongly applaud
a project like cassandra to step up to the challenge and propel the free
software community into 21st century scalability!  

Yes, the documentation could be better (it always can be), yes the
Cassandra book by O'Reilly has a HUGE amount of duplication (speak
unnecessary code/bad programming practice).   But the constructive thing
to do here is to:

1 - CONTRIBUTE to the documentation (I was unhappy with the Exim and
Windowmaker docs a looong time ago and my efforts did not go in vain)

2 - Direct your flame about to book on ora.com or amazon where this sort
of feedback goes to the right channels, but don't blame the cassandra
project for the shortcomings of the book.  That's someone else's problem ;)

3 - enjoy MongoDB.  Let us know how it scales.  Every project can learn
from each other.

Happy Hacking!

-- 
=NPG=



Re: urgent

2011-04-03 Thread Anurag Gujral
Now it is using all the three disks . I want to understand why recommended
approach is to use
one single large volume /directory and not multiple ones,can you please
explain in detail.
I am using SSDs using  three small ones is cheaper than using one large one.
Please Suggest
Thanks
Anurag

On Sun, Apr 3, 2011 at 7:31 AM, aaron morton wrote:

> Is this still a problem ? Are you getting errors on the server ?
>
> It should be choosing the directory with the most space.
>
> btw, the recommended approach is to use a single large volume/directory for
> the data.
>
> Aaron
>
> On 2 Apr 2011, at 01:56, Anurag Gujral wrote:
>
> > Hi All,
> >   I have setup a cassandra cluster with three data directories
> but cassandra is using only one of them and that disk is out of space
> > and .Why is cassandra not using all the three data directories.
> >
> > Plz Suggest.
> >
> > Thanks
> > Anurag
>
>


Re: Endless minor compactions after heavy inserts

2011-04-03 Thread Sheng Chen
I think if i can keep a single sstable file in a proper size, the hot
data/index files may be able to fit into memory at least in some occasions.

In my use case, I want to use cassandra for storage of a large amount of log
data.
There will be multiple nodes, and each node has 10*2TB disks to hold as much
data as possible, ideally 20TB (about 100 billion rows) in one node.
Reading operations will be much less than writing. A reading latency within
1 second is acceptable.

Is it possible? Do you have advice on this design?
Thank you.

Sheng



2011/4/3 aaron morton 

> With only one data file your reads would use the least amount of IO to find
> the data.
>
> Most people have multiple nodes and probably fewer disks, so each node may
> have a TB or two of data. How much capacity do your 10 disks give ? Will you
> be running multiple nodes in production ?
>
> Aaron
>
>
>
> On 2 Apr 2011, at 12:45, Sheng Chen wrote:
>
> Thank you very much.
>
> The major compaction will merge everything into one big file., which would
> be very large.
> Is there any way to control the number or size of files created by major
> compaction?
> Or, is there a recommended number or size of files for cassandra to handle?
>
> Thanks. I see the trigger of my minor compaction is OperationsInMillions.
> It is a number of operations in total, which I thought was in a second.
>
> Cheers,
> Sheng
>
>
> 2011/4/1 aaron morton 
>
>> If you are doing some sort of bulk load you can disable minor compactions
>> by setting the min_compaction_threshold and max_compaction_threshold to 0 .
>> Then once your insert is complete run a major compaction via nodetool before
>> turning the minor compaction back on.
>>
>> You can also reduce the compaction threads priority, see
>> compaction_thread_priority in the yaml file.
>>
>> The memtable will be flushed when either the MB or ops throughput is
>> triggered. If you are seeing a lot of memtables smaller than the MB
>> threshold then the ops threshold is probably been triggered. Look for a log
>> message at INFO level starting with "Enqueuing flush of Memtable" that will
>> tell you how many bytes and ops the memtable had when it was flushed. Trying
>> increasing the ops threshold and see what happens.
>>
>> You're change in the compaction threshold may not have an an effect
>> because the compaction process was already running.
>>
>> AFAIK the best way to get the best out of your 10 disks will be to use a
>> dedicated mirror for the commit log and a  stripe set for the data.
>>
>> Hope that helps.
>> Aaron
>>
>> On 1 Apr 2011, at 14:52, Sheng Chen wrote:
>>
>> > I've got a single node of cassandra 0.7.4, and I used the java stress
>> tool to insert about 100 million records.
>> > The inserts took about 6 hours (45k inserts/sec) but the following minor
>> compactions last for 2 days and the pending compaction jobs are still
>> increasing.
>> >
>> > From jconsole I can read the MemtableThroughputInMB=1499,
>> MemtableOperationsInMillions=7.0
>> > But in my data directory, I got hundreds of 438MB data files, which
>> should be the cause of the minor compactions.
>> >
>> > I tried to set compaction threshold by nodetool, but it didn't seem to
>> take effects (no change in pending compaction tasks).
>> > After restarting the node, my setting is lost.
>> >
>> > I want to distribute the read load in my disks (10 disks in xfs, LVM),
>> so I don't want to do a major compaction.
>> > So, what can I do to keep the sstable file in a reasonable size, or to
>> make the minor compactions faster?
>> >
>> > Thank you in advance.
>> > Sheng
>> >
>>
>>
>
>


Re: Bizarre side-effect of increasing read concurrency

2011-04-03 Thread Peter Schuller
> My Xmx and Xms are both 7.5GB. However, I never see the heap usage
> reach past 5.5. Think it is still a good idea to increase the heap?

Not necessarily. I thought you had a max heap of 5.5, in which case a
live set of 4 gb after a completed cms pass seemed pretty high.  Seems
more reasonable if max heap is 7.5 gig.

-- 
/ Peter Schuller


Re: Endless minor compactions after heavy inserts

2011-04-03 Thread Edward Capriolo
On Sun, Apr 3, 2011 at 1:46 PM, Sheng Chen  wrote:
> I think if i can keep a single sstable file in a proper size, the hot
> data/index files may be able to fit into memory at least in some occasions.
>
> In my use case, I want to use cassandra for storage of a large amount of log
> data.
> There will be multiple nodes, and each node has 10*2TB disks to hold as much
> data as possible, ideally 20TB (about 100 billion rows) in one node.
> Reading operations will be much less than writing. A reading latency within
> 1 second is acceptable.
>
> Is it possible? Do you have advice on this design?
> Thank you.
>
> Sheng
>
>
>
> 2011/4/3 aaron morton 
>>
>> With only one data file your reads would use the least amount of IO to
>> find the data.
>> Most people have multiple nodes and probably fewer disks, so each node may
>> have a TB or two of data. How much capacity do your 10 disks give ? Will you
>> be running multiple nodes in production ?
>> Aaron
>>
>>
>> On 2 Apr 2011, at 12:45, Sheng Chen wrote:
>>
>> Thank you very much.
>> The major compaction will merge everything into one big file., which would
>> be very large.
>> Is there any way to control the number or size of files created by major
>> compaction?
>> Or, is there a recommended number or size of files for cassandra to
>> handle?
>> Thanks. I see the trigger of my minor compaction is OperationsInMillions.
>> It is a number of operations in total, which I thought was in a second.
>> Cheers,
>> Sheng
>>
>> 2011/4/1 aaron morton 
>>>
>>> If you are doing some sort of bulk load you can disable minor compactions
>>> by setting the min_compaction_threshold and max_compaction_threshold to 0 .
>>> Then once your insert is complete run a major compaction via nodetool before
>>> turning the minor compaction back on.
>>>
>>> You can also reduce the compaction threads priority, see
>>> compaction_thread_priority in the yaml file.
>>>
>>> The memtable will be flushed when either the MB or ops throughput is
>>> triggered. If you are seeing a lot of memtables smaller than the MB
>>> threshold then the ops threshold is probably been triggered. Look for a log
>>> message at INFO level starting with "Enqueuing flush of Memtable" that will
>>> tell you how many bytes and ops the memtable had when it was flushed. Trying
>>> increasing the ops threshold and see what happens.
>>>
>>> You're change in the compaction threshold may not have an an effect
>>> because the compaction process was already running.
>>>
>>> AFAIK the best way to get the best out of your 10 disks will be to use a
>>> dedicated mirror for the commit log and a  stripe set for the data.
>>>
>>> Hope that helps.
>>> Aaron
>>>
>>> On 1 Apr 2011, at 14:52, Sheng Chen wrote:
>>>
>>> > I've got a single node of cassandra 0.7.4, and I used the java stress
>>> > tool to insert about 100 million records.
>>> > The inserts took about 6 hours (45k inserts/sec) but the following
>>> > minor compactions last for 2 days and the pending compaction jobs are 
>>> > still
>>> > increasing.
>>> >
>>> > From jconsole I can read the MemtableThroughputInMB=1499,
>>> > MemtableOperationsInMillions=7.0
>>> > But in my data directory, I got hundreds of 438MB data files, which
>>> > should be the cause of the minor compactions.
>>> >
>>> > I tried to set compaction threshold by nodetool, but it didn't seem to
>>> > take effects (no change in pending compaction tasks).
>>> > After restarting the node, my setting is lost.
>>> >
>>> > I want to distribute the read load in my disks (10 disks in xfs, LVM),
>>> > so I don't want to do a major compaction.
>>> > So, what can I do to keep the sstable file in a reasonable size, or to
>>> > make the minor compactions faster?
>>> >
>>> > Thank you in advance.
>>> > Sheng
>>> >
>>>
>>
>>
>
>

Consider implications of
http://wiki.apache.org/cassandra/LargeDataSetConsiderations


Re: urgent

2011-04-03 Thread shimi
How did you solve it?

On Sun, Apr 3, 2011 at 7:32 PM, Anurag Gujral wrote:

> Now it is using all the three disks . I want to understand why recommended
> approach is to use
> one single large volume /directory and not multiple ones,can you please
> explain in detail.
> I am using SSDs using  three small ones is cheaper than using one large
> one.
> Please Suggest
> Thanks
> Anurag
>
>
> On Sun, Apr 3, 2011 at 7:31 AM, aaron morton wrote:
>
>> Is this still a problem ? Are you getting errors on the server ?
>>
>> It should be choosing the directory with the most space.
>>
>> btw, the recommended approach is to use a single large volume/directory
>> for the data.
>>
>> Aaron
>>
>> On 2 Apr 2011, at 01:56, Anurag Gujral wrote:
>>
>> > Hi All,
>> >   I have setup a cassandra cluster with three data directories
>> but cassandra is using only one of them and that disk is out of space
>> > and .Why is cassandra not using all the three data directories.
>> >
>> > Plz Suggest.
>> >
>> > Thanks
>> > Anurag
>>
>>
>


change row cache size in cassandra

2011-04-03 Thread Anurag Gujral
Hi All,
 How can I change the row cache size in cassandra. I could not find
any documentation on  this.
Thanks
Anurag


Secondary Indexes

2011-04-03 Thread Drew Kutcharian
Hi Everyone,

I posted the following email a couple of days ago and I didn't get any 
responses. Makes me wonder, does anyone on this list know/use Secondary 
Indexes? They seem to me like a pretty big feature and it's a bit disappointing 
to not be able to get a documentation on it.

The only thing I could find on the Wiki was the end of 
http://wiki.apache.org/cassandra/StorageConfiguration and that was pointing to 
the non-existing page http://wiki.apache.org/cassandra/SecondaryIndexes . In 
addition, I checked the JIRA CASSANDRA-749 and there's a lot of back and forth 
that I couldn't really figure out what the conclusion was. What gives?

I think the Cassandra committers are doing a heck of a job adding all these 
cool functionalities but the documenting side doesn't really keep up. Jonathan 
Ellis's blog post on Secondary Indexes only scratches the surface of the topic, 
and if you consider that the whole point of using Cassandra is scalability, 
there isn't a single mention of how Secondary Indexes scale!!! (This same thing 
applies to Counters too)

I'm not trying to be a complainer, but as someone new to this community, I hope 
you guys take my comments as productive criticism.

Thanks,

Drew


[ORIGINAL POST]

I just read Jonathan Ellis' great post on Secondary Indexes 
(http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes) and 
I was wondering where I can find a bit more info on them. I would like to know:

1) Are there in limitations beside the hash properties (no between queries)? 
Like size or memory, etc?

2) Are there distributed? If so, how does that work? How are there stored on 
the nodes?

3) When you write a new row, when/how does the index get updated? What I would 
like to know is the atomicity of the operation, is the "index write" part of 
the "row write"?

4) Is there a difference between creating a secondary index vs creating an 
"index" CF manually such as "users_by_country"? 



NullPointerException with 0.7.4

2011-04-03 Thread Donal Zang

Hi,

I'm doing a stress test, and cassandra crashed with this Exception:
ERROR [MutationStage:9] 2011-04-03 21:11:50,152
DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
java.lang.NullPointerException
at
org.apache.cassandra.io.sstable.IndexSummary$KeyPosition.compareTo(IndexSummary.java:100)
at
org.apache.cassandra.io.sstable.IndexSummary$KeyPosition.compareTo(IndexSummary.java:87)
at java.util.Collections.indexedBinarySearch(Collections.java:232)
at java.util.Collections.binarySearch(Collections.java:218)
at
org.apache.cassandra.io.sstable.SSTableReader.getIndexScanPosition(SSTableReader.java:333)
at
org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:459)
at
org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:563)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:61)
at
org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:58)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1353)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1245)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1173)
at
org.apache.cassandra.db.Table.readCurrentIndexedColumns(Table.java:459)
at org.apache.cassandra.db.Table.apply(Table.java:394)
at
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:76)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

--

Donal Zang
CERN PH-ADP-DDM 40-3-D16
CH-1211 Geneve 23
donal.z...@cern.ch
+41 22 76 71268





Re: compaction behaviour

2011-04-03 Thread Anurag Gujral
Hi Zhu,
 I did not got that SSDs  have read latency of 0.1ms.Since there
is only one data file
I would expect the read of any key to take 0.1ms may be I am missing
something
please explain.
Thanks
Anurag

On Sun, Apr 3, 2011 at 5:01 AM, Zhu Han  wrote:

>
> best regards,
> Zhu Han
>
>
>
> On Sun, Apr 3, 2011 at 9:21 AM, Anurag Gujral wrote:
>
>> Hi All,
>>I have loaded data into cassandra using batch processing  the
>> response times for reads are in the range of  0.8 ms but I am using SSDs. so
>> I expect the read times to be even faster.
>>
>
> Does your working set fit in the memory? If so, SSD is not helpful to
> reduce the latency.
>
>
>> Every time I run compaction the latency numbers reduce to 0.3 to 0.4ms  ,
>> is there a way I can run compaction once with some parameters
>> so that i can get the same numbers 0.3 to 0.4 ms for reads.
>> Please note that I am not loading the data again.
>>
>> Thanks
>> Anurag
>>
>
>


Re: change row cache size in cassandra

2011-04-03 Thread Anurag Gujral
Hi All,
I looked at the nodetool there is an option to change cache
sizes .
Thanks
Anurag

On Sun, Apr 3, 2011 at 12:25 PM, Anurag Gujral wrote:

> Hi All,
>  How can I change the row cache size in cassandra. I could not find
> any documentation on  this.
> Thanks
> Anurag
>


Re: Secondary Indexes

2011-04-03 Thread Tyler Hobbs
I'm not familiar with some of the details, but I'll try to answer your
questions in general.  Secondary indexes are implemented as a slightly
special separate column family with the indexed value serving as the key;
most of the properties of secondary indexes follow from that.

On Sun, Apr 3, 2011 at 2:28 PM, Drew Kutcharian  wrote:

> Hi Everyone,
>
> I posted the following email a couple of days ago and I didn't get any
> responses. Makes me wonder, does anyone on this list know/use Secondary
> Indexes? They seem to me like a pretty big feature and it's a bit
> disappointing to not be able to get a documentation on it.
>
> The only thing I could find on the Wiki was the end of
> http://wiki.apache.org/cassandra/StorageConfiguration and that was
> pointing to the non-existing page
> http://wiki.apache.org/cassandra/SecondaryIndexes . In addition, I checked
> the JIRA CASSANDRA-749 and there's a lot of back and forth that I couldn't
> really figure out what the conclusion was. What gives?
>
> I think the Cassandra committers are doing a heck of a job adding all these
> cool functionalities but the documenting side doesn't really keep
> up. Jonathan Ellis's blog post on Secondary Indexes only scratches the
> surface of the topic, and if you consider that the whole point of using
> Cassandra is scalability, there isn't a single mention of how Secondary
> Indexes scale!!! (This same thing applies to Counters too)
>
> I'm not trying to be a complainer, but as someone new to this community, I
> hope you guys take my comments as productive criticism.
>
> *Thanks,
>
> Drew*
>
>
> [ORIGINAL POST]
>
> *I just read Jonathan Ellis' great post on Secondary Indexes (**
> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes*
> *) and I was wondering where I can find a bit more info on them. I would
> like to know:
>
> 1) Are there in limitations beside the hash properties (no between
> queries)? Like size or memory, etc?*
>

No.


> *
> 2) Are there distributed? If so, how does that work? How are there stored
> on the nodes?
> *
>

Each node only indexes data that it holds locally.


> *
> 3) When you write a new row, when/how does the index get updated? What I
> would like to know is the atomicity of the operation, is the "index write"
> part of the "row write"?
> *
>

The row and index updates are one atomic operation.


> *
> 4) Is there a difference between creating a secondary index vs creating an
> "index" CF manually such as "users_by_country"?
>
> *
>

Yes.  First, when creating your own index, a node may index data held by
another node.  Second, updates to the index and data are not atomic.

Your feedback is certainly helpful and hopefully we can get some of these
details into the documentation!

-- 
Tyler Hobbs
Software Engineer, DataStax 
Maintainer of the pycassa  Cassandra
Python client library


Re: Secondary Indexes

2011-04-03 Thread Drew Kutcharian
Thanks Tyler. Can you update the wiki with these answers so they are stored 
there for others to see too?

On Apr 3, 2011, at 12:51 PM, Tyler Hobbs  wrote:

> I'm not familiar with some of the details, but I'll try to answer your 
> questions in general.  Secondary indexes are implemented as a slightly 
> special separate column family with the indexed value serving as the key; 
> most of the properties of secondary indexes follow from that.
> 
> On Sun, Apr 3, 2011 at 2:28 PM, Drew Kutcharian  wrote:
> Hi Everyone,
> 
> I posted the following email a couple of days ago and I didn't get any 
> responses. Makes me wonder, does anyone on this list know/use Secondary 
> Indexes? They seem to me like a pretty big feature and it's a bit 
> disappointing to not be able to get a documentation on it.
> 
> The only thing I could find on the Wiki was the end of 
> http://wiki.apache.org/cassandra/StorageConfiguration and that was pointing 
> to the non-existing page http://wiki.apache.org/cassandra/SecondaryIndexes . 
> In addition, I checked the JIRA CASSANDRA-749 and there's a lot of back and 
> forth that I couldn't really figure out what the conclusion was. What gives?
> 
> I think the Cassandra committers are doing a heck of a job adding all these 
> cool functionalities but the documenting side doesn't really keep up. 
> Jonathan Ellis's blog post on Secondary Indexes only scratches the surface of 
> the topic, and if you consider that the whole point of using Cassandra is 
> scalability, there isn't a single mention of how Secondary Indexes scale!!! 
> (This same thing applies to Counters too)
> 
> I'm not trying to be a complainer, but as someone new to this community, I 
> hope you guys take my comments as productive criticism.
> 
> Thanks,
> 
> Drew
> 
> 
> [ORIGINAL POST]
> 
> I just read Jonathan Ellis' great post on Secondary Indexes 
> (http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes) 
> and I was wondering where I can find a bit more info on them. I would like to 
> know:
> 
> 1) Are there in limitations beside the hash properties (no between queries)? 
> Like size or memory, etc? 
>  
> No.
>  
> 
> 2) Are there distributed? If so, how does that work? How are there stored on 
> the nodes?
> 
> Each node only indexes data that it holds locally.
>  
> 
> 3) When you write a new row, when/how does the index get updated? What I 
> would like to know is the atomicity of the operation, is the "index write" 
> part of the "row write"?
> 
> The row and index updates are one atomic operation.
>  
> 
> 4) Is there a difference between creating a secondary index vs creating an 
> "index" CF manually such as "users_by_country"? 
> 
> 
> Yes.  First, when creating your own index, a node may index data held by 
> another node.  Second, updates to the index and data are not atomic.
> 
> Your feedback is certainly helpful and hopefully we can get some of these 
> details into the documentation!
> 
> -- 
> Tyler Hobbs
> Software Engineer, DataStax
> Maintainer of the pycassa Cassandra Python client library
> 


Re: Secondary Indexes

2011-04-03 Thread Joe Stump

On Apr 3, 2011, at 2:22 PM, Drew Kutcharian wrote:

> Thanks Tyler. Can you update the wiki with these answers so they are stored 
> there for others to see too?

Dude, it's a wiki. 

Re: Secondary Indexes

2011-04-03 Thread Drew Kutcharian
Yea I know, I just didn't know anyone can update it.


On Apr 3, 2011, at 1:26 PM, Joe Stump wrote:

> 
> On Apr 3, 2011, at 2:22 PM, Drew Kutcharian wrote:
> 
>> Thanks Tyler. Can you update the wiki with these answers so they are stored 
>> there for others to see too?
> 
> Dude, it's a wiki.



Re: Secondary Indexes

2011-04-03 Thread Drew Kutcharian
I just updated added a new page to the wiki: 
http://wiki.apache.org/cassandra/SecondaryIndexes


On Apr 3, 2011, at 7:37 PM, Drew Kutcharian wrote:

> Yea I know, I just didn't know anyone can update it.
> 
> 
> On Apr 3, 2011, at 1:26 PM, Joe Stump wrote:
> 
>> 
>> On Apr 3, 2011, at 2:22 PM, Drew Kutcharian wrote:
>> 
>>> Thanks Tyler. Can you update the wiki with these answers so they are stored 
>>> there for others to see too?
>> 
>> Dude, it's a wiki.
> 



Re: Embedding Cassandra in Java code w/o using ports

2011-04-03 Thread Kirk Peterson
Not sure, but I've been playing with running cassandra in the same VM as an
HTTP server for a pet project of mine, using a similar technique as the one
found in the Solandra  project. It does use ports on localhost,
but hopefully gives you an idea of embedding cassandra (no clue if its a
good idea or not yet, still playing with it myself).

cheers,
-kirk

https://github.com/tjake/Solandra

On Fri, Apr 1, 2011 at 9:07 PM, Bob Futrelle  wrote:

> Connecting via CLI to local host with a port number has never been
> successful for me in Snow Leopard.  No amount of reading suggestions and
> varying the approach has worked.  So I'm going to talk to Cassandra via its
> API, from Java.
>
> But I noticed that in some code samples that call the API from Java, ports
> are also in play.  In using Derby in Java I've never had to designate any
> ports.  Is such a  strategy available with Cassandra?
>
>  - Bob Futrelle
>Northeastern U.
>
>


-- 
⑆gmail.com⑆necrobious⑈