Re: Performance regression tests

2010-05-13 Thread Johan Oskarsson
Hey Todd,

thanks for the pointer, it looks like a good first step. 
Will check it out once I get to spend some time on this again.

I see you've been spending time on HBase, let me know if you guys want to join 
forces on this projcet.

/Johan

On 12 maj 2010, at 20.10, Todd Lipcon wrote:

> Hey Johan,
> 
> A Hudson plugin would be great. A short term solution, though, would
> be to simply use the existing support for Hudson graphing from
> properties files: http://wiki.hudson-ci.org/display/HUDSON/Plot+Plugin
> 
> At a previous job we used to use this plugin to plot web page response
> times, and it served its purpose great.
> 
> -Todd
> 
> On Wed, May 12, 2010 at 1:13 AM, Johan Oskarsson  wrote:
>> I've started looking into how this issue. My current thinking are as follows.
>> 
>> Add support for Cassandra in Whirr: 
>> http://wiki.apache.org/incubator/WhirrProposal
>> This would allow us to start a short lived Cassandra cluster on one of the 
>> cloud services (EC2/Rackspace etc) for testing.
>> Real hardware would of course be better, but this is a good starting point.
>> 
>> For running the actual tests I have been looking at YCSB: 
>> http://github.com/brianfrankcooper/YCSB
>> I've added support for Cassandra trunk as of last week and am now off and on 
>> working on adding an measurements export function so we can get the results 
>> as a JSON file. It's fairly straight forward.
>> 
>> The best way to expose these results as graphs etc and raise an error if 
>> they are unexpected would be a plugin to Hudson. That way all our test 
>> results are in one place.
>> Other projects such as HBase might be interested in contributing to a 
>> Hudson-YCSB plugin. This would probably be best done as separate project on 
>> github for example.
>> 
>> If we want further results on how performance is affected by failures we 
>> could run with
>> http://github.com/toddlipcon/gremlins
>> or
>> https://issues.apache.org/jira/browse/CASSANDRA-561
>> 
>> 
>> Thoughts?
>> 
>> /Johan
>> 
>> On 11 maj 2010, at 20.38, Kushal Pisavadia wrote:
>> 
>>> Hi,
>>> 
>>> Due to conflicting schedules, I was unable to take part in the GSoC this
>>> year. However, I'm still very interested in helping out the community for
>>> this specific case.
>>> 
>>> Rather than just coding off a solution that would suit my own needs, I'm
>>> here asking for some help.
>>> 
>>> What short-term goals do you have in mind? What long-term goals do you have
>>> in mind?
>>> 
>>> I've had a look at the respective ticket —
>>> https://issues.apache.org/jira/browse/CASSANDRA-875 — but rather than just
>>> refactor the py_stress utility I'd like to make something that fulfils
>>> whatever needs that the current utility fails to meet.
>>> 
>>> I'm also curious about how you'd like me to commit/expose my code.
>>> Originally I was thinking of creating a separate git repo, specific to this
>>> utility, but have no issues working from a fork on Github either.
>>> 
>>> Kind Regards,
>>> 
>>> Kushal Pisavadia
>> 
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera



cassandra index implementation

2010-05-13 Thread Boris Shulman
I see that the following code is used in order to create an index:

  for (Iterator it = columns.iterator(); it.hasNext();)
{
column = it.next();
if (firstColumn == null)
{
firstColumn = column;
startPosition = endPosition;
}
endPosition += column.serializedSize();
/* if we hit the column index size that we have to index
after, go ahead and index it. */
if (endPosition - startPosition >=
DatabaseDescriptor.getColumnIndexSize())
{
IndexHelper.IndexInfo cIndexInfo = new
IndexHelper.IndexInfo(firstColumn.name(), column.name(),
startPosition, endPosition - startPosition);
indexList.add(cIndexInfo);
indexSizeInBytes += cIndexInfo.serializedSize();
firstColumn = null;
}
}

According to this code the name of the first column is stored numerous
times which can be very expensive in a large rows. I think that a
better implementation is to have an Index Header that in turn will
contain the name of the first column.
Are there any plans to change this implementation? Is anyone aware of
an open issue for it?


Re: cassandra index implementation

2010-05-13 Thread Jonathan Ellis
it's a different firstColumn each time it's stored if you look carefully

On Thu, May 13, 2010 at 6:22 AM, Boris Shulman  wrote:
> I see that the following code is used in order to create an index:
>
>      for (Iterator it = columns.iterator(); it.hasNext();)
>        {
>            column = it.next();
>            if (firstColumn == null)
>            {
>                firstColumn = column;
>                startPosition = endPosition;
>            }
>            endPosition += column.serializedSize();
>            /* if we hit the column index size that we have to index
> after, go ahead and index it. */
>            if (endPosition - startPosition >=
> DatabaseDescriptor.getColumnIndexSize())
>            {
>                IndexHelper.IndexInfo cIndexInfo = new
> IndexHelper.IndexInfo(firstColumn.name(), column.name(),
> startPosition, endPosition - startPosition);
>                indexList.add(cIndexInfo);
>                indexSizeInBytes += cIndexInfo.serializedSize();
>                firstColumn = null;
>            }
>        }
>
> According to this code the name of the first column is stored numerous
> times which can be very expensive in a large rows. I think that a
> better implementation is to have an Index Header that in turn will
> contain the name of the first column.
> Are there any plans to change this implementation? Is anyone aware of
> an open issue for it?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com