Re: newer Cassandra + Hadoop = TimedOutException()

2012-03-08 Thread Patrik Modesto
I did change the rpc_endpoint to endpoints and now the splits are
computed correctly. So it's a bug in cassandra to hadoop interface. I
suspect that it has something to do with wide rows with tens of
thousands of columns we have because the unpatched getSubSplits()
works with small test data we have for development.

Regards,
P.


On Wed, Mar 7, 2012 at 11:02, Florent Lefillâtre  wrote:
> If you want try a test, in the CFIF.getSubSplits(String, String, TokenRange,
> Configuration) method, replace the loop on 'range.rpc_endpoints' by the same
> loop on 'range.endpoints'.
> This method split token range of each node with describe_splits method, but
> I think there is something wrong when you create Cassandra connection on
> host '0.0.0.0'.
>
>
>
>
> Le 7 mars 2012 09:07, Patrik Modesto  a écrit :
>
>> You're right, I wasn't looking in the right logs. Unfortunately I'd
>> need to restart hadoop takstracker with loglevel DEBUG and that is not
>> possilbe at the moment. Pitty it happens only in the production with
>> terrabytes of data, not in the test...
>>
>> Regards,
>> P.
>>
>> On Tue, Mar 6, 2012 at 14:31, Florent Lefillâtre 
>> wrote:
>> > CFRR.getProgress() is called by child mapper tasks on each TastTracker
>> > node,
>> > so the log must appear on
>> > ${hadoop_log_dir}/attempt_201202081707_0001_m_00_0/syslog (or
>> > somethings
>> > like this) on TaskTrackers, not on client job logs.
>> > Are you sure to see the good log file, I say that because in your first
>> > mail
>> > you link the client job log.
>> > And may be you can log the size of each split in CFIF.
>> >
>> >
>> >
>> >
>> > Le 6 mars 2012 13:09, Patrik Modesto  a écrit
>> > :
>> >
>> >> I've added a debug message in the CFRR.getProgress() and I can't find
>> >> it in the debug output. Seems like the getProgress() has not been
>> >> called at all;
>> >>
>> >> Regards,
>> >> P.
>> >>
>> >> On Tue, Mar 6, 2012 at 09:49, Jeremy Hanna 
>> >> wrote:
>> >> > you may be running into this -
>> >> > https://issues.apache.org/jira/browse/CASSANDRA-3942 - I'm not sure
>> >> > if it
>> >> > really affects the execution of the job itself though.
>> >> >
>> >> > On Mar 6, 2012, at 2:32 AM, Patrik Modesto wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I was recently trying Hadoop job + cassandra-all 0.8.10 again and
>> >> >> the
>> >> >> Timeouts I get are not because of the Cassandra can't handle the
>> >> >> requests. I've noticed there are several tasks that show proggess of
>> >> >> several thousands percents. Seems like they are looping their range
>> >> >> of
>> >> >> keys. I've run the job with debug enabled and the ranges look ok,
>> >> >> see
>> >> >> http://pastebin.com/stVsFzLM
>> >> >>
>> >> >> Another difference between cassandra-all 0.8.7 and 0.8.10 is the
>> >> >> number of mappers the job creates:
>> >> >> 0.8.7: 4680
>> >> >> 0.8.10: 595
>> >> >>
>> >> >> Task       Complete
>> >> >> task_201202281457_2027_m_41       9076.81%
>> >> >> task_201202281457_2027_m_73       9639.04%
>> >> >> task_201202281457_2027_m_000105       10538.60%
>> >> >> task_201202281457_2027_m_000108       9364.17%
>> >> >>
>> >> >> None of this happens with cassandra-all 0.8.7.
>> >> >>
>> >> >> Regards,
>> >> >> P.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Feb 28, 2012 at 12:29, Patrik Modesto
>> >> >>  wrote:
>> >> >>> I'll alter these settings and will let you know.
>> >> >>>
>> >> >>> Regards,
>> >> >>> P.
>> >> >>>
>> >> >>> On Tue, Feb 28, 2012 at 09:23, aaron morton
>> >> >>> 
>> >> >>> wrote:
>> >>  Have you tried lowering the  batch size and increasing the time
>> >>  out?
>> >>  Even
>> >>  just to get it to work.
>> >> 
>> >>  If you get a TimedOutException it means CL number of servers did
>> >>  not
>> >>  respond
>> >>  in time.
>> >> 
>> >>  Cheers
>> >> 
>> >>  -
>> >>  Aaron Morton
>> >>  Freelance Developer
>> >>  @aaronmorton
>> >>  http://www.thelastpickle.com
>> >> 
>> >>  On 28/02/2012, at 8:18 PM, Patrik Modesto wrote:
>> >> 
>> >>  Hi aaron,
>> >> 
>> >>  this is our current settings:
>> >> 
>> >>       
>> >>           cassandra.range.batch.size
>> >>           1024
>> >>       
>> >> 
>> >>       
>> >>           cassandra.input.split.size
>> >>           16384
>> >>       
>> >> 
>> >>  rpc_timeout_in_ms: 3
>> >> 
>> >>  Regards,
>> >>  P.
>> >> 
>> >>  On Mon, Feb 27, 2012 at 21:54, aaron morton
>> >>  
>> >>  wrote:
>> >> 
>> >>  What settings do you have for cassandra.range.batch.size
>> >> 
>> >>  and rpc_timeout_in_ms  ? Have you tried reducing the first and/or
>> >>  increasing
>> >> 
>> >>  the second ?
>> >> 
>> >> 
>> >>  Cheers
>> >> 
>> >> 
>> >>  -
>> >> 
>> >>  Aaron Morton
>> >> 
>> >>  Freelance Developer
>> >> 
>> >>  @aaronmorton
>> >

Re: Large SliceRanges: Reading all results in to memory vs. reading smaller result sub-sets at a time?

2012-03-08 Thread aaron morton
It is better to get a sensible amount. Moving a few MB's is ok (see 
thrift_framed_transport_size_in_mb in cassandra.yaml). 

Long running queries can reduce the overall query throughput. They also churn 
memory over on both the server and the client. 

Run some tests on your data, see how long it takes to iterate over all the 
columns using different slice sizes. More is not always better. 

Cheers
 
 
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 8/03/2012, at 11:56 AM, Kevin wrote:

> When dealing with large SliceRanges, it better to read all the results in to 
> memory (by setting “count” to the largest value possible), or is it better to 
> divide the query in to smaller SliceRange queries? Large in this case being 
> on the order of millions of rows.
>  
> There’s a footnote concerning SliceRanges on the main Apache Cassandra 
> project site that reads:
>  
> “…Thrift will materialize the whole result into memory before returning it to 
> the client, so be aware that you may be better served by iterating through 
> slices by passing the last value of one call in as the start of the next 
> instead of increasing count arbitrarily large.”
>  
> … but it doesn’t delve in to the reasons why going about things that way is 
> better.
>  
> Can someone shed some light on this? And would the same logic apply to large 
> KeyRanges?
>  



LeveledCompaction and/or SnappyCompressor causing memory pressure during repair

2012-03-08 Thread Thomas van Neerijnen
Hi all

Running Cassandra 1.0.7, I recently changed a few read heavy column
families from SizeTieredCompactionStrategy to LeveledCompactionStrategy and
added in SnappyCompressor, all with defaults so 5MB files and if memory
serves me correctly 64k chunk size for compression.
The results were amazingly good, my data size halved and my heap usage and
performance stabilised nicely, until it came time to run a repair.

When a repair isn't running I'm seeing a saw toothed pattern on my heap
graphs with CMS clearing out about 1.5GB each GC run. The CMS GC appears as
a sudden vertical drop on the Old Gen usage graph. In addition to what I
consider a healthy looking heap usage, my par new and CMS collections are
running far quicker than before I made the changes.

However, when I run a repair my CMS usage graph no longer shows sudden
drops but rather gradual slopes and only manages to clear around 300MB each
GC. This seems to occur on 2 other nodes in my cluster around the same
time, I assume this is because they're the replicas (we use 3 replicas).
Parnew collections look about the same on my graphs with or without repair
running so no trouble there so far as I can tell.
The symptom of the memory pressure during repair is either the node running
the repair of one of the two replicas tends to perform badly with read
stage backing up into the thousands at times.
If I run a repair on more than one or two nodes at the same time (it's a 7
node cluster) the memory pressure is so bad that half the cluster ends up
OOMing, and this happened during off-peak when it's doing about half the
reads we handle during peak so not particularly loaded.

The question I'm asking is has anyone run into this behaviour before, and
if so how was it dealt with?

Once I have nursed the cluster thru the repair it's currently running I
will be turning off compression on one of my larger CFs to see if it makes
a difference, I'll send the results of that test tomorrow.


Re: Node joining / unknown

2012-03-08 Thread R. Verlangen
It seemed that one of the other nodes had trouble with a compaction task.
The C node was waiting for that.

It's now streaming all it's data into place.

Thank you all for your time!

2012/3/7 

> just run "nodetool compactionstat" on other nodes.
>
>
> -Original Message-
> From: "R. Verlangen" 
> To: user@cassandra.apache.org
> Sent: Wed, 07 Mar 2012 23:09
> Subject: Re: Node joining / unknown
>
> @Brandon: Thank you for the information. I'll do that next time.
>
> @Igor: Any ways to find out whether that is the current state? And if so,
> how to solve it?
>
> 2012/3/7 
>
>> Maybe it wait for verification compaction on other node?
>>
>>
>>
>>
>>
>> -Original Message-
>> From: "R. Verlangen" 
>> To: user@cassandra.apache.org
>> Sent: Wed, 07 Mar 2012 22:15
>> Subject: Re: Node joining / unknown
>>
>> At this moment the node has joined the ring (after a restart: tried that
>> before, but now it had finally result).
>>
>> When I try to run repair on the new node, the log says (the new node is
>> NODE C):
>>
>> INFO [AntiEntropyStage:1] 2012-03-07 21:12:06,453 AntiEntropyService.java
>> (line 190) [repair #cfcc12b0-6891-11e1--70a329caccff] Received merkle
>> tree for StorageMeta from NODE A
>>  INFO [AntiEntropyStage:1] 2012-03-07 21:12:06,643
>> AntiEntropyService.java (line 190) [repair
>> #cfcc12b0-6891-11e1--70a329caccff] Received merkle tree for StorageMeta
>> from NODE B
>>
>> And then doesn't do anything anymore. Tried it a couple of times again.
>> It's just not starting.
>>
>> Results from netstats on NODE C:
>>
>> Mode: NORMAL
>> Not sending any streams.
>> Not receiving any streams.
>> Pool NameActive   Pending  Completed
>> Commandsn/a 0  5
>> Responses   n/a93   4296
>>
>>
>> Any suggestions?
>>
>> Thank you!
>>
>> 2012/3/7 aaron morton 
>>
>>> - When I try to remove the token, it says: Exception in thread "main"
>>> java.lang.UnsupportedOperationException: Token not found.
>>>
>>> Am assuming you ran nodetool removetoken on a node other than the
>>> joining node? What  did nodetool ring look like on that machine ?
>>>
>>> Take a look at nodetool netstats on the joining node to see if streaming
>>> has failed. If it's dead then…
>>>
>>> 1) Try restarting the joining node and run nodetool repair on it
>>> immediately. Note: am assuming QUOURM CL otherwise things may get
>>> inconsistent.
>>> or
>>> 2) Stop the node. Try to get remove the token again from another node.
>>> Node that removing a token will stream data around the place as well.
>>>
>>> Cheers
>>>
>>>   -
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 7/03/2012, at 9:11 PM, R. Verlangen wrote:
>>>
>>> Hi there,
>>>
>>> I'm currently in a really weird situation.
>>> - Nodetool ring says node X is joining (this already takes 12 hours,
>>> with no activity)
>>> - When I try to remove the token, it says: Exception in thread "main"
>>> java.lang.UnsupportedOperationException: Token not found.
>>> - Removetoken status = No token removals in process.
>>>
>>> How to get that node out of my cluster?
>>>
>>> With kind regards,
>>> Robin Verlangen
>>>
>>>
>>>
>>
>


offline compaction

2012-03-08 Thread Feng Qu
Hello, is there a way to take one node out of ring and running a major 
compaction? 
 
Feng Qu

Multic DC on EC2 with no VPC

2012-03-08 Thread Todd Nine
Hi all,
  I've recently upgraded a test cluster from 0.8.x to 1.0.8 for testing multi 
data center communications.  I have the following configuration file on 3 nodes 
in a single data center.

https://gist.github.com/4671e4ae562a47f96ed2

However, when I run node tool on any of these nodes, they recognize the others 
are up, but no data is available.


nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
Token   
   
113427455640312821154458202477256070485 
10.172.106.192  UNKNOWN-DC  UNKNOWN-RACKDown   Normal  ?   33.33%  
0   
50.18.3.222 us-west 1c  Up Normal  3.95 GB 33.33%  
56713727820156410577229101238628035242  
10.172.134.239  UNKNOWN-DC  UNKNOWN-RACKDown   Normal  ?   33.33%  
113427455640312821154458202477256070485  

If I remove the broadcast_address from the yaml, everything works fine, but I 
can't communicate across data centers.  Any ideas why I'm getting this error?  
The listen_address is the private ip, and the broadcast_address is the public 
ip.

Thanks,
Todd

Single column read latency

2012-03-08 Thread A J
Hello,
In a CF I have with valueless columns and column-name type being
integer, I am seeing latency in the order of 80-90ms to retrieve a
single column from a row containing 50K columns. It is just a single
node db on a single box.
Another row with 20K columns in the same CF, still has the latency
around 30ms to get to a single column.

Example in pycassa, for a row with 50K columns:
t1 = time.time()
cf1.get(5011,columns=[90006111])
t2 = time.time() - t1
print  int(t2*1000),'ms'

gives 82 ms

Any idea what could be causing the latency to be so high ? That too
after ensuring that the row_cache is large enough to contain all the
rows and all the rows are pre-fetched.

Thanks.


Re: offline compaction

2012-03-08 Thread Edward Capriolo
On Thu, Mar 8, 2012 at 1:43 PM, Feng Qu  wrote:
> Hello, is there a way to take one node out of ring and running a major
> compaction?
>
> Feng Qu

http://www.jointhegrid.com/highperfcassandra/?p=187

Cheers


Re: offline compaction

2012-03-08 Thread Karl Hiramoto

On 03/08/12 21:40, Edward Capriolo wrote:

On Thu, Mar 8, 2012 at 1:43 PM, Feng Qu  wrote:

Hello, is there a way to take one node out of ring and running a major
compaction?

Feng Qu

http://www.jointhegrid.com/highperfcassandra/?p=187



What are the drawbacks to disable thrift and gossip?  So you take a node 
offine to do repair/compact, but then assuming writes are comming into 
your cluseter at a steady rate, you have missing writes that need to be 
repaired in the node that was offline? Is that what would happen?


--
Karl


Re: offline compaction

2012-03-08 Thread Mike Panchenko
Yes, that is what would happen; some anti entropy mechanism would have to
perform the replication after the fact (hinted handoff, read repair, manual
repair etc).

For most things, it's better to rely on the dynamic endpoint snitch and
some sort of dynamic load balancing from the client (see:
https://github.com/rantav/hector/blob/master/core/src/main/java/me/prettyprint/cassandra/connection/DynamicLoadBalancingPolicy.java)
to automatically route around busy nodes.

I don't remember when the referenced nodetool command was added; for older
versions, a similar effect can be achieved using iptables or similar tools.

On Thu, Mar 8, 2012 at 12:56 PM, Karl Hiramoto  wrote:

> On 03/08/12 21:40, Edward Capriolo wrote:
>
>> On Thu, Mar 8, 2012 at 1:43 PM, Feng Qu  wrote:
>>
>>> Hello, is there a way to take one node out of ring and running a major
>>> compaction?
>>>
>>> Feng Qu
>>>
>> http://www.jointhegrid.com/**highperfcassandra/?p=187
>>
>>
> What are the drawbacks to disable thrift and gossip?  So you take a node
> offine to do repair/compact, but then assuming writes are comming into your
> cluseter at a steady rate, you have missing writes that need to be repaired
> in the node that was offline? Is that what would happen?
>
> --
> Karl
>