Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Jonathan Ellis
I still don't see the hole in the following reasoning:

- Input splits are 64k by default.  At this size, map processing time
dominates job creation.
- Therefore, if job creation time dominates, you have a toy data set
(< 64K * 256 vnodes = 16 MB)

Adding complexity to our inputformat to improve performance for this
niche does not sound like a good idea to me.

On Thu, Mar 28, 2013 at 8:40 AM, cem  wrote:
> Hi Alicia ,
>
> Cassandra input format creates mappers as many as vnodes. It is a known
> issue. You need to lower the number of vnodes :(
>
> I have a simple solution for that and ready to write a patch. Should I
> create a ticket about that? I don't know the procedure about that.
>
>  Regards,
> Cem
>
>
> On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong  wrote:
>>
>> Hi All,
>>
>> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for vnodes.
>>
>> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
>>
>> May I know, is the normal since is vnodes?  If yes, this have slow the M/R
>> job to finish/complete.
>>
>>
>> Thanks
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced


Lost data after expanding cluster c* 1.2.3-1

2013-03-29 Thread Kais Ahmed
Hi all,

I follow this tutorial for expanding a 4 c* cluster (production) and add 3
new nodes.

Datacenter: eu-west
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns   Host
ID   Rack
UN  10.34.142.xxx 10.79 GB   256 15.4%
4e2e26b8-aa38-428c-a8f5-e86c13eb4442  1b
UN  10.32.49.xxx   1.48 MB25613.7%
e86f67b6-d7cb-4b47-b090-3824a5887145  1b
UN  10.33.206.xxx  2.19 MB25611.9%
92af17c3-954a-4511-bc90-29a9657623e4  1b
UN  10.32.27.xxx   1.95 MB256  14.9%
862e6b39-b380-40b4-9d61-d83cb8dacf9e  1b
UN  10.34.139.xxx 11.67 GB   25615.5%
0324e394-b65f-46c8-acb4-1e1f87600a2c  1b
UN  10.34.147.xxx 11.18 GB   256 13.9%
cfc09822-5446-4565-a5f0-d25c917e2ce8  1b
UN  10.33.193.xxx 10.83 GB   256  14.7%
59f440db-cd2d-4041-aab4-fc8e9518c954  1b

The data are not streamed.

Can any one help me, our web site is down.

Thanks a lot,


Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
Every map reduce task typically has a minimum Xmx of 256MB memory. See
mapred.child.java.opts...
So if you have a 10 node cluster with 256 vnodes... You will need to spawn
2,560 map tasks to complete a job.
And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
slots.

Wouldnt it be better if the input format spawned 10 map tasks instead of
2,560?


On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis  wrote:

> I still don't see the hole in the following reasoning:
>
> - Input splits are 64k by default.  At this size, map processing time
> dominates job creation.
> - Therefore, if job creation time dominates, you have a toy data set
> (< 64K * 256 vnodes = 16 MB)
>
> Adding complexity to our inputformat to improve performance for this
> niche does not sound like a good idea to me.
>
> On Thu, Mar 28, 2013 at 8:40 AM, cem  wrote:
> > Hi Alicia ,
> >
> > Cassandra input format creates mappers as many as vnodes. It is a known
> > issue. You need to lower the number of vnodes :(
> >
> > I have a simple solution for that and ready to write a patch. Should I
> > create a ticket about that? I don't know the procedure about that.
> >
> >  Regards,
> > Cem
> >
> >
> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong 
> wrote:
> >>
> >> Hi All,
> >>
> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
> vnodes.
> >>
> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
> >>
> >> May I know, is the normal since is vnodes?  If yes, this have slow the
> M/R
> >> job to finish/complete.
> >>
> >>
> >> Thanks
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>


Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
This is the second person who has mentioned that hadoop performance has
tanked after switching to vnodes on list.


On Fri, Mar 29, 2013 at 10:42 AM, Edward Capriolo wrote:

> Every map reduce task typically has a minimum Xmx of 256MB memory. See
> mapred.child.java.opts...
> So if you have a 10 node cluster with 256 vnodes... You will need to spawn
> 2,560 map tasks to complete a job.
> And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
> slots.
>
> Wouldnt it be better if the input format spawned 10 map tasks instead of
> 2,560?
>
>
> On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis wrote:
>
>> I still don't see the hole in the following reasoning:
>>
>> - Input splits are 64k by default.  At this size, map processing time
>> dominates job creation.
>> - Therefore, if job creation time dominates, you have a toy data set
>> (< 64K * 256 vnodes = 16 MB)
>>
>> Adding complexity to our inputformat to improve performance for this
>> niche does not sound like a good idea to me.
>>
>> On Thu, Mar 28, 2013 at 8:40 AM, cem  wrote:
>> > Hi Alicia ,
>> >
>> > Cassandra input format creates mappers as many as vnodes. It is a known
>> > issue. You need to lower the number of vnodes :(
>> >
>> > I have a simple solution for that and ready to write a patch. Should I
>> > create a ticket about that? I don't know the procedure about that.
>> >
>> >  Regards,
>> > Cem
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong 
>> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
>> vnodes.
>> >>
>> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
>> >>
>> >> May I know, is the normal since is vnodes?  If yes, this have slow the
>> M/R
>> >> job to finish/complete.
>> >>
>> >>
>> >> Thanks
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>>
>
>


CQL queries timing out (and had worked)

2013-03-29 Thread David McNelis
I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my
cluster.

I'd had a large insert job running the last several days which just
ended it had been inserting using cql3 insert statements in a cql3
table.

Now, I show no compactions going on in my cluster but for some reason any
cql3 query I try to execute, insert, select, through cqlsh or through
external library, all time out with an rpc_timeout.

If I use cassandra-cli, I can do "list tablename limit 10" and immediately
get my 10 rows back.

However, if I do "select * from tablename limit 10" I get the rpc timeout
error.  Same table, same server.  It doesn't seem to matter if I'm hitting
a cql3 definited table or older style.

Load on the nodes is relatively low at the moment.

Any suggestions short of restarting nodes?  This is a pretty major issue
for us right now.


Re: CQL queries timing out (and had worked)

2013-03-29 Thread David McNelis
Appears that restarting a node makes CQL available on that node again, but
only that node.

Looks like I'll be doing a rolling restart.


On Fri, Mar 29, 2013 at 10:26 AM, David McNelis  wrote:

> I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my
> cluster.
>
> I'd had a large insert job running the last several days which just
> ended it had been inserting using cql3 insert statements in a cql3
> table.
>
> Now, I show no compactions going on in my cluster but for some reason any
> cql3 query I try to execute, insert, select, through cqlsh or through
> external library, all time out with an rpc_timeout.
>
> If I use cassandra-cli, I can do "list tablename limit 10" and immediately
> get my 10 rows back.
>
> However, if I do "select * from tablename limit 10" I get the rpc timeout
> error.  Same table, same server.  It doesn't seem to matter if I'm hitting
> a cql3 definited table or older style.
>
> Load on the nodes is relatively low at the moment.
>
> Any suggestions short of restarting nodes?  This is a pretty major issue
> for us right now.
>


Cassandra/MapReduce ‘Data Locality’

2013-03-29 Thread Alicia Leong
Hi All,

CfSplit that highlighted in RED* **, *in *d2t0053g*

But why it being submitted to *d2t0051g *not *d2t0053g ??*

Is this normal? Does this matter? In this case is no longer ‘Data Locality’
correct ?



 I’m using hadoop-1.1.2 & apache-cassandra-1.2.3.

TokenRange (1) >> 127605887595351923798765477786913079296 => 0

TokenRange (2) >> 85070591730234615865843651857942052864 =>
127605887595351923798765477786913079296

TokenRange (3) >> 42535295865117307932921825928971026432 =>
85070591730234615865843651857942052864

TokenRange (4) >> 0 => 42535295865117307932921825928971026432

ColumnFamilySplit((127605887595351923798765477786913079296, '-1] @[d2t0050g
])

ColumnFamilySplit((-1, '0] @[d2t0050g])

ColumnFamilySplit((85070591730234615865843651857942052864,
'127605887595351923798765477786913079296] @[d2t0053g])

ColumnFamilySplit((42535295865117307932921825928971026432,
'85070591730234615865843651857942052864] @[d2t0052g])

ColumnFamilySplit((0, '42535295865117307932921825928971026432] @[d2t0051g])



*RF1*---

*d2t0050g *

KeyRange(start_token:127605887595351923798765477786913079296, end_token:-1,
count:4096)

*d2t0051g*

KeyRange(start_token:85070591730234615865843651857942052864,
end_token:127605887595351923798765477786913079296, count:4096)

Rowkey:3; columnvalue=Critics Choice Awards from
ColumnFamilySplit((85070591730234615865843651857942052864,
'127605887595351923798765477786913079296] @[d2t0053g])

KeyRange(start_token:117356732921465116845890410746976120467,
end_token:127605887595351923798765477786913079296, count:4096)

KeyRange(start_token:0, end_token:42535295865117307932921825928971026432,
count:4096)

Rowkey:1; columnvalue=Academy Awards from ColumnFamilySplit((0,
'42535295865117307932921825928971026432] @[d2t0051g])

Rowkey 2: columnvalue=Golden Globe Awards from ColumnFamilySplit((0,
'42535295865117307932921825928971026432] @[d2t0051g])

KeyRange(start_token:19847720572362509985402305765727304993,
end_token:42535295865117307932921825928971026432, count:4096)

*d2t0052g*

KeyRange(start_token:42535295865117307932921825928971026432,
end_token:85070591730234615865843651857942052864, count:4096)

KeyRange(start_token:-1, end_token:0, count:4096)

*d2t0053g*

Nil




Thanks in advance.


Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Jonathan Ellis
My point is that if you have over 16MB of data per node, you're going
to get thousands of map tasks (that is: hundreds per node) with or
without vnodes.

On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo  wrote:
> Every map reduce task typically has a minimum Xmx of 256MB memory. See
> mapred.child.java.opts...
> So if you have a 10 node cluster with 256 vnodes... You will need to spawn
> 2,560 map tasks to complete a job.
> And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
> slots.
>
> Wouldnt it be better if the input format spawned 10 map tasks instead of
> 2,560?
>
>
> On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis  wrote:
>>
>> I still don't see the hole in the following reasoning:
>>
>> - Input splits are 64k by default.  At this size, map processing time
>> dominates job creation.
>> - Therefore, if job creation time dominates, you have a toy data set
>> (< 64K * 256 vnodes = 16 MB)
>>
>> Adding complexity to our inputformat to improve performance for this
>> niche does not sound like a good idea to me.
>>
>> On Thu, Mar 28, 2013 at 8:40 AM, cem  wrote:
>> > Hi Alicia ,
>> >
>> > Cassandra input format creates mappers as many as vnodes. It is a known
>> > issue. You need to lower the number of vnodes :(
>> >
>> > I have a simple solution for that and ready to write a patch. Should I
>> > create a ticket about that? I don't know the procedure about that.
>> >
>> >  Regards,
>> > Cem
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong 
>> > wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
>> >> vnodes.
>> >>
>> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
>> >>
>> >> May I know, is the normal since is vnodes?  If yes, this have slow the
>> >> M/R
>> >> job to finish/complete.
>> >>
>> >>
>> >> Thanks
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced


Re: CQL queries timing out (and had worked)

2013-03-29 Thread David McNelis
Final reason for problem:

We'd had one node's config for rpc type changed from sync to hsha...

So that mismatch can break rpc across the cluster, apparently.

It would be nice if there was a good way to set that in a single spot for
the cluster or handle the mismatch differently.  Otherwise, if you wanted
to change from sync to hsha in a cluster you'd have to entirely restart the
cluster (not a big deal), but CQL would apparently not work at all until
all of your nodes had been restarted.


On Fri, Mar 29, 2013 at 10:35 AM, David McNelis  wrote:

> Appears that restarting a node makes CQL available on that node again, but
> only that node.
>
> Looks like I'll be doing a rolling restart.
>
>
> On Fri, Mar 29, 2013 at 10:26 AM, David McNelis wrote:
>
>> I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in
>> my cluster.
>>
>> I'd had a large insert job running the last several days which just
>> ended it had been inserting using cql3 insert statements in a cql3
>> table.
>>
>> Now, I show no compactions going on in my cluster but for some reason any
>> cql3 query I try to execute, insert, select, through cqlsh or through
>> external library, all time out with an rpc_timeout.
>>
>> If I use cassandra-cli, I can do "list tablename limit 10" and
>> immediately get my 10 rows back.
>>
>> However, if I do "select * from tablename limit 10" I get the rpc timeout
>> error.  Same table, same server.  It doesn't seem to matter if I'm hitting
>> a cql3 definited table or older style.
>>
>> Load on the nodes is relatively low at the moment.
>>
>> Any suggestions short of restarting nodes?  This is a pretty major issue
>> for us right now.
>>
>
>


Re: MultiInput/MultiGet CF in MapReduce

2013-03-29 Thread Alicia Leong
Hi All

I’m thinking to do in this way.

1)  1) get_slice ( MMDDHH )  from Index Table.

2)  2) With the returned list of ROWKEYs

3)  3) Pass it to multiget_slice ( keys …)



But my questions is how to ensure ‘Data Locality’  ??


On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrote:

> I would be looking at Hive or Pig, rather than writing the MapReduce.
>
> There is an example in the source cassandra distribution, or you can look
> at Data Stax Enterprise to start playing with Hive.
>
> Typically with hadoop queries you want to query a lot of data, if you are
> only querying a few rows consider writing the code in your favourite
> language.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 18/03/2013, at 1:29 PM, Alicia Leong  wrote:
>
> Hi All
>
> I have 2 tables
>
> Data Table
> -
> RowKey: 1
> => (column=name, value=apple)
> RowKey: 2
> => (column=name, value=orange)
> RowKey: 3
> => (column=name, value=banana)
> RowKey: 4
> => (column=name, value=mango)
>
>
> Index Table (MMDDHH)
> 
> RowKey: 2013030114
> => (column=1, value=)
> => (column=2, value=)
> => (column=3, value=)
> RowKey: 2013030115
> => (column=4, value=)
>
>
> I would like to know, how to implement below in MapReduce
> 1) first query the Index Table by RowKey: 2013030114
> 2) then pass the Index Table column names  (1,2,3) to query the Data Table
>
> Thanks in advance.
>
>
>


Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
Yes but my point, is with 50 map slots you can only be processing 50 at
once. So it will take 1000/50 "waves" of mappers to complete the job.


On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis  wrote:

> My point is that if you have over 16MB of data per node, you're going
> to get thousands of map tasks (that is: hundreds per node) with or
> without vnodes.
>
> On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo 
> wrote:
> > Every map reduce task typically has a minimum Xmx of 256MB memory. See
> > mapred.child.java.opts...
> > So if you have a 10 node cluster with 256 vnodes... You will need to
> spawn
> > 2,560 map tasks to complete a job.
> > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
> > slots.
> >
> > Wouldnt it be better if the input format spawned 10 map tasks instead of
> > 2,560?
> >
> >
> > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis 
> wrote:
> >>
> >> I still don't see the hole in the following reasoning:
> >>
> >> - Input splits are 64k by default.  At this size, map processing time
> >> dominates job creation.
> >> - Therefore, if job creation time dominates, you have a toy data set
> >> (< 64K * 256 vnodes = 16 MB)
> >>
> >> Adding complexity to our inputformat to improve performance for this
> >> niche does not sound like a good idea to me.
> >>
> >> On Thu, Mar 28, 2013 at 8:40 AM, cem  wrote:
> >> > Hi Alicia ,
> >> >
> >> > Cassandra input format creates mappers as many as vnodes. It is a
> known
> >> > issue. You need to lower the number of vnodes :(
> >> >
> >> > I have a simple solution for that and ready to write a patch. Should I
> >> > create a ticket about that? I don't know the procedure about that.
> >> >
> >> >  Regards,
> >> > Cem
> >> >
> >> >
> >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong 
> >> > wrote:
> >> >>
> >> >> Hi All,
> >> >>
> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
> >> >> vnodes.
> >> >>
> >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
> >> >>
> >> >> May I know, is the normal since is vnodes?  If yes, this have slow
> the
> >> >> M/R
> >> >> job to finish/complete.
> >> >>
> >> >>
> >> >> Thanks
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder, http://www.datastax.com
> >> @spyced
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>


Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
It should be easy to control the number of map tasks.
http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you
might run into a directory with 10,000 small files and you do not want
10,000 map tasks. This is what the CombinedInputFormat's do, they help you
control the number of map tasks a job will generate. For example, imagine i
have a multi-tenant cluster. If a job kicks up 10,000 map tasks, all those
tasks can starve out other jobs. Being able to say "I only want 4 map tasks
per c* node regardless of the number of vnodes" would be a meaningful and
useful feature.


On Fri, Mar 29, 2013 at 2:17 PM, Edward Capriolo wrote:

> Yes but my point, is with 50 map slots you can only be processing 50 at
> once. So it will take 1000/50 "waves" of mappers to complete the job.
>
>
> On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis wrote:
>
>> My point is that if you have over 16MB of data per node, you're going
>> to get thousands of map tasks (that is: hundreds per node) with or
>> without vnodes.
>>
>> On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo 
>> wrote:
>> > Every map reduce task typically has a minimum Xmx of 256MB memory. See
>> > mapred.child.java.opts...
>> > So if you have a 10 node cluster with 256 vnodes... You will need to
>> spawn
>> > 2,560 map tasks to complete a job.
>> > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
>> > slots.
>> >
>> > Wouldnt it be better if the input format spawned 10 map tasks instead of
>> > 2,560?
>> >
>> >
>> > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis 
>> wrote:
>> >>
>> >> I still don't see the hole in the following reasoning:
>> >>
>> >> - Input splits are 64k by default.  At this size, map processing time
>> >> dominates job creation.
>> >> - Therefore, if job creation time dominates, you have a toy data set
>> >> (< 64K * 256 vnodes = 16 MB)
>> >>
>> >> Adding complexity to our inputformat to improve performance for this
>> >> niche does not sound like a good idea to me.
>> >>
>> >> On Thu, Mar 28, 2013 at 8:40 AM, cem  wrote:
>> >> > Hi Alicia ,
>> >> >
>> >> > Cassandra input format creates mappers as many as vnodes. It is a
>> known
>> >> > issue. You need to lower the number of vnodes :(
>> >> >
>> >> > I have a simple solution for that and ready to write a patch. Should
>> I
>> >> > create a ticket about that? I don't know the procedure about that.
>> >> >
>> >> >  Regards,
>> >> > Cem
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong 
>> >> > wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
>> >> >> vnodes.
>> >> >>
>> >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
>> >> >>
>> >> >> May I know, is the normal since is vnodes?  If yes, this have slow
>> the
>> >> >> M/R
>> >> >> job to finish/complete.
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder, http://www.datastax.com
>> >> @spyced
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>>
>
>


Re: Insert v/s Update performance

2013-03-29 Thread Jay Svc
Hi Aaron,

Thank you for your input. I have been monitoring my GC activities and
looking at my Heap, it shows pretty linear activities, without any spikes.

When I look at CPU it shows higher utilization while during writes alone. I
also expect hevy read traffic.

When I tried compaction_throughput_* parameter, I obsered that higher
number here in my case gets better CPU utilization and keeps pending
compactions pretty low. How this parameter works? I have 3 nodes and 2 core
each CPU and I have higher writes.

So usually for high *update* and high read situation what parameter we
should consider for tuning?

Thanks,
Jay





On Wed, Mar 27, 2013 at 9:55 PM, aaron morton wrote:

> * Check for GC activity in the logs
> * check the volume the commit log is on to see it it's over utilised.
> * check if the dropped messages correlate to compaction, look at the
> compaction_* settings in yaml and consider reducing the throughput.
>
> Like Dean says if you have existing data it will result in more
> compaction. You may be able to get a lot of writes through in a clean new
> cluster, but it also has to work when compaction and repair are running.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 27/03/2013, at 1:43 PM, Jay Svc  wrote:
>
> Thanks Dean again!
>
> My use case is high number of reads and writes out of that I am just
> focusing on write now. I thought LCS is a suitable for my situation. I
> tried simillar on STCS and results are same.
>
> I ran nodetool for tpstats and MutationStage pending are very high. At the
> same time the SSTable count and Pending Compaction are high too during my
> updates.
>
> Please find the snapshot of my syslog.
>
> INFO [ScheduledTasks:1] 2013-03-26 15:05:48,560 StatusLogger.java (line
> 116) OpsCenter.rollups864000,0
> INFO [FlushWriter:55] 2013-03-26 15:05:48,608 Memtable.java (line 264)
> Writing Memtable-InventoryPrice@1051586614(11438914/129587272
> serialized/live bytes, 404320 ops)
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,561 MessagingService.java
> (line 658) 2701 MUTATION messages dropped in last 5000ms
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,562 StatusLogger.java (line
> 57) Pool NameActive   Pending   Blocked
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,563 StatusLogger.java (line
> 72) ReadStage 0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,568 StatusLogger.java (line
> 72) RequestResponseStage  0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,627 StatusLogger.java (line
> 72) ReadRepairStage   0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,627 StatusLogger.java (line
> 72) MutationStage32 19967 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,628 StatusLogger.java (line
> 72) ReplicateOnWriteStage 0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,628 StatusLogger.java (line
> 72) GossipStage   0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,628 StatusLogger.java (line
> 72) AntiEntropyStage  0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,629 StatusLogger.java (line
> 72) MigrationStage0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,629 StatusLogger.java (line
> 72) StreamStage   0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,629 StatusLogger.java (line
> 72) MemtablePostFlusher   1 1 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,673 StatusLogger.java (line
> 72) FlushWriter   1 1 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,673 StatusLogger.java (line
> 72) MiscStage 0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,673 StatusLogger.java (line
> 72) commitlog_archiver0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,674 StatusLogger.java (line
> 72) InternalResponseStage 0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,674 StatusLogger.java (line
> 72) HintedHandoff 0 0 0
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,674 StatusLogger.java (line
> 77) CompactionManager 127
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,675 StatusLogger.java (line
> 89) MessagingServicen/a  0,22
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,724 StatusLogger.java (line
> 99) Cache Type Size
> Capacity
> KeysToSave Provider
> INFO [ScheduledTasks:1] 2013-03-26 15:05:53,725 StatusLogger.java (line
> 100) KeyCache 142315
> 2118997
> all
>  INFO [ScheduledTasks:1] 

Re: MultiInput/MultiGet CF in MapReduce

2013-03-29 Thread Edward Capriolo
You can use the output of describe_ring along with partitioner information
to determine which nodes data lives on.


On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong  wrote:

> Hi All
>
> I’m thinking to do in this way.
>
> 1)  1) get_slice ( MMDDHH )  from Index Table.
>
> 2)  2) With the returned list of ROWKEYs
>
> 3)  3) Pass it to multiget_slice ( keys …)
>
>
>
> But my questions is how to ensure ‘Data Locality’  ??
>
>
> On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrote:
>
>> I would be looking at Hive or Pig, rather than writing the MapReduce.
>>
>> There is an example in the source cassandra distribution, or you can look
>> at Data Stax Enterprise to start playing with Hive.
>>
>> Typically with hadoop queries you want to query a lot of data, if you are
>> only querying a few rows consider writing the code in your favourite
>> language.
>>
>> Cheers
>>
>>-
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 18/03/2013, at 1:29 PM, Alicia Leong  wrote:
>>
>> Hi All
>>
>> I have 2 tables
>>
>> Data Table
>> -
>> RowKey: 1
>> => (column=name, value=apple)
>> RowKey: 2
>> => (column=name, value=orange)
>> RowKey: 3
>> => (column=name, value=banana)
>> RowKey: 4
>> => (column=name, value=mango)
>>
>>
>> Index Table (MMDDHH)
>> 
>> RowKey: 2013030114
>> => (column=1, value=)
>> => (column=2, value=)
>> => (column=3, value=)
>> RowKey: 2013030115
>> => (column=4, value=)
>>
>>
>> I would like to know, how to implement below in MapReduce
>> 1) first query the Index Table by RowKey: 2013030114
>> 2) then pass the Index Table column names  (1,2,3) to query the Data
>> Table
>>
>> Thanks in advance.
>>
>>
>>
>


Re: MultiInput/MultiGet CF in MapReduce

2013-03-29 Thread Alicia Leong
This is the current flow for ColumnFamilyInputFormat.  Please correct me If
I'm wrong

1) In ColumnFamilyInputFormat, Get all nodes token ranges using *
client.describe_ring*
2) Get CfSplit using *client.describe_splits_ex *with the token range
2) new ColumnFamilySplit with start range, end range and endpoint
3) In ColumnFamilyRecordReader, will query *client.get_range_slices* with
the start range & end range of the ColumnFamilySplit at endpoint (datanode)


If I would use *client.get_slice* ( key).  My rowkey is '20130314'  from
Index Table.
Q1) How to know for rowkey '20130314' is in which Token Range & EndPoint.
Even though I manage to find out the Token Range & EndPoint.
Is the available Thrift API, that I can pass the ( ByteBuffer key, KeyRange
range )  Likes merge of client.get_slice & client.get_range_slices


Thanks



On Sat, Mar 30, 2013 at 7:53 AM, Edward Capriolo wrote:

> You can use the output of describe_ring along with partitioner information
> to determine which nodes data lives on.
>
>
> On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong wrote:
>
>> Hi All
>>
>> I’m thinking to do in this way.
>>
>> 1)  1) get_slice ( MMDDHH )  from Index Table.
>>
>> 2)  2) With the returned list of ROWKEYs
>>
>> 3)  3) Pass it to multiget_slice ( keys …)
>>
>>
>>
>> But my questions is how to ensure ‘Data Locality’  ??
>>
>>
>> On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrote:
>>
>>> I would be looking at Hive or Pig, rather than writing the MapReduce.
>>>
>>> There is an example in the source cassandra distribution, or you can
>>> look at Data Stax Enterprise to start playing with Hive.
>>>
>>> Typically with hadoop queries you want to query a lot of data, if you
>>> are only querying a few rows consider writing the code in your favourite
>>> language.
>>>
>>> Cheers
>>>
>>>-
>>> Aaron Morton
>>> Freelance Cassandra Consultant
>>> New Zealand
>>>
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 18/03/2013, at 1:29 PM, Alicia Leong  wrote:
>>>
>>> Hi All
>>>
>>> I have 2 tables
>>>
>>> Data Table
>>> -
>>> RowKey: 1
>>> => (column=name, value=apple)
>>> RowKey: 2
>>> => (column=name, value=orange)
>>> RowKey: 3
>>> => (column=name, value=banana)
>>> RowKey: 4
>>> => (column=name, value=mango)
>>>
>>>
>>> Index Table (MMDDHH)
>>> 
>>> RowKey: 2013030114
>>> => (column=1, value=)
>>> => (column=2, value=)
>>> => (column=3, value=)
>>> RowKey: 2013030115
>>> => (column=4, value=)
>>>
>>>
>>> I would like to know, how to implement below in MapReduce
>>> 1) first query the Index Table by RowKey: 2013030114
>>> 2) then pass the Index Table column names  (1,2,3) to query the Data
>>> Table
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>
>