Re: Vnodes - HUNDRED of MapReduce jobs
I still don't see the hole in the following reasoning: - Input splits are 64k by default. At this size, map processing time dominates job creation. - Therefore, if job creation time dominates, you have a toy data set (< 64K * 256 vnodes = 16 MB) Adding complexity to our inputformat to improve performance for this niche does not sound like a good idea to me. On Thu, Mar 28, 2013 at 8:40 AM, cem wrote: > Hi Alicia , > > Cassandra input format creates mappers as many as vnodes. It is a known > issue. You need to lower the number of vnodes :( > > I have a simple solution for that and ready to write a patch. Should I > create a ticket about that? I don't know the procedure about that. > > Regards, > Cem > > > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong wrote: >> >> Hi All, >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for vnodes. >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks. >> >> May I know, is the normal since is vnodes? If yes, this have slow the M/R >> job to finish/complete. >> >> >> Thanks > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Lost data after expanding cluster c* 1.2.3-1
Hi all, I follow this tutorial for expanding a 4 c* cluster (production) and add 3 new nodes. Datacenter: eu-west === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.34.142.xxx 10.79 GB 256 15.4% 4e2e26b8-aa38-428c-a8f5-e86c13eb4442 1b UN 10.32.49.xxx 1.48 MB25613.7% e86f67b6-d7cb-4b47-b090-3824a5887145 1b UN 10.33.206.xxx 2.19 MB25611.9% 92af17c3-954a-4511-bc90-29a9657623e4 1b UN 10.32.27.xxx 1.95 MB256 14.9% 862e6b39-b380-40b4-9d61-d83cb8dacf9e 1b UN 10.34.139.xxx 11.67 GB 25615.5% 0324e394-b65f-46c8-acb4-1e1f87600a2c 1b UN 10.34.147.xxx 11.18 GB 256 13.9% cfc09822-5446-4565-a5f0-d25c917e2ce8 1b UN 10.33.193.xxx 10.83 GB 256 14.7% 59f440db-cd2d-4041-aab4-fc8e9518c954 1b The data are not streamed. Can any one help me, our web site is down. Thanks a lot,
Re: Vnodes - HUNDRED of MapReduce jobs
Every map reduce task typically has a minimum Xmx of 256MB memory. See mapred.child.java.opts... So if you have a 10 node cluster with 256 vnodes... You will need to spawn 2,560 map tasks to complete a job. And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map slots. Wouldnt it be better if the input format spawned 10 map tasks instead of 2,560? On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis wrote: > I still don't see the hole in the following reasoning: > > - Input splits are 64k by default. At this size, map processing time > dominates job creation. > - Therefore, if job creation time dominates, you have a toy data set > (< 64K * 256 vnodes = 16 MB) > > Adding complexity to our inputformat to improve performance for this > niche does not sound like a good idea to me. > > On Thu, Mar 28, 2013 at 8:40 AM, cem wrote: > > Hi Alicia , > > > > Cassandra input format creates mappers as many as vnodes. It is a known > > issue. You need to lower the number of vnodes :( > > > > I have a simple solution for that and ready to write a patch. Should I > > create a ticket about that? I don't know the procedure about that. > > > > Regards, > > Cem > > > > > > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong > wrote: > >> > >> Hi All, > >> > >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for > vnodes. > >> > >> When I execute a M/R job .. the console showed HUNDRED of Map tasks. > >> > >> May I know, is the normal since is vnodes? If yes, this have slow the > M/R > >> job to finish/complete. > >> > >> > >> Thanks > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder, http://www.datastax.com > @spyced >
Re: Vnodes - HUNDRED of MapReduce jobs
This is the second person who has mentioned that hadoop performance has tanked after switching to vnodes on list. On Fri, Mar 29, 2013 at 10:42 AM, Edward Capriolo wrote: > Every map reduce task typically has a minimum Xmx of 256MB memory. See > mapred.child.java.opts... > So if you have a 10 node cluster with 256 vnodes... You will need to spawn > 2,560 map tasks to complete a job. > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map > slots. > > Wouldnt it be better if the input format spawned 10 map tasks instead of > 2,560? > > > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis wrote: > >> I still don't see the hole in the following reasoning: >> >> - Input splits are 64k by default. At this size, map processing time >> dominates job creation. >> - Therefore, if job creation time dominates, you have a toy data set >> (< 64K * 256 vnodes = 16 MB) >> >> Adding complexity to our inputformat to improve performance for this >> niche does not sound like a good idea to me. >> >> On Thu, Mar 28, 2013 at 8:40 AM, cem wrote: >> > Hi Alicia , >> > >> > Cassandra input format creates mappers as many as vnodes. It is a known >> > issue. You need to lower the number of vnodes :( >> > >> > I have a simple solution for that and ready to write a patch. Should I >> > create a ticket about that? I don't know the procedure about that. >> > >> > Regards, >> > Cem >> > >> > >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong >> wrote: >> >> >> >> Hi All, >> >> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for >> vnodes. >> >> >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks. >> >> >> >> May I know, is the normal since is vnodes? If yes, this have slow the >> M/R >> >> job to finish/complete. >> >> >> >> >> >> Thanks >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder, http://www.datastax.com >> @spyced >> > >
CQL queries timing out (and had worked)
I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my cluster. I'd had a large insert job running the last several days which just ended it had been inserting using cql3 insert statements in a cql3 table. Now, I show no compactions going on in my cluster but for some reason any cql3 query I try to execute, insert, select, through cqlsh or through external library, all time out with an rpc_timeout. If I use cassandra-cli, I can do "list tablename limit 10" and immediately get my 10 rows back. However, if I do "select * from tablename limit 10" I get the rpc timeout error. Same table, same server. It doesn't seem to matter if I'm hitting a cql3 definited table or older style. Load on the nodes is relatively low at the moment. Any suggestions short of restarting nodes? This is a pretty major issue for us right now.
Re: CQL queries timing out (and had worked)
Appears that restarting a node makes CQL available on that node again, but only that node. Looks like I'll be doing a rolling restart. On Fri, Mar 29, 2013 at 10:26 AM, David McNelis wrote: > I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my > cluster. > > I'd had a large insert job running the last several days which just > ended it had been inserting using cql3 insert statements in a cql3 > table. > > Now, I show no compactions going on in my cluster but for some reason any > cql3 query I try to execute, insert, select, through cqlsh or through > external library, all time out with an rpc_timeout. > > If I use cassandra-cli, I can do "list tablename limit 10" and immediately > get my 10 rows back. > > However, if I do "select * from tablename limit 10" I get the rpc timeout > error. Same table, same server. It doesn't seem to matter if I'm hitting > a cql3 definited table or older style. > > Load on the nodes is relatively low at the moment. > > Any suggestions short of restarting nodes? This is a pretty major issue > for us right now. >
Cassandra/MapReduce ‘Data Locality’
Hi All, CfSplit that highlighted in RED* **, *in *d2t0053g* But why it being submitted to *d2t0051g *not *d2t0053g ??* Is this normal? Does this matter? In this case is no longer ‘Data Locality’ correct ? I’m using hadoop-1.1.2 & apache-cassandra-1.2.3. TokenRange (1) >> 127605887595351923798765477786913079296 => 0 TokenRange (2) >> 85070591730234615865843651857942052864 => 127605887595351923798765477786913079296 TokenRange (3) >> 42535295865117307932921825928971026432 => 85070591730234615865843651857942052864 TokenRange (4) >> 0 => 42535295865117307932921825928971026432 ColumnFamilySplit((127605887595351923798765477786913079296, '-1] @[d2t0050g ]) ColumnFamilySplit((-1, '0] @[d2t0050g]) ColumnFamilySplit((85070591730234615865843651857942052864, '127605887595351923798765477786913079296] @[d2t0053g]) ColumnFamilySplit((42535295865117307932921825928971026432, '85070591730234615865843651857942052864] @[d2t0052g]) ColumnFamilySplit((0, '42535295865117307932921825928971026432] @[d2t0051g]) *RF1*--- *d2t0050g * KeyRange(start_token:127605887595351923798765477786913079296, end_token:-1, count:4096) *d2t0051g* KeyRange(start_token:85070591730234615865843651857942052864, end_token:127605887595351923798765477786913079296, count:4096) Rowkey:3; columnvalue=Critics Choice Awards from ColumnFamilySplit((85070591730234615865843651857942052864, '127605887595351923798765477786913079296] @[d2t0053g]) KeyRange(start_token:117356732921465116845890410746976120467, end_token:127605887595351923798765477786913079296, count:4096) KeyRange(start_token:0, end_token:42535295865117307932921825928971026432, count:4096) Rowkey:1; columnvalue=Academy Awards from ColumnFamilySplit((0, '42535295865117307932921825928971026432] @[d2t0051g]) Rowkey 2: columnvalue=Golden Globe Awards from ColumnFamilySplit((0, '42535295865117307932921825928971026432] @[d2t0051g]) KeyRange(start_token:19847720572362509985402305765727304993, end_token:42535295865117307932921825928971026432, count:4096) *d2t0052g* KeyRange(start_token:42535295865117307932921825928971026432, end_token:85070591730234615865843651857942052864, count:4096) KeyRange(start_token:-1, end_token:0, count:4096) *d2t0053g* Nil Thanks in advance.
Re: Vnodes - HUNDRED of MapReduce jobs
My point is that if you have over 16MB of data per node, you're going to get thousands of map tasks (that is: hundreds per node) with or without vnodes. On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo wrote: > Every map reduce task typically has a minimum Xmx of 256MB memory. See > mapred.child.java.opts... > So if you have a 10 node cluster with 256 vnodes... You will need to spawn > 2,560 map tasks to complete a job. > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map > slots. > > Wouldnt it be better if the input format spawned 10 map tasks instead of > 2,560? > > > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis wrote: >> >> I still don't see the hole in the following reasoning: >> >> - Input splits are 64k by default. At this size, map processing time >> dominates job creation. >> - Therefore, if job creation time dominates, you have a toy data set >> (< 64K * 256 vnodes = 16 MB) >> >> Adding complexity to our inputformat to improve performance for this >> niche does not sound like a good idea to me. >> >> On Thu, Mar 28, 2013 at 8:40 AM, cem wrote: >> > Hi Alicia , >> > >> > Cassandra input format creates mappers as many as vnodes. It is a known >> > issue. You need to lower the number of vnodes :( >> > >> > I have a simple solution for that and ready to write a patch. Should I >> > create a ticket about that? I don't know the procedure about that. >> > >> > Regards, >> > Cem >> > >> > >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong >> > wrote: >> >> >> >> Hi All, >> >> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for >> >> vnodes. >> >> >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks. >> >> >> >> May I know, is the normal since is vnodes? If yes, this have slow the >> >> M/R >> >> job to finish/complete. >> >> >> >> >> >> Thanks >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder, http://www.datastax.com >> @spyced > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: CQL queries timing out (and had worked)
Final reason for problem: We'd had one node's config for rpc type changed from sync to hsha... So that mismatch can break rpc across the cluster, apparently. It would be nice if there was a good way to set that in a single spot for the cluster or handle the mismatch differently. Otherwise, if you wanted to change from sync to hsha in a cluster you'd have to entirely restart the cluster (not a big deal), but CQL would apparently not work at all until all of your nodes had been restarted. On Fri, Mar 29, 2013 at 10:35 AM, David McNelis wrote: > Appears that restarting a node makes CQL available on that node again, but > only that node. > > Looks like I'll be doing a rolling restart. > > > On Fri, Mar 29, 2013 at 10:26 AM, David McNelis wrote: > >> I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in >> my cluster. >> >> I'd had a large insert job running the last several days which just >> ended it had been inserting using cql3 insert statements in a cql3 >> table. >> >> Now, I show no compactions going on in my cluster but for some reason any >> cql3 query I try to execute, insert, select, through cqlsh or through >> external library, all time out with an rpc_timeout. >> >> If I use cassandra-cli, I can do "list tablename limit 10" and >> immediately get my 10 rows back. >> >> However, if I do "select * from tablename limit 10" I get the rpc timeout >> error. Same table, same server. It doesn't seem to matter if I'm hitting >> a cql3 definited table or older style. >> >> Load on the nodes is relatively low at the moment. >> >> Any suggestions short of restarting nodes? This is a pretty major issue >> for us right now. >> > >
Re: MultiInput/MultiGet CF in MapReduce
Hi All I’m thinking to do in this way. 1) 1) get_slice ( MMDDHH ) from Index Table. 2) 2) With the returned list of ROWKEYs 3) 3) Pass it to multiget_slice ( keys …) But my questions is how to ensure ‘Data Locality’ ?? On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrote: > I would be looking at Hive or Pig, rather than writing the MapReduce. > > There is an example in the source cassandra distribution, or you can look > at Data Stax Enterprise to start playing with Hive. > > Typically with hadoop queries you want to query a lot of data, if you are > only querying a few rows consider writing the code in your favourite > language. > > Cheers > > - > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 18/03/2013, at 1:29 PM, Alicia Leong wrote: > > Hi All > > I have 2 tables > > Data Table > - > RowKey: 1 > => (column=name, value=apple) > RowKey: 2 > => (column=name, value=orange) > RowKey: 3 > => (column=name, value=banana) > RowKey: 4 > => (column=name, value=mango) > > > Index Table (MMDDHH) > > RowKey: 2013030114 > => (column=1, value=) > => (column=2, value=) > => (column=3, value=) > RowKey: 2013030115 > => (column=4, value=) > > > I would like to know, how to implement below in MapReduce > 1) first query the Index Table by RowKey: 2013030114 > 2) then pass the Index Table column names (1,2,3) to query the Data Table > > Thanks in advance. > > >
Re: Vnodes - HUNDRED of MapReduce jobs
Yes but my point, is with 50 map slots you can only be processing 50 at once. So it will take 1000/50 "waves" of mappers to complete the job. On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis wrote: > My point is that if you have over 16MB of data per node, you're going > to get thousands of map tasks (that is: hundreds per node) with or > without vnodes. > > On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo > wrote: > > Every map reduce task typically has a minimum Xmx of 256MB memory. See > > mapred.child.java.opts... > > So if you have a 10 node cluster with 256 vnodes... You will need to > spawn > > 2,560 map tasks to complete a job. > > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map > > slots. > > > > Wouldnt it be better if the input format spawned 10 map tasks instead of > > 2,560? > > > > > > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis > wrote: > >> > >> I still don't see the hole in the following reasoning: > >> > >> - Input splits are 64k by default. At this size, map processing time > >> dominates job creation. > >> - Therefore, if job creation time dominates, you have a toy data set > >> (< 64K * 256 vnodes = 16 MB) > >> > >> Adding complexity to our inputformat to improve performance for this > >> niche does not sound like a good idea to me. > >> > >> On Thu, Mar 28, 2013 at 8:40 AM, cem wrote: > >> > Hi Alicia , > >> > > >> > Cassandra input format creates mappers as many as vnodes. It is a > known > >> > issue. You need to lower the number of vnodes :( > >> > > >> > I have a simple solution for that and ready to write a patch. Should I > >> > create a ticket about that? I don't know the procedure about that. > >> > > >> > Regards, > >> > Cem > >> > > >> > > >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong > >> > wrote: > >> >> > >> >> Hi All, > >> >> > >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for > >> >> vnodes. > >> >> > >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks. > >> >> > >> >> May I know, is the normal since is vnodes? If yes, this have slow > the > >> >> M/R > >> >> job to finish/complete. > >> >> > >> >> > >> >> Thanks > >> > > >> > > >> > >> > >> > >> -- > >> Jonathan Ellis > >> Project Chair, Apache Cassandra > >> co-founder, http://www.datastax.com > >> @spyced > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder, http://www.datastax.com > @spyced >
Re: Vnodes - HUNDRED of MapReduce jobs
It should be easy to control the number of map tasks. http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you might run into a directory with 10,000 small files and you do not want 10,000 map tasks. This is what the CombinedInputFormat's do, they help you control the number of map tasks a job will generate. For example, imagine i have a multi-tenant cluster. If a job kicks up 10,000 map tasks, all those tasks can starve out other jobs. Being able to say "I only want 4 map tasks per c* node regardless of the number of vnodes" would be a meaningful and useful feature. On Fri, Mar 29, 2013 at 2:17 PM, Edward Capriolo wrote: > Yes but my point, is with 50 map slots you can only be processing 50 at > once. So it will take 1000/50 "waves" of mappers to complete the job. > > > On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis wrote: > >> My point is that if you have over 16MB of data per node, you're going >> to get thousands of map tasks (that is: hundreds per node) with or >> without vnodes. >> >> On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo >> wrote: >> > Every map reduce task typically has a minimum Xmx of 256MB memory. See >> > mapred.child.java.opts... >> > So if you have a 10 node cluster with 256 vnodes... You will need to >> spawn >> > 2,560 map tasks to complete a job. >> > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map >> > slots. >> > >> > Wouldnt it be better if the input format spawned 10 map tasks instead of >> > 2,560? >> > >> > >> > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis >> wrote: >> >> >> >> I still don't see the hole in the following reasoning: >> >> >> >> - Input splits are 64k by default. At this size, map processing time >> >> dominates job creation. >> >> - Therefore, if job creation time dominates, you have a toy data set >> >> (< 64K * 256 vnodes = 16 MB) >> >> >> >> Adding complexity to our inputformat to improve performance for this >> >> niche does not sound like a good idea to me. >> >> >> >> On Thu, Mar 28, 2013 at 8:40 AM, cem wrote: >> >> > Hi Alicia , >> >> > >> >> > Cassandra input format creates mappers as many as vnodes. It is a >> known >> >> > issue. You need to lower the number of vnodes :( >> >> > >> >> > I have a simple solution for that and ready to write a patch. Should >> I >> >> > create a ticket about that? I don't know the procedure about that. >> >> > >> >> > Regards, >> >> > Cem >> >> > >> >> > >> >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong >> >> > wrote: >> >> >> >> >> >> Hi All, >> >> >> >> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for >> >> >> vnodes. >> >> >> >> >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks. >> >> >> >> >> >> May I know, is the normal since is vnodes? If yes, this have slow >> the >> >> >> M/R >> >> >> job to finish/complete. >> >> >> >> >> >> >> >> >> Thanks >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Jonathan Ellis >> >> Project Chair, Apache Cassandra >> >> co-founder, http://www.datastax.com >> >> @spyced >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder, http://www.datastax.com >> @spyced >> > >
Re: Insert v/s Update performance
Hi Aaron, Thank you for your input. I have been monitoring my GC activities and looking at my Heap, it shows pretty linear activities, without any spikes. When I look at CPU it shows higher utilization while during writes alone. I also expect hevy read traffic. When I tried compaction_throughput_* parameter, I obsered that higher number here in my case gets better CPU utilization and keeps pending compactions pretty low. How this parameter works? I have 3 nodes and 2 core each CPU and I have higher writes. So usually for high *update* and high read situation what parameter we should consider for tuning? Thanks, Jay On Wed, Mar 27, 2013 at 9:55 PM, aaron morton wrote: > * Check for GC activity in the logs > * check the volume the commit log is on to see it it's over utilised. > * check if the dropped messages correlate to compaction, look at the > compaction_* settings in yaml and consider reducing the throughput. > > Like Dean says if you have existing data it will result in more > compaction. You may be able to get a lot of writes through in a clean new > cluster, but it also has to work when compaction and repair are running. > > Cheers > > - > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 27/03/2013, at 1:43 PM, Jay Svc wrote: > > Thanks Dean again! > > My use case is high number of reads and writes out of that I am just > focusing on write now. I thought LCS is a suitable for my situation. I > tried simillar on STCS and results are same. > > I ran nodetool for tpstats and MutationStage pending are very high. At the > same time the SSTable count and Pending Compaction are high too during my > updates. > > Please find the snapshot of my syslog. > > INFO [ScheduledTasks:1] 2013-03-26 15:05:48,560 StatusLogger.java (line > 116) OpsCenter.rollups864000,0 > INFO [FlushWriter:55] 2013-03-26 15:05:48,608 Memtable.java (line 264) > Writing Memtable-InventoryPrice@1051586614(11438914/129587272 > serialized/live bytes, 404320 ops) > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,561 MessagingService.java > (line 658) 2701 MUTATION messages dropped in last 5000ms > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,562 StatusLogger.java (line > 57) Pool NameActive Pending Blocked > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,563 StatusLogger.java (line > 72) ReadStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,568 StatusLogger.java (line > 72) RequestResponseStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,627 StatusLogger.java (line > 72) ReadRepairStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,627 StatusLogger.java (line > 72) MutationStage32 19967 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,628 StatusLogger.java (line > 72) ReplicateOnWriteStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,628 StatusLogger.java (line > 72) GossipStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,628 StatusLogger.java (line > 72) AntiEntropyStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,629 StatusLogger.java (line > 72) MigrationStage0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,629 StatusLogger.java (line > 72) StreamStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,629 StatusLogger.java (line > 72) MemtablePostFlusher 1 1 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,673 StatusLogger.java (line > 72) FlushWriter 1 1 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,673 StatusLogger.java (line > 72) MiscStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,673 StatusLogger.java (line > 72) commitlog_archiver0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,674 StatusLogger.java (line > 72) InternalResponseStage 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,674 StatusLogger.java (line > 72) HintedHandoff 0 0 0 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,674 StatusLogger.java (line > 77) CompactionManager 127 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,675 StatusLogger.java (line > 89) MessagingServicen/a 0,22 > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,724 StatusLogger.java (line > 99) Cache Type Size > Capacity > KeysToSave Provider > INFO [ScheduledTasks:1] 2013-03-26 15:05:53,725 StatusLogger.java (line > 100) KeyCache 142315 > 2118997 > all > INFO [ScheduledTasks:1]
Re: MultiInput/MultiGet CF in MapReduce
You can use the output of describe_ring along with partitioner information to determine which nodes data lives on. On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong wrote: > Hi All > > I’m thinking to do in this way. > > 1) 1) get_slice ( MMDDHH ) from Index Table. > > 2) 2) With the returned list of ROWKEYs > > 3) 3) Pass it to multiget_slice ( keys …) > > > > But my questions is how to ensure ‘Data Locality’ ?? > > > On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrote: > >> I would be looking at Hive or Pig, rather than writing the MapReduce. >> >> There is an example in the source cassandra distribution, or you can look >> at Data Stax Enterprise to start playing with Hive. >> >> Typically with hadoop queries you want to query a lot of data, if you are >> only querying a few rows consider writing the code in your favourite >> language. >> >> Cheers >> >>- >> Aaron Morton >> Freelance Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 18/03/2013, at 1:29 PM, Alicia Leong wrote: >> >> Hi All >> >> I have 2 tables >> >> Data Table >> - >> RowKey: 1 >> => (column=name, value=apple) >> RowKey: 2 >> => (column=name, value=orange) >> RowKey: 3 >> => (column=name, value=banana) >> RowKey: 4 >> => (column=name, value=mango) >> >> >> Index Table (MMDDHH) >> >> RowKey: 2013030114 >> => (column=1, value=) >> => (column=2, value=) >> => (column=3, value=) >> RowKey: 2013030115 >> => (column=4, value=) >> >> >> I would like to know, how to implement below in MapReduce >> 1) first query the Index Table by RowKey: 2013030114 >> 2) then pass the Index Table column names (1,2,3) to query the Data >> Table >> >> Thanks in advance. >> >> >> >
Re: MultiInput/MultiGet CF in MapReduce
This is the current flow for ColumnFamilyInputFormat. Please correct me If I'm wrong 1) In ColumnFamilyInputFormat, Get all nodes token ranges using * client.describe_ring* 2) Get CfSplit using *client.describe_splits_ex *with the token range 2) new ColumnFamilySplit with start range, end range and endpoint 3) In ColumnFamilyRecordReader, will query *client.get_range_slices* with the start range & end range of the ColumnFamilySplit at endpoint (datanode) If I would use *client.get_slice* ( key). My rowkey is '20130314' from Index Table. Q1) How to know for rowkey '20130314' is in which Token Range & EndPoint. Even though I manage to find out the Token Range & EndPoint. Is the available Thrift API, that I can pass the ( ByteBuffer key, KeyRange range ) Likes merge of client.get_slice & client.get_range_slices Thanks On Sat, Mar 30, 2013 at 7:53 AM, Edward Capriolo wrote: > You can use the output of describe_ring along with partitioner information > to determine which nodes data lives on. > > > On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong wrote: > >> Hi All >> >> I’m thinking to do in this way. >> >> 1) 1) get_slice ( MMDDHH ) from Index Table. >> >> 2) 2) With the returned list of ROWKEYs >> >> 3) 3) Pass it to multiget_slice ( keys …) >> >> >> >> But my questions is how to ensure ‘Data Locality’ ?? >> >> >> On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrote: >> >>> I would be looking at Hive or Pig, rather than writing the MapReduce. >>> >>> There is an example in the source cassandra distribution, or you can >>> look at Data Stax Enterprise to start playing with Hive. >>> >>> Typically with hadoop queries you want to query a lot of data, if you >>> are only querying a few rows consider writing the code in your favourite >>> language. >>> >>> Cheers >>> >>>- >>> Aaron Morton >>> Freelance Cassandra Consultant >>> New Zealand >>> >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 18/03/2013, at 1:29 PM, Alicia Leong wrote: >>> >>> Hi All >>> >>> I have 2 tables >>> >>> Data Table >>> - >>> RowKey: 1 >>> => (column=name, value=apple) >>> RowKey: 2 >>> => (column=name, value=orange) >>> RowKey: 3 >>> => (column=name, value=banana) >>> RowKey: 4 >>> => (column=name, value=mango) >>> >>> >>> Index Table (MMDDHH) >>> >>> RowKey: 2013030114 >>> => (column=1, value=) >>> => (column=2, value=) >>> => (column=3, value=) >>> RowKey: 2013030115 >>> => (column=4, value=) >>> >>> >>> I would like to know, how to implement below in MapReduce >>> 1) first query the Index Table by RowKey: 2013030114 >>> 2) then pass the Index Table column names (1,2,3) to query the Data >>> Table >>> >>> Thanks in advance. >>> >>> >>> >> >