This is the current flow for ColumnFamilyInputFormat. Please correct me If
I'm wrong
1) In ColumnFamilyInputFormat, Get all nodes token ranges using *
client.describe_ring*
2) Get CfSplit using *client.describe_splits_ex *with the token range
2) new ColumnFamilySplit with start range, end range a
You can use the output of describe_ring along with partitioner information
to determine which nodes data lives on.
On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong wrote:
> Hi All
>
> I’m thinking to do in this way.
>
> 1) 1) get_slice ( MMDDHH ) from Index Table.
>
> 2) 2) With th
Hi Aaron,
Thank you for your input. I have been monitoring my GC activities and
looking at my Heap, it shows pretty linear activities, without any spikes.
When I look at CPU it shows higher utilization while during writes alone. I
also expect hevy read traffic.
When I tried compaction_throughput
It should be easy to control the number of map tasks.
http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you
might run into a directory with 10,000 small files and you do not want
10,000 map tasks. This is what the CombinedInputFormat's do, they help you
control the number of map
Yes but my point, is with 50 map slots you can only be processing 50 at
once. So it will take 1000/50 "waves" of mappers to complete the job.
On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis wrote:
> My point is that if you have over 16MB of data per node, you're going
> to get thousands of map
Hi All
I’m thinking to do in this way.
1) 1) get_slice ( MMDDHH ) from Index Table.
2) 2) With the returned list of ROWKEYs
3) 3) Pass it to multiget_slice ( keys …)
But my questions is how to ensure ‘Data Locality’ ??
On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrot
Final reason for problem:
We'd had one node's config for rpc type changed from sync to hsha...
So that mismatch can break rpc across the cluster, apparently.
It would be nice if there was a good way to set that in a single spot for
the cluster or handle the mismatch differently. Otherwise, if y
My point is that if you have over 16MB of data per node, you're going
to get thousands of map tasks (that is: hundreds per node) with or
without vnodes.
On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo wrote:
> Every map reduce task typically has a minimum Xmx of 256MB memory. See
> mapred.child.
Hi All,
CfSplit that highlighted in RED* **, *in *d2t0053g*
But why it being submitted to *d2t0051g *not *d2t0053g ??*
Is this normal? Does this matter? In this case is no longer ‘Data Locality’
correct ?
I’m using hadoop-1.1.2 & apache-cassandra-1.2.3.
TokenRange (1) >> 1276058875953519237
Appears that restarting a node makes CQL available on that node again, but
only that node.
Looks like I'll be doing a rolling restart.
On Fri, Mar 29, 2013 at 10:26 AM, David McNelis wrote:
> I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my
> cluster.
>
> I'd had a la
I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my
cluster.
I'd had a large insert job running the last several days which just
ended it had been inserting using cql3 insert statements in a cql3
table.
Now, I show no compactions going on in my cluster but for some reas
This is the second person who has mentioned that hadoop performance has
tanked after switching to vnodes on list.
On Fri, Mar 29, 2013 at 10:42 AM, Edward Capriolo wrote:
> Every map reduce task typically has a minimum Xmx of 256MB memory. See
> mapred.child.java.opts...
> So if you have a 10 no
Every map reduce task typically has a minimum Xmx of 256MB memory. See
mapred.child.java.opts...
So if you have a 10 node cluster with 256 vnodes... You will need to spawn
2,560 map tasks to complete a job.
And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
slots.
Wouldnt it
Hi all,
I follow this tutorial for expanding a 4 c* cluster (production) and add 3
new nodes.
Datacenter: eu-west
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host
ID Rack
UN 10.34.142.xxx
I still don't see the hole in the following reasoning:
- Input splits are 64k by default. At this size, map processing time
dominates job creation.
- Therefore, if job creation time dominates, you have a toy data set
(< 64K * 256 vnodes = 16 MB)
Adding complexity to our inputformat to improve pe
15 matches
Mail list logo