In 0.94, we have AggregateImplementation, an endpoint coprocessor, which implements getRowNum().
Example is in AggregationClient.java Cheers On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[email protected]> wrote: > From your numbers below you have about 26k regions, thus each region is > about 545tb/26k = 20gb. Good. > > How many mappers are you running? > And just to rule out the obvious, the M/R is running on the cluster and > not locally, right? (it will default to a local runner when it cannot use > the M/R cluster). > > Some back of the envelope calculations tell me that assuming 1ge network > cards, the best you can expect for 110 machines to map through this data is > about 10h. (so way faster than what you see). > (545tb/(110*1/8gb/s) ~ 40ks ~11h) > > > We should really add a rowcounting coprocessor to HBase and allow using it > via M/R. > > -- Lars > > > > ________________________________ > From: James Birchfield <[email protected]> > To: [email protected] > Sent: Friday, September 20, 2013 5:09 PM > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help > > > I did not implement accurate timing, but the current table being counted > has been running for about 10 hours, and the log is estimating the map > portion at 10% > > 2013-09-20 23:40:24,099 INFO [main] Job : map > 10% reduce 0% > > So a loooong time. Like I mentioned, we have billions, if not trillions > of rows potentially. > > Thanks for the feedback on the approaches I mentioned. I was not sure if > they would have any effect overall. > > I will look further into coprocessors. > > Thanks! > Birch > On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[email protected]> > wrote: > > > How long does it take for RowCounter Job for largest table to finish on > your cluster? > > > > Just curious. > > > > On your options: > > > > 1. Not worth it probably - you may overload your cluster > > 2. Not sure this one differs from 1. Looks the same to me but more > complex. > > 3. The same as 1 and 2 > > > > Counting rows in efficient way can be done if you sacrifice some > accuracy : > > > > > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html > > > > Yeah, you will need coprocessors for that. > > > > Best regards, > > Vladimir Rodionov > > Principal Platform Engineer > > Carrier IQ, www.carrieriq.com > > e-mail: [email protected] > > > > ________________________________________ > > From: James Birchfield [[email protected]] > > Sent: Friday, September 20, 2013 3:50 PM > > To: [email protected] > > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help > > > > Hadoop 2.0.0-cdh4.3.1 > > > > HBase 0.94.6-cdh4.3.1 > > > > 110 servers, 0 dead, 238.2364 average load > > > > Some other info, not sure if it helps or not. > > > > Configured Capacity: 1295277834158080 (1.15 PB) > > Present Capacity: 1224692609430678 (1.09 PB) > > DFS Remaining: 624376503857152 (567.87 TB) > > DFS Used: 600316105573526 (545.98 TB) > > DFS Used%: 49.02% > > Under replicated blocks: 0 > > Blocks with corrupt replicas: 1 > > Missing blocks: 0 > > > > It is hitting a production cluster, but I am not really sure how to > calculate the load placed on the cluster. > > On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote: > > > >> How many nodes do you have in your cluster ? > >> > >> When counting rows, what other load would be placed on the cluster ? > >> > >> What is the HBase version you're currently using / planning to use ? > >> > >> Thanks > >> > >> > >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < > >> [email protected]> wrote: > >> > >>> After reading the documentation and scouring the mailing list > >>> archives, I understand there is no real support for fast row counting > in > >>> HBase unless you build some sort of tracking logic into your code. In > our > >>> case, we do not have such logic, and have massive amounts of data > already > >>> persisted. I am running into the issue of very long execution of the > >>> RowCounter MapReduce job against very large tables (multi-billion for > many > >>> is our estimate). I understand why this issue exists and am slowly > >>> accepting it, but I am hoping I can solicit some possible ideas to help > >>> speed things up a little. > >>> > >>> My current task is to provide total row counts on about 600 > >>> tables, some extremely large, some not so much. Currently, I have a > >>> process that executes the MapRduce job in process like so: > >>> > >>> Job job = RowCounter.createSubmittableJob( > >>> ConfigManager.getConfiguration(), > >>> new String[]{tableName}); > >>> boolean waitForCompletion = > >>> job.waitForCompletion(true); > >>> Counters counters = job.getCounters(); > >>> Counter rowCounter = > >>> counters.findCounter(hbaseadminconnection.Counters.ROWS); > >>> return rowCounter.getValue(); > >>> > >>> At the moment, each MapReduce job is executed in serial order, so > >>> counting one table at a time. For the current implementation of this > whole > >>> process, as it stands right now, my rough timing calculations indicate > that > >>> fully counting all the rows of these 600 tables will take anywhere > between > >>> 11 to 22 days. This is not what I consider a desirable timeframe. > >>> > >>> I have considered three alternative approaches to speed things > up. > >>> > >>> First, since the application is not heavily CPU bound, I could > use > >>> a ThreadPool and execute multiple MapReduce jobs at the same time > looking > >>> at different tables. I have never done this, so I am unsure if this > would > >>> cause any unanticipated side effects. > >>> > >>> Second, I could distribute the processes. I could find as many > >>> machines that can successfully talk to the desired cluster properly, > give > >>> them a subset of tables to work on, and then combine the results post > >>> process. > >>> > >>> Third, I could combine both the above approaches and run a > >>> distributed set of multithreaded process to execute the MapReduce jobs > in > >>> parallel. > >>> > >>> Although it seems to have been asked and answered many times, I > >>> will ask once again. Without the need to change our current > configurations > >>> or restart the clusters, is there a faster approach to obtain row > counts? > >>> FYI, my cache size for the Scan is set to 1000. I have experimented > with > >>> different numbers, but nothing made a noticeable difference. Any > advice or > >>> feedback would be greatly appreciated! > >>> > >>> Thanks, > >>> Birch > > > > > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. >
