Re: Poor data locality of MR job

2012-08-01 Thread Adrien Mogenet
Did you pre split your table or did you let balancer assign regions to regionservers for you ? Did your regionserver(s) fail ? On Thu, Aug 2, 2012 at 8:31 AM, Bryan Keller wrote: > I have an 8 node cluster and a table that is pretty well balanced with on > average 36 regions/node. When I run a

Poor data locality of MR job

2012-08-01 Thread Bryan Keller
I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the

Region balancing question

2012-08-01 Thread Bryan Keller
I have a table on a 4 node test cluster. I also have some other tables on the cluster. The table in question has a total of 12 regions. I noticed that 1 node has 6 regions, another has zero, and the remaining two nodes have the expected 3 regions. I'm a little confused how this can happen. The

Re: Where to run Thrift

2012-08-01 Thread lars hofhansl
As I said, I have not used this myself... So take this with a grain of salt :) I imagine the advantage would be no additional servers/processes that would need to be monitored and managed, as well as a (slight) reduction in overall resource consumption. On the downside any resource leak in the

Re: Where to run Thrift

2012-08-01 Thread Shrijeet Paliwal
Lars, Thanks for the pointer, its indeed interesting way. Two follow up questions : 1. Author states "Rather than a separate process, it can be *advantageous * in some situations for each RegionServer to embed their own ThriftServer" , do you happen to have insights on what are those

RE: Retrieve Put timestamp

2012-08-01 Thread Ramkrishna.S.Vasudevan
+1. Anyway all mutations extends OperationsWithAttributes also. Regards Ram > -Original Message- > From: Anoop Sam John [mailto:anoo...@huawei.com] > Sent: Thursday, August 02, 2012 10:13 AM > To: user@hbase.apache.org > Subject: RE: Retrieve Put timestamp > > Currently in Append there i

RE: Retrieve Put timestamp

2012-08-01 Thread Anoop Sam John
Currently in Append there is a setter to specify whether to return the result or not. Similar way we can use for Put? Only with specific use cases the return TS might be needed. May be in a generic way we can return the attributes of the Mutation? So any thing which the client needs back can be

Re: Where to run Thrift

2012-08-01 Thread lars hofhansl
There is a little documented feature that Jonathan Gray added a while back: Running a thrift server as a thread as part of each region server. This is enabled by settting hbase.regionserver.export.thrift to true in your configuration. While I have not personally tried it, it looks like a fairly

Re: Filter with State

2012-08-01 Thread lars hofhansl
The Filter is initialized per Region as part of a RegionScannerImpl. So as long as all the rows you are interested are co-located in the same region you can keep that state in the Filter instance. You can use a custom RegionSplitPolicy to control (to some extend at least) how the rows are coloc

Re: Region server failure question

2012-08-01 Thread Mohit Anchlia
On Wed, Aug 1, 2012 at 12:52 PM, Mohammad Tariq wrote: > Hello Mohit, > > If replication factor is set to some value > 1, then the data is > still present on some other node(perhaps within the same rack or a > different one). And, as far as this post is concerned it tells us > about Write Ah

Re: Filter with State

2012-08-01 Thread Jerry Lam
Hi Lars, I understand that it is more difficult to carry states across regions/servers, how about in a single region? Knowing that the rows in a single region have dependencies, can we have filter with state? If filter doesn't provide this ability, is there other mechanism in hbase to offer thi

Re: Filter with State

2012-08-01 Thread lars hofhansl
The issue here is that different rows can be located in different regions or even different region servers, so no local state will carry over all rows. - Original Message - From: Jerry Lam To: "user@hbase.apache.org" Cc: "user@hbase.apache.org" Sent: Wednesday, August 1, 2012 5:48 PM

Re: Filter with State

2012-08-01 Thread Jerry Lam
Hi St.Ack: Schema cannot be changed to a single row. The API describes "Do not rely on filters carrying state across rows; its not reliable in current hbase as we have no handlers in place for when regions split, close or server crashes." If we manage region splitting ourselves, so the split is

hbase and disaster recovery

2012-08-01 Thread Paul Mackles
In case anyone is interested in hbase and disaster recovery, here is a writeup I just posted: http://bruteforcedata.blogspot.com/2012/08/hbase-disaster-recovery-and-whisky.html Feedback appreciated. Thanks, Paul

Re: Filter with State

2012-08-01 Thread Stack
On Wed, Aug 1, 2012 at 10:44 PM, Jerry Lam wrote: > Hi HBase guru: > > From Lars George talk, he mentions that filter has no state. What if I need > to scan rows in which the decision to filter one row or not is based on the > previous row's column values? Any idea how one can implement this type

Re: Retrieve Put timestamp

2012-08-01 Thread Stack
On Wed, Aug 1, 2012 at 7:12 PM, Wei Tan wrote: > We have a similar requirement and here is the solution in our mind: > add a coprocessor, in prePut() get the current ms and set it to put --- > the current implementation get the current ms and set it in put() > return the ms generated to prePut() t

Not connecting to HBase using Java from web application

2012-08-01 Thread Jilani Shaik
Hi, We have a linux instance for HBase. Now I am trying to connect to HBase using Java, when I tried using a simple program it is connecting to HBase and doing the the operations like create and get the table rows etc. But the same code I used in my application as a ear file and deployed its not

Re: Query a version of a column efficiently

2012-08-01 Thread Jerry Lam
Thanks Suraj. I looked at the code but it looks like the logic is not self-contained, particularly for the way hbase works with search for a specific version using TimeRange. Best Regards, Jerry On Mon, Jul 30, 2012 at 12:53 PM, Suraj Varma wrote: > You may need to setup your Eclipse workspace

Re: Region server failure question

2012-08-01 Thread Mohammad Tariq
Hello Mohit, If replication factor is set to some value > 1, then the data is still present on some other node(perhaps within the same rack or a different one). And, as far as this post is concerned it tells us about Write Ahead Logs, i.e data that is still not written onto the disk. This is

Re: Retrieve Put timestamp

2012-08-01 Thread Wei Tan
We have a similar requirement and here is the solution in our mind: add a coprocessor, in prePut() get the current ms and set it to put --- the current implementation get the current ms and set it in put() return the ms generated to prePut() to client. For now put() does not return any value. we

Re: sync on writes

2012-08-01 Thread Mohit Anchlia
On Wed, Aug 1, 2012 at 9:29 AM, lars hofhansl wrote: > "sync" is a fluffy term in HDFS. HDFS has hsync and hflush. > hflush forces all current changes at a DFSClient to all replica nodes (but > not to disk). > > Until HDFS-744 hsync would be identical to hflush. After HDFS-744 hsync > can be used

Re: Retrieve Put timestamp

2012-08-01 Thread lars hofhansl
There is no HBase API for this. However, this could useful in some scenario, so maybe we could add an API for this. It's not entirely trivial, though. From: Pablo Musa To: "user@hbase.apache.org" Sent: Monday, July 30, 2012 3:13 PM Subject: Retrieve Put timestam

Re: sync on writes

2012-08-01 Thread lars hofhansl
"sync" is a fluffy term in HDFS. HDFS has hsync and hflush. hflush forces all current changes at a DFSClient to all replica nodes (but not to disk). Until HDFS-744 hsync would be identical to hflush. After HDFS-744 hsync can be used to force data to disk at the replicas. When HBase refers to "

Re: sync on writes

2012-08-01 Thread Jerry Lam
I believe you are talking about enabling dfs.support.append feature? I benchmarked the difference (disable/enable) previously and I don't find much differences. It would be great if someone else can confirm on this. Best Regards, Jerry On Wednesday, August 1, 2012, Alex Baranau wrote: > I belie

Re: Multiple CF and versions

2012-08-01 Thread Alex Baranau
These questions were raised many times in this ML and in other sources (blogs, etc.). You can find them with a little effort. Alex Baranau -- Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Wed, Aug 1, 2012 at 1:33 AM, Mohammad Tariq wrote: > Hello Mohit, >

Re: Parallel scans

2012-08-01 Thread Alex Baranau
> Is there a way to execute multiple scans in parallel like get? I guess the Q is can we (and does it makes sense) to execute multiple scans in parallel, e.g. in multiple threads inside the client. The answer is yes, you can do it and it makes sense: HBase is likely to be able to process much more

Re: sync on writes

2012-08-01 Thread Alex Baranau
I believe that this is *not default*, but *current* implementation of sync(). I.e. (please correct me if I'm wrong) n-way write approach is not available yet. You might confuse it with the fact that by default, sync() is called on every edit. And you can change it by using "deferred log flushing".

Re: How to query by rowKey-infix

2012-08-01 Thread Michael Segel
Actually w coprocessors you can create a secondary index in short order. Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. On Jul 31, 2012, at 12:41 PM, Matt Corgan wrote: > When deciding between a table scan vs secondary index, you should try

Re: How to query by rowKey-infix

2012-08-01 Thread Christian Schäfer
Thanks Matt & Jerry for your replies. The data for each row is small (some hundred Bytes). So, I will try the parallel table scan at first as you suggested... Before organizing that by myself, wouldn't it be a better idea to create a map reduce job for that? I'm not so keen on implementing seco

Re: Where to run Thrift

2012-08-01 Thread Trung Pham
Running thrift server on the client is more ideal. You get to cut down 1 network hop. On Tue, Jul 31, 2012 at 2:22 PM, Stack wrote: > On Tue, Jul 31, 2012 at 12:32 PM, Eric wrote: > > I'm currently running thrift on all region server nodes. The reasoning is > > that you can run jobs on this clu

Re: Null row key

2012-08-01 Thread Igal Shilman
Hi, If the row for your key, is not present then get() will return an empty Result (a result with no key values in it) you should call result.isEmpty() first. Igal. On Wed, Aug 1, 2012 at 3:20 AM, Mohit Anchlia wrote: > Not sure how but I am getting one null row per 9 writes when I do a GET in >