​Our cluster recently had some issue related to network outages*.

When all the dust settled, Hbase eventually "healed" itself, and almost 
everything is back to working well, with a couple of exceptions.

In particular, we have one table where almost every (Phoenix) query times out - 
which was never the case before. It's very small compared to most of our other 
tables at around 400 million rows.

I have tried with a raw JDBC connection in Java code as well as with Aqua Data 
Studio, both of which usually work fine.

The specific failure is that after 15 minutes (the set timeout),  I get a 
one-line error that says: “Error 1102 (XCL02): Cannot get all table regions”

When I look at the GUI tools (like http://<my 
server>:16010/master-status#storeStats) it shows '1' under "offline regions" 
for that table (it has 33 total regions). Almost all the other tables show '0'.

Can anyone help me troubleshoot this?

Are there Phoenix tables I can clear out that may be confused?

This isn’t an issue with the schema or skew or anything. The same table with 
the same data was lightning fast before these hbase issues.

I know there is a CLI tool for fixing HBase issues. I'm wondering whether that 
"offline region" is the cause of these timeouts.

If not, how I can I figure it out?

Thanks!


* FWIW, what happened was that DNS stopped working for a while, so HBase 
started referring to all the region servers by IP address, which somewhat 
worked, until the region servers restarted. Then they were hosed until a bit of 
manual intervention.

Reply via email to