Re: Spark 1.6.0 + Hive + HBase

Julio Antonio Soto de Vicente Thu, 28 Jan 2016 04:10:26 -0800

Hi,

Indeed, Hive is not able to perform predicate pushdown through a HBase table. 
Nor Hive or Impala can.


Broadly speaking, if you need to query your  HBase table through a field other 
than de rowkey:

A) Try to "encode" as much info as possible in the rowkey field and use it as 
your predicate, or
B) Feel free to use other kind of storage system/create coprocessors in order 
to create a secondary index.


> El 28 ene 2016, a las 12:56, Maciej Bryński <[email protected]> escribió:
> 
> Ted,
> You're right.
> hbase-site.xml resolved problems 2 and 3, but...
> 
> Problem 4)
> Spark don't push down predicates for HiveTableScan, which means that every 
> query is full scan.
> 
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#144L])
> +- TungstenExchange SinglePartition, None
>    +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#147L])
>       +- Project
>          +- Filter (added_date#141L >= 201601280000)
>             +- HiveTableScan [added_date#141L], MetastoreRelation 
> dwh_diagnostics, sessions_hbase, None
> 
> Is there any magic option to make this work ?
> 
> Regards,
> Maciek
> 
> 2016-01-28 10:25 GMT+01:00 Ted Yu <[email protected]>:
>> For the last two problems, hbase-site.xml seems not to be on classpath. 
>> 
>> Once hbase-site.xml is put on classpath, you should be able to make 
>> progress. 
>> 
>> Cheers
>> 
>>> On Jan 28, 2016, at 1:14 AM, Maciej Bryński <[email protected]> wrote:
>>> 
>>> Hi,
>>> I'm trying to run SQL query on Hive table which is stored on HBase.
>>> I'm using:
>>> - Spark 1.6.0
>>> - HDP 2.2
>>> - Hive 0.14.0
>>> - HBase 0.98.4
>>> 
>>> I managed to configure working classpath, but I have following problems:
>>> 
>>> 1) I have UDF defined in Hive Metastore (FUNCS table).
>>> Spark cannot use it.. 
>>> 
>>>  File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, 
>>> in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
>>> : org.apache.spark.sql.AnalysisException: undefined function 
>>> dwh.str_to_map_int_str; line 55 pos 30
>>>         at 
>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
>>>         at 
>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69)
>>>         at scala.Option.getOrElse(Option.scala:120)
>>>         at 
>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68)
>>>         at 
>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
>>>         at scala.util.Try.getOrElse(Try.scala:77)
>>>         at 
>>> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64)
>>>         at 
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
>>>         at 
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574)
>>>         at 
>>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>>>         at 
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:573)
>>>         at 
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:570)
>>> 
>>> 
>>> 2) When I'm using SQL without this function Spark tries to connect to 
>>> Zookeeper on localhost.
>>> I make a tunnel from localhost to one of the zookeeper servers but it's not 
>>> a solution.
>>> 
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client 
>>> environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:host.name=j4.jupyter1
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.version=1.8.0_66
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.vendor=Oracle 
>>> Corporation
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client 
>>> environment:java.home=/usr/lib/jvm/java-8-oracle/jre
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client 
>>> environment:java.class.path=/opt/spark/lib/mysql-connector-java-5.1.35-bin.jar:/opt/spark/lib/dwh-hbase-connector.jar:/opt/spark/lib/hive-hbase-handler-1.2.1.spark.jar:/opt/spark/lib/hbase-server.jar:/opt/spark/lib/hbase-common.jar:/opt/spark/lib/dwh-commons.jar:/opt/spark/lib/guava.jar:/opt/spark/lib/hbase-client.jar:/opt/spark/lib/hbase-protocol.jar:/opt/spark/lib/htrace-core.jar:/opt/spark/conf/:/opt/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client 
>>> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.compiler=<NA>
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.name=Linux
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.arch=amd64
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client 
>>> environment:os.version=3.13.0-24-generic
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:user.name=mbrynski
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client 
>>> environment:user.home=/home/mbrynski
>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:user.dir=/home/mbrynski
>>> 16/01/28 10:09:18 INFO ZooKeeper: Initiating client connection, 
>>> connectString=localhost:2181 sessionTimeout=90000 
>>> watcher=hconnection-0x36079f06, quorum=localhost:2181, baseZNode=/hbase
>>> 16/01/28 10:09:18 INFO RecoverableZooKeeper: Process 
>>> identifier=hconnection-0x36079f06 connecting to ZooKeeper 
>>> ensemble=localhost:2181
>>> 16/01/28 10:09:18 INFO ClientCnxn: Opening socket connection to server 
>>> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL 
>>> (unknown error)
>>> 16/01/28 10:09:18 INFO ClientCnxn: Socket connection established to 
>>> localhost/127.0.0.1:2181, initiating session
>>> 16/01/28 10:09:18 INFO ClientCnxn: Session establishment complete on server 
>>> localhost/127.0.0.1:2181, sessionid = 0x15254709ed3c8e1, negotiated timeout 
>>> = 40000
>>> 16/01/28 10:09:18 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is 
>>> null
>>> 
>>> 
>>> 3) After making tunel I'm getting NPE.
>>> 
>>> Caused by: java.lang.NullPointerException
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:269)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.MetaRegionTracker.blockUntilAvailable(MetaRegionTracker.java:241)
>>>         at 
>>> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:62)
>>>         at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1203)
>>>         at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1164)
>>>         at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294)
>>>         at 
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130)
>>>         at 
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55)
>>>         at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201)
>>>         ... 91 more
>>> 
>>> Do you have any ideas how to resolve those problems ?
>>> 
>>> Regards,
>>> -- 
>>> Maciek Bryński
> 
> 
> 
> -- 
> Maciek Bryński

Re: Spark 1.6.0 + Hive + HBase

Reply via email to