Re: Problems getting expected results from hbase_inputformat.py

Eric Bless Mon, 10 Aug 2015 11:09:58 -0700

Thank you Gen, the changes to HBaseConverters.scala look to now be returning 
all column qualifiers, as follows -


(u'row1', {u'qualifier': u'a', u'timestamp': u'1438716994027', u'value': 
u'value1', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row1'})
(u'row1', {u'qualifier': u'b', u'timestamp': u'1438717004248', u'value': 
u'value2', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row1'})
(u'row2', {u'qualifier': u'', u'timestamp': u'1438717014529', u'value': 
u'value3', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row2'})
(u'row3', {u'qualifier': u'', u'timestamp': u'1438717022756', u'value': 
u'value4', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row3'})
Just to be clear, you refer to "Spark update these two scripts recently.". What 
two scripts were you referencing? 


     On Friday, August 7, 2015 7:59 PM, gen tang <gen.tan...@gmail.com> wrote:
   

 Hi,
In fact, Pyspark use 
org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/)
 to transform object of Hbase result to python string.Spark update these two 
scripts recently. However, they are not included in the official release of 
spark. So you are trying to use this new python script with old jar. 
You can clone the newest code of spark from github and build examples jar. Then 
you can get correct result.
CheersGen

On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless <eric.bl...@yahoo.com.invalid> wrote:

I’m having some difficulty getting the desired results fromthe Spark Python 
example hbase_inputformat.py. I’m running with CDH5.4, hbaseVersion 1.0.0, 
Spark v 1.3.0 Using Python version 2.6.6 I followed the example to create a 
test HBase table. Here’sthe data from the table I created – hbase(main):001:0> 
scan 'dev_wx_test'ROW                      COLUMN+CELLrow1                    
column=f1:a, timestamp=1438716994027, value=value1row1                    
column=f1:b, timestamp=1438717004248, value=value2row2                    
column=f1:, timestamp=1438717014529, value=value3row3                    
column=f1:, timestamp=1438717022756, value=value43 row(s) in 0.2620 seconds 
When either of these statements are included -“hbase_rdd = 
hbase_rdd.flatMapValues(lambda v:v.split("\n"))”  or “hbase_rdd = 
hbase_rdd.flatMapValues(lambda v:v.split("\n")).countByValue().items()” the 
result is - We only get the following printed; (row1, value2) is notprinted: 
        ((u'row1', u'value1'), 1)        ((u'row2', u'value3'), 1)        
((u'row3', u'value4'), 1) This looks like similar results to the following post 
Ifound 
-http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-td18613.html#a18650but
 it appears the pythonconverterHBaseResultToStringConverter has been updated 
since then.
And this problem will be resolved too. 
 When the statement “hbase_rdd = hbase_rdd.flatMapValues(lambda 
v:v.split("\n")).mapValues(json.loads)” is included, the result is – 
ValueError: No JSON object could be decoded 
**************************************************************************************
 Here is more info on this from the log – Traceback (most recent call last):  
File"hbase_inputformat.py", line 87, in <module>    output =hbase_rdd.collect() 
 
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py",line
 701, in collect  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/java_gateway.py",line
 538, in __call__  File 
"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/protocol.py",line
 300, in get_return_valuepy4j.protocol.Py4JJavaError: An erroroccurred while 
calling o44.collect.: org.apache.spark.SparkException: Jobaborted due to stage 
failure: Task 0 in stage 1.0 failed 4 times, most recentfailure: Lost task 0.3 
in stage 1.0 (TID 4, 
stluhdpddev27.monsanto.com):org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):  File 
"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py",line
 101, in main    process()  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py",line
 96, in process   serializer.dump_stream(func(split_index, iterator), outfile)  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/serializers.py",line
 236, in dump_stream    vs =list(itertools.islice(iterator, batch))  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py",line
 1807, in <lambda>  File"/usr/lib64/python2.6/json/__init__.py", line 307, in 
loads    return_default_decoder.decode(s)  
File"/usr/lib64/python2.6/json/decoder.py", line 319, in decode    obj, end 
=self.raw_decode(s, idx=_w(s, 0).end())  File 
"/usr/lib64/python2.6/json/decoder.py",line 338, in raw_decode    
raiseValueError("No JSON object could be decoded")ValueError: No JSON object 
could bedecoded        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)       
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)    
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)       
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)       at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)       at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)       at 
org.apache.spark.scheduler.Task.run(Task.scala:64)       at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
      
atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       at java.lang.Thread.run(Thread.java:745) Driver stacktrace:       at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
       
atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
       
atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
       
atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)    
   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)     
  
atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
       at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
       at scala.Option.foreach(Option.scala:236)       
atorg.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
       at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
       
atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Any 
suggestions would be most welcome. 
****************************************************************Below is the 
code we’re running. We did add a few things tothe original example in our 
attempts to get it working.  from __future__ import print_function import 
sysimport json from pyspark import SparkContextfrom pyspark.conf import 
SparkConf import os.pathos.environ["SPARK_HOME"] 
="/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/"conf = 
(SparkConf().setMaster('local').setAppName('a')) if __name__ == "__main__":   
if len(sys.argv) != 3:      print("""       Usage:hbase_inputformat <host> 
<table>       Run with example jar:       ./bin/spark-submit--driver-class-path 
/path/to/example/jar \      /path/to/examples/hbase_inputformat.py <host> 
<table> [<znode>]       Assumes you have somedata in HBase already, running on 
<host>, in <table>         optionally,you can specify parent znode for your 
hbase cluster - <znode>       """,file=sys.stderr)       exit(-1) host = 
sys.argv[1]table = sys.argv[2]sc = SparkContext(appName="HBaseInputFormat") # 
Other options for configuring scan behavior are available.More information 
available at# 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormat.javaconf
 = {"hbase.zookeeper.quorum": host,"hbase.mapreduce.inputtable": table}if 
len(sys.argv) > 3:    conf ={"hbase.zookeeper.quorum": host, 
"zookeeper.znode.parent":sys.argv[3],           "hbase.mapreduce.inputtable": 
table}keyConv 
="org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"valueConv
 = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter" 
hbase_rdd = sc.newAPIHadoopRDD(   
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",   
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",   
"org.apache.hadoop.hbase.client.Result",    keyConverter=keyConv,    
valueConverter=valueConv,    conf=conf)hbase_rdd = 
hbase_rdd.flatMapValues(lambda v:v.split("\n")).mapValues(json.loads)# 
hbase_rdd = hbase_rdd.flatMapValues(lambda v:v.split("\n"))# hbase_rdd = 
hbase_rdd.flatMapValues(lambda v:v.split("\n")).countByValue().items() output = 
hbase_rdd.collect()# output = hbase_rddfor (k, v) in output:    print((k, v)) 
sc.stop()

Re: Problems getting expected results from hbase_inputformat.py

Reply via email to