Yes, we use org.apache.hbase.connectors.spark:hbase-spark:1.0.0.7.2.16.0-287
În mie., 30 oct. 2024 la 15:30, Gurunandan <gurunandan....@gmail.com> a scris: > Hi Evelina, > Do you use Spark HBase Connector ( hbase-spark ) as part of the unit-test > setup? > > regards, > Guru > > On Wed, Oct 30, 2024 at 5:35 PM Evelina Dumitrescu > <evelina.dumitrescu....@gmail.com> wrote: > > > > Hello, > > > > TLDR; The question is asked also here: > > > https://stackoverflow.com/questions/79139516/incompatible-configuration-used-between-spark-and-hbasetestingutility > > > > We are using the MiniDFSCluster and MiniHbaseCluster from > HBaseTestingUtility to run unit tests for our Spark jobs. > > The Spark configuration that we use is : > > > > conf.set("spark.sql.catalogImplementation", "hive") > > .set("spark.sql.warehouse.dir", getWarehousePath) > > .set("javax.jdo.option.ConnectionURL", > s"jdbc:derby:;databaseName=$getMetastorePath;create=true") > > .set("shark.test.data.path", dataFilePath) > > .set("hive.exec.dynamic.partition.mode", "nonstrict") > > .set("spark.kryo.registrator", "CustomKryoRegistrar") > > .set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > > > .registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result])) > > > > For the MiniDFSCluster and MiniHbaseCluster we use the default > HbaseTestingUtility configuration. > > The release versions that we use are: > > - hbase-testing-util Cloudera CDP 2.4.6.7.2.16.0-287 > > - Spark 2.11 > > > > > > > > In our unit tests, when we try to run a Spark job that reads Hive data, > we get the following exception: > > > > > > ``` > > org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 14.0 (TID 14, localhost, executor driver): java.lang.Unsuppo > > rtedOperationException: Byte-buffer read unsupported by > org.apache.hadoop.fs.BufferedFSInputStream > > > > at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:158) > > > > at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:154) > > > > at > org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81) > > > > at > org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90) > > > > at > org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75) > > > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:546) > > > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:516) > > > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:510) > > > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:459) > > > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371) > > > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370) > > > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374) > > > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352) > > > > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$ > 1.org > $apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124) > > > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > > > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > > > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > > > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > > > > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > > > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645) > > > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:270) > > > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:262) > > > > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > > > > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > > > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > > > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > > > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > > > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > > > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > > > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > > > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > > > > at org.apache.spark.scheduler.Task.run(Task.scala:123) > > > > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:456) > > > > at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334) > > > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:462) > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > at java.lang.Thread.run(Thread.java:750) > > > > > > Driver stacktrace: > > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1935) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1923) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1922) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1922) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:953) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:953) > > at scala.Option.foreach(Option.scala:257) > > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:953) > > ... > > Cause: java.lang.UnsupportedOperationException: Byte-buffer read > unsupported by org.apache.hadoop.fs.BufferedFSInputStream > > at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:158) > > at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:154) > > at > org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81) > > at > org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90) > > at > org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75) > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:546) > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:516) > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:510) > > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:459) > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371) > > ``` > > > > > > Is there an incompatible configuration used between Spark, > MiniDFSCluster and MiniHbaseCluster ? >