Adding third party specific jars to Spark

Mich Talebzadeh Thu, 14 Jan 2021 08:51:24 -0800

The primer for this was the process of developing code for accessing
BigQuery data from PyCharm on premises so that advanced analytics and
graphics can be done on local.


Writes are an issue as BiqQuery buffers data in a temporary storage on GS
bucket before pushing it into BigQuery database

One option is to use Dataproc clusters for doing write intensive activities
there ($$$) and thereafter do the reads on on-premises (Linux) and on local
(assuming you have a powerful enough Windows Box). The issue was more with
writes.

To make this work believe or not is a bit of art as you need to find the
correct versions of Spark plus the correct versions of JAR files to
BigQuery that work in tandem

Anyhow the read and write to BigQuery work with Spark-3.0.1-bin-hadoop3.2/
and the following two JAR files

-rwxr--r--  1 hduser hadoop 33943429 Jan 12 23:30 spark-bigquery-latest_2.12.jar
-rwxr--r--  1 hduser hadoop 17663298 Jan 13 19:20
gcs-connector-hadoop3-2.2.0-shaded.jar
lrwxrwxrwx  1 hduser hadoop       38 Jan 13 19:22 gcs-connector.jar ->
gcs-connector-hadoop3-2.2.0-shaded.jar

For me the option that worked *was to put these two jar files in directory *
*$SPARK_HOME/jars*.


Adding them to spark.driver.extraClassPath in
$SPARK_HOME/conf/spark-defaults.conf did not work. Using spark-submit on
PyCharm terminal with --jars added other issues.


So in short I put these two files in $SPARK_HOME/jars and it worked. I am
not sure this is ideal but one advantage it has would be to create a
container jar file spark-libs.jar


jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .


and put it under HDFS directory so all nodes of the cluster can access it.
You need to add it to $SPARK_HOME/conf/spark-defaults.conf


spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar


If anyone has any suggestions please let me know.


Thanks

Adding third party specific jars to Spark

Reply via email to