Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing
serious issue. I successfully updated spark thrift server from 1.5.2 to
1.6.0. But I have standalone application, which worked fine with 1.5.2 but
failing on 1.6.0 with:
*NestedThrowables:*
*java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory*
* at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)*
Inside this application I work with hive table, which have data in json
format.
When I add
<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-core</artifactId>
<version>4.0.0-release</version>
</dependency>
<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-api-jdo</artifactId>
<version>4.0.0-release</version>
</dependency>
<dependency>
<groupId>org.datanucleus</groupId>
<artifactId>datanucleus-rdbms</artifactId>
<version>3.2.9</version>
</dependency>
I'm getting:
*Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence
process has been specified to use a ClassLoaderResolver of name
"datanucleus" yet this has not been found by the DataNucleus plugin
mechanism. Please check your CLASSPATH and plugin specification.*
* at
org.datanucleus.AbstractNucleusContext.<init>(AbstractNucleusContext.java:102)*
* at
org.datanucleus.PersistenceNucleusContextImpl.<init>(PersistenceNucleusContextImpl.java:162)*
I have CDH 5.5. I build spark with
*./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.5.0
-Phive -DskipTests*
Than I publish fat jar locally:
*mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file
-Dfile=./spark-assembly.jar -DgroupId=org.spark-project
-DartifactId=my-spark-assembly -Dversion=1.6.0-SNAPSHOT -Dpackaging=jar*
Than I include dependency on this fat jar:
<dependency>
<groupId>org.spark-project</groupId>
<artifactId>my-spark-assembly</artifactId>
<version>1.6.0-SNAPSHOT</version>
</dependency>
Than I build my application with assembly plugin:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<artifactSet>
<includes>
<include>*:*</include>
</includes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/services/org.apache.hadoop.fs.FileSystem</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
<resource>log4j.properties</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"/>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ApacheNoticeResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
Configuration of assembly plugin is copy-past from spark assembly pom.
This workflow worked for 1.5.2 and broke for 1.6.0. If I have not good
approach of creating this standalone application, please recommend
other approach, but spark-submit does not work for me - it hard for me
to connect it to Oozie.
Any suggestion would be appreciated - I'm stuck.
My spark config:
lazy val sparkConf = new SparkConf()
.setMaster("yarn-client")
.setAppName(appName)
.set("spark.yarn.queue", "jenkins")
.set("spark.executor.memory", "10g")
.set("spark.yarn.executor.memoryOverhead", "2000")
.set("spark.executor.cores", "3")
.set("spark.driver.memory", "4g")
.set("spark.shuffle.io.numConnectionsPerPeer", "5")
.set("spark.sql.autoBroadcastJoinThreshold", "200483647")
.set("spark.network.timeout", "1000s")
.set("spark.executor.extraJavaOptions", "-XX:MaxPermSize=2g")
.set("spark.driver.maxResultSize", "2g")
.set("spark.rpc.lookupTimeout", "1000s")
.set("spark.sql.hive.convertMetastoreParquet", "false")
.set("spark.kryoserializer.buffer.max", "200m")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.yarn.driver.memoryOverhead", "1000")
.set("spark.dynamicAllocation.enabled", "true")
.set("spark.shuffle.service.enabled", "true")
.set("spark.dynamicAllocation.minExecutors", "1")
.set("spark.dynamicAllocation.maxExecutors", "20")
.set("spark.dynamicAllocation.executorIdleTimeout", "60s")
.set("spark.sql.tungsten.enabled", "false")
.set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "100s")
.setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
--
*Sincerely yoursEgor Pakhomov*