RE: Unable to Read/Write Avro RDD on cluster.

2015-03-05 Thread java8964
You can give Spark-Avro a try. It works great for our project.
https://github.com/databricks/spark-avro

> From: deepuj...@gmail.com
> Date: Thu, 5 Mar 2015 10:27:04 +0530
> Subject: Fwd: Unable to Read/Write Avro RDD on cluster.
> To: dev@spark.apache.org
> 
> I am trying to read RDD avro, transform and write.
> I am able to run it locally fine but when i run onto cluster, i see issues
> with Avro.
> 
> 
> export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> export YARN_CONF_DIR=/apache/hadoop/conf
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
> export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native"
> export YARN_CONF_DIR=/apache/hadoop/conf/
> 
> cd $SPARK_HOME
> 
> ./bin/spark-submit --master yarn-cluster --jars
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
> 1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
> /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
> subcommand=successevents
> outputdir=/user/dvasthimal/epdatasets/successdetail
> 
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to rm2
> 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
> ApplicationsManager (ASM), number of NodeManagers: 2221
> 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
> queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
>   queueApplicationCount = 7, queueChildQueueCount = 0
> 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
> resource in this cluster 16384
> 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
> 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
> 
> 
> 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
> 7780745 for dvasthimal on 10.115.206.112:8020
> 15/03/04 03:20:46 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
> 15/03/04 03:20:47 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
> 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
> 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
> 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
> ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
> -Djava.io.tmpdir=$PWD/tmp,
> -Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\",
>  -Dlog4j.configuration=log4j-spark-container.properties,
> org.apache.spark.deploy.yarn.ApplicationMaster, --class,
> com.company.ep.poc.spark.reporting.SparkApp, --jar ,
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
>  'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
>  'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
>  'subcommand=successevents'  --args
>  'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
> 2048, --executor-cores, 1, --num-executors , 3, 1>, /stdout, 2>,
> /stderr)
> 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
> 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
> application_1425075571333_61948
> 15/03/04 03:20:56 INFO y

RE: IntelliJ Runtime error

2015-04-03 Thread java8964
You have to change most of dependences in the spark-example model from 
"provided" to "compile", so you can run the example in Intellij.
Yong

> Date: Fri, 3 Apr 2015 09:22:13 -0700
> From: eng.sara.must...@gmail.com
> To: dev@spark.apache.org
> Subject: IntelliJ Runtime error
> 
> Hi,
> 
> I have built Spark 1.3.0 successfully on IntelliJ IDEA 14, but when i try to
> SparkPi example under the examples module i face this error:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/SparkConf
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:27)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 7 more
> 
> Could anyone help me please?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/IntelliJ-Runtime-error-tp11383.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
  

Spark 1.2.2 build problem with Hive 0.12, bringing in wrong version of avro-mapred

2015-08-12 Thread java8964
Hi, This email is sent to both dev and user list, just want to see if someone 
familiar with Spark/Maven build procedure can provide any help.
I am building Spark 1.2.2 with the following command:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0
The spark-assembly-1.2.2-hadoop2.2.0.jar contains the avro and avro-ipc of 
version 1.7.6, but avro-mapred of version 1.7.1, which caused some wired 
runtime exception when I tried to read the avro file in the Spark 1.2.2, as 
following:
NullPointerExceptionat java.io.StringReader.(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:943) at 
org.apache.avro.Schema.parse(Schema.java:992)at 
org.apache.avro.mapred.AvroJob.getInputSchema(AvroJob.java:65)   at 
org.apache.avro.mapred.AvroRecordReader.(AvroRecordReader.java:43) at 
org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:52) 
 at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:233)   at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
So I run the following command to understand that avro-mapred 1.7.1 is brought 
in by Hive 0.12 profile:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0 dependency:tree 
-Dverbose -Dincludes=org.apache.avro
[INFO] 
[INFO] 
Building Spark Project Hive 1.2.2[INFO] 
[INFO][INFO]
 --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---[INFO] 
org.apache.spark:spark-hive_2.10:jar:1.2.2[INFO] +- 
org.apache.spark:spark-core_2.10:jar:1.2.2:compile[INFO] |  \- 
org.apache.hadoop:hadoop-client:jar:2.2.0:compile (version managed from 
1.0.4)[INFO] | \- org.apache.hadoop:hadoop-common:jar:2.2.0:compile[INFO] | 
   \- (org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; 
omitted for duplicate)[INFO] +- 
org.spark-project.hive:hive-serde:jar:0.12.0-protobuf-2.5:compile[INFO] |  +- 
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted 
for duplicate)[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.1:compile[INFO] 
| \- (org.apache.avro:avro-ipc:jar:1.7.6:compile - version managed from 
1.7.1; omitted for duplicate)[INFO] +- 
org.apache.avro:avro:jar:1.7.6:compile[INFO] \- 
org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile[INFO]+- 
org.apache.avro:avro-ipc:jar:1.7.6:compile[INFO]|  \- 
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted 
for duplicate)[INFO]\- 
org.apache.avro:avro-ipc:jar:tests:1.7.6:compile[INFO]   \- 
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted 
for duplicate)[INFO]
In this case, I could manually fix all the classes in the final jar, changing 
from avro-mapred 1.7.1 to 1.7.6, but I wonder if there is any other solution, 
as this way is very error-prone.
Also, just from the above message, I can see avro-mapred.jar.hadoop2:1.7.6 
dependency is there, but looks like it is being omitted. Not sure why Maven 
choosed the lower version, as I am not a Maven guru.
My question, under the above situation, do I have a easy way to build it with 
avro-mapred 1.7.6, instead of 1.7.1?
Thanks
Yong  

Spark Job Hangs on our production cluster

2015-08-17 Thread java8964
I am comparing the log of Spark line by line between the hanging case (big 
dataset) and not hanging case (small dataset). 
In the hanging case, the Spark's log looks identical with not hanging case for 
reading the first block data from the HDFS.
But after that, starting from line 438 in the spark-hang.log, I only see the 
log generated from Worker, like following in the next 10 minutes:
15/08/14 14:24:19 DEBUG Worker: [actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:19 DEBUG Worker: 
[actor] handled message (0.121965 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]...15/08/14
 14:33:04 DEBUG Worker: [actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:33:04 DEBUG Worker: 
[actor] handled message (0.136146 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]
until almost 10 minutes I have to kill the job. I know it will hang forever.
But in the good log (spark-finished.log), starting from the line 361, Spark 
started to read the 2nd split data, I can see all the debug message from 
"BlockReaderLocal, BlockManger".
If I compared between these 2 cases log:
in the good log case from line 478, I can saw this message:15/08/14 14:37:09 
DEBUG BlockReaderLocal: putting FileInputStream for ..
But in the hang log case for reading the 2nd split data, I don't see this 
message any more (It existed for the 1st split). I believe in this case, this 
log message should show up, as the 2nd split block also existed on this Spark 
node, as just before it, I can see the following debug message:
15/08/14 14:24:11 DEBUG BlockReaderLocal: Created BlockReaderLocal for file 
/services/contact2/data/contacts/20150814004805-part-r-2.avro block 
BP-834217708-10.20.95.130-1438701195738:blk_1074484553_1099531839081 in 
datanode 10.20.95.146:5001015/08/14 14:24:11 DEBUG Project: Creating 
MutableProj: WrappedArray(), inputSchema: ArrayBuffer(account_id#0L, 
contact_id#1, sequence_id#2, state#3, name#4, kind#5, prefix_name#6, 
first_name#7, middle_name#8, company_name#9, job_title#10, source_name#11, 
source_details#12, provider_name#13, provider_details#14, created_at#15L, 
create_source#16, updated_at#17L, update_source#18, accessed_at#19L, 
deleted_at#20L, delta#21, birthday_day#22, birthday_month#23, anniversary#24L, 
contact_fields#25, related_contacts#26, contact_channels#27, 
contact_notes#28, contact_service_addresses#29, contact_street_addresses#30), 
codegen:false
This log is generated on node (10.20.95.146), and Spark created 
"BlockReaderLocal" to read the data from the local node.
Now my question is, can someone give me any idea why "DEBUG BlockReaderLocal: 
putting FileInputStream for " doesn't show up any more in this case?
I attached the log files again in this email, and really hope I can get some 
help from this list.
Thanks
Yong
From: java8...@hotmail.com
To: u...@spark.apache.org
Subject: RE: Spark Job Hangs on our production cluster
Date: Fri, 14 Aug 2015 15:14:10 -0400




I still want to check if anyone can provide any help related to the Spark 1.2.2 
will hang on our production cluster when reading Big HDFS data (7800 avro 
blocks), while looks fine for small data (769 avro blocks).
I enable the debug level in the spark log4j, and attached the log file if it 
helps to trouble shooting in this case.
Summary of our cluster:
IBM BigInsight V3.0.0.2 (running with Hadoop 2.2.0 + Hive 0.12)42 Data nodes, 
each one is running HDFS data node process + task tracker + spark workerOne 
master, running HDFS Name node + Spark masterAnother master node, running 2nd 
Name node + JobTracker
The test cases I did are 2, using very simple spark shell to read 2 folders, 
one is big data with 1T avro files; another one is small data with 160G avro 
files.
The avro files schema of 2 folders are different, but I don't think that will 
make any difference here.
The test script is like following:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val 
testdata = sqlContext.avroFile("hdfs://namenode:9000/bigdata_folder")   // vs 
sqlContext.avroFile("hdfs://namenode:9000/smalldata_folder")testdata.registerTempTable("testdata")testdata.count()
Both cases are kicking off as the same following:/opt/spark/bin/spark-shell 
--jars /opt/ibm/cclib/spark-avro.jar --conf spark.ui.port=4042 
--executor-memory 24G --total-executor-cores 42 --conf 
spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=2000 
--conf spark.default.parallelism=2000
When the script point to the small data folder, the Spark can finish very fast. 
Each task of scanning the HDFS block can finish within 30 seconds or less.
When the script point to the big data folder, most of the nodes can finish scan 
the first block of HDFS within 2 mins (longer than case 1), then the scann