Hi List,

I have just started using Spark and trying to create DataFrame from an Avro
file stored in Amazon S3. I am using *Spark-Avro*
<https://github.com/databricks/spark-avro> library for this. The code which
I'm using is shown below. Nothing fancy, just the basic prototype as shown
on the Spark-Avro GitHub page.

*SparkConf conf = new
SparkConf().setAppName("DataFrameDemo").setMaster("local");*

*JavaSparkContext sc = new JavaSparkContext(conf);*

*Configuration config = sc.hadoopConfiguration();*

*config.set("fs.s3.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem");*

*config.set("fs.s3n.awsAccessKeyId","***********************");*

*config.set("fs.s3n.awsSecretAccessKey","*********************");*

*SQLContext sqlContext = new SQLContext(sc);*

*DataFrame df = sqlContext.load("s3n://bucket-name/episodes.avro",
"com.databricks.spark.avro");*

*// DataFrame df =
sqlContext.load("/Users/miqbal1/Downloads/episodes.avro",
"com.databricks.spark.avro");*

*df.show();*

*df.printSchema();*

*df.select("title").show();*

*System.out.println("DONE");*

But this code throws :

*Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties*

*15/05/13 19:05:53 INFO SparkContext: Running Spark version 1.3.1*

*2015-05-13 19:05:53.715 java[7041:87063] Unable to load realm mapping info
from SCDynamicStore*

*15/05/13 19:05:53 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable*

*15/05/13 19:05:53 INFO SecurityManager: Changing view acls to: miqbal1*

*15/05/13 19:05:53 INFO SecurityManager: Changing modify acls to: miqbal1*

*15/05/13 19:05:53 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(miqbal1);
users with modify permissions: Set(miqbal1)*

*15/05/13 19:05:54 INFO Slf4jLogger: Slf4jLogger started*

*15/05/13 19:05:54 INFO Remoting: Starting remoting*

*15/05/13 19:05:54 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@172.28.210.56:51245
<http://sparkDriver@172.28.210.56:51245>]*

*15/05/13 19:05:54 INFO Utils: Successfully started service 'sparkDriver'
on port 51245.*

*15/05/13 19:05:54 INFO SparkEnv: Registering MapOutputTracker*

*15/05/13 19:05:54 INFO SparkEnv: Registering BlockManagerMaster*

*15/05/13 19:05:54 INFO DiskBlockManager: Created local directory at
/var/folders/n3/d0ghj1ln2zl0kpd8zkz4zf04mdm1y2/T/spark-e63c64f4-129c-49be-a2d6-88c6676f5ef7/blockmgr-2ae7fabb-c8df-4a49-b599-9d32401cf7ba*

*15/05/13 19:05:54 INFO MemoryStore: MemoryStore started with capacity 66.9
MB*

*15/05/13 19:05:54 INFO HttpFileServer: HTTP File server directory is
/var/folders/n3/d0ghj1ln2zl0kpd8zkz4zf04mdm1y2/T/spark-5bf192d7-f08e-489f-93a4-fc47300e8388/httpd-888e7695-1499-49cc-89ba-8258fc761e70*

*15/05/13 19:05:54 INFO HttpServer: Starting HTTP Server*

*15/05/13 19:05:54 INFO Server: jetty-8.y.z-SNAPSHOT*

*15/05/13 19:05:54 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:51246 <http://SocketConnector@0.0.0.0:51246>*

*15/05/13 19:05:54 INFO Utils: Successfully started service 'HTTP file
server' on port 51246.*

*15/05/13 19:05:54 INFO SparkEnv: Registering OutputCommitCoordinator*

*15/05/13 19:05:54 INFO Server: jetty-8.y.z-SNAPSHOT*

*15/05/13 19:05:54 INFO AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
<http://SelectChannelConnector@0.0.0.0:4040>*

*15/05/13 19:05:54 INFO Utils: Successfully started service 'SparkUI' on
port 4040.*

*15/05/13 19:05:54 INFO SparkUI: Started SparkUI at
http://172.28.210.56:4040 <http://172.28.210.56:4040>*

*15/05/13 19:05:55 INFO Executor: Starting executor ID <driver> on host
localhost*

*15/05/13 19:05:55 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@172.28.210.56:51245/user/HeartbeatReceiver
<http://sparkDriver@172.28.210.56:51245/user/HeartbeatReceiver>*

*15/05/13 19:05:55 INFO NettyBlockTransferService: Server created on 51247*

*15/05/13 19:05:55 INFO BlockManagerMaster: Trying to register BlockManager*

*15/05/13 19:05:55 INFO BlockManagerMasterActor: Registering block manager
localhost:51247 with 66.9 MB RAM, BlockManagerId(<driver>, localhost,
51247)*

*15/05/13 19:05:55 INFO BlockManagerMaster: Registered BlockManager*

*Exception in thread "main" java.lang.NullPointerException*

* at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)*

* at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)*

* at org.apache.hadoop.fs.Globber.glob(Globber.java:248)*

* at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)*

* at
com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:105)*

* at com.databricks.spark.avro.AvroRelation.<init>(AvroRelation.scala:60)*

* at
com.databricks.spark.avro.DefaultSource.createRelation(DefaultSource.scala:41)*

* at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)*

* at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)*

* at org.apache.spark.sql.SQLContext.load(SQLContext.scala:673)*

* at org.myorg.dataframe.S3DataFrame.main(S3DataFrame.java:25)*

But if I run the same code on a file stored in my local machine it works
fine(See the commented line in the above shown code). So, the problem seems
to be related to my S3 object and it persists even after giving
*Open/Download* permissions to everyone.

I would really appreciate if someone could guide me through this. Am I
missing/doing something wrong in my code? Please pardon my ignorance as I
am completely new to Spark.

Thank you so much for your valuable time.

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>

Reply via email to