Hi List, I have just started using Spark and trying to create DataFrame from an Avro file stored in Amazon S3. I am using *Spark-Avro* <https://github.com/databricks/spark-avro> library for this. The code which I'm using is shown below. Nothing fancy, just the basic prototype as shown on the Spark-Avro GitHub page.
*SparkConf conf = new SparkConf().setAppName("DataFrameDemo").setMaster("local");* *JavaSparkContext sc = new JavaSparkContext(conf);* *Configuration config = sc.hadoopConfiguration();* *config.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");* *config.set("fs.s3n.awsAccessKeyId","***********************");* *config.set("fs.s3n.awsSecretAccessKey","*********************");* *SQLContext sqlContext = new SQLContext(sc);* *DataFrame df = sqlContext.load("s3n://bucket-name/episodes.avro", "com.databricks.spark.avro");* *// DataFrame df = sqlContext.load("/Users/miqbal1/Downloads/episodes.avro", "com.databricks.spark.avro");* *df.show();* *df.printSchema();* *df.select("title").show();* *System.out.println("DONE");* But this code throws : *Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties* *15/05/13 19:05:53 INFO SparkContext: Running Spark version 1.3.1* *2015-05-13 19:05:53.715 java[7041:87063] Unable to load realm mapping info from SCDynamicStore* *15/05/13 19:05:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable* *15/05/13 19:05:53 INFO SecurityManager: Changing view acls to: miqbal1* *15/05/13 19:05:53 INFO SecurityManager: Changing modify acls to: miqbal1* *15/05/13 19:05:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(miqbal1); users with modify permissions: Set(miqbal1)* *15/05/13 19:05:54 INFO Slf4jLogger: Slf4jLogger started* *15/05/13 19:05:54 INFO Remoting: Starting remoting* *15/05/13 19:05:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@172.28.210.56:51245 <http://sparkDriver@172.28.210.56:51245>]* *15/05/13 19:05:54 INFO Utils: Successfully started service 'sparkDriver' on port 51245.* *15/05/13 19:05:54 INFO SparkEnv: Registering MapOutputTracker* *15/05/13 19:05:54 INFO SparkEnv: Registering BlockManagerMaster* *15/05/13 19:05:54 INFO DiskBlockManager: Created local directory at /var/folders/n3/d0ghj1ln2zl0kpd8zkz4zf04mdm1y2/T/spark-e63c64f4-129c-49be-a2d6-88c6676f5ef7/blockmgr-2ae7fabb-c8df-4a49-b599-9d32401cf7ba* *15/05/13 19:05:54 INFO MemoryStore: MemoryStore started with capacity 66.9 MB* *15/05/13 19:05:54 INFO HttpFileServer: HTTP File server directory is /var/folders/n3/d0ghj1ln2zl0kpd8zkz4zf04mdm1y2/T/spark-5bf192d7-f08e-489f-93a4-fc47300e8388/httpd-888e7695-1499-49cc-89ba-8258fc761e70* *15/05/13 19:05:54 INFO HttpServer: Starting HTTP Server* *15/05/13 19:05:54 INFO Server: jetty-8.y.z-SNAPSHOT* *15/05/13 19:05:54 INFO AbstractConnector: Started SocketConnector@0.0.0.0:51246 <http://SocketConnector@0.0.0.0:51246>* *15/05/13 19:05:54 INFO Utils: Successfully started service 'HTTP file server' on port 51246.* *15/05/13 19:05:54 INFO SparkEnv: Registering OutputCommitCoordinator* *15/05/13 19:05:54 INFO Server: jetty-8.y.z-SNAPSHOT* *15/05/13 19:05:54 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 <http://SelectChannelConnector@0.0.0.0:4040>* *15/05/13 19:05:54 INFO Utils: Successfully started service 'SparkUI' on port 4040.* *15/05/13 19:05:54 INFO SparkUI: Started SparkUI at http://172.28.210.56:4040 <http://172.28.210.56:4040>* *15/05/13 19:05:55 INFO Executor: Starting executor ID <driver> on host localhost* *15/05/13 19:05:55 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@172.28.210.56:51245/user/HeartbeatReceiver <http://sparkDriver@172.28.210.56:51245/user/HeartbeatReceiver>* *15/05/13 19:05:55 INFO NettyBlockTransferService: Server created on 51247* *15/05/13 19:05:55 INFO BlockManagerMaster: Trying to register BlockManager* *15/05/13 19:05:55 INFO BlockManagerMasterActor: Registering block manager localhost:51247 with 66.9 MB RAM, BlockManagerId(<driver>, localhost, 51247)* *15/05/13 19:05:55 INFO BlockManagerMaster: Registered BlockManager* *Exception in thread "main" java.lang.NullPointerException* * at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)* * at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)* * at org.apache.hadoop.fs.Globber.glob(Globber.java:248)* * at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)* * at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:105)* * at com.databricks.spark.avro.AvroRelation.<init>(AvroRelation.scala:60)* * at com.databricks.spark.avro.DefaultSource.createRelation(DefaultSource.scala:41)* * at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)* * at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)* * at org.apache.spark.sql.SQLContext.load(SQLContext.scala:673)* * at org.myorg.dataframe.S3DataFrame.main(S3DataFrame.java:25)* But if I run the same code on a file stored in my local machine it works fine(See the commented line in the above shown code). So, the problem seems to be related to my S3 object and it persists even after giving *Open/Download* permissions to everyone. I would really appreciate if someone could guide me through this. Am I missing/doing something wrong in my code? Please pardon my ignorance as I am completely new to Spark. Thank you so much for your valuable time. [image: http://] Tariq, Mohammad about.me/mti [image: http://] <http://about.me/mti>