In general you should be able to read full directories of files as a single RDD/SchemaRDD. For documentation I'd suggest the programming guides:
http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/sql-programming-guide.html For Avro in particular, I have been working on a library for Spark SQL. Its very early code, but you can find it here: https://github.com/databricks/spark-avro Bug reports welcome! Michael On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <[email protected]> wrote: > Hi, > > I am new to spark. I have began to read to understand sparks RDD > files as well as SparkSQL. My question is more on how to build out the RDD > files and best practices. I have data that is broken down by hour into > files on HDFS in avro format. Do I need to create a separate RDD for each > file? or using SparkSQL a separate SchemaRDD? > > I want to be able to pull lets say an entire day of data into spark and > run some analytics on it. Then possibly a week, a month, etc. > > > If there is documentation on this procedure or best practives for building > RDD's please point me at them. > > Thanks for your time, > Sam > > > >
