Hi,
I am new to spark. I have began to read to understand sparks RDD files
as well as SparkSQL. My question is more on how to build out the RDD files
and best practices. I have data that is broken down by hour into files on
HDFS in avro format. Do I need to create a separate RDD for each file? or
using SparkSQL a separate SchemaRDD?
I want to be able to pull lets say an entire day of data into spark and run
some analytics on it. Then possibly a week, a month, etc.
If there is documentation on this procedure or best practives for building
RDD's please point me at them.
Thanks for your time,
Sam