I would use just textFile unless you actually need a guarantee that you will be seeing a whole file at time (textFile splits on new lines).
RDDs are immutable, so you cannot add data to them. You can however union two RDDs, returning a new RDD that contains all the data. On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint <sam.fl...@magnetic.com> wrote: > Michael, > Thanks for your help. I found a wholeTextFiles() that I can use to > import all files in a directory. I believe this would be the case if all > the files existed in the same directory. Currently the files come in by > the hour and are in a format somewhat like this ../2014/10/01/00/filename > and there is one file per hour. > > Do I create an RDD and add to it? Is that possible? My example query > would be select count(*) from (entire day RDD) where a=2. "a" would exist > in all files multiple times with multiple values. > > I don't see in any documentation how to import a file create an RDD then > import another file into that RDD. kinda like in mysql when you create a > table import data then import more data. This may be my ignorance because > I am not that familiar with spark, but I would expect to import data into a > single RDD before performing analytics on it. > > Thank you for your time and help on this. > > > P.S. I am using python if that makes a difference. > > On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> In general you should be able to read full directories of files as a >> single RDD/SchemaRDD. For documentation I'd suggest the programming >> guides: >> >> http://spark.apache.org/docs/latest/quick-start.html >> http://spark.apache.org/docs/latest/sql-programming-guide.html >> >> For Avro in particular, I have been working on a library for Spark SQL. >> Its very early code, but you can find it here: >> https://github.com/databricks/spark-avro >> >> Bug reports welcome! >> >> Michael >> >> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <sam.fl...@magnetic.com> >> wrote: >> >>> Hi, >>> >>> I am new to spark. I have began to read to understand sparks RDD >>> files as well as SparkSQL. My question is more on how to build out the RDD >>> files and best practices. I have data that is broken down by hour into >>> files on HDFS in avro format. Do I need to create a separate RDD for each >>> file? or using SparkSQL a separate SchemaRDD? >>> >>> I want to be able to pull lets say an entire day of data into spark and >>> run some analytics on it. Then possibly a week, a month, etc. >>> >>> >>> If there is documentation on this procedure or best practives for >>> building RDD's please point me at them. >>> >>> Thanks for your time, >>> Sam >>> >>> >>> >>> >> > > > -- > > *MAGNE**+**I**C* > > *Sam Flint* | *Lead Developer, Data Analytics* > > >