So you are saying to query an entire day of data I would need to create one RDD for every hour and then union them into one RDD. After I have the one RDD I would be able to query for a=2 throughout the entire day. Please correct me if I am wrong.
Thanks On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust <mich...@databricks.com> wrote: > I would use just textFile unless you actually need a guarantee that you > will be seeing a whole file at time (textFile splits on new lines). > > RDDs are immutable, so you cannot add data to them. You can however union > two RDDs, returning a new RDD that contains all the data. > > On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint <sam.fl...@magnetic.com> wrote: > >> Michael, >> Thanks for your help. I found a wholeTextFiles() that I can use to >> import all files in a directory. I believe this would be the case if all >> the files existed in the same directory. Currently the files come in by >> the hour and are in a format somewhat like this ../2014/10/01/00/filename >> and there is one file per hour. >> >> Do I create an RDD and add to it? Is that possible? My example query >> would be select count(*) from (entire day RDD) where a=2. "a" would exist >> in all files multiple times with multiple values. >> >> I don't see in any documentation how to import a file create an RDD then >> import another file into that RDD. kinda like in mysql when you create a >> table import data then import more data. This may be my ignorance because >> I am not that familiar with spark, but I would expect to import data into a >> single RDD before performing analytics on it. >> >> Thank you for your time and help on this. >> >> >> P.S. I am using python if that makes a difference. >> >> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> In general you should be able to read full directories of files as a >>> single RDD/SchemaRDD. For documentation I'd suggest the programming >>> guides: >>> >>> http://spark.apache.org/docs/latest/quick-start.html >>> http://spark.apache.org/docs/latest/sql-programming-guide.html >>> >>> For Avro in particular, I have been working on a library for Spark SQL. >>> Its very early code, but you can find it here: >>> https://github.com/databricks/spark-avro >>> >>> Bug reports welcome! >>> >>> Michael >>> >>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <sam.fl...@magnetic.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am new to spark. I have began to read to understand sparks RDD >>>> files as well as SparkSQL. My question is more on how to build out the RDD >>>> files and best practices. I have data that is broken down by hour into >>>> files on HDFS in avro format. Do I need to create a separate RDD for each >>>> file? or using SparkSQL a separate SchemaRDD? >>>> >>>> I want to be able to pull lets say an entire day of data into spark and >>>> run some analytics on it. Then possibly a week, a month, etc. >>>> >>>> >>>> If there is documentation on this procedure or best practives for >>>> building RDD's please point me at them. >>>> >>>> Thanks for your time, >>>> Sam >>>> >>>> >>>> >>>> >>> >> >> >> -- >> >> *MAGNE**+**I**C* >> >> *Sam Flint* | *Lead Developer, Data Analytics* >> >> >> > -- *MAGNE**+**I**C* *Sam Flint* | *Lead Developer, Data Analytics*