I believe functions like sc.textFile will also accept paths with globs for example "/data/*/" which would read all the directories into a single RDD. Under the covers I think it is just using Hadoop's FileInputFormat, in case you want to google for the full list of supported syntax.
On Thu, Nov 20, 2014 at 7:27 AM, Sam Flint <[email protected]> wrote: > So you are saying to query an entire day of data I would need to create > one RDD for every hour and then union them into one RDD. After I have the > one RDD I would be able to query for a=2 throughout the entire day. > Please correct me if I am wrong. > > Thanks > > On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust <[email protected]> > wrote: > >> I would use just textFile unless you actually need a guarantee that you >> will be seeing a whole file at time (textFile splits on new lines). >> >> RDDs are immutable, so you cannot add data to them. You can however >> union two RDDs, returning a new RDD that contains all the data. >> >> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint <[email protected]> >> wrote: >> >>> Michael, >>> Thanks for your help. I found a wholeTextFiles() that I can use to >>> import all files in a directory. I believe this would be the case if all >>> the files existed in the same directory. Currently the files come in by >>> the hour and are in a format somewhat like this ../2014/10/01/00/filename >>> and there is one file per hour. >>> >>> Do I create an RDD and add to it? Is that possible? My example query >>> would be select count(*) from (entire day RDD) where a=2. "a" would exist >>> in all files multiple times with multiple values. >>> >>> I don't see in any documentation how to import a file create an RDD then >>> import another file into that RDD. kinda like in mysql when you create a >>> table import data then import more data. This may be my ignorance because >>> I am not that familiar with spark, but I would expect to import data into a >>> single RDD before performing analytics on it. >>> >>> Thank you for your time and help on this. >>> >>> >>> P.S. I am using python if that makes a difference. >>> >>> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust < >>> [email protected]> wrote: >>> >>>> In general you should be able to read full directories of files as a >>>> single RDD/SchemaRDD. For documentation I'd suggest the programming >>>> guides: >>>> >>>> http://spark.apache.org/docs/latest/quick-start.html >>>> http://spark.apache.org/docs/latest/sql-programming-guide.html >>>> >>>> For Avro in particular, I have been working on a library for Spark >>>> SQL. Its very early code, but you can find it here: >>>> https://github.com/databricks/spark-avro >>>> >>>> Bug reports welcome! >>>> >>>> Michael >>>> >>>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am new to spark. I have began to read to understand sparks RDD >>>>> files as well as SparkSQL. My question is more on how to build out the >>>>> RDD >>>>> files and best practices. I have data that is broken down by hour into >>>>> files on HDFS in avro format. Do I need to create a separate RDD for >>>>> each >>>>> file? or using SparkSQL a separate SchemaRDD? >>>>> >>>>> I want to be able to pull lets say an entire day of data into spark >>>>> and run some analytics on it. Then possibly a week, a month, etc. >>>>> >>>>> >>>>> If there is documentation on this procedure or best practives for >>>>> building RDD's please point me at them. >>>>> >>>>> Thanks for your time, >>>>> Sam >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> *MAGNE**+**I**C* >>> >>> *Sam Flint* | *Lead Developer, Data Analytics* >>> >>> >>> >> > > > -- > > *MAGNE**+**I**C* > > *Sam Flint* | *Lead Developer, Data Analytics* > > >
