I would use just textFile unless you actually need a guarantee that you
will be seeing a whole file at time (textFile splits on new lines).

RDDs are immutable, so you cannot add data to them.  You can however union
two RDDs, returning a new RDD that contains all the data.

On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint <sam.fl...@magnetic.com> wrote:

> Michael,
>     Thanks for your help.   I found a wholeTextFiles() that I can use to
> import all files in a directory.  I believe this would be the case if all
> the files existed in the same directory.  Currently the files come in by
> the hour and are in a format somewhat like this ../2014/10/01/00/filename
> and there is one file per hour.
>
> Do I create an RDD and add to it? Is that possible?  My example query
> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
> in all files multiple times with multiple values.
>
> I don't see in any documentation how to import a file create an RDD then
> import another file into that RDD.   kinda like in mysql when you create a
> table import data then import more data.  This may be my ignorance because
> I am not that familiar with spark, but I would expect to import data into a
> single RDD before performing analytics on it.
>
> Thank you for your time and help on this.
>
>
> P.S. I am using python if that makes a difference.
>
> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> In general you should be able to read full directories of files as a
>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>> guides:
>>
>> http://spark.apache.org/docs/latest/quick-start.html
>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> For Avro in particular, I have been working on a library for Spark SQL.
>> Its very early code, but you can find it here:
>> https://github.com/databricks/spark-avro
>>
>> Bug reports welcome!
>>
>> Michael
>>
>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <sam.fl...@magnetic.com>
>> wrote:
>>
>>> Hi,
>>>
>>>     I am new to spark.  I have began to read to understand sparks RDD
>>> files as well as SparkSQL.  My question is more on how to build out the RDD
>>> files and best practices.   I have data that is broken down by hour into
>>> files on HDFS in avro format.   Do I need to create a separate RDD for each
>>> file? or using SparkSQL a separate SchemaRDD?
>>>
>>> I want to be able to pull lets say an entire day of data into spark and
>>> run some analytics on it.  Then possibly a week, a month, etc.
>>>
>>>
>>> If there is documentation on this procedure or best practives for
>>> building RDD's please point me at them.
>>>
>>> Thanks for your time,
>>>    Sam
>>>
>>>
>>>
>>>
>>
>
>
> --
>
> *MAGNE**+**I**C*
>
> *Sam Flint* | *Lead Developer, Data Analytics*
>
>
>

Reply via email to