So you are saying to query an entire day of data I would need to create one
RDD for every hour and then union them into one RDD.  After I have the one
RDD I would be able to query for a=2 throughout the entire day.   Please
correct me if I am wrong.

Thanks

On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> I would use just textFile unless you actually need a guarantee that you
> will be seeing a whole file at time (textFile splits on new lines).
>
> RDDs are immutable, so you cannot add data to them.  You can however union
> two RDDs, returning a new RDD that contains all the data.
>
> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint <sam.fl...@magnetic.com> wrote:
>
>> Michael,
>>     Thanks for your help.   I found a wholeTextFiles() that I can use to
>> import all files in a directory.  I believe this would be the case if all
>> the files existed in the same directory.  Currently the files come in by
>> the hour and are in a format somewhat like this ../2014/10/01/00/filename
>> and there is one file per hour.
>>
>> Do I create an RDD and add to it? Is that possible?  My example query
>> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
>> in all files multiple times with multiple values.
>>
>> I don't see in any documentation how to import a file create an RDD then
>> import another file into that RDD.   kinda like in mysql when you create a
>> table import data then import more data.  This may be my ignorance because
>> I am not that familiar with spark, but I would expect to import data into a
>> single RDD before performing analytics on it.
>>
>> Thank you for your time and help on this.
>>
>>
>> P.S. I am using python if that makes a difference.
>>
>> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> In general you should be able to read full directories of files as a
>>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>>> guides:
>>>
>>> http://spark.apache.org/docs/latest/quick-start.html
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> For Avro in particular, I have been working on a library for Spark SQL.
>>> Its very early code, but you can find it here:
>>> https://github.com/databricks/spark-avro
>>>
>>> Bug reports welcome!
>>>
>>> Michael
>>>
>>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <sam.fl...@magnetic.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>     I am new to spark.  I have began to read to understand sparks RDD
>>>> files as well as SparkSQL.  My question is more on how to build out the RDD
>>>> files and best practices.   I have data that is broken down by hour into
>>>> files on HDFS in avro format.   Do I need to create a separate RDD for each
>>>> file? or using SparkSQL a separate SchemaRDD?
>>>>
>>>> I want to be able to pull lets say an entire day of data into spark and
>>>> run some analytics on it.  Then possibly a week, a month, etc.
>>>>
>>>>
>>>> If there is documentation on this procedure or best practives for
>>>> building RDD's please point me at them.
>>>>
>>>> Thanks for your time,
>>>>    Sam
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *MAGNE**+**I**C*
>>
>> *Sam Flint* | *Lead Developer, Data Analytics*
>>
>>
>>
>


-- 

*MAGNE**+**I**C*

*Sam Flint* | *Lead Developer, Data Analytics*

Reply via email to