Re: PySpark still reading only text?

Bertrand Dechoux Thu, 17 Apr 2014 02:41:27 -0700

According to the Spark SQL documentation, indeed, this project allows
python to be used while reading/writing table ie data which not necessarily
in text format.


Thanks a lot!

Bertrand Dechoux


On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux <decho...@gmail.com>wrote:

> Thanks for the IRA reference. I really need to look at Spark SQL.
>
> Am I right to understand that due to Spark SQL, hive data can be read (and
> it does not need to be a text format) and then 'classical' Spark can work
> on this extraction?
>
> It seems logical but I haven't worked with Spark SQL as of now.
>
> Does it also imply the reverse is true? That I can write data as hive data
> with spark SQL using results from a random (python) Spark application?
>
> Bertrand Dechoux
>
>
> On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia <matei.zaha...@gmail.com>wrote:
>
>> Yes, this JIRA would enable that. The Hive support also handles HDFS.
>>
>> Matei
>>
>> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <frank.einst...@gmail.com>
>> wrote:
>>
>> When this is implemented, can you load/save an RDD of pickled objects to
>> HDFS?
>>
>>
>> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia 
>> <matei.zaha...@gmail.com>wrote:
>>
>>> Hi Bertrand,
>>>
>>> We should probably add a SparkContext.pickleFile and
>>> RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately
>>> this is not in yet, but there is an issue up to track it:
>>> https://issues.apache.org/jira/browse/SPARK-1161.
>>>
>>> In 1.0, one feature we do have now is the ability to load binary data
>>> from Hive using Spark SQL’s Python API. Later we will also be able to save
>>> to Hive.
>>>
>>> Matei
>>>
>>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <decho...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I have browsed the online documentation and it is stated that PySpark
>>> only read text files as sources. Is it still the case?
>>> >
>>> > From what I understand, the RDD can after this first step be any
>>> serialized python structure if the class definitions are well distributed.
>>> >
>>> > Is it not possible to read back those RDDs? That is create a flow to
>>> parse everything and then, e.g. the next week, start from the binary,
>>> structured data?
>>> >
>>> > Technically, what is the difficulty? I would assume the code reading a
>>> binary python RDD or a binary python file to be quite similar. Where can I
>>> know more about this subject?
>>> >
>>> > Thanks in advance
>>> >
>>> > Bertrand
>>>
>>>
>>
>>
>> --
>> We dont beat the reaper by living longer. We beat the reaper by living
>> well and living fully. The reaper will come for all of us. Question is,
>> what do we do between the time we are born and the time he shows up? -Randy
>> Pausch
>>
>>
>>
>

Re: PySpark still reading only text?

Reply via email to