I'm not sure what you mean by it could be hard to serialize complex
operations?

Regardless I think the question is do you want to parallelize this on
multiple machines or just one?

On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2...@gmail.com> wrote:

> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However,
> it could be hard to serialize complex operation for Spark to execute in
> parallel. IMHO, spark does not fit this scenario. Hope this makes sense.
>
> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> ** You do NOT need dataframes, I mean.....
>>
>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Couple of suggestions:
>>>
>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no
>>> benefit of dataset features here. Using Dataframe, you can write an
>>> arbitrary UDF which can do what you want to do.
>>> 2. In fact you do need dataframes here. You would be better off with RDD
>>> here. just create a RDD of symbols and use map to do the processing.
>>>
>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.du...@gmail.com>
>>> wrote:
>>>
>>>> Do you only want to use Scala? Because otherwise, I think with pyspark
>>>> and pandas read table you should be able to accomplish what you want to
>>>> accomplish.
>>>>
>>>> Thank you,
>>>>
>>>> Irving Duran
>>>>
>>>> On 02/16/2018 06:10 PM, Lian Jiang wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a user case:
>>>>
>>>> I want to download S&P500 stock data from Yahoo API in parallel using
>>>> Spark. I have got all stock symbols as a Dataset. Then I used below code to
>>>> call Yahoo API for each symbol:
>>>>
>>>>
>>>>
>>>> case class Symbol(symbol: String, sector: String)
>>>>
>>>> case class Tick(symbol: String, sector: String, open: Double, close:
>>>> Double)
>>>>
>>>>
>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
>>>> Dataset[Tick]
>>>>
>>>>
>>>>     symbolDs.map { k =>
>>>>
>>>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>>>
>>>>     }
>>>>
>>>>
>>>> This statement cannot compile:
>>>>
>>>>
>>>> Unable to find encoder for type stored in a Dataset.  Primitive types
>>>> (Int, String, etc) and Product types (case classes) are supported by
>>>> importing spark.implicits._  Support for serializing other types will
>>>> be added in future releases.
>>>>
>>>>
>>>> My questions are:
>>>>
>>>>
>>>> 1. As you can see, this scenario is not traditional dataset handling
>>>> such as count, sql query... Instead, it is more like a UDF which apply
>>>> random operation on each record. Is Spark good at handling such scenario?
>>>>
>>>>
>>>> 2. Regarding the compilation error, any fix? I did not find a
>>>> satisfactory solution online.
>>>>
>>>>
>>>> Thanks for help!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Reply via email to