I'm not sure what you mean by it could be hard to serialize complex operations?
Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2...@gmail.com> wrote: > Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, > it could be hard to serialize complex operation for Spark to execute in > parallel. IMHO, spark does not fit this scenario. Hope this makes sense. > > On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com> wrote: > >> ** You do NOT need dataframes, I mean..... >> >> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> Couple of suggestions: >>> >>> 1. Do not use Dataset, use Dataframe in this scenario. There is no >>> benefit of dataset features here. Using Dataframe, you can write an >>> arbitrary UDF which can do what you want to do. >>> 2. In fact you do need dataframes here. You would be better off with RDD >>> here. just create a RDD of symbols and use map to do the processing. >>> >>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.du...@gmail.com> >>> wrote: >>> >>>> Do you only want to use Scala? Because otherwise, I think with pyspark >>>> and pandas read table you should be able to accomplish what you want to >>>> accomplish. >>>> >>>> Thank you, >>>> >>>> Irving Duran >>>> >>>> On 02/16/2018 06:10 PM, Lian Jiang wrote: >>>> >>>> Hi, >>>> >>>> I have a user case: >>>> >>>> I want to download S&P500 stock data from Yahoo API in parallel using >>>> Spark. I have got all stock symbols as a Dataset. Then I used below code to >>>> call Yahoo API for each symbol: >>>> >>>> >>>> >>>> case class Symbol(symbol: String, sector: String) >>>> >>>> case class Tick(symbol: String, sector: String, open: Double, close: >>>> Double) >>>> >>>> >>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>>> Dataset[Tick] >>>> >>>> >>>> symbolDs.map { k => >>>> >>>> pullSymbolFromYahoo(k.symbol, k.sector) >>>> >>>> } >>>> >>>> >>>> This statement cannot compile: >>>> >>>> >>>> Unable to find encoder for type stored in a Dataset. Primitive types >>>> (Int, String, etc) and Product types (case classes) are supported by >>>> importing spark.implicits._ Support for serializing other types will >>>> be added in future releases. >>>> >>>> >>>> My questions are: >>>> >>>> >>>> 1. As you can see, this scenario is not traditional dataset handling >>>> such as count, sql query... Instead, it is more like a UDF which apply >>>> random operation on each record. Is Spark good at handling such scenario? >>>> >>>> >>>> 2. Regarding the compilation error, any fix? I did not find a >>>> satisfactory solution online. >>>> >>>> >>>> Thanks for help! >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> > >