subject:"Re\: Best way to process this dataset"

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie

Thank you, that works. ** *Sincerely yours,* *Raymond* On Tue, Jun 19, 2018 at 4:36 PM, Nicolas Paris wrote: > Hi Raymond > > Spark works well on single machine too, since it benefits from multiple > core. > The csv parser is based on univocity

Re: Best way to process this dataset

2018-06-19 Thread Nicolas Paris

Hi Raymond Spark works well on single machine too, since it benefits from multiple core. The csv parser is based on univocity and you might use the "spark.read.csc" syntax instead of using the rdd api; >From my experience, this will better than any other csv parser 2018-06-19 16:43 GMT+02:00 Ra

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie

Thank you Matteo, Askash and Georg: I am attempting to get some stats first, the data is like: 1,4152983,2355072,pv,1511871096 I like to find out the count of Key of (UserID, Behavior Type) val bh_count = sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.sp

Re: Best way to process this dataset

2018-06-19 Thread Matteo Cossu

Single machine? Any other framework will perform better than Spark On Tue, 19 Jun 2018 at 09:40, Aakash Basu wrote: > Georg, just asking, can Pandas handle such a big dataset? If that data is > further passed into using any of the sklearn modules? > > On Tue, Jun 19, 2018 at 10:35 AM, Georg Heil

Re: Best way to process this dataset

2018-06-19 Thread Aakash Basu

Georg, just asking, can Pandas handle such a big dataset? If that data is further passed into using any of the sklearn modules? On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler wrote: > use pandas or dask > > If you do want to use spark store the dataset as parquet / orc. And then > continue to pe

Re: Best way to process this dataset

2018-06-18 Thread Georg Heiler

use pandas or dask If you do want to use spark store the dataset as parquet / orc. And then continue to perform analytical queries on that dataset. Raymond Xie schrieb am Di., 19. Juni 2018 um 04:29 Uhr: > I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment > is 20GB ssd ha

Re: Best way to process this dataset

Re: Best way to process this dataset

Re: Best way to process this dataset

Re: Best way to process this dataset

Re: Best way to process this dataset

Re: Best way to process this dataset

6 matches

Site Navigation

Mail list logo

Footer information