optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
Hi all, I'm a beginner with spark, and I'm wondering if someone could provide guidance on the following 2 questions I have. Background: I have a data set growing by 6 TB p.a. I plan to use spark to read in all the data, manipulate it and build a predictive model on it (say GBM) I plan to store th

Re: optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
ot; wrote: > > > On 11 Apr 2017, at 11:07, Zeming Yu wrote: > > > > Hi all, > > > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > > > Background: I have a data se

how to add new column using regular expression within pyspark dataframe

2017-04-17 Thread Zeming Yu
I've got a dataframe with a column looking like this: display(flight.select("duration").show()) ++ |duration| ++ | 15h10m| | 17h0m| | 21h25m| | 14h30m| | 24h50m| | 26h10m| | 14h30m| | 23h5m| | 21h30m| | 11h50m| | 16h10m| | 15h15m| | 21h25m| | 14h25m| | 14h40m| |

Re: how to add new column using regular expression within pyspark dataframe

2017-04-20 Thread Zeming Yu
wal.wordpress.com/2015/10/02/spark-custom-udf-example/ > > > > On Mon, Apr 17, 2017 at 8:25 PM, Zeming Yu wrote: > >> I've got a dataframe with a column looking like this: >> >> display(flight.select("duration").show()) >> >> ++ &g

Re: how to add new column using regular expression within pyspark dataframe

2017-04-22 Thread Zeming Yu
lit(flight.duration,'h').getItem(0)) > > > Thank you, > *Pushkar Gujar* > > > On Thu, Apr 20, 2017 at 4:35 AM, Zeming Yu wrote: > >> Any examples? >> >> On 20 Apr. 2017 3:44 pm, "颜发才(Yan Facai)" wrote: >> >>> How about u

udf that handles null values

2017-04-24 Thread Zeming Yu
hi all, I tried to write a UDF that handles null values: def getMinutes(hString, minString): if (hString != None) & (minString != None): return int(hString) * 60 + int(minString[:-1]) else: return None flight2 = (flight2.withColumn("duration_minutes", udfGetMinutes("duration_h", "duratio

pyspark vector

2017-04-24 Thread Zeming Yu
Hi all, Beginner question: what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])? https://spark.apache.org/docs/2.1.0/ml-features.html id | texts | vector |-|--- 0 | Array("a", "b", "c")| (3,[0,1,2],[1.0,1.

one hot encode a column of vector

2017-04-24 Thread Zeming Yu
how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] FYI here's my code for one hot encoding normal categorical columns. How do I make it work for a column of array? from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer indexers = [StringIndexer(inputCol=co

Re: udf that handles null values

2017-04-24 Thread Zeming Yu
issue today at stackoverflow. > > http://stackoverflow.com/questions/43595201/python-how- > to-convert-pyspark-column-to-date-type-if-there-are-null- > values/43595728#43595728 > > > Thank you, > *Pushkar Gujar* > > > On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu wro

how to find the nearest holiday

2017-04-25 Thread Zeming Yu
I have a column of dates (date type), just trying to find the nearest holiday of the date. Anyone has any idea what went wrong below? start_date_test = flight3.select("start_date").distinct() start_date_test.show() holidays = ['2017-09-01', '2017-10-01'] +--+ |start_date| +--+

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
er string to Date type when compare. > > Yu Wenpei. > > > - Original message - > From: Zeming Yu > To: user > Cc: > Subject: how to find the nearest holiday > Date: Tue, Apr 25, 2017 3:39 PM > > I have a column of dates (date type), just trying to find t

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
How could I access the first element of the holiday column? I tried the following code, but it doesn't work: start_date_test2.withColumn("diff", datediff(start_date_test2.start_date, start_date_test2.holiday*[0]*)).show() On Tue, Apr 25, 2017 at 10:20 PM, Zeming Yu wrote: >

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
ndex.html ​ > > > Thank you, > *Pushkar Gujar* > > > On Tue, Apr 25, 2017 at 8:50 AM, Zeming Yu wrote: > >> How could I access the first element of the holiday column? >> >> I tried the following code, but it doesn't work: >> start_date_test2.with

parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Hi, We're building a parquet based data lake. I was under the impression that flat files are more efficient than deeply nested files (say 3 or 4 levels down). Is that correct? Thanks, Zeming

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
anke wrote: > Depends on your queries, the data structure etc. generally flat is better, > but if your query filter is on the highest level then you may have better > performance with a nested structure, but it really depends > > > On 30. Apr 2017, at 10:19, Zeming Yu wrote: >

Re: Recommended cluster parameters

2017-04-30 Thread Zeming Yu
I've got a similar question. Would you be able to provide some rough guide (even a range is fine) on the number of nodes, cores, and total amount of RAM required? Do you want to store 1 TB, 1 PB or far more? - say 6 TB of data in parquet format on s3 Do you want to just read that data, retrieve

examples of dealing with nested parquet/ dataframe file

2017-04-30 Thread Zeming Yu
Hi, I'm still trying to decide whether to store my data as deeply nested or flat parquet file. The main reason for storing the nested file is it stores data in its raw format, no information loss. I have two questions: 1. Is it always necessary to flatten a nested dataframe for the purpose of b

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
ave to say. > Would it be more efficient if a relational database with the right index > (code field in the above case) to perform more efficiently (with spark that > uses predicate push-down)? > Hope this helps. > > Thanks, > Muthu > > On Sun, Apr 30, 2017 at 1:45 AM, Zemi

Spark books

2017-05-03 Thread Zeming Yu
I'm trying to decide whether to buy the book learning spark, spark for machine learning etc. or wait for a new edition covering the new concepts like dataframe and datasets. Anyone got any suggestions?

take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
Say I have the following dataframe with two numeric columns A and B, what's the best way to add a column showing the difference between the two columns? +-+--+ |A| B| +-+--+ |786.31999|786.12| | 786.12|

Re: take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
OK. I've worked it out. df.withColumn('diff', col('A')-col('B')) On Sun, May 7, 2017 at 11:49 AM, Zeming Yu wrote: > Say I have the following dataframe with two numeric columns A and B, > what's the best way to add a column sh

how to check whether spill over to hard drive happened or not

2017-05-06 Thread Zeming Yu
hi, I'm running pyspark on my local PC using the stand alone mode. After a pyspark window function on a dataframe, I did a groupby query on the dataframe. The groupby query turns out to be very slow (10+ minutes on a small data set). I then cached the dataframe and re-ran the same query. The quer

how to set up h2o sparkling water on jupyter notebook on a windows machine

2017-05-08 Thread Zeming Yu
Hi, I'm a newbie, so please bear with me. *I'm using a windows 10 machine. I installed spark here:* C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7 *I also installed h2o sparkling water here:* C:\sparkling-water-2.1.1 *I use this code in command line to launch a jupyter notebook for pysp

what does this error mean?

2017-05-13 Thread Zeming Yu
My code runs error free on my local pc. Just tried running the same code on a ubuntu machine on ec2, and got the error below. Any idea where to start in terms of debugging? ---Py4JError Tracebac

Re: what does this error mean?

2017-05-13 Thread Zeming Yu
"server ({0}:{1})".format(self.address, self.port)969 logger.exception(msg)--> 970 raise Py4JNetworkError(msg, e) 971 972 def close(self, reset=False): Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:34166) O

using pandas and pyspark to run ETL job - always failing after about 40 minutes

2017-05-26 Thread Zeming Yu
Hi, I tried running the ETL job a few times. It always fails after 40 minutes or so. When I relaunch jupyter and rerun the job, it runs without error. Then it fails again after some time. Just wondering if anyone else has encountered this before? Here's the error message: ---

examples for flattening dataframes using pyspark

2017-05-27 Thread Zeming Yu
Hi, I need to flatten a nested dataframe and I' following this example: https://docs.databricks.com/spark/latest/spark-sql/complex-types.html Just wondering: 1. how can I test for the existence of an item before retrieving it Say test if "b" exists before adding that into my flat dataframe event

Re: examples for flattening dataframes using pyspark

2017-05-27 Thread Zeming Yu
Sorry, sent the incomplete email by mistake. Here's the full email: > Hi, > > I need to flatten a nested dataframe and I' following this example: > https://docs.databricks.com/spark/latest/spark-sql/complex-types.html > > Just wondering: > 1. how can I test for the existence of an item before ret

[no subject]

2020-04-28 Thread Zeming Yu
Unsubscribe Get Outlook for Android

unsubscribe

2020-04-29 Thread Zeming Yu
unsubscribe Get Outlook for Android

Unsubscribe

2020-05-05 Thread Zeming Yu
Unsubscribe Get Outlook for Android