Re: Scala closure exceeds ByteArrayOutputStream limit (~2gb)

2017-08-22 Thread Mungeol Heo
Hello, Joel. Have you solved the problem which is Java's 32-bit limit on array sizes? Thanks. On Wed, Jan 27, 2016 at 2:36 AM, Joel Keller wrote: > Hello, > > I am running RandomForest from mllib on a data-set which has very-high > dimensional data (~50k dimensions). > > I get the following sta

Re: JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
, Apr 5, 2017 at 6:52 PM, Mungeol Heo wrote: > Hello, > > I am using "minidev" which is a JSON lib to remove duplicated keys in > JSON object. > > > minidev > > > > net.minidev > json-smart &

JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
Hello, I am using "minidev" which is a JSON lib to remove duplicated keys in JSON object. minidev net.minidev json-smart 2.3 Test Code import net.minidev.json.parser.JSONParser val badJson = "{\"keyA\":

Re: Need help for RDD/DF transformation.

2017-03-30 Thread Mungeol Heo
,4,5] > > ? > > On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo wrote: >> >> Hello Yong, >> >> First of all, thank your attention. >> Note that the values of elements, which have values at RDD/DF1, in the >> same list will be always same. >> Therefo

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
the desired result for > > > RDD/DF 1 > > 1, a > 3, c > 5, b > > RDD/DF 2 > > [1, 2, 3] > [4, 5] > > > Yong > > > From: Mungeol Heo > Sent: Wednesday, March 29, 2017 5:37 AM > To: user@spark.apache.org &

Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
Hello, Suppose, I have two RDD or data frame like addressed below. RDD/DF 1 1, a 3, a 5, b RDD/DF 2 [1, 2, 3] [4, 5] I need to create a new RDD/DF like below from RDD/DF 1 and 2. 1, a 2, a 3, a 4, b 5, b Is there an efficient way to do this? Any help will be great. Thank you.

How to clean the accumulator and broadcast from the driver manually?

2016-10-21 Thread Mungeol Heo
Hello, As I mentioned at the title, I want to know is it possible to clean the accumulator/broadcast from the driver manually since the driver's memory keeps increasing. Someone says that unpersist method removes them both from memory as well as disk on each executor node. But it stays on the dri

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mungeol Heo
nt > or explicit storage), then there can be substantial I/O activity. > > > > > > > > From: Xi Shen > Date: Monday, October 17, 2016 at 2:54 AM > To: Divya Gehlot , Mungeol Heo > > Cc: "user @spark" > Subject: Re: Is spark a right tool for updati

Is spark a right tool for updating a dataframe repeatedly

2016-10-16 Thread Mungeol Heo
Hello, everyone. As I mentioned at the tile, I wonder that is spark a right tool for updating a data frame repeatedly until there is no more date to update. For example. while (if there was a updating) { update a data frame A } If it is the right tool, then what is the best practice for this ki

[1.6.0] Skipped stages keep increasing and causes OOM finally

2016-10-13 Thread Mungeol Heo
Hello, My task is updating a dataframe in a while loop until there is no more data to update. The spark SQL I used is like below val hc = sqlContext hc.sql("use person") var temp_pair = hc.sql(""" select ROW_NUMBER() OVER (ORDER B

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn yarn.scheduler.capacity.resource-calculator on, then check again. On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default yarn does not > honor cpu c

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn "yarn.scheduler.capacity.resource-calculator" on On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default yarn does not > honor cpu cores as resource,

How to improve the performance for writing a data frame to a JDBC database?

2016-07-08 Thread Mungeol Heo
Hello, I am trying to write a data frame to a JDBC database, like SQL server, using spark 1.6.0. The problem is "write.jdbc(url, table, connectionProperties)" is too slow. Is there any way to improve the performance/speed? e.g. options like partitionColumn, lowerBound, upperBound, numPartitions w

Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
the case > you're seeing. A population of N=1 still has a standard deviation of > course (which is 0). > > On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo wrote: >> I know stddev_samp and stddev_pop gives different values, because they >> have different definition. Wha

Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
erty which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> >> >> On 7 July 2016 at 09

stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
Hello, As I mentioned at the title, stddev_samp function gives a NaN while stddev_pop gives a numeric value on the same data. The stddev_samp function will give a numeric value, if I cast it to decimal. E.g. cast(stddev_samp(column_name) as decimal(16,3)) Is it a bug? Thanks - mungeol -