Re: RDD Location

2016-12-30 Thread Sun Rui
ecuting this function. But if I move the > code to other places, like the main() function, it runs well. > > What is the reason for it? > > Thanks, > Fei > > On Fri, Dec 30, 2016 at 2:38 AM, Sun Rui <mailto:sunrise_...@163.com>> wrote: > Maybe you can create

Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the getPreferredLocations() to implement the logic of dynamic changing of the locations. > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > Dear all, > > Is there any way to change the host location for a certain partition of RDD? > > "pr

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Sun Rui
Hi, Could you give more information about your Spark environment? cluster manager, spark version, using dynamic allocation or not, etc.. Generally, executors will delete temporary directories for shuffle files on exit because JVM shutdown hooks are registered. Unless they are brutally killed. Y

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
18:51, Maciej Szymkiewicz wrote: > > Thank you for your prompt response and great examples Sun Rui but I am > still confused about one thing. Do you see any particular reason to not > to merge subsequent limits? Following case > >(limit n (map f (limit m ds))) > > could

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
Based on your code, here is simpler test case on Spark 2.0 case class my (x: Int) val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) } val df1 = spark.createDataFrame(rdd) val df2 = df1.limit(1) df1.map { r => r.getAs[Int](0) }.first df2.map { r => r.getAs[Int](0) }.first // Much slow

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-11 Thread Sun Rui
-1 https://issues.apache.org/jira/browse/SPARK-16379 > On Jul 6, 2016, at 19:28, Maciej Bryński wrote: > > -1 > https://issues.apache.org/jira/browse/SPARK-16379 >

Re: spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Sun Rui
maybe related to "parquet-provided”? remove "parquet-provided” profile when making the distribution or adding the parquet jar into class path when running Spark > On Jul 8, 2016, at 09:25, kevin wrote: > > parquet-provided

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors

Re: Windows Rstudio to Linux spakR

2016-06-01 Thread Sun Rui
Selvam, First, deploy the Spark distribution on your Windows machine, which is of the same version of Spark in your Linux cluster Second, follow the instructions at https://github.com/apache/spark/tree/master/R#using-sparkr-from-rstudio. Specify the Spark master URL for your Linux Spark cluste

Re:

2016-05-22 Thread Sun Rui
No permission is required. Just send your PR:) > On May 22, 2016, at 20:04, 成强 > wrote: > > spark-15429

Re: spark on kubernetes

2016-05-22 Thread Sun Rui
If it is possible to rewrite URL in outbound responses in Knox or other reverse proxy, would that solve your issue? > On May 22, 2016, at 14:55, Gurvinder Singh wrote: > > On 05/22/2016 08:32 AM, Reynold Xin wrote: >> Kubernetes itself already has facilities for http proxy, doesn't it? >> > Yea

Re: spark on kubernetes

2016-05-21 Thread Sun Rui
I think “reverse proxy” is beneficial to monitoring a cluster in a secure way. This feature is not only desired for Spark on standalone, but also Spark on YARN, and also projects other than spark. Maybe Apache Knox can help you. Not sure how Knox can integrate with Spark. > On May 22, 2016, at

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Kai, You can simply ignore this test failure before it is fixed > On May 20, 2016, at 12:54, Sun Rui wrote: > > Yes. I also met this issue. It is likely related to recent R versions. > Could you help to submit a JIRA issue? I will take a look at it >> On May 20, 2016, a

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
I guess this issue related to permission. It seems I used `sudo > ./R/run-tests.sh` and it worked sometimes. Without permission, maybe we > couldn't access /tmp directory. However, the SparkR unit testing is brittle. > > Could someone give any hints of how to solve this? > >

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
13.910 Thread-1 INFO ShutdownHookManager: Shutdown > hook called > 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting > directory > /private/var/folders/xy/qc35m0y55vq83dsqzg066_c4gn/T/spark-dfafdddc-fd25-4eb4-bb1d-565915 > 1c8231 > > &g

Re: SparkR dataframe error

2016-05-18 Thread Sun Rui
On Wed, May 18, 2016 at 5:27 PM, Sun Rui <mailto:sunrise_...@163.com>> wrote: > It’s wrong behaviour that head(df) outputs no row > Could you send a screenshot displaying whole error message? >> On May 19, 2016, at 08:12, Gayathri Murali > <mailto:gayathri.m.sof...@gmai

Re: SparkR dataframe error

2016-05-18 Thread Sun Rui
It’s wrong behaviour that head(df) outputs no row Could you send a screenshot displaying whole error message? > On May 19, 2016, at 08:12, Gayathri Murali > wrote: > > I am trying to run a basic example on Interactive R shell and run into the > following error. Also note that head(df) does not

RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Sun, Rui
nternally Dataset[Row(value: Row)]. From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, February 26, 2016 3:55 PM To: Sun, Rui Cc: Koert Kuipers ; dev@spark.apache.org Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0 The join and joinWith are just two different join semantics, and is not

RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Sun, Rui
Vote for option 2. Source compatibility and binary compatibility are very important from user’s perspective. It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As you said, sometimes it is more natural to think about DataFrame. I am wondering if conceptually there is sl

RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Sun, Rui
This should be solved by your pending PR https://github.com/apache/spark/pull/10480, right? From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Sunday, February 7, 2016 8:50 PM To: Sun, Rui ; Andrew Holway ; dev@spark.apache.org Subject: RE: Fwd: Writing to jdbc database from SparkR

RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Sun, Rui
DataFrameWrite.jdbc() does not work? From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Sunday, February 7, 2016 9:54 AM To: Andrew Holway ; dev@spark.apache.org Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2) Unfortunately I couldn't find a simple workaround. It seems to

RE: Specifying Scala types when calling methods from SparkR

2015-12-10 Thread Sun, Rui
...@alteryx.com] Sent: Friday, December 11, 2015 2:47 AM To: Sun, Rui; shiva...@eecs.berkeley.edu Cc: dev@spark.apache.org Subject: RE: Specifying Scala types when calling methods from SparkR Hi Sun Rui, I’ve had some luck simply using “objectFile” when saving from SparkR directly. The problem is that if you

RE: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Sun, Rui
Hi, Just use ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. You can take the objectFile() in context.R as example. Since the SparkContext created in SparkR is actually a JavaSparkContext, there is no need to pass the implicit ClassTag. -Original Message- From: S

RE: SparkR package path

2015-09-24 Thread Sun, Rui
not small change to spark-submit. Also additional network traffic overhead would be incurred. I can’t see any compelling demand for this. From: Hossein [mailto:fal...@gmail.com] Sent: Friday, September 25, 2015 5:09 AM To: shiva...@eecs.berkeley.edu Cc: Sun, Rui; dev@spark.apache.org; Dan Putler

RE: SparkR package path

2015-09-24 Thread Sun, Rui
AM To: Sun, Rui Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org Subject: Re: SparkR package path Requiring users to download entire Spark distribution to connect to a remote cluster (which is already running Spark) seems an over kill. Even for most spark users who download Spark source, it

RE: SparkR package path

2015-09-23 Thread Sun, Rui
documentation at https://github.com/apache/spark/tree/master/R From: Hossein [mailto:fal...@gmail.com] Sent: Thursday, September 24, 2015 1:42 AM To: shiva...@eecs.berkeley.edu Cc: Sun, Rui; dev@spark.apache.org Subject: Re: SparkR package path Yes, I think exposing SparkR in CRAN can significantly

RE: SparkR package path

2015-09-21 Thread Sun, Rui
Hossein, Any strong reason to download and install SparkR source package separately from the Spark distribution? An R user can simply download the spark distribution, which contains SparkR source and binary package, and directly use sparkR. No need to install SparkR package at all. From: Hosse

[SparkR] is toDF() necessary

2015-05-08 Thread Sun, Rui
toDF() is defined to convert an RDD to a DataFrame. But it is just a very thin wrapper of createDataFrame() by help the caller avoid input of SQLContext. Since Scala/pySpark does not have toDF(), and we'd better keep API as narrow and simple as possible. Is toDF() really necessary? Could we elim