Re: Spark Implicit Functions

2015-10-17 Thread YiZhi Liu
Cool! Help to improve, there exists a typo 'avaragingFunction' :) Thanks for sharing. 2015-10-17 5:55 GMT+08:00 Reynold Xin : > Thanks for sharing, Bill. > > > On Fri, Oct 16, 2015 at 2:06 PM, Bill Bejeck wrote: >> >> All, >> >> I just did a post on adding groupByKeyToList and groupByKeyUnique u

Re: Gradient Descent with large model size

2015-10-17 Thread Joseph Bradley
The decrease in running time from N=6 to N=7 makes some sense to me; that should be when tree aggregation kicks in. I'd call it an improvement to run in the same ~13sec increasing from N=6 to N=9. "Does this mean that for 5 nodes with treeaggreate of depth 1 it will take 5*3.1~15.5 seconds?" -->

flaky test "map stage submission with multiple shared stages and failures"

2015-10-17 Thread Reynold Xin
I just saw this happening: [info] - map stage submission with multiple shared stages and failures *** FAILED *** (566 milliseconds) [info] java.lang.IndexOutOfBoundsException: 2 [info] at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) [info] at scala.collection.

Re: Gradient Descent with large model size

2015-10-17 Thread Evan Sparks
Yes, remember that your bandwidth is the maximum number of bytes per second that can be shipped to the driver. So if you've got 5 blocks that size, then it looks like you're basically saturating the network. Aggregation trees help for many partitions/nodes and butterfly mixing can help use all

Re: PMML export for LinearRegressionModel

2015-10-17 Thread Joseph Bradley
Thanks for bringing this up! We need to add PMML export methods to the spark.ml API. I just made a JIRA for tracking that: https://issues.apache.org/jira/browse/SPARK-11171 Joseph On Thu, Oct 15, 2015 at 2:58 AM, Fazlan Nazeem wrote: > Ok It turns out I was using the wrong LinearRegressionMod

Re: Build spark 1.5.1 branch fails

2015-10-17 Thread Chester Chen
I was using jdk 1.7 and maven version is the same as pom file. ᚛ |(v1.5.1)|$ java -version java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Using build/sbt still fail the same with -Denforcer.skip, with mv

Re: Build spark 1.5.1 branch fails

2015-10-17 Thread Ted Yu
Have you set MAVEN_OPTS with the following ? -Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m Cheers On Sat, Oct 17, 2015 at 2:35 PM, Chester Chen wrote: > I was using jdk 1.7 and maven version is the same as pom file. > > ᚛ |(v1.5.1)|$ java -version > java version "1.7.0_51" > Java(T

Re: MLlib Contribution

2015-10-17 Thread Joseph Bradley
Hi, it'd be great to share your implementation with the community. I'd recommend: (1) Share it immediately by creating a Spark package: http://spark-packages.org/ You can use this helper package to create your own: http://spark-packages.org/package/databricks/sbt-spark-package After you create a

Re: Build spark 1.5.1 branch fails

2015-10-17 Thread Chester Chen
Yes, I have tried MAVEN_OPTS with -Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m -Xmx4g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m -Xmx2g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=512m None of them works. All failed with the same error. thanks On Sat, Oct 17, 2015

Streaming and storing to Google Cloud Storage or S3

2015-10-17 Thread vonnagy
I have a streaming application that reads from Kakfa (direct stream) and then write parquet files. It is a pretty simple app that gets a Kafka direct stream (8 partitions) and then calls `stream.foreachRdd` and then stores to parquet using a Dataframe. Batch intervals are set to 10 seconds. During

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-17 Thread Robert Dodier
Nicholas Chammas wrote > The funny thing is that Spark seems to accept this only if the value of > SPARK_MASTER_IP is a DNS name and not an IP address. > > When I provide an IP address, I get errors in the log when starting the > master: > > 15/10/15 01:47:31 ERROR NettyTransport: failed to bind

Checkpointing RDD calls the job twice?

2015-10-17 Thread jatinganhotra
Hi, I noticed that when you checkpoint a given RDD, it results in performing the action twice as I can see 2 jobs being executed in the Spark UI. Example: val logFile = "/data/pagecounts" sc.setCheckpointDir("/checkpoints") val logData = sc.textFile(logFile, 2) val as = logData.filter(line => lin