Spark spilling location

2014-09-18 Thread Tom Hubregtsen
Hi all,

Just one line of context, since last post mentioned this would help:
I'm currently writing my masters thesis (Computer Engineering) on storage
and memory in both Spark and Hadoop.

Right now I'm trying to analyze the spilling behavior of Spark, and I do not
see what I expect. Therefor, I want to be sure that I am looking at the
correct location.

If I set spark.local.dir and SPARK_LOCAL_DIRS to, for instance, ~/temp
instead of /tmp. Will this be the location where all data will be spilled
to? I assume it is, based on the description of spark.local.dir at
https://spark.apache.org/docs/latest/configuration.html:
"Directory to use for "scratch" space in Spark, including map output files
and RDDs that get stored on disk."

Thanks!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-spilling-location-tp8471.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark spilling location

2014-09-18 Thread Patrick Wendell
Yes - I believe we use the local dirs for spilling as well.

On Thu, Sep 18, 2014 at 7:57 AM, Tom Hubregtsen  wrote:
> Hi all,
>
> Just one line of context, since last post mentioned this would help:
> I'm currently writing my masters thesis (Computer Engineering) on storage
> and memory in both Spark and Hadoop.
>
> Right now I'm trying to analyze the spilling behavior of Spark, and I do not
> see what I expect. Therefor, I want to be sure that I am looking at the
> correct location.
>
> If I set spark.local.dir and SPARK_LOCAL_DIRS to, for instance, ~/temp
> instead of /tmp. Will this be the location where all data will be spilled
> to? I assume it is, based on the description of spark.local.dir at
> https://spark.apache.org/docs/latest/configuration.html:
> "Directory to use for "scratch" space in Spark, including map output files
> and RDDs that get stored on disk."
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-spilling-location-tp8471.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark authenticate enablement

2014-09-18 Thread Andrew Or
2014-09-16 22:32 GMT-07:00 Jun Feng Liu :

> I see. Thank you, it works for me. It looks confusing to have two ways
> expose configuration though.
>

I agree. We're working on it. :)






>  Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Andrew Or >*
>
> 2014/09/17 02:06
>   To
> Tom Graves ,
> cc
> Jun Feng Liu/China/IBM@IBMCN, "dev@spark.apache.org"  >
> Subject
> Re: Spark authenticate enablement
>
>
>
>
> Hi Jun,
>
> You can still set the authentication variables through `spark-env.sh`, by
> exporting SPARK_MASTER_OPTS, SPARK_WORKER_OPTS, SPARK_HISTORY_OPTS etc to
> include "-Dspark.auth.{...}". There is an open pull request that allows
> these processes to also read from spark-defaults.conf, but this is not
> merged into master yet.
>
> Andrew
>
> 2014-09-15 6:44 GMT-07:00 Tom Graves :
>
> > Spark authentication does work in standalone mode (atleast it did, I
> > haven't tested it in a while). The same shared secret has to be set on
> all
> > the daemons (master and workers) and then also in the configs of any
> > applications submitted.  Since everyone shares the same secret its by no
> > means ideal or a strong authentication.
> >
> > Tom
> >
> >
> > On Thursday, September 11, 2014 4:17 AM, Jun Feng Liu <
> liuj...@cn.ibm.com>
> > wrote:
> >
> >
> >
> > Hi, there
> >
> > I am trying to enable the authentication
> > on spark on standealone model. Seems like only SparkSubmit load the
> > properties
> > from spark-defaults.conf.  org.apache.spark.deploy.master.Master dose
> > not really load the default setting from spark-defaults.conf.
> >
> > Dose it mean the spark authentication
> > only work for like YARN model? Or I missed something with standalone
> model.
> >
> > Best Regards
> >
> > Jun Feng Liu
> > IBM China Systems & Technology Laboratory in Beijing
> >
> > 
> >
> >   Phone: 86-10-82452683
> > E-mail:liuj...@cn.ibm.com
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
>
>


Gaussian Mixture Model clustering

2014-09-18 Thread Meethu Mathew

Hi all,

We have come up with an initial distributed implementation of Gaussian 
Mixture Model in pyspark where the parameters are estimated using the 
Expectation-Maximization algorithm.Our current implementation considers 
diagonal covariance matrix for each component.
We did an initial benchmark study on a 2 node Spark standalone cluster 
setup where each node config is 8 Cores,8 GB RAM, the spark version used 
is 1.0.0. We also evaluated python version of k-means available in spark 
on the same datasets.Below are the results from this benchmark study. 
The reported stats are average from 10 runs.Tests were done on multiple 
datasets with varying number of features and instances.



 Dataset  Gaussian mixture model
   Kmeans(Python)

Instances   Dimensions  Avg time per iteration  Time for 100 iterations
Avg time per iteration  Time for 100 iterations
0.7million  13
7s
12min
  13s   26min
1.8million  11
17s
 29min 33s
 53min
10 million  16
1.6min  2.7hr
  1.2min2 hr


We are interested in contributing this implementation as a patch to 
SPARK. Does MLLib accept python implementations? If not, can we 
contribute to the pyspark component
I have created a JIRA for the same 
https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the 
ticket assigned to myself?


Please review and suggest how to take this forward.



--

Regards,


*Meethu Mathew*

*Engineer*

*Flytxt*

F: +91 471.2700202

www.flytxt.com | Visit our blog  | Follow us 
 | _Connect on Linkedin 
_




Re: Gaussian Mixture Model clustering

2014-09-18 Thread Meethu Mathew

Hi all,
Please find attached the image of benchmark results. The table in the 
previous mail got messed up. Thanks.




On Friday 19 September 2014 10:55 AM, Meethu Mathew wrote:

Hi all,

We have come up with an initial distributed implementation of Gaussian
Mixture Model in pyspark where the parameters are estimated using the
Expectation-Maximization algorithm.Our current implementation considers
diagonal covariance matrix for each component.
We did an initial benchmark study on a 2 node Spark standalone cluster
setup where each node config is 8 Cores,8 GB RAM, the spark version used
is 1.0.0. We also evaluated python version of k-means available in spark
on the same datasets.Below are the results from this benchmark study.
The reported stats are average from 10 runs.Tests were done on multiple
datasets with varying number of features and instances.


   DatasetGaussian mixture model
   Kmeans(Python)

Instances   Dimensions  Avg time per iteration  Time for 100 iterations
Avg time per iteration  Time for 100 iterations
0.7million  13
7s
12min
  13s   26min
1.8million  11
17s
 29min 33s
 53min
10 million  16
1.6min  2.7hr
  1.2min2 hr


We are interested in contributing this implementation as a patch to
SPARK. Does MLLib accept python implementations? If not, can we
contribute to the pyspark component
I have created a JIRA for the same
https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the
ticket assigned to myself?

Please review and suggest how to take this forward.





--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

Skype: meethu.mathew7

 F: +91 471.2700202

www.flytxt.com | Visit our blog  | Follow us 
 | _Connect on Linkedin 
_



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org