Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-12 Thread Robin Cjc
Some properties can be set in System.setProperty(). Like the -Dfile.encoding, can be set as ("file.encoding", "utf-8"). but some cannot, like the heap size, as it is too late to set the heap size. And in the new 0.9 version, I think you can use sparkconf.set() instead of System.setProperty(). It se

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Sean Owen
Ah, thank you, I had actually forgotten about this and this is indeed probably a difference. This is from the other paper I cited: http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf It's the "WR" in "ALS-WR" -- weighted regularization. I supp

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Michael Allman
Hi Sean, Digging deeper I've found another difference between Oryx's implementation and Spark's. Why do you adjust lambda here? https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java#L491 Cheers, Michael

Re: TriangleCount & Shortest Path under Spark

2014-03-12 Thread moxiecui
Well, you could run the driver in several ways, like using IDEA run, or using sbt run, the spark-shell, or even the run-example script, so long as you give the driver the right args. Here is an example using run-example script: # in spark-0.9.0 home dir, the first arg is your master, the second is

Local Standalone Application and shuffle spills

2014-03-12 Thread Fabrizio Milo aka misto
Hello everyone I have a question about Shuffle Spills. From the introduction to amplab spark internals each task output could be saved to disk for 'redundancy' if I set spark.shuffle.spill to false would this behavior be eliminated and make it in a way that it will never spill to disk ? Thank yo

Re: Lzo + Protobuf

2014-03-12 Thread Vipul Pandey
Extending this discussion further : Anyone able to write out Lzo compressed Protobuf to hdfs (using Elephant Bird - or any other way)? I have an RDD that I want written out as it is - but I'm unable to figure out a direct way of doing that. I can convert it to a "PairRDD" or Rdd of "Key" and

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Michael Allman
Thank you everyone for your feedback. It's been very helpful, and though I still haven't found the cause of the difference between Spark and Oryx, I feel I'm making progress. Xiangrui asked me to create a ticket for this issue. The reason I didn't do this originally is because it's not clear to me

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Ryan Compton
Have not upgraded yet... On Wed, Mar 12, 2014 at 3:06 PM, Aureliano Buendia wrote: > Thanks, Ryan. Was your problem solved in spark 0.9? > > > On Wed, Mar 12, 2014 at 9:59 PM, Ryan Compton > wrote: >> >> In 0.8 I had problems broadcasting variables around that size, for >> more info see here: >>

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Aureliano Buendia
Thanks, Ryan. Was your problem solved in spark 0.9? On Wed, Mar 12, 2014 at 9:59 PM, Ryan Compton wrote: > In 0.8 I had problems broadcasting variables around that size, for > more info see here: > > https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Ryan Compton
In 0.8 I had problems broadcasting variables around that size, for more info see here: https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E On Wed, Mar 12, 2014 at 2:12 PM, Matei Zaharia wrote: > You sh

Unable to start a stand app in scala on a spark cluster

2014-03-12 Thread Jaonary Rabarisoa
Hi all, I'm trying to play with a spark cluster with one master and one worker on my laptop. With my setup, I'm able to run the examples that come with spark as expected. For example with SparkPi ./bin/run-example org.apache.spark.examples.SparkPi spark://luigi:7077 works without error. But whe

Re: Changing number of workers for benchmarking purposes

2014-03-12 Thread DB Tsai
One related question. Is there any way to automatically determine the optimal # of workers in yarn based on the data size, and available resources without explicitly specifying it when the job is lunched? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -

Re: Changing number of workers for benchmarking purposes

2014-03-12 Thread Patrick Wendell
Hey Pierre, Currently modifying the "slaves" file is the best way to do this because in general we expect that users will want to launch workers on any slave. I think you could hack something together pretty easily to allow this. For instance if you modify the line in slaves.sh from this: for

Re: spark config params conventions

2014-03-12 Thread Aaron Davidson
One solution for typesafe config is to use "spark.speculation" = true Typesafe will recognize the key as a string rather than a path, so the name will actually be "\"spark.speculation\"", so you need to handle this contingency when passing the config operations to spark (stripping the quotes from

Re: building Spark docs

2014-03-12 Thread Patrick Wendell
Dianna I'm forwarding this to the dev list since it might be useful there as well. On Wed, Mar 12, 2014 at 11:39 AM, Diana Carroll wrote: > Hi all. I needed to build the Spark docs. The basic instructions to do > this are in spark/docs/README.md but it took me quite a bit of playing > around to

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Matei Zaharia
You should try Torrent for this one, it will be faster. It’s still experimental but I believe it works pretty well and it just needs more testing to become the default. Matei On Mar 12, 2014, at 1:12 PM, Aureliano Buendia wrote: > Is TorrentBroadcastFactory out of beta? IS it preferred over

Re: test

2014-03-12 Thread Nicholas Chammas
Yes, we can see your email. On Wed, Mar 12, 2014 at 4:21 PM, Yishu Lin wrote: > why can't I receive the email sent by myself ? ... Anybody else can see it? > Please let me know. Sorry for the spam ...

Re: NLP with Spark

2014-03-12 Thread Andrei
In my experience, choice of tools for NLP mostly depends on concrete tasks. For example, for named entity recognition (NER) there's nice Java library called GATE [1]. It allows you to annotate your text with special marks (e.g. part of speech tags, "time", "name", etc.) and write regex-like rules

test

2014-03-12 Thread Yishu Lin
why can’t I receive the email sent by myself ? … Anybody else can see it? Please let me know. Sorry for the spam ...

Re: pyspark crash on mesos

2014-03-12 Thread kayousterhout
FYI I filed a JIRA for the scheduler issue here: https://spark-project.atlassian.net/browse/SPARK-1235 Fixing the scheduler issue won't fix the underlying python communication issue, but will keep the job from hanging. -- View this message in context: http://apache-spark-user-list.1001560.n3.n

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Aureliano Buendia
Is TorrentBroadcastFactory out of beta? IS it preferred over HttpBroadcastFactory for large broadcasts? What are the benefits of HttpBroadcastFactory as the default factory? On Wed, Mar 12, 2014 at 7:09 PM, Stephen Boesch wrote: > Hi Josh, > So then 2^31 (2.2Bilion) * 2^6 (length of doubl

Re: NLP with Spark

2014-03-12 Thread Brian O'Neill
Please let us know how you make out. We have NLP requirements on the horizon. I¹ve used NLTK before, but never on Spark. I¹d love to hear if that works out for you. -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive € Ki

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Stephen Boesch
Hi Josh, So then 2^31 (2.2Bilion) * 2^6 (length of double) = 128GB would be max array byte length with Doubles? 2014-03-12 11:30 GMT-07:00 Josh Marcus : > Aureliano, > > Just to answer your second question (unrelated to Spark), arrays in java > and scala can't be larger than the maximum v

building Spark docs

2014-03-12 Thread Diana Carroll
Hi all. I needed to build the Spark docs. The basic instructions to do this are in spark/docs/README.md but it took me quite a bit of playing around to actually get it working on my system. In case this is useful to anyone else, thought I'd post. This is what I did to build the docs on a CentOS

Re: NLP with Spark

2014-03-12 Thread Mayur Rustagi
Would love to know if somebody has tried this, only possible problem I can forsee is non-serializable libraries, else no reason it should not work. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Mar 12, 2014 at 11:1

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Josh Marcus
Aureliano, Just to answer your second question (unrelated to Spark), arrays in java and scala can't be larger than the maximum value of an Integer (Integer.MAX_VALUE), which means that arrays are limited to about 2.2 billion elements. --j On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia wrot

NLP with Spark

2014-03-12 Thread shankark
(apologies if this was sent out multiple times before) We are about to start a large-scale text-processing research project and are debating between two alternatives for our cluster -- Spark and Hadoop. I've researched possibilities of using NLTK with Hadoop and see that there's some precedent ( h

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Brad Miller
If you're using pyspark, beware that there are some known issues associated with large broadcast variables. https://spark-project.atlassian.net/browse/SPARK-1065 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/browser -Brad On Wed, Mar 12, 2014 at 10:15 AM, Guillaume Pitel < gui

Re: TriangleCount & Shortest Path under Spark

2014-03-12 Thread yxzhao
Thanks very much Cui. Could you give me more detail about how to run the driver? I tried sometimes but all failed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TriangleCount-Shortest-Path-under-Spark-tp2439p2608.html Sent from the Apache Spark User Li

Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-12 Thread Linlin
Thank you! I have set the jvm option in sbt for local mode, that works! not sure how to specify it through System.setProperty(), this is jvm command line option only? Thank you for your help! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-JAVA-OP

Changing number of workers for benchmarking purposes

2014-03-12 Thread Pierre Borckmans
Hi there! I was performing some tests for benchmarking purposes, among other things to observe the evolution of the performances versus the number of workers. In that context, I was wondering if there is any easy way to choose the number of workers to be used in standalone mode, without having

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Guillaume Pitel
From my experience, it shouldn't be a problem since 0.8.1 (before that, the akka FrameSize was a limit). I've broadcast arrays of max 1.4GB so far Keep in mind that it will be stored in spark.local.dir so you must have room on the disk. Guillau

Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Aureliano Buendia
Hi, I asked a similar question a while ago, didn't get any answers. I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this arra

Re: spark config params conventions

2014-03-12 Thread Mark Hamstra
That's the whole reason why some of the intended configuration changes were backed out just before the 0.9.0 release. It's a well-known issue, even if a completely satisfactory solution isn't as well-known and is probably something which should do another iteration on. On Wed, Mar 12, 2014 at 9:

spark config params conventions

2014-03-12 Thread Koert Kuipers
i am reading the spark configuration params from another configuration object (typesafe config) before setting them as system properties. i noticed typesafe config has trouble with settings like: spark.speculation=true spark.speculation.interval=0.5 the issue seems to be that if spark.speculation

Re: [re-cont] map and flatMap

2014-03-12 Thread Pascal Voitot Dev
On Wed, Mar 12, 2014 at 3:06 PM, andy petrella wrote: > Folks, > > I want just to pint something out... > I didn't had time yet to sort it out and to think enough to give valuable > strict explanation of -- event though, intuitively I feel they are a lot > ===> need spark people or time to move fo

[re-cont] map and flatMap

2014-03-12 Thread andy petrella
Folks, I want just to pint something out... I didn't had time yet to sort it out and to think enough to give valuable strict explanation of -- event though, intuitively I feel they are a lot ===> need spark people or time to move forward. But here is the thing regarding *flatMap*. Actually, it lo

Re: What is the difference between map and flatMap

2014-03-12 Thread Bertrand Dechoux
In a single phrase : if you understand what map() does and what a flatten() might do, then flatMap() is like a map() followed by a flatten(). Like previously said, the concepts in themselves are not Spark specific. Bertrand On Wed, Mar 12, 2014 at 1:19 PM, Xuefeng Wu wrote: > It is the same c

Re: how to config worker HA

2014-03-12 Thread qingyang li
in addition: on this site: https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#hadoop-datasets , i find RDD can be stored using a different *storage level on the web, and *also find StorageLevel's attribute MEMORY_ONLY_2 . MEMORY_ONLY_2, Same as the levels above, but replicate each pa

Re: Pyspark Memory Woes

2014-03-12 Thread Aaron Olson
Hi Sandy, We are, yes. I strongly suspect we're not partitioning our data properly, but maybe 1.5G is simply too small for our workload. I'll bump the executor memory and see if we get better results. It seems we should be setting it to (SPARK_WORKER_MEMORY + pyspark memory) / # of concurrent app

Re: What is the difference between map and flatMap

2014-03-12 Thread Xuefeng Wu
It is the same concept with other FP API, you could learn it from Scala collection map and flatmap http://www.brunton-spall.co.uk/post/2011/12/02/map-map-and-flatmap-in-scala/ Spark doc: *map*(*func*) Return a new distributed dataset formed by passing each element of the source through a functi

Re: What is the difference between map and flatMap

2014-03-12 Thread Ngoc Dao
> Can someone explain to me the difference between map and flatMap Similarity: Both transform collection A to collection B. Difference: * map: One element in A is transformed to one element. One -> one. Size of B = size of A. * flatMap: One element in A is transformed to 0 or more elements, then

What is the difference between map and flatMap

2014-03-12 Thread goi cto
Hi, Can someone explain to me the difference between map and flatMap and what is a good use case for each? -- Eran | CTO

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Sean Owen
On Wed, Mar 12, 2014 at 7:36 AM, Nick Pentreath wrote: > @Sean, would it be a good idea to look at changing the regularization in > Spark's ALS to alpha * lambda? What is the thinking behind this? If I > recall, the Mahout version added something like (# ratings * lambda) as > regularization in ea

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Sebastian Schelter
The mahout implementation is just a straight-forward port of the paper. No changes have been made. On 03/12/2014 08:36 AM, Nick Pentreath wrote: It would be helpful to know what parameter inputs you are using. If the regularization schemes are different (by a factor of alpha, which can often b

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Nick Pentreath
It would be helpful to know what parameter inputs you are using. If the regularization schemes are different (by a factor of alpha, which can often be quite high) this will mean that the same parameter settings could give very different results. A higher lambda would be required with Spark's versi