Re: Spark Streaming not processing file with particular number of entries

2014-06-05 Thread praveshjain1991
Hi, I am using Spark-1.0.0 over a 3 node cluster with 1 master and 2 slaves. I am trying to run LR algorithm over Spark Streaming. package org.apache.spark.examples.streaming; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileWriter; import jav

Spark Streaming NeteorkReceiver problems

2014-06-05 Thread zzzzzqf12345
hi, here is problem description, I write a custom networkreceiver to receive image data from camera. I had confirmed all the data received correctly. 1)when data received, only the networkreceiver node run at full speed, while other nodes keep idle, my spark cluster has 6 nodes. 2)And every image

KyroException: Unable to find class

2014-06-05 Thread Justin Yip
Hello, I have been using Externalizer from Chill to as serialization wrapper. It appears to me that Spark have some conflict with the classloader with Chill. I have the (a simplified version) following program: import java.io._ import com.twitter.chill.Externalizer class X(val i: Int) { override

Re: Twitter feed options?

2014-06-05 Thread Jeremy Lee
Nope, sorry, nevermind! I looked at the source, and it was pretty obvious that it didn't implement that yet, so I've ripped the classes out and am mutating them into a new receivers right now... ... starting to get the hang of this. On Fri, Jun 6, 2014 at 1:07 PM, Jeremy Lee wrote: > > Me aga

Re: spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
Nice explanation... Thanks! On Thu, Jun 5, 2014 at 5:50 PM, Sandy Ryza wrote: > Hi Xu, > > As crazy as it might sound, this all makes sense. > > There are a few different quantities at play here: > * the heap size of the executor (controlled by --executor-memory) > * the amount of memory spark

RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
If some task have no locality preference, it will also show up as PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it more clear. Not sure is this your case. Best Regards, Raymond Liu From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan Chung Se

Twitter feed options?

2014-06-05 Thread Jeremy Lee
Me again, Things have been going well, actually. I've got my build chain sorted, 1.0.0 and streaming is working reliably. I managed to turn off the INFO messages by messing with every log4j properties file on the system. :-) On thing I would like to try now is some natural language processing on

Re: Setting executor memory when using spark-shell

2014-06-05 Thread hassan
just use -Dspark.executor.memory= -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Sean, your patch fixes the issue, thank you so much! (This is the second time within one week I run into network libraries not shutting down threads properly, I'm really glad your code fixes the issue.) I saw your pull request is closed, but not merged yet. Can I do anything to get your fix into

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
Hi Roger, You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once. It does require holding the entire partition in memory though. Do you need the partition to never be held in memory all at once? As far as

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
I think it would very handy to be able to specify that you want sorting during a partitioning stage. On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover wrote: > Hi Aaron, > > When you say that sorting is being worked on, can you elaborate a little > more please? > > If particular, I want to sort the

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
Hi Aaron, When you say that sorting is being worked on, can you elaborate a little more please? If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once. Thanks, Roger On Sat, May 31, 2014 at 11:10 PM, Aaron Davidso

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
Additionally, I've encountered some confusing situation where the locality level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache the data. I wonder some implicit caching happens even without the user specifying things. On Thu, Jun 5, 2014 at 3:50 PM, Sung Hwan Chung wrote: >

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
Thanks Andrew, Is there a chance that even with full-caching, that modes other than PROCESS_LOCAL will be used? E.g., let's say, an executor will try to perform tasks although the data are cached on a different executor. What I'd like to do is to prevent such a scenario entirely. I'd like to kno

Re: Spark streaming on load run - How to increase single node capacity?

2014-06-05 Thread RodrigoB
Hi Wayne, Tnks for reply. I did raise the thread max before posting, based on your previous comment on another post using ulimit -n 2048. That seemed to have helped on the out of memory issue. I'm curious if this is standard procedure for scaling a spark node's resources vertically or is it just

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Andrew Ash
The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower beca

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
On a related note, I'd also minimize any kind of executor movement. I.e., once an executor is spawned and data cached in the executor, I want that executor to live all the way till the job is finished, or the machine fails in a fatal manner. What would be the best way to ensure that this is the ca

Re: Join : Giving incorrect result

2014-06-05 Thread Andrew Ash
Hi Ajay, Can you please try running the same code with spark.shuffle.spill=false and see if the numbers turn out correctly? That parameter controls whether or not the buggy code that Matei fixed in ExternalAppendOnlyMap is used. FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I t

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
Oh my apologies that was for 1.0 For Spark 0.9 I did it like this: MASTER=spark://mymaster:7077 SPARK_MEM=8g ./bin/spark-shell -c $CORES_ACROSS_CLUSTER The downside of this though is that SPARK_MEM also sets the driver's JVM to be 8g, rather than just the executors. I think this is the reason f

When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean that the executor got terminated and restarted? Is there a way to prevent this from happening (ba

Re: Join : Giving incorrect result

2014-06-05 Thread Matei Zaharia
Hey Ajay, thanks for reporting this. There was indeed a bug, specifically in the way join tasks spill to disk (which happened when you had more concurrent tasks competing for memory). I’ve posted a patch for it here: https://github.com/apache/spark/pull/986. Feel free to try that if you’d like;

Re: spark worker and yarn memory

2014-06-05 Thread Sandy Ryza
Hi Xu, As crazy as it might sound, this all makes sense. There are a few different quantities at play here: * the heap size of the executor (controlled by --executor-memory) * the amount of memory spark requests from yarn (the heap size plus 384 mb to account for fixed memory costs outside if the

Re: implicit ALS dataSet

2014-06-05 Thread Sean Owen
On Thu, Jun 5, 2014 at 10:38 PM, redocpot wrote: > can be simplified by taking advantage of its algebraic structure, so > negative observations are not needed. This is what I think at the first time > I read the paper. Correct, a big part of the reason that is efficient is because of sparsity of

Re: implicit ALS dataSet

2014-06-05 Thread redocpot
Thank you for your quick reply. As far as I know, the update does not require negative observations, because the update rule Xu = (YtCuY + λI)^-1 Yt Cu P(u) can be simplified by taking advantage of its algebraic structure, so negative observations are not needed. This is what I think at the firs

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Oleg Proudnikov
Thank you, Andrew, I am using Spark 0.9.1 and tried your approach like this: bin/spark-shell --driver-java-options "-Dspark.executor.memory=$MEMORY_PER_EXECUTOR" I get bad option: '--driver-java-options' There must be something different in my setup. Any ideas? Thank you again, Oleg On 5

Spark Streaming, download a s3 file to run a script shell on it

2014-06-05 Thread Gianluca Privitera
Hi, I've got a weird question but maybe someone has already dealt with it. My Spark Streaming application needs to - download a file from a S3 bucket, - run a script with the file as input, - create a DStream from this script output. I've already got the second part done with the rdd.pipe() API th

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
Hi Oleg, I set the size of my executors on a standalone cluster when using the shell like this: ./bin/spark-shell --master $MASTER --total-executor-cores $CORES_ACROSS_CLUSTER --driver-java-options "-Dspark.executor.memory=$MEMORY_PER_EXECUTOR" It doesn't seem particularly clean, but it works.

Setting executor memory when using spark-shell

2014-06-05 Thread Oleg Proudnikov
Hi All, Please help me set Executor JVM memory size. I am using Spark shell and it appears that the executors are started with a predefined JVM heap of 512m as soon as Spark shell starts. How can I change this setting? I tried setting SPARK_EXECUTOR_MEMORY before launching Spark shell: export SPA

Examples

2014-06-05 Thread Tim Kellogg
Hi, I’m still having trouble running the CassandraTest example from the Spark-1.0.0 binary package. I’ve made a Stackoverflow question for it so you can get some street cred for helping me :) http://stackoverflow.com/q/24069039/503826 Thanks! Tim Kellogg Sr. Software Engineer, Protocols 2leme

creating new ami image for spark ec2 commands

2014-06-05 Thread Matt Work Coarr
How would I go about creating a new AMI image that I can use with the spark ec2 commands? I can't seem to find any documentation. I'm looking for a list of steps that I'd need to perform to make an Amazon Linux image ready to be used by the spark ec2 tools. I've been reading through the spark 1.0

Re: Loading Python libraries into Spark

2014-06-05 Thread Andrei
In my answer I assumed you run your program with "pyspark" command (e.g. "pyspark mymainscript.py", pyspark should be on your path). In this case workflow is as follows: 1. You create SparkConf object that simply contains your app's options. 2. You create SparkContext, which initializes your appli

Re: SQLContext and HiveContext Query Performance

2014-06-05 Thread Michael Armbrust
For a dataset as small as this one you could probably reduce the number of shuffle partitions. This will be possible once https://github.com/apache/spark/pull/956 is merged. On Thu, Jun 5, 2014 at 11:31 AM, ssb61 wrote: > Any inputs to reduce the time duration for mapPartitions at > Exchange.s

Seattle Spark Meetup: Machine Learning Streams with Spark 1.0

2014-06-05 Thread Denny Lee
If you’re in the Seattle area on 6/24, come join us at Madrona Ventures building in downtown Seattle to join the session: Machine Learning Streams with Spark 1.0.   For more information, please check out our meetup event:  http://www.meetup.com/Seattle-Spark-Meetup/events/187375042/ Enjoy! Denn

Re: SQLContext and HiveContext Query Performance

2014-06-05 Thread ssb61
Any inputs to reduce the time duration for mapPartitions at Exchange.scala:44 from 13 s? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-and-HiveContext-Query-Performance-tp6948p7075.html Sent from the Apache Spark User List mailing list archive a

Re: reuse hadoop code in Spark

2014-06-05 Thread Matei Zaharia
Use RDD.mapPartitions to go over all the items in a partition with one Mapper object. It will look something like this: rdd.mapPartitions(iterator => val mapper = new myown.Mapper() mapper.configure(conf) val output = // {{create an OutputCollector that stores stuff in an ArrayBuffer}} f

Re: Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi Andrei, Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged into the master node and have my python files somewhere, is that correct? -- View this message in context: http://apache-spark-user-li

Re: Unable to run a Standalone job([NOT FOUND ] org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v201005082020)

2014-06-05 Thread Sean Owen
Hm, I am not sure what to make of that. It seems like something else: http://stackoverflow.com/questions/9889674/sbt-jetty-and-servlet-3-0 A glance at this suggests that these artifacts have a different custom packaging type "orbit". I had not seen that before. I think Maven figures it out since

Re: implicit ALS dataSet

2014-06-05 Thread Sean Owen
The paper definitely does not suggest that you should include every user-item pair in the input. The input is by nature extremely sparse, so literally filling in all the 0s in the input would create overwhelmingly large input. No, there is no need to do it and it would be terrible for performance.

Re: compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
Thanks.. it works now. -Simon On Thu, Jun 5, 2014 at 10:47 AM, Nick Pentreath wrote: > Have you set the persistence level of the RDD to MEMORY_ONLY_SER ( > http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)? > If you're calling cache, the default persistence level is M

Re: Loading Python libraries into Spark

2014-06-05 Thread Andrei
For third party libraries the simplest way is to use Puppet [1] or Chef [2] or any similar automation tool to install packages (either from PIP [2] or from distribution's repository). It's easy because if you manage your cluster's software you are most probably already using one of these automation

implicit ALS dataSet

2014-06-05 Thread redocpot
Hi, According to the paper on which MLlib's ALS is based, the model should take all user-item preferences as an input, including those which are not related to any input observation (zero preference). My question is: With all positive observations in hand (similar to explicit feedback data set),

Re: Native library can not be loaded when using Mllib PCA

2014-06-05 Thread Xiangrui Meng
For standalone and yarn mode, you need to install native libraries on all nodes. The best solution is installing them to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 . If your matrix is sparse, the native libraries cannot help because they are for dense linear algebra. You can create RDD of

Re: reuse hadoop code in Spark

2014-06-05 Thread Wei Tan
Thanks Matei. Using your pointers I can import data frrom HDFS, what I want to do now is something like this in Spark: --- import myown.mapper rdd.map (mapper.map) --- The reason why I want this: myown.mapper is a java class I already developed. I used

Re: Unable to run a Standalone job([NOT FOUND ] org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v201005082020)

2014-06-05 Thread Shrikar archak
Hi Prabeesh/ Sean, I tried both the steps you guys mentioned looks like its not able to resolve it. [warn] [NOT FOUND ] org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645!javax.transaction.orbit (131ms) [warn] public: tried [warn] http://repo1.maven.org/maven2/org/eclipse/jetty/o

Scala By the Bay Developer Conference and Training Registration

2014-06-05 Thread Alexy Khrabrov
Scala by the Bay registration and training is now open! We are assembling a great two-day program for Scala By the Bay www.scalabythebay.org -- the yearly SF Scala developer conference. This year the conference itself is on August 8-9 in Fort Mason, near the Golden Gate bridge, with the Scala t

Re: compress in-memory cache?

2014-06-05 Thread Nick Pentreath
Have you set the persistence level of the RDD to MEMORY_ONLY_SER ( http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)? If you're calling cache, the default persistence level is MEMORY_ONLY so that setting will have no impact. On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Che

compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made no difference. The number of cached partitions, fraction cached, and size in memory remain the same. Any ideas? I confirmed that rdd compression wasn

Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi, I am new to Spark (and almost-new in python!). How can I download and install a Python library in my cluster so I can just import it later? Any help would be much appreciated. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-li

spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
I am slightly confused about the "--executor-memory" setting. My yarn cluster has a maximum container memory of 8192MB. When I specify "--executor-memory 8G" in my spark-shell, no container can be started at all. It only works when I lower the executor memory to 7G. But then, on yarn, I see 2 cont

Re: Spark Streaming not processing file with particular number of entries

2014-06-05 Thread praveshjain1991
The same issue persists in spark-1.0.0 as well (was using 0.9.1 earlier). Any suggestions are welcomed. -- Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-tp6694p7056.html Sent fr

Re: Serialization problem in Spark

2014-06-05 Thread Vibhor Banga
Any inputs on this will be helpful. Thanks, -Vibhor On Thu, Jun 5, 2014 at 3:41 PM, Vibhor Banga wrote: > Hi, > > I am trying to do something like following in Spark: > > JavaPairRDD eventRDD = hBaseRDD.map(new > PairFunction, byte[], MyObject >() { > @Override > public

Re: Better line number hints for logging?

2014-06-05 Thread Daniel Darabos
On Wed, Jun 4, 2014 at 10:39 PM, Matei Zaharia wrote: > That’s a good idea too, maybe we can change CallSiteInfo to do that. > I've filed an issue: https://issues.apache.org/jira/browse/SPARK-2035 Matei > > On Jun 4, 2014, at 8:44 AM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote:

Re: How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Sean, thanks for your link! I will try this ASAP! On Thu, Jun 5, 2014 at 6:49 PM, Sean Owen wrote: > However I do seem to be able to shut down everything cleanly and > terminate my (Java-based) program. I just call > StreamingContext.stop(true, true). I don't know why it's different. I think th

Re: Problem with serialization and deserialization

2014-06-05 Thread Stefan van Wouw
Dear Aneesh, Your particular use case of using Swing GUI components with Spark is a bit unclear to me. Assuming that you want Spark to operate on a tree object, you could use an implementation of the TreeModel ( http://docs.oracle.com/javase/8/docs/api/javax/swing/tree/DefaultTreeModel.html

Problem with serialization and deserialization

2014-06-05 Thread ANEESH .V.V
hi, I have a JTree. I want to serialize it using sc.saveAsObjectFile("path"). I could save it in some location. The real problem is that when I deserialize it back using sc.objectFile(), I am not getting the jtree. Can anyone please help me on this.. Thanks

Serialization problem in Spark

2014-06-05 Thread Vibhor Banga
Hi, I am trying to do something like following in Spark: JavaPairRDD eventRDD = hBaseRDD.map(new PairFunction, byte[], MyObject >() { @Override public Tuple2 call(Tuple2 immutableBytesWritableResultTuple2) throws Exception { return new Tuple2(immutableBytes

Re: Spark not working with mesos

2014-06-05 Thread praveshjain1991
Hi Ajatix. Yes the HADOOP_HOME is set on the nodes and i did update the bash. As I said, adding MESOS_HADOOP_HOME did not work. But what is causing the original error : "Java.lang.Error: java.io.IOException: failure to login " ? -- Thanks -- View this message in context: http://apache-spa

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng, Thanks a lot. That solved my problem. Thanks again for the quick response and solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html Sent from the Apache Spark User List mailing list

Re: How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Sean Owen
Yes I noted the same two issues -- there is a Executor that is never closed down, and the ConsumerConnector is never shut down. I foolishly tacked on a change to this effect on a different PR (https://github.com/apache/spark/pull/926/files#diff-bf41e92a42a1bdb3bc1662fd9fc50fe2L38) Maybe I can just

Spark Kafka streaming - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-06-05 Thread Gaurav Dasgupta
Hi, I have written my own custom Spark streaming code which connects to Kafka server and fetch data. I have tested the code on local mode and it is working fine. But when I am executing the same code on YARN mode, I am getting KafkaReceiver class not found exception. I am providing the Spark Kafka

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Cheng Lian
Hmm… my bad. The reason of the first exception is that the Iterator class is not serializable since my snippet tries to return something like RDD[(String, Iterator[(Double, Double)]]. As for the second one, the for expression returns an iterator rather than a collection, you need to traverse the it

How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Hi, I am trying to use Spark Streaming with Kafka, which works like a charm -- except for shutdown. When I run my program with "sbt run-main", sbt will never exit, because there are two non-daemon threads left that don't die. I created a minimal example at

Native library can not be loaded when using Mllib PCA

2014-06-05 Thread yangliuyu
Hi, We're using Mllib (1.0.0 release version) on a k-means clustering problem. We want to reduce the matrix column size before send the points to k-means solver. It works on my mac with the local mode: spark-test-run-assembly-1.0.jar contains my application code, com.github.fommil, netlib code an

Re: ClassCastException when using saveAsTextFile

2014-06-05 Thread Anwar Rizal
Hi Niko, I execute the script in 0.9/CDH5 using spark-shell , and it does not generate ClassCastException. Which version are you using and can you give more stack trace ? Cheers, a. On Tue, Mar 25, 2014 at 7:55 PM, Niko Stahl wrote: > Ok, so I've been able to narrow down the problem to this

Re: Spark not working with mesos

2014-06-05 Thread ajatix
I do assume that you've added HADOOP_HOME to you environment variables. Otherwise, you could fill the actual path of hadoop on your cluster. Also, did you do update the bash? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-working-with-mesos-tp6806

Re: Error related to serialisation in spark streaming

2014-06-05 Thread nilmish
Thanx a lot for your reply. I can see kryo serialiser in the UI. I have 1 another query : I wanted to know the meaning of the following log message when running a spark streaming job : [spark-akka.actor.default-dispatcher-18] INFO org.apache.spark.streaming.scheduler.JobScheduler - Total dela

Re: Unable to run a Standalone job

2014-06-05 Thread prabeesh k
try sbt clean command before build the app. or delete .ivy2 ans .sbt folders(not a good methode). Then try to rebuild the project. On Thu, Jun 5, 2014 at 11:45 AM, Sean Owen wrote: > I think this is SPARK-1949 again: https://github.com/apache/spark/pull/906 > I think this change fixed this is

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-05 Thread Jeremy Lee
I shan't be far. I'm committed now. Spark and I are going to have a very interesting future together, but hopefully future messages will be about the algorithms and modules, and less "how do I run make?". I suspect doing this at the exact moment of the 0.9 -> 1.0.0 transition hasn't helped me. (I

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-05 Thread prabeesh k
Hi Jeremy , if you are using *addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.4") * in "project/plugin.sbt" You also need to edit "project / project / build.scala" with same sbt version(0.11.4). like import sbt._ object Plugins extends Build { lazy val root = Project("root", file(".")

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
Lakshmi, this is orthogonal to your question, but in case it's useful. It sounds like you're trying to determine the home location of a user, or something similar. If that's the problem statement, the data pattern may suggest a far more computationally efficient approach. For example, first map a

Re: Re: mismatched hdfs protocol

2014-06-05 Thread bluejoe2008
ok, i see i imported wrong jar files which only work well on default hadoop version 2014-06-05 bluejoe2008 From: prabeesh k Date: 2014-06-05 16:14 To: user Subject: Re: Re: mismatched hdfs protocol If you are not setting the Spark hadoop version, Spark built using default hadoop version("1.0.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng, Sorry Again. In this method, i see that the values for a <- positions.iterator b <- positions.iterator always remain the same. I tried to do a b <- positions.iterator.next, it throws an error: value filter is not a member of (Double, Double) Is there something I

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-05 Thread Nick Pentreath
Great - well we do hope we hear from you, since the user list is for interesting success stories and anecdotes, as well as blog posts etc too :) On Thu, Jun 5, 2014 at 9:40 AM, Jeremy Lee wrote: > Oh. Yes of course. *facepalm* > > I'm sure I typed that at first, but at some point my fingers dec

Re: Re: mismatched hdfs protocol

2014-06-05 Thread prabeesh k
If you are not setting the Spark hadoop version, Spark built using default hadoop version("1.0.4"). Before importing Spark-1.0.0 libraries , build Spark using *SPARK_HADOOP_VERSION=2.4.0 sbt/sbt assembly *command. On Thu, Jun 5, 2014 at 12:28 PM, bluejoe2008 wrote: > thank you! > > i am deve

Re: Can't seem to link "external/twitter" classes from my own app

2014-06-05 Thread Jeremy Lee
Oh. Yes of course. *facepalm* I'm sure I typed that at first, but at some point my fingers decided to grammar-check me. Stupid fingers. I wonder what "sbt assemble" does? (apart from error) It certainly takes a while to do it. Thanks for the maven offer, but I'm not scheduled to learn that until

Re: Join : Giving incorrect result

2014-06-05 Thread Ajay Srivastava
Sorry for replying late. It was night here. Lian/Matei, Here is the code snippet -     sparkConf.set("spark.executor.memory", "10g")     sparkConf.set("spark.cores.max", "5")         val sc = new SparkContext(sparkConf)         val accId2LocRDD = sc.textFile("hdfs://bbr-dev178:9000/data/subDbSp

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng, Thank you for your response. While I tried your solution, .mapValues { positions => for { a <- positions.iterator b <- positions.iterator if lessThan(a, b) && distance(a, b) < 100 } yield { (a, b) } } I got the result