preservesPartitioning

2014-07-17 Thread Kamal Banga
Hi All, The function *mapPartitions *in RDD.scala takes a boolean parameter *preservesPartitioning. *It seems if that parameter is passed as *false*, the passed function f will operate on the data only

Re: preservesPartitioning

2014-07-17 Thread Matei Zaharia
Hi Kamal, This is not what preservesPartitioning does -- actually what it means is that if the RDD has a Partitioner set (which means it's an RDD of key-value pairs and the keys are grouped into a known way, e.g. hashed or range-partitioned), your map function is not changing the partition of k

Re: Read all the columns from a file in spark sql

2014-07-17 Thread Brad Miller
Hi Pandees, You may also be helped by looking into the ability to read and write Parquet files which is available in the present release. Parquet files allow you to store columnar data in HDFS. At present, Spark "infers" the schema from the Parquet file. In pyspark, some of the methods you'd be

Re: Simple record matching using Spark SQL

2014-07-17 Thread Sarath Chandra
Hi Michael, Soumya, Can you please check and let me know what is the issue? what am I missing? Let me know if you need any logs to analyze. ~Sarath On Wed, Jul 16, 2014 at 8:24 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Hi Michael, > > Tried it. It's correctly print

Re: Simple record matching using Spark SQL

2014-07-17 Thread Sonal Goyal
Hi Sarath, Are you explicitly stopping the context? sc.stop() Best Regards, Sonal Nube Technologies On Thu, Jul 17, 2014 at 12:51 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Hi Michael, Soumya, >

Re: Simple record matching using Spark SQL

2014-07-17 Thread Sarath Chandra
No Sonal, I'm not doing any explicit call to stop context. If you see my previous post to Michael, the commented portion of the code is my requirement. When I run this over standalone spark cluster, the execution keeps running with no output or error. After waiting for several minutes I'm killing

Re: Kmeans

2014-07-17 Thread Xiangrui Meng
Yes, both run in parallel. Random is a baseline implementation of initialization, which may ignore small clusters. k-means++ improves random initialization by adding weights to points far away to the current candidates. You can view k-means|| as a more scalable version of K-means++. We don't provid

Re: Error: No space left on device

2014-07-17 Thread Xiangrui Meng
Set N be the total number of cores on the cluster or less. sc.textFile doesn't always give you that number, depends on the block size. For MovieLens, I think the default behavior should be 2~3 partitions. You need to call repartition to ensure the right number of partitions. Which EC2 instance typ

Pysparkshell are not listing in the web UI while running

2014-07-17 Thread MEETHU MATHEW
 Hi all, I just upgraded to spark 1.0.1. In spark 1.0.0 when I start Ipython notebook using the following command,it used to come in the running applications tab in master:8080 web UI. IPYTHON_OPTS="notebook --pylab inline" $SPARK_HOME/bin/pyspark But now when I run it,its not getting listed

class after join

2014-07-17 Thread Luis Guerra
Hi all, I am a newbie Spark user with many doubts, so sorry if this is a "silly" question. I am dealing with tabular data formatted as text files, so when I first load the data, my code is like this: case class data_class( V1: String, V2: String, V3: String, V4: String, V5: String

Re: MLLib - Regularized logistic regression in python

2014-07-17 Thread Xiangrui Meng
1) This is a miss, unfortunately ... We will add support for regularization and intercept in the coming v1.1. (JIRA: https://issues.apache.org/jira/browse/SPARK-2550) 2) It has overflow problems in Python but not in Scala. We can stabilize the computation by ensuring exp only takes a negative value

Re: class after join

2014-07-17 Thread Sean Owen
If what you have is a large number of named strings, why not use a Map[String,String] to represent them? If you're approaching a class with >22 String fields anyway, it probably makes more sense. You lose a bit of compile-time checking, but gain flexibility. Also, merging two Maps to make a new on

Re: Kyro deserialisation error

2014-07-17 Thread Tathagata Das
Seems like there is some sort of stream corruption, causing Kryo read to read a weird class name from the stream (the name "arl Fridtjof Rode" in the exception cannot be a class!). Not sure how to debug this. @Patrick: Any idea? On Wed, Jul 16, 2014 at 10:14 PM, Hao Wang wrote: > I am not sur

Re: Kyro deserialisation error

2014-07-17 Thread Sean Owen
Not sure if this helps, but it does seem to be part of a name in a Wikipedia article, and Wikipedia is the data set. So something is reading this class name from the data. http://en.wikipedia.org/wiki/Carl_Fridtjof_Rode On Thu, Jul 17, 2014 at 9:40 AM, Tathagata Das wrote: > Seems like there is

Re: class after join

2014-07-17 Thread Luis Guerra
Thank you for your fast reply. We are considering this Map[String, String] solution, but there are some details that we do not control yet. What would happen if we have different data types for different fields? Also, with this solution, we have to repeat the field names for every "row" that we ha

Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile("hdfs://...inputData.txt") val parsedData3 = data3.map( _.split('\t')

Re: Pysparkshell are not listing in the web UI while running

2014-07-17 Thread Akhil Das
Hi Neethu, Your application is running on local mode and that's the reason why you are not seeing the driver app in the 8080 webUI. You can pass the Master IP to your pyspark and get it running in cluster mode. eg: IPYTHON_OPTS="notebook --pylab inline" $SPARK_HOME/bin/pyspark --master spark://ma

Bad Digest error while doing aws s3 put

2014-07-17 Thread lmk
Hi, I am getting the following error while trying save a large dataset to s3 using the saveAsHadoopFile command with apache spark-1.0. org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/spark_test%2Fsmaato_one_day_phase_2%2Fsmaato_2014_05_17%2F_temporary

Apache kafka + spark + Parquet

2014-07-17 Thread Mahebub Sayyed
Hi All, Currently we are reading (multiple) topics from Apache kafka and storing that in HBase (multiple tables) using twitter storm (1 tuple stores in 4 different tables). but we are facing some performance issue with HBase. so we are replacing* HBase* with *Parquet* file and *storm* with *Apache

Spark scheduling with Capacity scheduler

2014-07-17 Thread Konstantin Kudryavtsev
Hi all, I'm using HDP 2.0, YARN. I'm running both MapReduce and Spark jobs on this cluster, is it possible somehow use Capacity scheduler for Spark jobs management as well as MR jobs? I mean, I'm able to send MR job to specific queue, may I do the same with Spark job? thank you in advance Thank y

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
1. You can put in multiple kafka topics in the same Kafka input stream. See the example KafkaWordCount . However they will all be read thr

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at 1

Re: Kyro deserialisation error

2014-07-17 Thread Hao Wang
Hi, all Yes, it's a name of Wikipedia article. I am running WikipediaPageRank example of Spark Bagels. I am wondering whether there is any relation to buffer size of Kyro. The page rank can be successfully finished, sometimes not because this kind of Kyro exception happens too many times, which b

Re: Pysparkshell are not listing in the web UI while running

2014-07-17 Thread MEETHU MATHEW
Hi Akhil, That fixed the problem...Thanks   Thanks & Regards, Meethu M On Thursday, 17 July 2014 2:26 PM, Akhil Das wrote: Hi Neethu, Your application is running on local mode and that's the reason why you are not seeing the driver app in the 8080 webUI. You can pass the Master IP to yo

GraphX Pragel implementation

2014-07-17 Thread Arun Kumar
Hi I am trying to implement belief propagation algorithm in GraphX using the pragel API. *def* pregel[A] (initialMsg*:* A, maxIter*:* Int = *Int*.*MaxValue*, activeDir*:* EdgeDirection = *EdgeDirection*.*Out*) (vprog*:* (VertexId, VD, A) *=>* *VD*, sendMsg*

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Mahebub Sayyed
Hi, To migrate data from *HBase *to *Parquet* we used following query through * Impala*: INSERT INTO table PARQUET_HASHTAGS( key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time, tweet_id, user_id, user_name, hashtag_year ) *partition(

Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
Hi Xiangrui, Yes I am using Spark v0.9 and am not running it in local mode. I did the memory setting using "export SPARK_MEM=4G" before starting the Spark instance. Also previously, I was starting it with -c 1 but changed it to -c 12 since it is a 12 core machine. It did bring down the time tak

Re: Getting in local shell

2014-07-17 Thread newbee88
Could someone please help me resolve "This post has NOT been accepted by the mailing list yet." issue. I registered and subscribed to the mailing list many days ago but my post is still in unaccepted state. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get

Re: Spark Streaming timing considerations

2014-07-17 Thread Laeeq Ahmed
Hi TD, I have been able to filter the first WindowedRDD, but I am not sure how to make a generic filter. The larger window is 8 seconds and want to fetch 4 second based on application-time-stamp. I have seen an earlier post which suggest timeStampBasedwindow but I am not sure how to make timest

Seattle Spark Meetup: Evan Chan's Interactive OLAP Queries with Spark and Cassandra

2014-07-17 Thread Denny Lee
We had a great Seattle Spark Meetup session with Evan Chan presenting his  Interactive OLAP Queries with Spark and Cassandra.  You can find his awesome presentation at: http://www.slideshare.net/EvanChan2/2014-07olapcassspark. Enjoy!

Equivalent functions for NVL() and CASE expressions in Spark SQL

2014-07-17 Thread pandees waran
Do we have any equivalent scala functions available for NVL() and CASE expressions to use in spark sql?

Re: Simple record matching using Spark SQL

2014-07-17 Thread Sarath Chandra
Added below 2 lines just before the sql query line - *...* *file1_schema.count;* *file2_schema.count;* *...* and it started working. But I couldn't get the reason. Can someone please explain me? What was happening earlier and what is happening with addition of these 2 lines? ~Sarath On Thu, Jul

Re: Simple record matching using Spark SQL

2014-07-17 Thread Michael Armbrust
What version are you running? Could you provide a jstack of the driver and executor when it is hanging? On Thu, Jul 17, 2014 at 10:55 AM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Added below 2 lines just before the sql query line - > *...* > *file1_schema.count;* > *fi

Re: class after join

2014-07-17 Thread Michael Armbrust
If you intern the string it will be more efficient, but still significantly more expensive than the class based approach. ** VERY EXPERIMENTAL ** We are working with EPFL on a lightweight syntax for naming the results of spark transformations in scala (and are going to make it interoperate with SQ

Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi guys, sure you have similar use case and want to know how you deal with that. In our application, we want to check the previous state of some keys and compare with their current state. AFAIK, Spark Streaming does not have key-value access. So current what I am doing is storing the previous and

Re: Spark scheduling with Capacity scheduler

2014-07-17 Thread Matei Zaharia
It's possible using the --queue argument of spark-submit. Unfortunately this is not documented on http://spark.apache.org/docs/latest/running-on-yarn.html but it appears if you just type spark-submit --help or spark-submit with no arguments. Matei On Jul 17, 2014, at 2:33 AM, Konstantin Kudrya

Re: Spark scheduling with Capacity scheduler

2014-07-17 Thread Derek Schoettle
unsubscribe From: Matei Zaharia To: user@spark.apache.org Date: 07/17/2014 12:41 PM Subject:Re: Spark scheduling with Capacity scheduler It's possible using the --queue argument of spark-submit. Unfortunately this is not documented on http://spark.apache.org/docs/latest/run

Error while running example/scala application using spark-submit

2014-07-17 Thread ShanxT
Hi, I am receiving below error while submitting any spark example or scala application. Really appreciate any help. spark version = 1.0.0 hadoop version = 2.4.0 Windows/Standalone mode 14/07/17 22:13:19 INFO TaskSchedulerImpl: Cancelling stage 0 Exception in thread "main" org.apache.spark.SparkE

Need help on Spark UDF (Join) Performance tuning .

2014-07-17 Thread S Malligarjunan
Hello Experts, I am facing performance problem when I use the UDF function call. Please help me to tune the query. Please find the details below shark> select count(*) from table1; OK 151096 Time taken: 7.242 seconds shark> select count(*) from table2;  OK 938 Time taken: 1.273 seconds Without

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin
On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr wrote: > Thanks Marcelo, I'm not seeing anything in the logs that clearly explains > what's causing this to break. > > One interesting point that we just discovered is that if we run the driver > and the slave (worker) on the same host it runs, but

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Please try val parsedData3 = data3.repartition(12).map(_.split("\t")).map(_.toDouble).cache() and check the storage and driver/executor memory in the WebUI. Make sure the data is fully cached. -Xiangrui On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan wrote: > Hi Xiangrui, > > Yes I a

Re: Spark Streaming Json file groupby function

2014-07-17 Thread srinivas
hi TD, Thanks for the solutions for my previous post...I am running into other issue..i am getting data from json file and i am trying to parse it and trying to map it to a record given below val jsonf =lines.map(JSON.parseFull(_)).map(_.get.asInstanceOf[scala.collection.immutable.Map[Any

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread Sean Owen
I imagine the issue is ultimately combination of Windows and (stock?) Apache Hadoop. I know that in the past, operations like 'chmod' didn't work on Windows since it assumed the existence of POSIX binaries. That should be long since patched up for 2.4.x but there may be a gotcha here that others ca

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by "accessing" keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the

Re: Spark Streaming Json file groupby function

2014-07-17 Thread Tathagata Das
This is a basic scala problem. You cannot apply toInt to Any. Try doing toString.toInt For such scala issues, I recommend trying it out in the Scala shell. For example, you could have tried this out as the following. [tdas @ Xion streaming] scala Welcome to Scala version 2.10.3 (Java HotSpot(TM)

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread ShanxT
Thanks Sean, 1) Yes, I am trying to run locally without Hadoop. 2) I also see the error in the provided link while launching spark-shell but post launch I am able to execute same code I have in the sample application. Read any local file and perform some reduction operation. But not through submit

Re: Spark Streaming timing considerations

2014-07-17 Thread Tathagata Das
You have to define what is the range records that needs to be filtered out in every windowed RDD, right? For example, when the DStream.window has data from from times 0 - 8 seconds by DStream time, you only want to filter out data that falls into say 4 - 8 seconds by application time. This latter i

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread Stephen Boesch
Hi Sean RE: Windows and hadoop 2.4.x HortonWorks - all the hype aside - only supports Windows Server 2008/2012. So this general concept of "supporting Windows" is bunk. Given that - and since the vast majority of Windows users do not happen to have Windows Server on their laptop - do you have an

Re: GraphX Pragel implementation

2014-07-17 Thread Ankur Dave
If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store the incoming messages into the tuple, allowing sendMsg to access them. For example, if

Custom Metrics Sink

2014-07-17 Thread jjaffe
What is the preferred way of adding a custom metrics sink to Spark? I noticed that the Sink Trait has been private since April, so I cannot simply extend Sink in an outside package, but I would like to avoid having to create a custom build of Spark. Is this possible? -- View this message in cont

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread Sean Owen
I am probably the wrong person to ask as I never use Hadoop on Windows. But from looking at the code just now it is clearly trying to accommodate Windows shell commands. Yes I would not be surprised if it still needs Cygwin. A slightly broader point is that ideally it doesnt matter whether Hadoop

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, ("a" -> 1, "b" -> 2, "c" -> 3) 2. In the following 2 second interval, I only receive "c" and "d" and thei

Re: Equivalent functions for NVL() and CASE expressions in Spark SQL

2014-07-17 Thread Zongheng Yang
Hi Pandees, Spark SQL introduced support for CASE expressions just recently and it is available in 1.0.1. As for NVL(), I don't think we support it yet, and if you are interested a pull request will be much appreciated! Thanks, Zongheng On Thu, Jul 17, 2014 at 7:26 AM, pandees waran wrote: > Do

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-17 Thread hsy...@gmail.com
Thanks Tathagata, so can I say RDD size(from the stream) is window size. and the overlap between 2 adjacent RDDs are sliding size. But I still don't understand what it batch size, why do we need this since data processing is RDD by RDD right? And does spark chop the data into RDDs at the very beg

Re: Error: No space left on device

2014-07-17 Thread Chris DuBois
Hi Xiangrui, Thanks. I have taken your advice and set all 5 of my slaves to be c3.4xlarge. In this case /mnt and /mnt2 have plenty of space by default. I now do sc.textFile(blah).repartition(N).map(...).cache() with N=80 and spark.executor.memory to be 20gb and --driver-memory 20g. So far things s

Re: using multiple dstreams together (spark streaming)

2014-07-17 Thread Walrus theCat
Thanks! On Wed, Jul 16, 2014 at 6:34 PM, Tathagata Das wrote: > Have you taken a look at DStream.transformWith( ... ) . That allows you > apply arbitrary transformation between RDDs (of the same timestamp) of two > different streams. > > So you can do something like this. > > 2s-window-stream.t

Include permalinks in mail footer

2014-07-17 Thread Nick Chammas
Can we modify the mailing list to include permalinks to the thread in the footer of every email? Or at least of the initial email in a thread? I often find myself wanting to reference one thread from another, or from a JIRA issue. Right now I have to google the thread subject and find the link tha

An abstraction over Spark

2014-07-17 Thread Andrea Esposito
Hi all, I'd like to announce my MSc thesis that regards about an abstraction that is developed on top of Spark. More precisely, the computations are graph computations like Pregel/Bagel and the most recent GraphX. Starting from "graph as a network" view, a protocols stack-based view is conceived.

Re: Supported SQL syntax in Spark SQL

2014-07-17 Thread Nicholas Chammas
FYI: I've created SPARK-2560 to track creating SQL reference docs for Spark SQL. On Mon, Jul 14, 2014 at 2:06 PM, Michael Armbrust wrote: > You can find the parser here: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/

replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Eric Friedman
I used to use SPARK_LIBRARY_PATH to specify the location of native libs for lzo compression when using spark 0.9.0. The references to that environment variable have disappeared from the docs for spark 1.0.1 and it's not clear how to specify the location for lzo. Any guidance?

Re: Error: No space left on device

2014-07-17 Thread Bill Jay
Hi, I also have some issues with repartition. In my program, I consume data from Kafka. After I consume data, I use repartition(N). However, although I set N to be 120, there are around 18 executors allocated for my reduce stage. I am not sure how the repartition command works ton ensure the paral

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Matt Work Coarr
Thanks Marcelo! This is a huge help!! Looking at the executor logs (in a vanilla spark install, I'm finding them in $SPARK_HOME/work/*)... It launches the executor, but it looks like the CoarseGrainedExecutorBackend is having trouble talking to the driver (exactly what you said!!!). Do you know

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Zongheng Yang
One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here: http://spark.apache.org/docs/latest/configuration.html On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman wrote: > I used to use SPARK_LIBRARY_PATH to specify the

unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi guys, need some help in this problem. In our use case, we need to continuously insert values into the database. So our approach is to create the jdbc object in the main method and then do the inserting operation in the DStream foreachRDD operation. Is this approach reasonable? Then the problem

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Marcelo Vanzin
Could you share some code (or pseudo-code)? Sounds like you're instantiating the JDBC connection in the driver, and using it inside a closure that would be run in a remote executor. That means that the connection object would need to be serializable. If that sounds like what you're doing, it won't

Re: Include permalinks in mail footer

2014-07-17 Thread Matei Zaharia
Good question.. I'll ask INFRA because I haven't seen other Apache mailing lists provide this. It would indeed be helpful. Matei On Jul 17, 2014, at 12:59 PM, Nick Chammas wrote: > Can we modify the mailing list to include permalinks to the thread in the > footer of every email? Or at least o

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by "show all the existing states"? You have access to the latest state RDD by doing stateStream

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Tathagata Das
And if Marcelo's guess is correct, then the right way to do this would be to lazily / dynamically create the jdbc connection server as a singleton in the workers/executors and use that. Something like this. dstream.foreachRDD(rdd => { rdd.foreachPartition((iterator: Iterator[...]) => {

Re: Spark Streaming timestamps

2014-07-17 Thread Bill Jay
Hi Tathagata, Thanks for your answer. Please see my further question below: On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das wrote: > Answers inline. > > > On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay > wrote: > >> Hi all, >> >> I am currently using Spark Streaming to conduct a real-time data >> a

how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Chen Song
I am using spark 0.9.0 and I am able to submit job to YARN, https://spark.apache.org/docs/0.9.0/running-on-yarn.html. I am trying to turn on gc logging on executors but could not find a way to set extra Java opts for workers. I tried to set spark.executor.extraJavaOptions but that did not work.

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Chen Song
Thanks Luis and Tobias. On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer wrote: > Hi, > > On Wed, Jul 2, 2014 at 1:57 AM, Chen Song wrote: >> >> * Is there a way to control how far Kafka Dstream can read on >> topic-partition (via offset for example). By setting this to a small >> number, it w

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post val transformedStream = kafkaStream.map ... // whatever transformation you want to do transformedStream.foreachRDD((rdd: RDD[...], time: Time) => { // save the rdd to parquet file, using time as the file

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin
Hi Matt, I'm not very familiar with setup on ec2; the closest I can point you at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the ports seem to be configured. On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr wrote: > Thanks Marcelo! This is a huge help!! > > Looking at the exe

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi Marcelo and TD, Thank you for the help. If I use TD's approache, it works and there is no exception. Only drawback is that it will create many connections to the DB, which I was trying to avoid. Here is a snapshot of my code. Mark as red for the important code. What I was thinking is that, if

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Michael Armbrust
We don't have support for partitioned parquet yet. There is a JIRA here: https://issues.apache.org/jira/browse/SPARK-2406 On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das wrote: > val kafkaStream = KafkaUtils.createStream(... ) // see the example in my > previous post > > val transformedStream =

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Sean Owen
On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang wrote: > Thank you for the help. If I use TD's approache, it works and there is no > exception. Only drawback is that it will create many connections to the DB, > which I was trying to avoid. Connection-like objects aren't data that can be serialized. Wh

Re: Retrieve dataset of Big Data Benchmark

2014-07-17 Thread Tom
Hi Burak, I tried running it through the Spark shell, but I still ended with the same error message as in Hadoop: "java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAcce

Re: Error: No space left on device

2014-07-17 Thread Chris DuBois
Things were going smoothly until I hit the following: py4j.protocol.Py4JJavaError: An error occurred while calling o1282.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED Any ideas why this might occur? This is while running A

Error with spark-submit

2014-07-17 Thread ranjanp
Hi,I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2 workers) cluster. From the Web UI at the master, I see that the workers are registered. But when I try running the SparkPi example from the master node, I get the following message and then an exception.14/07/17 01:20:36 IN

Large scale ranked recommendation

2014-07-17 Thread m3.sharma
Hi, I am trying to develop a recommender system for about 1 million users and 10 thousand items. Currently it's a simple regression based model where for every user, item pair in dataset we generate some features and learn model from it. Till training and evaluation everything is fine the bottlene

Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
Thanks all! (And thanks Matei for the developer link!) I was able to build using maven[1] but `./sbt/sbt assembly` results in build errors. (Not familiar enough with the build to know why; in the past sbt worked for me and maven did not). I was able to run the master version of pyspark, which wa

Error with spark-submit (formatting corrected)

2014-07-17 Thread ranjanp
Hi, I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2 workers) cluster. >From the Web UI at the master, I see that the workers are registered. But when I try running the SparkPi example from the master node, I get the following message and then an exception. 14/07/17 01:

Re: Large scale ranked recommendation

2014-07-17 Thread m3.sharma
We are using RegressionModels that comes with *mllib* package in SPARK. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

2014-07-17 Thread Shuo Xiang
Hi, Are you suggesting that taking simple vector dot products or sigmoid function on 10K * 1M data takes 5hrs? On Thu, Jul 17, 2014 at 3:59 PM, m3.sharma wrote: > We are using RegressionModels that comes with *mllib* package in SPARK. > > > > -- > View this message in context: > http://apache

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems that there is still a single executor. Does anyone has an idea how to add parallelism to this job? On Thu, Jul 17, 2014 at 2:06 PM, Chen

Re: Large scale ranked recommendation

2014-07-17 Thread m3.sharma
Yes, thats what prediction should be doing, taking dot products or sigmoid function for each user,item pair. For 1 million users and 10 K items data there are 10 billion pairs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendati

Spark Streaming

2014-07-17 Thread Guangle Fan
Hi, All When I run spark streaming, in one of the flatMap stage, I want to access database. Code looks like : stream.flatMap( new FlatMapFunction { call () { //access database cluster } } ) Since I don't want to create database connection every time call() was called, where is

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi Sean, Thank you. I see your point. What I was thinking is that, do computation in a distributed fashion and do the storing from a single place. But you are right, having multiple DB connections actually is fine. Thanks for answering my questions. That helps me understand the system. Cheers,

Hive From Spark

2014-07-17 Thread JiajiaJing
Hello Spark Users, I am new to Spark SQL and now trying to first get the HiveFromSpark example working. However, I got the following error when running HiveFromSpark.scala program. May I get some help on this please? ERROR MESSAGE: org.apache.thrift.TApplicationException: Invalid method name: '

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Koert Kuipers
but be aware that spark-defaults.conf is only used if you use spark-submit On Jul 17, 2014 4:29 PM, "Zongheng Yang" wrote: > One way is to set this in your conf/spark-defaults.conf: > > spark.executor.extraLibraryPath /path/to/native/lib > > The key is documented here: > http://spark.apache.org/d

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi TD, Thank you. Yes, it behaves as you described. Sorry for missing this point. Then my only concern is in the performance side - since Spark Streaming operates on all the keys everytime a new batch comes, I think it is fine when the state size is small. When the state size becomes big, say, a

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Andrew Or
Hi Matt, The security group shouldn't be an issue; the ports listed in `spark_ec2.py` are only for communication with the outside world. How did you launch your application? I notice you did not launch your driver from your Master node. What happens if you did? Another thing is that there seems t

Re: Include permalinks in mail footer

2014-07-17 Thread Tobias Pfeiffer
> > On Jul 17, 2014, at 12:59 PM, Nick Chammas > wrote: > > I often find myself wanting to reference one thread from another, or from > a JIRA issue. Right now I have to google the thread subject and find the > link that way. > > +1

Re: Spark Streaming timestamps

2014-07-17 Thread Tathagata Das
The RDD parameter in foreachRDD contains raw/transformed data from the last batch. So when forearchRDD is called with the time parameter as 5:02:01 and batch size is 1 minute, then the rdd will contain data based on the data received by between 5:02:00 and 5:02:01. If you want to do custom interva

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer
Bill, are you saying, after repartition(400), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay wrote: > I also have an issue consuming from Kafka. When I consume from Kafka, > there are always a single execut

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Tathagata Das
Can you check in the environment tab of Spark web ui to see whether this configuration parameter is in effect? TD On Thu, Jul 17, 2014 at 2:05 PM, Chen Song wrote: > I am using spark 0.9.0 and I am able to submit job to YARN, > https://spark.apache.org/docs/0.9.0/running-on-yarn.html. > > I am

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das
You can create multiple kafka stream to partition your topics across them, which will run multiple receivers or multiple executors. This is covered in the Spark streaming guide. And for th

Re: Error with spark-submit (formatting corrected)

2014-07-17 Thread Andrew Or
Hi ranjanp, If you go to the master UI (masterIP:8080), what does the first line say? Verify that this is the same as what you expect. Another thing is that --master in spark submit overwrites whatever you set MASTER to, so the environment variable won't actually take effect. Another obvious thing

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Andrew Or
Hi Chen, spark.executor.extraJavaOptions is introduced in Spark 1.0, not in Spark 0.9. You need to export SPARK_JAVA_OPTS=" -Dspark.config1=value1 -Dspark.config2=value2" in conf/spark-env.sh. Let me know if that works. Andrew 2014-07-17 18:15 GMT-07:00 Tathagata Das : > Can you check in the

Re: jar changed on src filesystem

2014-07-17 Thread Andrew Or
Hi Jian, In yarn-cluster mode, Spark submit automatically uploads the assembly jar to a distributed cache that all executor containers read from, so there is no need to manually copy the assembly jar to all nodes (or pass it through --jars). It seems there are two versions of the same jar in your

  1   2   >