Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-28 Thread trackissue121
I had already tested query in Hive CLI and it works fine. Same query shows error in Spark SQL. On May 29, 2015 4:14 AM, ayan guha wrote: > > Probably a naive question: can you try the same in hive CLI and see if your > SQL is working? Looks like hive thing to me as spark is faithfully delegatin

Re: Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Ashish Mukherjee
Hi Mohit, Thanks for your reply. If my use case is purely querying read-only data (no transaction scenarios), at what scale is one of them a better option than the other? I am aware that for scale which can be supported on a single node, VoltDB is a better choice. However, when the scale grows to

Registering Custom metrics [Spark-Streaming-monitoring]

2015-05-28 Thread Snehal Nagmote
Hello All, I am using spark streaming 1.3 . I want to capture few custom metrics based on accumulators, I followed somewhat similar to this approach , val instrumentation = new SparkInstrumentation("example.metrics") * val numReqs = sc.accumulator(0L) * instrumentation.source.registerDailyAcc

Re: How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Nan Xiao
Hi Somnath, Is there a step-by-step instruction about using Eclipse to develop Spark application? I think many people need them. Thanks! Best Regards Nan Xiao On Thu, May 28, 2015 at 3:15 PM, Somnath Pandeya wrote: > Try scala eclipse plugin to eclipsify spark project and import spark as > ecl

Re: Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Mohit Jaggi
I have used VoltDB and Spark. The use cases for the two are quite different. VoltDB is intended for transactions and also supports queries on the same(custom to voltdb) store. Spark(SQL) is NOT suitable for transactions; it is designed for querying immutable data (which may exist in several diff

Twitter Streaming HTTP 401 Error

2015-05-28 Thread tracynj
I am working on the Databricks Reference applications, porting them to my company's platform, and extending them to emit RDF. I have already gotten them working with the extension on EC2, and have the Log Analyzer application working on our platform. But the Twitter Language Classifier application

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Which would imply that if there was a load manager type of service, it could signal to the driver(s) that they need to acquiesce, i.e. process what's at hand and terminate. Then bring up a new machine, then restart the driver(s)... Same deal with removing machines from the cluster. Send a signal

Re: Batch aggregation by sliding window + join

2015-05-28 Thread ayan guha
Which version of spark? In 1.4 window queries will show up for these kind of scenarios. 1 thing I can suggest is keep daily aggregates materialised and partioned by key and sorted by key-day combination using repartitionandsort method. It allows you to use custom partitioner and custom sorter. Be

spark mlib variance analysis

2015-05-28 Thread rafac
I have a simple problem: i got mean number of people on one place by hour(time-series like), and now i want to know if the weather condition have impact on the mean number. I would do it with variance analysis like anova in spss or analysing the resultant regression model summary How is it possib

Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-28 Thread ayan guha
Probably a naive question: can you try the same in hive CLI and see if your SQL is working? Looks like hive thing to me as spark is faithfully delegating the query to hive. On 29 May 2015 03:22, "Abhishek Tripathi" wrote: > Hi , > I'm using CDH5.4.0 quick start VM and tried to build Spark with H

UDF accessing hive struct array fails with buffer underflow from kryo

2015-05-28 Thread yluo
Hi all, I'm using Spark 1.3.1 with Hive 0.13.1. When running a UDF accessing a hive struct array the query fails with: Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. Serialization trace: fieldName (org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector$M

spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-28 Thread roy
hi, Suddenly spark jobs started failing with following error Exception in thread "main" java.io.FileNotFoundException: /user/spark/applicationHistory/application_1432824195832_1275.inprogress (No such file or directory) full trace here [21:50:04 x...@hadoop-client01.dev:~]$ spark-submit --clas

Adding an indexed column

2015-05-28 Thread Cesar Flores
Assuming that I have the next data frame: flag | price -- 1|47.808764653746 1|47.808764653746 1|31.9869279512204 1|47.7907893713564 1|16.7599200038239 1|16.7599200038239 1|20.3916014172137 How can I create a data frame with an extra indexed column a

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Cody Koeninger
I'm not sure that points 1 and 2 really apply to the kafka direct stream. There are no receivers, and you know at the driver how big each of your batches is. On Thu, May 28, 2015 at 2:21 PM, Andrew Or wrote: > Hi all, > > As the author of the dynamic allocation feature I can offer a few insights

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-28 Thread Sanjay Subramanian
ok guys , finally figured out how to get it running. I have detailed out the steps I did. Perhaps its clear to all you folks. To me it was not :-) Our  Hadoop development environment - 3 node development hadoop cluster - Current version CDH 5.3.3 - Hive 0.13.1 - Spark 1.2.0 (standal

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Andrew. >From speaking with customers, this is one of the most pressing issues for them (burning hot, to be precise), especially in a SAAS type of environment and especially with commodity hardware at play. Understandably, folks don't want to pay for more hardware usage than necessary and

spark submit debugging

2015-05-28 Thread boci
Hi! I have a little problem... If I started my spark application as java app (locally) it's work like a charm, but if I start in hadoop cluster (tried spark-submit --master local[5] and --master yarn-client), but it's not working. No error, no exception, periodically run the job but nothing happen

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
Thanks Sandy- I was digging through the code in the deploy.yarn.Client and literally found that property right before I saw your reply. I'm on 1.2.x right now which doesn't have the property. I guess I need to update sooner rather than later. On Thu, May 28, 2015 at 3:56 PM, Sandy Ryza wrote: >

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Sandy Ryza
Hi Corey, As of this PR https://github.com/apache/spark/pull/5297/files, this can be controlled with spark.yarn.submit.waitAppCompletion. -Sandy On Thu, May 28, 2015 at 11:48 AM, Corey Nolet wrote: > I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm > noticing the jvm t

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Andrew Or
Hi all, As the author of the dynamic allocation feature I can offer a few insights here. Gerard's explanation was both correct and concise: dynamic allocation is not intended to be used in Spark streaming at the moment (1.4 or before). This is because of two things: (1) Number of receivers is ne

RE: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Evo Eftimov
I don’t think the number of CPU cores controls the “number of parallel tasks”. The number of Tasks corresponds first and foremost to the number of (Dstream) RDD Partitions The Spark documentation doesn’t mention what is meant by “Task” in terms of Standard Multithreading Terminology ie a T

Add Custom Aggregate Column to Spark DataFrame

2015-05-28 Thread calstad
I have a Spark DataFrame that looks like: | id | value | bin | |+---+-| | 1 | 3.4 | 2 | | 2 | 2.6 | 1 | | 3 | 1.8 | 1 | | 4 | 9.6 | 2 | I have a function `f` that takes an array of values and returns a number. I want to add a column to the

yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm noticing the jvm that fires up to allocate the resources, etc... is not going away after the application master and executors have been allocated. Instead, it just sits there printing 1 second status updates to the console. I

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Mulugeta Mammo
Thanks for the valuable information. The blog states: "The cores property controls the number of concurrent tasks an executor can run. --executor-cores 5 means that each executor can run a maximum of five tasks at the same time. " So, I guess the max number of executor-cores I can assign is the C

spark sql lateral view unresolved attribute exception

2015-05-28 Thread weoccc
Hi, It seems LATERVAL VIEW explode column named 'some_col' can't be resolved if expressed in subquery. Any idea why ? SELECT `fc_clickq`.`some_col` FROM ( SELECT * FROM fc_clickq LATERAL VIEW explode(`overlap`) ltr_table_3 AS `some_col`) fc_clickq ; org.apache.spark.sql.catalyst.errors.package$

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Ruslan Dautkhanov
It's not only about cores. Keep in mind spark.executor.cores also affects available memeory for each task: From http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ The memory available to each task is (spark.executor.memory * spark.shuffle.memoryFraction *spark.shuffl

Re: PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Davies Liu
Could you try to comment out some lines in `extract_sift_features_opencv` to find which line cause the crash? If the bytes came from sequenceFile() is broken, it's easy to crash a C library in Python (OpenCV). On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga wrote: > Hi sparkers, > > I am working

Fwd: [Streaming] Configure executor logging on Mesos

2015-05-28 Thread Tim Chen
6:22.958067 26890 exec.cpp:206] Executor registered on slave > 20150528-063307-780930314-5050-8152-S5 > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > > So

Hyperthreading

2015-05-28 Thread Mulugeta Mammo
Hi guys, Does the SPARK_EXECUTOR_CORES assume Hyper threading? For example, if I have 4 cores with 2 threads per core, should the SPARK_EXECUTOR_CORES be 4*2 = 8 or just 4? Thanks,

Batch aggregation by sliding window + join

2015-05-28 Thread igor.berman
Hi, I have a batch daily job that computes daily aggregate of several counters represented by some object. After daily aggregation is done, I want to compute block of 3 days aggregation(3,7,30 etc) To do so I need to add new daily aggregation to the current block and then subtract from current bloc

Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-28 Thread Abhishek Tripathi
Hi , I'm using CDH5.4.0 quick start VM and tried to build Spark with Hive compatibility so that I can run Spark sql and access temp table remotely. I used below command to build Spark, it was build successful but when I tried to access Hive data from Spark sql, I get error. Thanks, Abhi --

PySpark with OpenCV causes python worker to crash

2015-05-28 Thread Sam Stoelinga
Hi sparkers, I am working on a PySpark application which uses the OpenCV library. It runs fine when running the code locally but when I try to run it on Spark on the same Machine it crashes the worker. The code can be found here: https://gist.github.com/samos123/885f9fe87c8fa5abf78f This is the

Loading CSV to DataFrame and saving it into Parquet for speedup

2015-05-28 Thread M Rez
I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple of unified DataFrames. Since this process is slow I have wrote this snippet: targetList.foreach { target => // this is using sqlContext.load by getting list of files then loading them according to schema files

Re: debug jsonRDD problem?

2015-05-28 Thread Michael Stone
On Wed, May 27, 2015 at 02:06:16PM -0700, Ted Yu wrote: Looks like the exception was caused by resolved.get(prefix ++ a) returning None :         a => StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK – it will be your insurance policy against sys crashes due to memory leaks. Until there is free RAM, spark streaming (spark) will NOT resort to disk – and of course resorting to disk from time to time (ie when there is no

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Ji ZHANG
Hi, Unfortunately, they're still growing, both driver and executors. I run the same job with local mode, everything is fine. On Thu, May 28, 2015 at 5:26 PM, Akhil Das wrote: > Can you replace your counting part with this? > > logs.filter(_.s_id > 0).foreachRDD(rdd => logger.info(rdd.count()))

Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Ashish Mukherjee
Hello, I was wondering if there is any documented comparison of SparkSQL with MemSQL/VoltDB kind of in-memory SQL databases. MemSQL etc. too allow queries to be run in a clustered environment. What is the major differentiation? Regards, Ashish

Best practice to update a MongoDB document from Sparks

2015-05-28 Thread nibiau
Hello, I'm evaluating Spark/SparkStreaming . I use SparkStreaming to receive messages from a Kafka topic. As soon as I have a JavaReceiverInputDStream , I have to treat each message, for each one I have to search in MongoDB to find if a document does exist. If I found the document I have to update

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-28 Thread Andrew Otto
> val sqlContext = new HiveContext(sc) > val schemaRdd = sqlContext.sql("some complex SQL") It mostly works, but have been having issues with tables that contains a large amount of data: https://issues.apache.org/jira/browse/SPARK-6910 > On

Re: Dataframe Partitioning

2015-05-28 Thread Silvio Fiorito
That’s due to the config setting spark.sql.shuffle.partitions which defaults to 200 From: Masf Date: Thursday, May 28, 2015 at 10:02 AM To: "user@spark.apache.org" Subject: Dataframe Partitioning Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I

Dataframe Partitioning

2015-05-28 Thread Masf
Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner join between these dataframes, the result contains 200 partitions. *Why?* df1.join(df2, df1("id") === df2("id"), "Inner") => returns 200 partitions Thanks!!! -- Regards. Miguel Ángel

[Streaming] Configure executor logging on Mesos

2015-05-28 Thread Gerard Maas
n the spark assembly: I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave 20150528-063307-780930314-5050-8152-S5 Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properti

Re: SPARK STREAMING PROBLEM

2015-05-28 Thread Sourav Chandra
The oproblem lies the way you are doing the processing. After the g.foreach(x => {println(x); println("")}) are you doing ssc.start. It means till now what you did is just setup the computation stpes but spark has not started any real processing. so when you do g.foreach what it iterat

Soft distinct on data frames.

2015-05-28 Thread Jan-Paul Bultmann
Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data. I’ve also noticed that plans generated with an unique transformation have

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Evo, good points. On the dynamic resource allocation, I'm surmising this only works within a particular cluster setup. So it improves the usage of current cluster resources but it doesn't make the cluster itself elastic. At least, that's my understanding. Memory + disk would be good and hopefull

Spark streaming with kafka

2015-05-28 Thread boci
Hi guys, I using spark streaming with kafka... In local machine (start as java application without using spark-submit) it's work, connect to kafka and do the job (*). I tried to put into spark docker container (hadoop 2.6, spark 1.3.1, try spark submit wil local[5] and yarn-client too ) but I'm ou

Fwd: SPARK STREAMING PROBLEM

2015-05-28 Thread Animesh Baranawal
I also started the streaming context by running ssc.start() but still apart from logs nothing of g gets printed. -- Forwarded message -- From: Animesh Baranawal Date: Thu, May 28, 2015 at 6:57 PM Subject: SPARK STREAMING PROBLEM To: user@spark.apache.org Hi, I am trying to extr

Re: SPARK STREAMING PROBLEM

2015-05-28 Thread Sourav Chandra
You must start the StreamingContext by calling ssc.start() On Thu, May 28, 2015 at 6:57 PM, Animesh Baranawal < animeshbarana...@gmail.com> wrote: > Hi, > > I am trying to extract the filenames from which a Dstream is generated by > parsing the toDebugString method on RDD > I am implementing the

SPARK STREAMING PROBLEM

2015-05-28 Thread Animesh Baranawal
Hi, I am trying to extract the filenames from which a Dstream is generated by parsing the toDebugString method on RDD I am implementing the following code in spark-shell: import org.apache.spark.streaming.{StreamingContext, Seconds} val ssc = new StreamingContext(sc,Seconds(10)) val lines = ssc.t

FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
You can also try Dynamic Resource Allocation https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation Also re the Feedback Loop for automatic message consumption rate adjustment – there is a “dumb” solution option – simply set the storage policy for the DStrea

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
You can always spin new boxes in the background and bring them into the cluster fold when fully operational and time that with job relaunch and param change Kafka offsets are mabaged automatically for you by the kafka clients which keep them in zoomeeper dont worry about that ad long as you shut

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by adding/removing

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
@DG; The key metrics should be - Scheduling delay – its ideal state is to remain constant over time and ideally be less than the time of the microbatch window - The average job processing time should remain less than the micro-batch window - Number of Lost Jobs –

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thank you, Gerard. We're looking at the receiver-less setup with Kafka Spark streaming so I'm not sure how to apply your comments to that case (not that we have to use receiver-less but it seems to offer some advantages over the receiver-based). As far as "the number of Kafka receivers is fixed f

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Gerard Maas
Hi, tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark streaming processes is not supported. *Longer version.* I assume that you are talking about Spark Streaming as the discussion is about handing Kafka streaming data. Then you have two things to consider: the Streaming re

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
Can you replace your counting part with this? logs.filter(_.s_id > 0).foreachRDD(rdd => logger.info(rdd.count())) Thanks Best Regards On Thu, May 28, 2015 at 1:02 PM, Ji ZHANG wrote: > Hi, > > I wrote a simple test job, it only does very basic operations. for example: > > val lines = Kaf

Re: why does "com.esotericsoftware.kryo.KryoException: java.u til.ConcurrentModificationException" happen?

2015-05-28 Thread Akhil Das
Can you paste the piece of code at least? Not sure, but it seems you are reading/writing an object at the same time. You can try disabling kryo and that might give you a proper exception stack. Thanks Best Regards On Thu, May 28, 2015 at 2:45 PM, randylu wrote: > begs for your help > > > > -- >

Re: FetchFailedException and MetadataFetchFailedException

2015-05-28 Thread Rok Roskar
yes I've had errors with too many open files before, but this doesn't seem to be the case here. Hmm, you're right in that these errors are different from what I initially stated -- I think what I assumed was that the failure to write resulted in the worker to crash which in turn resulted in a fail

Re: why does "com.esotericsoftware.kryo.KryoException: java.u til.ConcurrentModificationException" happen?

2015-05-28 Thread randylu
begs for your help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-does-com-esotericsoftware-kryo-KryoException-java-u-til-ConcurrentModificationException-happen-tp23067p23068.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

why does "com.esotericsoftware.kryo.KryoException: java.u til.ConcurrentModificationException" happen?

2015-05-28 Thread randylu
My program runs for 500 iterations, but fails at about 150 iterations almostly. It's hard to explain the details of my program, but i think my program is ok, for it runs succesfully somtimes. *I just wana know in which situations this exception will happen*. The detail error information is

Re: Get all servers in security group in bash(ec2)

2015-05-28 Thread Akhil Das
You can use python boto library for that, in fact spark-ec2 script uses it underneath. Here's the call spark-ec2 is making to get all machines under a given security group. Thanks Best Regards On Thu, May 28, 2015 at 2:22 PM, niz

Get all servers in security group in bash(ec2)

2015-05-28 Thread nizang
hi, Is there anyway in bash (from an ec2 apsrk server) to list all the servers in my security group (or better - in a given security group) I tried using: wget -q -O - http://instance-data/latest/meta-data/security-groups security_group_xxx but now, I want all the servers in security group secu

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
1. Upto you, you can either add internal ip or the external ip, it won't be a problem unless they are not in the same network. 2. If you only want to start a particular slave, then you can do like: sbin/start-slave.sh Thanks Best Regards On Thu, May 28, 2015 at 1:52 PM, Nizan Grauer wrote:

Spark Cassandra

2015-05-28 Thread lucas
Hello, I am trying to save data from spark to Cassandra. So I have an ScalaESRDD (because i take data from elasticsearch) that contains a lot of key/values like this : (AU16r4o_kbhIuSky3zFO , Map(@timestamp -> 2015-05-21T21:35:54.035Z, timestamp -> 2015-05-21 23:35:54,035, loglevel -> INFO, th

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Nizan Grauer
hi, thanks for your answer! I have few more: 1) the file /root/spark/conf/slaves , has the full DNS names of servers ( ec2-52-26-7-137.us-west-2.compute.amazonaws.com), did you add there the internal ip? 2) You call to start-all. Isn't it too aggressive? Let's say I have 20 slaves up, and I want

DataFrame nested sctructure selection limit

2015-05-28 Thread Eugene Morozov
Hi! I have a json file with some data, I’m able to create DataFrame out of it and the schema for particular part of it I’m interested in looks like following: val json: DataFrame = sqlc.load("entities_with_address2.json", "json") root |-- attributes: struct (nullable = true) ||-- Address2

Re: Recommended Scala version

2015-05-28 Thread Tathagata Das
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out soon) with Scala 2.11 and report issues. TD On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers wrote: > we are still running into issues with spark-shell not working on 2.11, but > we are running on somewhat older master so

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Ji ZHANG
Hi, I wrote a simple test job, it only does very basic operations. for example: val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> 1)).map(_._2) val logs = lines.flatMap { line => try { Some(parse(line).extract[Impression]) } catch { case _:

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
I do this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master public key to .ssh/authorized_keys - Add the slaves internal IP to the master's conf/slaves file - do sbin/start-all.sh and it will sho

RE: How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Somnath Pandeya
Try scala eclipse plugin to eclipsify spark project and import spark as eclipse project -Somnath -Original Message- From: Nan Xiao [mailto:xiaonan830...@gmail.com] Sent: Thursday, May 28, 2015 12:32 PM To: user@spark.apache.org Subject: How to use Eclipse on Windows to build Spark envir

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
Hi Zhang, Could you paste your code in a gist? Not sure what you are doing inside the code to fill up memory. Thanks Best Regards On Thu, May 28, 2015 at 10:08 AM, Ji ZHANG wrote: > Hi, > > Yes, I'm using createStream, but the storageLevel param is by default > MEMORY_AND_DISK_SER_2. Besides,

How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Nan Xiao
Hi all, I want to use Eclipse on Windows to build Spark environment, but find the reference page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup) doesn't contain any guide about Eclipse. Could anyone give tutorials or links about how to using

Adding slaves on spark standalone on ec2

2015-05-28 Thread nizang
hi, I'm working on spark standalone system on ec2, and I'm having problems on resizing the cluster (meaning - adding or removing slaves). In the basic ec2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html), there's only script for lunching the cluster, not adding slaves to it. On the