Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread ayan guha
Hi So to be clear, do you want to run one operation in multiple threads within a function or you want run multiple jobs using multiple threads? I am wondering why python thread module can't be used? Or you have already gave it a try? On 18 May 2015 16:39, "MEETHU MATHEW" wrote: > Hi Akhil, > > T

How to debug spark in IntelliJ Idea

2015-05-18 Thread Yi.Zhang
Hi all, Currently, I wrote some code lines to access spark master which was deployed on standalone style. I wanted to set the breakpoint for spark master which was running on the different process. I am wondering maybe I need attach process in IntelliJ, so that when AppClient sent the message to r

Fwd: Spark and Flink

2015-05-18 Thread Pa Rö
hi, if i add your dependency i get over 100 errors, now i change the version number: com.fasterxml.jackson.module jackson-module-scala_2.10 2.4.4 com.google.guava guava now the pom is fin

AccessControlException hive table created from spark shell

2015-05-18 Thread patcharee
Hi, I found a permission problem when I created a hive table from hive context in spark shell version 1.2.1. Then I tried to update this table, but got AccessControlException because this table is owned by hive, not my account. From the hive context, hiveContext.sql("create table orc_table(k

k-means core function for temporal geo data

2015-05-18 Thread Pa Rö
hallo, i want cluster geo data (lat,long,timestamp) with k-means. now i search for a good core function, i can not find good paper or other sources for that. to time i multiplicate the time and the space distance: public static double dis(GeoData input1, GeoData input2) { double timeDis = Math

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-18 Thread Steve Loughran
On 16 May 2015, at 04:39, Anton Brazhnyk mailto:anton.brazh...@genesys.com>> wrote: For me it wouldn’t help I guess, because those newer classes would still be loaded by different classloader. What did work for me with 1.3.1 – removing of those classes from Spark’s jar completely, so they get

NullPointerException when accessing broadcast variable in DStream

2015-05-18 Thread hotienvu
Hi I'm trying to use broadcast variables in my Spark streaming program. val conf = new SparkConf().setMaster(SPARK_MASTER).setAppName(APPLICATION_NAME) val ssc = new StreamingContext(conf, Seconds(1)) val LIMIT = ssc.sparkContext.broadcast(5L) println(LIMIT.value) // this print 5 val lines

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread MEETHU MATHEW
Hi,I think you cant supply an initial set of centroids to kmeans Thanks & Regards, Meethu M On Friday, 15 May 2015 12:37 AM, Suman Somasundar wrote: Hi,, I want to run a definite number of iterations in Kmeans.  There is a command line argument to set maxIterations, but even if I

Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am using spark-sql to read a CSV file and write it as parquet file. I am building the schema using the following code. String schemaString = "a b c"; List fields = new ArrayList(); MetadataBuilder mb = new MetadataBuilder(); mb.putBoolean("nullable", tr

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg
Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your sy

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
You can use spark.streaming.receiver.maxRate not set Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread ayan guha
Hi Give a try with dtaFrame.fillna function to fill up missing column Best Ayan On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...@honeywell.com> wrote: > Hi, > > > > I am using spark-sql to read a CSV file and write it as parquet file. I am > building the sche

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
And if you want to genuinely “reduce the latency” (still within the boundaries of the micro-batch) THEN you need to design and finely tune the Parallel Programming / Execution Model of your application. The objective/metric here is: a) Consume all data within your selected micro-batch wi

Re: Spark and Flink

2015-05-18 Thread Robert Metzger
Hi, I would really recommend you to put your Flink and Spark dependencies into different maven modules. Having them both in the same project will be very hard, if not impossible. Both projects depend on similar projects with slightly different versions. I would suggest a maven module structure lik

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
We fix the receivers rate at which it should consume at any given point of time. Also we have a back-pressuring mechanism attached to the receivers so it won't simply crashes in the "unceremonious way" like Evo said. Mesos has some sort of auto-scaling (read it somewhere), may be you can look into

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
Who are “we” and what is the mysterious “back-pressuring mechanism” and is it part of the Spark Distribution (are you talking about implementation of the custom feedback loop mentioned in my previous emails below)- asking these because I can assure you that at least as of Spark Streaming 1.2.0,

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
we = Sigmoid back-pressuring mechanism = Stoping the receiver from receiving more messages when its about to exhaust the worker memory. Here's a similar kind of proposal if you haven't seen already. Thanks Best Regards On Mon, May 18, 2015 at

Working with slides. How do I know how many times a RDD has been processed?

2015-05-18 Thread Guillermo Ortiz
Hi, I have two streaming RDD1 and RDD2 and want to cogroup them. Data don't come in the same time and sometimes they could come with some delay. When I get all data I want to insert in MongoDB. For example, imagine that I get: RDD1 --> T 0 RDD2 -->T 0.5 I do cogroup between them but I couldn't st

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
Ooow – that is essentially the custom feedback loop mentioned in my previous emails in generic Architecture Terms and what you have done is only one of the possible implementations moreover based on Zookeeper – there are other possible designs not using things like zookeeper at all and hence ach

Processing multiple columns in parallel

2015-05-18 Thread Laeeq Ahmed
Hi, Consider I have a tab delimited text file with 10 columns. Each column is a a set of text. I would like to do a word count for each column. In scala, I would do the following RDD transformation and action:  val data = sc.textFile("hdfs://namenode/data.txt")  for(i <- 0 until 9){     data.map

Re: Forbidded : Error Code: 403

2015-05-18 Thread Mohammad Tariq
Tried almost all the options, but it did not work. So, I ended up creating a new IAM user and the keys of this user are working fine. I am not getting Forbidden(403) exception now, but my program seems to be running infinitely. It's not throwing any exception, but continues to run continuously with

Spark streaming over a rest API

2015-05-18 Thread juandasgandaras
Hello, I would like to use spark streaming over a REST api to get information along the time and with diferent parameters in the REST query. I was thinking to use apache kafka but I don´t have any experience with this and I would like to have some advice about this. Thanks. Best regards, Jua

pass configuration parameters to PySpark job

2015-05-18 Thread Oleg Ruchovets
Hi , I am looking a way to pass configuration parameters to spark job. In general I have quite simple PySpark job. def process_model(k, vc): do something sc = SparkContext(appName="TAD") lines = sc.textFile(input_job_files) result = lines.map(doSplit)

parsedData option

2015-05-18 Thread Ricardo Goncalves da Silva
Hi Team, My dataset has the following format: CELLPHONE,KL_1,KL_2,KL_3,KL_4,KL_5 1120100114,-5.3244062521117e-003,-4.10825709805041e-003,-1.7816995027779e-002,-4.21462029980323e-003,-1.6200555039e-002 i.e., a reader in the first column and the data separated by comas. To load this data I’m u

Re: Processing multiple columns in parallel

2015-05-18 Thread ayan guha
My first thought would be creating 10 rdds and run your word count on each of them..I think spark scheduler is going to resolve dependency in parallel and launch 10 jobs. Best Ayan On 18 May 2015 23:41, "Laeeq Ahmed" wrote: > Hi, > > Consider I have a tab delimited text file with 10 columns. Eac

RE: Processing multiple columns in parallel

2015-05-18 Thread Needham, Guy
How about making the range in the for loop parallelised? The driver will then kick off the word counts independently. Regards, Guy Needham | Data Discovery Virgin Media | Technology and Transformation | Data Bartley Wood Business Park, Hook, Hampshire RG27 9UP D 01256 75 3362 I welcome VSRE ema

RE: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
Thanks for the heads up mate. On 18 May 2015 19:08, "Evo Eftimov" wrote: > Ooow – that is essentially the custom feedback loop mentioned in my > previous emails in generic Architecture Terms and what you have done is > only one of the possible implementations moreover based on Zookeeper – > there

Re: number of executors

2015-05-18 Thread Sandy Ryza
*All On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza wrote: > Hi Xiaohe, > > The all Spark options must go before the jar or they won't take effect. > > -Sandy > > On Sun, May 17, 2015 at 8:59 AM, xiaohe lan > wrote: > >> Sorry, them both are assigned task actually. >> >> Aggregated Metrics by Exec

Re: number of executors

2015-05-18 Thread Sandy Ryza
Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan wrote: > Sorry, them both are assigned task actually. > > Aggregated Metrics by Executor > Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksInpu

Re: InferredSchema Example in Spark-SQL

2015-05-18 Thread Rajdeep Dua
Thanks will try this. On Sun, May 17, 2015 at 8:07 PM, Ram Sriharsha wrote: > you are missing sqlContext.implicits._ > > On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua > wrote: > >> Here are my imports >> >> *import* org.apache.spark.SparkContext >> >> *import* org.apache.spark.SparkContext._ >>

py-files (and others?) not properly set up in cluster-mode Spark Yarn job?

2015-05-18 Thread Shay Rojansky
I'm having issues with submitting a Spark Yarn job in cluster mode when the cluster filesystem is file:///. It seems that additional resources (--py-files) are simply being skipped and not being added into the PYTHONPATH. The same issue may also exist for --jars, --files, etc. We use a simple NFS

Re: number of executors

2015-05-18 Thread edward cui
I actually have the same problem, but I am not sure whether it is a spark problem or a Yarn problem. I set up a five nodes cluster on aws emr, start yarn daemon on the master (The node manager will not be started on default on the master, I don't want to waste any resource since I have to pay). An

Re: number of executors

2015-05-18 Thread edward cui
Oh BTW, it's spark 1.3.1 on hadoop 2.4. AIM 3.6. Sorry for lefting out this information. Appreciate for any help! Ed 2015-05-18 12:53 GMT-04:00 edward cui : > I actually have the same problem, but I am not sure whether it is a spark > problem or a Yarn problem. > > I set up a five nodes cluste

Re: Reading Real Time Data only from Kafka

2015-05-18 Thread Akhil Das
I have played a bit with the directStream kafka api. Good work cody. These are my findings and also can you clarify a few things for me (see below). -> When "auto.offset.reset"-> "smallest" and you have 60GB of messages in Kafka, it takes forever as it reads the whole 60GB at once. "largest" will

Re: Spark streaming over a rest API

2015-05-18 Thread Akhil Das
Why not use sparkstreaming to do the computation and dump the result somewhere in a DB perhaps and take it from there? Thanks Best Regards On Mon, May 18, 2015 at 7:51 PM, juandasgandaras wrote: > Hello, > > I would like to use spark streaming over a REST api to get information > along > the ti

org.apache.spark.shuffle.FetchFailedException :: Migration from Spark 1.2 to 1.3

2015-05-18 Thread zia_kayani
Hi, I'm getting this exception after shifting my code from Spark 1.2 to Spark 1.3 15/05/18 18:22:39 WARN TaskSetManager: Lost task 0.0 in stage 1.6 (TID 84, cloud8-server): FetchFailed(BlockManagerId(1, cloud4-server, 7337), shuffleId=0, mapId=9, reduceId=1, message= org.apache.spark.shuffle.Fetch

Spark groupByKey, does it always create at least 1 partition per key?

2015-05-18 Thread tomboyle
I am currently using spark streaming. During my batch processing I must groupByKey. Afterwards I call foreachRDD & foreachPartition & write to an external datastore. My only concern with this is if it's future proof? I know groupByKey by default uses the hashPartitioner. I have printed out the int

Re: py-files (and others?) not properly set up in cluster-mode Spark Yarn job?

2015-05-18 Thread Marcelo Vanzin
Hi Shay, Yeah, that seems to be a bug; it doesn't seem to be related to the default FS nor compareFs either - I can reproduce this with HDFS when copying files from the local fs too. In yarn-client mode things seem to work. Could you file a bug to track this? If you don't have a jira account I ca

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
My pleasure young man, i will even go beynd the so called "heads up" and send you a solution design for Feedback Loop preventing spark streaming app clogging and resource depletion and featuring machine learning based self-tunning AND which is not zookeeper based and hence offers lower latency

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-18 Thread Olivier Girardot
PR is opened : https://github.com/apache/spark/pull/6237 Le ven. 15 mai 2015 à 17:55, Olivier Girardot a écrit : > yes, please do and send me the link. > @rxin I have trouble building master, but the code is done... > > > Le ven. 15 mai 2015 à 01:27, Haopu Wang a écrit : > >> Thank you, should

Re: pass configuration parameters to PySpark job

2015-05-18 Thread Davies Liu
In PySpark, it serializes the functions/closures together with used global values. For example, global_param = 111 def my_map(x): return x + global_param rdd.map(my_map) - Davies On Mon, May 18, 2015 at 7:26 AM, Oleg Ruchovets wrote: > Hi , >I am looking a way to pass configuration

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread Davies Liu
SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show) threading.Thread(target=job).start() On Mon, May 18, 2015

Re: parallelism on binary file

2015-05-18 Thread Imran Rashid
You can use sc.hadoopFile (or any of the variants) to do what you want. They even let you reuse your existing HadoopInputFormats. You should be able to mimic your old use with MR just fine. sc.textFile is just a convenience method which sits on top. imran On Fri, May 8, 2015 at 12:03 PM, tog w

Re: Spark groupByKey, does it always create at least 1 partition per key?

2015-05-18 Thread Tathagata Das
By definition, all the values of a key will be only in one partition. This is some of the oldest API in Spark and will continue to work as it is now. On Mon, May 18, 2015 at 10:38 AM, tomboyle wrote: > I am currently using spark streaming. During my batch processing I must > groupByKey. Afterwar

Re: Error communicating with MapOutputTracker

2015-05-18 Thread Imran Rashid
On Fri, May 15, 2015 at 5:09 PM, Thomas Gerber wrote: > Now, we noticed that we get java heap OOM exceptions on the output tracker > when we have too many tasks. I wonder: > 1. where does the map output tracker live? The driver? The master (when > those are not the same)? > 2. how can we increase

Re: applications are still in progress?

2015-05-18 Thread Imran Rashid
Most likely, you never call sc.stop(). Note that in 1.4, this will happen for you automatically in a shutdown hook, taken care of by https://issues.apache.org/jira/browse/SPARK-3090 On Wed, May 13, 2015 at 8:04 AM, Yifan LI wrote: > Hi, > > I have some applications finished(but actually failed

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is corrupted

2015-05-18 Thread Imran Rashid
Looks like this exception is after many more failures have occurred. It is already on attempt 6 for stage 7 -- I'd try to find out why attempt 0 failed. This particular exception is probably a result of corruption that can happen when stages are retried, that I'm working on addressing in https://

Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Bill Jay
Hi all, I am reading the docs of receiver-based Kafka consumer. The last parameters of KafkaUtils.createStream is per topic number of Kafka partitions to consume. My question is, does the number of partitions for topic in this parameter need to match the number of partitions in Kafka. For example

Re: Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Saisai Shao
HI Bill, You don't need to match the number of thread to the number of partitions in the specific topic, for example, you have 3 partitions in topic1, but you only set 2 threads, ideally 1 thread will receive 2 partitions and another thread for the left one partition, it depends on the scheduling

Re: Spark on Yarn : Map outputs lifetime ?

2015-05-18 Thread Imran Rashid
Neither of those two. Instead, the shuffle data is cleaned up when the stage they are from get GC'ed by the jvm. that is, when you are no longer holding any references to anything which points to the old stages, and there is an appropriate gc event. The data is not cleaned up right after the sta

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
Hi Suman, For maxIterations, are you using the DenseKMeans.scala example code? (I'm guessing yes since you mention the command line.) If so, then you should be able to specify maxIterations via an extra parameter like "--numIterations 50" (note the example uses "numIterations" in the current mas

Re: Broadcast variables can be rebroadcast?

2015-05-18 Thread Imran Rashid
Rather than "updating" the broadcast variable, can't you simply create a new one? When the old one can be gc'ed in your program, it will also get gc'ed from spark's cache (and all executors). I think this will make your code *slightly* more complicated, as you need to add in another layer of indi

Re: spark log field clarification

2015-05-18 Thread Imran Rashid
depends what you mean by "output data". Do you mean: * the data that is sent back to the driver? that is "result size" * the shuffle output? that is in "Shuffle Write Metrics" * the data written to a hadoop output format? that is in "Output Metrics" On Thu, May 14, 2015 at 2:22 PM, yanwei wro

Re: Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-18 Thread Joseph Bradley
Hi Justin, It sound like you're on the right track. The best way to write a custom Evaluator will probably be to modify an existing Evaluator as you described. It's best if you don't remove the other code, which handles parameter set/get and schema validation. Joseph On Sun, May 17, 2015 at 10

Re: FetchFailedException and MetadataFetchFailedException

2015-05-18 Thread Imran Rashid
Hi, can you take a look at the logs and see what the first error you are getting is? Its possible that the file doesn't exist when that error is produced, but it shows up later -- I've seen similar things happen, but only after there have already been some errors. But, if you see that in the ver

Re: LogisticRegressionWithLBFGS with large feature set

2015-05-18 Thread Imran Rashid
I'm not super familiar with this part of the code, but from taking a quick look: a) the code creates a MultivariateOnlineSummarizer, which stores 7 doubles per feature (mean, max, min, etc. etc.) b) The limit is on the result size from *all* tasks, not from one task. You start with 3072 tasks c) t

TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I am trying to print a basic twitter stream and receiving the following error: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching http://192.168.56.1:49752/jars/twitter4j-me

Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I'm not 100% sure that is causing a problem, though. The stream still starts, but is giving blank output. I checked the environment variables in the ui and it is running local[*], so there should be no bottleneck there. On Mon, May 18, 2015 at 10:08 PM, Justin Pihony wrote: > I am trying to prin

RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for the response. But I could not see fillna function in DataFrame class. [cid:image001.png@01D0920E.32B14460] Is it available in some specific version of Spark sql. This is what I have in my pom.xml org.apache.spark spark-sql_2.10 1.3.1

Re: number of executors

2015-05-18 Thread xiaohe lan
Hi Sandy, Thanks for your information. Yes, spark-submit --master yarn --num-executors 5 --executor-cores 4 target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp is working awesomely. Is there any documentations pointing to this ? Thanks, Xiaohe On Tue, May 19, 2015 at 12:07 AM,

Re: number of executors

2015-05-18 Thread Sandy Ryza
Awesome! It's documented here: https://spark.apache.org/docs/latest/submitting-applications.html -Sandy On Mon, May 18, 2015 at 8:03 PM, xiaohe lan wrote: > Hi Sandy, > > Thanks for your information. Yes, spark-submit --master yarn > --num-executors 5 --executor-cores 4 > target/scala-2.10/sim

Re: number of executors

2015-05-18 Thread xiaohe lan
Yeah, I read that page before, but it does not mention the options should come before the application jar. Actually, if I put the --class option before the application jar, I will get ClassNotFoundException. Anyway, thanks again Sandy. On Tue, May 19, 2015 at 11:06 AM, Sandy Ryza wrote: > Awes

Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I think I found the answer -> http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html Do I have no way of running this in Windows locally? On Mon, May 18, 2015 at 10:44 PM, Justin Pihony wrote: > I'm not 100% sure that i

Spark Streaming graceful shutdown in Spark 1.4

2015-05-18 Thread Dibyendu Bhattacharya
Hi, Just figured out that if I want to perform graceful shutdown of Spark Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no longer works . As in Spark 1.4 there is Utils.addShutdownHook defined for Spark Core, that gets anyway called , which leads to graceful shutdown fro

Re: --jars works in "yarn-client" but not "yarn-cluster" mode, why?

2015-05-18 Thread Fengyun RAO
Thanks, Marcelo! Below is the full log, SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/p

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-18 Thread Tathagata Das
If you wanted to stop it gracefully, then why are you not calling ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt matter whether the shutdown hook was called or not. TD On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi,

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-18 Thread Tathagata Das
If you dont want the fileStream to start only after certain event has happened, why not start the streamingContext after that event? TD On Sun, May 17, 2015 at 7:51 PM, Haopu Wang wrote: > I want to use file stream as input. And I look at SparkStreaming > document again, it's saying file strea