Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
I'm not caching the data. with "each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new Ja

Re: Tableau beta connector

2015-02-05 Thread Ashutosh Trivedi (MT2013030)
okay. So the queries tableau will run on the persisted data will be through SPARK SQL to improve performance and to take advantage of SPARK SQL. Thanks again Denny From: Denny Lee Sent: Thursday, February 5, 2015 1:27 PM To: Ashutosh Trivedi (MT2013030); İsmail

Re: pyspark - gzip output compression

2015-02-05 Thread Kane Kim
I'm getting SequenceFile doesn't work with GzipCodec without native-hadoop code! Where to get those libs and where to put it in the spark? Also can I save plain text file (like saveAsTextFile) as gzip? Thanks. On Wed, Feb 4, 2015 at 11:10 PM, Kane Kim wrote: > How to save RDD with gzip compres

Errors in the workers machines

2015-02-05 Thread Spico Florin
Hello! I received the following errors in the workerLog.log files: ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660] -> [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with [akka.tcp://sparkExecutor@stream4:47929]] [ akka.remote.EndpointAssociationE

Re: How many stages in my application?

2015-02-05 Thread Joe Wass
Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the code. So I rather find a way of doing this automatically if possible. On 4 February 2015 at 19:41, M

Re: Errors in the workers machines

2015-02-05 Thread Arush Kharbanda
1. For what reasons is using Spark the above ports? What internal component is triggering them? -Akka(guessing from the error log) is used to schedule tasks and to notify executors - the ports used are random by default 2. How I can get rid of these errors? - Probably the ports are not open on you

Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
RDD#toDebugString will help. On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass wrote: > Thanks Akhil and Mark. I can of course count events (assuming I can deduce > the shuffle boundaries), but like I said the program isn't simple and I'd > have to do this manually every time I change the code. So I rath

Re: pyspark - gzip output compression

2015-02-05 Thread Sean Owen
No, you can compress SequenceFile with gzip. If you are reading outside Hadoop then SequenceFile may not be a great choice. You can use the gzip codec with TextOutputFormat if you need to. On Feb 5, 2015 8:28 AM, "Kane Kim" wrote: > I'm getting SequenceFile doesn't work with GzipCodec without nat

Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
And the Job page of the web UI will give you an idea of stages completed out of the total number of stages for the job. That same information is also available as JSON. Statically determining how many stages a job logically comprises is one thing, but dynamically determining how many stages remai

Pyspark Hbase scan.

2015-02-05 Thread Castberg , René Christian
?Hi, I am trying to do a hbase scan and read it into a spark rdd using pyspark. I have successfully written data to hbase from pyspark, and been able to read a full table from hbase using the python example code. Unfortunately I am unable to find any example code for doing an HBase scan and rea

Re: how to specify hive connection options for HiveContext

2015-02-05 Thread Arush Kharbanda
Hi Are you trying to run a spark job from inside eclipse? and want the job to access hive configuration options.? To access hive tables? Thanks Arush On Tue, Feb 3, 2015 at 7:24 AM, guxiaobo1982 wrote: > Hi, > > I know two options, one for spark_submit, the other one for spark-shell, > but ho

Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example code at http://spark-packages.org/package/20 which is using scala and python to read data from hbase. Hope this can be helpful. Cheers Gen

Spark SQL - Not able to create schema RDD for nested Directory for specific directory names

2015-02-05 Thread Nishant Patel
Hi, I got strange behavior. When I am creating schema RDD for nested directory sometimes it work and sometime it does not work. My question is whether nested directory supported or not? My code is as below. val fileLocation = "hdfs://localhost:9000/apps/hive/warehouse/hl7" val parquetRDD = sq

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread kundan kumar
The problem got resolved after removing all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to the

Error KafkaStream

2015-02-05 Thread Eduardo Costa Alfaia
Hi Guys, I’m getting this error in KafkaWordCount; TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234): java.lang.ClassCastException: [B cannot be cast to java.lang.String

Re: Shuffle write increases in spark 1.2

2015-02-05 Thread Anubhav Srivastav
Hi Kevin, We seem to be facing the same problem as well. Were you able to find anything after that? The ticket does not seem to have progressed anywhere. Regards, Anubhav On 5 January 2015 at 10:37, 정재부 wrote: > Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081 > > > > -

Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Spico Florin
Hello! I'm using spark-csv 2.10 with Java from the maven repository com.databricks spark-csv_2.10 0.1.1 I would like to use Spark-SQL to filter out my data. I'm using the following code: JavaSchemaRDD cars = new JavaCsvParser().withUseHeader(true).csvFile( sqlContext, logFile); cars.registerAsTabl

Resources not uploaded when submitting job in yarn-client mode

2015-02-05 Thread weberste
Hi, I am trying to submit a job from a Windows system to a YARN cluster running on Linux (the HDP2.2 sandbox). I have copied the relevant Hadoop directories as well as the yarn-site.xml and mapred-site.xml to the Windows file system. Further, I have added winutils.exe to $HADOOP_HOME/bin. I can t

Re: Zipping RDDs of equal size not possible

2015-02-05 Thread Niklas Wilcke
Hi Xiangrui, I'm sorry. I didn't recognize your mail. What I did is a workaround only working for my special case. It does not scale and only works for small data sets but that is fine for me so far. Kind Regards, Niklas def securlyZipRdds[A, B: ClassTag](rdd1: RDD[A], rdd2: RDD[B]): RDD[(A, B

how to debug this kind of error, e.g. "lost executor"?

2015-02-05 Thread Yifan LI
Hi, I am running a heavy memory/cpu overhead graphx application, I think the memory is sufficient and set RDDs’ StorageLevel using MEMORY_AND_DISK. But I found there were some tasks failed due to following errors: java.io.FileNotFoundException: /data/spark/local/spark-local-20150205151711-9700

How to design a long live spark application

2015-02-05 Thread Shuai Zheng
Hi All, I want to develop a server side application: User submit request à Server run spark application and return (this might take a few seconds). So I want to host the server to keep the long-live context, I don’t know whether this is reasonable or not. Basically I try to have a g

Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Shuai Zheng
Hi All, It might sounds weird, but I think spark is perfect to be used as a multi-threading library in some cases. The local mode will naturally boost multiple thread when required. Because it is more restrict and less chance to have potential bug in the code (because it is more data oriental,

streaming joining multiple streams

2015-02-05 Thread Zilvinas Saltys
The challenge I have is this. There's two streams of data where an event might look like this in stream1: (time, hashkey, foo1) and in stream2: (time, hashkey, foo2) The result after joining should be (time, hashkey, foo1, foo2) .. The join happens on hashkey and the time difference can be ~30 mins

Re: Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Jerry Lam
Hi Florin, I might be wrong but timestamp looks like a keyword in SQL that the engine gets confused with. If it is a column name of your table, you might want to change it. ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) I'm constantly working with CSV files with spark. H

Re: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Sean Owen
Do you mean disable the web UI? spark.ui.enabled=false Sure, it's useful with master = local[*] too. On Thu, Feb 5, 2015 at 9:30 AM, Shuai Zheng wrote: > Hi All, > > > > It might sounds weird, but I think spark is perfect to be used as a > multi-threading library in some cases. The local mode wi

ZeroMQ and pyspark.streaming

2015-02-05 Thread Sasha Kacanski
Does pyspark supports zeroMQ? I see that java does it, but I am not sure for Python? regards -- Aleksandar Kacanski

Re: Error KafkaStream

2015-02-05 Thread Sean Owen
Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same issue? On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia wrote: > Hi Guys, > I’m getting this error in KafkaWordCount; > > TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234): > java.lang.ClassCastExcep

Re: Error KafkaStream

2015-02-05 Thread Eduardo Costa Alfaia
I don’t think so Sean. > On Feb 5, 2015, at 16:57, Sean Owen wrote: > > Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same > issue? > > On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia > wrote: >> Hi Guys, >> I’m getting this error in KafkaWordCount; >> >> TaskSetMa

RE: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Shuai Zheng
Nice. I just try and it works. Thanks very much! And I notice there is below in the log: 15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@NY02913D.global.local:8162] 15/02/05 11:19:10 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://s

Re: how to debug this kind of error, e.g. "lost executor"?

2015-02-05 Thread Yifan LI
Anyone has idea on where I can find the detailed log of that lost executor(why it was lost)? Thanks in advance! > On 05 Feb 2015, at 16:14, Yifan LI wrote: > > Hi, > > I am running a heavy memory/cpu overhead graphx application, I think the > memory is sufficient and set RDDs’ StorageLe

Re: How to design a long live spark application

2015-02-05 Thread Boromir Widas
You can check out https://github.com/spark-jobserver/spark-jobserver - this allows several users to upload their jars and run jobs with a REST interface. However, if all users are using the same functionality, you can write a simple spray server which will act as the driver and hosts the spark con

Re: How to design a long live spark application

2015-02-05 Thread Charles Feduke
If you want to design something like Spark shell have a look at: http://zeppelin-project.org/ Its open source and may already do what you need. If not, its source code will be helpful in answering the questions about how to integrate with long running jobs that you have. On Thu Feb 05 2015 at 11

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke wrote: > If you want to design something like Spark shell have a look at: > > http://zeppelin-project.

Re: how to debug this kind of error, e.g. "lost executor"?

2015-02-05 Thread Ankur Srivastava
Li, I cannot tell you the reason for this exception but have seen these kind of errors when using HASH based shuffle manager (which is default) until v 1.2. Try the SORT shuffle manager. Hopefully that will help Thanks Ankur Anyone has idea on where I can find the detailed log of that lost execu

Re: word2vec: how to save an mllib model and reload it?

2015-02-05 Thread Carsten Schnober
As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in our Hadoop cluster and here's my issue with classic Java serialization of the model: I don't have SSH access to the cluster master node. Here's my code for computing the model: val input = sc.textFile("README.md").

Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
While a spark-submit job is setting up, the yarnAppState goes into Running mode, then I get a flurry of typical looking INFO-level messages such as INFO BlockManagerMasterActor: ... INFO YarnClientSchedulerBackend: Registered executor: ... Then, spark-submit quits without any error message and I

Re: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Arush Kharbanda
You can use akka, that is the underlying Multithreading library Spark uses. On Thu, Feb 5, 2015 at 9:56 PM, Shuai Zheng wrote: > Nice. I just try and it works. Thanks very much! > > And I notice there is below in the log: > > 15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addres

Re: Spark job ends abruptly during setup without error message

2015-02-05 Thread Arush Kharbanda
Are you submitting the job from your local machine or on the driver machine.? Have you set YARN_CONF_DIR. On Thu, Feb 5, 2015 at 10:43 PM, Arun Luthra wrote: > While a spark-submit job is setting up, the yarnAppState goes into Running > mode, then I get a flurry of typical looking INFO-level me

Re: Tableau beta connector

2015-02-05 Thread Denny Lee
Could you clarify what you mean by "build another Spark and work through Spark Submit"? If you are referring to utilizing Spark spark and thrift, you could start the Spark service and then have your spark-shell, spark-submit, and/or thrift service aim at the master you have started. On Thu Feb 05

Shuffle Dependency Casting error

2015-02-05 Thread aanilpala
Hi, I am working on a text mining project and I want to use NaiveBayesClassifier of MLlib to classify some stream items. So, I have two Spark contexts one of which is a streaming context. Everything looks fine if I comment out train and predict methods, it works fine although doesn't obviously do w

Re: Shuffle Dependency Casting error

2015-02-05 Thread VISHNU SUBRAMANIAN
Hi, Could you share the code snippet. Thanks, Vishnu On Thu, Feb 5, 2015 at 11:22 PM, aanilpala wrote: > Hi, I am working on a text mining project and I want to use > NaiveBayesClassifier of MLlib to classify some stream items. So, I have two > Spark contexts one of which is a streaming contex

Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz : > I'm not caching the data. with "each iteration I mean,, each 128mb > that a executor has to process. > > The code is pretty simple. > > final Conversor c = new Conversor(null, nul

Number of goals to win championship

2015-02-05 Thread jvuillermet
I want to find the minimum number of goals for a player that likely allows its team to win the championship. My data : goals win/loose 25 1 5 0 10 1 20 0 After some reading and courses, I think I need a Logistic Regression model to get those datas. I create my LabeledPoint with those data (1/

Shuffle read/write issue in spark 1.2

2015-02-05 Thread Praveen Garg
Hi, While moving from spark 1.1 to spark 1.2, we are facing an issue where Shuffle read/write has been increased significantly. We also tried running the job by rolling back to spark 1.1 configuration where we set spark.shuffle.manager to hash and spark.shuffle.blockTransferService to nio. It d

Re: How to design a long live spark application

2015-02-05 Thread Chip Senkbeil
Hi, You can also check out the Spark Kernel project: https://github.com/ibm-et/spark-kernel It can plug into the upcoming IPython 3.0 notebook (providing a Scala/Spark language interface) and provides an API to submit code snippets (like the Spark Shell) and get results directly back, rather than

get null potiner exception newAPIHadoopRDD.map()

2015-02-05 Thread oxpeople
I modified the code Base on CassandraCQLTest. to get the area code count base on time zone. I got error on create new map Rdd. Any helping is appreciated. Thanks. ... val arecodeRdd = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[CqlPagingInputFormat], classOf[java.util.Map[Str

Re: get null potiner exception newAPIHadoopRDD.map()

2015-02-05 Thread Ted Yu
Is it possible that value.get("(area_code")) or value.get("time_zone")) returned null ? On Thu, Feb 5, 2015 at 10:58 AM, oxpeople wrote: > I modified the code Base on CassandraCQLTest. to get the area code count > base on time zone. I got error on create new map Rdd. Any helping is > appreciate

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread Kostas Sakellis
Kundan, So I think your configuration here is incorrect. We need to adjust memory and #executors. So for your case you have: Cluster setup 5 nodes 16gb RAM 8 cores. The number of executors should be the total number of nodes in your cluster - in your case 5. As for --num-executor-cores it should

Re: LeaseExpiredException while writing schemardd to hdfs

2015-02-05 Thread Petar Zecevic
Why don't you just map rdd's rows to lines and then call saveAsTextFile()? On 3.2.2015. 11:15, Hafiz Mujadid wrote: I want to write whole schemardd to single in hdfs but facing following exception rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException

MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]

2015-02-05 Thread danilopds
Hi, I'm learning Spark and testing the Spark MLlib library with algorithm K-means. So, I created a file "height-weight.txt" like this: 65.0 220.0 73.0 160.0 59.0 110.0 61.0 120.0 ... And the code (executed in spark-shell): import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.ml

My first experience with Spark

2015-02-05 Thread java8964
I am evaluating Spark for our production usage. Our production cluster is Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment running with Hadoop. What I have in mind is to test a very complex Hive query, which joins between 6 tables, lots of nested structure with explo

Re: MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]

2015-02-05 Thread danilopds
I solve the question with this code: import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile("/opt/testAppSpark/data/height-weight.txt").map { line => Vectors.dense(line.split(' ').map(_.toDouble)) }.cache() val cluster = KMeans.trai

Re: Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
I'm submitting this on a cluster, with my usual setting of, export YARN_CONF_DIR=/etc/hadoop/conf It is working again after a small change to the code so I will see if I can reproduce the error (later today). On Thu, Feb 5, 2015 at 9:17 AM, Arush Kharbanda wrote: > Are you submitting the job fr

word2vec more distributed

2015-02-05 Thread Alex Minnaar
I was wondering if there was any chance of getting a more distributed word2vec implementation. I seem to be running out of memory from big local data structures such as val syn1Global = new Array[Float](vocabSize * vectorSize) Is there anyway chance of getting a version where these are all pu

RE: How to design a long live spark application

2015-02-05 Thread Shuai Zheng
This example helps a lot J But I am thinking a below case: Assume I have a SparkContext as a global variable. Then if I use multiple threads to access/use it. Will it mess up? For example: My code: public static List> run(JavaSparkContext sparkContext, Map> cache, Properties

K-Means final cluster centers

2015-02-05 Thread SK
Hi, I am trying to get the final cluster centers after running the KMeans algorithm in MLlib in order to characterize the clusters. But the KMeansModel does not have any public method to retrieve this info. There appears to be only a private method called clusterCentersWithNorm. I guess I could c

Re: K-Means final cluster centers

2015-02-05 Thread Suneel Marthi
There's a kMeansModel.clusterCenters() available if u r looking to get the centers from KMeansModel. From: SK To: user@spark.apache.org Sent: Thursday, February 5, 2015 5:35 PM Subject: K-Means final cluster centers Hi, I am trying to get the final cluster centers after running th

Re: How to design a long live spark application

2015-02-05 Thread Eugen Cepoi
Yes you can submit multiple actions from different threads to the same SparkContext. It is safe. Indeed what you want to achieve is quite common. Expose some operations over a SparkContext through HTTP. I have used spray for this and it just worked fine. At bootstrap of your web app, start a spark

Re: K-Means final cluster centers

2015-02-05 Thread Frank Austin Nothaft
Unless I misunderstood your question, you’re looking for the val clusterCenters in http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel, no? Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 5, 2015, at 2:

RE: My first experience with Spark

2015-02-05 Thread java8964
Finally I gave up after there are too many failed retry. >From the log in the worker side, it looks like failed with JVM OOM, as below: 15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java heap s

How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread YaoPau
I have a file "badFullIPs.csv" of bad IP addresses used for filtering. In yarn-client mode, I simply read it off the edge node, transform it, and then broadcast it: val badIPs = fromFile(edgeDir + "badfullIPs.csv") val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toS

Re: My first experience with Spark

2015-02-05 Thread Deborah Siegel
Hi Yong, Have you tried increasing your level of parallelism? How many tasks are you getting in failing stage? 2-3 tasks per CPU core is recommended, though maybe you need more for your shuffle operation? You can configure spark.default.parallelism, or pass in a level of parallelism as second par

one is the default value for intercepts in GeneralizedLinearAlgorithm

2015-02-05 Thread jamborta
hi all, I have been going through the GeneralizedLinearAlgorithm to understand how intercepts are handled in regression. Just noticed that the initial setting for the intercept is set to one (whereas the initial setting for the rest of the coefficients is set to zero) using the same piece of code

no option to add intercepts for StreamingLinearAlgorithm

2015-02-05 Thread jamborta
hi all, just wondering if there is a reason why it is not possible to add intercepts for streaming regression models? I understand that run method in the underlying GeneralizedLinearModel does not take intercept as a parameter either. Any reason for that? thanks, -- View this message in contex

Re: Is there a way to access Hive UDFs in a HiveContext?

2015-02-05 Thread jamborta
Hi, My guess is that Spark is not picking up the jar where the function is stored. You might have to add it to sparkcontext or the classpath manually. You can also register the function hc.registerFunction("myfunct", myfunct) then use it in the query. -- View this message in context: http:/

StreamingContext getOrCreate with queueStream

2015-02-05 Thread pnpritchard
I am trying to use the StreamingContext "getOrCreate" method in my app. I started by running the example ( RecoverableNetworkWordCount ), which worked as ex

Best tools for visualizing Spark Streaming data?

2015-02-05 Thread Su She
Hello Everyone, I wanted to hear the community's thoughts on what (open - source) tools have been used to visualize data from Spark/Spark Streaming? I've taken a look at Zepellin, but had some trouble working with it. Couple questions: 1) I've looked at a couple blog posts and it seems like spar

Re: maven doesn't build dependencies with Scala 2.11

2015-02-05 Thread Ted Yu
Now that Kafka 0.8.2.0 has been released, adding external/kafka module works. FYI On Sun, Jan 18, 2015 at 7:36 PM, Ted Yu wrote: > bq. there was no 2.11 Kafka available > > That's right. Adding external/kafka module resulted in: > > [ERROR] Failed to execute goal on project spark-streaming-kafk

NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-05 Thread aanilpala
I have the following code: SparkConf conf = new SparkConf().setAppName("streamer").setMaster("local[2]"); conf.set("spark.driver.allowMultipleContexts", "true"); JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(batch_interval));

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-05 Thread Cheng Lian
Hi Jenny, You may try to use |--files $SPARK_HOME/conf/hive-site.xml --driver-class-path hive-site.xml| when submitting your application. The problem is that when running in cluster mode, the driver is actually running in a random container directory on a random executor node. By using |--fil

Re: StreamingContext getOrCreate with queueStream

2015-02-05 Thread Tathagata Das
I dont think your screenshots came through in the email. None the less, queueStream will not work with getOrCreate. Its mainly for testing (by generating your own RDDs) and not really useful for production usage (where you really need to checkpoint-based recovery). TD On Thu, Feb 5, 2015 at 4:12

spark driver behind firewall

2015-02-05 Thread Kane Kim
I submit spark job from machine behind firewall, I can't open any incoming connections to that box, does driver absolutely need to accept incoming connections? Is there any workaround for that case? Thanks.

Re: Can't access remote Hive table from spark

2015-02-05 Thread Cheng Lian
Please note that Spark 1.2.0 /only/ support Hive 0.13.1 /or/ 0.12.0, none of other versions are supported. Best, Cheng On 1/25/15 12:18 AM, guxiaobo1982 wrote: Hi, I built and started a single node standalone Spark 1.2.0 cluster along with a single node Hive 0.14.0 instance installed by Amba

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread Sandy Ryza
Hi Jon, You'll need to put the file on HDFS (or whatever distributed filesystem you're running on) and load it from there. -Sandy On Thu, Feb 5, 2015 at 3:18 PM, YaoPau wrote: > I have a file "badFullIPs.csv" of bad IP addresses used for filtering. In > yarn-client mode, I simply read it off

RE: Error KafkaStream

2015-02-05 Thread Shao, Saisai
Hi, I think you should change the `DefaultDecoder` of your type parameter into `StringDecoder`, seems you want to decode the message into String. `DefaultDecoder` is to return Array[Byte], not String, so here class casting will meet error. Thanks Jerry -Original Message- From: Eduardo

Re: Parquet compression codecs not applied

2015-02-05 Thread Cheng Lian
Hi Ayoub, The doc page isn’t wrong, but it’s indeed confusing. |spark.sql.parquet.compression.codec| is used when you’re wring Parquet file with something like |data.saveAsParquetFile(...)|. However, you are using Hive DDL in the example code. All Hive DDLs and commands like |SET| are directl

RE: My first experience with Spark

2015-02-05 Thread java8964
Hi, Deb:From what I search online, changing parallelism is one option. But the failed stage already had 200 tasks, which is quite large on a one 24 core box.I know query that amount of data in one box is kind of over, but I do want to know how to config it using less memory, even it could mean u

Get filename in Spark Streaming

2015-02-05 Thread Subacini B
Hi All, We have filename with timestamp say ABC_1421893256000.txt and the timestamp needs to be extracted from file name for further processing.Is there a way to get input file name picked up by spark streaming job? Thanks in advance Subacini

Requested array size exceeds VM limit Error

2015-02-05 Thread Muttineni, Vinay
Hi, I have a 170GB data tab limited data set which I am converting into the RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be used for training a GBT model. I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning the data set to 1000 partitions

Re: Error KafkaStream

2015-02-05 Thread Eduardo Costa Alfaia
Hi Shao, When I changed to StringDecoder I’ve get this compiling error: [error] /sata_disk/workspace/spark-1.2.0/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaW ordCount.scala:78: not found: type StringDecoder [error] KafkaUtils.createStream[String, String,

RE: Error KafkaStream

2015-02-05 Thread Shao, Saisai
Did you include Kafka jars? This StringDecoder is under kafka/serializer, You can refer to the unit test KafkaStreamSuite in Spark to see how to use this API. Thanks Jerry From: Eduardo Costa Alfaia [mailto:e.costaalf...@unibs.it] Sent: Friday, February 6, 2015 9:44 AM To: Shao, Saisai Cc: Sean

Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Michael Albert
Greetings! Again, thanks to all who have given suggestions.I am still trying to diagnose a problem where I have processes than run for one or several hours but intermittently stall or hang.By "stall" I mean that there is no CPU usage on the workers or the driver, nor network activity, nor do I s

Re: Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Michael Albert
My apologies for following up my own post, but I thought this might be of interest. I terminated the java process corresponding to executor which had opened the stderr file mentioned below (kill ).Then my spark job completed without error (it was actually almost finished). Now I am completely co

spark on ec2

2015-02-05 Thread Kane Kim
Hi, I'm trying to change setting as described here: http://spark.apache.org/docs/1.2.0/ec2-scripts.html export SPARK_WORKER_CORES=6 Then I ran ~/spark-ec2/copy-dir /root/spark/conf to distribute to slaves, but without any effect. Do I have to restart workers? How to do that with spark-ec2? Than

Re: Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Xuefeng Wu
what's the dump info by jstack? Yours, Xuefeng Wu 吴雪峰 敬上 > On 2015年2月6日, at 上午10:20, Michael Albert > wrote: > > My apologies for following up my own post, but I thought this might be of > interest. > > I terminated the java process corresponding to executor which had opened the > stderr fi

Re: spark on ec2

2015-02-05 Thread Charles Feduke
I don't see anything that says you must explicitly restart them to load the new settings, but usually there is some sort of signal trapped [or brute force full restart] to get a configuration reload for most daemons. I'd take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on yo

Re: spark on ec2

2015-02-05 Thread Kane Kim
Oh yeah, they picked up changes after restart, thanks! On Thu, Feb 5, 2015 at 8:13 PM, Charles Feduke wrote: > I don't see anything that says you must explicitly restart them to load > the new settings, but usually there is some sort of signal trapped [or > brute force full restart] to get a con

Reg Job Server

2015-02-05 Thread Deep Pradhan
Hi, Can Spark Job Server be used for profiling Spark jobs?

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan wrote: > Hi, > Can Spark Job Server be used for profiling Spark jobs? >

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis wrote: > Which Spark Job server are you talking about? > > On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan > wrote: > >> Hi, >> Can Spark Job Server be used for profiling Spark

Re: spark driver behind firewall

2015-02-05 Thread Kostas Sakellis
Yes, the driver has to be able to accept incoming connections. All the executors connect back to the driver sending heartbeats, map status, metrics. It is critical and I don't know of a way around it. You could look into using something like the https://github.com/spark-jobserver/spark-jobserver th

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
When you say profiling, what are you trying to figure out? Why your spark job is slow? Gatling seems to be a load generating framework so I'm not sure how you'd use it (i've never used it before). Spark runs on the JVM so you can use any JVM profiling tools like YourKit. Kostas On Thu, Feb 5, 201

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
Yes, I want to know, the reason about the job being slow. I will look at YourKit. Can you redirect me to that, some tutorial in how to use? Thank You On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis wrote: > When you say profiling, what are you trying to figure out? Why your spark > job is slow

Re: How many stages in my application?

2015-02-05 Thread Kostas Sakellis
Yes, there is no way right now to know how many stages a job will generate automatically. Like Mark said, RDD#toDebugString will give you some info about the RDD DAG and from that you can determine based on the dependency types (Wide vs. narrow) if there is a stage boundary. On Thu, Feb 5, 2015 at

RE: Error KafkaStream

2015-02-05 Thread jishnu.prathap
Hi, If your message is string you will have to Change Encoder and Decoder to StringEncoder , StringDecoder. If your message Is byte[] you can use DefaultEncoder & Decoder. Also Don’t forget to add import statements depending on ur encoder and decoder. import kafka.ser

Re: Reg Job Server

2015-02-05 Thread Mark Hamstra
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan wrote: > Yes, I want to know, the reason about the job being slow. > I will look at YourKit. > Can you redirect me to that, some tutorial in how to use? > > T

Re: Whether standalone spark support kerberos?

2015-02-05 Thread Kostas Sakellis
Standalone mode does not support talking to a kerberized HDFS. If you want to talk to a kerberized (secure) HDFS cluster i suggest you use Spark on Yarn. On Wed, Feb 4, 2015 at 2:29 AM, Jander g wrote: > Hope someone helps me. Thanks. > > On Wed, Feb 4, 2015 at 6:14 PM, Jander g wrote: > >> We

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I have a single node Spark standalone cluster. Will this also work for my cluster? Thank You On Fri, Feb 6, 2015 at 11:02 AM, Mark Hamstra wrote: > > https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit > > On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan >

Re: RE: Can't access remote Hive table from spark

2015-02-05 Thread Skanda
Hi, My spark-env.sh has the following entries with respect to classpath: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/hive/lib/*:/etc/hive/conf/ -Skanda On Sun, Feb 1, 2015 at 11:45 AM, guxiaobo1982 wrote: > Hi Skanda, > > How do set up your SPARK_CLASSPATH? > > I add the following line t

Re: how to debug this kind of error, e.g. "lost executor"?

2015-02-05 Thread Xuefeng Wu
could you find the shuffle files? or the files were deleted by other processes? Yours, Xuefeng Wu 吴雪峰 敬上 > On 2015年2月5日, at 下午11:14, Yifan LI wrote: > > Hi, > > I am running a heavy memory/cpu overhead graphx application, I think the > memory is sufficient and set RDDs’ StorageLevel using MEM

  1   2   >