Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Night Wolf
Thanks Yin, that seems to work with the Shell. But on a compiled application with Spark-submit it still fails with the same exception. On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai wrote: > Can you put the following setting in spark-defaults.conf and try again? > > spark.sql.hive.metastore.sharedPref

Re: ALS Rating Object

2015-06-03 Thread Yasemin Kaya
Hi Joseph, I think about converting IDS but there will be birthday problem. The probability of a Hash Collision is important for me because of the user number. I don't know how can I modify ALS to use Integer. yasemin 2015-06-04 2:28

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Can you do count() before fit to force materialize the RDD? I think something before fit is slow. On Wednesday, June 3, 2015, Piero Cinquegrana wrote: > The fit part is very slow, transform not at all. > > The number of partitions was 210 vs number of executors 80. > > Spark 1.4 sounds great

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
The fit part is very slow, transform not at all. The number of partitions was 210 vs number of executors 80. Spark 1.4 sounds great but as my company is using Qubole we are dependent upon them to upgrade from version 1.3.1. Until that happens, can you think of any other reasons as to why it cou

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline a

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Yin Huai
Can you put the following setting in spark-defaults.conf and try again? spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni https://issues.apache.org/jira/browse/SPAR

Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Night Wolf
Hi all, Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode with Hive support. *Build command;* ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4 -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided -Dhadoop.version=2.5.1-mapr-1501 -Dyarn.version=2.5.1-mapr

Re: Make HTTP requests from within Spark

2015-06-03 Thread Pat McDonough
Try something like the following. Create a function to make the HTTP call, e.g. using org.apache.commons.httpclient.HttpClient as in below. def getUrlAsString(url: String): String = { val client = new org.apache.http.impl.client.DefaultHttpClient() val request = new org.apache.http.client.met

Re: Make HTTP requests from within Spark

2015-06-03 Thread William Briggs
Hi Kaspar, This is definitely doable, but in my opinion, it's important to remember that, at its core, Spark is based around a functional programming paradigm - you're taking input sets of data and, by applying various transformations, you end up with a dataset that represents your "answer". Witho

TreeReduce Functionality in Spark

2015-06-03 Thread raggy
I am trying to understand what the treeReduce function for an RDD does, and how it is different from the normal reduce function. My current understanding is that treeReduce tries to split up the reduce into multiple steps. We do a partial reduce on different nodes, and then a final reduce is done t

Re: Spark Cluster Benchmarking Frameworks

2015-06-03 Thread Zhen Jia
Hi Jonathan, Maybe you can try BigDataBench. http://prof.ict.ac.cn/BigDataBench/ . It provides lots of workloads, including both Hadoop and Spark based workloads. Zhen Jia hodgesz wrote > Hi Spark Experts, > > I am curious what people are using to benchmar

Re: Example Page Java Function2

2015-06-03 Thread linkstar350 .
Thank you. I confirmed the page. 2015-06-04 1:35 GMT+09:00 Sean Owen : > Yes, I think you're right. Since this is a change to the ASF hosted > site, I can make this change to the .md / .html directly rather than > go through the usual PR. > > On Wed, Jun 3, 2015 at 6:23 PM, linkstar350 . > wrote

importerror using external library with pyspark

2015-06-03 Thread AlexG
I have libskylark installed on both machines in my two node cluster in the same locations, and checked that the following code, which calls libskylark, works on both nodes with 'pyspark rfmtest.py': import re import numpy import skylark.ml.kernels import random import os from pyspark import Spark

Adding new Spark workers on AWS EC2 - access error

2015-06-03 Thread barmaley
I have the existing operating Spark cluster that was launched with spark-ec2 script. I'm trying to add new slave by following the instructions: Stop the cluster On AWS console "launch more like this" on one of the slaves Start the cluster Although the new instance is added to the same security gro

RE: Make HTTP requests from within Spark

2015-06-03 Thread Mohammed Guller
The short answer is yes. How you do it depends on a number of factors. Assuming you want to build an RDD from the responses and then analyze the responses using Spark core (not Spark Streaming), here is one simple way to do it: 1) Implement a class or function that connects to a web service and

How to pass system properties in spark ?

2015-06-03 Thread Ashwin Shankar
Hi, I'm trying to use property substitution in my log4j.properties, so that I can choose where to write spark logs at runtime. The problem is that, system property passed to spark shell doesn't seem to getting propagated to log4j. *Here is log4j.properites(partial) with a parameter 'spark.log.path

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-03 Thread Yin Huai
Hi Doug, Actually, sqlContext.table does not support database name in both Spark 1.3 and Spark 1.4. We will support it in future version. Thanks, Yin On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog wrote: > Hi, > > sqlContext.table(“db.tbl”) isn’t working for me, I get a > NoSuchTableException.

Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
Hello User group, I have a RDD of LabeledPoint composed of sparse vectors like showing below. In the next step, I am standardizing the columns with the Standard Scaler. The data has 2450 columns and ~110M rows. It took 1.5hrs to complete the standardization with 10 nodes and 80 executors. The s

RandomForest - subsamplingRate parameter

2015-06-03 Thread Andrew Leverentz
When training a RandomForest model, the Strategy class (in mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping to use this to cut down on processing time for large datasets (more than 2MM rows and 9K predictors), but I've found that the runtime stays approximately cons

[ANNOUNCE] YARN support in Spark EC2

2015-06-03 Thread Shivaram Venkataraman
Hi all We recently merged support for launching YARN clusters using Spark EC2 scripts as a part of https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass in hadoop-major-version as "yarn" to the spark-ec2 script and this will setup Hadoop 2.4 HDFS, YARN and Spark built for YARN

Re: ALS Rating Object

2015-06-03 Thread Joseph Bradley
Hi Yasemin, If you can convert your user IDs to Integers in pre-processing (if you have < a couple billion users), that would work. Otherwise... In Spark 1.3: You may need to modify ALS to use Long instead of Int. In Spark 1.4: spark.ml.recommendation.ALS (in the Pipeline API) exposes ALS.train a

Re: data localisation in spark

2015-06-03 Thread Sandy Ryza
Tasks are scheduled on executors based on data locality. Things work as you would expect in the example you brought up. Through dynamic allocation, the number of executors can change throughout the life time of an application. 10 executors (or 5 executors with 2 cores each) are not needed for a

Re: Python Image Library and Spark

2015-06-03 Thread ayan guha
Try with large number of partition in parallelize. On 4 Jun 2015 06:28, "Justin Spargur" wrote: > Hi all, > > I'm playing around with manipulating images via Python and want to > utilize Spark for scalability. That said, I'm just learing Spark and my > Python is a bit rusty (been doing PHP c

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-03 Thread Marcelo Vanzin
(bcc: user@spark, cc:cdh-user@cloudera) If you're using CDH, Spark SQL is currently unsupported and mostly untested. I'd recommend trying to use it in CDH. You could try an upstream version of Spark instead. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake wrote: > As part of upgrading a cluster from

Problem reading Parquet from 1.2 to 1.3

2015-06-03 Thread Don Drake
As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that Spark is behaving differently when reading Parquet directories that contain a .metadata directory. It seems that in spark 1.2.x, it would just ignore the .metadata directory, but now that I'm using Spark 1.3, reading these f

Python Image Library and Spark

2015-06-03 Thread Justin Spargur
Hi all, I'm playing around with manipulating images via Python and want to utilize Spark for scalability. That said, I'm just learing Spark and my Python is a bit rusty (been doing PHP coding for the last few years). I think I have most of the process figured out. However, the script fails on

Re: Managing spark processes via supervisord

2015-06-03 Thread Igor Berman
assuming you are talking about standalone cluster imho, with workers you won't get any problems and it's straightforward since they are usually foreground processes with master it's a bit more complicated, ./sbin/start-master.sh goes background which is not good for supervisor, but anyway I think i

Managing spark processes via supervisord

2015-06-03 Thread Mike Trienis
Hi All, I am curious to know if anyone has successfully deployed a spark cluster using supervisord? - http://supervisord.org/ Currently I am using the cluster launch scripts which are working greater, however, every time I reboot my VM or development environment I need to re-launch the cluste

Re: Spark Client

2015-06-03 Thread pavan kumar Kolamuri
Thanks Akhil, Richard, Oleg for your quick response . @Oleg we have actually tried the same thing but unfortunately when we throw exception Akka framework is catching all exceptions and thinking job failed and rerunning the spark jobs infinitely. Since in OneForOneStrategy in akka , max no of re

Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-03 Thread Doug Balog
Hi, sqlContext.table(“db.tbl”) isn’t working for me, I get a NoSuchTableException. But I can access the table via sqlContext.sql(“select * from db.tbl”) So I know it has the table info from the metastore. Anyone else see this ? I’ll keep digging. I compiled via make-distribution -Pyarn

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys on a specific node and sends new tuples with that key to that. Matei > On Jun 3, 2015, at 6:31 AM, allonsy wrote: > > Hi everybody, >

Re: Behavior of the spark.streaming.kafka.maxRatePerPartition config param?

2015-06-03 Thread Cody Koeninger
The default of 0 means no limit. Each batch will grab as much as is available, ie a range of offsets spanning from the end of the previous batch to the highest available offsets on the leader. If you set spark.streaming.kafka.maxRatePerPartition > 0, the number you set is the maximum number of me

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
I think what we'd want to do is track the ingestion rate in the consumer(s) via Spark's aggregation functions and such. If we're at a critical level (load too high / load too low) then we issue a request into our Provisioning Component to add/remove machines. Once it comes back with an "OK", each c

Re: Broadcast variables can be rebroadcast?

2015-06-03 Thread NB
I am pasting some of the exchanges I had on this topic via the mailing list directly so it may help someone else too. (Don't know why those responses don't show up here). --- Thanks Imran. It does help clarify. I believe I had it right all along then but was confus

StreamingListener, anyone?

2015-06-03 Thread dgoldenberg
Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming listener, like so: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(params.getBatchDurationMillis())); jssc.addStreamingListener(new JobListener(jssc)); where JobListe

Re: Example Page Java Function2

2015-06-03 Thread Sean Owen
Yes, I think you're right. Since this is a change to the ASF hosted site, I can make this change to the .md / .html directly rather than go through the usual PR. On Wed, Jun 3, 2015 at 6:23 PM, linkstar350 . wrote: > Hi, I'm Taira. > > I notice that this example page may be a mistake. > > https:/

Example Page Java Function2

2015-06-03 Thread linkstar350 .
Hi, I'm Taira. I notice that this example page may be a mistake. https://spark.apache.org/examples.html Word Count (Java) JavaRDD textFile = spark.textFile("hdfs://..."); JavaRDD words = textFile.flatMap(new FlatMapFunction() { public Iterable call(String s) { ret

Re: Spark 1.4.0 build Error on Windows

2015-06-03 Thread pawan kumar
I got the same error message when using maven 3.3 . On Jun 3, 2015 8:58 AM, "Ted Yu" wrote: > I used the same command on Linux but didn't reproduce the error. > > Can you include -X switch on your command line ? > > Also consider upgrading maven to 3.3.x > > Cheers > > On Wed, Jun 3, 2015 at 2:36

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
If we have a hand-off between the older consumer and the newer consumer, I wonder if we need to manually manage the offsets in Kafka so as not to miss some messages as the hand-off is happening. Or if we let the new consumer run for a bit then let the old consumer know the 'new guy is in town' the

Re: Spark 1.4.0 build Error on Windows

2015-06-03 Thread Ted Yu
I used the same command on Linux but didn't reproduce the error. Can you include -X switch on your command line ? Also consider upgrading maven to 3.3.x Cheers On Wed, Jun 3, 2015 at 2:36 AM, Daniel Emaasit wrote: > I run into errors while trying to build Spark from the 1.4 release > branch:

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
Great. "You should monitor vital performance / job clogging stats of the Spark Streaming Runtime not “kafka topics” -- anything specific you were thinking of? On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov wrote: > Makes sense especially if you have a cloud with “infinite” resources / > nodes whi

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
Makes sense especially if you have a cloud with “infinite” resources / nodes which allows you to double, triple etc in the background/parallel the resources of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 mo

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
Evo, One of the ideas is to shadow the current cluster. This way there's no extra latency incurred due to shutting down of the consumers. If two sets of consumers are running, potentially processing the same data, that is OK. We phase out the older cluster and gradually flip over to the new one, i

Re: Spark Client

2015-06-03 Thread Oleg Zhurakousky
I am not sure why Spark is relying on System.exit, hopefully someone will be able to provide a technical justification for it (very curious to hear it), but for your use case you can easily trap System.exit call before JVM exit with a simple implementation of SecurityManager and try/catch. Here are

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
You should monitor vital performance / job clogging stats of the Spark Streaming Runtime not “kafka topics” You should be able to bring new worker nodes online and make them contact and register with the Master without bringing down the Master (or any of the currently running worker nodes)

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
Would it be possible to implement Spark autoscaling somewhat along these lines? -- 1. If we sense that a new machine is needed, by watching the data load in Kafka topic(s), then 2. Provision a new machine via a Provisioner interface (e.g. talk to AWS and get a machine); 3. Create a "shadow"/mirror

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
So Evo, option b is to "singleton" the Param, as in your modified snippet, i.e. instantiate is once per an RDD. But if I understand correctly the a) option is broadcast, meaning instantiation is in the Driver once before any transformations and actions, correct? That's where my serialization cost

Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-03 Thread lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

Re: Spark Client

2015-06-03 Thread Richard Marscher
I think the short answer to the question is, no, there is no alternate API that will not use the System.exit calls. You can craft a workaround like is being suggested in this thread. For comparison, we are doing programmatic submission of applications in a long-running client application. To get ar

Re: Application is "always" in process when I check out logs of completed application

2015-06-03 Thread Richard Marscher
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed in 1.3.1 and up. https://issues.apache.org/jira/browse/SPARK-6036 The ordering that the event logs are moved from in-progress to complete is coded to be after the Master tries to build the history page for the logs. The onl

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Dmitry was concerned about the “serialization cost” NOT the “memory footprint – hence option a) is still viable since a Broadcast is performed only ONCE for the lifetime of Driver instance From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, June 3, 2015 2:44 PM To: Evo Eftimov Cc: dg

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Ted Yu
Considering memory footprint of param as mentioned by Dmitry, option b seems better. Cheers On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov wrote: > Hmmm a spark streaming app code doesn't execute in the linear fashion > assumed in your previous code snippet - to achieve your objectives you > shoul

Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread allonsy
Hi everybody, is there in Spark anything sharing the philosophy of Storm's field grouping? I'd like to manage data partitioning across the workers by sending tuples sharing the same key to the very same worker in the cluster, but I did not find any method to do that. Suggestions? :) -- View t

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Hmmm a spark streaming app code doesn't execute in the linear fashion assumed in your previous code snippet - to achieve your objectives you should do something like the following in terms of your second objective - saving the initialization and serialization of the params you can: a) broadcast

Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread dgoldenberg
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's must be serializable. So if I've got a few objects that I'd rather initialize once and deinitialize once outside of the logic processing the RDD's, I'd

Re: in GraphX,program with Pregel runs slower and slower after several iterations

2015-06-03 Thread Cheuk Lam
I think you're exactly right. I once had 100 iterations in a single Pregel call, and got into the lineage problem right there. I had to modify the Pregel function and checkpoint both the graph and the newVerts RDD there to cut off the lineage. If you draw out the dependency graph among the g, th

ALS Rating Object

2015-06-03 Thread Yasemin Kaya
Hi, I want to use Spark's ALS in my project. I have the userid like 30011397223227125563254 and Rating Object which is the Object of ALS wants Integer as a userid so the id field does not fit into a 32 bit Integer. How can I solve that ? Thanks. Best, yasemin -- hiç ender hiç

run spark submit on cloudera cluster

2015-06-03 Thread Pa Rö
hi, i want to run my spark app on a cluster, i use cloudera live single node vm. how i must build the job for the spark submit script? and i must upload spark submit on hdfs? best regards paul

Re: Spark Client

2015-06-03 Thread Akhil Das
Did you try this? Create an sbt project like: // Create your context val sconf = new SparkConf().setAppName("Sigmoid").setMaster("spark://sigmoid:7077") val sc = new SparkContext(sconf) // Do some computations sc.parallelize(1 to 1).take(10).foreach(println) //Now return the exit stat

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Saisai Shao
I think you could check the yarn nodemanager log or other Spark executor logs to see the details. What you listed above of the exception stacks are just the phenomenon, not the cause. Normally there will be some situations which will lead to executor lost: 1. Killed by yarn cause of memory exceed,

Re: Spark 1.4.0 build Error on Windows

2015-06-03 Thread Daniel Emaasit
I run into errors while trying to build Spark from the 1.4 release branch: https://github.com/apache/spark/tree/branch-1.4. Any help will be much appreciated. Here is the log file from my windows 8.1 PC. (F.Y.I, I installed all the dependencies like Java 7, Maven 3.2.5 and set the environment varia

Error: Building Spark 1.4.0 from Github-1.4 release branch

2015-06-03 Thread Emaasit
I run into errors while trying to build Spark from the 1.4 release branch yourself: https://github.com/apache/spark/tree/branch-1.4. Any help will be much appreciated. Here is the log file. (F.Y.I, I installed all the dependencies like Java 7, Maven 3.2.5) C:\Program Files\Apache Software Foundat

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
Hi again, Below is the log from executor FetchFailed(BlockManagerId(4, compute-10-0.local, 38594), shuffleId=0, mapId=117, reduceId=117, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to compute-10-0.local/10.10.255.241:38594 at org.apache.spark.shuffle.hash.Blo

Re: DataFrames coming in SparkR in Apache Spark 1.4.0

2015-06-03 Thread Emaasit
You can build Spark from the 1.4 release branch yourself: https://github.com/apache/spark/tree/branch-1.4 - Daniel Emaasit, Ph.D. Research Assistant Transportation Research Center (TRC) University of Nevada, Las Vegas Las Vegas, NV 89154-4015 Cell: 615-649-2489 www.danielemaasit.com -- Vie

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread ๏̯͡๏
Sean, I am going with your answer. So there is no equivalent of partition. On Wed, Jun 3, 2015 at 12:34 PM, Jeff Zhang wrote: > I check the RDD#randSplit, it is much more like multiple one-to-one > transformation rather than a one-to-multiple transformation. > > I write one sample code as follow

Re: Spark Client

2015-06-03 Thread pavan kumar Kolamuri
Hi akhil , sorry i may not conveying the question properly . Actually we are looking to Launch a spark job from a long running workflow manager, which invokes spark client via SparkSubmit. Unfortunately the client upon successful completion of the application exits with a System.exit(0) or System.

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
Which version of spark? Looks like you are hitting this one https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Wed, Jun 3, 2015 at 1:06 PM, patcharee wrote: > This is log I can get> > > 15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (2/3) > for 4 outst

Re: Spark 1.4.0-rc3: Actor not found

2015-06-03 Thread Anders Arpteg
Tried on some other data sources as well, and it actually works for some parquet sources. Potentially some specific problems with that first parquet source that I tried with, and not a Spark 1.4 problem. I'll get back with more info if I find any new information. Thanks, Anders On Tue, Jun 2, 201

Make HTTP requests from within Spark

2015-06-03 Thread kasparfischer
Hi everybody, I'm new to Spark, apologies if my question is very basic. I have a need to send millions of requests to a web service and analyse and store the responses in an RDD. I can easy express the analysing part using Spark's filter/map/etc. primitives but I don't know how to make the reque

Re: Scripting with groovy

2015-06-03 Thread Akhil Das
I think when you do a ssc.stop it will stop your entire application and by "update a transformation function" you mean modifying the driver program? In that case even if you checkpoint your application, it won't be able to recover from its previous state. A simpler approach would be to add certain

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
This is log I can get> 15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (2/3) for 4 outstanding blocks after 5000 ms 15/06/02 16:37:36 INFO client.TransportClientFactory: Found inactive connection to compute-10-3.local/10.10.255.238:33671, creating a new one. 15/06/02 16:37:3

Re: Spark Client

2015-06-03 Thread Akhil Das
Run it as a standalone application. Create an sbt project and do sbt run? Thanks Best Regards On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri < pavan.kolam...@gmail.com> wrote: > Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. > But for my use case i don't want i

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee wrote: > Hi, > > What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? > How can I fix it? > > Best, > Patcharee > >

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Jeff Zhang
node down or container preempted ? You need to check the executor log / node manager log for more info. On Wed, Jun 3, 2015 at 2:31 PM, patcharee wrote: > Hi, > > What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? > How can I fix it? > > Best, > Patcharee > > -

MetaException(message:java.security.AccessControlException: Permission denied

2015-06-03 Thread patcharee
Hi, I was running a spark job to insert overwrite hive table and got Permission denied. My question is why spark job did the insert by using user 'hive', not myself who ran the job? How can I fix the problem? val hiveContext = new HiveContext(sc) import hiveContext.implicits._ hiveContext.sql

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
I check the RDD#randSplit, it is much more like multiple one-to-one transformation rather than a one-to-multiple transformation. I write one sample code as following, it would generate 3 stages. Although we can use cache here to make it better, If spark can support multiple outputs, only 2 stages