Contineous errors trying to start spark-shell

2015-07-03 Thread Mohamed Lrhazi
Hello, I am trying to just start spark-shell... it starts, the prompt appears, then a never ending (literally) stream of these log lines proceeds What is it trying to do? Why is it failing? To start it I do: $ docker run -it ncssm/spark-base /spark/bin/spark-shell --master spark:// devzero.c

Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one partition. On Jul 4, 2015 7:43 AM, "khaledh" wrote: > I'm writing a Spark Streaming application that uses RabbitMQ to consume > events. One feature of RabbitMQ that I intend to make use of is bulk ack of > messages, i.e. n

Re: How to timeout a task?

2015-07-03 Thread William Ferrell
Ted, Thanks very much for your reply. It took me almost a week but I have finally had a chance to implement what you noted and it appears to be working locally. However, when I launch this onto a cluster on EC2 -- this doesn't work reliably. To expand, I think the issue is that some of the code w

Re: SparkR and Spark Mlib

2015-07-03 Thread ayan guha
No. Spark R is language binding for spark. MLlib is machine learning project on top of spark core On 4 Jul 2015 12:23, "praveen S" wrote: > Hi, > Is sparkR and spark Mlib same? >

SparkR and Spark Mlib

2015-07-03 Thread praveen S
Hi, Is sparkR and spark Mlib same?

Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch. Before I commit to doing so, I'd

Re: Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Burak Yavuz
How many partitions do you have? It might be that one partition is too large, and there is Integer overflow. Could you double your number of partitions? Burak On Fri, Jul 3, 2015 at 4:41 AM, Danny wrote: > hi, > > i want to run a multiclass classification with 390 classes on120k label > points(

Re: Optimizations

2015-07-03 Thread Marius Danciu
Thanks for your feedback. Yes I am aware of stages design and Silvio what you are describing is essentially map-side join which is not applicable when you have both RDDs quite large. It appears that rdd.join(...).mapToPair(f) f is piggybacked inside join stage (right in the reducers I believe)

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Shivaram Venkataraman
You need to add -Psparkr to build SparkR code Shivaram On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das wrote: > Did you try: > > build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package > > > > Thanks > Best Regards > > On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com <1106944.

Experience with centralised logging for Spark?

2015-07-03 Thread Edward Sargisson
Hi all, I'm wondering if anybody as any experience with centralised logging for Spark - or even has felt that there was need for this given the WebUI. At my organization we use Log4j2 and Flume as the front end of our centralised logging system. I was looking into modifying Spark to use that syst

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-8817 On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers wrote: > i see the relaxation to allow duplicate field names was done on purpose, > since some data sources can have dupes due to case insensitive resolution. > > apparently the issue is now dealt wit

Re: Spark SQL groupby timestamp

2015-07-03 Thread sim
@bastien, in those situations, I prefer to use Unix timestamps (millisecond or second granularity) because you can apply math operations to them easily. If you don't have a Unix timestamp, you can use unix_timestamp() from Hive SQL to get one with second granularity.Then doing grouping by hour beco

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread sim
@bipin, in my case the error happens immediately in a fresh shell in 1.4.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html Sent from the Apache Spark User List mailing list archive at Na

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
i see the relaxation to allow duplicate field names was done on purpose, since some data sources can have dupes due to case insensitive resolution. apparently the issue is now dealt with during query analysis. although this might work for sql it does not seem a good thing for DataFrame to me. it

Re: Optimizations

2015-07-03 Thread Silvio Fiorito
One thing you could do is a broadcast join. You take your smaller RDD, save it as a broadcast variable. Then run a map operation to perform the join and whatever else you need to do. This will remove a shuffle stage but you will still have to collect the joined RDD and broadcast it. All depends

SparkSQL cache table with multiple replicas

2015-07-03 Thread David Sabater Dinter
Hi all, Do you know if there is an option to specify how many replicas we want while caching in memory a table in SparkSQL Thrift server? I have not seen any option so far but I assumed there is an option as you can see in the Storage section of the UI that there is 1 x replica of your Dataframe/Ta

Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different stages... Not sure you can achieve what you r looking for. On Jul 3, 2015 12:43 PM, "Marius Danciu" wrote: > Hi all, > > If I have something like: > > rdd.join(...).mapPartitionToPair(...) > > It looks like mapPartitionToPair

Re: Streaming: updating broadcast variables

2015-07-03 Thread Raghavendra Pandey
You cannot update the broadcasted variable.. It wont get reflected on workers. On Jul 3, 2015 12:18 PM, "James Cole" wrote: > Hi all, > > I'm filtering a DStream using a function. I need to be able to change this > function while the application is running (I'm polling a service to see if > a use

Re: Filter on Grouped Data

2015-07-03 Thread Raghavendra Pandey
Why dont you apply filter first and then Group the data and run aggregations.. On Jul 3, 2015 1:29 PM, "Megha Sridhar- Cynepia" wrote: > Hi, > > > I have a Spark DataFrame object, which when trimmed, looks like, > > > > FromTo SubjectMessage-ID > karen@xyz

Spark-csv into labeled points with null values

2015-07-03 Thread Saif.A.Ellafi
Hello all, I am learning scala spark and going through some applications with data I have. Please allow me to put a couple questions: spark-csv: The data I have, ain't malformed, but there are empty values in some rows, properly comma-sepparated and not catched by "DROPMALFORMED" mode These val

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Ted Yu
Alternatively, setting spark.driver.extraClassPath should work. Cheers On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran wrote: > >> On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv < >> daniel.ha...@veracity-group.com> wrote: >> >>> Hi, >>> I'm trying to start the thrift-server and passing it azure's

Float type coercion on SparkR with hiveContext

2015-07-03 Thread Evgeny Sinelnikov
Hello, I'm got a trouble with float type coercion on SparkR with hiveContext. > result <- sql(hiveContext, "SELECT offset, percentage from data limit 100") > show(result) DataFrame[offset:float, percentage:float] > head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : canno

ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

2015-07-03 Thread Kostas Kougios
I have this problem with a job. A random executor gets this ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM Almost always at the same point in the processing of the data. I am processing 1 mil files with sc.wholeText. At around the 600.000th file, a container receives thi

Re: Kryo fails to serialise output

2015-07-03 Thread Will Briggs
Kryo serialization is used internally by Spark for spilling or shuffling intermediate results, not for writing out an RDD as an action. Look at Sandy Ryza's examples for some hints on how to do this: https://github.com/sryza/simplesparkavroapp Regards, Will On July 3, 2015, at 2:45 AM, Dominik

Re: Spark Streaming broadcast to all keys

2015-07-03 Thread Silvio Fiorito
updateStateByKey will run for all keys, whether they have new data in a batch or not so you should be able to still use it. On 7/3/15, 7:34 AM, "micvog" wrote: >UpdateStateByKey is useful but what if I want to perform an operation to all >existing keys (not only the ones in this RDD). > >Word

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re using. Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local. From: diplomatic Guru D

Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is expe

Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Danny
hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8 Nod

Spark Streaming broadcast to all keys

2015-07-03 Thread micvog
UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD). Word count for example - is there a way to decrease *all* words seen so far by 1? I was thinking of keeping a static class per node with the count information and issuing a

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I have a hunch I want to share: I feel that data is not being deallocated in memory (at least like in 1.3). Once it goes in-memory it just stays there. Spark SQL works fine, the same query when run on a new shell won't throw that error, but when run on a shell which has been used for other queries

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
I have shown two senarios below: // setup spark context val user = readLine("username: ") val pass = System.console.readPassword("password: ") <- null pointer exception here and // setup spark context val user = readLine("username: ") val console = System.console <- null pointer exception

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I will second this. I very rarely used to get out-of-memory errors in 1.3. Now I get these errors all the time. I feel that I could work on 1.3 spark-shell for long periods of time without spark throwing that error, whereas in 1.4 the shell needs to be restarted or gets killed frequently. -- Vie

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker wrote: > In the driver when running spark-submit with --master yarn-client > > On Fri, Jul 3, 2015 at 10:23 AM Akhil Das > wrote: > >> Where does it returns null? Within the driver or in

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Steve Loughran
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv mailto:daniel.ha...@veracity-group.com>> wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.Fil

Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
Hi, I need to join with multiple conditions. Can anyone tell how to specify that. For e.g. this is what I am trying to do : val Lead_all = Leads. | join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == Utm_Master.columns("LeadSource","Utm_Source","Utm_

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das wrote: > Where does it returns null? Within the driver or in the executor? I just > tried System.console.readPassword in spark-shell and it worked. > > Thanks > Best Regards > > On Fri, Ju

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker wrote: > Hi, > > We have an application that requires a username/password to be entered > from

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com <1106944...@qq.com> wrote: > Hi all, >Anyone build spark 1.4 source code for sparkR with maven/sbt, what's > comand ? using

Accessing the console from spark

2015-07-03 Thread Jem Tucker
Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem

build spark 1.4 source code for sparkR with maven

2015-07-03 Thread 1106944...@qq.com
Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using sparkR must build from source code about 1.4 version . thank you 1106944...@qq.com

Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Danny Linden
hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8 N

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread ayan guha
Hivecontext should be supersets of SQL context so you should be able to perform all your tasks. Are you facing any problem with hivecontext? On 3 Jul 2015 17:33, "Daniel Haviv" wrote: > Thanks > I was looking for a less hack-ish way :) > > Daniel > > On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das >

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
The main reason is Spark's startup time and the need to configure a component I don't really need (without configs the hivecontext takes more time to load) Thanks, Daniel > On 3 ביולי 2015, at 11:13, Robin East wrote: > > As Akhil mentioned there isn’t AFAIK any kind of initialisation to sto

Filter on Grouped Data

2015-07-03 Thread Megha Sridhar- Cynepia
Hi, I have a Spark DataFrame object, which when trimmed, looks like, FromTo SubjectMessage-ID karen@xyz.com['vance.me...@enron.com', SEC Inquiry <19952575.1075858> 'jeannie.mandel...@enron.com', 'mary.cl...

[spark1.4] sparkContext.stop causes exception on Mesos

2015-07-03 Thread Ayoub
Hello Spark developers, After upgrading to spark 1.4 on Mesos 0.22.1 existing code started to throw this exception when calling sparkContext.stop : (SparkListenerBus) [ERROR - org.apache.spark.Logging$class.logError(Logging.scala:96)] Listener EventLoggingListener threw an exception java.lang.r

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das wrote: > With binary i think it might not be possible, although if you can download > the sources and then build it then you can remove this function >

Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR (SPARK-2890 ) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers wrote: > i am surprised this is all

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function which initializes the SQLContext. Tha

Optimizations

2015-07-03 Thread Marius Danciu
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair fu