Streaming: updating broadcast variables

2015-07-02 Thread James Cole
Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to see if a user has changed their filtering). The filter function is a transformation and runs on the workers, so that's where the updates need to go

Kryo fails to serialise output

2015-07-02 Thread Dominik Hübner
I have a rather simple avro schema to serialize Tweets (message, username, timestamp). Kryo and twitter chill are used to do so. For my dev environment the Spark context is configured as below val conf: SparkConf = new SparkConf() conf.setAppName("kryo_test") conf.setMaster(“local[4]") conf.set(

Re: Solving Systems of Linear Equations Using Spark?

2015-07-02 Thread jamaica
I wrote some very simple 1-d laplace equation Spark solver. For details, see here . If is there anyone who is interested in this, please let me know. Anyway hope this can help someone.

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Suraj Shetiya
Hi Salih, Thanks for the links :) This seems very promising to me. When do you think this would be available in the spark codeline ? Thanks, Suraj On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop wrote: > Hi Suraj, > It seems your requirement is Record Linkage/Entity Resolution. > https://en.wikip

Spark MLLib 140 - logistic regression with SGD model accuracy is different in local mode and cluster mode

2015-07-02 Thread Nirmal Fernando
Hi All, I'm facing a quite strange case, where after migrating to Spark 140, I'm seen SparkMLLib produces different results when runs on local mode and cluster mode. Is there any possibility of that happening? (I feel this is an issue in my environment, but just wanted to get confirmed.) Thanks.

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-07-02 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 9:52 AM, Tobias Pfeiffer wrote: > Hi, > > On Thu, Jan 29, 2015 at 1:54 AM, YaoPau wrote: >> >> My thinking is to maintain state in an RDD and update it an persist it >> with >> each 2-second pass, but this also seems like it could get messy. Any >> thoughts or examp

Spark Thriftserver exec insert sql got error on Hadoop federation

2015-07-02 Thread Xiaoyu Wang
Hi all! My sql case is: insert overwrite table test1 select * From test; In the job end got move file error. I see hive-0.13.1 support for viewfs is not good. until hive-1.1.0+ How to upgrade the hive version for spark? Or how to fix the bug on "org.spark-project.hive". My version: Spark versio

回复:All master are unreponsive issue

2015-07-02 Thread luohui20001
Hi there, i check the source code and found that in org.apache.spark.deploy.client.AppClient, there is a parameter tells(line 52): val REGISTRATION_TIMEOUT = 20.seconds val REGISTRATION_RETRIES = 3As I know If I wanna increase the retry times, must I modify this value,rebuild the entire

Aggregating the same column multiple times

2015-07-02 Thread sim
What is the rationale for not allowing the same column in a GroupedData to be aggregated more than once using agg, especially when the method signature def agg(aggExpr: (String, String), aggExprs: (String, String)*) allows passing something like agg("x" -> "sum", "x" =>"avg")? -- View this messa

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row Cheers & enjoy the long weekend

duplicate names in sql allowed?

2015-07-02 Thread Koert Kuipers
i am surprised this is allowed... scala> sqlContext.sql("select name as boo, score as boo from candidates").schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Simeon Simeonov
Same error with the new code: import org.apache.spark.sql.hive.HiveContext val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ val df = ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-0.gz") df.registerTempTable("training") val dfCount =

Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
Thanks Sandy et al, I will try that. I like that I can choose the minRegisteredResourcesRatio. On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza wrote: > Hi Arun, > > You can achieve this by > setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really > high number and spark.scheduler.m

Re: sliding

2015-07-02 Thread tog
Understood. Thanks for your great help Cheers Guillaume On 2 July 2015 at 23:23, Feynman Liang wrote: > Consider an example dataset [a, b, c, d, e, f] > > After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)] > > After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d,

Re: sliding

2015-07-02 Thread Feynman Liang
Consider an example dataset [a, b, c, d, e, f] After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)] After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d, e, f), 3)] After filter: [((a,b,c), 0), ((d, e, f), 3)], which is what I'm assuming you want (non-overlapping bu

configuring max sum of cores and memory in cluster through command line

2015-07-02 Thread Alexander Waldin
Hi, I'd like to specify the total sum of cores / memory as command line arguments with spark-submit. That is, I'd like to set yarn.nodemanager.resource.memory-mb and the yarn.nodemanager.resource.cpu-vcores parameters as described in this blog

Re: sliding

2015-07-02 Thread tog
Well it did reduce the length of my serie of events. I will have to dig what it did actually ;-) I would assume that it took one out of 3 value, is that correct ? Would it be possible to control a bit more how the value assigned to the bucket is computed for example take the first element, the min

Re: where is the source code for org.apache.spark.launcher.Main?

2015-07-02 Thread Shiyao Ma
After clicking the github spark repo, it is clearly here: https://github.com/apache/spark/tree/master/launcher/src/main/java/org/apache/spark/launcher My intellij project sidebar was fully expanded and I was lost in anther folder. Problem solved. -

where is the source code for org.apache.spark.launcher.Main?

2015-07-02 Thread Shiyao Ma
Hi, It seems to me spark launches a process to read the spark-deaults.conf and then launch another process to do the app stuff. The code here should confirm it: https://github.com/apache/spark/blob/master/bin/spark-class#L76 $RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"

Re: sliding

2015-07-02 Thread Feynman Liang
How about: events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0) That would group the RDD into adjacent buckets of size 3. On Thu, Jul 2, 2015 at 2:33 PM, tog wrote: > Was complaining about the Seq ... > > Moved it to > val eventsfiltered = events.sliding(3).map(s => Event(s(0).time, > (s(0).x

Re: sliding

2015-07-02 Thread tog
Was complaining about the Seq ... Moved it to val eventsfiltered = events.sliding(3).map(s => Event(s(0).time, (s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0)) and that is working. Anyway this is not what I wanted to do, my goal was more to implement bucketing to shorten the

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim, Seems you already set the PermGen size to 256m, right? I notice that in your the shell, you created a HiveContext (it further increased the memory consumption on PermGen). But, spark shell has already created a HiveContext for you (sqlContext. You can use asInstanceOf to access HiveContext

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Salih Oztop
Hi Suraj,It seems your requirement is Record Linkage/Entity Resolution.https://en.wikipedia.org/wiki/Record_linkage http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf A presentation from Spark Summit using GraphXhttps://spark-summit.org/east-2015/talk/distributed-graph-based-entity-reso

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
Ya, I think its a limitation too.I looked at the source code, SparkConf.scala and ExecutorRunnable.scala both Xms and Xmx are set equal value which is spark.executor.memory. Thanks On Thu, Jul 2, 2015 at 1:18 PM, Todd Nist wrote: > Yes, that does appear to be the case. The documentation is ver

Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-07-02 Thread Davies Liu
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl wrote: > In pyspark, when I convert from rdds to dataframes it looks like the rdd is > being materialized/collected/repartitioned before it's converted to a > dataframe. It's not true. When converting a RDD to dataframe, it only take a few of rows to inf

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
Yes, that does appear to be the case. The documentation is very clear about the heap settings and that they can not be used with spark.executor.extraJavaOptions spark.executor.extraJavaOptions(none)A string of extra JVM options to pass to executors. For instance, GC settings or other logging. *No

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
thanks but my use case requires I specify different start and max heap sizes. Looks like spark sets start and max sizes same value. On Thu, Jul 2, 2015 at 1:08 PM, Todd Nist wrote: > You should use: > > spark.executor.memory > > from the docs

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
You should use: spark.executor.memory from the docs : spark.executor.memory512mAmount of memory to use per executor process, in the same format as JVM memory strings (e.g.512m, 2g). -Todd On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta Mammo

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim, Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you add --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" in the command you used to launch Spark shell? This will increase the PermGen size f

Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
Thanks, Mohit. It sounds like we're on the same page -- I used a similar approach. On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi wrote: > if you are joining successive lines together based on a predicate, then > you are doing a "flatMap" not an "aggregate". you are on the right track > with a mu

1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread sim
A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and fails with a series of out-of-memory errors in 1.4.0. This gist includes the code and the full output from the 1.3.1 and 1.4.0 runs, including the command line

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
tried that one and it throws error - extraJavaOptions is not allowed to alter memory settings, use spakr.executor.memory instead. On Thu, Jul 2, 2015 at 12:21 PM, Benjamin Fradet wrote: > Hi, > > You can set those parameters through the > > spark.executor.extraJavaOptions > > Which is documented

Re: .NET on Apache Spark?

2015-07-02 Thread pedro
You might try using .pipe() and installing your .NET program as a binary across the cluster (or using addFile). Its not ideal to pipe things in/out along with the overhead, but it would work. I don't know much about IronPython, but perhaps changing the default python by changing your path might wo

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Benjamin Fradet
Hi, You can set those parameters through the spark.executor.extraJavaOptions Which is documented in the configuration guide: spark.apache.org/docs/latest/configuration.htnl On 2 Jul 2015 9:06 pm, "Mulugeta Mammo" wrote: > Hi, > > I'm running Spark 1.4.0, I want to specify the start and max siz

Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
Hi, I'm running Spark 1.4.0, I want to specify the start and max size (-Xms and Xmx) of the jvm heap size for my executors, I tried: executor.cores.memory="-Xms1g -Xms8g" but doesn't work. How do I specify? Appreciate your help. Thanks,

Re: Check for null in PySpark DataFrame

2015-07-02 Thread Pedro Rodriguez
Thanks for the tip. Any idea why the intuitive answer doesn't work ( != None)? I inspected the Row columns and they do indeed have a None value. I would suspect that somehow Python's None is translated to something in jvm which doesn't equal to null? I might check out the source code for a better

Re: Grouping runs of elements in a RDD

2015-07-02 Thread Mohit Jaggi
if you are joining successive lines together based on a predicate, then you are doing a "flatMap" not an "aggregate". you are on the right track with a multi-pass solution. i had the same challenge when i needed a sliding window over an RDD(see below). [ i had suggested that the sliding window API

Re: sliding

2015-07-02 Thread Feynman Liang
What's the error you are getting? On Thu, Jul 2, 2015 at 9:37 AM, tog wrote: > Hi > > Sorry for this scala/spark newbie question. I am creating RDD which > represent large time series this way: > val data = sc.textFile("somefile.csv") > > case class Event( > time: Double, > x:

Re: KMeans questions

2015-07-02 Thread Feynman Liang
SPARK-7879 seems to address your use case (running KMeans on a dataframe and having the results added as an additional column) On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman wrote: > In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans

Re: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
What I'm doing in the RDD is parsing a text file and sending things to the external system.. I guess that it does that immediately when the action (count) is triggered instead of being a two step process. So I guess I should have parsing logic + sending to external system inside the foreach (with

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-02 Thread Raghavendra Pandey
You can collect the dataframe as array n then create map out of it..., On Jul 2, 2015 9:23 AM, wrote: > Any example how can i return a Hashmap from data frame ? > > Thanks , > Ashish > > On Jul 1, 2015, at 11:34 PM, Holden Karau wrote: > > Collecting it as a regular (Java/scala/Python) map. You

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
Heh, an actions or materializaiton, means that it will trigger the computation over the RDD. A transformation like map, means that it will create the transformation chain that must be applied on the data, but it is actually not executed. It is executed only when an action is triggered over that RDD

Re: Spark SQL and Streaming - How to execute JDBC Query only once

2015-07-02 Thread Raghavendra Pandey
This will not work i.e. using data frame inside map function.. Although you can try to create df separately n cache it... Then you can join your event stream with this df. On Jul 2, 2015 6:11 PM, "Ashish Soni" wrote: > Hi All , > > I have and Stream of Event coming in and i want to fetch some ad

Re: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
Foreach is listed as an action[1]. I guess an *action* just means that it forces materialization of the RDD. I just noticed much faster executions with map although I don't like the map approach. I'll look at it with new eyes if foreach is the way to go. [1] – https://spark.apache.org/docs/latest

Re: map vs foreach for sending data to external system

2015-07-02 Thread Silvio Fiorito
foreach absolutely runs on the executors. For sending data to an external system you should likely use foreachPartition in order to batch the output. Also if you want to limit the parallelism of the output action then you can use coalesce. What makes you think foreach is running on the driver?

sliding

2015-07-02 Thread tog
Hi Sorry for this scala/spark newbie question. I am creating RDD which represent large time series this way: val data = sc.textFile("somefile.csv") case class Event( time: Double, x: Double, vztot: Double ) val events = data.filter(s => !s.startsWith("GMT")).map{s

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
Hi, I'm using 1.4. It's indeed a typo in the email itself. Thanks, Daniel On Thu, Jul 2, 2015 at 6:06 PM, Ted Yu wrote: > Which Spark release are you using ? > > bq. yarn--jars > > I guess the above was just a typo in your email (missing space). > > Cheers > > On Thu, Jul 2, 2015 at 7:38 AM, Da

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
*"The thing is that foreach forces materialization of the RDD and it seems to be executed on the driver program"* What makes you think that? No, foreach is run in the executors (distributed) and not in the driver. 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues < alex.jose.rodrig...@gmail.com>: >

Fwd: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
Hi Spark devs, I'm coding a spark job and at a certain point in execution I need to send some data present in an RDD to an external system. val myRdd = myRdd.foreach { record => sendToWhtv(record) } The thing is that foreach forces materialization of the RDD and it seems to be executed o

Re: wholeTextFiles("/x/*/*.txt") runs single threaded

2015-07-02 Thread Kostas Kougios
In SparkUI I can see it creating 2 stages. I tried wholeTextFiles().repartition(32) but same threading results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html Sent from the Apache Spark User List

wholeTextFiles("/x/*/*.txt") runs single threaded

2015-07-02 Thread Kostas Kougios
Hi, I got a cluster of 4 machines and I sc.wholeTextFiles("/x/*/*.txt") folder x contains subfolders and each subfolder contains thousand of files with a total of ~1million matching the path expression. My spark task starts processing the files but single threaded. I can see that in the sparkUI,

binaryFiles() for 1 million files, too much memory required

2015-07-02 Thread Kostas Kougios
Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e. ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on the

Re: making dataframe for different types using spark-csv

2015-07-02 Thread Hafiz Mujadid
Thanks On Thu, Jul 2, 2015 at 5:40 PM, Kohler, Curt E (ELS-STL) < c.koh...@elsevier.com> wrote: > You should be able to do something like this (assuming an input file > formatted as: String, IntVal, LongVal) > > > import org.apache.spark.sql.types._ > > val recSchema = StructType(List(StructF

Re: Spark driver hangs on start of job

2015-07-02 Thread Richard Marscher
Ah I see, glad that simple patch works for your problem. That seems to be a different underlying problem than we have been experiencing. In our case, the executors are failing properly, its just that none of the new ones will ever escape experiencing the same exact issue. So we start a death spiral

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Ted Yu
Which Spark release are you using ? bq. yarn--jars I guess the above was just a typo in your email (missing space). Cheers On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to start the thrift-server and passing it azure's blob storage >

Dataframe in single partition after sorting?

2015-07-02 Thread Cesar Flores
I am sorting a data frame using something like: val sortedDF = df.orderBy(df("score").desc) The sorting is really fast. The issue I have is that after sorting, the resulting data frame sortedDF appears to be in a single partition, which is a problem because when I try to execute another operation

thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.creat

Array fields in dataframe.write.jdbc

2015-07-02 Thread Anand Nalya
Hi, I'm using spark 1.4. I've a array field in my data frame and when I'm trying to write this dataframe to postgres, I'm getting the following exception: Exception in thread "main" java.lang.IllegalArgumentException: Can't translate null value for field StructField(filter,ArrayType(StringType,fa

override/update options in Dataframe/JdbcRdd

2015-07-02 Thread manohar
Hi, what are the options in DataFrame/JdbcRdd save/saveAsTable api. is there any options to override/update a particular column in the table instead of whole table overriding based on some ID colum. SaveMode append is there but it wont help us to update the record,it will append/add new row to the

Re: making dataframe for different types using spark-csv

2015-07-02 Thread Kohler, Curt E (ELS-STL)
You should be able to do something like this (assuming an input file formatted as: String, IntVal, LongVal) import org.apache.spark.sql.types._ val recSchema = StructType(List(StructField("strVal", StringType, false), StructField("intVal", IntegerType,

Starting Spark without automatically starting HiveContext

2015-07-02 Thread Daniel Haviv
Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel

Spark SQL and Streaming - How to execute JDBC Query only once

2015-07-02 Thread Ashish Soni
Hi All , I have and Stream of Event coming in and i want to fetch some additional data from the database based on the values in the incoming data , For Eg below is the data coming in loginName Email address city Now for each login name i need to go to oracle database and get the userId from the

Re: Meets class not found error in spark console with newly hive context

2015-07-02 Thread shenyan zhen
In case it helps: I got around it temporarily by saving and reseting the context class loader around creating HiveContext. On Jul 2, 2015 4:36 AM, "Terry Hole" wrote: > Found this a bug in spark 1.4.0: SPARK-8368 > > > Thanks! > Terry > > On Thu,

Re: BroadcastHashJoin when RDD is not cached

2015-07-02 Thread Srikanth
Good to know this will be in next release. Thanks. On Wed, Jul 1, 2015 at 3:13 PM, Michael Armbrust wrote: > We don't know that the table is small unless you cache it. In Spark 1.5 > you'll be able to give us a hint though ( > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/

RE: Making Unpersist Lazy

2015-07-02 Thread Ganelin, Ilya
You may pass an optional parameter (blocking = false) to make it lazy. Thank you, Ilya Ganelin -Original Message- From: Jem Tucker [jem.tuc...@gmail.com] Sent: Thursday, July 02, 2015 04:06 AM Eastern Standard Time To: Akhil Das Cc: user Subject: Re: Makin

RE: .NET on Apache Spark?

2015-07-02 Thread Silvio Fiorito
Since Spark runs on the JVM, no there isn't support for .Net. You should take a look at Dryad and Naiad instead. https://github.com/MicrosoftResearch/ From: Zwits Sent: ‎7/‎2/‎2015 4:33 AM To: user@spark.apache.org

Re: .NET on Apache Spark?

2015-07-02 Thread Daniel Darabos
Indeed Spark does not have .NET bindings. On Thu, Jul 2, 2015 at 10:33 AM, Zwits wrote: > I'm currently looking into a way to run a program/code (DAG) written in > .NET > on a cluster using Spark. However I ran into problems concerning the coding > language, Spark has no .NET API. > I tried look

Re: getting WARN ReliableDeliverySupervisor

2015-07-02 Thread xiaohe lan
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue. I saw https://issues.apache.org/jira/browse/SPARK-6388 But it is not a problem however. On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan wrote: > Hi Expert, > > Hadoop version: 2.4 > Spark version: 1.3.1 > > I am running the SparkPi example appl

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Suraj Shetiya
Hi Michael, Thanks for a quick response.. This sounds like something that would work. However, Rethinking the problem statement and various other use cases, which are growing, there are more such scenarios, where one could have columns with structured and unstructured data embedded (json or xml or

EventLoggingListener threw an exception when sparkContext.stop

2015-07-02 Thread Ayoub
Hello, I recently upgraded to spark 1.4 on Mesos 0.22.1 and now existing code throw the exception bellow. The interesting part is that this problems happens only when spark.eventLog.enabled flag is set to true. So I guess it has something to do with the spark history server. The problem happens

All master are unreponsive issue

2015-07-02 Thread luohui20001
Hi there: I got an problem that "Application has been killed.Reason:All masters are unresponsive!Giving up." I check the network I/O and found sometimes it is really high when running my app. Pls refer to the attached pic for more info.I also checked http://databricks.gitbooks.io/databrick

"insert overwrite table phonesall" in spark-sql resulted in java.io.StreamCorruptedException

2015-07-02 Thread John Jay
My spark-sql command: spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf spark.driver.cores=20 --conf spark.cores.max=20 --conf spark.executor.memory=2g --conf spark.driver.memory=2g --conf spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf spark.eventLog.

Re: Spark driver hangs on start of job

2015-07-02 Thread Sjoerd Mulder
Hi Richard, I have actually applied the following fix to our 1.4.0 version and this seem to resolve the zombies :) https://github.com/apache/spark/pull/7077/files Sjoerd 2015-06-26 20:08 GMT+02:00 Richard Marscher : > Hi, > > we are on 1.3.1 right now so in case there are differences in the Sp

Re: Meets class not found error in spark console with newly hive context

2015-07-02 Thread Terry Hole
Found this a bug in spark 1.4.0: SPARK-8368 Thanks! Terry On Thu, Jul 2, 2015 at 1:20 PM, Terry Hole wrote: > All, > > I am using spark console 1.4.0 to do some tests, when a create a newly > HiveContext (Line 18 in the code) in my test functio

.NET on Apache Spark?

2015-07-02 Thread Zwits
I'm currently looking into a way to run a program/code (DAG) written in .NET on a cluster using Spark. However I ran into problems concerning the coding language, Spark has no .NET API. I tried looking into IronPython because Spark does have a Python API, but i couldn't find a way to use this. Is

Re: Making Unpersist Lazy

2015-07-02 Thread Jem Tucker
Hi, After running some tests it appears the unpersist is called as soon as it is reached, so any tasks using this rdd later on will have to re calculate it. This is fine for simple programs but when an rdd is created within a function and its reference is then lost but children of it continue to b

Re: DataFrame Find/Filter Based on Input - Inside Map function

2015-07-02 Thread ayan guha
You can keep a joined dataset cached and filter that joined df with your filter condition On 2 Jul 2015 15:01, "Mailing List" wrote: > I need to pass the value of the filter dynamically like where id= > and that someVal exist in another RDD. > > How can I do this across JavaRDD and DataFrame ? >

Re: Convert CSV lines to List of Objects

2015-07-02 Thread Akhil Das
Have a look at the sc.wholeTextFiles, you can use it to read the whole csv contents into the value and then split it on \n and add them up to a list and return it. *sc.wholeTextFiles:* Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported

Re: Making Unpersist Lazy

2015-07-02 Thread Akhil Das
rdd's which are no longer required will be removed from memory by spark itself (which you can consider as lazy?). Thanks Best Regards On Wed, Jul 1, 2015 at 7:48 PM, Jem Tucker wrote: > Hi, > > The current behavior of rdd.unpersist() appears to not be lazily executed > and therefore must be pla