Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
I am invoking it from the java application by creating the sparkcontext On Tue, Jun 23, 2015 at 12:17 PM, Tathagata Das wrote: > How are you adding that to the classpath? Through spark-submit or > otherwise? > > On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri > wrote: > >> Yes I have the pro

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Tathagata Das
How are you adding that to the classpath? Through spark-submit or otherwise? On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri wrote: > Yes I have the producer in the class path. And I am using in standalone > mode. > > Sent from my iPhone > > On 23-Jun-2015, at 3:31 am, Tathagata Das wrote: >

Re: Spark standalone cluster - resource management

2015-06-22 Thread nizang
to give a bit more data on what I'm trying to get - I have many tasks I want to run in parallel, so I want each task to catch small part of the cluster (-> only limited part of my 20 cores in the cluster) I have important tasks that I want them to get 10 cores, and I have small tasks that I want

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Tathagata Das
Try adding the provided scopes org.apache.spark spark-core_2.10 1.4.0 *provided * org.apache.spark spark-streaming_2.10 1.4.0 *provided * This prevents these artifacts from being included in the assemb

Spark standalone cluster - resource management

2015-06-22 Thread nizang
hi, I'm running spark standalone cluster with 5 slaves, each has 4 cores. When I run job with the following configuration: /root/spark/bin/spark-submit -v --total-executor-cores 20 --executor-memory 22g --executor-cores 4 --class com.windward.spark.apps.MyApp --name dev-app --properties-fil

MLLIB - Storing the Trained Model

2015-06-22 Thread samsudhin
HI All, I was trying to store a trained model to the local hard disk. i am able to save it using save() function. while i am trying to retrieve the stored model using load() function i am end up with following error. kindly help me on this. scala> val sameModel = RandomForestModel.load(sc,"/home/

Re: Velox Model Server

2015-06-22 Thread Nick Pentreath
How large are your models? Spark job server does allow synchronous job execution and with a "warm" long-lived context it will be quite fast - but still in the order of a second or a few seconds usually (depending on model size - for very large models possibly quite a lot more than that).

RE: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Cheng, Hao
Yes, with should be with HiveContext, not SQLContext. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday, June 23, 2015 2:51 AM To: smazumder Cc: user Subject: Re: Support for Windowing and Analytics functions in Spark SQL 1.4 supports it On 23 Jun 2015 02:59, "Sourav Mazumder" mailto:s

RE: Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Cheng, Hao
It’s actually not that tricky. SPARK_WORKER_CORES: is the max task thread pool size of the of the executor, the same saying of “one executor with 32 cores and the executor could execute 32 tasks simultaneously”. Spark doesn’t care about how much real physical CPU/Cores you have (OS does), so use

Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-22 Thread dgoldenberg
Is there any way to retrieve the time of each message's arrival into a Kafka topic, when streaming in Spark, whether with receiver-based or direct streaming? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-way-to-retrieve-time-of-message-arrival

Re: Confusion matrix for binary classification

2015-06-22 Thread CD Athuraliya
Hi Burak, Thanks for the response. I am using Spark version 1.3.0 through Java API. Regards, CD On Tue, Jun 23, 2015 at 5:11 AM, Burak Yavuz wrote: > Hi, > > In Spark 1.4, you may use DataFrame.stat.crosstab to generate the > confusion matrix. This would be very simple if you are using the ML

Re: Serializer not switching

2015-06-22 Thread Josh Rosen
My hunch is that you changed spark.serializer to Kryo but left spark.closureSerializer unmodified, so it's still using Java for closure serialization. Kryo doesn't really work as a closure serializer but there's an open pull request to fix this: https://github.com/apache/spark/pull/6361 On Mon, J

mutable vs. pure functional implementation - StatCounter

2015-06-22 Thread mzeltser
Using StatCounter as an example, I'd like to understand if "pure" functional implementation would be more or less beneficial for "accumulating" structures used inside RDD.map StatCounter.merge is updating mutable class variables and returning reference to same object. This is clearly a non-functi

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, For this case, you could simply do sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using Scala Chris > On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote: > > Hi Chris, > Thanks f

Programming with java on spark

2015-06-22 Thread 付雅丹
Hello, everyone! I'm new in spark. I have already written programs in Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I want to move my codes to spark using java language. The first problem I encountered is how to transform big txt file in local storage to RDD, which is compat

Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Rui Li
Hi, I was running a WordCount application on Spark, and the machine I used has 4 physical cores. However, in spark-env.sh file, I set SPARK_WORKER_CORES = 32. The web UI says it launched one executor with 32 cores and the executor could execute 32 tasks simultaneously. Does spark create 32 vCores

Re: Storing an action result in HDFS

2015-06-22 Thread ddpisfun
Hi Chris, Thanks for the quick reply and the welcome. I am trying to read a file from hdfs and then writing back just the first line to hdfs. I calling first() on the RDD to get the first line. Sent from my iPhone > On Jun 22, 2015, at 7:42 PM, Chris Gore wrote: > > Hi Ravi, > > Welcome, y

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”) Chris > On Jun 22, 2015, at 5:28 PM, ravi tella wrote: > > > Hello All, > I am new to Spark. I have a very basic question.How do I write the output of > an action on a RDD to HDFS? > > Thanks in advance for the help.

Fwd: Storing an action result in HDFS

2015-06-22 Thread ravi tella
Hello All, I am new to Spark. I have a very basic question.How do I write the output of an action on a RDD to HDFS? Thanks in advance for the help. Cheers, Ravi

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
Yes I have the producer in the class path. And I am using in standalone mode. Sent from my iPhone > On 23-Jun-2015, at 3:31 am, Tathagata Das wrote: > > Do you have Kafka producer in your classpath? If so how are adding that > library? Are you running on YARN, or Mesos or Standalone or local.

Re: Compute Median in Spark Dataframe

2015-06-22 Thread Deenar Toraskar
Many thanks, will look into this. I dont want to particularly reuse the custom Hive UDAF I have, would prefer writing a new one it that is cleaner. I am just using the JVM. On 5 June 2015 at 00:03, Holden Karau wrote: > My current example doesn't use a Hive UDAF, but you would do something > p

Re: Confusion matrix for binary classification

2015-06-22 Thread Burak Yavuz
Hi, In Spark 1.4, you may use DataFrame.stat.crosstab to generate the confusion matrix. This would be very simple if you are using the ML Pipelines Api, and are working with DataFrames. Best, Burak On Mon, Jun 22, 2015 at 4:21 AM, CD Athuraliya wrote: > Hi, > > I am looking for a way to get co

Re: Why can't I allocate more than 4 executors with 2 machines on YARN?

2015-06-22 Thread Saiph Kappa
OK, I figured out this. The maximum number of containers YARN can create per node is based on the total available RAM and the maximum allocation per container ( yarn.scheduler.maximum-allocation-mb ). The default is 8192; setting to a lower value allowed me to create more containers per node. On

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
You’re right of course, I’m sorry. I was typing before thinking about what you actually asked! On a second thought, what is the ultimate outcome for what you want the sequence of pages for? Do they need to actually all be grouped? Could you instead partition by user id then use a mapPartitions

which mllib algorithm for large multi-class classification?

2015-06-22 Thread Danny
hi, I am unfortunately not very fit in the whole MLlib stuff, so I would appreciate a little help: Which multi-class classification algorithm i should use if i want to train texts (100-1000 words each) into categories. The number of categories is between 100-500 and the number of training documen

Re: s3 - Can't make directory for path

2015-06-22 Thread Danny
hi, have you tested "s3://ww-sandbox/name_of_path/" instead of "s3://ww-sandbox/name_of_path" or have you test to add your file extension with placeholder (*) like: "s3://ww-sandbox/name_of_path/*.gz" or "s3://ww-sandbox/name_of_path/*.csv" depend on your files. If it does not work pls tes

New Spark Meetup group in Munich

2015-06-22 Thread Danny Linden
Hi everyone, I want to announce that we have create a Spark Meetup Group in Munich (Germany). We currently plan our first event which will take place in July. There we will show basics about spark to catch a lot of people who are new to this framework. In the following evens we will go deeper i

Re: workaround for groupByKey

2015-06-22 Thread Jianguo Li
Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking *We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Tathagata Das
Do you have Kafka producer in your classpath? If so how are adding that library? Are you running on YARN, or Mesos or Standalone or local. These details will be very useful. On Mon, Jun 22, 2015 at 8:34 AM, Murthy Chelankuri wrote: > I am using spark streaming. what i am trying to do is sending

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
Silvio, Suppose my RDD is (K-1, v1,v2,v3,v4). If i want to do simple addition i can use reduceByKey or aggregateByKey. What if my processing needs to check all the items in the value list each time, Above two operations do not get all the values, they just get two pairs (v1, v2) , you do some proc

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
You can use aggregateByKey as one option: val input: RDD[Int, String] = ... val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += b, (a, b) => a ++ b) From: Jianguo Li Date: Monday, June 22, 2015 at 5:12 PM To: "user@spark.apache.org" Subject: wo

Re: Why can't I allocate more than 4 executors with 2 machines on YARN?

2015-06-22 Thread ๏̯͡๏
1) Can you try with yarn-cluster 2) Does your queue have enough capacity On Mon, Jun 22, 2015 at 11:10 AM, Saiph Kappa wrote: > Hi, > > I am running a simple spark streaming application on hadoop 2.7.0/YARN > (master: yarn-client) cluster with 2 different machines (12GB RAM with 8 > CPU cores ea

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
There is reduceByKey that works on K,V. You need to accumulate partial results and proceed. does your computation allow that ? On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li wrote: > Hi, > > I am processing an RDD of key-value pairs. The key is an user_id, and the > value is an website url the us

Re: Submitting Spark Applications using Spark Submit

2015-06-22 Thread Andrew Or
Did you restart your master / workers? On the master node, run `sbin/stop-all.sh` followed by `sbin/start-all.sh` 2015-06-20 17:59 GMT-07:00 Raghav Shankar : > Hey Andrew, > > I tried the following approach: I modified my Spark build on my local > machine. I did downloaded the Spark 1.4.0 src co

Re: Velox Model Server

2015-06-22 Thread Debasish Das
Models that I am looking for are mostly factorization based models (which includes both recommendation and topic modeling use-cases). For recommendation models, I need a combination of Spark SQL and ml model prediction api...I think spark job server is what I am looking for and it has fast http res

spark on yarn failing silently

2015-06-22 Thread roy
Hi, suddenly our spark job on yarn started failing silently without showing any error, following is the trace in verbose mode Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default pr

workaround for groupByKey

2015-06-22 Thread Jianguo Li
Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However, since there could be millions of users and urls, the

Spark job fails silently

2015-06-22 Thread roy
Hi, Our spark job on yarn suddenly started failing silently without showing any error following is the trace. Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property: spark.exec

Re: PySpark on YARN "port out of range"

2015-06-22 Thread Andrew Or
Unfortunately there is not a great way to do it without modifying Spark to print more things it reads from the stream. 2015-06-20 23:10 GMT-07:00 John Meehan : > Yes it seems to be consistently "port out of range:1315905645”. Is there > any way to see what the python process is actually outputti

Re: SQL vs. DataFrame API

2015-06-22 Thread Davies Liu
Right now, we can not figure out which column you referenced in `select`, if there are multiple row with the same name in the joined DataFrame (for example, two `value`). A workaround could be: numbers2 = numbers.select(df.name, df.value.alias('other')) rows = numbers.join(numbers2,

Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai
Hi James, Maybe it's the DISTINCT causing the issue. I rewrote the query as follows. Maybe this one can finish faster. select sum(cnt) as uses, count(id) as users from ( select count(*) cnt, cast(id as string) as id, from usage_events where from_unixtime(cast(timestamp_mill

External Jar file with SparkR

2015-06-22 Thread mtn111
I have been unsuccessful with incorporating an external Jar into a SparkR program. Does anyone know how to do this successfully? JarTest.java = package com.myco; public class JarTest { public static double myStaticMethod() { return 5.515; } } = JarTe

Re: Help optimising Spark SQL query

2015-06-22 Thread Jörn Franke
Generally (not only spark sql specific) you should not cast in the where part of a sql query. It is also not necessary in your case. Getting rid of casts in the whole query will be also beneficial. Le lun. 22 juin 2015 à 17:29, James Aley a écrit : > Hello, > > A colleague of mine ran the follow

Re: SQL vs. DataFrame API

2015-06-22 Thread Ignacio Blasco
Sorry thought it was scala/spark El 22/6/2015 9:49 p. m., "Bob Corsaro" escribió: > That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a > query here and not actually doing an equality operation. > > On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco > wrote: > >> Probably you s

Re: SQL vs. DataFrame API

2015-06-22 Thread Bob Corsaro
That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a query here and not actually doing an equality operation. On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco wrote: > Probably you should use === instead of == and !== instead of != > Can anyone explain why the dataframe API do

Re: SQL vs. DataFrame API

2015-06-22 Thread Ignacio Blasco
Probably you should use === instead of == and !== instead of != Can anyone explain why the dataframe API doesn't work as I expect it to here? It seems like the column identifiers are getting confused. https://gist.github.com/dokipen/4b324a7365ae87b7b0e5

SQL vs. DataFrame API

2015-06-22 Thread Bob Corsaro
Can anyone explain why the dataframe API doesn't work as I expect it to here? It seems like the column identifiers are getting confused. https://gist.github.com/dokipen/4b324a7365ae87b7b0e5

Re: Help optimising Spark SQL query

2015-06-22 Thread Ntale Lukama
Have you test this on a smaller set to verify that the query is correct? On Mon, Jun 22, 2015 at 2:59 PM, ayan guha wrote: > You may also want to change count(*) to specific column. > On 23 Jun 2015 01:29, "James Aley" wrote: > >> Hello, >> >> A colleague of mine ran the following Spark SQL que

Re: Help optimising Spark SQL query

2015-06-22 Thread ayan guha
You may also want to change count(*) to specific column. On 23 Jun 2015 01:29, "James Aley" wrote: > Hello, > > A colleague of mine ran the following Spark SQL query: > > select > count(*) as uses, > count (distinct cast(id as string)) as users > from usage_events > where > from_unixtime(ca

Re: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread ayan guha
1.4 supports it On 23 Jun 2015 02:59, "Sourav Mazumder" wrote: > Hi, > > Though the documentation does not explicitly mention support for Windowing > and Analytics function in Spark SQL, looks like it is not supported. > > I tried running a query like Select Lead(, 1) over (Partition > By order

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread anshu shukla
Thanx for reply !! YES , Either it should write on any machine of cluster or Can you please help me ... that how to do this . Previously i was using writing using collect () , so some of my tuples are missing while writing. //previous logic that was just creating the file on master -

Re: Web UI vs History Server Bugs

2015-06-22 Thread Steve Loughran
well, I'm afraid you've reached the limits of my knowledge ... hopefully someone else can answer On 22 Jun 2015, at 16:37, Jonathon Cai mailto:jonathon@yale.edu>> wrote: No, what I'm seeing is that while the cluster is running, I can't see the app info after the app is completed. That is t

Why can't I allocate more than 4 executors with 2 machines on YARN?

2015-06-22 Thread Saiph Kappa
Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) cluster with 2 different machines (12GB RAM with 8 CPU cores each). I am launching my application like this: ~/myapp$ ~/my-spark/bin/spark-submit --class App --master yarn-client --driver-memory 4g -

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread Richard Marscher
Is spoutLog just a non-spark file writer? If you run that in the map call on a cluster its going to be writing in the filesystem of the executor its being run on. I'm not sure if that's what you intended. On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla wrote: > Running perfectly in local system bu

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread pawan kumar
Hi Matthew, you could add the dependencies yourself by using the %dep command in zeppelin ( https://zeppelin.incubator.apache.org/docs/interpreter/spark.html). I have not tried with zeppelin but have used spark-notebook and got Cassandra connector w

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread anshu shukla
Running perfectly in local system but not writing to file in cluster mode .ANY suggestions please .. //msgid is long counter JavaDStream newinputStream=inputStream.map(new Function() { @Override public String call(String v1) throws Exception { String s1=msgId+"@"+v1; System.

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Nipun Arora
Hi Tathagata, I am attaching a snapshot of my pom.xml. It would help immensely, if I can include max, and min values in my mapper phase. The question is still open at : http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 I see that there is a

Multiple executors writing file using java filewriter

2015-06-22 Thread anshu shukla
Can not we write some data to a txt file in parallel with multiple executors running in parallel ?? -- Thanks & Regards, Anshu Shukla

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread shahid ashraf
hi Folks I am newbie to spark world, this seems very interesting work as well as discusion. I have same sort of use case. I usuall use mysql to blocking query for Record Linkage , as of now the data has grown very much and it's not scalling. I want to store all my data on hdfs and expose it via s

Re: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-06-22 Thread Olivier Girardot
Hi, I can't get this to work using CDH 5.4, Spark 1.4.0 in yarn cluster mode. @andrew did you manage to get it work with the latest version ? Le mar. 21 avr. 2015 à 00:02, Andrew Lee a écrit : > Hi Marcelo, > > Exactly what I need to track, thanks for the JIRA pointer. > > > > Date: Mon, 20 Apr

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
Thanks for the responses, guys! Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with 1.4.0 and try the codegen suggestion then report back. On 22 June 2015 at 12:37, Matthew Johnson wrote: > Hi James, > > > > What version of Spark are you using? In Spark 1.2.2 I had an iss

Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Sourav Mazumder
Hi, Though the documentation does not explicitly mention support for Windowing and Analytics function in Spark SQL, looks like it is not supported. I tried running a query like Select Lead(, 1) over (Partition By order by ) from and I got error saying that this feature is unsupported. I tried

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Matthew Johnson
Hi Pawan, Looking at the changes for that git pull request, it looks like it just pulls in the dependency (and transitives) for “spark-cassandra-connector”. Since I am having to build Zeppelin myself anyway, would it be ok to just add this myself for the connector for 1.4.0 (as found here http:/

RE: Help optimising Spark SQL query

2015-06-22 Thread Matthew Johnson
Hi James, What version of Spark are you using? In Spark 1.2.2 I had an issue where Spark would report a job as complete but I couldn’t find my results anywhere – I just assumed it was me doing something wrong as I am still quite new to Spark. However, since upgrading to 1.4.0 I have not seen thi

Re: Help optimising Spark SQL query

2015-06-22 Thread Lior Chaga
Hi James, There are a few configurations that you can try: https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options >From my experience, the codegen really boost things up. Just run sqlContext.sql("spark.sql.codegen=true") before you execute your query. But keep

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread pawan kumar
Hi, Zeppelin has a cassandra-spark-connector built into the build. I have not tried it yet may be you could let us know. https://github.com/apache/incubator-zeppelin/pull/79 To build a Zeppelin version with the *Datastax Spark/Cassandra connector

Re: Does HiveContext connect to HiveServer2?

2015-06-22 Thread nitinkak001
Hey, I have exactly this question. Did you get an answer to it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-22 Thread Nitin kak
Any response to this guys? On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak wrote: > Any other suggestions guys? > > On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak wrote: > >> With Sentry, only hive user has the permission for read/write/execute on >> the subdirectories of warehouse. All the users get tr

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Silvio Fiorito
Yes, just put the Cassandra connector on the Spark classpath and set the connector config properties in the interpreter settings. From: Mohammed Guller Date: Monday, June 22, 2015 at 11:56 AM To: Matthew Johnson, shahid ashraf Cc: "user@spark.apache.org" Subject: RE:

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Mohammed Guller
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure, but it should not be difficult. Mohammed From: Matthew Johnson [mailto:matt.john...@algomi.com] Sent: Monday, June 22, 2015 2:15 AM To: Mohammed Guller; shahid ashraf Cc: user@spark.apache.org Subject: RE: Code review

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-22 Thread scar scar
Sorry I was on vacation for a few days. Yes, it is on. This is what I have in the logs: 15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from server sessionid 0x14dd82e22f70ef1, likely server has closed socket, closing socket connection and attempting reconnect 15/06/22 10:44:00 I

Yarn application ID for Spark job on Yarn

2015-06-22 Thread roy
Hi, Is there a way to get Yarn application ID inside spark application, when running spark Job on YARN ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html Sent from the Apache Spark User List ma

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a callback.

Re: Web UI vs History Server Bugs

2015-06-22 Thread Jonathon Cai
No, what I'm seeing is that while the cluster is running, I can't see the app info after the app is completed. That is to say, when I click on the application name on master:8080, no info is shown. However, when I examine the same file on the History Server, the application information opens fine.

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
I am using spark streaming. what i am trying to do is sending few messages to some kafka topic. where its failing. java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.jav

Help optimising Spark SQL query

2015-06-22 Thread James Aley
Hello, A colleague of mine ran the following Spark SQL query: select count(*) as uses, count (distinct cast(id as string)) as users from usage_events where from_unixtime(cast(timestamp_millis/1000 as bigint)) between '2015-06-09' and '2015-06-16' The table contains billions of rows, but to

Re: Task Serialization Error on DataFrame.foreachPartition

2015-06-22 Thread Ted Yu
private HTable table; You should declare table variable within apply() method. BTW which hbase release are you using ? I see you implement caching yourself. You can make use of the following HTable method: public void setWriteBufferSize(long writeBufferSize) throws IOExcep

Calling rdd() on a DataFrame causes stage boundary

2015-06-22 Thread Alex Nastetsky
When I call rdd() on a DataFrame, it ends the current stage and starts a new one that just maps the DataFrame to rdd and nothing else. It doesn't seem to do a shuffle (which is good and expected), but then why does why is there a separate stage? I also thought that stages only end when there's a s

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
It's a generated set of shell commands to run (written in C, highly optimized numerical computer), which is create from a set of user provided parameters. The snippet above is: task_outfiles_to_cmds = OrderedDict(run_sieving.leftover_tasks) task_outfiles_to_cmds.update(generate_sieving_t

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Yes. Thanks Best Regards On Mon, Jun 22, 2015 at 8:33 PM, Murthy Chelankuri wrote: > I have more than one jar. can we set sc.addJar multiple times with each > dependent jar ? > > On Mon, Jun 22, 2015 at 8:30 PM, Akhil Das > wrote: > >> Try sc.addJar instead of setJars >> >> Thanks >> Best Rega

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
Where does "task_batches" come from? On 22 Jun 2015 4:48 pm, "Shaanan Cohney" wrote: > Thanks, > > I've updated my code to use updateStateByKey but am still getting these > errors when I resume from a checkpoint. > > One thought of mine was that I used sc.parallelize to generate the RDDs > for th

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Try sc.addJar instead of setJars Thanks Best Regards On Mon, Jun 22, 2015 at 8:24 PM, Murthy Chelankuri wrote: > I have been using the spark from the last 6 months with the version 1.2.0. > > I am trying to migrate to the 1.3.0 but the same problem i have written is > not wokring. > > Its givin

jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
I have been using the spark from the last 6 months with the version 1.2.0. I am trying to migrate to the 1.3.0 but the same problem i have written is not wokring. Its giving class not found error when i try to load some dependent jars from the main program. This use to work in 1.2.0 when set all

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Thanks, I've updated my code to use updateStateByKey but am still getting these errors when I resume from a checkpoint. One thought of mine was that I used sc.parallelize to generate the RDDs for the queue, but perhaps on resume, it doesn't recreate the context needed? -- Shaanan Cohney PhD S

Re: Registering custom metrics

2015-06-22 Thread Silvio Fiorito
Sorry, replied to Gerard’s question vs yours. See here: Yes, you have to implement your own custom Metrics Source using the Code Hale library. See here for some examples: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala https://gith

Re: Registering custom metrics

2015-06-22 Thread Silvio Fiorito
Hi Gerard, Yes, you have to implement your own custom Metrics Source using the Code Hale library. See here for some examples: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala https://github.com/apache/spark/blob/master/core/src/main/

Re: Registering custom metrics

2015-06-22 Thread dgoldenberg
Hi Gerard, Have there been any responses? Any insights as to what you ended up doing to enable custom metrics? I'm thinking of implementing a custom metrics sink, not sure how doable that is yet... Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re

Re: Custom Metrics Sink

2015-06-22 Thread dgoldenberg
Hi, I was wondering if there've been any responses to this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread ayan guha
I have a basic qs: how spark assigns partition to an executor? Does it respect data locality? Does this behaviour depend on cluster manager, ie yarn vs standalone? On 22 Jun 2015 22:45, "Akhil Das" wrote: > Option 1 should be fine, Option 2 would bound a lot on network as the data > increase in t

Re: understanding on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-22 Thread Fang, Mike
Hi Das, Thanks for your reply. Somehow I missed it.. I am using Spark 1.3. The data source is from kafka. Yeah, not sure why the delay is 0. I'll run against 1.4 and give a screenshot. Thanks, Mike From: Akhil Das mailto:ak...@sigmoidanalytics.com>> Date: Thursday, June 18, 2015 at 6:05 PM To: M

Re: Using Accumulators in Streaming

2015-06-22 Thread Michal Čizmazia
I stumbled upon zipWithUniqueId/zipWithIndex. Is this what you are looking for? https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDDLike.html#zipWithUniqueId() On 22 June 2015 at 06:16, Michal Čizmazia wrote: > If I am not mistaken, one way to see the accumulators is

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Yong Feng
Thanks Akhil I will have a try and then go back to you Yong On Mon, Jun 22, 2015 at 8:25 AM, Akhil Das wrote: > Like this? > > val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], > classOf[LongWritable], > classOf[Text]) > > > Thanks > Best Regards > > On Mon, Jun 22, 2015 at 5:4

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data increase in time. Thanks Best Regards On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni wrote: > Hi All , > > What is the Best Way to install and Spark Cluster along side with Hadoop > Cluster , Any recommendation for below

Re: Serializer not switching

2015-06-22 Thread Sean Barzilay
My program is written in Scala. I am creating a jar and submitting it using spark-submit. My code is on a computer in an internal network withe no internet so I can't send it. On Mon, Jun 22, 2015, 3:19 PM Akhil Das wrote: > How are you submitting the application? Could you paste the code that y

Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Ashish Soni
Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any recommendation for below deployment topology will be a great help *Also Is it necessary to put the Spark Worker on DataNodes as when it read block from HDFS it will be local to the Server / Worker or

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng wrote: > Thanks a lot, Akhil > > I saw this mail thread before, but still do not understand how to use > XmlInputFo

Re: Serializer not switching

2015-06-22 Thread Akhil Das
How are you submitting the application? Could you paste the code that you are running? Thanks Best Regards On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay wrote: > I am trying to run a function on every line of a parquet file. The > function is in an object. When I run the program, I get an exce

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Yong Feng
Thanks a lot, Akhil I saw this mail thread before, but still do not understand how to use XmlInputFormatof mahout in Spark Streaming (I am not Spark Streaming Expert yet ;-)). Can you show me some sample code for explanation. Thanks in advance, Yong On Mon, Jun 22, 2015 at 6:44 AM, Akhil Das w

Serializer not switching

2015-06-22 Thread Sean Barzilay
I am trying to run a function on every line of a parquet file. The function is in an object. When I run the program, I get an exception that the object is not serializable. I read around the internet and found that I should use Kryo Serializer. I changed the setting in the spark conf and registered

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
I would suggest you have a look at the updateStateByKey transformation in the Spark Streaming programming guide which should fit your needs better than your update_state function. On 22 Jun 2015 1:03 pm, "Shaanan Cohney" wrote: > Counts is a list (counts = []) in the driver, used to collect the r

  1   2   >