Re: Spark on Mesos / Executor Memory

2015-10-14 Thread Bharath Ravi Kumar
(Reviving this thread since I ran into similar issues...) I'm running two spark jobs (in mesos fine grained mode), each belonging to a different mesos role, say low and high. The low:high mesos weights are 1:10. On expected lines, I see that the low priority job occupies cluster resources to the m

Re: Spark 1.5 Streaming and Kinesis

2015-10-14 Thread Jean-Baptiste Onofré
Thanks for the update Phil. I'm preparing a environment to reproduce it. I keep you posted. Thanks again, Regards JB On 10/15/2015 08:36 AM, Phil Kallos wrote: Not a dumb question, but yes I updated all of the library references to 1.5, including (even tried 1.5.1). // Versions.spark set el

Re: Spark 1.5 Streaming and Kinesis

2015-10-14 Thread Phil Kallos
Not a dumb question, but yes I updated all of the library references to 1.5, including (even tried 1.5.1). // Versions.spark set elsewhere to "1.5.0" "org.apache.spark" %% "spark-streaming-kinesis-asl" % Versions.spark % "provided" I am experiencing the issue in my own spark project, but also wh

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Shixiong Zhu
Thanks for reporting it Terry. I submitted a PR to fix it: https://github.com/apache/spark/pull/9132 Best Regards, Shixiong Zhu 2015-10-15 2:39 GMT+08:00 Reynold Xin : > +dev list > > On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > >> All, >> >> Does anyone meet memory leak issue with spark

RE: Running in cluster mode causes native library linking to fail

2015-10-14 Thread prajod.vettiyattil
Forwarding to the group, in case someone else has the same error. Just found out that I did not reply to the group in my original reply. From: Prajod S Vettiyattil (WT01 - BAS) Sent: 15 October 2015 11:45 To: 'Bernardo Vecchia Stein' Subject: RE: Running in cluster mode causes native library lin

Re: Spark 1.5 Streaming and Kinesis

2015-10-14 Thread Jean-Baptiste Onofré
By correct, I mean: the map declaration looks good to me, so the ClassCastException is weird ;) I'm trying to reproduce the issue in order to investigate. Regards JB On 10/15/2015 08:03 AM, Jean-Baptiste Onofré wrote: Hi Phil, KinesisReceiver is part of extra. Just a dumb question: did you u

Re: Spark 1.5 Streaming and Kinesis

2015-10-14 Thread Jean-Baptiste Onofré
Hi Phil, KinesisReceiver is part of extra. Just a dumb question: did you update all, including the Spark Kinesis extra containing the KinesisReceiver ? I checked on tag v1.5.0, and at line 175 of the KinesisReceiver, we see: blockIdToSeqNumRanges.clear() which is a: private val blockIdToSeq

Spark 1.5 Streaming and Kinesis

2015-10-14 Thread Phil Kallos
Hi, We are trying to migrate from Spark1.4 to Spark1.5 for our Kinesis streaming applications, to take advantage of the new Kinesis checkpointing improvements in 1.5. However after upgrading, we are consistently seeing the following error: java.lang.ClassCastException: scala.collection.mutable.H

Re: How to compile Spark with customized Hadoop?

2015-10-14 Thread Dogtail L
Hi, When I publish my version of Hadoop, it is installed in: /HOME_DIRECTORY/.m2/repository/org/apache/hadoop, but when I compile Spark, it will fetch Hadoop libraries from https://repo1.maven.org/maven2/org/apache/hadoop. How can I let Spark fetch Hadoop libraries from my local M2 cache? Great th

Re: Spark Master Dying saying TimeoutException

2015-10-14 Thread Kartik Mathur
Retrying what ? I want to know why is it died , and what can i do to prevent ? On Wed, Oct 14, 2015 at 5:20 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > I fixed these timeout errors by retrying... > On Oct 15, 2015 3:41 AM, "Kartik Mathur" wrote: > >> Hi, >> >> I have some nig

Sensitivity analysis using Spark MlLib

2015-10-14 Thread Sourav Mazumder
Is there any algorithm implementated in Spark MLLib which supports parameter sensitivity analysis ? After the model is created using a training data set, the model should be able to tell among the various features used which are the ones most important (from the perspective of their contribution t

dataframes and numPartitions

2015-10-14 Thread Alex Nastetsky
A lot of RDD methods take a numPartitions parameter that lets you specify the number of partitions in the result. For example, groupByKey. The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy only takes a bunch of Columns as params. I understand that the DataFrame API is

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread SLiZn Liu
Thanks, Michael and java8964! Does Hive Context also provides udf for combining existing lists, into flattened(not nested) list? (list->list of lists -[flatten]->list). On Thu, Oct 15, 2015 at 1:16 AM Michael Armbrust wrote: > Thats correct. It is a Hive UDAF. > > On Wed, Oct 14, 2015 at 6:45

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-14 Thread Lan Jiang
Thank you, Akhil. Actually the problem was solved last week and I did not have time to report back. The error was caused by YARN killing the container because executors use more off-heap memory that they were assigned. There was nothing in the exectuor log, but the AM log clearly states this is the

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Raghavendra Pandey
Looks like you are facing ipv6 issue. Can you try using preferIPv4 property on. On Oct 15, 2015 2:10 AM, "Steve Loughran" wrote: > > On 14 Oct 2015, at 20:56, Marco Mistroni wrote: > > > 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to a > loopback/non-r > eachable address: fe80:0

Re: Spark Master Dying saying TimeoutException

2015-10-14 Thread Raghavendra Pandey
I fixed these timeout errors by retrying... On Oct 15, 2015 3:41 AM, "Kartik Mathur" wrote: > Hi, > > I have some nightly jobs which runs every night but dies sometimes because > of unresponsive master , spark master logs says - > > Not seeing much else there , what could possible cause an except

Re: Application not found in Spark historyserver in yarn-client mode

2015-10-14 Thread Ted Yu
Which Spark release are you using ? Thanks On Wed, Oct 14, 2015 at 4:20 PM, Anfernee Xu wrote: > Hi, > > Here's the problem I'm facing, I have a standalone java application which > is periodically submit Spark jobs to my yarn cluster, btw I'm not using > 'spark-submit' or 'org.apache.spark.laun

Application not found in Spark historyserver in yarn-client mode

2015-10-14 Thread Anfernee Xu
Hi, Here's the problem I'm facing, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs are successful and I can see them on Yarn RM webUI, but when I want to f

Spark Master Dying saying TimeoutException

2015-10-14 Thread Kartik Mathur
Hi, I have some nightly jobs which runs every night but dies sometimes because of unresponsive master , spark master logs says - Not seeing much else there , what could possible cause an exception like this. *Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out aft

Re: Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
It looks like that reconstruction of SparkContext from checkpoint data is trying to look for the jar file of previous failed runs. It can not find the jar files as our jar files are on local machines and were cleaned up after each failed run. -- View this message in context: http://apac

Re: Spark 1.5.1 ClassNotFoundException in cluster mode.

2015-10-14 Thread Dean Wampler
There is a Datastax Spark connector library jar file that you probably have on your CLASSPATH locally, but not on the cluster. If you know where it is, you could either install it on each node in some location on their CLASSPATHs or when you submit the mob, pass the jar file using the "--jars" opti

Re: IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
Specifically, something like this should probably do the trick: def checkHost(host: String, message: String = "") { assert(!HostAndPort.fromString(host).hasPort, message) } def checkHostPort(hostPort: String, message: String = "") { assert(HostAndPort.fromString(hostPort).hasPort, m

IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
It looks like Spark 1.5.1 does not work with IPv6. When adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the driver fails with: 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext. java.lang.AssertionError: assertion failed: Expected hostname at scala.Predef$.a

Re: PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread Michael Armbrust
I forgot to add. You might also try running: SET spark.sql.hive.metastorePartitionPruning=true On Wed, Oct 14, 2015 at 2:23 PM, Michael Armbrust wrote: > No link to the original stack overflow so I can up my reputation? :) > > This is likely not a difference between HiveContext/SQLContext, but

Re: PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread Michael Armbrust
No link to the original stack overflow so I can up my reputation? :) This is likely not a difference between HiveContext/SQLContext, but instead a difference between a table where the metadata is coming from the HiveMetastore vs the SparkSQL Data Source API. I would guess that if you create the t

Spark 1.5.1 ClassNotFoundException in cluster mode.

2015-10-14 Thread Renato Perini
Hello. I have developed a Spark job using a jersey client (1.9 included with Spark) to make some service calls during data computations. Data is read and written on an Apache Cassandra 2.2.1 database. When I run the job in local mode, everything works nicely. But when I execute my job in cluste

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
Here you go: https://github.com/databricks/spark-avro/issues/92 Thanks. On Wed, Oct 14, 2015 at 4:41 PM, Josh Rosen wrote: > Can you report this as an issue at > https://github.com/databricks/spark-avro/issues so that it's easier to > track? Thanks! > > On Wed, Oct 14, 2015 at 1:38 PM, Alex Nas

Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
Hi, I am trying to set spark streaming checkpoint to s3, here is what I did basically val checkpoint = "s3://myBucket/checkpoint" val ssc = StreamingContext.getOrCreate(checkpointDir, () => getStreamingContext(sparkJobName,

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Josh Rosen
Can you report this as an issue at https://github.com/databricks/spark-avro/issues so that it's easier to track? Thanks! On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I save my dataframe to avro with spark-avro 1.0.0 and it looks like this > (using avr

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Steve Loughran
On 14 Oct 2015, at 20:56, Marco Mistroni mailto:mmistr...@gmail.com>> wrote: 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to a loopback/non-r eachable address: fe80:0:0:0:c5ed:a66d:9d95:5caa%wlan2, but we couldn't find any external IP address! java.lang.RuntimeException: java.l

PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread charles.drotar
I have duplicated my submission to stack overflow below since it is exactly the same question I would like to post here as well. Please don't judge me too harshly for my laziness *The questions I

spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this (using avro-tools tojson): {"field1":"value1","field2":976200} {"field1":"value2","field2":976200} {"field1":"value3","field2":614100} But when I use spark-avro 2.0.1, it looks like this: {"field1":{"string":"value1"},"fiel

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Tathagata Das
When a job gets aborted, it means that the internal tasks were retried a number of times before the system gave up. You can control the number retries (see Spark's configuration page). The job by default does not get resubmitted. You could try getting the logs of the failed executor, to see what c

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
Thanks! Indeed not a given. I'm not sure we have the time to wait for nodes within a streaming interval. I'll explore some alternatives. If I stumble on something reasonable I'll report back. -kr, Gerard. On Wed, Oct 14, 2015 at 9:57 PM, Cody Koeninger wrote: > What I'm saying is that it's no

spark-shell :javap fails with complaint about JAVA_HOME, but it is set correctly

2015-10-14 Thread Robert Dodier
Hi, I am working with Spark 1.5.1 (official release), with Oracle Java8, on Ubuntu 14.04. echo $JAVA_HOME says "/usr/lib/jvm/java-8-oracle". I'd like to use :javap in spark-shell, but I get an error message: scala> :javap java.lang.Object Failed: Could not load javap tool. Check that JAVA_HOME i

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Bernardo Vecchia Stein
Hi Renato, I am using a single master and a single worker node, both in the same machine, to simplify everything. I have tested with System.loadLibrary() as well (setting all the necessary paths) and get the same error. Just double checked everything and the parameters are fine. Bernardo On 14 O

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Cody Koeninger
What I'm saying is that it's not a given with spark, even in receiver-based mode, because as soon as you lose an executor you'll have a rebalance. Spark's model in general isn't a good fit for pinning work to specific nodes. If you really want to try and fake this, you can override getPreferredLo

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Hi Bernardo, So is this in distributed mode? or single node? Maybe fix the issue with a single node first ;) You are right that Spark finds the library but not the *.so file. I also use System.load() with LD_LIBRARY_PATH set, and I am able to execute without issues. Maybe you'd like to double chec

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Marco Mistroni
Thanks Steve followed instruction, spk is started and i can see the web ui however after launching spark-shell i am getting another exception. is this preventing me from actually using spark? kind regards marco 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to a loopback/non-r each

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Sorry Bernardo, I just double checked. I use: System.loadLibrary(); Could you also try that? Renato M. 2015-10-14 21:51 GMT+02:00 Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com>: > Hi Bernardo, > > So is this in distributed mode? or single node? Maybe fix the issue with a > single

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
Hi Cody, I think that I misused the term 'data locality'. I think I should better call it "node affinity" instead, as this is what I would like to have: For as long as an executor is available, I would like to have the same kafka partition processed by the same node in order to take advantage of

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Bernardo Vecchia Stein
Hi Renato, I have done that as well, but so far no luck. I believe spark is finding the library correctly, otherwise the error message would be "no libraryname found" or something like that. The problem seems to be something else, and I'm not sure how to find it. Thanks, Bernardo On 14 October 2

Re: TTL for saveAsObjectFile()

2015-10-14 Thread Calvin Jia
Hi Antonio, I don't think Spark provides a way to pass down params with saveAsObjectFile. One way could be to pass a default TTL in the configuration, but the approach doesn't make much sense since TTL is not necessarily uniform. Baidu will be talking about their use of TTL in Tachyon with Spark

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
You can also try setting the env variable LD_LIBRARY_PATH to point where your compiled libraries are. Renato M. 2015-10-14 21:07 GMT+02:00 Bernardo Vecchia Stein : > Hi Deenar, > > Yes, the native library is installed on all machines of the cluster. I > tried a simpler approach by just using S

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Bernardo Vecchia Stein
Hi Deenar, Yes, the native library is installed on all machines of the cluster. I tried a simpler approach by just using System.load() and passing the exact path of the library, and things still won't work (I get exactly the same error and message). Any ideas of what might be failing? Thank you,

If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-14 Thread Reynold Xin
Can you reply to this email and provide us with reasons why you disable it? Thanks.

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Reynold Xin
+dev list On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > All, > > Does anyone meet memory leak issue with spark streaming and spark sql in > spark 1.5.1? I can see the memory is increasing all the time when running > this simple sample: > > val sc = new SparkContext(conf) >

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
Thanks Michael for your input. By 1) do you mean: - Caching the partitioned_rdd - Caching the partitioned_df - *Or* just caching unpartitioned_df without the need of creating the partitioned_rdd variable? Can you expand a little bit more 2) Thanks! On Wed, Oct 14, 2015 at 12:11

Strange spark problems among different versions

2015-10-14 Thread xia zhao
Hi. I try to run the Spark Pi on the cluster, some strange errors happen and I do not know what cause the error. When I am using the hadoop2.6 and spark-1.5.1-bin-hadoop2.6 the error log is below: 118 10/01/01 11:59:14 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.refl

Re: stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread Nicholas Chammas
The Spark 1.4 release notes say that Python 3 is supported. The 1.4 docs are incorrect, and the 1.5 programming guide has been updated to indicate Python 3 support. On Wed, Oct 14, 2015 at 7:06 AM shoira.mukhsin...@bnpparibasfortis.com <

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Spark Newbie
Is it slowing things down or blocking progress. >> I didn't see slowing of processing, but I do see jobs aborted consecutively for a period of 18 batches (5 minute batch intervals). So I am worried about what happened to the records that these jobs were processing. Also, one more thing to mention i

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Spark Newbie
I ran 2 different spark 1.5 clusters that have been running for more than a day now. I do see jobs getting aborted due to task retry's maxing out (default 4) due to ConnectionException. It seems like the executors die and get restarted and I was unable to find the root cause (same app code and conf

Re: Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Marcelo Vanzin
On Wed, Oct 14, 2015 at 10:29 AM, Florian Kaspar wrote: > so it is possible to simply copy the YARN configuration from the remote > cluster to the local machine (assuming, the local machine can resolve the > YARN host etc.) and just letting Spark do the rest? > Yes, that should be all. -- Marc

Re: Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Florian Kaspar
Thank you, Marcelo, so it is possible to simply copy the YARN configuration from the remote cluster to the local machine (assuming, the local machine can resolve the YARN host etc.) and just letting Spark do the rest? This would actually be great! Our "local" machine is just another virtual ma

Re: Reusing Spark Functions

2015-10-14 Thread Michael Armbrust
Unless its a broadcast variable, a new copy will be deserialized for every task. On Wed, Oct 14, 2015 at 10:18 AM, Starch, Michael D (398M) < michael.d.sta...@jpl.nasa.gov> wrote: > All, > > Is a Function object in Spark reused on a given executor, or is sent and > deserialized with each new task

Re: SPARK SQL Error

2015-10-14 Thread pnpritchard
I think the stack trace is quite informative. Assuming line 10 of CsvDataSource is "val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1),"header"->"true"))", then the "args(1)" call is throwing an ArrayIndexOutOfBoundsException. The reason for this is because you aren't passi

Reusing Spark Functions

2015-10-14 Thread Starch, Michael D (398M)
All, Is a Function object in Spark reused on a given executor, or is sent and deserialized with each new task? On my project, we have functions that incur a very large setup cost, but then could be called many times. Currently, I am using object deserialization to run this intensive setup, I

Re: thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Michael Armbrust
Yes, call startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56 On Wed, Oct 14, 2015 at 7:10 AM, wrote: > Hi, > > Is it possible to load a spark-shell, in which we

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Michael Armbrust
Thats correct. It is a Hive UDAF. On Wed, Oct 14, 2015 at 6:45 AM, java8964 wrote: > My guess is the same as UDAF of (collect_set) in Hive. > > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) > > Yong > > -

Re: Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Marcelo Vanzin
On Wed, Oct 14, 2015 at 10:01 AM, Florian Kaspar wrote: > we are working on a project running on Spark. Currently we connect to a > remote Spark-Cluster in Standalone mode to obtain the SparkContext using > > new JavaSparkContext(new > SparkConf().setAppName("").setMaster("spark://:7077")); > C

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by addin

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Adrian Tanase
You are correct, of course. Gave up on sbt for spark long ago, I never managed to get it working while mvn works great. Sent from my iPhone On 14 Oct 2015, at 16:52, Ted Yu mailto:yuzhih...@gmail.com>> wrote: Adrian: Likely you were using maven. Jakob's report was with sbt. Cheers On Tue, O

Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Florian Kaspar
Hey everyone, we are working on a project running on Spark. Currently we connect to a remote Spark-Cluster in Standalone mode to obtain the SparkContext using new JavaSparkContext(new SparkConf().setAppName("").setMaster("spark://:7077")); Currently, we try to connect to a remote (!) YARN cl

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Jakob Odersky
[Repost to mailing list] Hey, Sorry about the typo, I of course meant hadoop-2.6, not 2.11. I suspect something bad happened with my Ivy cache, since when reverting back to scala 2.10, I got a very strange IllegalStateException (something something IvyNode, I can't remember the details). Kilking t

Re: How to calculate percentile of a column of DataFrame?

2015-10-14 Thread Umesh Kacha
Hi Ted thanks much for your help. So fix is in JIRA 10671 and it is suppose to release in spark 1.6.0 right? Until 1.6.0 is released I won't be able to invoke callUdf using string and percentile_approx with lit as argument right On Oct 14, 2015 03:26, "Ted Yu" wrote: > I modified DataFrameSuite,

Get *document*-topic distribution from PySpark LDA model?

2015-10-14 Thread moustachio
Hi! I already have a StackOverflow question on this (see here ), but haven't received any responses, so I thought I'd try here! Long story short, I'm working in PySpark and have successfully gene

RE: Node afinity for Kafka-Direct Stream

2015-10-14 Thread prajod.vettiyattil
Hi, Another point is the in the receiver based approach, all the data from kafka first goes to the Worker where the receiver runs https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md Also if you create one stream (which is the normal case), and you have many worker instances,

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-14 Thread Jonathan Kelly
Ah, yes, it will use private IPs, so you may need to update your FoxyProxy settings to include the private IPs in the regex as well as the public IPs. Also, yes, for completed applications you may use the Spark History Server on port 18080. The YARN ProxyServer will automatically redirect to the S

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to partitio

Dynamic partitioning pruning

2015-10-14 Thread Younes Naguib
Hi, This feature was added in Hive 1.3. https://issues.apache.org/jira/browse/HIVE-9152 Any idea when this would be in Spark? Or is it already? Any work around in spark 1.5.1? Thanks, Younes

Re: Why is my spark executor is terminated?

2015-10-14 Thread Jean-Baptiste Onofré
Hi Ningjun I just wanted to check that the master didn't "kick out" the worker, as the "Disassociated" can come from the master. Here it looks like the worker killed the executor before shutting down itself. What's the Spark version ? Regards JB On 10/14/2015 04:42 PM, Wang, Ningjun (LNG-

RE: Why is my spark executor is terminated?

2015-10-14 Thread Wang, Ningjun (LNG-NPV)
I checked master log before and did not find anything wrong. Unfortunately I have lost the master log now. So you think master log will tell you why executor is down? Regards, Ningjun Wang -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Tuesday, October

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Cody Koeninger
Assumptions about locality in spark are not very reliable, regardless of what consumer you use. Even if you have locality preferences, and locality wait turned up really high, you still have to account for losing executors. On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas wrote: > Thanks Saisai, Mi

NullPointerException when adding to accumulator

2015-10-14 Thread Sela, Amit
I'm running a simple streaming application that reads from Kafka, maps the events and prints them and I'm trying to use accumulators to count the number of mapped records. While this works in standalone(IDE), when submitting to YARN I get NullPointerException on accumulator.add(1) or accumulato

thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Saif.A.Ellafi
Hi, Is it possible to load a spark-shell, in which we do any number of operations in a dataframe, then register it as a temporary table and get to see it through thriftserver? ps. or even better, submit a full job and store the dataframe in thriftserver in-memory before the job completes. I ha

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Ted Yu
Adrian: Likely you were using maven. Jakob's report was with sbt. Cheers On Tue, Oct 13, 2015 at 10:05 PM, Adrian Tanase wrote: > Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also > compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works. > > -adrian > >

RE: Spark DataFrame GroupBy into List

2015-10-14 Thread java8964
My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) Yong From: sliznmail...@gmail.com Date: Wed, 14 Oct 2015 02:45:48 + Subject: Re: Spark DataFrame GroupBy into List To: m

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
Thanks Saisai, Mishra, Indeed, that hint will only work on a case where the Spark executor is co-located with the Kafka broker. I think the answer to my question as stated is that there's no warranty of where the task will execute as it will depend on the scheduler and cluster resources available

Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-14 Thread Ted Yu
This might be related : http://search-hadoop.com/m/q3RTta8AxS1UjMSI&subj=Cannot+get+spark+streaming_2+10+1+5+0+pom+from+the+maven+repository > On Oct 12, 2015, at 11:30 PM, Akhil Das wrote: > > You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to the > dependencies list. >

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
Yes, that is correct. When you import the K,V classes, make sure you import it from the hadoop.io package. import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.NullWritable; Thanks Best Regards On Wed, Oct 14, 2015 at 6:26 PM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
This preferred locality is a hint to spark to schedule Kafka tasks on the preferred nodes, if Kafka and Spark are two separate cluster, obviously this locality hint takes no effect, and spark will schedule tasks following node-local -> rack-local -> any pattern, like any other spark tasks. On Wed,

RE: spark streaming filestream API

2015-10-14 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for your response. My input format is the one I have created to handle the files as a whole i.e. WholeFileInputFormat I wrote one based on this example https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3 In thi

Re: writing to hive

2015-10-14 Thread Ted Yu
Can you show your query ? Thanks > On Oct 13, 2015, at 12:29 AM, Hafiz Mujadid wrote: > > hi! > > I am following this > > > tutorial to read and write from hive. But i am facing following exception > when i run

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
You could check the code of KafkaRDD, the locality (host) is got from Kafka's partition and set in KafkaRDD, this will a hint for Spark to schedule task on the preferred location. override def getPreferredLocations(thePart: Partition): Seq[String] = { val part = thePart.asInstanceOf[KafkaRDDPart

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Rishitesh Mishra
Hi Gerard, I am also trying to understand the same issue. Whatever code I have seen it looks like once Kafka RDD is constructed the execution of that RDD is upto the task scheduler and it can schedule the partitions based on the load on nodes. There is preferred node specified in Kafks RDD. But ASF

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
Key and Value are the ones that you are using with your InputFormat. Eg: JavaReceiverInputDStream lines = jssc.fileStream("/sigmoid", LongWritable.class, Text.class, TextInputFormat.class); TextInputFormat uses the LongWritable as Key and Text as Value classes. If your data is plain CSV or text

Re: Changing application log level in standalone cluster

2015-10-14 Thread Akhil Das
You should be able to do that from your application. In the beginning of the application, just add: import org.apache.log4j.Loggerimport org.apache.log4j.Level Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF) That will switch off the logs. Thanks Best Rega

Re: how to use SharedSparkContext

2015-10-14 Thread Fengdong Yu
oh, Yes. Thanks much. > On Oct 14, 2015, at 18:47, Akhil Das wrote: > > com.holdenkarau.spark.testing

spark streaming filestream API

2015-10-14 Thread Chandra Mohan, Ananda Vel Murugan
Hi All, I have a directory hdfs which I want to monitor and whenever there is a new file in it, I want to parse that file and load the contents into a HIVE table. File format is proprietary and I have java parsers for parsing it. I am building a spark streaming application for this workflow. Fo

Re: java.io.InvalidClassException using spark1.4.1 for Terasort

2015-10-14 Thread Sonal Goyal
This is probably a versioning issue, are you sure your code is compiling and running against the same versions? On Oct 14, 2015 2:19 PM, "Shreeharsha G Neelakantachar" < shreeharsh...@in.ibm.com> wrote: > Hi, > I have Terasort being executed on spark1.4.1 with hadoop 2.7 for a > datasize of

Re: spark sql OOM

2015-10-14 Thread Andy Zhao
I increased executor memory from 6g to 10g, but it still failed and report the same error and because of my company security policy, I cannot write the sql out. But I'm sure that this error occurred in the compute method of HadoopRDD, and this error happened in one of executors. -- View this me

Fwd: Partition Column in JDBCRDD or Datasource API

2015-10-14 Thread satish chandra j
HI All, Please give me some inputs on *Partition Column *to be used in DataSourceAPI or JDBCRDD to define Lowerbound and Upperbound value which would be used to define No. of partitions, but issue is my source table does not have a Numeric Columns which is sequential and unique such that proper par

stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread shoira.mukhsin...@bnpparibasfortis.com
Dear Spark Community, The official documentation of Spark 1.4.1 mentions that Spark runs on Python 2.6+ http://spark.apache.org/docs/1.4.1/ It is not clear if by "Python 2.6+" do you also mean Python 3.4 or not. There is a resolved issue on this point which makes me believe that it does run on

Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread SLiZn Liu
Yes it went wrong when processing a large file only. I removed transformations on DF, and it worked just fine. But doing a simple filter operation on the DF became the last straw that breaks the camel’s back. That’s confusing. ​ On Wed, Oct 14, 2015 at 2:11 PM Deenar Toraskar wrote: > Hi > > Why

Re: how to use SharedSparkContext

2015-10-14 Thread Akhil Das
Did a quick search and found the following, I haven't tested it myself. Add the following to your build.sbt libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.5.0_1.4.0_1.4.1_0.1.2" Create a class extending com.holdenkarau.spark.testing.SharedSparkContext And you should

Re: Cannot connect to standalone spark cluster

2015-10-14 Thread Akhil Das
Open a spark-shell by: MASTER=Ellens-MacBook-Pro.local:7077 bin/spark-shell And if its able to connect, then check your java projects build file and make sure you are having the proper spark version. Thanks Best Regards On Sat, Oct 10, 2015 at 3:07 AM, ekraffmiller wrote: > Hi, > I'm trying t

Re: spark sql OOM

2015-10-14 Thread cherrywayb...@gmail.com
Hi,pls increase your memory . cherrywayb...@gmail.com From: Andy Zhao Date: 2015-10-14 17:40 To: user Subject: spark sql OOM Hi guys, I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cdh5.3.2. One sql which can ran successfully using hive failed when I ran it using sparkSql. I got the f

Re: spark sql OOM

2015-10-14 Thread Fengdong Yu
Can you search the mail-archive before asked the question? at least search for how ask the question. nobody can give your answer if you don’t paste your SQL or SparkSQL code. > On Oct 14, 2015, at 17:40, Andy Zhao wrote: > > Hi guys, > > I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cd

Re: HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Xiaoyu Wang
I create a jira and pull request for this issue. https://issues.apache.org/jira/browse/SPARK-11100 在 2015年10月13日 16:36, Xiaoyu Wang 写道: I have the same issue. I think spark thrift server is not suport HA with zookeeper now. 在 2015年09月01日 18:10, sreeramvenkat 写道: Hi, I am trying to setup dyn

  1   2   >