Re:RE: Test case for the spark sql catalyst

2015-08-24 Thread Todd
Thanks Chenghao! At 2015-08-25 13:06:40, "Cheng, Hao" wrote: Yes, check the source code under:https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01 PM To:user@spark.

Re: Spark

2015-08-24 Thread Sonal Goyal
Sorry am I missing something? There is a method sortBy on both RDD and PairRDD. def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length

Re: How can I save the RDD result as Orcfile with spark1.3?

2015-08-24 Thread dong.yajun
We plan to upgrade our spark cluster to 1.4, and I just have a test in local mode which reference here: http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ but an exception caused when running the example, the stack trace as below: *Exception in thread "main" java.lang.NoSuchFiel

Re: build spark 1.4.1 with JDK 1.6

2015-08-24 Thread Sean Owen
-cdh-user This suggests that Maven is still using Java 6. I think this is indeed controlled by JAVA_HOME. Use 'mvn -X ...' to see a lot more about what is being used and why. I still suspect JAVA_HOME is not visible to the Maven process. Or maybe you have JRE 7 installed but not JDK 7 and it's som

Re: org.apache.spark.shuffle.FetchFailedException

2015-08-24 Thread kundan kumar
I have set spark.sql.shuffle.partitions=1000 then also its failing. On Tue, Aug 25, 2015 at 11:36 AM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Did you try increasing sql partitions? > > On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar > wrote: > >> I am running this query on a

Re: org.apache.spark.shuffle.FetchFailedException

2015-08-24 Thread Raghavendra Pandey
Did you try increasing sql partitions? On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar wrote: > I am running this query on a data size of 4 billion rows and > getting org.apache.spark.shuffle.FetchFailedException error. > > select adid,position,userid,price > from ( > select adid,position,userid,

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this case. Best Regards, Shixiong Zhu 2015-08-25 14:01 GMT+08:00 Cheng, Hao : > O, Sorry, I miss reading your reply! > > > > I know the minimum tasks will be 2 for scanning, but Jeff is talking about > 2 jobs, not 2 tasks. > >

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
O, Sorry, I miss reading your reply! I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 jobs, not 2 tasks. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 1:29 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re: DataFrame#sh

How to access Spark UI through AWS

2015-08-24 Thread Justin Pihony
I am using the steps from this article to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started

Re: Spark

2015-08-24 Thread Sonal Goyal
I think you could try sorting the endPointsCount and then doing a take. This should be a distributed process and only the result would get returned to the driver. Best Regards, Sonal Founder, Nube Technologies Check out Reifier at Spark Summit 2015

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Hao, I can reproduce it using the master branch. I'm curious why you cannot reproduce it. Did you check if the input HadoopRDD did have two partitions? My test code is val df = sqlContext.read.json("examples/src/main/resources/people.json") df.show() Best Regards, Shixiong Zhu 2015-08-25 13:01

RE: Test case for the spark sql catalyst

2015-08-24 Thread Cheng, Hao
Yes, check the source code under: https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01 PM To: user@spark.apache.org Subject: Test case for the spark sql catalyst Hi, Are there

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark jobs in the `df.show()` with latest code, we did refactor the code for json data source recently, not sure you’re running an earlier version of it. And a known issue is Spark SQL will try to re-list the files every time when

Test case for the spark sql catalyst

2015-08-24 Thread Todd
Hi, Are there test cases for the spark sql catalyst, such as testing the rules of transforming unsolved query plan? Thanks!

Re: Drop table and Hive warehouse

2015-08-24 Thread Kevin Jung
Thanks, Michael. I discovered it myself. Finally, it was not a bug from Spark. I have two HDFS cluster and Hive uses hive.metastore.warehouse.dir + fs.defaultFS(HDFS1) for saving internal tables and also reference a default database URI(HDFS2) in "DBS" table from metastore. It may not be a probl

Spark

2015-08-24 Thread Spark Enthusiast
I was running a Spark Job to crunch a 9GB apache log file When I saw the following error: 15/08/25 04:25:16 WARN scheduler.TaskSetManager: Lost task 99.0 in stage 37.0 (TID 4115, ip-10-150-137-100.ap-southeast-1.compute.internal): ExecutorLostFailure (executor 29 lost)15/08/25 04:25:16 INFO s

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Because defaultMinPartitions is 2 (See https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/core/src/main/scala/org/apache/spark/SparkContext.scala#L2057 ), your input "people.json" will be split to 2 partitions. At first, `take` will start a job for the first partition. H

Re: MLlib Prefixspan implementation

2015-08-24 Thread Feynman Liang
CCing the mailing list again. It's currently not on the radar. Do you have a use case for it? I can bring it up during 1.6 roadmap planning tomorrow. On Mon, Aug 24, 2015 at 8:28 PM, alexis GILLAIN wrote: > Hi, > > I just realized the article I mentioned is cited in the jira and not in > the co

Re: How to list all dataframes and RDDs available in current session?

2015-08-24 Thread Dhaval Gmail
Okay but "how?" thats what I am trying to figure out 😀? Any command you would suggest? "Sent from my iPhone, plaese excuse any typos :)" > On Aug 21, 2015, at 11:45 PM, Raghavendra Pandey > wrote: > > You get the list of all the persistet rdd using spark context... >> On Aug 21, 2015 12:06 A

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Cody Koeninger
I'd start off by trying to simplify that closure - you don't need the transform step, or currOffsetRanges to be scoped outside of it. Just do everything in foreachRDD. LIkewise, it looks like zkClient is also scoped outside of the closure passed to foreachRDD i.e. you have zkClient = new ZkClie

Re: build spark 1.4.1 with JDK 1.6

2015-08-24 Thread Eric Friedman
I'm trying to build Spark 1.4 with Java 7 and despite having that as my JAVA_HOME, I get [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-launcher_2.10 --- [INFO] Using zinc server for incremental compilation [info] Compiling 8 Java sources to /Users/eric/spark/spark/lau

Re: Protobuf error when streaming from Kafka

2015-08-24 Thread Ted Yu
Can you show the complete stack trace ? Which Spark / Kafka release are you using ? Thanks On Mon, Aug 24, 2015 at 4:58 PM, Cassa L wrote: > Hi, > I am storing messages in Kafka using protobuf and reading them into > Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got > "java.lang.U

What does Attribute and AttributeReference mean in Spark SQL

2015-08-24 Thread Todd
There are many such kind of case class or concept such as Attribute/AttributeReference/Expression in Spark SQL I would ask what Attribute/AttributeReference/Expression mean, given a sql query like select a,b from c, it a, b are two Attributes? a + b is an expression? Looks I misunderstand it b

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Jeff Zhang
Hi Cheng, I know that sqlContext.read will trigger one spark job to infer the schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3 jobs. Here's the command I use: >> val df = sqlContext.read.json("file:///Users/hadoop/github/spark/examples/src/main/resources/people

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
I get the same error even when I set the SPARK_CLASSPATH: export SPARK_CLASSPATH=/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.1.jar:/.m2/repository/ch/qos/logback/logback-core/1.1.2/logback-core-1.1.2.jar And I run the job like this: /spark-1.4.1-bin-hadoop2.4/bin/spark-

Protobuf error when streaming from Kafka

2015-08-24 Thread Cassa L
Hi, I am storing messages in Kafka using protobuf and reading them into Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got "java.lang.UnsupportedOperationException" for older messages. However, even for new messages I get the same error. Spark does convert it though. I see my messages.

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
I assumed that's the case beacause of the error I got and the documentation which says: "Extra classpath entries to append to the classpath of the driver." This is where I stand now: org.apache.spark spark-core_2.10 1.4.1

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
On Mon, Aug 24, 2015 at 3:58 PM, Utkarsh Sengar wrote: > That didn't work since "extraClassPath" flag was still appending the jars at > the end, so its still picking the slf4j jar provided by spark. Out of curiosity, how did you verify this? The "extraClassPath" options are supposed to prepend en

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
That didn't work since "extraClassPath" flag was still appending the jars at the end, so its still picking the slf4j jar provided by spark. Although I found this flag: --conf "spark.executor.userClassPathFirst=true" (http://spark.apache.org/docs/latest/configuration.html) and tried this: ➜ simspa

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread suchenzang
When updating the ZK offset in the driver (within foreachRDD), there is somehow a serialization exception getting thrown: 15/08/24 15:45:40 ERROR JobScheduler: Error in job generator java.io.NotSerializableException: org.I0Itec.zkclient.ZkClient at java.io.ObjectOutputStream.writeObject0(O

Re: DataFrame/JDBC very slow performance

2015-08-24 Thread Michael Armbrust
> > Much appreciated! I am not comparing with "select count(*)" for > performance, but it was one simple thing I tried to check the performance > :). I think it now makes sense since Spark tries to extract all records > before doing the count. I thought having an aggregated function query > submitt

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
Hi Utkarsh, A quick look at slf4j's source shows it loads the first "StaticLoggerBinder" in your classpath. How are you adding the logback jar file to spark-submit? If you use "spark.driver.extraClassPath" and "spark.executor.extraClassPath" to add the jar, it should take precedence over the log4

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
Hi Marcelo, When I add this exclusion rule to my pom: org.apache.spark spark-core_2.10 1.4.1 org.slf4j slf4j-log4j12 The SparkRunner class work

Re: Too many files/dirs in hdfs

2015-08-24 Thread Mohit Anchlia
Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia wrote: > My question was how to do this in Hadoop? Could somebody point me to some > examples? > > On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY > wrote: > >> Of course, Java or Scala can do that: >> 1) Create a Fi

Where is Redgate's HDFS explorer?

2015-08-24 Thread Dino Fancellu
http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent fro

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
Hi Utkarsh, Unfortunately that's not going to be easy. Since Spark bundles all dependent classes into a single fat jar file, to remove that dependency you'd need to modify Spark's assembly jar (potentially in all your nodes). Doing that per-job is even trickier, because you'd probably need some ki

Running spark shell on mesos with zookeeper on spark 1.3.1

2015-08-24 Thread kohlisimranjit
I have setup up apache mesos using mesosphere on Cent OS 6 with Java 8.I have 3 slaves which total to 3 cores and 8 gb ram. I have set no firewalls. I am trying to run the following lines of code to test whether the setup is working: val data = 1 to 1 val distData = sc.parallelize(data) dis

Strange ClassNotFoundException in spark-shell

2015-08-24 Thread Jan Algermissen
Hi, I am using spark 1.4 M1 with the Cassandra Connector and run into a strange error when using the spark shell. This works: sc.cassandraTable("events", "bid_events").select("bid","type").take(10).foreach(println) But as soon as I put a map() in there (or filter): sc.cassandraTable("events

Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
Continuing this discussion: http://apache-spark-user-list.1001560.n3.nabble.com/same-log4j-slf4j-error-in-spark-9-1-td5592.html I am getting this error when I use logback-classic. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:.m2/repository/ch/qos/logback/l

Re: Spark ec2 lunch problem

2015-08-24 Thread Andrew Or
Hey Garry, Have you verified that your particular VPC and subnet are open to the world? In particular, have you verified the route table attached to your VPC / subnet contains an internet gateway open to the public? I've run into this issue myself recently and that was the problem for me. -Andre

`show tables like 'tmp*';` does not work in Spark 1.3.0+

2015-08-24 Thread dugdun
Hi guys and gals, I have a Spark 1.2.0 instance running that I connect to via the thrift interface using beeline. On this instance I can send a command like `show tables like 'tmp*';` and I get a list of all tables that start with `tmp`. When testing this same command out on a server that is runni

ExternalSorter: Thread *** spilling in-memory map of 352.6 MB to disk (38 times so far)

2015-08-24 Thread d...@lumity.com
Hello, I'm trying to run a spark 1.5 job with: ./spark-shell --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 -Xms16g -Xmx48g -Xss128m" I get lots of error messages like : 15/08/24 20:24:33 INFO ExternalSorter: Thread 172 spilling in-memory map of

Re: Array Out OF Bound Exception

2015-08-24 Thread Michael Armbrust
This top line here is indicating that the exception is being throw from your code (i.e. code written in the console). at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:40) Check to make sure that you are properly handling data that has

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Susan Zhang
Thanks Cody (forgot to reply-all earlier, apologies)! One more question for the list: I'm now seeing a java.lang.ClassNotFoundException for kafka.OffsetRange upon relaunching the streaming job after a previous run (via spark-submit) 15/08/24 13:07:11 INFO CheckpointReader: Attempting to load ch

Re: Performance - Python streaming v/s Scala streaming

2015-08-24 Thread Tathagata Das
The scala version of the Kafka is something that we have been working on for a while, and is likely to be more optimized than the python one. The python one definitely requires pass the data back and forth between JVM and Python VM and decoding the raw bytes to the Python strings (probably less ef

Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Dino Fancellu
Changing the ip to the guest IP address just never connects. The VM has port tunnelling, and it passes through all the main ports, 8020 included to the host VM. You can tell that it was talking to the guest VM before, simply because it said when file not found Error is: Exception in thread "mai

Run Spark job from within iPython+Spark?

2015-08-24 Thread YaoPau
I set up iPython Notebook to work with the pyspark shell, and now I'd like use %run to basically 'spark-submit' another Python Spark file, and leave the objects accessible within the Notebook. I tried this, but got a "ValueError: Cannot run multiple SparkContexts at once" error. I then tried taki

Re: spark and scala-2.11

2015-08-24 Thread Lanny Ripple
We're going to be upgrading from spark 1.0.2 and using hadoop-1.2.1 so need to build by hand. (Yes, I know. Use hadoop-2.x but standard resource constraints apply.) I want to build against scala-2.11 and publish to our artifact repository but finding build/spark-2.10.4 and tracing down what build

Re: spark and scala-2.11

2015-08-24 Thread Jonathan Coveney
I've used the instructions and it worked fine. Can you post exactly what you're doing, and what it fails with? Or are you just trying to understand how it works? 2015-08-24 15:48 GMT-04:00 Lanny Ripple : > Hello, > > The instructions for building spark against scala-2.11 indicate using > -Dspark

Re: spark and scala-2.11

2015-08-24 Thread Sean Owen
The property "scala-2.11" triggers the profile "scala-2.11" -- and additionally disables the scala-2.10 profile, so that's the way to do it. But yes, you also need to run the script before-hand to set up the build for Scala 2.11 as well. On Mon, Aug 24, 2015 at 8:48 PM, Lanny Ripple wrote: > Hell

Re: rdd count is throwing null pointer exception

2015-08-24 Thread Akhil Das
Move your count operation outside the foreach and use a broadcast to access it inside the foreach. On Aug 17, 2015 10:34 AM, "Priya Ch" wrote: > Looks like because of Spark-5063 > RDD transformations and actions can only be invoked by the driver, not > inside of other transformations; for example

spark and scala-2.11

2015-08-24 Thread Lanny Ripple
Hello, The instructions for building spark against scala-2.11 indicate using -Dspark-2.11. When I look in the pom.xml I find a profile named 'spark-2.11' but nothing that would indicate I should set a property. The sbt build seems to need the -Dscala-2.11 property set. Finally build/mvn does a

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
Follow the directions here: http://spark.apache.org/community.html On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott wrote: > Can you please remove me from this distribution list? > > > > (Filling up my inbox too fast) > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Mon

Re: Meaning of local[2]

2015-08-24 Thread Akhil Das
Just to add you can also look into SPARK_WORKER_INSTANCES configuration in the spark-env.sh file. On Aug 17, 2015 3:44 AM, "Daniel Darabos" wrote: > Hi Praveen, > > On Mon, Aug 17, 2015 at 12:34 PM, praveen S wrote: > >> What does this mean in .setMaster("local[2]") >> > Local mode (executor in

Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Roberto Congiu
When you launch your HDP guest VM, most likely it gets launched with NAT and an address on a private network (192.168.x.x) so on your windows host you should use that address (you can find out using ifconfig on the guest OS). I usually add an entry to my /etc/hosts for VMs that I use oftenif yo

Array Out OF Bound Exception

2015-08-24 Thread SAHA, DEBOBROTA
Hi , I am using SPARK 1.4 and I am getting an array out of bound Exception when I am trying to read from a registered table in SPARK. For example If I have 3 different text files with the content as below: Scenario 1: A1|B1|C1 A2|B2|C2 Scenario 2: A1| |C1 A2| |C2 Scenario 3: A1| B1| A2| B2|

History server is not receiving any event

2015-08-24 Thread b.bhavesh
Hi, I am working on streaming application. I tried to configure history server to persist the events of application in hadoop file system (hdfs). However, it is not logging any events. I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three nodes. Here is my configuration: File -

Re: Job is Failing automatically

2015-08-24 Thread Akhil Das
You are hitting a NPE. Put a try catch around and see what's going on. On Aug 11, 2015 2:09 PM, "Nikhil Gs" wrote: > Hello Team, > > I am facing an error which I have pasted below. My job is failing when I > am copying my data files into flume spool directory. Most of the time the > job is gettin

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
No, starting with Spark 1.5 we should by default only be reading the footers on the executor side (that is unless schema merging has been explicitly turned on). On Mon, Aug 24, 2015 at 12:20 PM, Jerrick Hoang wrote: > @Michael: would listStatus calls read the actual parquet footers within > the

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Jerrick Hoang
@Michael: would listStatus calls read the actual parquet footers within the folders? On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott wrote: > Can you please remove me from this distribution list? > > > > (Filling up my inbox too fast) > > > > *From:* Michael Armbrust [mailto:mich...@databricks.

Local Spark talking to remote HDFS?

2015-08-24 Thread Dino Fancellu
I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM. If I go into the guest spark-shell and refer to the file thus, it works fine val words=sc.textFile("hdfs:///tmp/people.txt") words.count However if I try to access it from a local Spark app on my Windows host, it doesn't wo

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Sereday, Scott
Can you please remove me from this distribution list? (Filling up my inbox too fast) From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Monday, August 24, 2015 2:13 PM To: Philip Weaver Cc: Jerrick Hoang ; Raghavendra Pandey ; User ; Cheng, Hao Subject: Re: Spark Sql behaves strange

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
The laziness is hard to deal with in these situations. I would suggest trying to handle expected cases "FileNotFound", etc using other methods before even starting a Spark job. If you really want to try.catch a specific portion of a Spark job, one way is to just follow it with an action. You can ev

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
I think we are mostly bottlenecked at this point by how fast we can make listStatus calls to discover the folders. That said, we are happy to accept suggestions or PRs to make this faster. Perhaps you can describe how your home grown partitioning works? On Sun, Aug 23, 2015 at 7:38 PM, Philip We

Re: Kafka Spark Partition Mapping

2015-08-24 Thread Syed, Nehal (Contractor)
Dear Cody, Thanks for your response, I am trying to do decoration which means when a message comes from Kafka (partitioned by key) in to the Spark I want to add more fields/data to it. How Does normally people do it in Spark? If it were you how would you decorate message without hitting database

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Roberto Coluccio
Hi Burak, thanks for your answer. I have a "new MyResultFunction()(sparkContext, inputPath).collect" in the unit test (so to evaluate the actual result), and there I can observe and catch the exception. Even considering Spark's laziness, shouldn't I catch the exception while occurring in the try..

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
textFile is a lazy operation. It doesn't evaluate until you call an action on it, such as .count(). Therefore, you won't catch the exception there. Best, Burak On Mon, Aug 24, 2015 at 9:09 AM, Roberto Coluccio < roberto.coluc...@gmail.com> wrote: > Hello folks, > > I'm experiencing an unexpected

Re: [Spark Streaming on Mesos (good practices)]

2015-08-24 Thread Aram Mkrtchyan
Here is the answer to my question if somebody needs it Running Spark in Standalone mode or coarse-grained Mesos mode leads to better task launch times than the fine-grained Mesos mode. The resource is http://spark.apache.org/docs/latest/streaming-programming-guide.html On Mon, Aug 24, 2015 at

Re: Kafka Spark Partition Mapping

2015-08-24 Thread Cody Koeninger
If your cache doesn't change during operation, you can just create it once then broadcast it to all workers. Otherwise, use redis / memcache / whatever. On Mon, Aug 24, 2015 at 12:21 PM, Syed, Nehal (Contractor) < nehal_s...@cable.comcast.com> wrote: > Dear Cody, > Thanks for your response, I am

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Cody Koeninger
It doesn't matter if shuffling occurs. Just update ZK from the driver, inside the foreachRDD, after all your dynamodb updates are done. Since you're just doing it for monitoring purposes, that should be fine. On Mon, Aug 24, 2015 at 12:11 PM, suchenzang wrote: > Forgot to include the PR I was

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread suchenzang
Forgot to include the PR I was referencing: https://github.com/apache/spark/pull/4805/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Direct-Streaming-With-ZK-Updates-tp24423p24424.html Sent from the Apache Spark User List mailing list archive at Nabb

Spark Direct Streaming With ZK Updates

2015-08-24 Thread suchenzang
Hello, I'm planning on adding a listener to update Zookeeper (for monitoring purposes) when batch completes. What would be a consistent manner to index the offsets for a given batch? In the PR above, it seems like batchTime was used, but is there a way to create this batchTime -> offsets in the st

Re: Determinant of Matrix

2015-08-24 Thread Alex Gittens
It sounds like you've already computed the covariance matrix. You can convert it to a breeze matrix then use breeze.linalg.det : val determinant = breeze.linalg.det( mat.toBreeze.asInstanceOf[breeze.linalg.DenseMatrix[Double]] ) On Mon, Aug 24, 2015 at 4:10 AM, Naveen wrote: > Hi, > > Is ther

Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
I found this solution: https://stackoverflow.com/questions/3390484/python-hashlib-md5-differs-between-linux-windows Does anybody see a reason why I shouldn't put in a PR to make this change? FROM with open(tgz_file_path) as tar: TO with open(tgz_file_path, "rb") as tar: On Mon, Aug 24, 2015 at

Re: How to evaluate custom UDF over window

2015-08-24 Thread Yin Huai
For now, user-defined window function is not supported. We will add it in future. On Mon, Aug 24, 2015 at 6:26 AM, xander92 wrote: > The ultimate aim of my program is to be able to wrap an arbitrary Scala > function (mostly will be statistics / customized rolling window metrics) in > a UDF and e

Unable to catch SparkContext methods exceptions

2015-08-24 Thread Roberto Coluccio
Hello folks, I'm experiencing an unexpected behaviour, that suggests me thinking about my missing notions on how Spark works. Let's say I have a Spark driver that invokes a function like: - in myDriver - val sparkContext = new SparkContext(mySparkConf) val inputPath = "file://home/myUser

Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
Additional info...If I use an online md5sum check then it matches...So, it's either windows or python (using 2.7.10) On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony wrote: > When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now > seen this on two different machines. I am run

Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two different machines. I am running on windows, but I would imagine that shouldn't affect the md5. Is this a boto problem, python problem, spark problem? -- View this message in context: http://apache-spark

Re: Drop table and Hive warehouse

2015-08-24 Thread Michael Armbrust
Thats not the expected behavior. What version of Spark? On Mon, Aug 24, 2015 at 1:32 AM, Kevin Jung wrote: > When I store DataFrame as table with command "saveAsTable" and then > execute "DROP TABLE" in SparkSQL, it doesn't actually delete files in hive > warehouse. > The table disappears from

DataFrame/JDBC very slow performance

2015-08-24 Thread Dhaval Patel
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). When I tried with BIG table (5B records) then no results returned upon completion of query. I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores,

Re: How to set environment of worker applications

2015-08-24 Thread Raghavendra Pandey
System properties and environment variables are two different things.. One can use spark.executor.extraJavaOptions to pass system properties and spark-env.sh to pass environment variables. -raghav On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat wrote: > That's surprising. Passing the environme

RE: Loading already existing tables in spark shell

2015-08-24 Thread Cheng, Hao
And be sure the hive-site.xml is under the classpath or under the path of $SPARK_HOME/conf Hao From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Monday, August 24, 2015 8:57 PM To: user Subject: Re: Loading already existing tables in spark shell Hi Jeetendra, I faced this

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
The first job is to infer the json schema, and the second one is what you mean of the query. You can provide the schema while loading the json file, like below: sqlContext.read.schema(xxx).json(“…”)? Hao From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Monday, August 24, 2015 6:20 PM To: user@sp

Re: Joining using mulitimap or array

2015-08-24 Thread Hemant Bhanawat
In your example, a.attributes.name is a list and is not a string . Run this to find it out : a.select($"a.attributes.name").show() On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov wrote: > Hi, guys > I'm confused about joining columns in SparkSQL and need your advice. > I want to join 2 datasets o

How to evaluate custom UDF over window

2015-08-24 Thread xander92
The ultimate aim of my program is to be able to wrap an arbitrary Scala function (mostly will be statistics / customized rolling window metrics) in a UDF and evaluate them on DataFrames using the window functionality. So my main question is how do I express that a UDF takes a Frame of rows from a

Re: Joining using mulitimap or array

2015-08-24 Thread Ilya Karpov
Thanks, but I think this is not the case of multiple spark contexts (never the less I tried your suggestion - didn’t worked). The problem is join to datasets using array items value: attribute.value in my case. Has anyone ideas? > 24 авг. 2015 г., в 15:01, satish chandra j > написал(а): > >

Re: Spark ec2 lunch problem

2015-08-24 Thread Robin East
spark-ec2 is the way to go however you may need to debug connectivity issues. For example do you know that the servers were correctly setup in AWS and can you access each node using ssh? If no then you need to work out why (it’s not a spark issue). If yes then you will need to work out why ssh v

Re: Loading already existing tables in spark shell

2015-08-24 Thread Ishwardeep Singh
Hi Jeetendra, I faced this issue. I did not specify the database where this table exists. Please set the database by using "use " command before executing the query. Regards, Ishwardeep From: Jeetendra Gangele Sent: Monday, August 24, 2015 5:47 PM To: user

Difficulties developing a Specs2 matcher for Spark Streaming

2015-08-24 Thread Juan Rodríguez Hortalá
Hi, I've had some troubles developing a Specs2 matcher that checks that a predicate holds for all the elements of an RDD, and using it for testing a simple Spark Streaming program. I've finally been able to get a code that works, you can see it in https://gist.github.com/juanrh/dffd060e3a371676b83

RE: Spark ec2 lunch problem

2015-08-24 Thread Garry Chen
So what is the best way to deploy spark cluster in EC2 environment any suggestions? Garry From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Friday, August 21, 2015 4:27 PM To: Garry Chen Cc: user@spark.apache.org Subject: Re: Spark ec2 lunch problem It may happen that the version of s

Re: Memory allocation error with Spark 1.5, HashJoinCompatibilitySuite

2015-08-24 Thread Adam Roberts
Hi, I'm regularly hitting "Unable to acquire memory" problems only when trying to use overflow pages when running the full set of Spark tests across different platforms. The machines I'm using all have well over 10 GB of RAM and I'm running without any changes to the pom.xml file. Standard 3 GB Jav

Performance - Python streaming v/s Scala streaming

2015-08-24 Thread utk.pat
I am new to SPARK streaming. I was running the "kafka_wordcount" example with a local KAFKA and SPARK instance. It was very easy to set this up and get going :)I tried running both SCALA and Python versions of the word count example. Python versions seems to be extremely slow. Sometimes it has dela

Loading already existing tables in spark shell

2015-08-24 Thread Jeetendra Gangele
Hi All I have few tables in hive and I wanted to run query against them with spark as execution engine. Can I direct;y load these tables in spark shell and run query? I tried with 1.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) 2.qlContext.sql("FROM event_impressions select count

Re: Joining using mulitimap or array

2015-08-24 Thread satish chandra j
Hi, If you join logic is correct, it seems to be a similar issue which i faced recently Can you try by *SparkContext(conf).set("spark.driver.allowMultipleContexts","true")* Regards, Satish Chandra On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov wrote: > Hi, guys > I'm confused about joining colum

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-24 Thread satish chandra j
HI All, Please find fix info for users who are following the mail chain of this issue and the respective solution below: *reduceByKey: Non working snippet* import org.apache.spark.Context import org.apache.spark.Context._ import org.apache.spark.SparkConf val conf = new SparkConf() val sc = new

Determinant of Matrix

2015-08-24 Thread Naveen
Hi, Is there any function to find the determinant of a mllib.linalg.Matrix (a covariance matrix) using Spark? Regards, Naveen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-

DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Jeff Zhang
It's weird to me that the simple show function will cost 2 spark jobs. DataFrame#explain shows it is a very simple operation, not sure why need 2 jobs. == Parsed Logical Plan == Relation[age#0L,name#1] JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] == Analyz

Joining using mulitimap or array

2015-08-24 Thread Ilya Karpov
Hi, guys I'm confused about joining columns in SparkSQL and need your advice. I want to join 2 datasets of profiles. Each profile has name and array of attributes(age, gender, email etc). There can be mutliple instances of attribute with the same name, e.g. profile has 2 emails - so 2 attributes

Re: Memory-efficient successive calls to repartition()

2015-08-24 Thread alexis GILLAIN
Hi Aurelien, The first code should create a new RDD in memory at each iteration (check the webui). The second code will unpersist the RDD but that's not the main problem. I think you have trouble due to long lineage as .cache() keep track of lineage for recovery. You should have a look at checkpo

Drop table and Hive warehouse

2015-08-24 Thread Kevin Jung
When I store DataFrame as table with command "saveAsTable" and then execute "DROP TABLE" in SparkSQL, it doesn't actually delete files in hive warehouse. The table disappears from a table list but the data files are still alive. Because of this, I can't saveAsTable with a same name before dropping

  1   2   >