Re: aliasing aggregate columns?

2015-04-17 Thread elliott cordo
FYI.. the problem is that column names spark generates are not able to be referenced within SQL or dataframe operations (ie. "SUM(cool_cnt#725)").. any idea how to alias these final aggregate columns.. the syntax below doesn't make sense, but this is what i'd ideally want to do: .agg({"cool_cnt":"

Re: Distinct is very slow

2015-04-17 Thread Akhil Das
How many tasks are you seeing in your mapToPair stage? Is it 7000? then i suggest you giving a number similar/close to 7000 in your .distinct call, what is happening in your case is that, you are repartitioning your data to a smaller number (32) which would put a lot of load on processing i believe

Re: SparkR: Server IPC version 9 cannot communicate with client version 4

2015-04-17 Thread Akhil Das
There's a version incompatibility between your hadoop jars. You need to make sure you build your spark with Hadoop 2.5.0-cdh5.3.1 version. Thanks Best Regards On Fri, Apr 17, 2015 at 5:17 AM, lalasriza . wrote: > Dear everyone, > > right now I am working with SparkR on cluster. The following ar

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing som

Re: Spark 1.3 saveAsTextFile with codec gives error - works with Spark 1.2

2015-04-17 Thread Akhil Das
Not sure if this will help, but try clearing your jar cache (for sbt ~/.ivy2 and for maven ~/.m2) directories. Thanks Best Regards On Wed, Apr 15, 2015 at 9:33 PM, Manoj Samel wrote: > Env - Spark 1.3 Hadoop 2.3, Kerbeos > > xx.saveAsTextFile(path, codec) gives following trace. Same works with

Re: Task result in Spark Worker Node

2015-04-17 Thread Raghav Shankar
Hey Imran, Thanks for the great explanation! This cleared up a lot of things for me. I am actually trying to utilize some of the features within Spark for a system I am developing. I am currently working on developing a subsystem that can be integrated within Spark and other Big Data solution

Tuple join

2015-04-17 Thread Flavio Pompermaier
Hi to all, I have 2 rdd D1 and D2 like: D1: A,p1,a A,p2,a2 A,p3,X B,p3,Y B,p1,b1 D2: X,s,V X,r,2 Y,j,k I'd like to have a unique rdd D3(Tuple4) like A,X,a1,a2 B,Y,b1,null Basically filling with when D1.f2==D2.f0. Is that possible and how? Could you show me a simple snippet? Thanks in advance

Spark Directed Acyclic Graph / Jobs

2015-04-17 Thread James King
Is there a good resource that explains how Spark jobs gets broken down to tasks and executions. I just need to get a better understanding of this. Regards j

Re: Spark on Windows

2015-04-17 Thread Sree V
spark 'master' branch (i.e. v1.4.0) builds successfully on windows 8.1 intel i7 64-bit with oracle jdk8_45.with maven opts without the flag "-XX:ReservedCodeCacheSize=1g". takes about 33 minutes. Thanking you. With Regards Sree  On Thursday, April 16, 2015 9:07 PM, Arun Lists wrote:

SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-04-17 Thread Jaonary Rabarisoa
Dear all, Here is an example of code to reproduce the issue I mentioned in a previous mail about saving an UserDefinedType into a parquet file. The problem here is that the code works when I run it inside intellij idea but fails when I create the assembly jar and run it with spark-submit. I use th

RE: Spark Directed Acyclic Graph / Jobs

2015-04-17 Thread Shao, Saisai
I think this paper will be a good resource (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf), also the paper of Dryad is also a good one. Thanks Jerry From: James King [mailto:jakwebin...@gmail.com] Sent: Friday, April 17, 2015 3:26 PM To: user Subject: Spark Directed Acyclic Grap

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Krist Rastislav
Hello, thank You for Your answer – I am creating the DataFrames manually using org.apache.spark.sql.SQLContext#createDataFrame. RDD is my custom implementation encapsulating invocation of a remote REST-based web service and schema is created programatically upon metadata (obtained from the same

Re: Task result in Spark Worker Node

2015-04-17 Thread Raghav Shankar
My apologies, I had pasted the wrong exception trace in the previous email. Here is the actual exception that I am receiving. Exception in thread "main" java.lang.NullPointerException at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:154) at org.a

Re: Random pairs / RDD order

2015-04-17 Thread Aurélien Bellet
Hi Sean, Thanks a lot for your reply. The problem is that I need to sample random *independent* pairs. If I draw two samples and build all n*(n-1) pairs then there is a lot of dependency. My current solution is also not satisfying because some pairs (the closest ones in a partition) have a mu

Path issue in running spark

2015-04-17 Thread mas
A very basic but strange problem: On running master i am getting following error. My java path is proper, however spark-class file is getting error because here the in the string "bin/java" is duplicated. Can any body explain why it is getting this . Error: /bin/spark-class: line 190: exec: /u

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Wang, Daoyuan
Normally I use like the following in scala: >case calss datetest (x: Int, y:java.sql.Date) >val dt = sc.parallelize(1 to 3).map(p => datetest(p, new >java.sql.Date(p*1000*60*60*24))) >sqlContext.createDataFrame(dt).registerTempTable(“t1”) >sql(“select * from t1”).collect.foreach(println) If you

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen
Doesn't this reduce to "Scala isn't compatible with itself across maintenance releases"? Meaning, if this were "fixed" then Scala 2.11.{x < 6} would have similar failures. It's not not-ready; it's just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the unofficial support to at least make

Re: Spark Directed Acyclic Graph / Jobs

2015-04-17 Thread James King
Thanks Jerry, The other paper you refer to is may be ? http://research.microsoft.com/pubs/63785/eurosys07.pdf Regards j On Fri, Apr 17, 2015 at 9:45 AM, Shao, Saisai wrote: > I think this paper will be a good resource ( > https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf), also

Re: Actor not found

2015-04-17 Thread Shixiong Zhu
Forgot this one: I cannot find any issue about creating OutputCommitCoordinator. The order of creating OutputCommitCoordinato looks right. Best Regards, Shixiong(Ryan) Zhu 2015-04-17 16:57 GMT+08:00 Shixiong Zhu : > I just checked the codes about creating OutputCommitCoordinator. Could you > rep

RE: How to do dispatching in Streaming?

2015-04-17 Thread Evo Eftimov
Good use of analogies J Yep friction (or entropy in general) exists in everything – but hey by adding and doing “more work” at the same time (aka more powerful rockets) some people have overcome the friction of the air and even got as far as the moon and beyond It is all about the botto

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Krist Rastislav
Hi, this is OK if there is a fixed structure being used as schema in dataframe. But in my case, the schema used dataframe is build dynamically from metadata provided by REST service (and is thus unknown during compilation) so I have no case class to be used as schema. Thanks, R.Krist From:

Re: Actor not found

2015-04-17 Thread Shixiong Zhu
I just checked the codes about creating OutputCommitCoordinator. Could you reproduce this issue? If so, could you provide details about how to reproduce it? Best Regards, Shixiong(Ryan) Zhu 2015-04-16 13:27 GMT+08:00 Canoe : > 13119 Exception in thread "main" akka.actor.ActorNotFound: Actor not

Some questions on Multiple Streams

2015-04-17 Thread Laeeq Ahmed
Hi, I am working with multiple Kafka streams (23 streams) and currently I am processing them separately. I receive one stream from each topic. I have the following questions. 1.    Spark streaming guide suggests to union these streams. Is it possible to get statistics of each stream even after t

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-17 Thread Evo Eftimov
And btw if you suspect this is a "YARN issue" you can always launch and use Spark in a Standalone Mode which uses its own embedded cluster resource manager - this is possible even when Spark has been deployed on CDH under YARN by the pre-canned install scripts of CDH To achieve that: 1.

Re: How to do dispatching in Streaming?

2015-04-17 Thread Gerard Maas
Evo, In Spark there's a fixed scheduling cost for each task, so more tasks mean an increased bottom line for the same amount of work being done. The number of tasks per batch interval should relate to the CPU resources available for the job following the same 'rule of thumbs' than for Spark, being

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Krist Rastislav
Hello again, steps to reproduce the same problem in JdbcRDD: - create a table containig Date field in your favourite DBMS, I used PostgreSQL: CREATE TABLE spark_test ( pk_spark_test integer NOT NULL, text character varying(25), date1 date, CONSTRAINT pk PRIMARY KEY (pk_spark_test) ) WITH

Re: Custom partioner

2015-04-17 Thread Archit Thakur
I dont think you can change it to 4 bytes without any custom compilation. To make same key go to same node, you'll have to repartition the data, which is shuffling anyway. Unless your raw data is such that the same key is on same node, you'll have to shuffle atleast once to make same key on same no

Executor memory in web UI

2015-04-17 Thread podioss
Hi, i am a bit confused with the executor-memory option. I am running applications with Standalone cluster manager with 8 workers with 4gb memory and 2 cores each and when i submit my application with spark-submit i use --executor-memory 1g. In the web ui in the completed applications table i see t

Re: Executor memory in web UI

2015-04-17 Thread Sean Owen
This is the fraction available for caching, which is 60% * 90% * total by default. On Fri, Apr 17, 2015 at 11:30 AM, podioss wrote: > Hi, > i am a bit confused with the executor-memory option. I am running > applications with Standalone cluster manager with 8 workers with 4gb memory > and 2 cores

Addition of new Metrics for killed executors.

2015-04-17 Thread Archit Thakur
Hi, We are planning to add new Metrics in Spark for the executors that got killed during the execution. Was just curious, why this info is not already present. Is there some reason for not adding it.? Any ideas around are welcome. Thanks and Regards, Archit Thakur.

Re: Spark on Windows

2015-04-17 Thread Arun Lists
Thanks, Sree! Are you able to run your applications using spark-submit? Even after we were able to build successfully, we ran into problems with running the spark-submit script. If everything worked correctly for you, we can hope that things will be smoother when 1.4.0 is made generally available.

ClassCastException while caching a query

2015-04-17 Thread Tash Chainar
Hi all, Spark 1.2.1. I have a Cassandra column family and doing the following SchemaRDD s = cassandraSQLContext.sql("select user.id as user_id from user"); // user.id is UUID in table definition s.registerTempTable( "my_user" ); s.cache(); // throws following exception // tried the cassandraSQLC

SparkStreaming 1.3.0 fileNotFound Exception while using WAL & Checkpoints

2015-04-17 Thread Akhil Das
Hi With SparkStreaming on 1.3.0 version when I'm using WAL and checkpoints, sometimes, I'm hitting fileNotFound exceptions. Here's the complete stacktrace: https://gist.github.com/akhld/126b945f7fef408a525e The application simply reads data from Kafka and does a simple wordcount over it. Batch d

RE: Streaming problems running 24x7

2015-04-17 Thread González Salgado , Miquel
Hi Akhil, Thank you for your response, I think it is not because of the processing time, in fact the delay is under 1 second, while the batch interval is 10 seconds… The data volume is low (10 lines / second) By the way, I have seen some results changing to this call of Kafkautils: KafkaUtils.c

RE: Streaming problems running 24x7

2015-04-17 Thread González Salgado , Miquel
Hi, Thank you for your response, I think it is not because of the processing speed, in fact the delay is under 1 second, while the batch interval is 10 seconds... The data volume is low (10 lines / second) Changing to local[8] was worsening the problem (cpu increase more quickly) By the way, I

Re: aliasing aggregate columns?

2015-04-17 Thread elliott cordo
Ps.. forgot to mention this syntax works... but then you loose your group by fields (which is honestly pretty weird, i'm not sure if this is as designed or a bug?) >>> t2 = reviews.groupBy("stars").agg(count("stars").alias("count")) >>> t2 *DataFrame[count: bigint]* On Thu, Apr 16, 2015 at 9:32

Streaming Linear Regression problem

2015-04-17 Thread barisak
Hi, I write this code for just train the Stream Linear Regression, but I took no data found warn, so no weights were not updated. Is there any solution for this ? Thanks import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressi

Re: Joined RDD

2015-04-17 Thread Archit Thakur
Ajay, This is true. When we call join again on two RDD's.Rather than computing the whole pipe again, It reads the map output of the map phase of an RDD(which it usually gets from shuffle manager). If you see the code: override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[

Re: Joined RDD

2015-04-17 Thread Archit Thakur
map phase of join* On Fri, Apr 17, 2015 at 5:28 PM, Archit Thakur wrote: > Ajay, > > This is true. When we call join again on two RDD's.Rather than computing > the whole pipe again, It reads the map output of the map phase of an > RDD(which it usually gets from shuffle manager). > > If you see t

Re: Custom partioner

2015-04-17 Thread Jeetendra Gangele
Hi Archit Thanks for reply. How can I don the costom compilation so reduce it to 4 bytes.I want to make it to 4 bytes in any case can you please guide? I am applying flatMapvalue in each step after ZipWithIndex it should be in same Node right? Why its suffling? Also I am running with very less rec

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Krist Rastislav
So finally, org.apache.spark.sql.catalyst.ScalaReflection#convertToCatalyst was the method I was looking for (this is the way how it is being done with case classes at least, so it should be good for me too ;-)) My problem is thus solved... Someone should put that method also in JdbcRDD to make

Re: RDD collect hangs on large input data

2015-04-17 Thread Zsolt Tóth
Thanks for your answer Imran. I haven't tried your suggestions yet, but setting spark.shuffle.blockTransferService=nio solved my issue. There is a JIRA for this: https://issues.apache.org/jira/browse/SPARK-6962. Zsolt 2015-04-14 21:57 GMT+02:00 Imran Rashid : > is it possible that when you switc

Re: Custom partioner

2015-04-17 Thread Archit Thakur
By custom installation, I meant change the code and build it. I have not done the complete impact analysis, just had a look on the code. When you say, same key goes to same node, It would need shuffling unless the raw data you are reading is present that way. On Apr 17, 2015 6:30 PM, "Jeetendra Ga

Running into several problems with Data Frames

2015-04-17 Thread Darin McBeath
I decided to play around with DataFrames this morning but I'm running into quite a few issues. I'm assuming that I must be doing something wrong so would appreciate some advice. First, I create my Data Frame. import sqlContext.implicits._ case class Entity(InternalId: Long, EntityId: Long, Ent

Re: Custom partioner

2015-04-17 Thread Jeetendra Gangele
Ok is there a way, I can use hash Partitioning so that I can improve the performance? On 17 April 2015 at 19:33, Archit Thakur wrote: > By custom installation, I meant change the code and build it. I have not > done the complete impact analysis, just had a look on the code. > > When you say, s

history-server does't read logs which are on FS

2015-04-17 Thread Serega Sheypak
Hi, started history-server Here is UI output: - *Event log directory:* file:/var/log/spark/applicationHistory/ No completed applications found! Did you specify the correct logging directory? Please verify your setting of spark.history.fs.logDirectory and whether you have the permissions to a

Re: Distinct is very slow

2015-04-17 Thread Jeetendra Gangele
I have given 3000 task to mapToPair now its taking so much memory and shuffling and wasting time there. Here is the stats when I run with very small data almost for all data its doing shuffling not sure what is happening here any idea? - *Total task time across all tasks: *11.0 h - *Shuffle

Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread ARose
So I'm trying to store the results of a query into a DataFrame, but I get the following exception thrown: Exception in thread "main" java.lang.RuntimeException: [1.71] failure: ``*'' expected but `select' found SELECT DISTINCT OutSwitchID FROM wtbECRTemp WHERE OutSwtichID NOT IN (SELECT SwitchID

local directories for spark running on yarn

2015-04-17 Thread shenyanls
According to the documentation: The local directories used by Spark executors will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. (https://spark.apache.org/docs/1.2.1/running-on-yarn.html) I

Re: regarding ZipWithIndex

2015-04-17 Thread Jeetendra Gangele
Hi Ted Will you be able to guide me for any sample code for achieving this? On 17 April 2015 at 02:30, Jeetendra Gangele wrote: > Can you please guide me how can I extend RDD and convert into this way you > are suggesting. > > On 16 April 2015 at 23:46, Jeetendra Gangele wrote: > >> I type T i

Re: regarding ZipWithIndex

2015-04-17 Thread Ted Yu
I have some assignments on hand at the moment. Will try to come up with sample code after I clear the assignments. FYI On Thu, Apr 16, 2015 at 2:00 PM, Jeetendra Gangele wrote: > Can you please guide me how can I extend RDD and convert into this way you > are suggesting. > > On 16 April 2015 a

How to persist RDD return from partitionBy() to disk?

2015-04-17 Thread Wang, Ningjun (LNG-NPV)
I have a huge RDD[Document] with millions of items. I partitioned it using HashPartitioner and save as object file. But when I load the object file back into RDD, I lost the HashPartitioner. How do I preserve the partitions when loading the object file? Here is the code val docVectors : RDD[D

Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee
Support for sub queries in predicates hasn't been resolved yet - please refer to SPARK-4226 BTW, Spark 1.3 default bindings to Hive 0.13.1 On Fri, Apr 17, 2015 at 09:18 ARose wrote: > So I'm trying to store the results of a query into a DataFrame, but I get > the > following exception thrown

Re: Distinct is very slow

2015-04-17 Thread Jeetendra Gangele
I am saying to partition something like partitionBy(new HashPartitioner(16) will this not work? On 17 April 2015 at 21:28, Jeetendra Gangele wrote: > I have given 3000 task to mapToPair now its taking so much memory and > shuffling and wasting time there. Here is the stats when I run with very >

When are TaskCompletionListeners called?

2015-04-17 Thread Akshat Aranya
Hi, I'm trying to figure out when TaskCompletionListeners are called -- are they called at the end of the RDD's compute() method, or after the iteration through the iterator of the compute() method is completed. To put it another way, is this OK: class DatabaseRDD[T] extends RDD[T] { def comp

Spark hanging after main method completes

2015-04-17 Thread apropion
I recently started using Spark version 1.3.0 in standalone mode (with Scala 2.10.3), and I'm running into an odd problem. I'm loading data from a file using sc.textFile, doing some conversion of the data, and then clustering it. When I do this with a small file (10 lines, 9 KB), it works fine, and

Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
Thanks everyone for the reply. Looks like foreachRDD + filtering is the way to go. I'll have 4 independent Spark streaming applications so the overhead seems acceptable. Jianshi On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov wrote: > Good use of analogies J > > > > Yep friction (or entropy in g

Metrics Servlet on spark 1.2

2015-04-17 Thread Udit Mehta
Hi, I am unable to access the metrics servlet on spark 1.2. I tried to access it from the app master UI on port 4040 but i dont see any metrics there. Is it a known issue with spark 1.2 or am I doing something wrong? Also how do I publish my own metrics and view them on this servlet? Thanks, Udit

Re: When are TaskCompletionListeners called?

2015-04-17 Thread Imran Rashid
its the latter -- after spark gets to the end of the iterator (or if it hits an exception) so your example is good, that is exactly what it is intended for. On Fri, Apr 17, 2015 at 12:23 PM, Akshat Aranya wrote: > Hi, > > I'm trying to figure out when TaskCompletionListeners are called -- are >

Re: Task result in Spark Worker Node

2015-04-17 Thread Imran Rashid
hard to say for sure, but tasks get serialized along with a partition, which might be the info that you are missing. I dont' know what you are trying to build, so I don't mean to be too discouraging, but it sounds like you are trying to do some unusual hybrid that will be very tough. It might mak

Re: Spark Code to read RCFiles

2015-04-17 Thread gle
Hi, I'm new to Spark and am working on a proof of concept. I'm using Spark 1.3.0 and running in local mode. I can read and parse an RCFile using Spark however the performance is not as good as I hoped. I'm testing using ~800k rows and it is taking about 30 mins to process. Is there a better wa

Re: Random pairs / RDD order

2015-04-17 Thread Imran Rashid
if you can store the entire sample for one partition in memory, I think you just want: val sample1 = rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle) val sample2 = rdd.sample(true,0.01,43) .mapPartitions(scala.util.Random.shuffle) ... On Fri, Apr 17, 2015 at 3:05 AM, Aurélien

Re: history-server does't read logs which are on FS

2015-04-17 Thread Imran Rashid
are you calling sc.stop() at the end of your applications? The history server only displays completed applications, but if you don't call sc.stop(), it doesn't know that those applications have been stopped. Note that in spark 1.3, the history server can also display running applications (includi

Re: How to persist RDD return from partitionBy() to disk?

2015-04-17 Thread Imran Rashid
https://issues.apache.org/jira/browse/SPARK-1061 note the proposed fix isn't to have spark automatically know about the partitioner when it reloads the data, but at least to make it *possible* for it to be done at the application level. On Fri, Apr 17, 2015 at 11:35 AM, Wang, Ningjun (LNG-NPV) <

Need Costom RDD

2015-04-17 Thread Jeetendra Gangele
Hi All I have an RDD then I convert it to RDD with ZipWithIndex here Index is Long and its taking 8 bytes Is there any way to make it Integer? There is no API available which INT index. How Can I create Custom RDD so that I takes only 4 bytes for index part? Also why API is design such a way that

Re: Spark hanging after main method completes

2015-04-17 Thread apropion
I was using sbt, and I found that I actually had specified Spark 0.9.1 there. Once I upgraded my sbt config file to use 1.3.0, and Scala to 2.10.4, the problem went away. Michael -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hanging-after-main-metho

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei

Re: dataframe can not find fields after loading from hive

2015-04-17 Thread Reynold Xin
This is strange. cc the dev list since it might be a bug. On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores wrote: > Never mind. I found the solution: > > val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, > hiveLoadedDataFrame.schema) > > which translate to convert the data frame to r

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
You don’t need to put any yarn assembly in hdfs. The spark assembly jar will include everything. It looks like your package does not include yarn module, although I didn’t find anything wrong in your mvn command. Can you check whether the ExecutorLauncher class is in your jar file or not? BTW:

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Thanks. Would that distribution work for hdp 2.2? On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang wrote: > You don’t need to put any yarn assembly in hdfs. The spark assembly jar > will include everything. It looks like your package does not include yarn > module, although I didn’t find anything wr

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
I am just trying to launch a spark shell and not do anything fancy. I got the binary distribution from apache and put the spark assembly on hdfs. I then specified the yarn.jars option in spark defaults to point to the assembly in hdfs. I still got the same error so though I had to build it for hdp.

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
H... I don't follow. The 2.11.x series is supposed to be binary compatible against user code. Anyway, I was building Spark against 2.11.2 and still saw the problems with the REPL. I've created a bug report: https://issues.apache.org/jira/browse/SPARK-6989

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
You probably want to first try the basic configuration to see whether it works, instead of setting SPARK_JAR pointing to the hdfs location. This error is caused by not finding ExecutorLauncher in class path, and not HDP specific, I think. Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
Hi Udit, By the way, do you mind to share the whole log trace? Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit Mehta mailto:ume...@groupon.com>> wrote: I am just trying to launch a spark shell and not do anything fancy. I got the binary distribution from apache and put the spark assembl

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Hi, This is the log trace: https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47 On the yarn RM UI, I see : Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher The command I run is: bin/spark-shell --master yarn-client The spark defaults I use is: spark.y

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
Besides the hdp.version in spark-defaults.conf, I think you probably forget to put the file java-opts under $SPARK_HOME/conf with following contents. [root@c6402 conf]# pwd /usr/hdp/current/spark-client/conf [root@c6402 conf]# ls fairscheduler.xml.template java-opts log4j.properties.temp

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Thanks Zhang, that solved the error. This is probably not documented anywhere so I missed it. Thanks again, Udit On Fri, Apr 17, 2015 at 3:24 PM, Zhan Zhang wrote: > Besides the hdp.version in spark-defaults.conf, I think you probably > forget to put the file* java-opts* under $SPARK_HOME/conf

How to avoid “Invalid checkpoint directory” error in apache Spark?

2015-04-17 Thread Peng Cheng
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception: (On driver) org.apache.s

Announcing Spark 1.3.1 and 1.2.2

2015-04-17 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases. We recommend all users on the 1.3 and 1.2 Spark branches upgrade to these releases, which contain several important bug fixes. Download Spark 1.3.1 or 1.2.2: http://spark.apache.org/downloads.html Release notes: 1.3.1:

Can't get SparkListener to work

2015-04-17 Thread Praveen Balaji
I'm trying to create a simple SparkListener to get notified of error on executors. I do not get any call backs on my SparkListener. Here some simple code I'm executing in spark-shell. But I still don't get any callbacks on my listener. Am I doing something wrong? Thanks for any clue you can send m

Re: Can't get SparkListener to work

2015-04-17 Thread Imran Rashid
when you start the spark-shell, its already too late to get the ApplicationStart event. Try listening for StageCompleted or JobEnd instead. On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji < secondorderpolynom...@gmail.com> wrote: > I'm trying to create a simple SparkListener to get notified of e

Verifying multiple workers are being used

2015-04-17 Thread mj1200
When running in standalone cluster mode, how can I verify that more than just one worker is being utilized? I can see that multiple workers are being started up, from the files in $SCALA_HOME/logs, but I don't see any difference in execution time when I specify 1 worker versus 4, which surprises

Re: Can't get SparkListener to work

2015-04-17 Thread Praveen Balaji
Thanks for the response, Imran. I probably chose the wrong methods for this email. I implemented all methods of SparkListener and the only callback I get is onExecutorMetricsUpdate. Here's the complete code: == import org.apache.spark.scheduler._ sc.addSparkListener(new SparkListener()

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Wang, Daoyuan
Thank you for the explanation! I’ll check what can be done here. From: Krist Rastislav [mailto:rkr...@vub.sk] Sent: Friday, April 17, 2015 9:03 PM To: Wang, Daoyuan; Michael Armbrust Cc: user Subject: RE: ClassCastException processing date fields using spark SQL since 1.3.0 So finally, org.apach

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen
You are running on 2.11.6, right? of course, it seems like that should all work, but it doesn't work for you. My point is that the shell you are saying doesn't work is Scala's 2.11.2 shell -- with some light modification. It's possible that the delta is the problem. I can't entirely make out wheth

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
I actually just saw your comment on SPARK-6989 before this message. So I'll copy to the mailing list: I'm not sure I understand what you mean about running on 2.11.6. I'm just running the spark-shell command. It in turn is running java -cp /opt/spark/conf:/opt/spark/lib/spark-assembly-1.3.2

Re: Spark Code to read RCFiles

2015-04-17 Thread Pramod Biligiri
Hi, I remember seeing a similar performance problem with Apache Shark last year when compared to Hive, though that was in a company specific port of the code. Unfortunately I no longer have access to that code. The problem then was reflection based class creation in the critical path of reading eac

Re: Actor not found

2015-04-17 Thread Zhihang Fan
Hi, Shixiong: Actually, I know nothing about this exception. I submitted a job that would read about 2.5T data and it threw this exception. And also I try to submit some jobs that can run successfully before this submission, it also failed with the same exception. Hope this will help you to do