problem with spark thrift server

2015-04-23 Thread guoqing0...@yahoo.com.hk
Hi , I have a question about spark thrift server , i deployed the spark on yarn and found if the spark driver disable , the spark application will be crashed on yarn. appreciate for any suggestions and idea . Thank you!

Re: Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

2015-04-23 Thread Prannoy
Hi, Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. On Thu, Apr 23, 2015 at 9:28 AM, Jeffery [via Apache Spark User List] < ml-node+s1001560n22618...@n3.nabble.com> wrote: > Hi, Dear Spark Users/Devs: > > In a method

spark yarn-cluster job failing in batch processing

2015-04-23 Thread sachin Singh
Hi All, I am trying to execute batch processing in yarn-cluster mode i.e. I have many sql insert queries,based on argument provided it will it will fetch the queries ,create context , schema RDD and insert in hive tables, Please Note- in standalone mode its working and in cluster mode working is I

Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { private int id; private String name; private double salary; } Apply a schema to an RDD and register table.

Pipeline in pyspark

2015-04-23 Thread Suraj Shetiya
Hi, I have come across ways of building pipeline of input/transform and output pipelines with Java (Google Dataflow/Spark etc). I also understand that Spark itelf provides ways for creating a pipeline within mlib for MLtransforms (primarily fit) Both of the above are available in Java/Scala enviro

Re: Problem with using Spark ML

2015-04-23 Thread Staffan
So I got the tip of trying to reduce step-size and that finally gave some more decent results, had hoped for the default params to give at least OK results and thought that the problem must be somewhere else in the code. Problem solved! -- View this message in context: http://apache-spark-user-

Re: Pipeline in pyspark

2015-04-23 Thread ayan guha
I do not think you can share data across spark contexts. So as long as you can pass it around you should be good. On 23 Apr 2015 17:12, "Suraj Shetiya" wrote: > Hi, > > I have come across ways of building pipeline of input/transform and output > pipelines with Java (Google Dataflow/Spark etc). I

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi, Hive table creation need an extra step from 1.3. You can follow the following template df.registerTempTable(tableName) hc.sql(s"create table $tableName as select * from $tableName") this will save the table in hive with given tableName. Regards, Madhukara Phatak http://datamantra

A Spark Group by is running forever

2015-04-23 Thread ๏̯͡๏
I have a groupBy query after a map-side join & leftOuterJoin. And this query is running for more than 2 hours. asks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC TimeShuffle Read Size / RecordsWrite TimeShuffle Write Size / RecordsErrors 0 36 0 RUNNING PROCESS_LOCAL 17

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi Michael, Here is the jira issue and PR for the same. Please have a look. Regards, Madhukara Phatak http://datamantra.io/ On Thu, Apr 23, 2015 at 1:22 PM, madhu phatak wrote: > Hi, > Hive table

Re: Convert DStream to DataFrame

2015-04-23 Thread Sergio Jiménez Barrio
Thank you ver much, Tathagata! El miércoles, 22 de abril de 2015, Tathagata Das escribió: > Aaah, that. That is probably a limitation of the SQLContext (cc'ing Yin > for more information). > > > On Wed, Apr 22, 2015 at 7:07 AM, Sergio Jiménez Barrio < > drarse.a...@gmail.com > > wrote: > >> Sorr

Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, "Nikolay Tikhonov" wrote: > Hi, > I have Spark SQL performance issue. My code contains a simple JavaBean: > > public class Person implements Externalizable { > private int id; >

Re: Custom paritioning of DSTream

2015-04-23 Thread davidkl
Hello Evo, Ranjitiyer, I am also looking for the same thing. Using foreach is not useful for me as processing the RDD as a whole won't be distributed across workers and that would kill performance in my application :-/ Let me know if you find a solution for this. Regards -- View this message

Re: Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
> why are you cache both rdd and table? I try to cache all the data to avoid the bad performance for the first query. Is it right? > Which stage of job is slow? The query is run many times on one sqlContext and each query execution takes 1 second. 2015-04-23 11:33 GMT+03:00 ayan guha : > Quick q

RE: Custom paritioning of DSTream

2015-04-23 Thread Evo Eftimov
You can use "transform" which yields RDDs from the DStream as on each of the RDDs you can then apply partitionBy - transform also returns another DSTream while foreach doesn't Btw what do you mean re "foreach killing the performance by not distributing the workload" - every function (provided

Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda
Hi Can you share your Web UI, depicting your task level breakup.I can see many thing s that can be improved. 1. JavaRDD rdds = ...rdds.cache(); ->this caching is not needed as you are not reading the rdd for any action 2.Instead of collecting as list, if you can save as text file, it would be b

Re: problem with spark thrift server

2015-04-23 Thread Arush Kharbanda
Hi What do you mean disable the driver? what are you trying to achieve. Thanks Arush On Thu, Apr 23, 2015 at 12:29 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi , > I have a question about spark thrift server , i deployed the spark on yarn > and found if the spark driver

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer wrote: > Afternoon all, > > I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: > > `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipT

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
HI TD, Some observations: 1. If I submit the application using spark-submit tool with *client as deploy mode* it works fine with single master and worker (driver, master and worker are running in same machine) 2. If I submit the application using spark-submit tool with client as deploy mode it *c

Re: Error in creating spark RDD

2015-04-23 Thread madhvi
On Thursday 23 April 2015 12:22 PM, Akhil Das wrote: Here's a complete scala example https://github.com/bbux-proteus/spark-accumulo-examples/blob/1dace96a115f29c44325903195c8135edf828c86/src/main/scala/org/bbux/spark/AccumuloMetadataCount.scala Thanks Best Regards On Thu, Apr 23, 2015 at 12:19

Re: Re: HiveContext setConf seems not stable

2015-04-23 Thread guoqing0...@yahoo.com.hk
Hi all , My understanding for this problem is SQLConf will be overwrite by the hiveconfig in initialization phase when setConf(key: String, value: String) being called in the first time as below code snippets , so it is correctly in later. I`m not sure whether it is right , any point are welco

Is a higher-res or vector version of Spark logo available?

2015-04-23 Thread Enno Shioji
My employer (adform.com) would like to use the Spark logo in a recruitment event (to indicate that we are using Spark in our company). I looked in the Spark repo (https://github.com/apache/spark/tree/master/docs/img) but couldn't find a vector format. Is a higher-res or vector format version avail

Re: Building Spark : Adding new DataType in Catalyst

2015-04-23 Thread zia_kayani
I've already tried UDT in Spark 1.2 and 1.3 but I encountered Kryo Serialization Exception on Joining as tracked here , i've talked to Michael Armbrust about the Exception, he said "

ML regression - spark context dies without error

2015-04-23 Thread jamborta
Hi all, I have been testing Spark ML algorithms with bigger dataset, and ran into some problems with linear regression: It seems the executors stop without any apparent reason: 15/04/22 20:15:05 INFO BlockManagerInfo: Added rdd_12492_80 in memory on backend-node:48037 (size: 28.5 MB, free: 2.8 G

Contributors, read me! Updated Contributing to Spark wiki

2015-04-23 Thread Sean Owen
Following several discussions about how to improve the contribution process in Spark, I've overhauled the guide to contributing. Anyone who is going to contribute needs to read it, as it has more formal guidance about the process: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+S

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-23 Thread Christian S. Perone
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the state where the flatMap is showing these warnings (w/ Spark 1.3.0, they are also shown in Spark 1.3.1): org.apache.spark.rdd.RDD.flatMap(RDD.scala:296) org.apache.spark.ml.recommendati

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
ok yes, I think I have narrowed it down to being a problem with driver memory settings. It looks like the application master/driver is not being launched with the settings specified: For the driver process on the main node I see "-XX:MaxPermSize=128m -Xms512m -Xmx512m" as options used to start the

Is there a way to get the list of all jobs?

2015-04-23 Thread mkestemont
Hello, I am currently trying to monitor the progression of jobs. I created a class extending SparkListener, added a jobProgressListener to my sparkContext, and overrided the methods OnTaskStart, OnTaskEnd, OnJobStart and OnJobEnd, which leads to good results. Then, I would also like to monitor th

Re: How to debug Spark on Yarn?

2015-04-23 Thread Ted Yu
For step 2, you can pipe application log to a file instead of copy-pasting. Cheers > On Apr 22, 2015, at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > > I submit a spark app to YARN and i get these messages > > > > 15/04/22 22:45:04 INFO yarn.Client: Application report for > application_1429087638744

Streaming Kmeans usage in java

2015-04-23 Thread Jeetendra Gangele
Do everyone do we have sample example how to use streaming k-means clustering with java. I have seen some example usage in scala. can anybody point me to the java example? regards jeetendra

How to start Thrift JDBC server as part of standalone spark application?

2015-04-23 Thread Vladimir Grigor
Hello, I would like to export RDD/DataFrames via JDBC SQL interface from the standalone application for currently stable Spark v1.3.1. I found one way of doing it but it requires the use of @DeveloperAPI method HiveThriftServer2.startWithContext(sqlContext) Is there a better, production level ap

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-23 Thread Sujee Maniyam
Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26 > sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") > val f = sc.textFile("s3n://bucket/file") > f.count No it can't find the im

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-23 Thread Ted Yu
NativeS3FileSystem class is in hadoop-aws jar. Looks like it was not on classpath. Cheers On Thu, Apr 23, 2015 at 7:30 AM, Sujee Maniyam wrote: > Thanks all... > > btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop > 2.4 > > I tried this on 1.3.1-hadoop26 > > sc.hadoopCo

Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Dean Wampler
I strongly recommend spawning a new process for the Spark jobs. Much cleaner separation. Your driver program won't be clobbered if the Spark job dies, etc. It can even watch for failures and restart. In the Scala standard library, the sys.process package has classes for constructing and interopera

dynamicAllocation & spark-shell

2015-04-23 Thread Michael Stone
If I enable dynamicAllocation and then use spark-shell or pyspark, things start out working as expected: running simple commands causes new executors to start and complete tasks. If the shell is left idle for a while, executors start getting killed off: 15/04/23 10:52:43 INFO cluster.YarnClien

Spark + Hue

2015-04-23 Thread MrAsanjar .
Hi all Is there any good documentation on how to integrate spark with Hue 3.7.x? Is the only way to install spark Job Server? Thanks in advance for your help

Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Anshul Singhle
Hi firemonk9, What you're doing looks interesting. Can you share some more details? Are you running the same spark context for each job, or are you running a seperate spark context for each job? Does your system need sharing of rdd's across multiple jobs? If yes, how do you implement that? Also wh

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
Here it is. How do I access a broadcastVar in a function that's in another module (process_stuff.py below): Thanks, Vadim main.py --- from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.sql import SQLContext from process_stuff import myfunc

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Mohammed Omer
Hm, no I don't have that in my path. However, someone on the spark-csv project advised that since I could not get another package/example to work, that this might be a Spark / Yarn issue: https://github.com/databricks/spark-csv/issues/54 Thoughts? I'll open a ticket later this afternoon if the di

RE: Map Question

2015-04-23 Thread Ganelin, Ilya
You need to expose that variable the same way you'd expose any other variable in Python that you wanted to see across modules. As long as you share a spark context all will work as expected. http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable Sent with Good (w

Re: A Spark Group by is running forever

2015-04-23 Thread ๏̯͡๏
I have seen multiple blogs stating to use reduceByKey instead of groupByKey. Could someone please help me in converting below code to use reduceByKey Code some spark processing ... Below val viEventsWithListingsJoinSpsLevelMetric: org.apache.spark.rdd.RDD[(com.ebay.ep.poc.spark.reporting.

Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-x’ files. When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The r

[Spark Streaming] Help with updateStateByKey()

2015-04-23 Thread allonsy
Hi everybody, I think I could use some help with the /updateStateByKey()/ JAVA method in Spark Streaming. *Context:* I have a /JavaReceiverInputDStream du/ DStream, where object /DataUpdate/ mainly has 2 fields of interest (in my case), namely du.personId (an Integer) and du.cell.hashCode() (Int

Slower performance when bigger memory?

2015-04-23 Thread Shuai Zheng
Hi All, I am running some benchmark on r3*8xlarge instance. I have a cluster with one master (no executor on it) and one slave (r3*8xlarge). My job has 1000 tasks in stage 0. R3*8xlarge has 244G memory and 32 cores. If I create 4 executors, each has 8 core+50G memory, each task will

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
Thanks Ilya. I am having trouble doing that. Can you give me an example? ᐧ On Thu, Apr 23, 2015 at 12:06 PM, Ganelin, Ilya wrote: > You need to expose that variable the same way you'd expose any other > variable in Python that you wanted to see across modules. As long as you > share a spark con

Re: Slower performance when bigger memory?

2015-04-23 Thread Dean Wampler
JVM's often have significant GC overhead with heaps bigger than 64GB. You might try your experiments with configurations below this threshold. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: Tasks run only on one machine

2015-04-23 Thread Jeetendra Gangele
Will you be able to paste code here? On 23 April 2015 at 22:21, Pat Ferrel wrote: > Using Spark streaming to create a large volume of small nano-batch input > files, ~4k per file, thousands of 'part-x' files. When reading the > nano-batch files and doing a distributed calculation my tasks r

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Sure var columns = mc.textFile(source).map { line => line.split(delimiter) } Here “source” is a comma delimited list of files or directories. Both the textFile and .map tasks happen only on the machine they were launched from. Later other distributed operations happen but I suspect if I can

Re: Tasks run only on one machine

2015-04-23 Thread Sean Owen
Where are the file splits? meaning is it possible they were also (only) available on one node and that was also your driver? On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel wrote: > Sure > > var columns = mc.textFile(source).map { line => line.split(delimiter) } > > Here “source” is a comma delim

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Physically? Not sure, they were written using the nano-batch rdds in a streaming job that is in a separate driver. The job is a Kafka consumer. Would that effect all derived rdds? If so is there something I can do to mix it up or does Spark know best about execution speed here? On Apr 23, 201

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
They are in HDFS so available on all workers On Apr 23, 2015, at 10:29 AM, Pat Ferrel wrote: Physically? Not sure, they were written using the nano-batch rdds in a streaming job that is in a separate driver. The job is a Kafka consumer. Would that effect all derived rdds? If so is there somet

Re: dynamicAllocation & spark-shell

2015-04-23 Thread Cheolsoo Park
Hi, >> Attempted to request a negative number of executor(s) -663 from the cluster manager. Please specify a positive number! This is a bug in dynamic allocation. Here is the jira- https://issues.apache.org/jira/browse/SPARK-6954 Thanks! Cheolsoo On Thu, Apr 23, 2015 at 7:57 AM, Michael Stone

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Argh, I looked and there really isn’t that much data yet. There will be thousands but starting small. I bet this is just a total data size not requiring all workers thing—sorry, nevermind. On Apr 23, 2015, at 10:30 AM, Pat Ferrel wrote: They are in HDFS so available on all workers On Apr 23

Re: Slower performance when bigger memory?

2015-04-23 Thread Ted Yu
Shuai: Please take a look at: http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-whats-new-in-java-8/ > On Apr 23, 2015, at 10:18 AM, Dean Wampler wrote: > > JVM's often have significant GC overhead with heaps bigger than 64GB. You > might try your experiments

Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Shuai Zheng
Hi All, I use 1.3.1 When I have two DF and join them on a same name key, after that, I can't get the common key by name. Basically: select * from t1 inner join t2 on t1.col1 = t2.col1 And I am using purely DataFrame, spark SqlContext not HiveContext DataFrame df3 = df1.join(df2

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread N B
Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up shuffle files after itself. However, after exactly 5 hours later it started throwing exceptions and eventually stopped working again. A sample stack trace is below. What is curious about 5

Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Yin Huai
Hi Shuai, You can use "as" to create a table alias. For example, df1.as("df1"). Then you can use $"df1.col" to refer it. Thanks, Yin On Thu, Apr 23, 2015 at 11:14 AM, Shuai Zheng wrote: > Hi All, > > > > I use 1.3.1 > > > > When I have two DF and join them on a same name key, after that, I ca

RE: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Shuai Zheng
Got it. Thanks! J From: Yin Huai [mailto:yh...@databricks.com] Sent: Thursday, April 23, 2015 2:35 PM To: Shuai Zheng Cc: user Subject: Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key Hi Shuai, You can use "as" to create a table alias. For

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread Tathagata Das
What was the state of your streaming application? Was it falling behind with a large increasing scheduling delay? TD On Thu, Apr 23, 2015 at 11:31 AM, N B wrote: > Thanks for the response, Conor. I tried with those settings and for a > while it seemed like it was cleaning up shuffle files after

Non-Deterministic Graph Building

2015-04-23 Thread hokiegeek2
Hi Everyone, I am running into a really weird problem that only one other person has reported to the best of my knowledge (and the thread never yielded a resolution). I build a GraphX Graph from an input EdgeRDD and VertexRDD via the Graph(VertexRDD,EdgeRDD) constructor. When I execute Graph.trip

Question regarding join with multiple columns with pyspark

2015-04-23 Thread Ali Bajwa
Hi experts, Sorry if this is a n00b question or has already been answered... Am trying to use the data frames API in python to join 2 dataframes with more than 1 column. The example I've seen in the documentation only shows a single column - so I tried this: Example code import pandas a

Getting error running MLlib example with new cluster

2015-04-23 Thread Su She
I had asked this question before, but wanted to ask again as I think it is related to my pom file or project setup. I have been trying on/off for the past month to try to run this MLlib example: - To unsubscribe, e-mail: user-uns

Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-23 Thread dusts66
I am trying to figure out python library management. So my question is: Where do third party Python libraries(ex. numpy, scipy, etc.) need to exist if I running a spark job via 'spark-submit' against my cluster in 'yarn client' mode. Do the libraries need to only exist on the client(ie. the serv

Getting error running MLlib example with new cluster

2015-04-23 Thread Su She
Sorry, accidentally sent the last email before finishing. I had asked this question before, but wanted to ask again as I think it is now related to my pom file or project setup. Really appreciate the help! I have been trying on/off for the past month to try to run this MLlib example: https://git

gridsearch - python

2015-04-23 Thread Pagliari, Roberto
Can anybody point me to an example, if available, about gridsearch with python? Thank you,

Re: problem writing to s3

2015-04-23 Thread Daniel Mahler
Hi Akhil I can confirm that the problem goes away when jsonRaw and jsonClean are in different s3 buckets. thanks Daniel On Thu, Apr 23, 2015 at 1:27 AM, Akhil Das wrote: > Can you try writing to a different S3 bucket and confirm that? > > Thanks > Best Regards > > On Thu, Apr 23, 2015 at 12:11

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Hao Ren
Should I repost this to dev list ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-does-groupByKey-return-RDD-K-Iterable-V-not-RDD-K-CompactBuffer-V-tp22616p22640.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Koert Kuipers
because CompactBuffer is considered an implementation detail. It is also not public for the same reason. On Thu, Apr 23, 2015 at 6:46 PM, Hao Ren wrote: > Should I repost this to dev list ? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/why-does-

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Dean Wampler
I wasn't involved in this decision ("I just make the fries"), but CompactBuffer is designed for relatively small data sets that at least fit in memory. It's more or less an Array. In principle, returning an iterator could hide the actual data structure that might be needed to hold a much bigger dat

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Corey Nolet
If you return an iterable, you are not tying the API to a compactbuffer. Someday, the data could be fetched lazily and he API would not have to change. On Apr 23, 2015 6:59 PM, "Dean Wampler" wrote: > I wasn't involved in this decision ("I just make the fries"), but > CompactBuffer is designed fo

Understanding Spark/MLlib failures

2015-04-23 Thread aleverentz
[My apologies if this is a re-post. I wasn't subscribed the first time I sent this message, and I'm hoping this second message will get through.] I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks. In a fit of blind optimism, I decided to try running MLlib’s Principal Componen

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Burak Yavuz
Hi Andrew, I observed similar behavior under high GC pressure, when running ALS. What happened to me was that, there would be very long Full GC pauses (over 600 seconds at times). These would prevent the executors from sending heartbeats to the driver. Then the driver would think that the executor

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Reza Zadeh
Hi Andrew, The .principalComponents feature of RowMatrix is currently constrained to tall and skinny matrices. Your matrix is barely above the skinny requirement (10k columns), though the number of rows is fine. What are you looking to do with the principal components? If unnormalized PCA is OK f

Re: gridsearch - python

2015-04-23 Thread Punyashloka Biswal
https://issues.apache.org/jira/browse/SPARK-7022. Punya On Thu, Apr 23, 2015 at 5:47 PM Pagliari, Roberto wrote: > Can anybody point me to an example, if available, about gridsearch with > python? > > > > Thank you, > > >

Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf
Hi guys, Having a problem build a DataFrame in Spark SQL from a JDBC data source when running with --master yarn-client and adding the JDBC driver JAR with --jars. If I run with a local[*] master all works fine. ./bin/spark-shell --jars /tmp/libs/mysql-jdbc.jar --master yarn-client sqlContext.lo

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Marcelo Vanzin
You'd have to use spark.{driver,executor}.extraClassPath to modify the system class loader. But that also means you have to manually distribute the jar to the nodes in your cluster, into a common location. On Thu, Apr 23, 2015 at 6:38 PM, Night Wolf wrote: > Hi guys, > > Having a problem build a

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf
Thanks Marcelo, can this be a path on HDFS? On Fri, Apr 24, 2015 at 11:52 AM, Marcelo Vanzin wrote: > You'd have to use spark.{driver,executor}.extraClassPath to modify the > system class loader. But that also means you have to manually > distribute the jar to the nodes in your cluster, into a c

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Marcelo Vanzin
No, those have to be local paths. On Thu, Apr 23, 2015 at 6:53 PM, Night Wolf wrote: > Thanks Marcelo, can this be a path on HDFS? > > On Fri, Apr 24, 2015 at 11:52 AM, Marcelo Vanzin > wrote: >> >> You'd have to use spark.{driver,executor}.extraClassPath to modify the >> system class loader. Bu

Re: Re: problem with spark thrift server

2015-04-23 Thread guoqing0...@yahoo.com.hk
Thanks for your reply , i would like to use Spark Thriftserver as JDBC SQL interface and the Spark application running on YARN . but the application was FINISHED when the Thriftserver crashed , all the cached table was lost . Thriftserver start command: start-thriftserver.sh --master yarn --exec

spark 1.3.0 strange log message

2015-04-23 Thread Henry Hung
Dear All, When using spark 1.3.0 spark-submit with directing out and err to a log file, I saw some strange lines inside that looks like this: [Stage 0:>(0 + 2) / 120] [Stage 0:>(2 + 2)

Re: spark 1.3.0 strange log message

2015-04-23 Thread Terry Hole
Use this in spark conf: spark.ui.showConsoleProgress=false Best Regards, On Fri, Apr 24, 2015 at 11:23 AM, Henry Hung wrote: > Dear All, > > > > When using spark 1.3.0 spark-submit with directing out and err to a log > file, I saw some strange lines inside that looks like this: > > [Stage 0:>

Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread guoqing0...@yahoo.com.hk
Is the Spark-1.3.1 support build with scala 2.8 ? Wether it can integrated with kafka_2.8.0-0.8.0 If build with scala 2.10 . Thanks.

Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread madhu phatak
Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it work. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Is the Spark-1.3.1 support build with scala 2.

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
*bump* On Thu, Apr 23, 2015 at 3:46 PM, Sourav Chandra < sourav.chan...@livestream.com> wrote: > HI TD, > > Some observations: > > 1. If I submit the application using spark-submit tool with *client as > deploy mode* it works fine with single master and worker (driver, master > and worker are run

Re: Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread guoqing0...@yahoo.com.hk
Thank you very much for your suggestion. Regards, From: madhu phatak Date: 2015-04-24 13:06 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: Is the Spark-1.3.1 support build with scala 2.8 ? Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it work

RE: gridsearch - python

2015-04-23 Thread Pagliari, Roberto
I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being. Thanks, From: Punyashloka Biswal [mailto:punya.bis...@gmail.com] Sent: Thursday, April 23, 2015 9:06 PM To: Pagliari, Roberto; user@spark.apache.org Subject:

RE: Re: problem with spark thrift server

2015-04-23 Thread Cheng, Hao
Hi, can you describe a little bit how the ThriftServer crashed, or steps to reproduce that? It’s probably a bug of ThriftServer. Thanks, From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk] Sent: Friday, April 24, 2015 9:55 AM To: Arush Kharbanda Cc: user Subject: Re: Re: problem wit

Re: JAVA_HOME problem

2015-04-23 Thread sourabh chaki
I am also facing the same problem with spark 1.3.0 and yarn-client and yarn-cluster mode. Launching yarn container failed and this is the error in stderr: Container: container_1429709079342_65869_01_01

Re: problem writing to s3

2015-04-23 Thread Akhil Das
You should probably open a JIRA issue with this i think. Thanks Best Regards On Fri, Apr 24, 2015 at 3:27 AM, Daniel Mahler wrote: > Hi Akhil > > I can confirm that the problem goes away when jsonRaw and jsonClean are in > different s3 buckets. > > thanks > Daniel > > On Thu, Apr 23, 2015 at 1:

Re: JAVA_HOME problem

2015-04-23 Thread Akhil Das
Isn't this related to this https://issues.apache.org/jira/browse/SPARK-6681 Thanks Best Regards On Fri, Apr 24, 2015 at 11:40 AM, sourabh chaki wrote: > I am also facing the same problem with spark 1.3.0 and yarn-client and > yarn-cluster mode. Launching yarn container failed and this is the er