Re: EC2 cluster created by spark using old HDFS 1.0

2015-03-22 Thread Akhil Das
That's a hadoop version incompatibility issue, you need to make sure everything runs on the same version. Thanks Best Regards On Sat, Mar 21, 2015 at 1:24 AM, morfious902002 wrote: > Hi, > I created a cluster using spark-ec2 script. But it installs HDFS version > 1.0. I would like to use this c

Re: Buffering for Socket streams

2015-03-22 Thread Akhil Das
You can try playing with spark.streaming.blockInterval so that it wont consume a lot of data, default value is 200ms Thanks Best Regards On Fri, Mar 20, 2015 at 8:49 PM, jamborta wrote: > Hi all, > > We are designing a workflow where we try to stream local files to a Socket > streamer, that wou

Re: How to handle under-performing nodes in the cluster

2015-03-22 Thread Akhil Das
It seems that node is not getting allocated with enough tasks, try increasing your level of parallelism or do a manual repartition so that everyone gets even tasks to operate on. Thanks Best Regards On Fri, Mar 20, 2015 at 8:05 PM, Yiannis Gkoufas wrote: > Hi all, > > I have 6 nodes in the clus

Re: How to check that a dataset is sorted after it has been written out?

2015-03-22 Thread Akhil Das
One approach would be to repartition the whole data into 1 (costly operation though, but will give you a single file). Also, You could try using zipWithIndex before writing it out. Thanks Best Regards On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert < m_albert...@yahoo.com.invalid> wrote: > Greet

Re: Spark streaming alerting

2015-03-22 Thread Akhil Das
What do you mean you can't send it directly from spark workers? Here's a simple approach which you could do: val data = ssc.textFileStream("sigmoid/") val dist = data.filter(_.contains("ERROR")).foreachRDD(rdd => alert("Errors :" + rdd.count())) And the alert() function could be anything

Re: SocketTimeout only when launching lots of executors

2015-03-22 Thread Akhil Das
It seems your driver is getting flooded by those many executors and hence it gets timeout. There are some configuration options like spark.akka.timeout etc, you could try playing with those. More information will be available here: http://spark.apache.org/docs/latest/configuration.html Thanks Best

Re: How to use DataFrame with MySQL

2015-03-22 Thread gavin zhang
OK,I found what the problem is: It couldn't work with mysql-connector-5.0.8. I updated the connector version to 5.1.34 and it worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p22182.html Sent from the Apache Spar

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Arush Kharbanda
A basis change needed by spark is setting the executor memory which defaults to 512MB by default. On Mon, Mar 23, 2015 at 10:16 AM, Denny Lee wrote: > How are you running your spark instance out of curiosity? Via YARN or > standalone mode? When connecting Spark thriftserver to the Spark servic

Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > so finally i can resort to: > rdd.saveAsObjectFile(...) > sc.objectFile(...) > but that seems like a rather broken abstraction. > > This seems like a fine solution to me.

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee
How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver to the Spark service, have you allocated enough memory and CPU when executing with spark? On Sun, Mar 22, 2015 at 3:39 AM fanooos wrote: > We have cloudera CDH 5.3 installe

SocketTimeout only when launching lots of executors

2015-03-22 Thread Tianshuo Deng
Hi, spark users. When running a spark application with lots of executors(300+), I see following failures: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at j

Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Ted Yu
I thought of formation #1. But looks like when there're many fields, formation #2 is cleaner. Cheers On Sun, Mar 22, 2015 at 8:14 PM, Cheng Lian wrote: > You need either > > .map { row => > (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...) > } > > or > > .map { case Row(f0: Floa

Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Cheng Lian
You need either |.map { row => (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...) } | or |.map {case Row(f0:Float, f1:Float, ...) => (f0, f1) } | On 3/23/15 9:08 AM, Minnow Noir wrote: I'm following some online tutorial written in Python and trying to convert a Spark SQL tabl

Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Minnow Noir
I'm following some online tutorial written in Python and trying to convert a Spark SQL table object to an RDD in Scala. The Spark SQL just loads a simple table from a CSV file. The tutorial says to convert the table to an RDD. The Python is products_rdd = sqlContext.table("products").map(lambda

spark disk-to-disk

2015-03-22 Thread Koert Kuipers
i would like to use spark for some algorithms where i make no attempt to work in memory, so read from hdfs and write to hdfs for every step. of course i would like every step to only be evaluated once. and i have no need for spark's RDD lineage info, since i persist to reliable storage. the troubl

Re: Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread Michael Armbrust
Please open a JIRA, we added the info to Row that will allow this to happen, but we need to provide the methods you are asking for. I'll add that this does work today in python (i.e. row.columnName). On Sun, Mar 22, 2015 at 12:40 AM, amghost wrote: > I would like to retrieve column value from S

Re: DataFrame saveAsTable - partitioned tables

2015-03-22 Thread Michael Armbrust
Not yet. This is on the roadmap for Spark 1.4. On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar wrote: > Hi > > I wanted to store DataFrames as partitioned Hive tables. Is there a way to > do this via the saveAsTable call. The set of options does not seem to be > documented. > > def > saveAsTa

Re: DataFrame saveAsTable - partitioned tables

2015-03-22 Thread Michael Armbrust
Note you can use HiveQL syntax for creating dynamically partitioned tables though. On Sun, Mar 22, 2015 at 1:29 PM, Michael Armbrust wrote: > Not yet. This is on the roadmap for Spark 1.4. > > On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar > wrote: > >> Hi >> >> I wanted to store DataFrames

Re: How to use DataFrame with MySQL

2015-03-22 Thread Michael Armbrust
Can you try adding "driver" -> "com.mysql.jdbc.Driver"? This causes the class to get loaded both locally and the workers so that it can register with JDBC. On Sun, Mar 22, 2015 at 7:32 AM, gavin zhang wrote: > OK, I have known that I could use jdbc connector to create DataFrame with > this comm

Re: join two DataFrames, same column name

2015-03-22 Thread Michael Armbrust
You can include * and a column alias in the same select clause var df1 = sqlContext.sql("select *, column_id AS table1_id from table1") I'm also hoping to resolve SPARK-6376 before Spark 1.3.1 which will let you do something like: var df1 = sqlCo

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Burak Yavuz
Did you build Spark with: -Pnetlib-lgpl? Ref: https://spark.apache.org/docs/latest/mllib-guide.html Burak On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu wrote: > How about pointing LD_LIBRARY_PATH to native lib folder ? > > You need Spark 1.2.0 or higher for the above to work. See SPARK-1719 > > Chee

Re: lower&upperBound not working/spark 1.3

2015-03-22 Thread Ted Yu
I went over JDBCRelation#columnPartition() but didn't find obvious clue (you can add more logging to confirm that the partitions were generated correctly). Looks like the issue may be somewhere else. Cheers On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka wrote: > ...I even tried setting uppe

Re: lower&upperBound not working/spark 1.3

2015-03-22 Thread Marek Wiewiorka
...I even tried setting upper/lower bounds to the same value like 1 or 10 with the same result. cs_id is a column of the cardinality ~5*10^6 So this is not the case here. Regards, Marek 2015-03-22 20:30 GMT+01:00 Ted Yu : > From javadoc of JDBCRelation#columnPartition(): >* Given a partition

Re: lower&upperBound not working/spark 1.3

2015-03-22 Thread Ted Yu
>From javadoc of JDBCRelation#columnPartition(): * Given a partitioning schematic (a column of integral type, a number of * partitions, and upper and lower bounds on the column's value), generate In your example, 1 and 1 are for the value of cs_id column. Looks like all the values in th

Re: Load balancing

2015-03-22 Thread Mohit Anchlia
posting my question again :) Thanks for the pointer, looking at the below description from the site it looks like in spark block size is not fixed, it's determined by block interval and in fact for the same batch you could have different block sizes. Did I get it right? - Another para

Re: How Does aggregate work

2015-03-22 Thread Ted Yu
I assume spark.default.parallelism is 4 in the VM Ashish was using. Cheers

lower&upperBound not working/spark 1.3

2015-03-22 Thread Marek Wiewiorka
Hi All - I try to use the new SQLContext API for populating DataFrame from jdbc data source. like this: val jdbcDF = sqlContext.jdbc(url = "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table = "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound = 1, numPart

How to check that a dataset is sorted after it has been written out? [repost]

2015-03-22 Thread Michael Albert
Greetings![My apologies for this respost, I'm not certain that the first message made it to the list]. I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first "partit

Re: Error while installing Spark 1.3.0 on local machine

2015-03-22 Thread Dean Wampler
Any particular reason you're not just downloading a build from http://spark.apache.org/downloads.html Even if you aren't using Hadoop, any of those builds will work. If you want to build from source, the Maven build is more reliable. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Editio

Re: can distinct transform applied on DStream?

2015-03-22 Thread Dean Wampler
aDstream.transform(_.distinct()) will only make the elements of each RDD in the DStream distinct, not for the whole DStream globally. Is that what you're seeing? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: How Does aggregate work

2015-03-22 Thread Dean Wampler
2 is added every time the final partition aggregator is called. The result of summing the elements across partitions is 9 of course. If you force a single partition (using spark-shell in local mode): scala> val data = sc.parallelize(List(2,3,4),1) scala> data.aggregate(0)((x,y) => x+y,(x,y) => 2+x

How Does aggregate work

2015-03-22 Thread ashish.usoni
Hi , I am not able to understand how aggregate function works, Can some one please explain how below result came I am running spark using cloudera VM The result in below is 17 but i am not able to find out how it is calculating 17 val data = sc.parallelize(List(2,3,4)) data.aggregate(0)((x,y) =

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Ted Yu
How about pointing LD_LIBRARY_PATH to native lib folder ? You need Spark 1.2.0 or higher for the above to work. See SPARK-1719 Cheers On Sun, Mar 22, 2015 at 4:02 AM, Xi Shen wrote: > Hi Ted, > > I have tried to invoke the command from both cygwin environment and > powershell environment. I st

How to use DataFrame with MySQL

2015-03-22 Thread gavin zhang
OK, I have known that I could use jdbc connector to create DataFrame with this command: val jdbcDF = sqlContext.load("jdbc", Map("url" -> "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456", "dbtable" -> "video")) But I got this error: java.sql.SQLException: No suitable driver fo

Re: ArrayIndexOutOfBoundsException in ALS.trainImplicit

2015-03-22 Thread Sabarish Sasidharan
My bad. This was an outofmemory disguised as something else. Regards Sab On Sun, Mar 22, 2015 at 1:53 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > I am consistently running into this ArrayIndexOutOfBoundsException issue > when using trainImplicit. I have tried changing the

Re: Reducing Spark's logging verbosity

2015-03-22 Thread Emre Sevinc
Hello Edmon, Does the following help? http://stackoverflow.com/questions/27248997/how-to-suppress-spark-logging-in-unit-tests/2736#2736 -- Emre Sevinç http://www.bigindustries.be On Mar 22, 2015 1:44 AM, "Edmon Begoli" wrote: > Hi, > Does anyone have concrete recommendations how to red

Re: converting DStream[String] into RDD[String] in spark streaming

2015-03-22 Thread Sean Owen
On Sun, Mar 22, 2015 at 8:43 AM, deenar.toraskar wrote: > 1) if there are no sliding window calls in this streaming context, will > there just one file written per interval? As many files as there are partitions will be written in each interval. > 2) if there is a sliding window call in the same

Re: How to set Spark executor memory?

2015-03-22 Thread Xi Shen
OK, I actually got the answer days ago from StackOverflow, but I did not check it :( When running in "local" mode, to set the executor memory - when using spark-submit, use "--driver-memory" - when running as a Java application, like executing from IDE, set the "-Xmx" vm option Thanks, David

Re: How to do nested foreach with RDD

2015-03-22 Thread Xi Shen
Hi Reza, Yes, I just found RDD.cartesian(). Very useful. Thanks, David On Sun, Mar 22, 2015 at 5:08 PM Reza Zadeh wrote: > You can do this with the 'cartesian' product method on RDD. For example: > > val rdd1 = ... > val rdd2 = ... > > val combinations = rdd1.cartesian(rdd2).filter{ case (a,b

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Xi Shen
Hi Ted, I have tried to invoke the command from both cygwin environment and powershell environment. I still get the messages: 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implem

Spark sql thrift server slower than hive

2015-03-22 Thread fanooos
We have cloudera CDH 5.3 installed on one machine. We are trying to use spark sql thrift server to execute some analysis queries against hive table. Without any changes in the configurations, we run the following query on both hive and spark sql thrift server *select * from tableName;* The time

Re: Load balancing

2015-03-22 Thread Jeffrey Jedele
Hi Mohit, please make sure you use the "Reply to all" button and include the mailing list, otherwise only I will get your message ;) Regarding your question: Yes, that's also my understanding. You can partition streaming RDDs only by time intervals, not by size. So depending on your incoming rate,

Re: converting DStream[String] into RDD[String] in spark streaming

2015-03-22 Thread deenar.toraskar
Sean Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for each interval def saveAsTextFiles(prefix: String, suffix: String = "") { val saveFunc = (rdd: RDD[T], time: Time) => { val file = rddToFileName(prefix, suffix, time) rdd.saveAsTextFile(file) }

Re: Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread Yanbo Liang
If you use the latest version Spark 1.3, you can use the DataFrame API like val results = sqlContext.sql("SELECT name FROM people") results.select("name").show() 2015-03-22 15:40 GMT+08:00 amghost : > I would like to retrieve column value from Spark SQL query result. But > currently it seems tha

Re: 'nested' RDD problem, advise needed

2015-03-22 Thread Victor Tso-Guillen
Something like this? (2 to alphabetLength toList).map(shift => Object.myFunction(inputRDD, shift).map(v => shift -> v).foldLeft(Object.myFunction(inputRDD, 1).map(v => 1 -> v))(_ union _) which is an RDD[(Int, Char)] Problem is that you can't play with RDDs inside of RDDs. The recursive structur

Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread amghost
I would like to retrieve column value from Spark SQL query result. But currently it seems that Spark SQL only support retrieving by index val results = sqlContext.sql("SELECT name FROM people") results.map(t => "Name: " + *t(0)*).collect().foreach(println) I think it will be much more convenient

DataFrame saveAsTable - partitioned tables

2015-03-22 Thread deenar.toraskar
Hi I wanted to store DataFrames as partitioned Hive tables. Is there a way to do this via the saveAsTable call. The set of options does not seem to be documented. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit (Scala-specific) Creates a tabl