Re: FetchFailedException during shuffle

2015-03-27 Thread Akhil Das
What operation are you doing? I'm assuming you have enabled rdd compression and you are having an empty stream which it tries to uncompress (as seen from the Exceptions) Thanks Best Regards On Fri, Mar 27, 2015 at 7:15 AM, Chen Song wrote: > Using spark 1.3.0 on cdh5.1.0, I was running a fetch

Re: Serialization Problem in Spark Program

2015-03-27 Thread Akhil Das
Awesome. Thanks Best Regards On Fri, Mar 27, 2015 at 7:26 AM, donhoff_h <165612...@qq.com> wrote: > Hi, Akhil > > Yes, it's the problem lies in. Thanks very much for point out my mistake. > > -- Original -- > *From: * "Akhil Das";; > *Send time:* Thursday, Mar 26,

Re: RDD Exception Handling

2015-03-27 Thread Akhil Das
Like this? val krdd = testrdd.map(x => { try{var key = "" val tmp_tocks = x.split(sep1)(0)(key, x) }catch{ case e: Exception => println("Exception!! => " + e + "|||KS1 " + x)(null, x) }}) Thanks Best Regards

Re: Error in creating log directory

2015-03-27 Thread Akhil Das
You need to set proper permission for the directory: :/user/spark/applicationHistory in your local file system Thanks Best Regards On Fri, Mar 27, 2015 at 2:18 AM, pzilaro wrote: > I get the following error message when I start pyspark shell. > The config has the following settings- > # spark

saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
Hi, The behaviour is the same for me in Scala and Python, so posting here in Python. When I use DataFrame.saveAsTable with the path option, I expect an external Hive table to be created at the specified path. Specifically, when I call: >>> df.saveAsTable(..., path="/tmp/test") I expect an exter

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda
Since hive and spark SQL internally use HDFS and Hive metastore. The only thing you want to change is the processing engine. You can try to bring your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive site xml captures the metastore connection details). Its a hack, i havnt tr

Re: Parallel actions from driver

2015-03-27 Thread Aram Mkrtchyan
Thanks Sean, It works with Scala's parallel collections. On Thu, Mar 26, 2015 at 11:35 PM, Sean Owen wrote: > You can do this much more simply, I think, with Scala's parallel > collections (try .par). There's nothing wrong with doing this, no. > > Here, something is getting caught in your closu

Re: Parallel actions from driver

2015-03-27 Thread Harut Martirosyan
This is exactly my case also, it worked, thanks Sean. On 26 March 2015 at 23:35, Sean Owen wrote: > You can do this much more simply, I think, with Scala's parallel > collections (try .par). There's nothing wrong with doing this, no. > > Here, something is getting caught in your closure, maybe >

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Ondrej Smola
It happens only when StorageLevel is used with 1 replica ( StorageLevel. MEMORY_ONLY_2,StorageLevel.MEMORY_AND_DISK_2) , StorageLevel.MEMORY_ONLY , StorageLevel.MEMORY_AND_DISK works - the problems must be clearly somewhere between mesos-spark . From console I see that spark is trying to replicate

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Yanbo Liang
"saveAsTable" will use the default data source configured by spark.sql.sources.default. def saveAsTable(tableName: String): Unit = { saveAsTable(tableName, SaveMode.ErrorIfExists) } It can not set "path" if I understand correct. 2015-03-27 15:45 GMT+08:00 Tom Walwyn : > Hi, > > The behavi

failed to launch workers on spark

2015-03-27 Thread mas
Hi all! I am trying to install spark on my standalone machine. I am able to run the master but when i try to run the slaves it gives me following error. Any help in this regard will highly be appreciated. _ localhost: failed to launch org

Re: failed to launch workers on spark

2015-03-27 Thread Noorul Islam K M
mas writes: > Hi all! > I am trying to install spark on my standalone machine. I am able to run the > master but when i try to run the slaves it gives me following error. Any > help in this regard will highly be appreciated. > _ > local

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
We can set a path, refer to the unit tests. For example: df.saveAsTable("savedJsonTable", "org.apache.spark.sql.json", "append", path =tmpPath) Investigating some more, I found that the table is being created at the specifie

Error in Delete Table

2015-03-27 Thread Masf
Hi. In HiveContext, when I put this statement "DROP TABLE IF EXISTS TestTable" If TestTable doesn't exist, spark returns an error: ERROR Hive: NoSuchObjectException(message:default.TestTable table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
Another follow-up: saveAsTable works as expected when running on hadoop cluster with Hive installed. It's just locally that I'm getting this strange behaviour. Any ideas why this is happening? Kind Regards. Tom On 27 March 2015 at 11:29, Tom Walwyn wrote: > We can set a path, refer to the unit

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread ๏̯͡๏
I did copy hive-conf.xml form Hive installation into spark-home/conf. IT does have all the meta store connection details, host, username, passwd, driver and others. Snippet == javax.jdo.option.ConnectionURL jdbc:mysql://host.vip.company.com:3306/HDB javax.jdo.option.ConnectionD

Re: Column not found in schema when querying partitioned table

2015-03-27 Thread ๏̯͡๏
Hello Jon, Are you able to connect to existing Hive and read tables created in hive ? Regards, deepak On Thu, Mar 26, 2015 at 4:16 PM, Jon Chase wrote: > I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 > > On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase wrote: > >> Spark 1.3.0, P

Re: Error while querying hive table from spark shell

2015-03-27 Thread ๏̯͡๏
Did you resolve this ? I am facing the same error On Wed, Feb 11, 2015 at 1:02 PM, Arush Kharbanda wrote: > Seems that the HDFS path for the table dosnt contains any file/data. > > Does the metastore contain the right path for HDFS data. > > You can find the HDFS path in TBLS in your metastore.

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda
Seems Spark SQL accesses some more columns apart from those created by hive. You can always recreate the tables, you would need to execute the table creation scripts but it would be good to avoid recreation. On Fri, Mar 27, 2015 at 3:20 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I did copy hive-conf.xml form H

Spark streaming

2015-03-27 Thread jamborta
Hi all, We have a workflow that pulls in data from csv files, then originally setup up of the workflow was to parse the data as it comes in (turn into array), then store it. This resulted in out of memory errors with larger files (as a result of increased GC?). It turns out if the data gets stor

Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
Spark 1.3.0 Two issues: a) I'm unable to get a "lateral view explode" query to work on an array type b) I'm unable to save an array type to a Parquet file I keep running into this: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq Here's a stack trace from the explo

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Cheng Lian
Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop 1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng

Decrease In Performance due to Auto Increase of Partitions in Spark

2015-03-27 Thread sayantini
In our application where we load our historical data in 40 partitioned RDDs (no. of available cores X 2) and we have not implemented any custom partitioner. After applying transformations on these RDDs intermediate RDDs are created which have partitions greater than 40 and sometimes partitions are

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
This should be a bug in the Explode.eval(), which always assumes the underlying SQL array is represented by a Scala Seq. Would you mind to open a JIRA ticket for this? Thanks! Cheng On 3/27/15 7:00 PM, Jon Chase wrote: Spark 1.3.0 Two issues: a) I'm unable to get a "lateral view explode" qu

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
https://issues.apache.org/jira/browse/SPARK-6570 I also left in the call to saveAsParquetFile(), as it produced a similar exception (though there was no use of explode there). On Fri, Mar 27, 2015 at 7:20 AM, Cheng Lian wrote: > This should be a bug in the Explode.eval(), which always assumes

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Ondrej Smola
More info when using *spark.mesos.coarse* everything works as expected. I think this must be a bug in spark-mesos integration. 2015-03-27 9:23 GMT+01:00 Ondrej Smola : > It happens only when StorageLevel is used with 1 replica ( StorageLevel. > MEMORY_ONLY_2,StorageLevel.MEMORY_AND_DISK_2) , St

Spark SQL and DataSources API roadmap

2015-03-27 Thread Ashish Mukherjee
Hello, Is there any published community roadmap for SparkSQL and the DataSources API? Regards, Ashish

Checking Data Integrity in Spark

2015-03-27 Thread Sathish Kumaran Vairavelu
Hello, I want to check if there is any way to check the data integrity of the data files. The use case is perform data integrity check on large files 100+ columns and reject records (write it another file) that does not meet criteria's (such as NOT NULL, date format, etc). Since there are lot of c

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
Forgot to mention that, would you mind to also provide the full stack trace of the exception thrown in the saveAsParquetFile call? Thanks! Cheng On 3/27/15 7:35 PM, Jon Chase wrote: https://issues.apache.org/jira/browse/SPARK-6570 I also left in the call to saveAsParquetFile(), as it produced

Re: Decrease In Performance due to Auto Increase of Partitions in Spark

2015-03-27 Thread Akhil Das
Each RDD is composed of multiple blocks known as partitions, when you apply transformation over it, then it can grow in size depending on the operation (as the # objects/references increase) and that is probably the reason why you are seeing increased number of partitions. I don't think increased

saving schemaRDD to cassandra

2015-03-27 Thread Hafiz Mujadid
Hi experts! I would like to know is there anyway to store schemaRDD to cassandra? if yes then how to store in existing cassandra column family and new column family? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saving-schemaRDD-to-cassandra-tp22

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread ๏̯͡๏
I can recreate tables but what about data. It looks like this is a obvious feature that Spark SQL must be having. People will want to transform tons of data stored in HDFS through Hive from Spark SQL. Spark programming guide suggests its possible. Spark SQL also supports reading and writing data

Re: Hive Table not from from Spark SQL

2015-03-27 Thread ๏̯͡๏
I tried the following 1) ./bin/spark-submit -v --master yarn-cluster --driver-class-path /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/ha

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
Done. I also updated the name on the ticket to include both issues. "Spark SQL arrays: "explode()" fails and cannot save array type to Parquet" https://issues.apache.org/jira/browse/SPARK-6570 On Fri, Mar 27, 2015 at 8:14 AM, Cheng Lian wrote: > Forgot to mention that, would you mind to also

Re: Checking Data Integrity in Spark

2015-03-27 Thread Arush Kharbanda
Its not possible to configure Spark to do checks based on xmls. You would need to write jobs to do the validations you need. On Fri, Mar 27, 2015 at 5:13 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello, > > I want to check if there is any way to check the data integrity

Re: Spark streaming

2015-03-27 Thread DW @ Gmail
Show us the code. This shouldn't happen for the simple process you described Sent from my rotary phone. > On Mar 27, 2015, at 5:47 AM, jamborta wrote: > > Hi all, > > We have a workflow that pulls in data from csv files, then originally setup > up of the workflow was to parse the data as it

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
Thanks for the detailed information! On 3/27/15 9:16 PM, Jon Chase wrote: Done. I also updated the name on the ticket to include both issues. "Spark SQL arrays: "explode()" fails and cannot save array type to Parquet" https://issues.apache.org/jira/browse/SPARK-6570 On Fri, Mar 27, 2015 at

Re: Spark streaming

2015-03-27 Thread Ted Yu
jamborta : Please also describe the format of your csv files. Cheers On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail wrote: > Show us the code. This shouldn't happen for the simple process you > described > > Sent from my rotary phone. > > > > On Mar 27, 2015, at 5:47 AM, jamborta wrote: > > > > H

RE: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-27 Thread Bozeman, Christopher
Ankur, The JavaKinesisWordCountASLYARN is no longer valid and was added just to the EMR build back in 1.1.0 to demonstrate Spark Streaming with Kinesis in YARN, just follow the stock example as seen in JavaKinesisWordCountASL as it is better form anyway given it is best not to hard code the mas

Re: Spark streaming

2015-03-27 Thread Tamas Jambor
It is just a comma separated file, about 10 columns wide which we append with a unique id and a few additional values. On Fri, Mar 27, 2015 at 2:43 PM, Ted Yu wrote: > jamborta : > Please also describe the format of your csv files. > > Cheers > > On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail wrot

RDD collect hangs on large input data

2015-03-27 Thread Zsolt Tóth
Hi, I have a simple Spark application: it creates an input rdd with sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The output rdd is small, a few MB's. Then I call collect() on the output. If the textfile is ~50GB, it finishes in a few minutes. However, if it's larger (~100GB

Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
Hi Kelvin, Thank you. That works for me. I wrote my own joins that produced Scala collections, instead of using rdd.join. Regards, Yang On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2dot7kel...@gmail.com> wrote: > Hi, I used union() before and yes it may be slow sometimes. I _guess_ your > varia

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee
Upon reviewing your other thread, could you confirm that your Hive metastore that you can connect to via Hive is a MySQL database? And to also confirm, when you're running spark-shell and doing a "show tables" statement, you're getting the same error? On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏

Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Peter Mac
I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template python]# /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath Trac

Re: Spark streaming

2015-03-27 Thread Tamas Jambor
Seems the problem was that we have an actor that picks put the stream (as a receiver) that sends it off to another one that does the actual stream, if the message is a string it works ok, if it is an array (or list) it just dies. Not sure why, as I cannot see any difference in terms overhead betwe

JettyUtils.createServletHandler Method not Found?

2015-03-27 Thread kmader
I have a very strange error in Spark 1.3 where at runtime in the org.apache.spark.ui.JettyUtils object the method createServletHandler is not found Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.ui.JettyUtils$.createServletHandler(Ljava/lang/String;Ljavax/servlet/http/Htt

Re: JettyUtils.createServletHandler Method not Found?

2015-03-27 Thread Ted Yu
JettyUtils is marked with: private[spark] object JettyUtils extends Logging { FYI On Fri, Mar 27, 2015 at 9:50 AM, kmader wrote: > I have a very strange error in Spark 1.3 where at runtime in the > org.apache.spark.ui.JettyUtils object the method createServletHandler is > not > found > > Except

"Could not compute split, block not found" in Spark Streaming Simple Application

2015-03-27 Thread Saiph Kappa
Hi, I am just running this simple example with machineA: 1 master + 1 worker machineB: 1 worker « val ssc = new StreamingContext(sparkConf, Duration(1000)) val rawStreams = (1 to numStreams).map(_ =>ssc.rawSocketStream[String](host, port, StorageLevel.MEMORY_ONLY_SER)).toArray val uni

Re: Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Davies Liu
This will be fixed in https://github.com/apache/spark/pull/5230/files On Fri, Mar 27, 2015 at 9:13 AM, Peter Mac wrote: > I downloaded spark version spark-1.3.0-bin-hadoop2.4. > > When the python version of sql.py is run the following error occurs: > > [root@nde-dev8-template python]# > /root/spa

RE: Hive Table not from from Spark SQL

2015-03-27 Thread Cheng, Hao
1) Seems only in #2, the hive-site.xml was loaded correctly, (it knows the mysql driver stuffs, right?), #1 & #3 didn’t load the correct hive-site.xml, and actually it tried to run in default configuration(the empty database / metastore created). 2) In yarn cluster, the driver probabl

How to avoid the repartitioning in graph construction

2015-03-27 Thread Yifan LI
Hi, Now I have 10 edge data files in my HDFS directory, e.g. edges_part00, edges_part01, …, edges_part09 format: srcId tarId (They make a good partitioning of that whole graph, so I never expect any change(re-partitoning operations) on them during graph building). I am thinking of how to

Re: "Could not compute split, block not found" in Spark Streaming Simple Application

2015-03-27 Thread Tathagata Das
If it is deterministically reproducible, could you generate full DEBUG level logs, from the driver and the workers and give it to me? Basically I want to trace through what is happening to the block that is not being found. And can you tell what Cluster manager are you using? Spark Standalone, Meso

[Dataframe] Problem with insertIntoJDBC and existing database

2015-03-27 Thread Pierre Bailly-Ferry
Hello, I 'm trying to develop with the new Dataframe API, but I'm running into an error. I have an existing MySQL database and I want to insert rows. I create a Dataframe from an RDD, then use the "insertIntoJDBC" function. It appear that dataframes reorder the data inside them. As a result, I g

Re: WordCount example

2015-03-27 Thread Mohit Anchlia
I checked the ports using netstat and don't see any connections established on that port. Logs show only this: 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID app-20150327135048-0002 Spark ui shows: Running Ap

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das
Does it fail with just Spark jobs (using storage levels) on non-coarse mode? TD On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola wrote: > More info > > when using *spark.mesos.coarse* everything works as expected. I think > this must be a bug in spark-mesos integration. > > > 2015-03-27 9:23 GMT+0

Re: Using ORC input for mllib algorithms

2015-03-27 Thread Xiangrui Meng
This is a PR in review to support ORC via the SQL data source API: https://github.com/apache/spark/pull/3753. You can try pulling that PR and help test it. -Xiangrui On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth wrote: > Hi, > > I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.clas

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Xiangrui Meng
Hi Martin, Could you attach the code snippet and the stack trace? The default implementation of some methods uses reflection, which may be the cause. Best, Xiangrui On Wed, Mar 25, 2015 at 3:18 PM, wrote: > Thanks Peter, > > I ended up doing something similar. I however consider both the appro

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-27 Thread Xiangrui Meng
This sounds like a bug ... Did you try a different lambda? It would be great if you can share your dataset or re-produce this issue on the public dataset. Thanks! -Xiangrui On Thu, Mar 26, 2015 at 7:56 AM, Ravi Mody wrote: > After upgrading to 1.3.0, ALS.trainImplicit() has been returning vastly

Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Manoj Samel
While looking into a issue, I noticed that the source displayed on Github site does not matches the downloaded tar for 1.3 Thoughts ?

Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Eran Medan
Remember that article that went viral on HN? (Where a guy showed how GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on a 1 thread machine? if not here is the article - http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) Well as you may recall

Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences? On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel wrote: > While looking into a issue, I noticed that the source displayed on Github > site does not matches the downloaded tar for 1.3 >

Re: Can't access file in spark, but can in hadoop

2015-03-27 Thread Johnson, Dale
Yes, I could recompile the hdfs client with more logging, but I don’t have the day or two to spare right this week. One more thing about this, the cluster is Horton Works 2.1.3 [.0] They seem to have a claim of supporting spark on Horton Works 2.2 Dale. From: Ted Yu mailto:yuzhih...@gmail.com>

spark streaming driver hang

2015-03-27 Thread Chen Song
I ran a spark streaming job. 100 executors 30G heap per executor 4 cores per executor The version I used is 1.3.0-cdh5.1.0. The job is reading from a directory on HDFS (with files incoming continuously) and does some join on the data. I set batch interval to be 15 minutes and the job worked fine

Re: spark streaming driver hang

2015-03-27 Thread Tathagata Das
Do you have the logs of the driver? Does that give any exceptions? TD On Fri, Mar 27, 2015 at 12:24 PM, Chen Song wrote: > I ran a spark streaming job. > > 100 executors > 30G heap per executor > 4 cores per executor > > The version I used is 1.3.0-cdh5.1.0. > > The job is reading from a direct

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Michael Armbrust
Are you running on yarn? - If you are running in yarn-client mode, set HADOOP_CONF_DIR to /etc/hive/conf/ (or the directory where your hive-site.xml is located). - If you are running in yarn-cluster mode, the easiest thing to do is to add--files=/etc/hive/conf/hive-site.xml (or the path for your

RDD resiliency -- does it keep state?

2015-03-27 Thread Michal Klos
Hi Spark group, We haven't been able to find clear descriptions of how Spark handles the resiliency of RDDs in relationship to executing actions with side-effects. If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect for each element in the RDD. If a partition goes down --

Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
If you invoke this, you will get at-least-once semantics on failure. For instance, if a machine dies in the middle of executing the foreach for a single partition, that will be re-executed on another machine. It could even fully complete on one machine, but the machine dies immediately before repor

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Jörn Franke
Hallo, Well all problems you want to solve with technology need to have good justification for a certain technology. So the first thing is that you ask which technology fits to my current and future problems. This is also what the article says. Unfortunately, it does only provide a vague answer wh

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Ondrej Smola
Yes, only when using fine grained mode and replication (StorageLevel.MEMORY_ONLY_2 etc). 2015-03-27 19:06 GMT+01:00 Tathagata Das : > Does it fail with just Spark jobs (using storage levels) on non-coarse > mode? > > TD > > On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola > wrote: > >> More info >>

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Joseph Bradley
Can you try specifying the number of partitions when you load the data to equal the number of executors? If your ETL changes the number of partitions, you can also repartition before calling KMeans. On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen wrote: > Hi, > > I have a large data set, and I expect

[spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-27 Thread Eran Medan
Hi everyone, I had a lot of questions today, sorry if I'm spamming the list, but I thought it's better than posting all questions in one thread. Let me know if I should throttle my posts ;) Here is my question: When I try to have a case class that has Any in it (e.g. I have a property map and va

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Joseph Bradley
Hi Martin, In the short term: Would you be able to work with a different type other than Vector? If so, then you can override the *Predictor* class's "*protected def featuresDataType: DataType"* with a DataFrame type which fits your purpose. If you need Vector, then you might have to do a hack l

Understanding Spark Memory distribution

2015-03-27 Thread Ankur Srivastava
Hi All, I am running a spark cluster on EC2 instances of type: m3.2xlarge. I have given 26gb of memory with all 8 cores to my executors. I can see that in the logs too: *15/03/27 21:31:06 INFO AppClient$ClientActor: Executor added: app-20150327213106-/0 on worker-20150327212934-10.x.y.z-40128

2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. Th

Re: HQL function Rollup and Cube

2015-03-27 Thread Chang Lim
Yes, it works for me. Make sure the Spark machine can access the hive machine. On Thu, Mar 26, 2015 at 6:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Did you manage to connect to Hive metastore from Spark SQL. I copied hive > conf file into Spark conf folder but when i run show tables, or do select * > from d

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das
Seems like a bug, could you file a JIRA? @Tim: Patrick said you take a look at Mesos related issues. Could you take a look at this. Thanks! TD On Fri, Mar 27, 2015 at 1:25 PM, Ondrej Smola wrote: > Yes, only when using fine grained mode and replication > (StorageLevel.MEMORY_ONLY_2 > etc). >

Re: How to specify the port for AM Actor ...

2015-03-27 Thread Manoj Samel
I looked @ the 1.3.0 code and figured where this can be added In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is actorSystem = AkkaUtils.createActorSystem("sparkYarnAM", Utils.localHostName, 0, conf = sparkConf, securityManager = securityMgr)._1 If I change it to below, th

Re: 2 input paths generate 3 partitions

2015-03-27 Thread Zhan Zhang
Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica mailto:rvern...@gmail.com>> wro

Re: Can't access file in spark, but can in hadoop

2015-03-27 Thread Zhan Zhang
Probably guava version conflicts issue. What spark version did you use, and which hadoop version it compile against? Thanks. Zhan Zhang On Mar 27, 2015, at 12:13 PM, Johnson, Dale mailto:daljohn...@ebay.com>> wrote: Yes, I could recompile the hdfs client with more logging, but I don’t have th

RE: 2 input paths generate 3 partitions

2015-03-27 Thread java8964
The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares, T

Streaming anomaly detection using ARIMA

2015-03-27 Thread Corey Nolet
I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several

Setting a custom loss function for GradientDescent

2015-03-27 Thread shmoanne
I am working with the mllib.optimization.GradientDescent class and I'm confused about how to set a custom loss function with setGradient? For instance, if I wanted my loss function to be x^2 how would I go about setting it using setGradient? -- View this message in context: http://apache-spar

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
JIRA ticket created at: https://issues.apache.org/jira/browse/SPARK-6581 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian wrote: > Thanks for the information. Verified that the _common_metadata and > _metadata file are missing in this case when using Hadoop 1.0.4. Would you > min

Re: 2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hi, I am not using HDFS, I am using the local file system. Moreover, I did not modify the defaultParallelism. The Spark instance is the default one started by Spark Shell. Thanks! Rares On Fri, Mar 27, 2015 at 4:48 PM, java8964 wrote: > The files sound too small to be 2 blocks in HDFS. > > Di

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen
Yes, I have done repartition. I tried to repartition to the number of cores in my cluster. Not helping... I tried to repartition to the number of centroids (k value). Not helping... On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley wrote: > Can you try specifying the number of partitions when you

unable to read avro file

2015-03-27 Thread Joanne Contact
Hi I am following the instruction on this website. http://www.infoobjects.com/spark-with-avro/ I installed the sparkavro libary on https://github.com/databricks/spark-avro on a machine which only has hive gateway client role on a hadoop cluster. somehow I got error on reading the avro file. scal

Re: unable to read avro file

2015-03-27 Thread Joanne Contact
never mind. find my spark is still 1.2 but the avro library requires 1.3. will try again. On Fri, Mar 27, 2015 at 9:38 PM, Joanne Contact wrote: > Hi I am following the instruction on this website. > http://www.infoobjects.com/spark-with-avro/ > > I installed the sparkavro libary on https:// > g

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Sean Owen
(I bet the Spark implementation could be improved. I bet GraphX could be optimized.) Not sure about this one, but "in core" benchmarks often start by assuming that the data is local. In the real world, data is unlikely to be. The benchmark has to include the cost of bringing all the data to the lo

Re: Understanding Spark Memory distribution

2015-03-27 Thread Ankur Srivastava
I have increased the "spark.storage.memoryFraction" to 0.4 but I still get OOM errors on Spark Executor nodes 15/03/27 23:19:51 INFO BlockManagerMaster: Updated info of block broadcast_5_piece10 15/03/27 23:19:51 INFO TorrentBroadcast: Reading broadcast variable 5 took 2704 ms 15/03/27 23:19:52

rdd.toDF().saveAsParquetFile("tachyon://host:19998/test")

2015-03-27 Thread sud_self
spark version is 1.3.0 with tanhyon-0.6.1 QUESTION DESCRIPTION: rdd.saveAsObjectFile("tachyon://host:19998/test") and rdd.saveAsTextFile("tachyon://host:19998/test") succeed, but rdd.toDF().saveAsParquetFile("tachyon://host:19998/test") failure. ERROR MESSAGE:java.lang.IllegalArgumen

Re: rdd.toDF().saveAsParquetFile("tachyon://host:19998/test")

2015-03-27 Thread Yin Huai
You are hitting https://issues.apache.org/jira/browse/SPARK-6330. It has been fixed in 1.3.1, which will be released soon. On Fri, Mar 27, 2015 at 10:42 PM, sud_self <852677...@qq.com> wrote: > spark version is 1.3.0 with tanhyon-0.6.1 > > QUESTION DESCRIPTION: rdd.saveAsObjectFile("tachyon://hos

Re: Understanding Spark Memory distribution

2015-03-27 Thread Wisely Chen
Hi In broadcast, spark will collect the whole 3gb object into master node and broadcast to each slaves. It is very common situation that the master node don't have enough memory . What is your master node settings? Wisely Chen Ankur Srivastava 於 2015年3月28日 星期六寫道: > I have increased the "spark