Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread Mani
Hi, I am a graduate student from Virginia Tech (USA) pursuing my Masters in Computer Science. I’ve been researching on parallel and distributed databases and their performance for running some Range queries involving simple joins and group by on large datasets. As part of my research, I tried e

Re: How to debug Spark on Yarn?

2015-04-27 Thread Zoltán Zvara
You can check container logs from RM web UI or when log-aggregation is enabled with the yarn command. There are other, but less convenient options. On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Spark 1.3 > > 1. View stderr/stdout from executor from Web UI: when the job is running i > fi

Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :) The fact that query runtime increases indicates more shuffle. You may want to construct rdds based on keys you use. You may want to specify what kind of node you are using and how many executors you are using. You may also want to play around with executor memory alloc

Re: How to debug Spark on Yarn?

2015-04-27 Thread ๏̯͡๏
1) Application container logs from Web RM UI never load on browser. I eventually have to kill the browser. 2) /apache/hadoop/bin/yarn logs -applicationId application_1429087638744_151059 | less emits logs only after the application has completed. Are there no better ways to see the logs as they a

A problem of using spark streaming to capture network packets

2015-04-27 Thread Hai Shan Wu
Hi Everyone We use pcap4j to capture network packets and then use spark streaming to analyze captured packets. However, we met a strange problem. If we run our application on spark locally (for example, spark-submit --master local[2]), then the program runs successfully. If we run our applicati

How to distribute Spark computation recipes

2015-04-27 Thread Olivier Girardot
Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to serialize/deserialize full RDD computations ? @rxin Spark SQL is, in a way, already doing this but the parsers are private[sql], is there any way to

Spark + Mesos + HDFS resource split

2015-04-27 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am building a mesos cluster for the purposes of using it to run spark workloads (in addition to other frameworks). I am under the impression that it is preferable/recommended to run hdfs datanode process, spark slave on the same physical node (o

Re: Cassandra Connection Issue with Spark-jobserver

2015-04-27 Thread Noorul Islam K M
Are you using DSE spark, if so are you pointing spark job server to use DSE spark? Thanks and Regards Noorul Anand writes: > *I am new to Spark world and Job Server > > My Code :* > > package spark.jobserver > > import java.nio.ByteBuffer > > import scala.collection.JavaConversions._ > import

Re: Cassandra Connection Issue with Spark-jobserver

2015-04-27 Thread Anand
I was able to fix the issues by providing right version of cassandra-all and thrift libraries -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-Connection-Issue-with-Spark-jobserver-tp22587p22664.html Sent from the Apache Spark User List mailing li

Re: Spark streaming action running the same work in parallel

2015-04-27 Thread ColinMc
I was able to get it working. Instead of using customers.flatMap to return alerts. I had to use the following: customers.foreachRDD(new Function>, Void>() { @Override public Void call(final JavaPairRDD> rdd) throws Exception { rdd.foreachPartition(new VoidFu

spark-defaults.conf

2015-04-27 Thread James King
I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file. Defaul

ReduceByKey and sorting within partitions

2015-04-27 Thread Marco
Hi, I'm trying, after reducing by key, to get data ordered among partitions (like RangePartitioner) and within partitions (like sortByKey or repartitionAndSortWithinPartition) pushing the sorting down to the shuffles machinery of the reducing phase. I think, but maybe I'm wrong, that the correct

Disable partition discovery

2015-04-27 Thread Cosmin Cătălin Sanda
How can one disable *Partition discovery* in *Spark 1.3.0 *when using *sqlContext.parquetFile*? Alternatively, is there a way to load *.parquet* files without *Partition discovery*? Cosmin

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
FYI, Parquet schema output: message pig_schema { optional binary cust_id (UTF8); optional int32 part_num; optional group ip_list (LIST) { repeated group ip_t { optional binary ip (UTF8); } } optional group vid_list (LIST) { repeated group vid_t { optional binary

Spark 1.2.1: How to convert SchemaRDD to CassandraRDD?

2015-04-27 Thread Tash Chainar
Hi all, following the import com.datastax.spark.connector.SelectableColumnRef; import com.datastax.spark.connector.japi.CassandraJavaUtil; import org.apache.spark.sql.SchemaRDD; import static com.datastax.spark.connector.util.JavaApiHelper.toScalaSeq; import scala.collection.Seq; SchemaRDD schema

Re: spark-defaults.conf

2015-04-27 Thread Zoltán Zvara
You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King wrote: > I renamed spark-defaults.conf.template to spark-defaults.conf > and invoked

Re: spark-defaults.conf

2015-04-27 Thread James King
Thanks. I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile But when I start worker like this spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh I still get failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf.

Re: SQL UDF returning object of case class; regression from 1.2.0

2015-04-27 Thread Ophir Cohen
A short update: eventually we manually upgraded to 1.3.1 and the problem fixed. On Apr 26, 2015 2:26 PM, "Ophir Cohen" wrote: > I happened to hit the following issue that prevents me from using UDFs > with case classes: https://issues.apache.org/jira/browse/SPARK-6054. > > The issue already fixed

Bigints in pyspark

2015-04-27 Thread jamborta
hi all, I have just come across a problem where I have a table that has a few bigint columns, it seems if I read that table into a dataframe then collect it in pyspark, the bigints are stored and integers in python. (The problem is if I write it back to another table, I detect the hive type prog

Re: ReduceByKey and sorting within partitions

2015-04-27 Thread Saisai Shao
Hi Marco, As I know, current combineByKey() does not expose the related argument where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is package private, if you can get the ShuffledRDD through reflection or other way, the keyOrdering you set will be pushed down to shuffle. If you

Exception in using updateStateByKey

2015-04-27 Thread Sea
Hi, all: I use function updateStateByKey in Spark Streaming, I need to store the states for one minite, I set "spark.cleaner.ttl" to 120, the duration is 2 seconds, but it throws Exception Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist

Re: Exception in using updateStateByKey

2015-04-27 Thread Ted Yu
Which hadoop release are you using ? Can you check hdfs audit log to see who / when deleted spark/ck/hdfsaudit/ receivedData/0/log-1430139541443-1430139601443 ? Cheers On Mon, Apr 27, 2015 at 6:21 AM, Sea <261810...@qq.com> wrote: > Hi, all: > I use function updateStateByKey in Spark Streaming,

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
I make it to 240, it happens again when 240 seconds is reached. -- -- ??: "261810726";<261810...@qq.com>; : 2015??4??27??(??) 10:24 ??: "Ted Yu"; : ?? Exception in using updateStateByKey Yes??I can make it larger, but

RE: Slower performance when bigger memory?

2015-04-27 Thread Shuai Zheng
Thanks. So may I know what is your configuration for more/smaller executors on r3.8xlarge, how big of the memory that you eventually decide to give one executor without impact performance (for example: 64g? ). From: Sven Krasser [mailto:kras...@gmail.com] Sent: Friday, April 24, 2015 1:59 PM

Re: directory loader in windows

2015-04-27 Thread Steve Loughran
This a hadoop-side stack trace it looks like the code is trying to get the filesystem permissions by running %HADOOP_HOME%\bin\WINUTILS.EXE ls -F and something is triggering a null pointer exception. There isn't any HADOOP- JIRA with this specific stack trace in it, so it's not a known/fixe

data locality in spark

2015-04-27 Thread Grandl Robert
Hi guys, I am running some SQL queries, but all my tasks are reported as either NODE_LOCAL or PROCESS_LOCAL.  In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because they have to aggregate data from multiple hosts. However, in Spark even the aggregation stages are reported a

Driver ID from spark-submit

2015-04-27 Thread Rares Vernica
Hello, I am trying to use the default Spark cluster manager in a production environment. I will be submitting jobs with spark-submit. I wonder if the following is possible: 1. Get the Driver ID from spark-submit. We will use this ID to keep track of the job and kill it if necessary. 2. Weather i

Spark JDBC data source API issue with mysql

2015-04-27 Thread madhu phatak
Hi, I have been trying out spark data source api with JDBC. The following is the code to get DataFrame, Try(hc.load("org.apache.spark.sql.jdbc",Map("url" -> dbUrl,"dbtable"->s"($ query) "))) By looking at test cases, I found that query has to be inside brackets, otherwise it's treated as table

Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread ๏̯͡๏
I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors and direct link. Each time i untar i get below error spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty error message) tar: Error exit delayed from previous errors Is it broken ? -- Deepak

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
Maybe I found the solution??do not set 'spark.cleaner.ttl', just use function 'remember' in StreamingContext to set the rememberDuration. -- -- ??: "Ted Yu";; : 2015??4??27??(??) 10:20 ??: "Sea"<261810...@qq.com>; : Re: Ex

RE: ReduceByKey and sorting within partitions

2015-04-27 Thread Ganelin, Ilya
Marco - why do you want data sorted both within and across partitions? If you need to take an ordered sequence across all your data you need to either aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an ordered index to your data that matches the order it was stored on

RE: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Ganelin, Ilya
What command are you using to untar? Are you running out of disk space? Sent with Good (www.good.com) -Original Message- From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.com] Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time To: user Subject: Spark 1.3.1 Hadoo

Re: Timeout Error

2015-04-27 Thread Deepak Gopalakrishnan
Hello All, I dug a little deeper and found this error : 15/04/27 16:05:39 WARN TransportChannelHandler: Exception in connection from /10.1.0.90:40590 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.

Re: What is difference btw reduce & fold?

2015-04-27 Thread keegan
Hi Q, fold and reduce both aggregate over a collection by implementing an operation you specify, the major different is the starting point of the aggregation. For fold(), you have to specify the starting value, and for reduce() the starting value is the first (or possibly an arbitrary) element in

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Sean Owen
Works fine for me. Make sure you're not downloading the HTML redirector page and thinking it's the archive. On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors > and direct link. Each time i untar i get below error >

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-27 Thread Xiangrui Meng
Could you try different ranks and see whether the task size changes? We do use YtY in the closure, which should work the same as broadcast. If that is the case, it should be safe to ignore this warning. -Xiangrui On Thu, Apr 23, 2015 at 4:52 AM, Christian S. Perone wrote: > All these warnings com

Re: StandardScaler failing with OOM errors in PySpark

2015-04-27 Thread Xiangrui Meng
You might need to specify driver memory in spark-submit instead of passing JVM options. spark-submit is designed to handle different deployments correctly. -Xiangrui On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar wrote: > ok yes, I think I have narrowed it down to being a problem with driver > memor

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Xiangrui Meng
How did you run the example app? Did you use spark-submit? -Xiangrui On Thu, Apr 23, 2015 at 2:27 PM, Su She wrote: > Sorry, accidentally sent the last email before finishing. > > I had asked this question before, but wanted to ask again as I think > it is now related to my pom file or project se

spark sql LEFT OUTER JOIN java.lang.ClassCastException

2015-04-27 Thread kiran mavatoor
Hi There, I am using spark sql left out join query.  The sql query is  scala> val test = sqlContext.sql("SELECT e.departmentID FROM employee e LEFT OUTER JOIN department d ON d.departmentId = e.departmentId").toDF() In the spark 1.3.1 its working fine, but the latest pull is give the below error 1

Streaming app with windowing and persistence

2015-04-27 Thread Alexander Krasheninnikov
Hello, everyone. I develop stream application, working with window functions - each window create table and perform some SQL-operations on extracted data. I met such problem: when using window operations and checkpointing, application does not start next time. Here is the code: ---

Re: Spark on Mesos

2015-04-27 Thread Stephen Carman
So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just basically got the pre-built from the spark website… I placed those compiled spark installs on each slave at /opt/spark My spark properties seem to be getting picked up on my side fine… [cid:683C1BA0-C9EC-448C-B1DB-E93A

Re: gridsearch - python

2015-04-27 Thread Xiangrui Meng
We will try to make them available in 1.4, which is coming soon. -Xiangrui On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto wrote: > I know grid search with cross validation is not supported. However, I was > wondering if there is something availalable for the time being. > > > > Thanks, > > >

deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-04-27 Thread Pagliari, Roberto
Suppose I have something like the code below for idx in xrange(0, 10): train_test_split = training.randomSplit(weights=[0.75, 0.25]) train_cv = train_test_split[0] test_cv = train_test_split[1] # scale train_cv and test_cv by scaling train

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Su She
Hello Xiangrui, I am using this spark-submit command (as I do for all other jobs): /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit --class MLlib --master local[2] --jars $(echo /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',') /home/ec2-user/sparkApps/lea

Group by order by

2015-04-27 Thread Ulanov, Alexander
Hi, Could you suggest what is the best way to do "group by x order by y" in Spark? When I try to perform it with Spark SQL I get the following error (Spark 1.3): val results = sqlContext.sql("select * from sample group by id order by time") org.apache.spark.sql.AnalysisException: expression 'tim

Re: Group by order by

2015-04-27 Thread Richard Marscher
Hi, that error seems to indicate the basic query is not properly expressed. If you group by just ID, then that means it would need to aggregate all the time values into one value per ID, so you can't sort by it. Thus it tries to suggest an aggregate function for time so you can have 1 value per ID

RE: Group by order by

2015-04-27 Thread Ulanov, Alexander
Hi Richard, There are several values of time per id. Is there a way to perform group by id and sort by time in Spark? Best regards, Alexander From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Monday, April 27, 2015 12:20 PM To: Ulanov, Alexander Cc: user@spark.apache.org Subject: R

Understanding Spark's caching

2015-04-27 Thread Eran Medan
Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile("some data") rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile("...") rdd3.sa

bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-04-27 Thread Pagliari, Roberto
With the Python APIs, the available arguments I got (using inspect module) are the following: ['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights', 'regParam', 'regType', 'intercept'] numClasses is not available. Can someone comment on this? Thanks,

Re: Group by order by

2015-04-27 Thread Richard Marscher
It's not related to Spark, but the concept of what you are trying to do with the data. Grouping by ID means consolidating data for each ID down to 1 row per ID. You can sort by time after this point yes, but you would need to either take each ID and time value pair OR do some aggregate operation on

[SQL][1.3.1][JAVA] UDF in java cause Task not serializable

2015-04-27 Thread Shuai Zheng
Hi All, Basically I try to define a simple UDF and use it in the query, but it gives me "Task not serializable" public void test() { RiskGroupModelDefinition model = registeredRiskGroupMap.get(this.modelId); RiskGroupModelDefinition edm = this.createEdm(

[SQL][1.3.1][JAVA]Use UDF in DataFrame join

2015-04-27 Thread Shuai Zheng
Hi All, I want to ask how to use UDF when I use join function on DataFrame. It looks like always give me the "cannot solve the column name" error. Anyone can give me an example on how to run this in java? My code is like: edmData.join(yb_lookup, edmData.col("year(YEARBUILT)").equalTo(

RE: Group by order by

2015-04-27 Thread Ulanov, Alexander
Thanks, it should be “select id, time, min(x1), min(x2), … from data group by id, time order by time” (“min” or other aggregate function to pick other fields) Forgot to mention that (id, time) is my primary key and I took for granted that it worked in my MySQL example. Best regards, Alexander

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-27 Thread TimMalt
Hi, and what can I do when I am on Windows? It does not allow me to set the hostname to some IP Thanks, Tim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22674.html Sent fro

Change ivy cache for spark on Windows

2015-04-27 Thread mj
Hi, I'm having trouble using the --packages option for spark-shell.cmd - I have to use Windows at work and have been issued a username with a space in it that means when I use the --packages option it fails with this message: "Exception in thread "main" java.net.URISyntaxException: Illegal charac

Fwd: Change ivy cache for spark on Windows

2015-04-27 Thread Burak Yavuz
+user -- Forwarded message -- From: Burak Yavuz Date: Mon, Apr 27, 2015 at 1:59 PM Subject: Re: Change ivy cache for spark on Windows To: mj Hi, In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set: `spark.jars.ivy \your\path` Best, Burak On Mon, Apr 27, 201

Re: How to distribute Spark computation recipes

2015-04-27 Thread Reynold Xin
The code themselves are the "recipies", no? On Mon, Apr 27, 2015 at 2:49 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > I know that any RDD is related to its SparkContext and the associated > variables (broadcast, accumulators), but I'm looking for a way to > ser

Re: hive-thriftserver maven artifact

2015-04-27 Thread Ted Yu
This is available for 1.3.1: http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 FYI On Mon, Feb 16, 2015 at 7:24 AM, Marco wrote: > Ok, so will it be only available for the next version (1.30)? > > 2015-02-16 15:24 GMT+01:00 Ted Yu : > >> I searched for 'spark-hive-

Re: Super slow caching in 1.3?

2015-04-27 Thread Christian Perez
Michael, There is only one schema: both versions have 200 string columns in one file. On Mon, Apr 20, 2015 at 9:08 AM, Evo Eftimov wrote: > Now this is very important: > > > > “Normal RDDs” refers to “batch RDDs”. However the default in-memory > Serialization of RDDs which are part of DSTream is

Automatic Cache in SparkSQL

2015-04-27 Thread Wenlei Xie
Hi, I am trying to answer a simple query with SparkSQL over the Parquet file. When execute the query several times, the first run will take about 2s while the later run will take <0.1s. By looking at the log file it seems the later runs doesn't load the data from disk. However, I didn't enable an

Re: Powered By Spark

2015-04-27 Thread Justin
Hi, Would you mind adding our company to the Powered By Spark page? organization name: atp URL: https://atp.io a list of which Spark components you are using: SparkSQL, MLLib, Databricks Cloud and a short description of your use case: Predictive models and learning algorithms to improve the relev

Re: Super slow caching in 1.3?

2015-04-27 Thread Wenlei Xie
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s for 400MB data. The schema is similar to the TPC-H LineItem. Here is the code I tried the cache. I am wondering if there is any setting missing? Thank you so much! lineitemSchemaRDD.registerTempTable("lineitem"); sqlCont

Scalability of group by

2015-04-27 Thread Ulanov, Alexander
Hi, I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in Spark 1.3 as follows: "select id, time, first(value) from data group by id, time" My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is allocated with 5GB of memory. However, all executors ar

Re: Scalability of group by

2015-04-27 Thread ayan guha
Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, "Ulanov, Alexander" wrote: > Hi, > > > > I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] > in Spark 1.3 as follows: > > “select id, time, first(value)

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha
Spark keeps job in memory by default for kind of performance gains you are seeing. Additionally depending on your query spark runs stages and any point of time spark's code behind the scene may issue explicit cache. If you hit any such scenario you will find those cached objects in UI under storage

RE: Scalability of group by

2015-04-27 Thread Ulanov, Alexander
It works on a smaller dataset of 100 rows. Probably I could find the size when it fails using binary search. However, it would not help me because I need to work with 2B rows. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Monday, April 27, 2015 6:58 PM To: Ulanov, Alexander Cc: user@spark.a

Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread bit1...@163.com
Hi, I am frequently asked why spark is also much faster than Hadoop MapReduce on disk (without the use of memory cache). I have no convencing answer for this question, could you guys elaborate on this? Thanks!

java.lang.UnsupportedOperationException: empty collection

2015-04-27 Thread xweb
I am running following code on Spark 1.3.0. It is from https://spark.apache.org/docs/1.3.0/ml-guide.html On running val model1 = lr.fit(training.toDF) I get java.lang.UnsupportedOperationException: empty collection what could be the reason? import org.apache.spark.{SparkConf, Sp

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Ilya Ganelin
I believe the typical answer is that Spark is actually a bit slower. On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com wrote: > Hi, > > I am frequently asked why spark is also much faster than Hadoop MapReduce > on disk (without the use of memory cache). I have no convencing answer for > this quest

Re: Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread bit1...@163.com
Is it? I learned somewhere else that spark's speed is 5~10 times faster than Hadoop MapReduce. bit1...@163.com From: Ilya Ganelin Date: 2015-04-28 10:55 To: bit1...@163.com; user Subject: Re: Why Spark is much faster than Hadoop MapReduce even on disk I believe the typical answer is that Spar

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks   From: "bit1...@163.com" To: user Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk #yiv1713360705 body {line-height:1.5;}

1.3.1: Persisting RDD in parquet - "Conflicting partition column names"

2015-04-27 Thread sranga
Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is appreciated. Caused by: java.lang.AssertionError: assertion failed: Conflicting partition

Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread lokeshkumar
Hi Forum I am facing below compile error when using the fileStream method of the JavaStreamingContext class. I have copied the code from JavaAPISuite.java test class of spark test code. Please help me to find a solution for this.

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee
Similar to what Dean called out, we build Puppet manifests so we could do the automation - its a bit of work to setup, but well worth the effort. On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler wrote: > It's mostly manual. You could try automating with something like Chef, of > course, but there's

New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake
https://issues.apache.org/jira/browse/SPARK-7182 Can anyone suggest a workaround for the above issue? Thanks. -Don -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/

Spark Job fails with 6 executors and succeeds with 8 ?

2015-04-27 Thread ๏̯͡๏
I have this Spark App and it fails when i run with 6 executors but succeeds with 8. Any suggestions ? Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop

Serialization error

2015-04-27 Thread madhvi
Hi, While connecting to accumulo through spark by making sparkRDD I am getting the following error: object not serializable (class: org.apache.accumulo.core.data.Key) This is due to the 'key' class of accumulo which does not implement serializable interface.How it can be solved and accumulo

Re: Unable to work with foreachrdd

2015-04-27 Thread drarse
Can u write the code? Maby is the foreachRDD body. :) El martes, 28 de abril de 2015, CH.KMVPRASAD [via Apache Spark User List] < ml-node+s1001560n22681...@n3.nabble.com> escribió: > When i run spark streaming application print method is printing result it > is f9, but i used foreachrdd on that

Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread lokeshkumar
Hi Forum I am facing below compile error when using the fileStream method of the JavaStreamingContext class. I have copied the code from JavaAPISuite.java test class of spark test code. The error message is

Re: JAVA_HOME problem

2015-04-27 Thread sourabh chaki
I was able to solve this problem hard coding the JAVA_HOME inside org.apache.spark.deploy.yarn.Client.scala class. *val commands = prefixEnv ++ Seq(-- YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + "/bin/java", "-server"++ "/usr/java/jdk1.7.0_51/bin/java", "-server")* Somehow {

Re: Spark Streaming: JavaDStream compute method NPE

2015-04-27 Thread Himanshu Mehra
Hi Puneith, Please provide the code if you may. It will be helpful. Thank you, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-JavaDStream-compute-method-NPE-tp22676p22684.html Sent from the Apache Spark User List mailing list archive at Na

Re: Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread Akhil Das
How about: JavaPairDStream input = jssc.fileStream(inputDirectory, LongWritable.class, Text.class, TextInputFormat.class); See the complete example over here