Re: No suitable driver found for jdbc:mysql://

2015-07-22 Thread Rishi Yadav
try setting --driver-class-path On Wed, Jul 22, 2015 at 3:45 PM, roni wrote: > Hi All, > I have a cluster with spark 1.4. > I am trying to save data to mysql but getting error > > Exception in thread "main" java.sql.SQLException: No suitable driver found > for jdbc:mysql://<>.rds.amazonaws.com:

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Rishi Yadav
can you explain what transformation is failing. Here's a simple example. http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/ On Thu, Jul 23, 2015 at 5:37 AM, wrote: > I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ > RDD[DenseVector] not >: RDD[

Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Rishi Yadav
why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema) On Sat, Aug 15, 2015 at 6:35 PM, Todd wrote: > Hi, > With following code snippet, I cached the raw RDD(which is already in > memory, but just for illustration) and its DataFrame. > I though

Re: Re: Can't understand the size of raw RDD and its DataFrame

2015-08-16 Thread Rishi Yadav
ould think that it would take less space,looks my understanding is > run?? > > > > > > At 2015-08-16 12:34:31, "Rishi Yadav" wrote: > > why are you expecting footprint of dataframe to be lower when it contains > more information ( RDD + Schema) > > O

Re: Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-16 Thread Rishi Yadav
try --jars rather than --class to submit jar. On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch wrote: > The NoClassDefFoundException differs from ClassNotFoundException : it > indicates an error while initializing that class: but the class is found in > the classpath. Please provide the full st

Re: Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread Rishi Yadav
can you tell more about your environment. I understand you are running it on a single machine but is firewall enabled? On Sun, Aug 16, 2015 at 5:47 AM, t4ng0 wrote: > Hi > > I am new to spark and trying to run standalone application using > spark-submit. Whatever i could understood, from logs is

Re: Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Rishi Yadav
Can you provide some more details: 1. How many partitions does RDD have 2. How big is the cluster On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote: > Dear all, > > I want to equally divide a RDD partition into two partitions. That means, > the first half of elements in the partition will create a new

RangePartitioner

2015-01-20 Thread Rishi Yadav
I am joining two tables as below, the program stalls at below log line and never proceeds. What might be the issue and possible solution? >>> INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79 Table 1 has  450 columns Table2 has  100 columns Both tables have few million r

Re: Define size partitions

2015-01-30 Thread Rishi Yadav
if you are only concerned about big partition size you can specify number of partitions as an additional parameter while loading files form hdfs. On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser wrote: > You can also use your InputFormat/RecordReader in Spark, e.g. using > newAPIHadoopFile. See her

Stepsize with Linear Regression

2015-02-10 Thread Rishi Yadav
Are there any thumbrules how to set stepsize with gradient descent. I am using it for Linear Regression but I am sure it applies in general to gradient descent. I am at present deriving a number which fits closest to training data set response variable values. I am sure there is a better way to

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Rishi Yadav
programmatically specifying Schema needs import org.apache.spark.sql.type._ for StructType and StructField to resolve. On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen wrote: > Yes I think this was already just fixed by: > > https://github.com/apache/spark/pull/4977 > > a ".toDF()" is missing > >

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav
ca you share some sample data On Sun, Mar 15, 2015 at 8:51 PM, Rohit U wrote: > Hi, > > I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints > loaded using loadLibSVMFile: > > val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, > "s3n://logistic-regression/epsilon_norma

Re: How to use DataFrame with MySQL

2015-03-23 Thread Rishi Yadav
for me, it's only working if I set --driver-class-path to mysql library. On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang wrote: > OK,I found what the problem is: It couldn't work with > mysql-connector-5.0.8. > I updated the connector version to 5.1.34 and it worked. > > > > -- > View this message

Re: Spark Streaming Twitter Example Error

2014-08-21 Thread Rishi Yadav
please add following three libraries to your class path. spark-streaming-twitter_2.10-1.0.0.jar twitter4j-core-3.0.3.jar twitter4j-stream-3.0.3.jar On Thu, Aug 21, 2014 at 1:09 PM, danilopds wrote: > Hi! > > I'm beginning with the development in Spark Streaming.. And I'm learning > with the e

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using "hdfs dfs -getmerge..." On Friday, October 17, 2014, Sean Owen wrote: > You can save to a local file. What are you trying and what doesn't work? > > You can output one file by repartitioning to 1 partition but this is > probably not a good ide

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav
Hi Tridib, I changed SQLContext to HiveContext and it started working. These are steps I used. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val person = sqlContext.jsonFile("json/person.json") person.printSchema() person.registerTempTable("person") val address = sqlContext.jsonF

Re: Bug in Accumulators...

2014-10-25 Thread Rishi Yadav
works fine. Spark 1.1.0 on REPL On Sat, Oct 25, 2014 at 1:41 PM, octavian.ganea wrote: > There is for sure a bug in the Accumulators code. > > More specifically, the following code works well as expected: > > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("EL LBP SP

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav
did you create SQLContext? On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary wrote: > I have same requirement of passing list of values to in clause, when i am > trying to do > > i am getting below error > > scala> val longList = Seq[Expression]("a", "b") > :11: error: type mismatch; > found :

Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav
simple scala> val date = new java.text.SimpleDateFormat("mmdd").parse(fechau3m) should work. Replace "mmdd" with the format fechau3m is in. If you want to do it at case class level: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) //HiveContext always a good idea import s

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav
yes, can you always specify minimum number of partitions and that would force some parallelism ( assuming you have enough cores) On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa wrote: > What if the window is of 5 seconds, and the file takes longer than 5 > seconds to be completely scanned? It will

Re: join 2 tables

2014-11-12 Thread Rishi Yadav
please use join syntax. On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > I have 2 tables in a hive context, and I want to select one field of each > table where id’s of each table are equal. For example, > > > > *val tmp2=sqlContext.sql("select a.ult_

Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav
If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file. On Thursday, November 13, 2014, Daniel Siegmann wrote: > Would it make sense to read each file in as a separate RDD? This way you > woul

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav
how about using fluent style of Scala programming. On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini wrote: > Let's say I have to apply a complex sequence of operations to a certain > RDD. > In order to make code more modular/readable, I would typically have > something like this: > > object myO

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread Rishi Yadav
We keep conf as symbolic link so that upgrade is as simple as drop-in replacement On Monday, November 24, 2014, riginos wrote: > OK thank you very much for that! > On 23 Nov 2014 21:49, "Denny Lee [via Apache Spark User List]" <[hidden > email]

Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav
you can try (scala version => you convert to python) val set = initial.groupBy( x => if (x == something) "key1" else "key2") This would do one pass over original data. On Fri, Nov 28, 2014 at 8:21 AM, mrm wrote: > Hi, > > My question is: > > I have multiple filter operations where I split my i

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav
How big is your input dataset? On Thursday, November 27, 2014, Praveen Sripati wrote: > Hi, > > When I run the below program, I see two files in the HDFS because the > number of partitions in 2. But, one of the file is empty. Why is it so? Is > the work not distributed equally to all the tasks?

Re: Cached RDD

2014-12-30 Thread Rishi Yadav
Without caching, each action is recomputed. So assuming rdd2 and rdd3 result in separate actions answer is yes. On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet wrote: > If I have 2 RDDs which depend on the same RDD like the following: > > val rdd1 = ... > > val rdd2 = rdd1.groupBy()... > > val rdd3

Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-01 Thread Rishi Yadav
Hi Ankit, Optional number of partitions value is to increase number of partitions not reduce it from default value. On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I am trying to read a file into a single partition but it seems like > sparkContext.textFil

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin wrote: > Hi, All > > > > Supp

Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav
you can also access SparkConf using sc.getConf in Spark shell though for StreamingContext you can directly refer sc as Akhil suggested. On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das wrote: > In the shell you could do: > > val ssc = StreamingContext(*sc*, Seconds(1)) > > as *sc* is the SparkContext

Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav
as per my understanding RDDs do not get replicated, underlying Data does if it's in HDFS. On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek wrote: > Hi, > > I want to find the time taken for replicating an rdd in spark cluster > along with the computation time on the replicated rdd. > > Can some

Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav
One approach is to first transform this RDD into a PairRDD by taking the field you are going to do aggregation on as key On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh wrote: > Hi, > I have a csv file having fields as a,b,c . > I want to do aggregation(sum,average..) based on any field(a,b or c)