Saving intermediate results in mapPartitions

2016-03-18 Thread Krishna
Hi, I've a situation where the number of elements output by each partition from mapPartitions don't fit into the RAM even with the lowest number of rows in the partition (there is a hard lower limit on this value). What's the best way to address this problem? During the mapPartition phase, is ther

Re: Dataframe fails for large resultsize

2016-04-29 Thread Krishna
I recently encountered similar network related errors and was able to fix it by applying the ethtool updates decribed here [ https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085] On Friday, April 29, 2016, Buntu Dev wrote: > Just to provide more details, I have 200 blocks (parq

destroyPythonWorker job in PySpark

2016-06-23 Thread Krishna
Hi, I am running a PySpark app with 1000's of cores (partitions is a small multiple of # of cores) and the overall application performance is fine. However, I noticed that, at the end of the job, PySpark initiates job clean-up procedures and as part of this procedure, PySpark executes a job shown

Window range in Spark

2016-01-26 Thread Krishna
Hi, We receive bursts of data with sequential ids and I would like to find the range for each burst-window. What's the best way to find the "window" ranges in Spark? Input --- 1 2 3 4 6 7 8 100 101 102 500 700 701 702 703 704 Output (window start, window end) ---

Maintain state outside rdd

2016-01-27 Thread Krishna
Hi, I've a scenario where I need to maintain state that is local to a worker that can change during map operation. What's the best way to handle this? *incr = 0* *def row_index():* * global incr* * incr += 1* * return incr* *out_rdd = inp_rdd.map(lambda x: row_index()).collect()* "out_rdd" i

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
at 6:25 PM, Ted Yu wrote: > Have you looked at this method ? > >* Zips this RDD with its element indices. The ordering is first based > on the partition index > ... > def zipWithIndex(): RDD[(T, Long)] = withScope { > > On Wed, Jan 27, 2016 at 6:03 PM, Krishna wrote: > &

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
oring state in external > NoSQL store such as hbase ? > > On Wed, Jan 27, 2016 at 6:37 PM, Krishna wrote: > >> Thanks; What I'm looking for is a way to see changes to the state of some >> variable during map(..) phase. >> I simplified the scenario in my example by m

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
gt; > http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka > , > it also tells you how to define your own for custom data types. > > On Wed, Jan 27, 2016 at 7:22 PM, Krishna > wrote: > > mapPartitions(...) seems like a good candidate, since

Merge rows into csv

2015-12-08 Thread Krishna
Hi, what is the most efficient way to perform a group-by operation in Spark and merge rows into csv? Here is the current RDD - ID STATE - 1 TX 1NY 1FL 2CA 2OH - This is the required output: --

JdbcRDD

2014-11-18 Thread Krishna
Hi, Are there any examples of using JdbcRDD in java available? Its not clear what is the last argument in this example ( https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ): sc = new SparkContext("local", "test") val rdd = new JdbcRDD( sc, ()

Re: JdbcRDD

2014-11-18 Thread Krishna
Thanks Kidong. I'll try your approach. On Tue, Nov 18, 2014 at 4:22 PM, mykidong wrote: > I had also same problem to use JdbcRDD in java. > For me, I have written a class in scala to get JdbcRDD, and I call this > instance from java. > > for instance, JdbcRDDWrapper.scala like this: > > ... > >

Tuning spark job to make count faster.

2021-04-05 Thread Krishna Chakka
Hi, I am working on a spark job. It takes 10 mins for the job just for the count() function. Question is How can I make it faster ? From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . Here are the parameters that I am using for

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers On Thu, May 21, 2015 at 6:19 PM, anneywarlord wrote: > Hello, > > New to Spark. I wanted to know if it is possible to use a Labeled Point RDD > in org.apache.spark.mllib.clustering.KMeans. After I cluster my d

Writing data to hbase using Sparkstreaming

2015-06-12 Thread Vamshi Krishna
Hi I am trying to write data that is produced from kafka commandline producer for some topic. I am facing problem and unable to proceed. Below is my code which I am creating a jar and running through spark-submit on spark-shell. Am I doing wrong inside foreachRDD() ? What is wrong with SparkKafkaD

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro wrote: > 1.4 and I did set the second parameter. The DSL works fine but trying

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar
- use .cast("...").alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid wrote: > Hi experts! > > > I am using spark-csv to lead csv data into dataframe. By default it makes > type of each

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row Cheers & enjoy the long weekend

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here. Cheers On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna < leonida.gianfa...@gmail.com> wrote: > Thanks a lot oubrik, > > I got your point, my consideration is that sum() should be already a > built-in function for iterators in python. > Anyway I trie

Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single machine

Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, I am new to this this field, I want to implement new ML algo using Spark MLlib. What is the procedure. -- Regards, Ram Krishna KT

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, How to add new ML algo in Spark MLlib. On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna wrote: > Hi All, > > I am new to this this field, I want to implement new ML algo using Spark > MLlib. What is the procedure. > > -- > Regards, > Ram Krishna KT > > >

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
maller first project building new functionality in > Spark as a good starting point rather than adding a new algorithm right > away, since you learn a lot in the process of making your first > contribution. > > On Friday, June 10, 2016, Ram Krishna wrote: > >> Hi All, >>

Re: RBM in mllib

2016-06-14 Thread Krishna Kalyan
Hi Robert, According to the jira the Resolution is wont fix. The pull request was closed as it did not merge cleanly with the master. (https://github.com/apache/spark/pull/3222) On Tue, Jun 14, 2016 at 4:23 PM, Roberto Pagliari wrote: > Is RBM being developed? > > This one is marked as resolved,

Error Running SparkPi.scala Example

2016-06-15 Thread Krishna Kalyan
andler not found - continuing with a stub. Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not found - continuing with a stub. Warning:scalac: Class com.google.common.collect.ImmutableMap not found - continuing with a stub. /Users/krishna/Experiment/spark/external/flume-sink/src/

how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Hi, I'm using Spark 1.4.1 (HDP 2.3.2). As per the spark-csv documentation (https://github.com/databricks/spark-csv), I see that we can write to a csv file in compressed form using the 'codec' option. But, didn't see the support for 'codec' option to read a csv file. Is there a way to read a compr

Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Thanks. It works. On Thu, Jun 16, 2016 at 5:32 PM Hyukjin Kwon wrote: > It will 'auto-detect' the compression codec by the file extension and then > will decompress and read it correctly. > > Thanks! > > 2016-06-16 20:27 GMT+09:00 Vamsi Krishna : > >> Hi, &

Re: Error Running SparkPi.scala Example

2016-06-17 Thread Krishna Kalyan
;, "SPARK_SUBMIT" -> "true", "spark.driver.cores" -> "5", "spark.ui.enabled" -> "false", "spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass", "spark.jars" -> "file:

spark streaming - how to purge old data files in data directory

2016-06-18 Thread Vamsi Krishna
Hi, I'm on HDP 2.3.2 cluster (Spark 1.4.1). I have a spark streaming app which uses 'textFileStream' to stream simple CSV files and process. I see the old data files that are processed are left in the data directory. What is the right way to purge the old data files in data directory on HDFS? Tha

Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and take

Unsubscribe

2016-06-19 Thread Ram Krishna
Hi Sir, Please unsubscribe me -- Regards, Ram Krishna KT

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-24 Thread Radha krishna
) using java Can any one suggest.. Note: i need to use other than \n bez my data contains \n as part of the column value. Thanks & Regards Radha krishna

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-28 Thread Radha krishna
o hdfs with the same line separator (RS[\u001e]) Thanks & Regards Radha krishna

Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
Hi All, Please check below for the code and input and output, i think the output is not correct, i am missing any thing? pls guide Code public class Test { private static JavaSparkContext jsc = null; private static SQLContext sqlContext = null; private static Configuration hadoopConf = null; pu

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
gt; sqlContext.createDataFrame(empRDD, Emp.class); >>>> empDF.registerTempTable("EMP"); >>>> >>>>sqlContext.sql("SELECT * FROM EMP e LEFT OUTER JOIN >>>> DEPT d ON e.deptid >>>> = d.deptid").show(); >>>> >>>> >>>> >>>> //empDF.join(deptDF,empDF.col("deptid").equalTo(deptDF.col("deptid")),"leftouter").show();; >>>> >>>> } >>>> catch(Exception e){ >>>> System.out.println(e); >>>> } >>>> } >>>> public static Emp getInstance(String[] parts, Emp emp) throws >>>> ParseException { >>>> emp.setId(parts[0]); >>>> emp.setName(parts[1]); >>>> emp.setDeptid(parts[2]); >>>> >>>> return emp; >>>> } >>>> public static Dept getInstanceDept(String[] parts, Dept dept) >>>> throws >>>> ParseException { >>>> dept.setDeptid(parts[0]); >>>> dept.setDeptname(parts[1]); >>>> return dept; >>>> } >>>> } >>>> >>>> Input >>>> Emp >>>> 1001 aba 10 >>>> 1002 abs 20 >>>> 1003 abd 10 >>>> 1004 abf 30 >>>> 1005 abg 10 >>>> 1006 abh 20 >>>> 1007 abj 10 >>>> 1008 abk 30 >>>> 1009 abl 20 >>>> 1010 abq 10 >>>> >>>> Dept >>>> 10 dev >>>> 20 Test >>>> 30 IT >>>> >>>> Output >>>> +--+--++--++ >>>> |deptid|id|name|deptid|deptname| >>>> +--+--++--++ >>>> |10| 1001| aba|10| dev| >>>> |10| 1003| abd|10| dev| >>>> |10| 1005| abg|10| dev| >>>> |10| 1007| abj|10| dev| >>>> |10| 1010| abq|10| dev| >>>> |20| 1002| abs| null|null| >>>> |20| 1006| abh| null|null| >>>> |20| 1009| abl| null|null| >>>> |30| 1004| abf| null|null| >>>> |30| 1008| abk| null|null| >>>> +--+--++--++ >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Left-outer-Join-issue-using-programmatic-sql-joins-tp27295.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > -- Thanks & Regards Radha krishna

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
Hi Mich, Here I given just a sample data, I have some GB's of files in HDFS and performing left outer joins on those files, and the final result I am going to store in Vertica data base table. There is no duplicate columns in the target table but for the non matching rows columns I want to insert "

IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
AS| | 16| | | 13| UK| | 14| US| | 20| As| | 15| IN| | 19| IR| | 11| PK| +---++ i am expecting the below one any idea, how to apply IS NOT NULL ? +---++ |_c0|code| +---++ | 18| AS| | 13| UK| | 14| US| | 20| As| | 15| IN| | 19| IR| | 11| PK| +---++ Thanks & Regards Radha krishna

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
Ok thank you, how to achieve the requirement. On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen wrote: > It doesn't look like you have a NULL field, You have a string-value > field with an empty string. > > On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna wrote: > > Hi All,IS NOT

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
I want to apply null comparison to a column in sqlcontext.sql, is there any way to achieve this? On Jul 10, 2016 8:55 PM, "Radha krishna" wrote: > Ok thank you, how to achieve the requirement. > > On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen wrote: > >> It doesn'

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar
This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by accide

Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
cified by query) value. I've experimented with using (abusing) Spark Streaming, by streaming queries and running these against the cached RDD. However, as I say I don't think that this is an intended use-case of Streaming. Cheers, Krishna

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
more on the use case? It looks a little bit > like an abuse of Spark in general . Interactive queries that are not > suitable for in-memory batch processing might be better supported by ignite > that has in-memory indexes, concept of hot, warm, cold data etc. or hive on > tez+ll

Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Krishna Rao
Hi all, Is there a method for reading from s3 without having to hard-code keys? The only 2 ways I've found both require this: 1. Set conf in code e.g.: sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "") sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "") 2. Set keys in URL, e.g.:

streaming application redundant dag stage execution/performance/caching

2016-02-16 Thread krishna ramachandran
re is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. Still seeing out of memory errors Krishna

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread krishna ramachandran
en see out of memory error regards Krishna On Thu, Feb 18, 2016 at 4:54 AM, Ted Yu wrote: > bq. streamingContext.remember("duration") did not help > > Can you give a bit more detail on the above ? > Did you mean the job encountered OOME later on ? > > Which Spark re

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
least until convergence) But am seeing same centers always for the entire duration - ran the app for several hours with a custom receiver. Yes I am using the latestModel to predict using "labeled" test data. But also like to know where my centers are regards Krishna On Fri, Feb 19, 201

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
, 6706.05424139] and monitor. please let know if I missed something Krishna On Fri, Feb 19, 2016 at 10:59 AM, Bryan Cutler wrote: > Can you share more of your code to reproduce this issue? The model should > be updated with each batch, but can't tell what is happening from what y

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
Also the cluster centroid I get in streaming mode (some with negative values) do not make sense - if I use the same data and run in batch KMeans.train(sc.parallelize(parsedData), numClusters, numIterations) cluster centers are what you would expect. Krishna On Fri, Feb 19, 2016 at 12:49 PM

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works. Cheers On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu wrote: > Krishna: > If you want to query ORC files, see the following JIRA: > > [SPARK-10623] [SQL] Fixes ORC predicate push-down > > which is in the 1.5.1

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale, dat

HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Krishna Sankar
Guys, - We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The current wisdom is that it will support the 1.4.x train (which is good, need DataFrame et al). - What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on HDP 2.3 ? Or will Spark 1.5.x support be in HD

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
does the data split and the datasets where they are allocated to. Cheers On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > >1. What is the test-data-size ? > - If i

Contributing to PySpark

2016-10-18 Thread Krishna Kalyan
Hello, I am a masters student. Could someone please let me know how set up my dev working environment to contribute to pyspark. Questions I had were: a) Should I use Intellij Idea or PyCharm?. b) How do I test my changes?. Regards, Krishna

Unsubscribe

2016-12-16 Thread krishna ramachandran
Unsubscribe

Unable to build spark documentation

2017-01-11 Thread Krishna Kalyan
://gist.github.com/krishnakalyan3/08f00f49a943e43600cbc6b21f307228 Could someone please advice on how to go about resolving this error?. Regards, Krishna

Structured Streaming on Kubernetes

2018-04-13 Thread Krishna Kalyan
is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink. Regards, Krishna

Re: Structured Streaming on Kubernetes

2018-04-16 Thread Krishna Kalyan
. >>> >>> >>> >>> However, I’m unaware of any specific use of streaming with the Spark on >>> Kubernetes integration right now. Would be curious to get feedback on the >>> failover behavior right now. >>> >>> >>> >

unsubscribe

2020-01-17 Thread vijay krishna

unsubscribe

2020-04-24 Thread vijay krishna

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 & 2.11. Cheers On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman < stephen.haber...@gmail.com> wrote: > Hey, > > I recently compiled Spark master against

Re: randomSplit instead of a huge map & reduce ?

2015-02-21 Thread Krishna Sankar
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a "mapReduce with combiners" problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) < 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) >= 6 and (x[3] % 10) < 8) - test = ratin

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
what is a good number for a recommendation engine ? Cheers On Tue, Feb 24, 2015 at 1:03 AM, Guillaume Charhon < guilla...@databerries.com> wrote: > I am using Spark 1.2.1. > > Thank you Krishna, I am getting almost the same results as you so it must > be an error in the tut

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation & storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your ve

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar
Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers On Fri, Mar 20, 2015 at 3:45 PM, cong yue wrote: > Hello : > > I tried ipython notebook with the following command in my enviroment. > > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVE

Re: column expression in left outer join for DataFrame

2015-03-24 Thread S Krishna
Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df("event") === 0) . select("country", "cnt") val df_2 = df.filter( df("event") === 3) . select("country", "cnt

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
. select("country", "cnt").as("b") > val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer") > > > > On Tue, Mar 24, 2015 at 11:57 PM, S Krishna wrote: > >> Hi, >> >> Thanks for your

Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle wrote: > Dear Spark users, > > I would l

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer wrote: > Afternoon all, > > I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: > > `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipT

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
One reason could be that the keys are in a different region. Need to create the keys in us-east-1-North Virginia. Cheers On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer wrote: > Hi, > > I am trying to launch an EC2 cluster from spark using the following > command: > > ./spark-ec2 -k HackerPa

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
curity group rules must specify protocols > explicitly.7ff92687-b95a-4a39-94cb-e2d00a6928fd > > This sounds like it could have to do with the access settings of the > security group, but I don't know how to change. Any advice would be much > appreciated! > > Sam > > ---

Re: Spark Usecase

2014-06-04 Thread Krishna Sankar
Shahab, Interesting question. Couple of points (based on the information from your e-mail) 1. One can support the use case in Spark as a set of transformations on a WIP TDD over a span of time and the final transformation outputting to a processed TDD - Spark streaming would be a

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Krishna Sankar
Project->Properties->Java Build Path->Add External Jars Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar Cheers On Sun, Jun 8, 2014 at 8:06 AM, Carter wrote: > Hi All, > > I just downloaded the Scala IDE for Eclipse. After I created a Spark > project > and clicked "Run

Re: problem starting the history server on EC2

2014-06-10 Thread Krishna Sankar
Yep, it gives tons of errors. I was able to make it work with sudo. Looks like ownership issue. Cheers On Tue, Jun 10, 2014 at 6:29 PM, zhen wrote: > I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I > do not seem to be able to start the history server on the master n

Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions() works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records & memory efficient. heers On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar wrote: > Hi, >Would appreciat

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
And got the first cut: val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size)) gives the total & unique. The question : is it scalable & efficient ? Would appreciate insights. Cheers On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar wrote: > Answ

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar
Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell wrote: > Depending on your requirements when doing hourly metrics calculating >

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
Mahesh, - One direction could be : create a parquet schema, convert & save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc

Re: question about setting SPARK_CLASSPATH IN spark_env.sh

2014-06-17 Thread Krishna Sankar
Santhosh, All the nodes should have access to the jar file and so the classpath should be on all the nodes. Usually it is better to rsync the conf directory to all nodes rather than editing them separately. Cheers On Tue, Jun 17, 2014 at 9:26 PM, santhoma wrote: > Hi, > > This is about spar

Re: Spark Processing Large Data Stuck

2014-06-21 Thread Krishna Sankar
Hi, - I have seen similar behavior before. As far as I can tell, the root cause is the out of memory error - verified this by monitoring the memory. - I had a 30 GB file and was running on a single machine with 16GB. So I knew it would fail. - But instead of raising an exce

Re: Interconnect benchmarking

2014-06-29 Thread Krishna Sankar
- After loading large RDDs that are > 60-70% of the total memory, (k,v) operations like finding uniques/distinct, GroupByKey and SetOperations would be network bound. - A multi-stage Map-Reduce DAG should be a good test. When we tried this for Hadoop, we used examples from Genomics.

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar
Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs & YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should

Re: Spark Installation

2014-07-07 Thread Krishna Sankar
Couldn't find any reference of CDH in pom.xml - profiles or the hadoop.version.Am also wondering how the cdh compatible artifact was compiled. Cheers On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S wrote: > Hi All, > > Does anyone know what the command line arguments to mvn are to generate > the

Re: Understanding how to install in HDP

2014-07-09 Thread Krishna Sankar
Abel, I rsync the spark-1.0.1 directory to all the nodes. Then whenever the configuration changes, rsync the conf directory. Cheers On Wed, Jul 9, 2014 at 2:06 PM, Abel Coronado Iruegas < acoronadoirue...@gmail.com> wrote: > Hi everybody > > We have hortonworks cluster with many nodes, we wa

Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Krishna Sankar
Nick, AFAIK, you can compile with yarn=true and still run spark in stand alone cluster mode. Cheers On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis wrote: > Hello, > > I am currently learning Apache Spark and I want to see how it integrates > with an existing Hadoop Cluster. > > My cu

Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers On Tue, Jul 8, 2014 at 6:24 PM, Robert James wrote: > I have a Spark app which runs well on local master. I'm now ready to > put it on a cluster. What needs to be ins

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
One vector to check is the HBase libraries in the --jars as in : spark-submit --class --master --jars hbase-client-0.98.3-hadoop2.jar,commons-csv-1.0-SNAPSHOT.jar,hbase-common-0.98.3-hadoop2.jar,hbase-hadoop2-compat-0.98.3-hadoop2.jar,hbase-it-0.98.3-hadoop2.jar,hbase-protocol-0.98.3-hadoop2.jar,

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
not if you have other configurations that might have conflicted with >>> those. >>> >>> Could you try the following, remove anything that is spark specific >>> leaving only hbase related codes. uber jar it and run it just like any >>> other simple java

Re: Out of any idea

2014-07-19 Thread Krishna Sankar
Probably you have - if not, try a very simple app in the docker container and make sure it works. Sometimes resource contention/allocation can get in the way. This happened to me in the YARN container. Also try single worker thread. Cheers On Sat, Jul 19, 2014 at 2:39 PM, boci wrote: > Hi guys

Re: Spark as a application library vs infra

2014-07-27 Thread Krishna Sankar
- IMHO, #2 is preferred as it could work in any environment (Mesos, Standalone et al). While Spark needs HDFS (for any decent distributed system) YARN is not required at all - Meson is a lot better. - Also managing the app with appropriate bootstrap/deployment framework is more flexi

Re: Spark webUI - application details page

2014-08-29 Thread Sudha Krishna
I specified as follows: spark.eventLog.dir /mapr/spark_io We use mapr fs for sharing files. I did not provide an ip address or port number - just the directory name on the shared filesystem. On Aug 29, 2014 8:28 AM, "Brad Miller" wrote: > How did you specify the HDFS path? When i put > > spark

Re: mllib performance on mesos cluster

2014-09-24 Thread Sudha Krishna
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster from 17 min to around 6 min. The scheduler delay per task reduced from 40 ms to around 10 ms. thanks On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng wrote: > 1) MLlib 1.1 should be faster than 1.0 in general. What's th

MLlib 1.2 New & Interesting Features

2014-09-27 Thread Krishna Sankar
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - "The Hitchhiker's Guide to Machine Learning with Python & Apache Spark"[2] - At minimum, it would be good to take the last 30 mi

Re: MLlib 1.2 New & Interesting Features

2014-09-29 Thread Krishna Sankar
Thanks Xiangrui. Appreciate the insights. I have uploaded the initial version of my presentation at http://goo.gl/1nBD8N Cheers On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng wrote: > Hi Krishna, > > Some planned features for MLlib 1.2 can be found via Spark JIRA: > http://bi

MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ]

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
be 0.1 or 0.01? > > Best, > Burak > > - Original Message - > From: "Krishna Sankar" > To: user@spark.apache.org > Sent: Wednesday, October 1, 2014 12:43:20 PM > Subject: MLlib Linear Regression Mismatch > > Guys, >Obviously I am doing some

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers On Thu, Oct 2, 2014 at 12:29 PM, danilopds wrote: > Hi tsingfu, > > I want to see me

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers P.S: What are you folks planning next ? O

  1   2   >