Base metrics for Spark Benchmarking.

2015-04-16 Thread Bijay Pathak
Hello, We wanted to tune the Spark running on YARN cluster.The Spark History Server UI shows lots of parameters like: - GC time - Task Duration - Shuffle R/W - Shuffle Spill (Memory/Disk) - Serialization Time (Task/Result) - Scheduler Delay Among the above metrics, which are th

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Wang, Daoyuan
Can you tell us how did you create the dataframe? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 17, 2015 2:52 AM To: rkrist Cc: user Subject: Re: ClassCastException processing date fields using spark SQL since 1.3.0 Filed: https://issues.apache.org/jira/browse/SPARK-

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Wang, Daoyuan
The conversion between date and int should be automatically handled by Implicit conversion. So we are accepting date types externally, and represented as integer internally. From: Wang, Daoyuan Sent: Friday, April 17, 2015 11:00 AM To: 'Michael Armbrust'; rkrist Cc: user Subject: RE: ClassCastEx

Re: Spark SQL query key/value in Map

2015-04-16 Thread JC Francisco
Ah yeah, didn't notice that difference. Thanks! It worked. On Fri, Apr 17, 2015 at 4:27 AM, Yin Huai wrote: > For Map type column, fields['driver'] is the syntax to retrieve the map > value (in the schema, you can see "fields: map"). The syntax of > fields.driver is used for struct type. > > On

SparkR: Server IPC version 9 cannot communicate with client version 4

2015-04-16 Thread lalasriza .
Dear everyone, right now I am working with SparkR on cluster. The following are the package versions installed on the cluster: 1) Hadoop and Yarn: Hadoop 2.5.0-cdh5.3.1 Subversion http://github.com/cloudera/hadoop -r 4cda8416c73034b59cc8baafbe3666b074472846 Compiled by jenkins on 2015-01-28T0

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Ankur Dave
I'm the primary author of IndexedRDD. To answer your questions: 1. Operations on an IndexedRDD partition can only be performed from a task operating on that partition, since doing otherwise would require decentralized coordination between workers, which is difficult in Spark. If you want to perfor

Python 3 support for PySpark has been merged into master

2015-04-16 Thread Josh Rosen
Hi everyone, We just merged Python 3 support for PySpark into Spark's master branch (which will become Spark 1.4.0). This means that PySpark now supports Python 2.6+, PyPy 2.5+, and Python 3.4+. To run with Python 3, download and build Spark from the master branch then configure the PYSPARK_PYTH

mapPartitions() in Java 8

2015-04-16 Thread samirissa00
Hi , how to convert this script to java 8 with lampdas ? My problem is the function getactivations() returns a scala.collection.Iterator to mapPartitions() that need a java.util.Iterator ... Thks ! === // Step 1 - Stub code to copy into Spark S

Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai
Hi Neal, spark.sql.shuffle.partitions is the property to control the number of tasks after shuffle (to generate t2, there is a shuffle for the aggregations specified by groupBy and agg.) You can use sqlContext.setConf("spark.sql.shuffle.partitions", "newNumber") or sqlContext.sql("set spark.sql.sh

Re: Spark on Windows

2015-04-16 Thread Stephen Boesch
The hadoop "support" from HortonWorks only *actually *works with Windows Server - well at least as of Spark Summit last year : and AFAIK that has not changed since 2015-04-16 15:18 GMT-07:00 Dean Wampler : > If you're running Hadoop, too, now that Hortonworks supports Spark, you > might be able

Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
(Indeed, though the OP said it was a requirement that the pairs are drawn from the same partition.) On Thu, Apr 16, 2015 at 11:14 PM, Guillaume Pitel wrote: > Hi Aurelien, > > Sean's solution is nice, but maybe not completely order-free, since pairs > will come from the same partition. > > The ea

Re: Spark on Windows

2015-04-16 Thread Dean Wampler
If you're running Hadoop, too, now that Hortonworks supports Spark, you might be able to use their distribution. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

Re: Random pairs / RDD order

2015-04-16 Thread Guillaume Pitel
Hi Aurelien, Sean's solution is nice, but maybe not completely order-free, since pairs will come from the same partition. The easiest / fastest way to do it in my opinion is to use a random key instead of a zipWithIndex. Of course you'll not be able to ensure uniqueness of each elements of t

dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Neal Yin
I have some trouble to control number of spark tasks for a stage. This on latest spark 1.3.x source code build. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) sc.getConf.get("spark.default.parallelism") -> setup to 10 val t1 = hiveContext.sql("FROM SalesJan2009 select * ") val

RE: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Thanks but we need a firm statement and preferably from somebody from the spark vendor Data Bricks including answer to the specific question posed by me and assessment/confirmation whether this is a production ready / quality library which can be used for general purpose RDDs not just inside the

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Koert Kuipers
i believe it is a generalization of some classes inside graphx, where there was/is a need to keep stuff indexed for random access within each rdd partition On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov wrote: > Can somebody from Data Briks sched more light on this Indexed RDD library > > https://

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Here is the list of my dependencies : *libraryDependencies ++= Seq(* * "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "org.iq80.leveldb" % "leveldb" % "0.7", "com.

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Richard Marscher
If it fails with sbt-assembly but not without it, then there's always the likelihood of a classpath issue. What dependencies are you rolling up into your assembly jar? On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa wrote: > Any ideas ? > > On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same functiona

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
Can you please guide me how can I extend RDD and convert into this way you are suggesting. On 16 April 2015 at 23:46, Jeetendra Gangele wrote: > I type T i already have Object ... I have RDD and then I am > calling ZipWithIndex on this RDD and getting RDD on this I am > running MapToPair and con

AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work fr

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Jeetendra Gangele
Does this same functionality exist with Java? On 17 April 2015 at 02:23, Evo Eftimov wrote: > You can use > > def partitionBy(partitioner: Partitioner): RDD[(K, V)] > Return a copy of the RDD partitioned using the specified partitioner > > The https://github.com/amplab/spark-indexedrdd stuff lo

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-16 Thread N B
Hi Guillaume, Interesting that you brought up Shuffle. In fact we are experiencing this issue of shuffle files being left behind and not being cleaned up. Since this is a Spark streaming application, it is expected to stay up indefinitely, so shuffle files being left is a big problem right now. Si

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
You can use def partitionBy(partitioner: Partitioner): RDD[(K, V)] Return a copy of the RDD partitioned using the specified partitioner The https://github.com/amplab/spark-indexedrdd stuff looks pretty cool and is something which adds valuable functionality to spark e.g. the point lookups PRO

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa wrote: > Dear all, > > Here is an issue that gets me mad. I wrote a UserDefineType in order to be > able to store a custom type in a parquet file. In my code I just create a > DataFrame with my custom data type and write in into a pa

When querying ElasticSearch, score is 0

2015-04-16 Thread Andrejs Abele
Hi, I have data in my ElasticSearch server, when I query it using rest interface, I get results and score for each result, but when I run the same query in spark using ElasticSearch API, I get results and meta data, but the score is shown 0 for each record. My configuration is ... val conf = new

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Evo > partition the large doc RDD based on the hash function on the key ie the docid What API to use to do this? By the way, loading the entire dataset to memory cause OutOfMemory problem because it is too large (I only have one machine with 16GB and 4 cores). I found something called

Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai
For Map type column, fields['driver'] is the syntax to retrieve the map value (in the schema, you can see "fields: map"). The syntax of fields.driver is used for struct type. On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco wrote: > Hi, > > I'm new with both Cassandra and Spark and am experimentin

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Akhil, any thought on this? On 16 April 2015 at 23:07, Jeetendra Gangele wrote: > No I did not tried the partitioning below is the full code > > public static void matchAndMerge(JavaRDD > matchRdd,JavaSparkContext jsc) throws IOException{ > long start = System.currentTimeMillis(); > JavaPai

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Oh, just noticed that I missed "attach"... Yeah, your scripts will be helpful. Thanks! On Thu, Apr 16, 2015 at 12:03 PM, Arush Kharbanda < ar...@sigmoidanalytics.com> wrote: > Yes, i am able to reproduce the problem. Do you need the scripts to create > the tables? > > On Thu, Apr 16, 2015 at 10:5

Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
empty folders generally means that you need to just increase the window intervals; i.e. spark streaming saveAsTxtFiles will save folders for each interval regardless On Wed, Apr 15, 2015 at 5:03 AM, Shushant Arora wrote: > Its printing on console but on HDFS all folders are still empty . > >

MLlib - Naive Bayes Problem

2015-04-16 Thread riginos
I have a big dataset of categories of cars and descriptions of cars. So i want to give a description of a car and the program to classify the category of that car. So i decided to use multinomial naive Bayes. I created a unique id for each word and replaced my whole category,description data. //My

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Looks like that message would be triggered if spark.dynamicAllocation.initialExecutors was not set, or 0, if I read this right. Yeah, that might have to be positive. This requires you set initial executors to 1 if you want 0 min executors. Hm, maybe that shouldn't be an error condition in the args

Re: Spark on Windows

2015-04-16 Thread Arun Lists
Thanks, Matei! We'll try that and let you know if it works. You are correct in inferring that some of the problems we had were with dependencies. We also had problems with the spark-submit scripts. I will get the details from the engineer who worked on the Windows builds and provide them to you.

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 12:16:13PM -0700, Marcelo Vanzin wrote: I think Michael is referring to this: """ Exception in thread "main" java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] """ Yes, sorry, there were too man

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Marcelo Vanzin
I think Michael is referring to this: """ Exception in thread "main" java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] """ spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecut

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 08:10:54PM +0100, Sean Owen wrote: Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. How can 0 be a fine minimum if it's rejected? Changing the value is easy enough, but in general it's nice for defau

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. On Thu, Apr 16, 2015 at 8:09 PM, Michael Stone wrote: > On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: >> >> IIRC that was fixed already in 1.3 >> >> >> https://gith

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b From that commit: + private val minNumExecutors = conf.getInt("spark.dynamicAllocation.minExecutors", 0) ... + if (max

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Yes, i am able to reproduce the problem. Do you need the scripts to create the tables? On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai wrote: > Can your code that can reproduce the problem? > > On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda < > ar...@sigmoidanalytics.com> wrote: > >> Hi >> >> As pe

Re: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Michael Armbrust
Filed: https://issues.apache.org/jira/browse/SPARK-6967 Shouldn't they be null? > Statistics are only used to eliminate partitions that can't possibly hold matching values. So while you are right this might result in a false positive, that will not result in a wrong answer.

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b On Thu, Apr 16, 2015 at 7:41 PM, Michael Stone wrote: > The default for spark.dynamicAllocation.minExecutors is 0, but that value > causes a runtime error and a message that the min

Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
Use mapPartitions, and then take two random samples of the elements in the partition, and return an iterator over all pairs of them? Should be pretty simple assuming your sample size n is smallish since you're returning ~n^2 pairs. On Thu, Apr 16, 2015 at 7:00 PM, abellet wrote: > Hi everyone, >

spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
The default for spark.dynamicAllocation.minExecutors is 0, but that value causes a runtime error and a message that the minimum is 1. Perhaps the default should be changed to 1? Mike Stone - To unsubscribe, e-mail: user-unsubs

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
at distinct level I will have 7000 times more elements in my RDD.So should I re partition? because its parent will definitely have less partition how to see through java code number of partition? On 16 April 2015 at 23:07, Jeetendra Gangele wrote: > No I did not tried the partitioning below is

Re: StackOverflowError from KafkaReceiver when rate limiting used

2015-04-16 Thread Sean Owen
Yeah, this really shouldn't be recursive. It can't be optimized since it's not a final/private method. I think you're welcome to try a PR to un-recursivize it. On Thu, Apr 16, 2015 at 7:31 PM, Jeff Nadler wrote: > > I've got a Kafka topic on which lots of data has built up, and a streaming > app

StackOverflowError from KafkaReceiver when rate limiting used

2015-04-16 Thread Jeff Nadler
I've got a Kafka topic on which lots of data has built up, and a streaming app with a rate limit. During maintenance for example records will build up on Kafka and we'll burn them off on restart. The rate limit keeps the job stable while burning off the backlog. Sometimes on the first or second i

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Well normal RDDs can also be serialized if you select that type of Memory Persistence …. Ok thanks, so just to confirm: IF a “normal” RDD is not going to be persisted in-memory as Serialized objects (which would mean it has to be persisted as “actual/hydrated” objects) THEN there are no

Re: Super slow caching in 1.3?

2015-04-16 Thread Michael Armbrust
Here are the types that we specialize, other types will be much slower. This is only for Spark SQL, normal RDDs do not serialize data that is cached. I'll also not that until yesterday we were missing FloatType https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sq

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
I type T i already have Object ... I have RDD and then I am calling ZipWithIndex on this RDD and getting RDD on this I am running MapToPair and converting into RDD so that i can use it later for other operation like lookup and join. On 16 April 2015 at 23:42, Ted Yu wrote: > The Long in RDD[(T,

Re: regarding ZipWithIndex

2015-04-16 Thread Ted Yu
The Long in RDD[(T, Long)] is type parameter. You can create RDD with Integer as the first type parameter. Cheers On Thu, Apr 16, 2015 at 11:07 AM, Jeetendra Gangele wrote: > Hi Ted. > This works for me. But since Long takes here 8 bytes. Can I reduce it to 4 > bytes. its just a index and I fee

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
Hi Ted. This works for me. But since Long takes here 8 bytes. Can I reduce it to 4 bytes. its just a index and I feel 4 bytes was more than enough.is there any method which takes Integer or similar for Index? On 13 April 2015 at 01:59, Ted Yu wrote: > bq. will return something like JavaPairRDD

Random pairs / RDD order

2015-04-16 Thread abellet
Hi everyone, I have a large RDD and I am trying to create a RDD of a random sample of pairs of elements from this RDD. The elements composing a pair should come from the same partition for efficiency. The idea I've come up with is to take two random samples and then use zipPartitions to pair each

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Copy should be doable but I'm not sure how to specify a prefix for the directory while keeping the filename (ie part-0) fixed in copy command. > On Apr 16, 2015, at 1:51 PM, Sean Owen wrote: > > Just copy the files? it shouldn't matter that much where they are as > you can find them easil

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Also to juggle even further the multithreading model of both spark and HDFS you can even publish the data from spark first to a message broker e.g. kafka from where a predetermined number (from 1 to infinity) of parallel consumers will retrieve and store in HDFS in one or more finely controlled

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
Just copy the files? it shouldn't matter that much where they are as you can find them easily. Or consider somehow sending the batches of data straight into Redshift? no idea how that is done but I imagine it's doable. On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy wrote: > Thanks Sean. I want

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Evo for your detailed explanation. > On Apr 16, 2015, at 1:38 PM, Evo Eftimov wrote: > > The reason for this is as follows: > > 1. You are saving data on HDFS > 2. HDFS as a cluster/server side Service has a Single Writer / Multiple > Reader multithreading model > 3.

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei > On Apr 16, 2015, at 9:23 AM, Arun Lists wrote: > > We

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them wherever you want - Use foreacPartition and then foreach -Original Message- From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:39 PM To: Sean Owen Cc: user@spark.apache.or

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed to work. Batches ar

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
The reason for this is as follows: 1. You are saving data on HDFS 2. HDFS as a cluster/server side Service has a Single Writer / Multiple Reader multithreading model 3. Hence each thread of execution in Spark has to write to a separate file in HDFS 4. Moreover the

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Sean. I want to load each batch into Redshift. What's the best/most efficient way to do that? Vadim > On Apr 16, 2015, at 1:35 PM, Sean Owen wrote: > > You can't, since that's how it's designed to work. Batches are saved > in different "files", which are really directories containing >

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
No I did not tried the partitioning below is the full code public static void matchAndMerge(JavaRDD matchRdd,JavaSparkContext jsc) throws IOException{ long start = System.currentTimeMillis(); JavaPairRDD RddForMarch =matchRdd.zipWithIndex().mapToPair(new PairFunction, Long, MatcherReleventDat

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
You can't, since that's how it's designed to work. Batches are saved in different "files", which are really directories containing partitions, as is common in Hadoop. You can move them later, or just read them where they are. On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy wrote: > I am using S

saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
I am using Spark Streaming where during each micro-batch I output data to S3 using saveAsTextFile. Right now each batch of data is put into its own directory containing 2 objects, "_SUCCESS" and "part-0." How do I output each batch into a common directory? Thanks, Vadim ᐧ

Re: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Sean Owen
I don't think there's anything specific to CDH that you need to know, other than it ought to set things up sanely for you. Sandy did a couple posts about tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Essentially to change the performance yield of software cluster infrastructure platform like spark you play with different permutations of: - Number of CPU cores used by Spark Executors on every cluster node - Amount of RAM allocated for each executor How disks and net

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Thanks Evo. Yes, my concern is only regarding the infrastructure configurations. Basically, configuring Yarn (Node manager) + Spark is must and default setting never works. And what really happens, is we make changes as and when an issue is faced because of one of the numerous default configurat

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda wrote: > Hi > > As per JIRA this issue is resolved, but i am still facing this issue. > > SPARK-2734 - DROP TABLE should also uncache table > > > -- > > [image: Sigmoid Analytics]

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by "flattened" version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in ge

Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
Hi Michael, Good question! We checked 1.2 and found that it is also slow cacheing the same flat parquet file. Caching other file formats of the same data were faster by up to a factor of ~2. Note that the parquet file was created in Impala but the other formats were written by Spark SQL. Cheers,

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Well there are a number of performance tuning guidelines in dedicated sections of the spark documentation - have you read and applied them Secondly any performance problem within a distributed cluster environment has two aspects: 1. Infrastructure 2. App Algorithms You s

General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can plug in the native hadoop input formats with Spark's sc.newApiHadoopFile etc which takes in the inputformat. Thanks Best Regards On Thu, Apr 16, 2015 at 10:15 PM, Shushant Arora wrote: > Is it for spark? > > On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das > wrote: > >> You can simply overr

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Can you paste your complete code? Did you try repartioning/increasing level of parallelism to speed up the processing. Since you have 16 cores, and I'm assuming your 400k records isn't bigger than a 10G dataset. Thanks Best Regards On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele wrote: > I

Custom partioner

2015-04-16 Thread Jeetendra Gangele
Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I redu

Re: custom input format in spark

2015-04-16 Thread Shushant Arora
Is it for spark? On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das wrote: > You can simply override the isSplitable method in your custom inputformat > class and make it return false. > > Here's a sample code snippet: > > > http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-h

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can simply override the isSplitable method in your custom inputformat class and make it return false. Here's a sample code snippet: http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop#answers-header Thanks Best Regards On Thu, Apr 16, 2015 at 4:18 PM, Shush

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
I already checked and G is taking 1 secs for each task. is this too much? if yes how to avoid this? On 16 April 2015 at 21:58, Akhil Das wrote: > Open the driver ui and see which stage is taking time, you can look > whether its adding any GC time etc. > > Thanks > Best Regards > > On Thu, Apr 16

Re: MLLib SVMWithSGD : java.lang.OutOfMemoryError: Java heap space

2015-04-16 Thread Akhil Das
Try increasing your driver memory. Thanks Best Regards On Thu, Apr 16, 2015 at 6:09 PM, sarath wrote: > Hi, > > I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But > I'm getting "java.lang.OutOfMemoryError: Java heap space" error. The > dataset > is really sparse and have

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best Regards On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele wrote: > Hi All I have below code whether distinct is running for more time. > > blockingRdd is the combination of and

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Akhil Das
You could try repartitioning your RDD using a custom partitioner (HashPartitioner etc) and caching the dataset into memory to speedup the joins. Thanks Best Regards On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I have an RDD that contains milli

Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Hi All I have below code whether distinct is running for more time. blockingRdd is the combination of and it will have 400K records JavaPairRDD completeDataToprocess=blockingRdd.flatMapValues( new Function>(){ @Override public Iterable call(String v1) throws Exception { return ckdao.getSingelkey

Spark on Windows

2015-04-16 Thread Arun Lists
We run Spark on Mac and Linux but also need to run it on Windows 8.1 and Windows Server. We ran into problems with the Scala 2.10 binary bundle for Spark 1.3.0 but managed to get it working. However, on Mac/Linux, we are on Scala 2.11.6 (we built Spark from the sources). On Windows, however despit

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Ningjun, to speed up your current design you can do the following: 1.partition the large doc RDD based on the hash function on the key ie the docid 2. persist the large dataset in memory to be available for subsequent queries without reloading and repartitioning for every search query 3. parti

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
I have a data frame in which I load data from a hive table. And my issue is that the data frame is missing the columns that I need to query. For example: val newdataset = dataset.where(dataset("label") === 1) gives me an error like the following: ERROR yarn.ApplicationMaster: User class threw e

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Sean Owen
This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) wrote: > Does anybody have a solution for this? > > > > > > From: Wang

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I n

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
Well you can have a two level index structure, still without any need for physical cluster node awareness Level 1 Index is the previously described partitioned [K,V] RDD – this gets you to the value (RDD element) you need on the respective cluster node Level 2 Index – it will be built and

Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
Thanks a lot for the reply. Indeed it is useful but to be more precise i have 3D data and want to index it using octree. Thus i aim to build a two level indexing mechanism i.e. First at global level i want to partition and send the data to the nodes then at node level i again want to use octree to

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
Well you can use a [Key, Value] RDD and partition it based on hash function on the Key and even a specific number of partitions (and hence cluster nodes). This will a) index the data, b) divide it and send it to multiple nodes. Re your last requirement - in a cluster programming environment/fram

Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
I want to use Spark functions/APIs to do this task. My basic purpose is to index the data and divide and send it to multiple nodes. Then at the time of accessing i want to reach the right node and data partition. I don't have any clue how to do this. Thanks, On Thu, Apr 16, 2015 at 5:13 PM, Evo Ef

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
How do you intend to "fetch the required data" - from within Spark or using an app / code / module outside Spark -Original Message- From: mas [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:08 PM To: user@spark.apache.org Subject: Data partitioning and node tracking in Spa

Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread mas
I have a big data file, i aim to create index on the data. I want to partition the data based on user defined function in Spark-GraphX (Scala). Further i want to keep track the node on which a particular data partition is send and being processed so i could fetch the required data by accessing the

Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write in into a parquet file. When I run my code directly inside idea every thing works like a charm

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska
Hi Sparkers, hoping for insight here: running a simple describe mytable here where mytable is a partitioned Hive table. Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 ​ Whereas Hive over the same

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or perhaps copy the jar to the slaves could that actually do the trick? The latter isn't really all that efficient but just curious if that could do the trick. On Thu, Apr 16, 2015 at 7:14 AM ARose wrote: > I take it back. My so

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread ARose
I take it back. My solution only works when you set the master to "local". I get the same error when I try to run it on the cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22525.html Sent from the Ap

[SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics] *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.c

  1   2   >