Re: Spark SQL configurations

2015-03-26 Thread Akhil Das
If you can share the stacktrace, then we can give your proper guidelines. For running on YARN, everything is described here: https://spark.apache.org/docs/latest/running-on-yarn.html Thanks Best Regards On Fri, Mar 27, 2015 at 8:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Hello, > Can someone share me the li

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian wrote: > Hm, which version of Hadoop are you using? Actually there should also be > a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 > btw. I'm not sure whether Hadoop version matters here,

Can spark sql read existing tables created in hive

2015-03-26 Thread ๏̯͡๏
I have few tables that are created in Hive. I wan to transform data stored in these Hive tables using Spark SQL. Is this even possible ? So far i have seen that i can create new tables using Spark SQL dialect. However when i run show tables or do desc hive_table it says table not found. I am now

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian
Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did observe cases where Spark behaves differently because of semantic differences of th

Add partition support in saveAsParquet

2015-03-26 Thread Jianshi Huang
Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: def saveAsParquet(path: String, partitionColumns: Seq[String]) -- Jianshi Huang LinkedIn: ji

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
Ok. I modified as per your suggestions export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar export HADOOP_CONF_DIR=/apache/hadoop/conf cd $SPARK_HOME ./bin/spark-sql -v --driver-class-path /apache/hadoop/sha

Re: SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
I have to use .lines.toArray.toSeq A little tricky. [image: --] Xi Shen [image: http://]about.me/davidshen On Fri, Mar 27, 2015 at 4:41 PM, Xi Shen wrote: > Hi, > > I want to load my data in this way: > > sc.wholeText

SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
Hi, I want to load my data in this way: sc.wholeTextFiles(opt.input) map { x => (x._1, x._2.lines.filter(!_.isEmpty).toSeq) } But I got java.io.NotSerializableException: scala.collection.Iterator$$anon$13 But if I use "x._2.split('\n')", I can get the expected result. I want to know what's wr

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
In that case you can directly use count-min-sketch from algebirdthey work fine with Spark aggregateBy but I have found the java BPQ from Spark much faster than say algebird Heap datastructure... On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden wrote: > ​You could also consider using a count

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Charles Hayden
?You could also consider using a count-min data structure such as in https://github.com/laserson/dsq? to get approximate quantiles, then use whatever values you want to filter the original sequence. From: Debasish Das Sent: Thursday, March 26, 2015 9:45 PM To:

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
Idea is to use a heap and get topK elements from every partition...then use aggregateBy and for combOp do a merge routine from mergeSort...basically get 100 items from partition 1, 100 items from partition 2, merge them so that you get sorted 200 items and take 100...for merge you can use heap as w

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi Debasish, Thanks for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example? Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das wrote: > You can do it in-memory as wellg

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
Hey Deepak, It seems that your hive-site.xml says your Hive metastore setup is using MySQL. If that's not the case, you need to adjust your hive-site.xml configurations. As for the version of MySQL driver, it should match the MySQL server. Cheng On 3/27/15 11:07 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I d

Re: Cross-compatibility of YARN shuffle service

2015-03-26 Thread Sandy Ryza
Hi Matt, I'm not sure whether we have documented compatibility guidelines here. However, a strong goal is to keep the external shuffle service compatible so that many versions of Spark can run against the same shuffle service. -Sandy On Wed, Mar 25, 2015 at 6:44 PM, Matt Cheah wrote: > Hi ever

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through treeAggre

Re: What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Noorul Islam K M
Sandy Ryza writes: > Creating a SparkContext and setting master as yarn-cluster unfortunately > will not work. > > SPARK-4924 added APIs for doing this in Spark, but won't be included until > 1.4. > > -Sandy > Did you look into something like [1]? With that you can make rest API call from your j

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
Hi Cheng, on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode. Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-

How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD. A naive algorithm would be - 1) Sort RDD by segment & score (descending) 2) Within each segment, nu

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
Yang Chen writes: > Hi Noorul, > > Thank you for your suggestion. I tried that, but ran out of memory. I did > some search and found some suggestions > that we should try to avoid rdd.union( > http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what are you using? The error you are seeing is common when there isn't the correct driver to allow Spark to connect to the Hive metastore because the correct driver isn't there. As well, I noticed that you're using SPARK_CLAS

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
I do not use MySQL, i want to read Hive tables from Spark SQL and transform them in Spark SQL. Why do i need a MySQL driver ? If i still need it which version should i use. Assuming i need it, i downloaded the latest version of it from http://mvnrepository.com/artifact/mysql/mysql-connector-java/5

k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen
Hi, I have a large data set, and I expects to get 5000 clusters. I load the raw data, convert them into DenseVector; then I did repartition and cache; finally I give the RDD[Vector] to KMeans.train(). Now the job is running, and data are loaded. But according to the Spark UI, all data are loaded

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
As the exception suggests, you don't have MySQL JDBC driver on your classpath. On 3/27/15 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I am unable to run spark-sql form command line. I attempted the following 1) export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export SPARK_JAR=$SPARK

Spark SQL configurations

2015-03-26 Thread ๏̯͡๏
Hello, Can someone share me the list of commands (including export statements) that you use to run Spark SQL over YARN cluster. I am unable to get it running on my YARN cluster and running into exceptions. I understand i need to share specific exception. This is more like i want to know if i have

spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
I am unable to run spark-sql form command line. I attempted the following 1) export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar export SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-E

Re: HQL function Rollup and Cube

2015-03-26 Thread ๏̯͡๏
Did you manage to connect to Hive metastore from Spark SQL. I copied hive conf file into Spark conf folder but when i run show tables, or do select * from dw_bid (dw_bid is stored in Hive) it says table not found. On Thu, Mar 26, 2015 at 11:43 PM, Chang Lim wrote: > Solved. In IDE, project se

Re: Missing an output location for shuffle. : (

2015-03-26 Thread 李铖
Here is the worker track. 15/03/26 16:05:47 INFO Worker: Asked to kill executor app-20150326160534-0005/1 15/03/26 16:05:47 INFO ExecutorRunner: Runner thread for executor app-20150326160534-0005/1 interrupted 15/03/26 16:05:47 INFO ExecutorRunner: Killing process! 15/03/26 16:05:47 ERROR FileAppe

Re: WordCount example

2015-03-26 Thread Saisai Shao
Hi, Did you run the word count example in Spark local mode or other mode, in local mode you have to set Local[n], where n >=2. For other mode, make sure available cores larger than 1. Because the receiver inside Spark Streaming wraps as a long-running task, which will at least occupy one core. Be

Re: Missing an output location for shuffle. : (

2015-03-26 Thread 李铖
Here is the work track: 172.100.8.2600:01:40:01:32:d11020小时1分43秒172.100.8.3300:01:40:01:34:a6200152152020小时1分49秒172.100.8.3700:01:40:01:34:a81001680020小时1分52秒172.100.8.5800:01:40:01:34:f5100840020小时1分53秒172.100.8.7900:01:40:01:2a:242020小时1分53秒172.100.8.8100:01:40:01:1c:0c10016884020小时1分53秒

FetchFailedException during shuffle

2015-03-26 Thread Chen Song
Using spark 1.3.0 on cdh5.1.0, I was running a fetch failed exception. I searched in this email list but not found anything like this reported. What could be the reason for the error? org.apache.spark.shuffle.FetchFailedException: [EMPTY_INPUT] Cannot decompress empty stream at org.apach

Re: What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Sandy Ryza
Creating a SparkContext and setting master as yarn-cluster unfortunately will not work. SPARK-4924 added APIs for doing this in Spark, but won't be included until 1.4. -Sandy On Tue, Mar 17, 2015 at 3:19 AM, Akhil Das wrote: > Create SparkContext set master as yarn-cluster then run it as a sta

Re: shuffle write size

2015-03-26 Thread Chen Song
Anyone can shed some light on this? On Tue, Mar 17, 2015 at 5:23 PM, Chen Song wrote: > I have a map reduce job that reads from three logs and joins them on some > key column. The underlying data is protobuf messages in sequence > files. Between mappers and reducers, the underlying raw byte arra

Difference behaviour of DateType in SparkSQL between 1.2 and 1.3

2015-03-26 Thread Wush Wu
Dear all, I am trying to upgrade the spark from 1.2 to 1.3 and switch the existed API of creating SchemaRDD to DataFrame. After testing, I notice that the following behavior is changed: ``` import java.sql.Date import com.bridgewell.SparkTestUtils import org.apache.spark.rdd.RDD import org.apach

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set to 32. But in the Spark UI, it seems all the data are loaded onto one executor. Previously they were loaded onto 4 executors. Any idea? Thanks, David On Fri, Mar 27, 2015 at 11:01 A

Re: WordCount example

2015-03-26 Thread Mohit Anchlia
What's the best way to troubleshoot inside spark to see why Spark is not connecting to nc on port ? I don't see any errors either. On Thu, Mar 26, 2015 at 2:38 PM, Mohit Anchlia wrote: > I am trying to run the word count example but for some reason it's not > working as expected. I start "nc

Re: Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
in log I found this 2015-03-26 19:42:09,531 WARN org.eclipse.jetty.servlet.ServletHandler: Error for /history/application_1425934191900_87572 org.spark-project.guava.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.spark-project.guava.co

FakeClassTag in Java API

2015-03-26 Thread kmader
The JavaAPI uses FakeClassTag for all of the implicit class tags fed to RDDs during creation, mapping, etc. I am working on a more generic Scala library where I won't always have the type information beforehand. Is it safe / accepted practice to use FakeClassTag in these situations as well? It was

Re: foreachRDD execution

2015-03-26 Thread Tathagata Das
Yes, that is the correct understanding. There are undocumented parameters that allow that, but I do not recommend using those :) TD On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez < langel.gro...@gmail.com> wrote: > I have a simple and probably dumb question about foreachRDD. > > We

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
How do I get the number of cores that I specified at the command line? I want to use "spark.default.parallelism". I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the "spark.default.parallelism" value will be 4 * 8 = 32...

Re: Can't access file in spark, but can in hadoop

2015-03-26 Thread Ted Yu
Looks like the following assertion failed: Preconditions.checkState(storageIDsCount == locs.size()); locs is List Can you enhance the assertion to log more information ? Cheers On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson wrote: > There seems to be a special kind of "corrupted according

RE: Date and decimal datatype not working

2015-03-26 Thread BASAK, ANANDA
Thanks all. I am installing Spark 1.3 now. Thought that I should better sync with the daily evolution of this new technology. So once I install that, I will try to use the Spark-CSV library. Regards Ananda From: Dean Wampler [mailto:deanwamp...@gmail.com] Sent: Wednesday, March 25, 2015 1:17 PM

Re: Spark History Server : jobs link doesn't open

2015-03-26 Thread Marcelo Vanzin
bcc: user@, cc: cdh-user@ I recommend using CDH's mailing list whenever you have a problem with CDH. That being said, you haven't provided enough info to debug the problem. Since you're using CM, you can easily go look at the History Server's logs and see what the underlying error is. On Thu, M

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Xi Shen
It it bought in by another dependency, so you do not need to specify it explicitly...I think this is what Ted mean. On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia wrote: > +spark-dev > > Yes, the dependencies are there. I guess my question is how come the build > is succeeding in the mainline th

Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2 Jobs link on spark History server doesn't open and shows following message : HTTP ERROR: 500 Problem accessing /history/application_1425934191900_87572. Reason: Server Error -- *Powered by Jetty:/

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
The code is very simple. val data = sc.textFile("very/large/text/file") map { l => // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this: 15/0

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen wrote: > Hi Burak, > > My iterations is set to 500. But I think it should also stop of the > centroid coverages, right? > > My spark is 1.2.0, working in windows

Recreating the Mesos/Spark paper's experiments

2015-03-26 Thread hbogert
Hi all, For my master thesis I will be characterising performance of two-level schedulers like Mesos and after reading the paper: https://www.cs.berkeley.edu/~alig/papers/mesos.pdf where Spark is also introduced I am wondering how some experiments and results came about. If this is not the plac

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
will do! I've got to clear with my boss what I can post and in what manner, but I'll definitely do what I can to put some working code out into the world so the next person who runs into this brick wall can benefit from all this :-D DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space. Thanks

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
+spark-dev Yes, the dependencies are there. I guess my question is how come the build is succeeding in the mainline then, without adding these dependencies? On Thu, Mar 26, 2015 at 3:44 PM, Ted Yu wrote: > Looking at output from dependency:tree, servlet-api is brought in by the > following: > >

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Ted Yu
Looking at output from dependency:tree, servlet-api is brought in by the following: [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile [INFO] | +- org.antlr:antlr:jar:3.2:compile [INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1:compile [INFO] | +- org.yaml:snakeyaml:jar:1.

Re: K Means cluster with spark

2015-03-26 Thread Xi Shen
Hi Sandeep, I followed the DenseKMeans example which comes with the spark package. My total vectors are about 40k, and my k=500. All my code are written in Scala. Thanks, David On Fri, 27 Mar 2015 05:51 sandeep vura wrote: > Hi Shen, > > I am also working on k means clustering with spark. May

Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
Hi, We are trying to build spark 1.2 from source (tip of the branch-1.2 at the moment). I tried to build spark using the following command: mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package I encountered various missing class definition except

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks all for the quick response. Thanks. Zhan Zhang On Mar 26, 2015, at 3:14 PM, Patrick Wendell wrote: > I think we have a version of mapPartitions that allows you to tell > Spark the partitioning is preserved: > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639 We could also add a map function that does same. Or you can just write your map using an iter

Can't access file in spark, but can in hadoop

2015-03-26 Thread Dale Johnson
There seems to be a special kind of "corrupted according to Spark" state of file in HDFS. I have isolated a set of files (maybe 1% of all files I need to work with) which are producing the following stack dump when I try to sc.textFile() open them. When I try to open directories, most large direc

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
This is just a deficiency of the api, imo. I agree: mapValues could definitely be a function (K, V)=>V1. The option isn't set by the function, it's on the RDD. So you could look at the code and do this. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread andy petrella
That purely awesome! Don't hesitate to contribute your notebook back to the spark notebook repo, even rough, I'll help cleaning up if needed. The vagrant is also appealing 😆 Congrats! Le jeu 26 mars 2015 22:22, David Holiday a écrit : > w0t! that did it! t/y so much! > > I'm

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks Jonathan. You are right regarding rewrite the example. I mean providing such option to developer so that it is controllable. The example may seems silly, and I don’t know the use cases. But for example, if I also want to operate both the key and value part to generate some new value with

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance). Cha

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at :23 [] | MapPartitionsRDD[33] at mapValues at :23 [] | ShuffledRDD[32] at reduceByKey at :

RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Hi Folks, Does anybody know what is the reason not allowing preserverPartitioning in RDD.map? Do I miss something here? Following example involves two shuffles. I think if preservePartitioning is allowed, we can avoid the second one, right? val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,1

WordCount example

2015-03-26 Thread Mohit Anchlia
I am trying to run the word count example but for some reason it's not working as expected. I start "nc" server on port and then submit the spark job to the cluster. Spark job gets successfully submitting but I never see any connection from spark getting established. I also tried to type words

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra wrote: > RDD#union is not the same thing as SparkContext#union > > On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote: > >> Hi Noorul, >> >> Tha

Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra
RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote: > Hi Noorul, > > Thank you for your suggestion. I tried that, but ran out of memory. I did > some search and found some suggestions > that we should try to avoid rdd.union( > http://stackoverf

Re: Fuzzy GroupBy

2015-03-26 Thread Sean Owen
The grouping is determined by the POJO's equals() method. You can also call groupBy() to group by some function of the POJOs. For example if you're grouping Doubles into nearly-equal bunches, you could group by their .intValue() On Thu, Mar 26, 2015 at 8:47 PM, Mihran Shahinian wrote: > I would l

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try t

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
w0t! that did it! t/y so much! I'm going to put together a pastebin or something that has all the code put together so if anyone else runs into this issue they will have some working code to help them figure out what's going on. DAVID HOLIDAY Software Engineer 760 607 3300 | Off

Re: Spark SQL queries hang forever

2015-03-26 Thread Michael Armbrust
Is it possible to jstack the executors and see where they are hanging? On Thu, Mar 26, 2015 at 2:02 PM, Jon Chase wrote: > Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), > executor memory 20GB, driver memory 10GB > > I'm using Spark SQL, mainly via spark-shell, to query

Spark SQL queries hang forever

2015-03-26 Thread Jon Chase
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread out over roughly 2,000 Parquet files and my queries frequently hang. Simple queries like "select count(*) from

Error in creating log directory

2015-03-26 Thread pzilaro
I get the following error message when I start pyspark shell. The config has the following settings- # spark.masterspark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializerorg.apache.spark.serializer.KryoSerial

Fuzzy GroupBy

2015-03-26 Thread Mihran Shahinian
I would like to group records, but instead of grouping on exact key I want to be able to compute the similarity of keys on my own. Is there a recommended way of doing this? here is my starting point final JavaRDD< pojo > records = spark.parallelize(getListofPojos()).cache(); class pojo { String

RE: EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out

2015-03-26 Thread Adrian Mocanu
I also get stack overflow every now and then without having any recursive calls: java.lang.StackOverflowError: null at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1479) ~[na:1.7.0_75] at java.io.ObjectOutputStream.writeOrdinaryObject(Object

Re: Parallel actions from driver

2015-03-26 Thread Sean Owen
You can do this much more simply, I think, with Scala's parallel collections (try .par). There's nothing wrong with doing this, no. Here, something is getting caught in your closure, maybe unintentionally, that's not serializable. It's not directly related to the parallelism. On Thu, Mar 26, 2015

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Corey Nolet
Spark uses a SerializableWritable [1] to java serialize writable objects. I've noticed (at least in Spark 1.2.1) that it breaks down with some objects when Kryo is used instead of regular java serialization. Though it is wrapping the actual AccumuloInputFormat (another example of something you may

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Nick Pentreath
I'm guessing the Accumulo Key and Value classes are not serializable, so you would need to do something like  val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) => (extractScalaType(key), extractScalaType(value)) } Where 'extractScalaType converts the key or Value to a standard Sca

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Russ Weeks
Hi, David, This is the code that I use to create a JavaPairRDD from an Accumulo table: JavaSparkContext sc = new JavaSparkContext(conf); Job hadoopJob = Job.getInstance(conf,"TestSparkJob"); job.setInputFormatClass(AccumuloInputFormat.class); AccumuloInputFormat.setZooKeeperInstance(job, conf

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-26 Thread Chang Lim
After this line: val sc = new SparkContext(conf) You need to add this line: import sc.implicits._ //this is used to implicitly convert an RDD to a DataFrame. Hope this helps -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-DataFrame-for-using-

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
progress! i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '(' AbstractInputFormat.setConnectorInfo(jobConf, "root", new PasswordToken("password") as I'm doing this in spark-notebook,

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
Could you narrow down to a step which cause the OOM, something like: log2= self.sqlContext.jsonFile(path) log2.count() ... out.count() ... On Thu, Mar 26, 2015 at 10:34 AM, Eduardo Cusa wrote: > the last try was without log2.cache() and still getting out of memory > > I using the following conf,

Re: HQL function Rollup and Cube

2015-03-26 Thread Chang Lim
Solved. In IDE, project settings was missing the dependent lib jars (jar files under spark-xx/lib). When theses jar is not set, I got class not found error about datanucleus classes (compared to an out of memory error in Spark Shell). In the context of Spark Shell, these dependent jars needs to b

EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out

2015-03-26 Thread Adrian Mocanu
Hi I need help fixing a time out exception thrown from ElasticSearch Hadoop. The ES cluster is up all the time. I use ElasticSearch Hadoop to read data from ES into RDDs. I get a collection of these RDD which I traverse (with foreachRDD) and create more RDDs from each one RDD in the collection.

Re: python : Out of memory: Kill process

2015-03-26 Thread Eduardo Cusa
the last try was without log2.cache() and still getting out of memory I using the following conf, maybe help: conf = (SparkConf() .setAppName("LoadS3") .set("spark.executor.memory", "13g") .set("spark.driver.memory", "13g") .set("spark.driver.maxResultS

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (http://atscale.com/). It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik. On Thu

Re: python : Out of memory: Kill process

2015-03-26 Thread Davies Liu
Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa wrote: > I running on ec2 : > > 1 Master : 4 CPU 15 GB RAM (2 GB swap) > > 2 Slaves 4 CPU 15 GB RAM > > > the uncompressed dataset size is 15 GB > > > > > On Thu, Mar 26, 2015 at 10:41 AM, Eduardo C

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
sparkx writes: > Hi, > > I have a Spark job and a dataset of 0.5 Million items. Each item performs > some sort of computation (joining a shared external dataset, if that does > matter) and produces an RDD containing 20-500 result items. Now I would like > to combine all these RDDs and perform a n

Re: python : Out of memory: Kill process

2015-03-26 Thread Eduardo Cusa
I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap) 2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa < eduardo.c...@usmediaconsulting.com> wrote: > Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. > > I ran

Re: What his the ideal method to interact with Spark Cluster from a Cloud App?

2015-03-26 Thread Noorul Islam K M
Today I found one answer from a this thread [1] which seems to be worth exploring. Michael, if you are reading this, it will be helpful if you could share more about your spark deployment in production. Thanks and Regards Noorul [1] http://apache-spark-user-list.1001560.n3.nabble.com/How-do-yo

Re: HQL function Rollup and Cube

2015-03-26 Thread Chang Lim
Clarification on how the HQL was invoked: hiveContext.sql("select a, b, count(*) from t group by a, b with rollup") Thanks, Chang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HQL-function-Rollup-and-Cube-tp22241p22244.html Sent from the Apache Spark U

Combining Many RDDs

2015-03-26 Thread sparkx
Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found o

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Mark Hamstra
You can also always take the more extreme approach of using SparkContext#runJob (or submitJob) to write a custom Action that does what you want in one pass. Usually that's not worth the extra effort. On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen wrote: > To avoid computing twice you need to persis

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Sean Owen
To avoid computing twice you need to persist the RDD but that need not be in memory. You can persist to disk with persist(). On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" < ningjun.w...@lexisnexis.com> wrote: > I have a rdd that is expensive to compute. I want to save it as object > file and

DataFrame GroupBy

2015-03-26 Thread gtanguy
Hello everybody, I am trying to do a simple groupBy : *Code:* val df = hiveContext.sql("SELECT * FROM table1") df .printSchema() df .groupBy("customer_id").count().show(5) *Stacktrace* : root |-- customer_id: string (nullable = true) |-- rank: string (nullable = true) |-- reco_material_id:

HQL function Rollup and Cube

2015-03-26 Thread Chang Lim
Has anyone been able to use Hive 0.13 ROLLUP and CUBE functions in Spark 1.3's Hive Context? According to https://issues.apache.org/jira/browse/SPARK-2663, this has been resolved in Spark 1.3. I created an in-memory temp table (t) and tried to execute a ROLLUP(and CUBE) function: select a,

Re: Spark-core and guava

2015-03-26 Thread Stevo Slavić
Thanks for heads up Sean! On Mar 26, 2015 1:30 PM, "Sean Owen" wrote: > This is a long and complicated story. In short, Spark shades Guava 14 > except for a few classes that were accidentally used in a public API > (Optional and a few more it depends on). So "provided" is more of a > Maven workar

How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Wang, Ningjun (LNG-NPV)
I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD? val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd println(count

Parallel actions from driver

2015-03-26 Thread Aram Mkrtchyan
Hi. I'm trying to trigger DataFrame's save method in parallel from my driver. For that purposes I use ExecutorService and Futures, here's my code: val futures = [1,2,3].map( t => pool.submit( new Runnable { override def run(): Unit = { val commons = events.filter(_._1 == t).map(_._2.common)

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
As I wrote previously - indexing is not your only choice, you can preaggregate data during load or depending on your needs you need to think about other data structures, such as graphs, hyperloglog, bloom filters etc. (challenge to integrate in standard bi tools) Le 26 mars 2015 13:34, "kundan kum

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
hi Nick Unfortunately the Accumulo docs are woefully inadequate, and in some places, flat wrong. I'm not sure if this is a case where the docs are 'flat wrong', or if there's some wrinke with spark-notebook in the mix that's messing everything up. I've been working with some people on stack ove

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
Stack Trace: 15/03/26 08:25:42 INFO ql.Driver: OK 15/03/26 08:25:42 INFO log.PerfLogger: 15/03/26 08:25:42 INFO log.PerfLogger: 15/03/26 08:25:42 INFO log.PerfLogger: 15/03/26 08:25:42 INFO metastore.HiveMetaStore: 0: get_tables: db=default pat=.* 15/03/26 08:25:42 INFO HiveMetaStore.audit: ugi

  1   2   >