Need equallyWeightedPartitioner Algorithm

2014-06-03 Thread Joe L
I need to partition my data into the same weighted partitions, suppose I have 20GB data and I want 4 partitions where each partition has 5GB of the data. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-equallyWeightedPartitioner-Algorithm-tp6788

classnotfound error due to groupByKey

2014-07-04 Thread Joe L
Hi, When I run the following a piece of code, it is throwing a classnotfound error. Any suggestion would be appreciated. Wanted to group an RDD by key: val t = rdd.groupByKey() Error message: java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$ Thanks

GC problem while filtering large data

2014-12-16 Thread Joe L
Hi I am trying to filter large table with 3 columns. Spark SQL might be a good choice but want to do it without SQL. The goal is to filter bigtable with multi clauses. I filtered bigtable 3times but the first filtering takes about 50seconds but the second and third filter transformation took about

what is the difference between persist() and cache()?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-persist-and-cache-tp4181.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to use a single filter instead of multiple filters

2014-04-13 Thread Joe L
Hi, I have multiple filters as shown below, should I use a single optimal filter instead of them? these filters can degrade the performance of spark? -- View this message in context: http://apache-spark-user-list.1

how to count maps without shuffling too much data?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-without-shuffling-too-much-data-tp4194.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to set spark worker memory size?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-spark-worker-memory-size-tp4195.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to count maps within a node?

2014-04-13 Thread Joe L
Hi, I want to count maps within a node and return it to the driver without too much shuffling. I think I can improve my performance by doing so. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-within-a-node-tp4196.html Sent from the Apach

Proper caching method

2014-04-14 Thread Joe L
Hi I am trying to cache 2Gbyte data and to implement the following procedure. In order to cache them I did as follows: Is it necessary to cache rdd2 since rdd1 is already cached? rdd1 = textFile("hdfs...").cache() rdd2 = rdd1.filter(userDefinedFunc1).cache() rdd3 = rdd1.filter(userDefinedFunc2).c

shuffle vs performance

2014-04-14 Thread Joe L
I was wondering less partitioning rdds could help the Spark performance and reduce shuffling? is it true? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-vs-performance-tp4255.html Sent from the Apache Spark User List mailing list archive at Nabble.c

groupByKey returns a single partition in a RDD?

2014-04-15 Thread Joe L
I want to apply the following transformations to 60Gbyte data on 7nodes with 10Gbyte memory. And I am wondering if groupByKey() function returns a RDD with a single partition for each key? if so, what will happen if the size of the partition doesn't fit into that particular node? rdd = sc.textFil

what is the difference between element and partition?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

groupByKey(None) returns partitions according to the keys?

2014-04-15 Thread Joe L
I was wonder if groupByKey returns 2 partitions in the below example? >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> sorted(x.groupByKey().collect()) [('a', [1, 1]), ('b', [1])] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupByKey-None-ret

Could I improve Spark performance partitioning elements in a RDD?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Could-I-improve-Spark-performance-partitioning-elements-in-a-RDD-tp4320.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

what is a partition? how it works?

2014-04-16 Thread Joe L
I want to know as follows: what is a partition? how it works? how it is different from hadoop partition? For example: >>> sc.parallelize([1,2,3,4]).map(lambda x: >>> (x,x)).partitionBy(2).glom().collect() [[(2,2), (4,4)], [(1,1), (3,3)]] from this, we will get 2 partitions but what does it mean?

choose the number of partition according to the number of nodes

2014-04-16 Thread Joe L
Is it true that it is better to choose the number of partition according to the number of nodes in the cluster? partitionBy(numNodes) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362.html

Re: choose the number of partition according to the number of nodes

2014-04-16 Thread Joe L
Thank you Nicholas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362p4364.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

join with inputs co-partitioned?

2014-04-17 Thread Joe L
I am trying to implement joining with co-partitioned inputs. As described in the documentation, we can avoid shuffling by partitioning elements with the same hash code into the same machine. >>> links = >>> sc.parallelize([('a','b'),('a','c'),('b','c'),('c','a')]).groupByKey(3) >>> links.glom().co

how to split one big RDD (about 100G) into several small ones?

2014-04-18 Thread Joe L
I want to split a single big rdd into small rdds without reading too much from disk (hdfs). what is the best way to do that? this is my current code: subclass_pairs= schema_triples.filter(lambda (s, p, o): p == PROPERTIES['subClassOf']).map(lambda (s, p, o): (s, o)) subproperty_pairs = s

efficient joining

2014-04-19 Thread Joe L
What is the efficient way to join two RDDs? joining is taking too long to perform. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/efficient-joining-tp4497.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

evaluate spark

2014-04-20 Thread Joe L
I want to evaluate spark performance by measuring the running time of transformation operations such as map and join. To do so, do I need to materialize merely count action? because As far as I know, transformations are lazy operations and don't do any computation until we action on them but when I

Spark is slow

2014-04-21 Thread Joe L
It is claimed that spark is 10x or 100x times faster than mapreduce and hive but since I started using it I haven't seen any faster performance. it is taking 2 minutes to run map and join tasks over just 2GB data. Instead hive was taking just a few seconds to join 2 tables over the same data. And,

Re: Spark is slow

2014-04-21 Thread Joe L
g1 = pairs1.groupByKey().count() pairs1 = pairs1.groupByKey(g1).cache() g2 = triples.groupByKey().count() pairs2 = pairs2.groupByKey(g2) pairs = pairs2.join(pairs1) Hi, I want to implement hash-partitioned joining as shown above. But somehow, it is taking so long to perform. As I understand,

help me

2014-04-22 Thread Joe L
I got the following performance is it normal in spark to be like this. some times spark switchs into node_local mode from process_local and it becomes 10x faster. I am very confused. scala> val a = sc.textFile("/user/exobrain/batselem/LUBM1000") scala> f.count() Long = 137805557 took 130.80966161

help

2014-04-23 Thread Joe L
hi, I found out the major problem of my spark cluster but don't know why it happens. First, I was testing spark by running applications. It was spending about 20 seconds only for counting 10 million strings/items(2GB) on the cluster with 8 nodes (8 cores per node). As we know that it is a very bad

read file from hdfs

2014-04-25 Thread Joe L
I have just 2 two questions? sc.textFile("hdfs://host:port/user/matei/whatever.txt") Is host master node? What port we should use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html Sent from the Apache Spark User List mailing

strange error

2014-04-25 Thread Joe L
[error] 14/04/25 23:09:57 INFO slf4j.Slf4jLogger: Slf4jLogger started [error] 14/04/25 23:09:57 INFO Remoting: Starting remoting [error] 14/04/25 23:09:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@cm03:5] [error] 14/04/25 23:09:58 INFO Remoting: Remoting now lis

help

2014-04-25 Thread Joe L
I need someone's help please I am getting the following error. [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run program "/home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh" (in directory

Re: help

2014-04-25 Thread Joe L
hi thank you for your reply but I could not find it. it says that no such file or directory -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the A

help

2014-04-27 Thread Joe L
I am getting this error, please help me to fix it 4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run program "/home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh" (in directory "."): error=13, -- View thi

spark running examples error

2014-04-27 Thread Joe L
I applied this ./bin/run-example org.apache.spark.examples.SparkPi spark://MASTERIP:7077 but I am getting the following error it seems master is not connecting to the slave nodes. Any suggestion? -- View this mess

getting an error

2014-04-28 Thread Joe L
Hi, while I was testing an example, I have encountered a problem in running Scala on cluster. I searched it on Google but couldn't solve it and posted about it on spark mailing list but it couldn't help me solve the problem as well. The problem is that I could run Spark successfully in local mode,

RE: help

2014-04-28 Thread Joe L
Yes, here it is I set it up like this: export STANDALONE_SPARK_MASTER_HOST=`hostname` export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST ### Let's run everything with JVM runtime, instead of Scala export SPARK_LAUNCH_WITH_SCALA=10.2 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRA

ClassNotFoundException

2014-05-01 Thread Joe L
Hi, I am getting the following error. How could I fix this problem? Joe 14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1) 14/05/02 03:51:48 INFO TaskSetManager: Loss was due to java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4 [duplicate

Re: java.lang.ClassNotFoundException

2014-05-01 Thread Joe L
Hi, You should include the jar file of your project. for example: conf.set("yourjarfilepath.jar") Joe On Friday, May 2, 2014 7:39 AM, proofmoore [via Apache Spark User List] wrote: HelIo. I followed "A Standalone App in Java" part of the tutorial  https://spark.apache.org/docs/0.8.1/quick-sta

Re: ClassNotFoundException

2014-05-01 Thread Joe L
Please help me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-tp5182p5209.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

facebook data mining with Spark

2014-05-19 Thread Joe L
Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Map failed [dupliacte 1] error

2014-05-27 Thread Joe L
Hi, I am getting the following error but I don't understand what the problem is. 14/05/27 17:44:29 INFO TaskSetManager: Loss was due to java.io.IOException: Map failed [duplicate 15] 14/05/27 17:44:30 INFO TaskSetManager: Starting task 47.0:43 as TID 60281 on executor 0: cm07 (PROCESS_LOCAL) 14/