Re: Kafka client - specify offsets?

2014-06-15 Thread Tobias Pfeiffer
Hi, there are apparently helpers to tell you the offsets , but I have no idea how to pass that to the Kafka stream consumer. I am interested in that as well.

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Vivek YS
The more fundamental question is why doesn't groupByKey return RDD[(K, RDD[V])] instead of RDD[(K, Iterable[V])]. I wrote something like this (Yet to test. & I am not sure if this is even correct) I appreciate any suggestions/comments def groupByKeyWithRDD(partitioner: Partitioner): RDD[(K, RDD

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar
Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell wrote: > Depending on your requirements when doing hourly metrics calculating >

Re: long GC pause during file.cache()

2014-06-15 Thread Aaron Davidson
Note also that Java does not work well with very large JVMs due to this exact issue. There are two commonly used workarounds: 1) Spawn multiple (smaller) executors on the same machine. This can be done by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone mode[1]). 2) Use Tachyon

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Ian O'Connell
Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/

pyspark serializer can't handle functions?

2014-06-15 Thread madeleine
It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sam

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu
Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t check….) Best, -- Nan Zhu On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: > Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? > > > > On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu (mailto:zhunanm

Re: long GC pause during file.cache()

2014-06-15 Thread Surendranauth Hiraman
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu wrote: > SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you > don’t mind the WARNING in the logs > > you can set spark.executor.extraJavaOpts in your SparkConf obj > > Best, > > -- > Nan Zhu > >

Akka listens to hostname while user may spark-submit with master in IP url

2014-06-15 Thread Hao Wang
Hi, All In Spark the "spark.driver.host" is driver hostname in default, thus, akka actor system will listen to a URL like akka.tcp://hostname:port. However, when a user tries to use spark-submit to run application, the user may set "--master spark://192.168.1.12:7077". Then, the *AppClient* in *S

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-15 Thread Manish Amde
Hi Suraj, I don't see any logs from mllib. You might need to explicit set the logging to DEBUG for mllib. Adding this line for log4j.properties might fix the problem. log4j.logger.org.apache.spark.mllib.tree=DEBUG Also, please let me know if you can encounter similar problems with the Spark maste

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, -- Nan Zhu On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: > Hi, Wei > > You may try to set JVM opts in spark-

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Andrew Ash
Good point Sean. I've filed a ticket to document the equals() / hashCode() requirements for custom keys in the Spark documentation, as this has come up a few times on the user@ list. https://issues.apache.org/jira/browse/SPARK-2148 On Sun, Jun 15, 2014 at 12:11 PM, Sean Owen wrote: > In Java

Re: long GC pause during file.cache()

2014-06-15 Thread Hao Wang
Hi, Wei You may try to set JVM opts in *spark-env.sh* as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS="-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m" There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam | S

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Sean Owen
In Java at large, you must always implement hashCode() when you implement equals(). This is not specific to Spark. This is to maintain the contract that two equals() instances have the same hash code, and that's not the case for your class now. This causes weird things to happen wherever the hash c

Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Gaurav Jain
I have a simple Java class as follows, that I want to use as a key while applying groupByKey or reduceByKey functions: private static class FlowId { public String dcxId; public String trxId; public String msgType; pub

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Surendranauth Hiraman
Vivek, If the foldByKey solution doesn't work for you, my team uses RDD.persist(DISK_ONLY) to avoid OOM errors. It's slower, of course, and requires tuning other config parameters. It can also be a problem if you do not have enough disk space, meaning that you have to unpersist at the right point