date:20140615

Re: Kafka client - specify offsets?

2014-06-15 Thread Tobias Pfeiffer

Hi, there are apparently helpers to tell you the offsets , but I have no idea how to pass that to the Kafka stream consumer. I am interested in that as well.

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Vivek YS

The more fundamental question is why doesn't groupByKey return RDD[(K, RDD[V])] instead of RDD[(K, Iterable[V])]. I wrote something like this (Yet to test. & I am not sure if this is even correct) I appreciate any suggestions/comments def groupByKeyWithRDD(partitioner: Partitioner): RDD[(K, RDD

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar

Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell wrote: > Depending on your requirements when doing hourly metrics calculating >

Re: long GC pause during file.cache()

2014-06-15 Thread Aaron Davidson

Note also that Java does not work well with very large JVMs due to this exact issue. There are two commonly used workarounds: 1) Spawn multiple (smaller) executors on the same machine. This can be done by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone mode[1]). 2) Use Tachyon

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Ian O'Connell

Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/

pyspark serializer can't handle functions?

2014-06-15 Thread madeleine

It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sam

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu

Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t check….) Best, -- Nan Zhu On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: > Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? > > > > On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu (mailto:zhunanm

Re: long GC pause during file.cache()

2014-06-15 Thread Surendranauth Hiraman

Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu wrote: > SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you > don’t mind the WARNING in the logs > > you can set spark.executor.extraJavaOpts in your SparkConf obj > > Best, > > -- > Nan Zhu > >

Akka listens to hostname while user may spark-submit with master in IP url

2014-06-15 Thread Hao Wang

Hi, All In Spark the "spark.driver.host" is driver hostname in default, thus, akka actor system will listen to a URL like akka.tcp://hostname:port. However, when a user tries to use spark-submit to run application, the user may set "--master spark://192.168.1.12:7077". Then, the *AppClient* in *S

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-15 Thread Manish Amde

Hi Suraj, I don't see any logs from mllib. You might need to explicit set the logging to DEBUG for mllib. Adding this line for log4j.properties might fix the problem. log4j.logger.org.apache.spark.mllib.tree=DEBUG Also, please let me know if you can encounter similar problems with the Spark maste

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu

SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, -- Nan Zhu On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: > Hi, Wei > > You may try to set JVM opts in spark-

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Andrew Ash

Good point Sean. I've filed a ticket to document the equals() / hashCode() requirements for custom keys in the Spark documentation, as this has come up a few times on the user@ list. https://issues.apache.org/jira/browse/SPARK-2148 On Sun, Jun 15, 2014 at 12:11 PM, Sean Owen wrote: > In Java

Re: long GC pause during file.cache()

2014-06-15 Thread Hao Wang

Hi, Wei You may try to set JVM opts in *spark-env.sh* as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS="-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m" There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam | S

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Sean Owen

In Java at large, you must always implement hashCode() when you implement equals(). This is not specific to Spark. This is to maintain the contract that two equals() instances have the same hash code, and that's not the case for your class now. This causes weird things to happen wherever the hash c

Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Gaurav Jain

I have a simple Java class as follows, that I want to use as a key while applying groupByKey or reduceByKey functions: private static class FlowId { public String dcxId; public String trxId; public String msgType; pub

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Surendranauth Hiraman

Vivek, If the foldByKey solution doesn't work for you, my team uses RDD.persist(DISK_ONLY) to avoid OOM errors. It's slower, of course, and requires tuning other config parameters. It can also be a problem if you do not have enough disk space, meaning that you have to unpersist at the right point

Re: Kafka client - specify offsets?

Re: GroupByKey results in OOM - Any other alternative

Re: GroupByKey results in OOM - Any other alternative

Re: long GC pause during file.cache()

Re: GroupByKey results in OOM - Any other alternative

pyspark serializer can't handle functions?

Re: long GC pause during file.cache()

Re: long GC pause during file.cache()

Akka listens to hostname while user may spark-submit with master in IP url

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Re: long GC pause during file.cache()

Re: Using custom class as a key for groupByKey() or reduceByKey()

Re: long GC pause during file.cache()

Re: Using custom class as a key for groupByKey() or reduceByKey()

Using custom class as a key for groupByKey() or reduceByKey()

Re: GroupByKey results in OOM - Any other alternative

16 matches

Site Navigation

Mail list logo

Footer information