Re: IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
ng(hostPort).hasPort, message) } On Wed, Oct 14, 2015 at 2:40 PM, Thomas Dudziak wrote: > It looks like Spark 1.5.1 does not work with IPv6. When > adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the > driver fails with: > > 15/10/14 14:36:01 ERROR SparkConte

IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
It looks like Spark 1.5.1 does not work with IPv6. When adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the driver fails with: 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext. java.lang.AssertionError: assertion failed: Expected hostname at scala.Predef$.a

Yahoo's Caffe-on-Spark project

2015-09-29 Thread Thomas Dudziak
http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop I would be curious to learn what the Spark developer's plans are in this area (NNs, GPUs) and what they think of integration with existing NN frameworks like Caffe or Torch. cheers, Tom

Accumulator with non-java-serializable value ?

2015-09-09 Thread Thomas Dudziak
I want to use t-digest with foreachPartition and accumulators (essentially, create a t-digest per partition and add that to the accumulator leveraging the fact that t-digests can be added to each other). I can make t-digests kryo-serializable easily but java-serializable is not very easy. Now, when

Re: How to avoid shuffle errors for a large join ?

2015-09-01 Thread Thomas Dudziak
aking it slower. SMJ performance is probably 5x - 1000x better in > 1.5 for your case. > > > On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak wrote: > >> I'm getting errors like "Removing executor with no recent heartbeats" & >> "Missing an output lo

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
imilar problems to this (reduce side failures for large joins (25bn > rows with 9bn)), and found the answer was to further up the > spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for > me, but your tables look a little denser, so you may want to go even higher. > > On Thu,

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
he answer was to further up the > spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for > me, but your tables look a little denser, so you may want to go even higher. > > On Thu, Aug 27, 2015 at 6:04 PM Thomas Dudziak wrote: > >> I'm getting err

How to avoid shuffle errors for a large join ?

2015-08-27 Thread Thomas Dudziak
I'm getting errors like "Removing executor with no recent heartbeats" & "Missing an output location for shuffle" errors for a large SparkSql join (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to configure the job to avoid them. The initial stage completes fine with some 30k tasks

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
: > > Have you tried tablesample? You find the exact syntax in the > documentation, but it exlxactly does what you want > > Le mer. 26 août 2015 à 18:12, Thomas Dudziak a écrit : > >> Sorry, I meant without reading from all splits. This is a single >> partition in the tab

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak wrote: > I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from > and I don't particularly care which rows. Doing a LIMIT un

Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT unfortunately results in two stages where the first stage reads the whole table, and the second then performs the limit with a single worker, which is not very efficien

Exception when using CLUSTER BY or ORDER BY

2015-05-19 Thread Thomas Dudziak
Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext (Spark 1.3.1 on Mesos, fine-grained mode). Is this a known problem or should I file a JIRA for it ? org.apache.spark.SparkException: Can only zip RDDs with same

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
grained scheduler, there is a spark.cores.max config setting that > will limit the total # of cores it grabs. This was there in earlier > versions too. > > Matei > > > On May 19, 2015, at 12:39 PM, Thomas Dudziak wrote: > > > > I read the other day that there will b

Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I read the other day that there will be a fair number of improvements in 1.4 for Mesos. Could I ask for one more (if it isn't already in there): a configurable limit for the number of tasks for jobs run on Mesos ? This would be a very simple yet effective way to prevent a job dominating the cluster

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
I've just been through this exact case with shaded guava in our Mesos setup and that is how it behaves there (with Spark 1.3.1). cheers, Tom On Fri, May 15, 2015 at 12:04 PM, Marcelo Vanzin wrote: > On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak wrote: > >> Actually t

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
Actually the extraClassPath settings put the extra jars at the end of the classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them at the front. cheers, Tom On Fri, May 15, 2015 at 11:54 AM, Marcelo Vanzin wrote: > Ah, I see. yeah, it sucks that Spark has to expose Optional (

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
This is still a problem in 1.3. Optional is both used in several shaded classes within Guava (e.g. the Immutable* classes) and itself uses shaded classes (e.g. AbstractIterator). This causes problems in application code. The only reliable way we've found around this is to shade Guava ourselves for