:
spark.mesos.coarse true
Or, from this page
<http://spark.apache.org/docs/latest/running-on-mesos.html>, set the
property in a SparkConf object used to construct the SparkContext:
conf.set("spark.mesos.coarse", "true")
dean
Dean Wampler, Ph.D.
Author: Programmin
Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the
Spark project has published maven artifacts that are compiled with 2.11 and
2.10, although the downloads at http://spark.apache.org/downloads.html are
still all for 2.10.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd
R? Are you running Spark
in Mesos, but accessing data in MapR-FS?
Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
HTH,
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesaf
Here's our home page: http://www.meetup.com/Chicago-Spark-Users/
Thanks,
Dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twit
Is myRDD outside a DStream? If so are you persisting on each batch
iteration? It should be checkpointed frequently too.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@
le, but you don't need to call collect
on toUpdate before using foreach(println). If the RDD is huge, you
definitely don't want to do that.
Hope this helps.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>
There are ClassTags for Array, List, and Map, as well as for Int, etc. that
you might have inside those collections. What do you mean by sql? Could you
post more of your code?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O&
ry time. Put sc.stop() at the end of the page.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogra
You can call the collect() method to return a collection, but be careful.
If your data is too big to fit in the driver's memory, it will crash.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <
More specifically, you could have TBs of data across thousands of
partitions for a single RDD. If you call collect(), BOOM!
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@
ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The
Java 7 method returns a Set. Are you running Java 7? What happens if you
run Java 8?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typ
t was compiled with Java 6 (see
https://en.wikipedia.org/wiki/Java_class_file). So, it doesn't appear to be
a Spark build issue.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe
ClassNotFoundException usually means one of a few problems:
1. Your app assembly is missing the jar files with those classes.
2. You mixed jar files from imcompatible versions in your assembly.
3. You built with one version of Spark and deployed to another.
Dean Wampler, Ph.D.
Author
It works on Mesos, too. I'm not sure about Standalone mode.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://pol
for 10 seconds?
HTH,
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Wed, Feb 10, 2016
Michal Armbrust's talk at Spark
Summit East nicely made this point.
http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O
it of your example,
(r: ResultSet) => (r.getInt("col1"),r.getInt("col2")...r.getInt("col37")
)
could add nested () to group elements and keep the outer number of elements
<= 22.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http
Try putting the following in the SBT build file, e.g., ./build.sbt:
// Fork when running, so you can ...
fork := true
// ... add a JVM option to use when forking a JVM for 'run'
javaOptions += "-Dlog4j.configuration=file:/"
dean
Dean Wampler, Ph.D.
Author: Programmin
-jars"
option. Note that the latter may not be an ideal solution if it has other
dependencies that also need to be passed.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe
Check out StreamingContext.queueStream (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
)
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typ
Dynamic allocation doesn't work yet with Spark Streaming in any cluster
scenario. There was a previous thread on this topic which discusses the
issues that need to be resolved.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>
-separated input data:
case class Address(street: String, city: String)
case class User (name: String, address: Address)
sc.textFile("/path/to/stuff").
map { line =>
line.split(0)
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/
val array = line.split(",") // assume: "name,street,city"
User(array(0), Address(array(1), array(2)))
}.toDF()
scala> df.printSchema
root
|-- name: string (nullable = true)
|-- address: struct (nullable = true)
||-- street: string (nullable = true)
| |-- city:
cated.
You can look at the logic in the Spark code base, RDD.scala (first method
calls the take method) and SparkContext.scala (runJob method, which take
calls).
However, the exceptions definitely look like bugs to me. There must be some
empty partitions.
dean
Dean Wampler, Ph.D.
Author: Progra
If you mean retaining data from past jobs, try running the history server,
documented here:
http://spark.apache.org/docs/latest/monitoring.html
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http:
hadoop-2.6 is supported (look for "profile" XML in the pom.xml file).
For Hive, add "-Phive -Phive-thriftserver" (See
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)
for more details.
dean
Dean Wampler, Ph.D.
Author: Programming Scal
uld be a lot of network overhead. In some situations, a high
performance file system appliance, e.g., NAS, could suffice.
My $0.02,
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typ
terval, if that works for your every 10-min. need.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.
confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
for full details.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <
on to combine all the input streams into one, then have one
foreachRDD to write output?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/de
driver is running, then you can use the Spark web UI on that machine to see
what the Spark job is doing.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <
You can certainly start jobs without Chronos, but to automatically restart
finished jobs or to run jobs at specific times or periods, you'll want
something like Chronos.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>
figuration of Zookeeper if you need master failover.
Hence, you don't see it often in production scenarios.
The Spark page on cluster deployments has more details:
http://spark.apache.org/docs/latest/cluster-overview.html
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2n
;.
It looks like either the master service isn't running or isn't reachable
over your network. Is hadoopm0 publicly routable? Is port 7077 blocked? As
a test, can you telnet to it?
telnet hadoopm0 7077
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http:/
r 1st integer, then do the filtering and final averaging
"downstream" if you can, i.e., where you actually need the final value. If
you need it on every batch iteration, then you'll have to do a reduce per
iteration.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://s
If you don't mind using SBT with your Scala instead of Maven, you can see
the example I created here: https://github.com/deanwampler/spark-workshop
It can be loaded into Eclipse or IntelliJ
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/06369
Did you use an absolute path in $path_to_file? I just tried this with
spark-shell v1.4.1 and it worked for me. If the URL is wrong, you should
see an error message from log4j that it can't find the file. For windows it
would be something like file:/c:/path/to/file, I believe.
Dean Wampler,
Typesafe (http://typesafe.com). We provide commercial support for Spark on
Mesos and Mesosphere DCOS. We contribute to Spark's Mesos integration and
Spark Streaming enhancements.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/06369200
That's the correct URL. Recent change? The last time I looked, earlier this
week, it still had the obsolete artifactory URL for URL1 ;)
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <
Also, Spark on Mesos supports cluster mode:
http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@
Following Hadoop conventions, Spark won't overwrite an existing directory.
You need to provide a unique output path every time you run the program, or
delete or rename the target directory before you run the job.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
. Is that really
a mandatory requirement for this problem?
HTH,
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://
Add the other Cassandra dependencies (dse.jar,
spark-cassandra-connect-java_2.10) to your --jars argument on the command
line.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@
org.apache.spark.streaming.twitter.TwitterInputDStream is a small class.
You could write your own that lets you change the filters at run time. Then
provide a mechanism in your app, like periodic polling of a database table
or file for the list of filters.
Dean Wampler, Ph.D.
Author: Programming
So, just before running the job, if you run the HDFS command at a shell
prompt: "hdfs dfs -ls hdfs://172.31.42.10:54310/./weblogReadResult".
Does it say the path doesn't exist?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/produc
isn't in the --jars list. I assume
that's where HelloWorld is found. Confusing, yes it is...
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <h
It should work fine. I have an example script here:
https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala
(Spark 1.4.X)
What does "I am failing to do so" mean?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edi
a-code-with-spark-submit-tp24367.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mai
src/main/scala/sparkworkshop/InvertedIndex5b.scala>
for a taste of how concise it makes code!
4. Type inference: Spark really shows its utility. It means a lot less code
to write, but you get the hints of what you just wrote!
My $0.02.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd E
d routable inside the cluster. Recall that EC2 instances
have both public and private host names & IP addresses.
Also, is the port number correct for HDFS in the cluster?
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>
Here's a demonstration video from @noootsab himself (creator of Spark
Notebook) showing live charting in Spark Notebook. It's one reason I prefer
it over the other options.
https://twitter.com/noootsab/status/638489244160401408
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edi
files overall.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Sun, Sep 13, 2015 at 12:54
You are creating a HiveContext, then using the sql method instead of hql.
Is that deliberate?
The code doesn't work if you replace HiveContext with SQLContext. Lots of
exceptions are thrown, but I don't have time to investigate now.
dean
Dean Wampler, Ph.D.
Author: Programming
lements in the list.
If p_i is the ith partition, the final sum appears to be:
(2 + ... (2 + (2 + (2 + 0 + p_1) + p_2) + p_3) ...)
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
aDstream.transform(_.distinct()) will only make the elements of each RDD
in the DStream distinct, not for the whole DStream globally. Is that what
you're seeing?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'
Any particular reason you're not just downloading a build from
http://spark.apache.org/downloads.html Even if you aren't using Hadoop, any
of those builds will work.
If you want to build from source, the Maven build is more reliable.
dean
Dean Wampler, Ph.D.
Author: Programming
In 1.2 it's a member of SchemaRDD and it becomes available on RDD (through
the "type class" mechanism) when you add a SQLContext, like so.
val sqlContext = new SQLContext(sc)import sqlContext._
In 1.3, the method has moved to the new DataFrame type.
Dean Wampler, Ph.D.
Auth
You can use the coalesce method to reduce the number of partitions. You can
reduce to one if the data is not too big. Then write the output.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http:
will be needed. Actually only
one would be enough, but the default number of partitions will be used. I
believe 8 is the default for Mesos. For local mode ("local[*]"), it's the
number of cores. You can also set the propoerty "spark.default.parallelism".
HTH,
Dean
Dean Wampl
have
the same name as the corresponding SQL keyword.
HTH,
Dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogrammi
alternative is to
avoid using Scalaz in any closures passed to Spark methods, but that's
probably not what you want.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://types
ons needed to satisfy the limit. In this case, it will
trivially stop at the first.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitte
Both spark-submit and spark-shell have a --jars option for passing
additional jars to the cluster. They will be added to the appropriate
classpaths.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typ
ll only need to grab the first element
(being careful that the partition isn't empty!) and then determine which of
those first lines has the header info.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesaf
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before
1.3, Spark SQL was considered experimental where API changes were allowed.
So, H2O and ADA compatible with 1.2.X might not work with 1.3.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<h
,
but that shouldn't be this issue.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On We
:51849/),
Path(/user/MapOutputTracker)]
It's trying to connect to an Akka actor on itself, using the loopback
address.
Try changing SPARK_LOCAL_IP to the publicly routable IP address.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/06369
is such a common problem, I usually define a "parse" method
that converts input text to the desired schema. It catches parse exceptions
like this and reports the bad line at least. If you can return a default
long in this case, say 0, that makes it easier to return something.
dean
Dean W
Yes, that's the problem. The RDD class exists in both binary jar files, but
the signatures probably don't match. The bottom line, as always for tools
like this, is that you can't mix versions.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.o
rerun
them as the same user. Or look at what the EC2 scripts do.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglot
Is it possible "tbBER" is empty? If so, it shouldn't fail like this, of
course.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <h
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection
in the Spark SQL library. Make sure it's in the classpath and the version
is correct, too.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>
I have a "self-study" workshop here:
https://github.com/deanwampler/spark-workshop
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http:/
ost its definition?
Also calling take(1) to grab the first element should also work, even if
the RDD is empty. (It will return an empty RDD in that case, but not throw
an exception.)
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.d
from spark.apache.org. Even the Hadoop builds there will work
okay, as they don't actually attempt to run Hadoop commands.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@dean
Use JavaSparkContext.parallelize.
http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typ
Are you allocating 1 core per input stream plus additional cores for the
rest of the processing? Each input stream Reader requires a dedicated core.
So, if you have two input streams, you'll need "local[3]" at least.
Dean Wampler, Ph.D.
Author: Programming Scala, 2n
You're welcome. Two limitations to know about:
1. I haven't updated it to 1.3
2. It uses Scala for all examples (my bias ;), so less useful if you don't
want to use Scala.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/063692003
eeds your ability to
process it. What's your streaming batch window size?
See also here for ideas:
http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#performance-tuning
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.d
ase expression to the left of the => pattern matches
on the input tuples.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
This might be overkill for your needs, but the scodec parser combinator
library might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typ
Without the rest of your code, it's hard to know what might be
unserializable.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twit
Use the MVN build instead. From the README in the git repo (
https://github.com/apache/spark)
mvn -DskipTests clean package
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@
The runtime attempts to serialize everything required by records, and also
any lambdas/closures you use. Small, simple types are less likely to run
into this problem.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reill
it holds a
database connection, same problem.
You can't suppress the warning because it's actually an error. The
VoidFunction can't be serialized to send it over the cluster's network.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/pr
VendorRecords and fields pulled from HBase.
Whatever you can do to make this work like "table scans" and joins will
probably be most efficient.
dean
>
>
>
> On 7 April 2015 at 03:33, Dean Wampler wrote:
>
>> The "log" instance won't be serializable, be
Foreach() runs in parallel across the cluster, like map, flatMap, etc.
You'll only run into problems if you call collect(), which brings the
entire RDD into memory in the driver program.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/06369200
d that you switch to Java 8 and Lambdas, or go all the way
to Scala, so all that noisy code shrinks down to simpler expressions.
You'll be surprised how helpful that is for comprehending your code and
reasoning about it.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<h
with the first element and the size of the
second (each CompactBuffer). An alternative pattern match syntax would be.
scala> val i2 = i1.map { case (key, buffer) => (key, buffer.size) }
This should work as long as none of the CompactBuffers are too large, which
could happen for extremely
That appears to work, with a few changes to get the types correct:
input.distinct().combineByKey((s: String) => 1, (agg: Int, s: String) =>
agg + 1, (agg1: Int, agg2: Int) => agg1 + agg2)
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly
If you're running Hadoop, too, now that Hortonworks supports Spark, you
might be able to use their distribution.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampl
Without the rest of your code it's hard to make sense of errors. Why do you
need to use reflection?
​Make sure you use the same Scala versions throughout and 2.10.4 is
recommended. That's still the official version for Spark, even though
provisional​ support for 2.11 exists.
Dean Wam
g and interoperating with external processes. Perhaps Java has
something similar these days?
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twit
JVM's often have significant GC overhead with heaps bigger than 64GB. You
might try your experiments with configurations below this threshold.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly
o hold a much
bigger data set, if necessary.
HOWEVER, it actually returns a CompactBuffer.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L444
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/p
The convention for standalone cluster is to use Zookeeper to manage master
failover.
http://spark.apache.org/docs/latest/spark-standalone.html
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http:
It's mostly manual. You could try automating with something like Chef, of
course, but there's nothing already available in terms of automation.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'R
s();
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Mon, Apr 27, 2015 at 4:03 AM, H
Are the tasks on the slaves also running as root? If not, that might
explain the problem.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twi
I would use the "ps" command on each machine while the job is running to
confirm that every process involved is running as root.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://ty
1 - 100 of 134 matches
Mail list logo