As Akhil says Ubuntu is a good choice if you're starting from near scratch.
Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
sell
You could also try setting your `nofile` value in /etc/security/limits.conf
for `soft` to some ridiculously high value if you haven't done so already.
On Fri, Apr 3, 2015 at 2:09 AM Akhil Das wrote:
> Did you try these?
>
> - Disable shuffle : spark.shuffle.spill=false
> - Enable log rotation:
>
ns on spark end so
> that the connections can be reused across jobs
>
>
> On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke
> wrote:
>
>> How long does each executor keep the connection open for? How many
>> connections does each executor open?
>>
>> Are you
How long does each executor keep the connection open for? How many
connections does each executor open?
Are you certain that connection pooling is a performant and suitable
solution? Are you running out of resources on the database server and
cannot tolerate each executor having a single connectio
Assuming you are on Linux, what is your /etc/security/limits.conf set for
nofile/soft (number of open file handles)?
On Fri, Mar 20, 2015 at 3:29 PM Shuai Zheng wrote:
> Hi All,
>
>
>
> I try to run a simple sort by on 1.2.1. And it always give me below two
> errors:
>
>
>
> 1, 15/03/20 17:48:29
Scala is the language used to write Spark so there's never a situation in
which features introduced in a newer version of Spark cannot be taken
advantage of if you write your code in Scala. (This is mostly true of Java,
but it may be a little more legwork if a Java-friendly adapter isn't
available
What I found from a quick search of the Spark source code (from my local
snapshot on January 25, 2015):
// Interval between each check for event log updates
private val UPDATE_INTERVAL_MS =
conf.getInt("spark.history.fs.updateInterval",
conf.getInt("spark.history.updateInterval", 10)) * 1000
pr
This should help you understand the cost of running a Spark cluster for a
short period of time:
http://www.ec2instances.info/
If you run an instance for even 1 second of a single hour you are charged
for that complete hour. So before you shut down your miniature cluster make
sure you really are d
ga/schemavalidator.properties
>>
>> I couldn't understand why I couldn't get to the value of "propertiesFile"
>> by using standard System.getProperty method. (I can use new
>> SparkConf().get("spark.driver.extraJavaOptions") and manually pars
ueries directly to the
> MySQL database. Since in theory I only have to do this once, I'm not
> sure there's much to be gained in moving the data from MySQL to Spark
> first.
>
> I have yet to find any non-trivial examples of ETL logic on the web ...
> it seems like it'
I cannot comment about the correctness of Python code. I will assume your
caper_kv is keyed on something that uniquely identifies all the rows that
make up the person's record so your group by key makes sense, as does the
map. (I will also assume all of the rows that comprise a single person's
reco
I haven't actually tried mixing non-Spark settings into the Spark
properties. Instead I package my properties into the jar and use the
Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala
specific) to get at my properties:
Properties file: src/main/resources/integration.conf
(below
I was having a similar problem to this trying to use the Scala Jackson
module yesterday. I tried setting `spark.files.userClassPathFirst` to true
but I was still having problems due to the older version of Jackson that
Spark has a dependency on. (I think its an old org.codehaus version.)
I ended u
bug report after someone
from the dev team chimes in on this issue.
On Wed Feb 11 2015 at 2:20:34 PM Charles Feduke
wrote:
> Take a look at this:
>
> http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
>
> Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf
that route,
since that's the performance advantage Spark has over vanilla Hadoop.
On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein wrote:
> Thanks for the info. The file system in use is a Lustre file system.
>
> Best,
> Tassilo
>
> On Wed, Feb 11, 2015 at 12:15 PM, Charles Fed
If you use mapPartitions to iterate the lookup_tables does that improve the
performance?
This link is to Spark docs 1.1 because both latest and 1.2 for Python give
me a 404:
http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions
On Wed Feb 11 2015 at 1:48:42 PM rok
A central location, such as NFS?
If they are temporary for the purpose of further job processing you'll want
to keep them local to the node in the cluster, i.e., in /tmp. If they are
centralized you won't be able to take advantage of data locality and the
central file store will become a bottlenec
Did you restart the slaves so they would read the settings? You don't need
to start/stop the EC2 cluster, just the slaves. From the master node:
$SPARK_HOME/sbin/stop-slaves.sh
$SPARK_HOME/sbin/start-slaves.sh
($SPARK_HOME is probably /root/spark)
On Fri Feb 06 2015 at 10:31:18 AM Joe Wass wrot
Good questions, some of which I'd like to know the answer to.
>> Is it okay to update a NoSQL DB with aggregated counts per batch
interval or is it generally stored in hdfs?
This depends on how you are going to use the aggregate data.
1. Is there a lot of data? If so, and you are going to use t
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a
merged CSV (instead of the various part-n files). You might find the
following links useful:
- This article is about combining the part files and outputting a header as
the first line in the merged results:
http://jav
I don't see anything that says you must explicitly restart them to load the
new settings, but usually there is some sort of signal trapped [or brute
force full restart] to get a configuration reload for most daemons. I'd
take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on
yo
If you want to design something like Spark shell have a look at:
http://zeppelin-project.org/
Its open source and may already do what you need. If not, its source code
will be helpful in answering the questions about how to integrate with long
running jobs that you have.
On Thu Feb 05 2015 at 11
In case anyone needs to merge all of their part-n files (small result
set only) into a single *.csv file or needs to generically flatten case
classes, tuples, etc., into comma separated values:
http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
On Tue Feb 03 2015 at 8:23:59 AM k
Are you using the default Java object serialization, or have you tried Kryo
yet? If you haven't tried Kryo please do and let me know how much it
impacts the serialization size. (I know its more efficient, I'm curious to
know how much more efficient, and I'm being lazy - I don't have ~6K 500MB
files
You'll still need to:
import org.apache.spark.SparkContext._
Importing org.apache.spark._ does _not_ recurse into sub-objects or
sub-packages, it only brings in whatever is at the level of the package or
object imported.
SparkContext._ has some implicits, one of them for adding groupByKey to an
Define "not working". Not compiling? If so you need:
import org.apache.spark.SparkContext._
On Fri Jan 30 2015 at 3:21:45 PM Amit Behera wrote:
> hi all,
>
> my sbt file is like this:
>
> name := "Spark"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.s
bash, but zsh handles tilde expansion the same as
> bash.
>
> Nick
>
>
> On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke
> wrote:
>
>> It was only hanging when I specified the path with ~ I never tried
>> relative.
>>
>> Hanging on the waiting fo
hell will expand the ~/ to
>> the absolute path before sending it to spark-ec2. (i.e. tilde expansion.)
>>
>> Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t
>> matter, since we fixed that for Spark 1.2.0
>> <https://issues.apache.org/jira/browse/SPA
rServer.java:141)
>
>
> Maybe it is about Hadoop 2.4.0, but I think this is what is included in
> the binary download of Spark. I've also tried it with Spark 1.2.0 binary
> (pre-built for Hadoop 2.4 and later).
>
> Or maybe I'm totally wrong, and the problem / fix is
I have been trying to work around a similar problem with my Typesafe config
*.conf files seemingly not appearing on the executors. (Though now that I
think about it its not because the files are absent in the JAR, but because
the -Dconf.resource environment variable I pass to the master obviously
d
istryFactory.class
>
> (probably because I'm using a self-contained JAR).
>
> In other words, I'm still stuck.
>
> --
> Emre
>
>
> On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke
> wrote:
>
>> I deal with problems like this so often across Java
I deal with problems like this so often across Java applications with large
dependency trees. Add the shell function at the following link to your
shell on the machine where your Spark Streaming is installed:
https://gist.github.com/cfeduke/fe63b12ab07f87e76b38
Then run in the directory where you
Absolute path means no ~ and also verify that you have the path to the file
correct. For some reason the Python code does not validate that the file
exists and will hang (this is the same reason why ~ hangs).
On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick wrote:
> Try using an absolute path to the
You should look at using Mesos. This should abstract away the individual
hosts into a pool of resources and make the different physical
specifications manageable.
I haven't tried configuring Spark Standalone mode to have different specs
on different machines but based on spark-env.sh.template:
#
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts.
I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.
What parameters are you using when you execute spark-ec2?
I am launching in the us-west-1 region (ami-7a320f3f) which may explain
things.
On Mon Jan 26 2015
to appropriate sub-ranges.)
Because of the sub-range bucketing and cluster distribution you shouldn't
run into OOM errors, assuming you provision sufficient worker nodes in the
cluster.
On Sun Jan 25 2015 at 9:39:56 AM Charles Feduke
wrote:
> I'm facing a similar problem ex
I'm facing a similar problem except my data is already pre-sharded in
PostgreSQL.
I'm going to attempt to solve it like this:
- Submit the shard names (database names) across the Spark cluster as a
text file and partition it so workers get 0 or more - hopefully 1 - shard
name. In this case you co
I think you want to instead use `.saveAsSequenceFile` to save an RDD to
someplace like HDFS or NFS it you are attempting to interoperate with
another system, such as Hadoop. `.persist` is for keeping the contents of
an RDD around so future uses of that particular RDD don't need to
recalculate its c
I'm trying to figure out the best approach to getting sharded data from
PostgreSQL into Spark.
Our production PGSQL cluster has 12 shards with TiB of data on each shard.
(I won't be accessing all of the data on a shard at once, but I don't think
its feasible to use Sqoop to copy tables who's data
39 matches
Mail list logo