Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Charles Feduke
As Akhil says Ubuntu is a good choice if you're starting from near scratch. Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and other big data tools so you can get a cluster running with very little effort. Keep in mind Cloudera is a for-profit corporation so they are also sell

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Charles Feduke
You could also try setting your `nofile` value in /etc/security/limits.conf for `soft` to some ridiculously high value if you haven't done so already. On Fri, Apr 3, 2015 at 2:09 AM Akhil Das wrote: > Did you try these? > > - Disable shuffle : spark.shuffle.spill=false > - Enable log rotation: >

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke
ns on spark end so > that the connections can be reused across jobs > > > On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke > wrote: > >> How long does each executor keep the connection open for? How many >> connections does each executor open? >> >> Are you

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke
How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single connectio

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Charles Feduke
Assuming you are on Linux, what is your /etc/security/limits.conf set for nofile/soft (number of open file handles)? On Fri, Mar 20, 2015 at 3:29 PM Shuai Zheng wrote: > Hi All, > > > > I try to run a simple sort by on 1.2.1. And it always give me below two > errors: > > > > 1, 15/03/20 17:48:29

Re: Writing Spark Streaming Programs

2015-03-19 Thread Charles Feduke
Scala is the language used to write Spark so there's never a situation in which features introduced in a newer version of Spark cannot be taken advantage of if you write your code in Scala. (This is mostly true of Java, but it may be a little more legwork if a Java-friendly adapter isn't available

Re: Spark History server default conf values

2015-03-10 Thread Charles Feduke
What I found from a quick search of the Spark source code (from my local snapshot on January 25, 2015): // Interval between each check for event log updates private val UPDATE_INTERVAL_MS = conf.getInt("spark.history.fs.updateInterval", conf.getInt("spark.history.updateInterval", 10)) * 1000 pr

Re: Spark on EC2

2015-02-24 Thread Charles Feduke
This should help you understand the cost of running a Spark cluster for a short period of time: http://www.ec2instances.info/ If you run an instance for even 1 second of a single hour you are charged for that complete hour. So before you shut down your miniature cluster make sure you really are d

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Charles Feduke
ga/schemavalidator.properties >> >> I couldn't understand why I couldn't get to the value of "propertiesFile" >> by using standard System.getProperty method. (I can use new >> SparkConf().get("spark.driver.extraJavaOptions") and manually pars

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
ueries directly to the > MySQL database. Since in theory I only have to do this once, I'm not > sure there's much to be gained in moving the data from MySQL to Spark > first. > > I have yet to find any non-trivial examples of ETL logic on the web ... > it seems like it'

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
I cannot comment about the correctness of Python code. I will assume your caper_kv is keyed on something that uniquely identifies all the rows that make up the person's record so your group by key makes sense, as does the map. (I will also assume all of the rows that comprise a single person's reco

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Charles Feduke
I haven't actually tried mixing non-Spark settings into the Spark properties. Instead I package my properties into the jar and use the Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala specific) to get at my properties: Properties file: src/main/resources/integration.conf (below

Re: exception with json4s render

2015-02-11 Thread Charles Feduke
I was having a similar problem to this trying to use the Scala Jackson module yesterday. I tried setting `spark.files.userClassPathFirst` to true but I was still having problems due to the older version of Jackson that Spark has a dependency on. (I think its an old org.codehaus version.) I ended u

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
bug report after someone from the dev team chimes in on this issue. On Wed Feb 11 2015 at 2:20:34 PM Charles Feduke wrote: > Take a look at this: > > http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre > > Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
that route, since that's the performance advantage Spark has over vanilla Hadoop. On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein wrote: > Thanks for the info. The file system in use is a Lustre file system. > > Best, > Tassilo > > On Wed, Feb 11, 2015 at 12:15 PM, Charles Fed

Re: iteratively modifying an RDD

2015-02-11 Thread Charles Feduke
If you use mapPartitions to iterate the lookup_tables does that improve the performance? This link is to Spark docs 1.1 because both latest and 1.2 for Python give me a 404: http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions On Wed Feb 11 2015 at 1:48:42 PM rok

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
A central location, such as NFS? If they are temporary for the purpose of further job processing you'll want to keep them local to the node in the cluster, i.e., in /tmp. If they are centralized you won't be able to take advantage of data locality and the central file store will become a bottlenec

Re: How do I set spark.local.dirs?

2015-02-06 Thread Charles Feduke
Did you restart the slaves so they would read the settings? You don't need to start/stop the EC2 cluster, just the slaves. From the master node: $SPARK_HOME/sbin/stop-slaves.sh $SPARK_HOME/sbin/start-slaves.sh ($SPARK_HOME is probably /root/spark) On Fri Feb 06 2015 at 10:31:18 AM Joe Wass wrot

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Charles Feduke
Good questions, some of which I'd like to know the answer to. >> Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? This depends on how you are going to use the aggregate data. 1. Is there a lot of data? If so, and you are going to use t

Re: Parsing CSV files in Spark

2015-02-06 Thread Charles Feduke
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a merged CSV (instead of the various part-n files). You might find the following links useful: - This article is about combining the part files and outputting a header as the first line in the merged results: http://jav

Re: spark on ec2

2015-02-05 Thread Charles Feduke
I don't see anything that says you must explicitly restart them to load the new settings, but usually there is some sort of signal trapped [or brute force full restart] to get a configuration reload for most daemons. I'd take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on yo

Re: How to design a long live spark application

2015-02-05 Thread Charles Feduke
If you want to design something like Spark shell have a look at: http://zeppelin-project.org/ Its open source and may already do what you need. If not, its source code will be helpful in answering the questions about how to integrate with long running jobs that you have. On Thu Feb 05 2015 at 11

Re: Writing RDD to a csv file

2015-02-03 Thread Charles Feduke
In case anyone needs to merge all of their part-n files (small result set only) into a single *.csv file or needs to generically flatten case classes, tuples, etc., into comma separated values: http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ On Tue Feb 03 2015 at 8:23:59 AM k

Re: Serialized task result size exceeded

2015-01-30 Thread Charles Feduke
Are you using the default Java object serialization, or have you tried Kryo yet? If you haven't tried Kryo please do and let me know how much it impacts the serialization size. (I know its more efficient, I'm curious to know how much more efficient, and I'm being lazy - I don't have ~6K 500MB files

Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
You'll still need to: import org.apache.spark.SparkContext._ Importing org.apache.spark._ does _not_ recurse into sub-objects or sub-packages, it only brings in whatever is at the level of the package or object imported. SparkContext._ has some implicits, one of them for adding groupByKey to an

Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
Define "not working". Not compiling? If so you need: import org.apache.spark.SparkContext._ On Fri Jan 30 2015 at 3:21:45 PM Amit Behera wrote: > hi all, > > my sbt file is like this: > > name := "Spark" > > version := "1.0" > > scalaVersion := "2.10.4" > > libraryDependencies += "org.apache.s

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
bash, but zsh handles tilde expansion the same as > bash. > > Nick > ​ > > On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke > wrote: > >> It was only hanging when I specified the path with ~ I never tried >> relative. >> >> Hanging on the waiting fo

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
hell will expand the ~/ to >> the absolute path before sending it to spark-ec2. (i.e. tilde expansion.) >> >> Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t >> matter, since we fixed that for Spark 1.2.0 >> <https://issues.apache.org/jira/browse/SPA

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
rServer.java:141) > > > Maybe it is about Hadoop 2.4.0, but I think this is what is included in > the binary download of Spark. I've also tried it with Spark 1.2.0 binary > (pre-built for Hadoop 2.4 and later). > > Or maybe I'm totally wrong, and the problem / fix is

Re: Spark and S3 server side encryption

2015-01-28 Thread Charles Feduke
I have been trying to work around a similar problem with my Typesafe config *.conf files seemingly not appearing on the executors. (Though now that I think about it its not because the files are absent in the JAR, but because the -Dconf.resource environment variable I pass to the master obviously d

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
istryFactory.class > > (probably because I'm using a self-contained JAR). > > In other words, I'm still stuck. > > -- > Emre > > > On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke > wrote: > >> I deal with problems like this so often across Java

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
I deal with problems like this so often across Java applications with large dependency trees. Add the shell function at the following link to your shell on the machine where your Spark Streaming is installed: https://gist.github.com/cfeduke/fe63b12ab07f87e76b38 Then run in the directory where you

Re: spark 1.2 ec2 launch script hang

2015-01-27 Thread Charles Feduke
Absolute path means no ~ and also verify that you have the path to the file correct. For some reason the Python code does not validate that the file exists and will hang (this is the same reason why ~ hangs). On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick wrote: > Try using an absolute path to the

Re: HW imbalance

2015-01-26 Thread Charles Feduke
You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: #

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Charles Feduke
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
to appropriate sub-ranges.) Because of the sub-range bucketing and cluster distribution you shouldn't run into OOM errors, assuming you provision sufficient worker nodes in the cluster. On Sun Jan 25 2015 at 9:39:56 AM Charles Feduke wrote: > I'm facing a similar problem ex

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
I'm facing a similar problem except my data is already pre-sharded in PostgreSQL. I'm going to attempt to solve it like this: - Submit the shard names (database names) across the Spark cluster as a text file and partition it so workers get 0 or more - hopefully 1 - shard name. In this case you co

Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Charles Feduke
I think you want to instead use `.saveAsSequenceFile` to save an RDD to someplace like HDFS or NFS it you are attempting to interoperate with another system, such as Hadoop. `.persist` is for keeping the contents of an RDD around so future uses of that particular RDD don't need to recalculate its c

JDBC sharded solution

2015-01-24 Thread Charles Feduke
I'm trying to figure out the best approach to getting sharded data from PostgreSQL into Spark. Our production PGSQL cluster has 12 shards with TiB of data on each shard. (I won't be accessing all of the data on a shard at once, but I don't think its feasible to use Sqoop to copy tables who's data