Cleanup after Spark SQL job with window aggregation takes a long time

2016-08-29 Thread Jestin Ma
After a Spark SQL job appending a few columns using window aggregation functions, and performing a join and some data massaging, I find that the cleanup after the job finishes saving the result data to disk takes as long if not longer than the job. I currently am performing window aggregation on a

Re: Converting DataFrame's int column to Double

2016-08-25 Thread Jestin Ma
How about this: df.withColumn("doubles", col("ints").cast("double")).drop("ints") On Thu, Aug 25, 2016 at 2:09 PM, Marco Mistroni wrote: > hi all > i might be stuck in old code, but this is what i am doing to convert a > DF int column to Double > > val intToDoubleFunc:(Int => Double) = lbl =>

Caching broadcasted DataFrames?

2016-08-25 Thread Jestin Ma
I have a DataFrame d1 that I would like to join with two separate DataFrames. Since d1 is small enough, I broadcast it. What I understand about cache vs broadcast is that cache leads to each executor storing the partitions its assigned in memory (cluster-wide in-memory). Broadcast leads to each no

Re:

2016-08-14 Thread Jestin Ma
t;> > doing a union of the results? It is possible that with this technique, >> that >> > the join which only contains skewed data would be filtered enough to >> allow >> > broadcasting of one side. >> > >> > On Sat, Aug 13, 2016 at 11:15 PM, Jesti

Re:

2016-08-14 Thread Jestin Ma
y for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > &g

[no subject]

2016-08-13 Thread Jestin Ma
Hi, I'm currently trying to perform an outer join between two DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id. df1.id is skewed in that there are many 0's, the rest being unique IDs. df2.id is not skewed. If I filter df1.id != 0, then the join works well. If I don't, then the

DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Jestin Ma
When I load in a timestamp column and try to save it immediately without any transformations, the output time is unix time with padded 0's until there are 16 values. For example, loading in a time of August 3, 2016, 00:36:25 GMT, which is 1470184585 in UNIX time, saves as 147018458500. When I

Changing Spark configuration midway through application.

2016-08-10 Thread Jestin Ma
If I run an application, for example with 3 joins: [join 1] [join 2] [join 3] [final join and save to disk] Could I change Spark properties in between each join? [join 1] [change properties] [join 2] [change properties] ... Or would I have to create a separate application with different proper

Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Jestin Ma
If we want to use versions of Spark beyond the official 2.0.0 release, specifically on Maven + Java, what steps should we take to upgrade? I can't find the newer versions on Maven central. Thank you! Jestin

Using Kyro for DataFrames (Dataset)?

2016-08-07 Thread Jestin Ma
When using DataFrames (Dataset), there's no option for an Encoder. Does that mean DataFrames (since it builds on top of an RDD) uses Java serialization? Does using Kyro make sense as an optimization here? If not, what's the difference between Java/Kyro serialization, Tungsten, and Encoders? Thank

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-02 Thread Jestin Ma
n't see why running on YARN is more important. Hope this makes my question clearer. On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski wrote: > On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma > wrote: > > Hi Nikolay, I'm looking at data locality improvements for Spark, and I >

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Jestin Ma
way, what can minimize transporting temporary > information between worker-nodes. > > Try google in this way "Data locality in Hadoop" > > > 2016-08-01 4:41 GMT+03:00 Jestin Ma : > >> It seems that the number of tasks being this large do not matter. Each >> task w

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Jestin Ma
> > On Jul 29, 2016, at 9:02 AM, Jestin Ma wrote: > > I am processing ~2 TB of hdfs data using DataFrames. The size of a task is > equal to the block size specified by hdfs, which happens to be 128 MB, > leading to about 15000 tasks. > > I'm using 5 worker nodes wit

Tuning level of Parallelism: Increase or decrease?

2016-07-29 Thread Jestin Ma
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks. I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. I'm performing groupBy, count, and an outer-join with another

Spark Standalone Cluster: Having a master and worker on the same node

2016-07-27 Thread Jestin Ma
Hi, I'm doing performance testing and currently have 1 master node and 4 worker nodes and am submitting in client mode from a 6th cluster node. I know we can have a master and worker on the same node. Speaking in terms of performance and practicality, is it possible/suggested to have another worki

Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Jestin Ma
I know that Sparksession is replacing the SQL and HiveContexts, but what about SparkConf and SparkContext? Are those still relevant in our programs? Thank you! Jestin

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
Also, sorry for the repeated updates, I checked my master UI and it says no drivers are running but 1 application is running. On Tue, Jul 26, 2016 at 6:49 AM, Jestin Ma wrote: > I did netstat -apn | grep 4040 on machine 6, and I see > > tcp0

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
I did netstat -apn | grep 4040 on machine 6, and I see tcp0 0 :::4040 :::* LISTEN 30597/java What does this mean? On Tue, Jul 26, 2016 at 6:47 AM, Jestin Ma wrote: > I do not deploy using cluster mode and I don't use EC2. > > I just read

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
out where the driver runs and use the machine's IP. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Jul 26,

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
heck in master node by using netstat -apn | grep 4040 > > > > > On Jul 26, 2016, at 8:21 AM, Jestin Ma > wrote: > > > > Hello, when running spark jobs, I can access the master UI (port 8080 > one) no problem. However, I'm confused as to how to access the

Spark Web UI port 4040 not working

2016-07-25 Thread Jestin Ma
Hello, when running spark jobs, I can access the master UI (port 8080 one) no problem. However, I'm confused as to how to access the web UI to see jobs/tasks/stages/etc. I can access the master UI at http://:8080. But port 4040 gives me a -connection cannot be reached-. Is the web UI http:// with

Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Jestin Ma
Hello, Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2. The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me abou

Can Spark Dataframes preserve order when joining?

2016-06-29 Thread Jestin Ma
If it’s not too much trouble, could I get some pointers/help on this? (see link) http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order -also, as a side question, do Datafra