Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
HI Removed export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar” It works, THANK YOU!! Regards Arthur On 23 Oct, 2014, at 1:00 pm, Shao, Saisai wrote: > Seems you just add snappy library into your classpath: > > export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT

Re: About "Memory usage" in the Spark UI

2014-10-22 Thread Patrick Wendell
It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang wrote: > Hi, please take a look at the attached screen-shot. I wonders what's the > "Memory Used" column mean. > > > > I give 2GB me

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Patrick Wendell
Hey Ryan, I've found that filing issues with the Scala/Typesafe JIRA is pretty helpful if the issue can be fully reproduced, and even sometimes helpful if it can't. You can file bugs here: https://issues.scala-lang.org/secure/Dashboard.jspa The Spark SQL code in particular is typically the sourc

Re: how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Mohit Jaggi
i think you can give a list of jars - not just one - to spark-submit, so build only the one that has changed source code. On Wed, Oct 22, 2014 at 10:29 PM, Yang wrote: > during tests, I often modify my code a little bit and want to see the > result. > but spark-submit requires the full fat-jar,

Re: spark 1.1.0/yarn hang

2014-10-22 Thread Tian Zhang
We have narrowed this hanging issue down to the calliope package that we used to create RDD from reading cassandra table. The calliope native RDD interface seems hanging and I have decided to switch to the calliope cql3 RDD interface. -- View this message in context: http://apache-spark-user-l

how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Yang
during tests, I often modify my code a little bit and want to see the result. but spark-submit requires the full fat-jar, which takes quite a lot of time to build. I just need to run in --master local mode. is there a way to run it without rebuilding the fat jar? thanks Yang

About "Memory usage" in the Spark UI

2014-10-22 Thread Haopu Wang
Hi, please take a look at the attached screen-shot. I wonders what's the "Memory Used" column mean. I give 2GB memory to the driver process and 12GB memory to the executor process. Thank you!

scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Ryan Williams
I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here is a full gist of an innocuous test command (mvn test -Dsuites='*Kafka

RE: Spark Hive Snappy Error

2014-10-22 Thread Shao, Saisai
Seems you just add snappy library into your classpath: export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar" But for spark itself, it depends on snappy-0.2.jar. Is there any possibility that this problem caused by different version of snappy? Thanks Jerry From: arthur.hk.c...@gma

Re: Spark as key/value store?

2014-10-22 Thread Hajime Takase
Thanks! On Thu, Oct 23, 2014 at 10:56 AM, Akshat Aranya wrote: > Yes, that is a downside of Spark's design in general. The only way to > share data across consumers of data is by having a separate entity that > owns the Spark context. That's the idea behind Ooyala's job server. The > driver is s

Workaround for SPARK-1931 not compiling

2014-10-22 Thread Arpit Kumar
Hi all, I am new to spark/graphx and am trying to use partitioning strategies in graphx on spark 1.0.0 The workaround I saw on the main page seems not to compile. The code I added was def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val numP

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi May I know where to configure Spark to load libhadoop.so? Regards Arthur On 23 Oct, 2014, at 11:31 am, arthur.hk.c...@gmail.com wrote: > Hi, > > Please find the attached file. > > > > > my spark-default.xml > # Default system properties included when running spark-submit. > # This is

hive timestamp column always returns null

2014-10-22 Thread tridib
Hello Experts, I created a table using spark-sql CLI. No Hive is installed. I am using spark 1.1.0. create table date_test(my_date timestamp) row format delimited fields terminated by ' ' lines terminated by '\n' LOCATION '/user/hive/date_test'; The data file has following data: 2014-12-11 00:00:

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,Please find the attached file.{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 {\fonttbl\f0\fnil\fcharset0 Menlo-Regular;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww26300\viewh12480\viewkind0 \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\t

SparkSQL , best way to divide data into partitions?

2014-10-22 Thread raymond
Hi I have a json file that can be load by sqlcontext.jsonfile into a table. but this table is not partitioned. Then I wish to transform this table into a partitioned table say on field “date” etc. what will be the best approaching to do this? seems in hive this is usually done

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Aaron Davidson
Another wild guess, if your data is stored in S3, you might be running into an issue where the default jets3t properties limits the number of parallel S3 connections to 4. Consider increasing the max-thread-counts from here: http://www.jets3t.org/toolkit/configuration.html. On Tue, Oct 21, 2014 at

Re: version mismatch issue with spark breeze vector

2014-10-22 Thread Holden Karau
Hi Yang, It looks like your build file is a different version than the version of Spark you are running against. I'd try building against the same version of spark as you are running your application against (1.1.0). Also what is your assembly/shading configuration for your build? Cheers, Holden

version mismatch issue with spark breeze vector

2014-10-22 Thread Yang
I'm trying to submit a simple test code through spark-submit. first portion of the code works fine, but some calls to breeze vector library fails: 14/10/22 17:36:02 INFO CacheManager: Partition rdd_1_0 not found, computing it 14/10/22 17:36:02 ERROR Executor: Exception in task 0.0 in stage 0.0 (TI

Spark: Order by Failed, java.lang.NullPointerException

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, I got java.lang.NullPointerException. Please help! sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 10").collect().foreach(println); 2014-10-23 08:20:12,024 INFO [sparkDriver-akka.actor.default-dispatcher-3

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, I have managed to resolve it because a wrong setting. Please ignore this . Regards Arthur On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com wrote: > > 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up >

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Andy Davidson
On a related note, how are you submitting your job? I have a simple streaming proof of concept and noticed that everything runs on my master. I wonder if I do not have enough load for spark to push tasks to the slaves. Thanks Andy From: Daniel Mahler Date: Monday, October 20, 2014 at 5:22 P

Re: Shuffle issues in the current master

2014-10-22 Thread Aaron Davidson
You may be running into this issue: https://issues.apache.org/jira/browse/SPARK-4019 You could check by having 2000 or fewer reduce partitions. On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai wrote: > PS, sorry for spamming the mailing list. Based my knowledge, both > spark.shuffle.spill.compress and

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
I modified the pom files in my private repo to use those parameters as default to solve the problem. But after the deployment, I found the installed version is not the customized version, but an official one. Anyone please give a hint on how the spark-ec2 work with spark from private repos.. --

Re: Solving linear equations

2014-10-22 Thread Debasish Das
Hi Martin, This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;] and x is what you want to find..B is 0 in this case...For mllib normally this is labelbasically create a labeledPoint where label is 0 always... Use mllib's linear regression and solve the following problem

Solving linear equations

2014-10-22 Thread Martin Enzinger
Hi, I'm wondering how to use Mllib for solving equation systems following this pattern 2*x1 + x2 + 3*x3 + + xn = 0 x1 + 0*x2 + 3*x3 + + xn = 0 .. .. 0*x1 + x2 + 0*x3 + + xn = 0 I definitely still have some reading to do to really understand the direct solving techn

Re: SchemaRDD Convert

2014-10-22 Thread Yin Huai
The implicit conversion function mentioned by Hao is createSchemaRDD in SQLContext/HiveContext. You can import it by doing val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Or new org.apache.spark.sql.hive.HiveContext(sc) for HiveContext import sqlContext.createSchemaRDD On Wed, Oct 2

Re: Setting only master heap

2014-10-22 Thread Sameer Farooqui
Hi Keith, Would be helpful if you could post the error message. Are you running Spark in Standalone mode or with YARN? In general, the Spark Master is only used for scheduling and it should be fine with the default setting of 512 MB RAM. Is it actually the Spark Driver's memory that you intende

Re: Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Michael Armbrust
The JDBC server is what you are looking for: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Wed, Oct 22, 2014 at 11:10 AM, Sadhan Sood wrote: > We want to run multiple instances of spark sql cli on our yarn cluster. > Each instance of the cli is

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
Sorry, there is not, you can try clone from github and build it from scratch, see [1] [1] https://github.com/apache/spark Davies On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier wrote: > Can’t install that on our cluster, but I can try locally. Is there a > pre-built binary available? > > On 22

Re: Does SQLSpark support Hive built in functions?

2014-10-22 Thread Michael Armbrust
Yes, when using a HiveContext. On Wed, Oct 22, 2014 at 2:20 PM, shahab wrote: > Hi, > > I just wonder if SparkSQL supports Hive built-in functions (e.g. > from_unixtime) or any of the functions pointed out here : ( > https://cwiki.apache.org/confluence/display/Hive/Tutorial) > > best, > > /Shaha

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen wrote: > No, there's no such thing as an RDD of RDDs in Spark. > Here though, why not just operate on an RDD of Lists? or a List of RDDs? > Usually one of these two is the right approach whenever you feel > inclined to operate on an RDD of RDDs.

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar wrote: >> That's not something you might want to do usually. In general, a >> SparkContext maps to a user application > > My question was basically this. In this page in the official doc, under > "Scheduling within an application" section, it talks a

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu wrote: > In the master, you can easily profile you job, find the bottlenecks, > see https://github.com/apache/spark/pull/2556 > > Could you try it and show the s

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas wrote: > On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT > wrote: > > > > Wild guess maybe, but do you decode the json records in Python ? it could be > much slower as the default lib is quite slow. > > > Oh yea

Does SQLSpark support Hive built in functions?

2014-10-22 Thread shahab
Hi, I just wonder if SparkSQL supports Hive built-in functions (e.g. from_unixtime) or any of the functions pointed out here : ( https://cwiki.apache.org/confluence/display/Hive/Tutorial) best, /Shahab

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Thanks Marcelo, that was helpful ! I had some follow up questions : That's not something you might want to do usually. In general, a > SparkContext maps to a user application My question was basically this. In this page in the official doc

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, I just tried sample PI calculation on Spark Cluster, after returning the Pi result, it shows ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m37,35662) not found ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://m33:7077 --e

Re: Rdd of Rdds

2014-10-22 Thread Sonal Goyal
Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDD rdd1 with values 1,"1" ; 1,"2" and so on Second RDD becomes rdd2 is 2, "a"; 2, "b";2,"c" You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what yo

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
PS, sorry for spamming the mailing list. Based my knowledge, both spark.shuffle.spill.compress and spark.shuffle.compress are default to true, so in theory, we should not run into this issue if we don't change any setting. Is there any other big we run into? Thanks. Sincerely, DB Tsai --

Setting only master heap

2014-10-22 Thread Keith Simmons
We've been getting some OOMs from the spark master since upgrading to Spark 1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the worker heap, which as far as I know is fine. Is there any setting which *only* increases the master heap size? Keith

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
Or can it be solved by setting both of the following setting into true for now? spark.shuffle.spill.compress true spark.shuffle.compress ture Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
It seems that this issue should be addressed by https://github.com/apache/spark/pull/2890 ? Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Oct 22, 2014 at 11:54 AM, DB Ts

streaming join sliding windows

2014-10-22 Thread Josh J
Hi, How can I join neighbor sliding windows in spark streaming? Thanks, Josh

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
Hi Ashwin, Let me try to answer to the best of my knowledge. On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar wrote: > Here are my questions : > 1. Sharing spark context : How exactly multiple users can share the cluster > using same spark > context ? That's not something you might want to

Shuffle issues in the current master

2014-10-22 Thread DB Tsai
Hi all, With SPARK-3948, the exception in Snappy PARSING_ERROR is gone, but I've another exception now. I've no clue about what's going on; does anyone run into similar issue? Thanks. This is the configuration I use. spark.rdd.compress true spark.shuffle.consolidateFiles true spark.shuffle.manage

Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Hi Spark devs/users, One of the things we are investigating here at Netflix is if Spark would suit us for our ETL needs, and one of requirements is multi tenancy. I did read the official doc and the book, but I'm still not clear on certain t

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn command line parameters? like following mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests clean package -- View this mess

Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Sadhan Sood
We want to run multiple instances of spark sql cli on our yarn cluster. Each instance of the cli is to be used by a different user. This looks non-optimal if each user brings up a different cli given how spark works on yarn by running executor processes (and hence consuming resources) on worker nod

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT wrote: Wild guess maybe, but do you decode the json records in Python ? it could > be much slower as the default lib is quite slow. > Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7 may be enough performance improvement

Re: Spark Streaming Applications

2014-10-22 Thread Sameer Farooqui
Hi Saiph, Patrick McFadin and Helena Edelson from DataStax taught a tutorial at NYC Strata last week where they created a prototype Spark Streaming + Kafka application for time series data. You can see the code here: https://github.com/killrweather/killrweather On Tue, Oct 21, 2014 at 4:33 PM,

Re: spark sql query optimization , and decision tree building

2014-10-22 Thread sanath kumar
Thank you very much , two more small questions : 1) val output = sqlContext.sql("SELECT * From people") my output has 128 columns and single row . how to find the which column has the maximum value in a single row using scala ? 2) as each row has 128 columns how to print each row into a text wh

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try it and show the stats? Davies On Wed, Oct 22, 2014 at 7:51 AM, Marius Soutier wrote: > It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 > G

Re: create a Row Matrix

2014-10-22 Thread Yana Kadiyska
This works for me import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg.distributed.RowMatrix val v1=Vectors.dense(Array(1d,2d)) val v2=Vectors.dense(Array(3d,4d)) val rows=sc.parallelize(List(v1,v2)) val mat=new RowMatrix(rows) val svd: SingularValueDecomposition[RowMat

Re: saveasSequenceFile with codec and compression type

2014-10-22 Thread Holden Karau
Hi gpatcham, If you want to save as a sequence file with a custom compression type you can use saveAsHadoopFile along with setting the " mapred.output.compression.type" on the jobconf. If you want to keep using saveAsSequenceFile, and the syntax is much nicer, you could also set that property on t

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread Daniil Osipov
You can use --spark-version argument to spark-ec2 to specify a GIT hash corresponding to the version you want to checkout. If you made changes that are not in the master repository, you can use --spark-git-repo to specify the git repository to pull down spark from, which contains the specified comm

Re: Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-22 Thread Holden Karau
Hi Michael Campbell, Are you deploying against yarn or standalone mode? In yarn try setting the shell variables SPARK_EXECUTOR_MEMORY=2G in standalone try and set SPARK_WORKER_MEMORY=2G. Cheers, Holden :) On Thu, Oct 16, 2014 at 2:22 PM, Michael Campbell < michael.campb...@gmail.com> wrote: >

Re: Spark as key/value store?

2014-10-22 Thread Akshat Aranya
Spark, in general, is good for iterating through an entire dataset again and again. All operations are expressed in terms of iteration through all the records of at least one partition. You may want to look at IndexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve poi

Re: Python vs Scala performance

2014-10-22 Thread Eustache DIEMERT
Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the default lib is quite slow. If so try ujson [1] - a C implementation that is at least an order of magnitude faster. HTH [1] https://pypi.python.org/pypi/ujson 2014-10-22 16:51 GMT+02:00 Marius Soutier

Re: Rdd of Rdds

2014-10-22 Thread Sean Owen
No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini wrote: > Hello, >

Rdd of Rdds

2014-10-22 Thread Tomer Benyamini
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a "foreach" on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("testapp"); sparkConf.setM

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 GB RAM and 4 cores, but fast enough for the currently 40 Gigs a day. Data is on HDFS in EBS volumes. Each file is a Gzip-compress collection of JSON objects, each one between 115-120 MB to be near the HDFS block size. O

Spark as key/value store?

2014-10-22 Thread Hajime Takase
Hi, Is it possible to use Spark as clustered key/value store ( say, like redis-cluster or hazelcast)?Will it out perform in write/read or other operation? My main urge is to use same RDD from several different SparkContext without saving to disk or using spark-job server,but I'm curious if someone

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set("spark.shuffle.spill", "false").set("spark.default.parallelism", "12") sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas wrote: > Total guess without knowing anything about you

Re:

2014-10-22 Thread Ted Yu
See first section of http://spark.apache.org/community On Wed, Oct 22, 2014 at 7:42 AM, Margusja wrote: > unsubscribe > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...

Re: confused about memory usage in spark

2014-10-22 Thread Sean Owen
One thing to remember is that Strings are composed of chars in Java, which take 2 bytes each. The encoding of the text on disk on S3 is probably something like UTF-8, which takes much closer to 1 byte per character for English text. This might explain the factor of ~2 difference. On Wed, Oct 22, 2

[no subject]

2014-10-22 Thread Margusja
unsubscribe - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: confused about memory usage in spark

2014-10-22 Thread Akhil Das
You can enable rdd compression (*spark.rdd.compress*) also you can use MEMORY_ONLY_SER ( *sc.sequenceFile[String,String]("s3n://somebucket/part-0").persist(StorageLevel.MEMORY_ONLY_SER* *)* ) to reduce the rdd size in memory. Thanks Best Regards On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath

confused about memory usage in spark

2014-10-22 Thread Darin McBeath
I have a PairRDD of type which I persist to S3 (using the following code). JavaPairRDD aRDDWritable = aRDD.mapToPair(new ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, SequenceFileOutputFormat.class); class ConvertToWritableTypes implements PairFunct

Re: Python vs Scala performance

2014-10-22 Thread Arian Pasquali
Interesting thread Marius, Btw, I'm curious about your cluster size. How small it is in terms of ram and cores. Arian 2014-10-22 13:17 GMT+01:00 Nicholas Chammas : > Total guess without knowing anything about your code: Do either of these > two notes from the 1.1.0 release notes >

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, FYI, I use snappy-java-1.0.4.1.jar Regards Arthur On 22 Oct, 2014, at 8:59 pm, Shao, Saisai wrote: > Thanks a lot, I will try to reproduce this in my local settings and dig into > the details, thanks for your information. > > > BR > Jerry > > From: arthur.hk.c...@gmail.com [mailto:

RE: Spark Hive Snappy Error

2014-10-22 Thread Shao, Saisai
Thanks a lot, I will try to reproduce this in my local settings and dig into the details, thanks for your information. BR Jerry From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] Sent: Wednesday, October 22, 2014 8:35 PM To: Shao, Saisai Cc: arthur.hk.c...@gmail.com; user Subject:

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
PS: Just to clarify my statement: >>Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. With "feared RDD operations on the driver" I meant to contrast an rdd action like rdd.collect that wou

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, Yes, I can always reproduce the issue: > about you workload, Spark configuration, JDK version and OS version? I ran SparkPI 1000 java -version java version "1.7.0_67" Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) cat /e

RE: spark sql query optimization , and decision tree building

2014-10-22 Thread Cheng, Hao
The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD 1) How to save result values of a query into a list ? [CH:] val list: Array[Row] = output.collect, however get 1M records into a

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
Total guess without knowing anything about your code: Do either of these two notes from the 1.1.0 release notes affect things at all? - PySpark now performs external spilling during aggregations. Old behavior can be restored by set

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
Thanks Matt, Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. My question still remains: Is it better to foreachRDD early in the process or do as much Dstream transformations before going

RE: SchemaRDD Convert

2014-10-22 Thread Cheng, Hao
You needn't do anything, the implicit conversion should do this for you. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L103 https://github.com/apache/spark/blob/2ac40da3f9fa6d45a59bb45b41606f1931ac5e81/sql/catalyst/src/main/scala/org/apac

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that... On 22.10.2014, at 13:02, Nicholas Chammas wrote: > What version of Spark are you running? Some recent changes to how PySpark > works relative to Scala Spark may explain things. > > PySpark should not be that mu

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
What version of Spark are you running? Some recent changes to how PySpark works relative to Scala Spark may explain things. PySpark should not be that much slower, not by a stretch. On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab wrote:

RE: Python vs Scala performance

2014-10-22 Thread Ashic Mahtab
I'm no expert, but looked into how the python bits work a while back (was trying to assess what it would take to add F# support). It seems python hosts a jvm inside of it, and talks to "scala spark" in that jvm. The python server bit "translates" the python calls to those in the jvm. The python

Python vs Scala performance

2014-10-22 Thread Marius Soutier
Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple of word count-like Scala jobs that essentially pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes to complete

SchemaRDD Convert

2014-10-22 Thread Dai, Kevin
Hi, ALL I have a RDD of case class T and T contains several primitive types and a Map. How can I convert this to a SchemaRDD? Best Regards, Kevin.

Re: rdd.checkpoint() : driver-side lineage??

2014-10-22 Thread harsh2005_7
Hi, I am no expert but my best guess is that its a 'closure' problem.Spark map reduce internally does a closure of all the variables outside its scope which are being used for the map operation.It does a serialization check for the map task . Since class scala.util.Random is not serializable it th

Re: Spark - HiveContext - Unstructured Json

2014-10-22 Thread Harivardan Jayaraman
For me inference is not an issue as compared to persistence. Imagine a Streaming application where the input is JSON whose format can vary from row to row and whose format I cannot pre-determine. I can use `sqlContext.jsonRDD` , but once I have the `SchemaRDD`, there is no way for me to update the