Re: How to group multiple row data ?

2015-04-30 Thread Bipin Nag
OK, consider the case where there are multiple event triggers for a given customer/ vendor/product like 1,1,2,2,3 arranged in the order of *event* *occurrence* (time stamp). So output should be two groups (1,2) and (1,2,3). The doublet would be first occurrence of 1,2 and triplet later occurrences

Re: How to add a column to a spark RDD with many columns?

2015-04-30 Thread ayan guha
You have rdd or dataframe? Rdds are kind of tuples. You can add a new column to it by a map. rdd s are immutable, so you will get another rdd. On 1 May 2015 14:59, "Carter" wrote: > Hi all, > > I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more > column at the end of this

Spark SQL ThriftServer Impersonation Support

2015-04-30 Thread Night Wolf
Hi guys, Trying to use the SparkSQL Thriftserver with hive metastore. It seems that hive meta impersonation works fine (when running Hive tasks). However spinning up SparkSQL thrift server, impersonation doesn't seem to work... What settings do I need to enable impersonation? I've copied the sa

Error when saving as parquet to S3

2015-04-30 Thread Cosmin Cătălin Sanda
After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception when saving toAmazon's S3. The data that I try to write is 10G. logsForDate .repartition(10) .saveAsParquetFile(destination) // <-- Exception here The exception I receive is: java.io.IOException: The file being wr

How to add a column to a spark RDD with many columns?

2015-04-30 Thread Carter
Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94, 956,

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread ๏̯͡๏
1) As i am limited with 12G and i was doing a brodcast join (collect data and then publish), it was throwing OOM. The data size was 25G and limit was 12G, hence i reverted back to regular join. 2) I started using repartitioning, i started with 100 and now trying 200. At beginning it looked promisi

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Michael Armbrust
This looks like a bug. Mind opening a JIRA? On Thu, Apr 30, 2015 at 3:49 PM, Justin Yip wrote: > After some trial and error, using DataType solves the problem: > > df.withColumn("millis", $"eventTime".cast( > org.apache.spark.sql.types.LongType) * 1000) > > Justin > > On Thu, Apr 30, 2015 at 3:

Re: [SPAM] Customized Aggregation Query on Spark SQL

2015-04-30 Thread Wenlei Xie
Hi Zhan, How would this be achieved? Should the data be partitioned by name in this case? Thank you! Best, Wenlei On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang wrote: > One optimization is to reduce the shuffle by first aggregate locally > (only keep the max for each name), and then reduceByKe

Re: Enabling Event Log

2015-04-30 Thread Shixiong Zhu
"spark.history.fs.logDirectory" is for the history server. For Spark applications, they should use "spark.eventLog.dir". Since you commented out "spark.eventLog.dir", it will be "/tmp/spark-events". And this folder does not exits. Best Regards, Shixiong Zhu 2015-04-29 23:22 GMT-07:00 James King :

Re: How to install spark in spark on yarn mode

2015-04-30 Thread Shixiong Zhu
You don't need to install Spark. Just download or build a Spark package that matches your Yarn version. And ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. See instructions here: http://spark.apache.o

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread Sandy Ryza
Hi Deepak, I wrote a couple posts with a bunch of different information about how to tune Spark jobs. The second one might be helpful with how to think about tuning the number of partitions and resources? What kind of OOMEs are you hitting? http://blog.cloudera.com/blog/2015/03/how-to-tune-your

RE: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread java8964
Really not expert here, but try the following ideas: 1) I assume you are using yarn, then this blog is very good about the resource tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 2) If 12G is a hard limit in this case, then you have no option but lower yo

Re: [SPAM] Customized Aggregation Query on Spark SQL

2015-04-30 Thread Zhan Zhang
One optimization is to reduce the shuffle by first aggregate locally (only keep the max for each name), and then reduceByKey. Thanks. Zhan Zhang On Apr 24, 2015, at 10:03 PM, ayan guha mailto:guha.a...@gmail.com>> wrote: Here you go t = [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B"

how to pass configuration properties from driver to executor?

2015-04-30 Thread Tian Zhang
Hi, We have a scenario as below and would like your suggestion. We have app.conf file with propX=A as default built into the fat jar file that is provided to spark-submit WE have env.conf file with propX=B that would like spark-submit to take as input to overwrite the default and populate to both

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread ayan guha
And if I may ask, how long it takes in hbase CLI? I would not expect spark to improve performance of hbase. At best spark will push down the filter to hbase. So I would try to optimise any additional overhead like bringing data into spark. On 1 May 2015 00:56, "Ted Yu" wrote: > bq. a single quer

Re: DataFrame filter referencing error

2015-04-30 Thread ayan guha
I think you need to specify new in single quote. My guess is the query showing up in dB is like ...where status=new or ...where status="new" Either case mysql assumes new is a column. What you need is the form below ...where status='new' You need to provide your quotes accordingly. Easiest way wo

Spark Streaming Kafka Avro NPE on deserialization of payload

2015-04-30 Thread Todd Nist
I’m very perplexed with the following. I have a set of AVRO generated objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming job follows the receiver-based approach. I am encountering the below error when I attempt to de serialize the payload: 15/04/30 17:49:25 INFO MapOutputT

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
After some trial and error, using DataType solves the problem: df.withColumn("millis", $"eventTime".cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip wrote: > Hello, > > I was able to cast a timestamp into long using > df.withColumn("millis",

casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
Hello, I was able to cast a timestamp into long using df.withColumn("millis", $"eventTime".cast("long") * 1000) in spark 1.3.0. However, this statement returns a failure with spark 1.3.1. I got the following exception: Exception in thread "main" org.apache.spark.sql.types.DataTypeException: Unsu

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Corey Nolet
A tad off topic, but could still be relevant. Accumulo's design is a tad different in the realm of being able to shard and perform set intersections/unions server-side (through seeks). I've got an adapter for Spark SQL on top of a document store implementation in Accumulo that accepts the push-dow

Re: DataFrame filter referencing error

2015-04-30 Thread Burak Yavuz
Is "new" a reserved word for MySQL? On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella < francesco.bigare...@gmail.com> wrote: > Do you know how I can check that? I googled a bit but couldn't find a > clear explanation about it. I also tried to use explain() but it doesn't > really help. > I st

Re: DataFrame filter referencing error

2015-04-30 Thread Francesco Bigarella
Do you know how I can check that? I googled a bit but couldn't find a clear explanation about it. I also tried to use explain() but it doesn't really help. I still find unusual that I have this issue only for the equality operator but not for the others. Thank you, F On Wed, Apr 29, 2015 at 3:03

sap hana database laod using jdbcRDD

2015-04-30 Thread Hafiz Mujadid
Hi ! Can we load hana database table using spark jdbc RDD? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sap-hana-database-laod-using-jdbcRDD-tp22726.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
I terminated the old job and now start a new one. Currently, the Spark streaming job has been running for 2 hours and when I use lsof, I do not see many files related to the Spark job. BTW, I am running Spark streaming using local[2] mode. The batch is 5 seconds and it has around 50 lines to read

Re: JavaRDD> flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Sean Owen
You fundamentally want (half of) the Cartesian product so I don't think it gets a lot faster to form this. You could implement this on cogroup directly and maybe avoid forming the tuples you will filter out. I'd think more about whether you really need to do this thing, or whether there is anythin

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Cody Koeninger
Did you use lsof to see what files were opened during the job? On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay wrote: > The data ingestion is in outermost portion in foreachRDD block. Although > now I close the statement of jdbc, the same exception happened again. It > seems it is not related to the d

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
The data ingestion is in outermost portion in foreachRDD block. Although now I close the statement of jdbc, the same exception happened again. It seems it is not related to the data ingestion part. On Wed, Apr 29, 2015 at 8:35 PM, Cody Koeninger wrote: > Use lsof to see what files are actually b

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
Full Exception *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at VISummaryDataProvider.scala:37) failed in 884.087 s* *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap at VISummaryDataProvider.scala:37, took 1093.418249 s* 15/04/30 09:59:49 ERROR yarn

Re: JavaRDD> flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Dan DeCapria, CivicScience
Thought about it some more, and simplified the problem space for discussions: Given: JavaPairRDD c1; // c1.count() == 8000. Goal: JavaPairRDD,Tuple2> c2; // all lexicographical pairs Where: all lexicographic permutations on c1 :: (c1_i._1().compareTo(c1_j._1()) < 0) -> new Tuple2,Tuple2>(c1_i, c1

Re: HBase HTable constructor hangs

2015-04-30 Thread Ted Yu
Thanks for your confirmation. On Thu, Apr 30, 2015 at 10:17 AM, Tridib Samanta wrote: > You are right. After I moved from HBase 0.98.1 to 1.0.0 this problem got > solved. Thanks all! > > -- > Date: Wed, 29 Apr 2015 06:58:59 -0700 > Subject: Re: HBase HTable construct

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
My Spark Job is failing and i see == 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw exception: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error sending message [message = GetLocations(taskr

RE: HBase HTable constructor hangs

2015-04-30 Thread Tridib Samanta
You are right. After I moved from HBase 0.98.1 to 1.0.0 this problem got solved. Thanks all! Date: Wed, 29 Apr 2015 06:58:59 -0700 Subject: Re: HBase HTable constructor hangs From: yuzhih...@gmail.com To: tridib.sama...@live.com CC: d...@ocirs.com; user@spark.apache.org Can you verify whether t

Re: Timeout Error

2015-04-30 Thread ๏̯͡๏
I am facing same issue, do you have any solution ? On Mon, Apr 27, 2015 at 9:43 PM, Deepak Gopalakrishnan wrote: > Hello All, > > I dug a little deeper and found this error : > > 15/04/27 16:05:39 WARN TransportChannelHandler: Exception in connection from > /10.1.0.90:40590 > java.io.IOExceptio

Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wrote: > actually this is a sql parse exception, are you sure your sql is right? > > 发自我的 iPhone > > > 在 2015年4月30日,18:50,"Haopu Wang" 写道: > > > > Hi, in a test

Re: spark with standalone HBase

2015-04-30 Thread Akshat Aranya
Looking at your classpath, it looks like you've compiled Spark yourself. Depending on which version of Hadoop you've compiled against (looks like it's Hadoop 2.2 in your case), Spark will have its own version of protobuf. You should try by making sure both your HBase and Spark are compiled against

JavaRDD> flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Dan DeCapria, CivicScience
I am trying to generate all (N-1)(N)/2 lexicographical 2-tuples from a glom() JavaPairRDD>. The construction of these initial Tuple2's JavaPairRDD space is well formed from case classes I provide it (AQ, AQV, AQQ, CT) and is performant; minimized code: SparkConf conf = new SparkConf() .se

Error when saving as parquet to S3

2015-04-30 Thread cosmincatalin
After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception when saving toAmazon's S3. The data that I try to write is 10G. logsForDate .repartition(10) .saveAsParquetFile(destination) // <-- Exception here The exception I receive is: java.io.IOException: The file being wr

Re: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread shahab
Thanks Alex, but 482MB was just example size, and I am looking for generic approach doing this without broadcasting, any idea? best, /Shahab On Thu, Apr 30, 2015 at 4:21 PM, Alex wrote: > 482 MB should be small enough to be distributed as a set of broadcast > variables. Then you can use loca

RE: External Application Run Status

2015-04-30 Thread Nastooh Avessta (navesta)
Thanks. Sounds reasonable, perhaps blocking on local fifo or a socket would do. Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123

Re: SparkStream saveAsTextFiles()

2015-04-30 Thread anavidad
JavaDStream.saveAsTextFiles not exists in Spark Java Api. If you want to persist every RDD in your JavaDStream you should do something like this: words.foreachRDD(new Function2, Time, Void>() { @Override public Void call(JavaRDD rddWords, Time arg1) throws Exception { rddWords.

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Ted Yu
bq. a single query on one filter criteria Can you tell us more about your filter ? How selective is it ? Which hbase release are you using ? Cheers On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale < siddharth.ub...@syncoms.com> wrote: > Hi, > > > > I want to use Spark as Query engine on HBase

real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Siddharth Ubale
Hi, I want to use Spark as Query engine on HBase with sub second latency. I am using Spark 1.3 version. And followed the steps below on Hbase table with around 3.5 lac rows : 1. Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase table which is used to create the Dataf

RE: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread Alex
482 MB should be small enough to be distributed as a set of broadcast variables. Then you can use local features of spark to process. -Original Message- From: "shahab" Sent: ‎4/‎30/‎2015 9:42 AM To: "user@spark.apache.org" Subject: is there anyway to enforce Spark to cache data in all w

Re: spark with standalone HBase

2015-04-30 Thread Ted Yu
The error indicates incompatible protobuf versions. Please take a look at 4.1.1 under http://hbase.apache.org/book.html#basic.prerequisites Cheers On Thu, Apr 30, 2015 at 3:49 AM, Saurabh Gupta wrote: > Now able to solve the issue by setting > > SparkConf sconf = *new* SparkConf().setAppName(“

Error when saving as parquet to S3 from Spark

2015-04-30 Thread Cosmin Cătălin Sanda
After repartitioning a *DataFrame* in *Spark 1.3.0* I get a *.parquet* exception when saving to*Amazon's S3*. The data that I try to write is 10G. logsForDate .repartition(10) .saveAsParquetFile(destination) // <-- Exception here The exception I receive is: java.io.IOException: The file

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread Wang, Ningjun (LNG-NPV)
I think this will be slow because you have to do a group by then do a join (my table has 7 million records). I am looking for something like reduceByKey(), e.g. rdd.reduceByKey( (a, b) => if (a.timeStamp > b.timeStamp) a else b ) Does it have similar thing in DataFrame? Of course I can convert

[Runing Spark Applications with Chronos or Marathon]

2015-04-30 Thread Aram Mkrtchyan
Hi, We want to have Marathon starting and monitoring Chronos, so that when Chronos based Spark job fails, marathon automatically restarts them in scope of Chronos. Will this approach work if we start Spark jobs as shell scripts from Chronos or Marathon?

is there anyway to enforce Spark to cache data in all worker nodes (almost equally) ?

2015-04-30 Thread shahab
Hi, I load data from Cassandra into spark The entire data is almost around 482 MB. and it is cached as TempTable in 7 tables. How can I enforce spark to cache data in both worker nodes not only in ONE worker (as in my case)? I am using spark "2.1.1" with spark-connector "1.2.0-rc3". I have small

Expert advise needed. (POC is at crossroads)

2015-04-30 Thread ๏̯͡๏
I am at crossroads now and expert advise help me decide what the next course of the project going to be. Background : At out company we process tons of data to help build experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of data, takes 24 hours and does lots of joins. Its si

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
Did not work. Same problem. On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das wrote: > You could try increasing your heap space explicitly. like export > _JAVA_OPTIONS="-Xmx10g", its not the correct approach but try. > > Thanks > Best Regards > > On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wro

Re: Compute pairwise distance

2015-04-30 Thread Driesprong, Fokko
Thank you guys for the input. Ayan, I am not sure how this can be done using reduceByKey, as far as I can see (but I am not so advanced with Spark), this requires a groupByKey which can be very costly. What would be nice to transform the dataset which contains all the vectors like: val localData

Re: Is SQLContext thread-safe?

2015-04-30 Thread Wangfei (X)
actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone > 在 2015年4月30日,18:50,"Haopu Wang" 写道: > > Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a > same SQLContext instance, but below exception is thrown, so it looks > like SQLContext is NOT t

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but iden

Re: spark with standalone HBase

2015-04-30 Thread Saurabh Gupta
Now able to solve the issue by setting SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local") and conf.set(“zookeeper.znode.parent”, “/hbase-unsecure”) Standalone hbase has a table 'test' hbase(main):001:0> scan 'test' ROW COLUMN+CELL row1

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Kyle Lin
Hi all Producing more data into Kafka is not effective in my situation, because the speed of reading Kafka is consistent. I will adopt Saiai's suggestion to add more receivers. Kyle 2015-04-30 14:49 GMT+08:00 Saisai Shao : > From the chart you pasted, I guess you only have one receiver with

Re: Spark pre-built for Hadoop 2.6

2015-04-30 Thread Sean Owen
Yes there is now such a profile, though it is essentially redundant and doesn't configure things differently from 2.4. Besides hadoop version of course. Which is why it hadn't existed before since the 2.4 profile is 2.4+ People just kept filing bugs to add it but the docs are correct : you don't a

Spark pre-built for Hadoop 2.6

2015-04-30 Thread Christophe Préaud
Hi, I can see that there is now a pre-built Spark package for hadoop-2.6: http://apache.mirrors.ovh.net/ftp.apache.org/dist/spark/spark-1.3.1/ Does this mean that there is now a hadoop-2.6 profile, because it does not appear in the building-spark page: http://spark.apache.org/docs/latest/buildi

Re: local directories for spark running on yarn

2015-04-30 Thread Christophe Préaud
No, you should read: if spark.local.dir is specified, spark.local.dir will be ignored. This has been reworded (hopefully for the best) in 1.3.1: https://spark.apache.org/docs/1.3.1/running-on-yarn.html Christophe. On 17/04/2015 18:18, shenyanls wrote: > According to the documentation: > > The l

Re: How to install spark in spark on yarn mode

2015-04-30 Thread madhvi
Hi, you have to specify the worker nodes of the spark cluster at the time of configurations of the cluster. Thanks Madhvi On Thursday 30 April 2015 01:30 PM, xiaohe lan wrote: Hi Madhvi, If I only install spark on one node, and use spark-submit to run an application, which are the Worker no

Re: Performance advantage by loading data from local node over S3.

2015-04-30 Thread Akhil Das
If the data is too huge and is in S3, that'll be a lot of network traffic, instead, if the data is available in HDFS (with proper replication available) then it will be faster as most of the time, data will be available as PROCESS_LOCAL/NODE_LOCAL to the executor. Thanks Best Regards On Wed, Apr

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-30 Thread Akhil Das
Have a look at KafkaRDD Thanks Best Regards On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg wrote: > Hi, > > I'm wondering about the use-case where you're not doing continuous, > incremental streaming o

Re: Starting httpd: http: Syntax error on line 154

2015-04-30 Thread Nelson Batalha
Same here (using vanilla spark-ec2), I believe this started when I moved from 1.3.0 to 1.3.1. Only thing unusual with the setup is a few extra security parameters on EC2 but was the same as in 1.3.0. Did you solve this problem? On Thu, Apr 2, 2015 at 8:02 AM, Ganon Pierce wrote: > Now I am recei

SparkContext.getCallSite is in the top of profiler by memory allocation

2015-04-30 Thread Igor Petrov
Hello, we send a lot of small jobs to Spark (up to 500 in a second). When profiling I see Throwable.getStackTrace() in the top of memory profiler which is caused by SparkContext.getCallSite - this is memory consuming. we use Java API, I tried to call SparkContext.setCallSite("-") before submitti

Best strategy for Pandas -> Spark

2015-04-30 Thread Olivier Girardot
Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute to complete and some taking more than 30 minutes. What would be f

Re: External Application Run Status

2015-04-30 Thread Akhil Das
One way you could try would be, Inside the map, you can have a synchronized thread and you can block the map till the thread finishes up processing. Thanks Best Regards On Wed, Apr 29, 2015 at 9:38 AM, Nastooh Avessta (navesta) < nave...@cisco.com> wrote: > Hi > > In a multi-node setup, I am in

Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Anshul Singhle
Hi Akhil, I discovered the reason for this problem. There was in issue with my deployment (Google Cloud Platform). My spark master was on a different "region" than the slaves. This resulted in huge scheduler delays. Thanks, Anshul On Thu, Apr 30, 2015 at 1:39 PM, Akhil Das wrote: > Does this s

Re: default number of reducers

2015-04-30 Thread Akhil Das
This is spark mailing list :/ Yes, you can configure the following in the mapred-site.xml for that: mapred.tasktracker.map.tasks.maximum 4 Thanks Best Regards On Tue, Apr 28, 2015 at 11:00 PM, Shushant Arora wrote: > In Normal MR job can I configure ( cluster wide) default number

Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Akhil Das
Does this speed up? val rdd = sc.parallelize(1 to 100*, 30)* rdd.count Thanks Best Regards On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle wrote: > Hi, > > I'm running the following code in my cluster (standalone mode) via spark > shell - > > val rdd = sc.parallelize(1 to 100) > rdd.count

Re: How to run customized Spark on EC2?

2015-04-30 Thread Akhil Das
This is how i used to do it: - Login to the ec2 cluster (master) - Make changes to the spark, and build it. - Stop the old installation of spark (sbin/stop-all.sh) - Copy old installation conf/* to modified version's conf/ - Rsync modified version to all slaves - do sbin/start-all.sh from the modi

Re: How to install spark in spark on yarn mode

2015-04-30 Thread xiaohe lan
Hi Madhvi, If I only install spark on one node, and use spark-submit to run an application, which are the Worker nodes? Any where are the executors ? Thanks, Xiaohe On Thu, Apr 30, 2015 at 12:52 PM, madhvi wrote: > Hi, > Follow the instructions to install on the following link: > http://mbonac

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread Akhil Das
You could try increasing your heap space explicitly. like export _JAVA_OPTIONS="-Xmx10g", its not the correct approach but try. Thanks Best Regards On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB > size) and it takes

Re: 1.3.1: Persisting RDD in parquet - "Conflicting partition column names"

2015-04-30 Thread Cheng Lian
Did you have a directory layout like this? base/ | data-file-2.parquet | batch_id=1/ | | data-file-1.parquet Cheng On 4/28/15 11:20 AM, sranga wrote: Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 ve

Re: spark with standalone HBase

2015-04-30 Thread Saurabh Gupta
I am using hbase -0.94.8. On Wed, Apr 29, 2015 at 11:56 PM, Ted Yu wrote: > Can you enable HBase DEBUG logging in log4j.properties so that we can have > more clue ? > > What hbase release are you using ? > > Cheers > > On Wed, Apr 29, 2015 at 4:27 AM, Saurabh Gupta > wrote: > >> Hi, >> >> I am

Re: How to run self-build spark on EC2?

2015-04-30 Thread Akhil Das
You can replace your clusters(on master and workers) assembly jar with your custom build assembly jar. Thanks Best Regards On Tue, Apr 28, 2015 at 9:45 PM, Bo Fu wrote: > Hi all, > > I have an issue. I added some timestamps in Spark source code and built it > using: > > mvn package -DskipTests

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread ayan guha
1. Do a group by and get Max. In your example select id, Max(DT) from t group by id. Name this j. 2. Join t,j on id and DT=mxdt This is how we used to query RDBMS before window functions show up. As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select

Creating StructType with DataFrame.withColumn

2015-04-30 Thread Justin Yip
Hello, I would like to add a StructType to DataFrame. What would be the best way to do it? Not sure if it is possible using withColumn. A possible way is to convert the dataframe into a RDD[Row], add the struct and then convert it back to dataframe. But that seems an overkill. I guess I may have