Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread Saisai Shao
I think you have to using "604800" instead of "7 * 24 * 3600", obviously SparkConf will not do multiplication for you.. The exception is quite obvious: "Caused by: java.lang.NumberFormatException: For input string: "3 * 24 * 3600"" 2015-06-16 14:52 GMT+08:00 : > Hi guys: > >I added a pa

Re: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Himanshu Mehra
Hi Shreesh, You can definitely decide the how many partitions your data should break into by passing a, 'minPartition' argument in the method sc.textFile("input/path", minPartition) and 'numSlices' arg in method sc.parallelize(localCollection, numSlices). In fact there is always a option to specif

Re: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Akhil Das
Each receiver will run on 1 core. So if your network is not the bottleneck then to test the consumption speed of the receivers you can simply do a *dstream.count.print* to see how many records it can receive. (Also it will be available in the Streaming tab of the driver UI). If you spawn 10 receive

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
Good question, with fileStream or textFileStream basically it will only takes in the files whose timestamp is > the current timestamp

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X wrote: > Thanks very much, Akhil. > > That solved my problem. > > Best, > Rex > > > > On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das > wrote:

Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-16 Thread Nathan McCarthy
Hi all, Looks like data frame parquet writing is very broken in Spark 1.4.0. We had no problems with Spark 1.3. When trying to save a data frame with 569610608 rows. dfc.write.format("parquet").save(“/data/map_parquet_file") We get random results between runs. Caching the data frame in memor

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs? Thanks Best Regards On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden wrote: > My Mesos cluster has 1.5 CPU and 17GB free. If I set: > > conf.set("spark.mesos.coarse", "true"); > conf.set("spark.cores.max", "1"); > > in the SparkConf object

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file? Thanks Best Regards On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden wrote: > I'm loading these settings from a properties file: > spark.executor.memory=256M > spark.cores.max=1 > spark.shuffle.consolidateFiles=true > spark.task.c

Spark+hive bucketing

2015-06-16 Thread Marcin Szymaniuk
Spark SQL document states: Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn’t support buckets yet What exactly does that mean?: - that writing to bucketed table wont respect this feature and data will be written in not bucketed manner? -

回复:Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread luohui20001
thanks saisai,I should try more times. I thought it will be caculated automatically as the default. Thanks&Best regards! San.Luo - 原始邮件 - 发件人:Saisai Shao 收件人:罗辉 抄送人:user 主题:Re: Spark Configuration of spark.worker.cleanup.appDataTtl 日期:2015

SparkR 1.4.0: read.df() function fails

2015-06-16 Thread esten
Hi, In SparkR shell, I invoke: > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", > header="false") I have tried various filetypes (csv, txt), all fail. RESPONSE: "ERROR RBackendHandler: load on 1 failed" BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFr

HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee
Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet>> ---

Re: HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee
I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the tab

Re: Limit Spark Shuffle Disk Usage

2015-06-16 Thread Himanshu Mehra
Hi Al M, You should try proving more main memory to shuffle process and it might reduce spill on disk. The default configuration for shuffle memory fraction is 20% of the safe memory that means 16% of the overall heap memory. so when we set executor memory only a small fraction of it is used in th

Re: Spark standalone mode and kerberized cluster

2015-06-16 Thread Steve Loughran
On 15 Jun 2015, at 15:43, Borja Garrido Bear mailto:kazebo...@gmail.com>> wrote: I tried running the job in a standalone cluster and I'm getting this: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authentic

cassandra with jdbcRDD

2015-06-16 Thread Hafiz Mujadid
hi all! is there a way to connect cassandra with jdbcRDD ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cassandra-with-jdbcRDD-tp23335.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: ALS predictALL not completing

2015-06-16 Thread Nick Pentreath
Which version of Spark are you using? On Tue, Jun 16, 2015 at 6:20 AM, afarahat wrote: > Hello; > I have a data set of about 80 Million users and 12,000 items (very sparse > ). > I can get the training part working no problem. (model has 20 factors), > However, when i try using Predict all for 8

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Gary Ogden
There isn't a conf/spark-defaults.conf file in the .tgz. There's a template file, but we didn't think we'd need one. I assumed using the defaults and anything we wanted to override would be in the properties file we load via --properties-file, as well as command line parms (--master etc). On 16

how to maintain the offset for spark streaming if HDFS is the source

2015-06-16 Thread Manohar753
Hi All, In my usecase HDFS file as source for Spark Stream, the job will process the data line by line but how will make sure to maintain the offset line number(data already processed) while restarting/new code push . Team can you please reply on this is there any configuration in Spark. Than

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Gary Ogden
On the master node, I see this printed over and over in the mesos-master.WARNING log file: W0615 06:06:51.211262 8672 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative Here's what I see in

Spark History Server pointing to S3

2015-06-16 Thread Gianluca Privitera
In Spark website it’s stated in the View After the Fact section (https://spark.apache.org/docs/latest/monitoring.html) that you can point the start-history-server.sh script to a directory in order do view the Web UI using the logs as data source. Is it possible to point that script to S3? Maybe

Re: Spark History Server pointing to S3

2015-06-16 Thread Akhil Das
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera < gianluca.privite...@studio.unibo.it> wrote: > In Spark website it’s stated in the View After the Fact section ( > https://spark.apache.org/docs/

Re: how to maintain the offset for spark streaming if HDFS is the source

2015-06-16 Thread Akhil Das
With sparkstreaming when you use fileStream or textFileStream it will always pick up the files from the directory whose timestamp is > the current timestamp, and if you have checkpointing enabled then it would start from the last read timestamp. So you may not need to maintain the line number. Tha

Re: Spark standalone mode and kerberized cluster

2015-06-16 Thread Borja Garrido Bear
Thank you for the answer, it doesn't seem to work neither (I've not log into the machine as the spark user, but kinit inside the spark-env script), and also tried inside the job. I've notice when I run pyspark that the kerberos token is used for something, but this same behavior is not presented w

RE: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Chaudhary, Umesh
Thanks Akhil for taking this point, I am also talking about the MQ bottleneck. I am currently having 5 receivers for a unreliable Websphere MQ receiver implementations. Is there any proven way to convert this implementation to reliable one ? Regards, Umesh Chaudhary From: Akhil Das [mailto:ak...

stop streaming context of job failure

2015-06-16 Thread Krot Viacheslav
Hi all, Is there a way to stop streaming context when some batch processing failed? I want to set reasonable reties count, say 10, and if failed - stop context completely. Is that possible?

RE: stop streaming context of job failure

2015-06-16 Thread Evo Eftimov
https://spark.apache.org/docs/latest/monitoring.html also subscribe to various Listeners for various Metrcis Types e.g. Job Stats/Statuses - this will allow you (in the driver) to decide when to stop the context gracefully (the listening and stopping can be done from a completely separa

Re: ALS predictALL not completing

2015-06-16 Thread Ayman Farahat
This is 1.3.1 Ayman Farahat  --  View my research on my SSRN Author page:  http://ssrn.com/author=1594571  From: Nick Pentreath To: "user@spark.apache.org" Sent: Tuesday, June 16, 2015 4:23 AM Subject: Re: ALS predictALL not completing

The problem when share data inside Dstream

2015-06-16 Thread Shuai Zhang
Hello guys, I faced one problem that I cannot pass my data inside rdd partition when I was trying to develop spark streaming feature.I'm the newcomer of Spark, could you please give me any suggestion on this problem?  The figure in the attachment is the code I used in my program: After I run my c

Re: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread shreesh
I realize that there are a lot of ways to configure my application in spark. The part that is not clear is that how do I decide say for example in how many partitions should I divide my data or how much ram should I have or how many workers should one initialize? -- View this message in context

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Jon Walton
On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust wrote: > 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use >> 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is >> available, would any of the JOIN enhancements help this situation? >> > > I would try Spa

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Koert Kuipers
a skew join (where the dominant key is spread across multiple executors) is pretty standard in other frameworks, see for example in scalding: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala this would be a great addition to sp

RE: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Evo Eftimov
Best is by measuring and recording how The Performance of your solution scales as The Workload scales - recording As In "Data Points recording" and then you can do some times series stat analysis and visualizations For example you can start with a single box with e.g. 8 CPU cores Use e.g. 1 or

Re: Spark History Server pointing to S3

2015-06-16 Thread Gianluca Privitera
It gives me an exception with org.apache.spark.deploy.history.FsHistoryProvider , a problem with the file system. I can reproduce the exception if you want. It perfectly works if I give a local path, I tested it in 1.3.0 version. Gianluca On 16 Jun 2015, at 15:08, Akhil Das mailto:ak...@sigmoid

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-16 Thread Sanjay Subramanian
Hi Josh It was great meeting u in person at the spark-summit SFO yesterday. Thanks for discussing potential solutions to the problem. I verified that 2 hive gateway nodes had not been configured correctly. My bad. I added hive-site.xml to the spark Conf directories for these 2 additional hive gat

Unit Testing Spark Transformations/Actions

2015-06-16 Thread Mark Tse
Hi there, I am looking to use Mockito to mock out some functionality while unit testing a Spark application. I currently have code that happily runs on a cluster, but fails when I try to run unit tests against it, throwing a "SparkException": org.apache.spark.SparkException: Job aborted due to

Re: Creating RDD from Iterable from groupByKey results

2015-06-16 Thread nir
I updated code sample so people can understand better what are my inputs and outputs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-RDD-from-Iterable-from-groupByKey-results-tp23328p23341.html Sent from the Apache Spark User List mailing list arch

HDFS not supported by databricks cloud :-(

2015-06-16 Thread Sanjay Subramanian
hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is not supported by Databricks cloud.My speed bottleneck is to transfer ~1TB of snapshot HDFS data (250+ external hive tables) to S3 :-(  I want to use databricks cloud but this to me is a starting disabler.The ha

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Shivaram Venkataraman
The error you are running into is that the input file does not exist -- You can see it from the following line "Input path does not exist: hdfs://smalldata13.hdp:8020/ home/esten/ami/usaf.json" Thanks Shivaram On Tue, Jun 16, 2015 at 1:55 AM, esten wrote: > Hi, > In SparkR shell, I invoke: > >

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Guru Medasani
Hi Esten, Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file path you specified is local?. mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json”, Error below shows that the Input path you specified does not exist on the cluster. Pointing to the righ

Spark on EMR

2015-06-16 Thread kamatsuoka
Spark is now officially supported on Amazon Elastic Map Reduce: http://aws.amazon.com/elasticmapreduce/details/spark/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html Sent from the Apache Spark User List mailing list archive at Nabbl

Pyspark Dense Matrix Multiply : One of them can fit in Memory

2015-06-16 Thread afarahat
Hello I would like to Multiply two matrices C = A* B A is a m x k , B is a kxl k,l m so that B can easily fit in memory. Any ideas or suggestions how to do that in Pyspark? Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Den

Re: HDFS not supported by databricks cloud :-(

2015-06-16 Thread Simon Elliston Ball
You could consider using Zeppelin and spark on yarn as an alternative. http://zeppelin.incubator.apache.org/ Simon > On 16 Jun 2015, at 17:58, Sanjay Subramanian > wrote: > > hey guys > > After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is > not supported by Databr

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated! -Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X wrote: > For clustering analysis, we need a way to measure distances. > > When the data contains different levels of measurement - > *binary / categor

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Sujit Pal
Hi Rexx, In general (ie not Spark specific), its best to convert categorical data to 1-hot encoding rather than integers - that way the algorithm doesn't use the ordering implicit in the integer representation. -sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X wrote: > Is it necessary to convert

spark-sql CLI options does not work --master yarn --deploy-mode client

2015-06-16 Thread Sanjay Subramanian
hey guys  I have CDH 5.3.3 with Spark 1.2.0 (on Yarn) This does not work /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql  --deploy-mode client --master yarn --driver-memory 1g -e "select j.person_id, p.first_name, p.last_name, count(*) from (select person_id from cdr.cdr_mjp_joborder where pers

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Hi Sujit, That's a good point. But 1-hot encoding will make our data changing from Terabytes to Petabytes, because we have tens of categorical attributes, and some of them contain thousands of categorical values. Is there any way to make a good balance of data size and right representation of cat

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
+cc user@spark.apache.org Reply inline. On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA) wrote: > Hi DB, > > Thank you for the reply. That explains a lot. > > I however had a few points regarding this:- > > 1. Just to help with the debate of not regularizing the b parameter. A > sta

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread nsalian
Hello, Is the json file in HDFS or local? "/home/esten/ami/usaf.json" is this an HDFS path? Suggestions: 1) Specify "file:/home/esten/ami/usaf.json" 2) Or move the usaf.json file into HDFS since the application is looking for the file in HDFS. Please let me know if that helps. Thank you. --

Suggestions for Posting on the User Mailing List

2015-06-16 Thread nsalian
As discussed during the meetup, the following information should help while creating a topic on the User mailing list. 1) Version of Spark and Hadoop should be included to help reproduce the issue or understand if the issue is a version limitation 2) Explanation about the scenario in as much deta

Re: Spark on EMR

2015-06-16 Thread ayan guha
That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline? On 17 Jun 2015 05:29, "kamatsuoka" wrote: > Spark is now officially supported on Amazon Elastic Map Reduce: > http://aws.amazon.com/elasticmapreduce/details/spark/ > > > > -- > View this message in context: > http://

What happens when a streaming consumer job is killed then restarted?

2015-06-16 Thread dgoldenberg
I'd like to understand better what happens when a streaming consumer job (with direct streaming, but also with receiver-based streaming) is killed/terminated/crashes. Assuming it was processing a batch of RDD data, what happens when the job is restarted? How much state is maintained within Spark'

What is Spark's data retention policy?

2015-06-16 Thread dgoldenberg
What is Spark's data retention policy? As in, the jobs that are sent from the master to the worker nodes, how long do they persist on those nodes? What about the RDD data, how is that cleaned up? Are all RDD's cleaned up at GC time unless they've been .persist()'ed or .cache()'ed? -- View this

Custom Spark metrics?

2015-06-16 Thread dgoldenberg
I'm looking at the doc here: https://spark.apache.org/docs/latest/monitoring.html. Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and emit those? Can a custom metrics sink be defined? And, can such a sink collect some metrics, execute some metrics handling logic, then i

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Michael Armbrust
> > this would be a great addition to spark, and ideally it belongs in spark > core not sql. > I agree with the fact that this would be a great addition, but we would likely want a specialized SQL implementation for performance reasons.

Re: cassandra with jdbcRDD

2015-06-16 Thread Michael Armbrust
I would suggest looking at https://github.com/datastax/spark-cassandra-connector On Tue, Jun 16, 2015 at 4:01 AM, Hafiz Mujadid wrote: > hi all! > > > is there a way to connect cassandra with jdbcRDD ? > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabb

Re: How to use DataFrame with MySQL

2015-06-16 Thread matthewrj
I just ran into this too. Thanks for the tip! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p23351.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Submitting Spark Applications using Spark Submit

2015-06-16 Thread raggy
I am trying to submit a spark application using the command line. I used the spark submit command for doing so. I initially setup my Spark application on Eclipse and have been making changes on there. I recently obtained my own version of the Spark source code and added a new method to RDD.scala. I

Unable to use more than 1 executor for spark streaming application with YARN

2015-06-16 Thread Saiph Kappa
Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) with 2 executors in different machines. However, while the app is running, I can see on the app web UI (tab executors) that only 1 executor keeps completing tasks over time, the other executor only wor

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Mohammad Tariq
I would really appreciate if someone could help me with this. On Monday, June 15, 2015, Mohammad Tariq wrote: > Hello list, > > The method *insertIntoJDBC(url: String, table: String, overwrite: > Boolean)* provided by Spark DataFrame allows us to copy a DataFrame into > a JDBC DB table. Similar

ClassNotFound exception from closure

2015-06-16 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue -- I have a ClassNotFound exception from a closure?! My code looks like this: val jRdd1 = table.map(cassRow=>{ val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1)) Row.fromSeq(lst) }) println(s"This one worked .

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Yana Kadiyska
When all else fails look at the source ;) Looks like createJDBCTable is deprecated, but otherwise goes to the same implementation as insertIntoJDBC... https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala You can also look at DataFrameWriter in t

Spark or Storm

2015-06-16 Thread asoni . learn
Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsub

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
In general, you should avoid making direct changes to the Spark source code. If you are using Scala, you can seamlessly blend your own methods on top of the base RDDs using implicit conversions. Regards, Will On June 16, 2015, at 7:53 PM, raggy wrote: I am trying to submit a spark application

Re: Spark or Storm

2015-06-16 Thread Will Briggs
The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are signific

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Raghav Shankar
I made the change so that I could implement top() using treeReduce(). A member on here suggested I make the change in RDD.scala to accomplish that. Also, this is for a research project, and not for commercial use. So, any advice on how I can get the spark submit to use my custom built jars wou

questions on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-16 Thread Fang, Mike
Hi, I have a spark streaming program running for ~ 25hrs. When I check the Streaming UI tab. I found the "Waiting batches" is 144. But the "scheduling delay" is 0. I am a bit confused. If the "waiting batches" is 144, that means many batches are waiting in the queue to be processed? If this is

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
If this is research-only, and you don't want to have to worry about updating the jars installed by default on the cluster, you can add your custom Spark jar using the "spark.driver.extraLibraryPath" configuration property when running spark-submit, and then use the experimental " spark.driver.us

Re: Not getting event logs >= spark 1.3.1

2015-06-16 Thread Tsai Li Ming
Forgot to mention this is on standalone mode. Is my configuration wrong? Thanks, Liming On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming wrote: > Hi, > > I have this in my spark-defaults.conf (same for hdfs): > spark.eventLog.enabled true > spark.eventLog.dir file:/tmp/spark-e

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Raghav Shankar
The documentation says spark.driver.userClassPathFirst can only be used in cluster mode. Does this mean I have to set the --deploy-mode option for spark-submit to cluster? Or can I still use the default client? My understanding is that even in the default deploy mode, spark still uses the slave

Re: number of partitions in join: Spark documentation misleading!

2015-06-16 Thread Davies Liu
Please file a JIRA for it. On Mon, Jun 15, 2015 at 8:00 AM, mrm wrote: > Hi all, > > I was looking for an explanation on the number of partitions for a joined > rdd. > > The documentation of Spark 1.3.1. says that: > "For distributed shuffle operations like reduceByKey and join, the largest > num

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Yanbo Liang
If you run Spark on YARN, the simplest way is replace the $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run your application. The spark-submit script will upload this jar to YARN cluster automatically and then you can run your application as usual. It does not care about w

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
Hi Dhar, For "standardization", we can disable it effectively by using different regularization on each component. Thus, we're solving the same problem but having better rate of convergence. This is one of the features I will implement. Sincerely, DB Tsai

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Raghav Shankar
To clarify, I am using the spark standalone cluster. On Tuesday, June 16, 2015, Yanbo Liang wrote: > If you run Spark on YARN, the simplest way is replace the > $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run > your application. > The spark-submit script will upload t

Re: Spark or Storm

2015-06-16 Thread ayan guha
I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis

Incorrect ACL checking for partitioned table in Spark SQL-1.4

2015-06-16 Thread Karthik Subramanian
*Problem Statement:* While doing query on a partitioned table using Spark SQL (Version 1.4.0), access denied exception is observed on the partition the user doesn’t belong to (The user permission is controlled using HDF ACLs). The same works correctly in hive. *Usercase:* /To address Multitenancy/

Re: Spark or Storm

2015-06-16 Thread Spark Enthusiast
I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is :                                  send events                                          enrich

Re: Spark or Storm

2015-06-16 Thread Sateesh Kavuri
Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should the

Interpreting what gets printed as one submits spark application

2015-06-16 Thread shreesh
I am fairly new to spark. I configured 3 machines(2 slaves) on a standalone cluster. I just wanted to know what exactly is the meaning of: [Stage 0:==>(25 + 4) / 500] This gets printed to the terminal when I submit my app. I understand tha

Read/write metrics for jobs which use S3

2015-06-16 Thread Abhishek Modi
I mostly use Amazon S3 for reading input data and writing output data for my spark jobs. I want to know the numbers of bytes read & written by my job from S3. In hadoop, there are FileSystemCounters for this, is there something similar in spark ? If there is, can you please guide me on how to use

Re: Spark or Storm

2015-06-16 Thread Sabarish Sasidharan
Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the

Re: Spark or Storm

2015-06-16 Thread Enno Shioji
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve t

Shuffle produces one huge partition

2015-06-16 Thread Al M
I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A has 1 billion rows; RDD B has 100k rows. I want to join them on a single key. 95% of the rows in RDD A have the same key to join with RDD B. Before I can join the two RDDs, I must map them to tuples where the first element

Kerberos authentication exception when spark access hbase with yarn-cluster mode on a kerberos yarn Cluster

2015-06-16 Thread 马元文
Hi, all I have a question about spark access hbase with yarn-cluster mode on a kerberos yarn Cluster. Is it the only way to enable Spark access HBase by distributing the keytab to each NodeManager? It seems that Spark doesn't provide a delegation token like MR job, am I right?