Re: LDA prediction on new document

2015-05-22 Thread Dani Qiu
thanks, Ken but I am planning to use spark LDA in production. I cannot wait for the future release. At least, provide some workaround solution. PS : in SPARK-5567 , mentioned "This will require inference but should be able to use the same code,

Re: 回复: How to use spark to access HBase with Security enabled

2015-05-22 Thread Ted Yu
Can you post the morning modified code ? Thanks > On May 21, 2015, at 11:11 PM, donhoff_h <165612...@qq.com> wrote: > > Hi, > > Thanks very much for the reply. I have tried the "SecurityUtil". I can see > from log that this statement executed successfully, but I still can not pass > the au

Re: DataFrame Column Alias problem

2015-05-22 Thread SLiZn Liu
Despite the odd usage, it does the trick, thanks Reynold! On Fri, May 22, 2015 at 2:47 PM Reynold Xin wrote: > In 1.4 it actually shows col1 by default. > > In 1.3, you can add "col1" to the output, i.e. > > df.groupBy($"col1").agg($"col1", count($"col1").as("c")).show() > > > On Thu, May 21, 20

Spark Memory management

2015-05-22 Thread swaranga
Experts, This is an academic question. Since Spark runs on the JVM, how it is able to do things like offloading RDDs from memory to disk when the data cannot fit into memory. How are the calculations performed? Does it use the methods availabe in the java.lang.Runtime class to get free/available m

?????? ?????? How to use spark to access HBase with Security enabled

2015-05-22 Thread donhoff_h
Hi, My modified code is listed below, just add the SecurityUtil API. I don't know which propertyKeys I should use, so I make 2 my own propertyKeys to find the keytab and principal. object TestHBaseRead2 { def main(args: Array[String]) { val conf = new SparkConf() val sc = new SparkCont

Re: Spark Memory management

2015-05-22 Thread Akhil Das
You can look at the logic for offloading data from Memory by looking at ensureFreeSpace call. And dropFromMemory

Re: Spark Memory management

2015-05-22 Thread ??????
in spark src this class org.apache.spark.deploy.worker.WorkerArguments def inferDefaultCores(): Int = { Runtime.getRuntime.availableProcessors() } def inferDefaultMemory(): Int = { val ibmVendor = System.getProperty("java.vendor").contains("IBM") var totalMb = 0 try { val bean = Man

MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread SparknewUser
I am new in MLlib and in Spark.(I use Scala) I'm trying to understand how LogisticRegressionWithLBFGS and LogisticRegressionWithSGD work. I usually use R to do logistic regressions but now I do it on Spark to be able to analyze Big Data. The model only returns weights and intercept. My problem is

Spark with cassandra

2015-05-22 Thread lucas
Hello, I have an issue. I would like to save some data to Cassandra using Spark. Firstly i have load data from Elasticsearch to Spark then I obtain this : org.elasticsearch.spark.rdd.ScalaEsRDD which contains this kind of information (AU1rN9uN4PGB4YTCSXr7,Map(@timestamp -> 2015-05-19T08:08

Re: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Gautam Bajaj
This is just a friendly ping, just to remind you of my query. Also, is there a possible explanation/example on the usage of AsyncRDDActions in Java ? On Thu, May 21, 2015 at 7:18 PM, Gautam Bajaj wrote: > I am received data at UDP port 8060 and doing processing on it using Spark > and storing t

Re: Official Docker container for Spark

2015-05-22 Thread ??????
spark src have dockerfile ,you can use this -- Original -- From: "tridib";; Date: Fri, May 22, 2015 03:25 AM To: "user"; Subject: Official Docker container for Spark Hi, I am using spark 1.2.0. Can you suggest docker containers which can be deployed

Spark Streaming and Drools

2015-05-22 Thread Antonio Giambanco
Hi All, I'm deploying and architecture that uses flume for sending log information in a sink. Spark streaming read from this sink (pull strategy) e process al this information, during this process I would like to make some event processing. . . for example: Log appender writes information about all

Re: Official Docker container for Spark

2015-05-22 Thread Ritesh Kumar Singh
Use this: sequenceiq/docker Here's a link to their github repo: docker-spark They have repos for other big data tools too which are agin really nice. Its being maintained properly by their devs and

Partitioning of Dataframes

2015-05-22 Thread Karlson
Hi, is there any way to control how Dataframes are partitioned? I'm doing lots of joins and am seeing very large shuffle reads and writes in the Spark UI. With PairRDDs you can control how the data is partitioned across nodes with partitionBy. There is no such method on Dataframes however. Ca

Re: Issues with constants in Spark HiveQL queries

2015-05-22 Thread Skanda
Hi All, I'm facing the same problem with Spark 1.3.0 from cloudera cdh 5.4.x. Any luck solving the issue? Exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Unsupported language features in query: select * from everest_marts_test.hive_ql_test where daily_partition=2015

Re: Issues with constants in Spark HiveQL queries

2015-05-22 Thread Skanda
Hi I was using the wrong version of the spark-hive jar. I downloaded the right version of the jar from the cloudera repo and it works now. Thanks, Skanda On Fri, May 22, 2015 at 2:36 PM, Skanda wrote: > Hi All, > > I'm facing the same problem with Spark 1.3.0 from cloudera cdh 5.4.x. Any > lu

Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Rok Roskar
on the worker/container that fails, the "file not found" is the first error -- the output below is from the yarn log. There were some python worker crashes for another job/stage earlier (see the warning at 18:36) but I expect those to be unrelated to this file not found error.

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
You can deploy and invoke Drools as a Singleton on every Spark Worker Node / Executor / Worker JVM You can invoke it from e.g. map, filter etc and use the result from the Rule to make decision how to transform/filter an event/message From: Antonio Giambanco [mailto:antogia...@gmail.com]

Re: Spark Streaming and Drools

2015-05-22 Thread Antonio Giambanco
Thanks a lot Evo, do you know where I can find some examples? Have a great one A G 2015-05-22 12:00 GMT+02:00 Evo Eftimov : > You can deploy and invoke Drools as a Singleton on every Spark Worker Node > / Executor / Worker JVM > > > > You can invoke it from e.g. map, filter etc and use the resu

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
I am not aware of existing examples but you can always “ask” Google Basically from Spark Streaming perspective, Drools is a third-party Software Library, you would invoke it in the same way as any other third-party software library from the Tasks (maps, filters etc) within your DAG job

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
The only “tricky” bit would be when you want to manage/update the Rule Base in your Drools Engines already running as Singletons in Executor JVMs on Worker Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already loaded in Drools is not a problem. From: Evo Eftimov [

Re: Spark Streaming and Drools

2015-05-22 Thread Dibyendu Bhattacharya
Hi, Sometime back I played with Distributed Rule processing by integrating Drool with HBase Co-Processors ..and invoke Rules on any incoming data .. https://github.com/dibbhatt/hbase-rule-engine You can get some idea how to use Drools rules if you see this RegionObserverCoprocessor .. https://g

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
OR you can run Drools in a Central Server Mode ie as a common/shared service, but that would slowdown your Spark Streaming job due to the remote network call which will have to be generated for every single message From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015

DataFrame groupBy vs RDD groupBy

2015-05-22 Thread gtanguy
Hello everybody, I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part of my code using groupBy became really slow. *1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy of dataFrame? // DataFrame : running in few seconds val result = table.groupBy("co

Re: Partitioning of Dataframes

2015-05-22 Thread ayan guha
DataFrame is an abstraction of rdd. So you should be able to do df.rdd.partitioyBy. however as far as I know, equijoines already optimizes partitioning. You may want to look explain plans more carefully and materialise interim joins. On 22 May 2015 19:03, "Karlson" wrote: > Hi, > > is there any

Re: 回复: 回复: How to use spark to access HBase with Security enabled

2015-05-22 Thread Ted Yu
Can you share the exception(s) you encountered ? Thanks > On May 22, 2015, at 12:33 AM, donhoff_h <165612...@qq.com> wrote: > > Hi, > > My modified code is listed below, just add the SecurityUtil API. I don't > know which propertyKeys I should use, so I make 2 my own propertyKeys to find >

Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Guillermo Ortiz
Hi, I'm trying to connect to two topics of Kafka with Spark with DirectStream but I get an error. I don't know if there're any limitation to do it, because when I just access to one topics everything if right. *val ssc = new StreamingContext(sparkConf, Seconds(5))* *val kafkaParams =

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this: df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird to me, though, and I'm not sure if the DF will be aware of its partitioning. On 2015-05-22 12:55,

Re: Partitioning of Dataframes

2015-05-22 Thread Silvio Fiorito
This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, "Karlson" wrote: >Hi, > >wouldn't df.rdd.partitionBy() return a new RDD that I would then need to >make into a Dataframe again? Maybe like this: >df.rdd.partitionBy(...).toDF(schema=df.schema). That lo

Parallel parameter tuning: distributed execution of MLlib algorithms

2015-05-22 Thread Hugo Ferreira
Hi, I am currently experimenting with linear regression (SGD) (Spark + MLlib, ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I do this (for now) by an exhaustive grid search of the step size and the number of iterations. Currently I am on a dual core that acts as a

LDA prediction on new document

2015-05-22 Thread Charles Earl
Dani, Folding in I believe refers to setting up your Gibbs sampler (or other model) with the learning word and document topic proportions as computed by spark. You might look at https://lists.cs.princeton.edu/pipermail/topic-models/2014-May/002763.html Where Jones suggests summing across columns

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Alright, that doesn't seem to have made it into the Python API yet. On 2015-05-22 15:12, Silvio Fiorito wrote: This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, "Karlson" wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then n

partitioning after extracting from a hive table?

2015-05-22 Thread Cesar Flores
I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions. Do this partitioning corresponds to my original table partitioning in Hive? Thanks -- Cesar Flores

Re: Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Cody Koeninger
I just verified that the following code works on 1.3.0 : val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic1) val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic2)

Re: How to use spark to access HBase with Security enabled

2015-05-22 Thread Frank Staszak
You might also enable debug in: hadoop-env.sh # Extra Java runtime options. Empty by default. export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}” and check that the principals are the same on the NameNode and DataNode. and you can confir

Re: Partitioning of Dataframes

2015-05-22 Thread Ted Yu
Looking at python/pyspark/sql/dataframe.py : @since(1.4) def coalesce(self, numPartitions): @since(1.3) def repartition(self, numPartitions): Would the above methods serve the purpose ? Cheers On Fri, May 22, 2015 at 6:57 AM, Karlson wrote: > Alright, that doesn't seem to hav

Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Shay Seng
Hi. I have a job that takes ~50min with Spark 0.9.3 and ~1.8hrs on Spark 1.3.1 on the same cluster. The only code difference between the two code bases is to fix the Seq -> Iter changes that happened in the Spark 1.x series. Are there any other changes in the defaults from spark 0.9.3 -> 1.3.1 th

Help reading Spark UI tea leaves..

2015-05-22 Thread Shay Seng
Hi. I have an RDD that I use repeatedly through many iterations of an algorithm. To prevent recomputation, I persist the RDD (and incidentally I also persist and checkpoint it's parents) val consCostConstraintMap = consCost.join(constraintMap).map { case (cid, (costs,(mid1,_,mid2,_,_))) => {

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Josh Rosen
I don't think that 0.9.3 has been released, so I'm assuming that you're running on branch-0.9. There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that this question doesn't have a concise answer: https://github.com/apache/spark/compare/branch-0.9...v1.3.1 To narrow down the pot

How to share a (spring) singleton service with Spark?

2015-05-22 Thread Tristan107
I have a small Spark "launcher" app able to instanciate a service via Spring xml application context and then "broadcasts" it in order to make it available on remote nodes. I suppose when a Spring service is instanciated, all dependencies are instanciated and injected at the same time, so broadc

Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
This ticket improved the RDD API, but it could be even more discoverable if made available via the API directly. I assume this was originally an omission that now needs to be kept for backwards compatibility, but would any of the repo owners be ope

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread Xin Liu
Thank you guys for the prompt help. I ended up building spark master and verified what DB has suggested. val lr = (new MlLogisticRegression) .setFitIntercept(true) .setMaxIter(35) val model = lr.fit(sqlContext.createDataFrame(training)) val scoreAndLabels = model.transfor

Re: partitioning after extracting from a hive table?

2015-05-22 Thread ayan guha
I guess not. Spark partitions correspond to number of splits. On 23 May 2015 00:02, "Cesar Flores" wrote: > > I have a table in a Hive database partitioning by date. I notice that when > I query this table using HiveContext the created data frame has an specific > number of partitions. > > > Do t

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Reynold Xin
I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony wrote: > This ticket improved > the RDD API, but it could be even mor

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
The (crude) proof of concept seems to work: class RDD[V](value: List[V]){ def doStuff = println("I'm doing stuff") } object RDD{ implicit def toPair[V](x:RDD[V]) = new PairRDD(List((1,2))) } class PairRDD[K,V](value: List[(K,V)]) extends RDD (value){ def doPairs = println("I'm using pairs"

Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Imran Rashid
hmm, sorry I think that disproves my theory. Nothing else is immediately coming to mind. Its possible there is more info in the logs from the driver, couldn't hurt to send those (though I don't have high hopes of finding anything that way). Offchance this could be from too many open files or som

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread DB Tsai
Great to see the result comparable to R in new ML implementation. Since majority of users will still use the old mllib api, we plan to call the ML implementation from MLlib to handle the intercept correctly with regularization. JIRA is created. https://issues.apache.org/jira/browse/SPARK-7780 Sin

HiveContext fails when querying large external Parquet tables

2015-05-22 Thread Andrew Otto
Hi all, (This email was easier to write in markdown, so I’ve created a gist with its contents here: https://gist.github.com/ottomata/f91ea76cece97444e269 . I’ll paste the markdown content in the email body here too.) --- We’ve recently up

Re: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Tathagata Das
Something does not make sense. Receivers (currently) does not get blocked (unless rate limit has been set) due to processing load. The receiver will continue to receive data and store it in memory and until it is processed. So I am still not sure how the data loss is happening. Unless you are sendi

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread DB Tsai
In Spark 1.4, Logistic Regression with elasticNet is implemented in ML pipeline framework. Model selection can be achieved through high lambda resulting lots of zero in the coefficients. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On F

Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data from Kinesis at 15 second intervals using two streams (i.e. receivers). The job simply grabs the latest batch and pushes it to MongoDB. I believe that the problem is that all tasks are execu

RE: HiveContext fails when querying large external Parquet tables

2015-05-22 Thread yana
There is an open Jira on Spark not pushing predicates to metastore. I have a large dataset with many partitions but doing anything with it 8s very slow...But I am surprised Spark 1.2 worked for you: it has this problem... Original message From: Andrew Otto Date:05/22/2015 3:5

Re: HiveContext fails when querying large external Parquet tables

2015-05-22 Thread Andrew Otto
What is also strange is that this seems to work on external JSON data, but not Parquet. I’ll try to do more verification of that next week. > On May 22, 2015, at 16:24, yana wrote: > > There is an open Jira on Spark not pushing predicates to metastore. I have a > large dataset with many part

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
If the message consumption rate is higher than the time required to process ALL data for a micro batch (ie the next RDD produced for your stream) the following happens – lets say that e.g. your micro batch time is 3 sec: 1. Based on your message streaming and consumption rate, you ge

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
… and measure 4 is to implement a custom Feedback Loop to e.g.to monitor the amount of free RAM and number of queued jobs and automatically decrease the message consumption rate of the Receiver until the number of clogged RDDs and Jobs subsides (again here you artificially decrease your perfor

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
I guess each receiver occupies a executor. So there was only one executor available for processing the job. On Fri, May 22, 2015 at 1:24 PM, Mike Trienis wrote: > Hi All, > > I have cluster of four nodes (three workers and one master, with one core > each) which consumes data from Kinesis at 15

spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-22 Thread Wang, Ningjun (LNG-NPV)
I used spark standalone cluster on Windows 2008. I kept on getting the following error when trying to save an RDD to a windows shared folder rdd.saveAsObjectFile("file:///T:/lab4-win02/IndexRoot01/tobacco-07/myrdd.obj") 15/05/22 16:49:05 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 1

Re: spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-22 Thread Ted Yu
The stack trace is related to hdfs. Can you tell us which hadoop release you are using ? Is this a secure cluster ? Thanks On Fri, May 22, 2015 at 1:55 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I used spark standalone cluster on Windows 2008. I kept on getting the >

Re: DataFrame groupBy vs RDD groupBy

2015-05-22 Thread Michael Armbrust
DataFrames have a lot more information about the data, so there is a whole class of optimizations that are possible there that we cannot do in RDDs. This is why we are focusing a lot of effort on this part of the project. In Spark 1.4 you can accomplish what you want using the new window function f

Re: Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Tathagata Das
Can you show us the rest of the program? When are you starting, or stopping the context. Is the exception occuring right after start or stop? What about log4j logs, what does it say? On Fri, May 22, 2015 at 7:12 AM, Cody Koeninger wrote: > I just verified that the following code works on 1.3.0 :

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread Joseph Bradley
If you want to select specific variable combinations by hand, then you will need to modify the dataset before passing it to the ML algorithm. The DataFrame API should make that easy to do. If you want to have an ML algorithm select variables automatically, then I would recommend using L1 regulari

spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Todd Nist
I'm using the spark-cassandra-connector from DataStax in a spark streaming job launched from my own driver. It is connecting a a standalone cluster on my local box which has two worker running. This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT. I have added the following entry to

Application on standalone cluster never changes state to be stopped

2015-05-22 Thread Edward Sargisson
Hi, Environment: Spark standalone cluster running with a master and a work on a small Vagrant VM. The Jetty Webapp on the same node calls the spark-submit script to start the job. >From the contents of the stdout I can see that it's running successfully. However, the spark-submit process never see

Migrate Relational to Distributed

2015-05-22 Thread Brant Seibert
Hi, The healthcare industry can do wonderful things with Apache Spark. But, there is already a very large base of data and applications firmly rooted in the relational paradigm and they are resistent to change - stuck on Oracle. ** QUESTION 1 - Migrate legacy relational data (plus new transact

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Evo Eftimov
A receiver occupies a cpu core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile Original message From: Mike Trienis Date:2015/05/22 21:51 (GMT+00:00) To:

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-22 Thread Tathagata Das
Hey Aniket, I just checked in the fix in Spark master and branch-1.4. Could you download Spark and test it out? On Thu, May 21, 2015 at 1:43 AM, Tathagata Das wrote: > Thanks for the JIRA. I will look into this issue. > > TD > > On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar < > aniket.bhat

SparkSQL failing while writing into S3 for 'insert into table'

2015-05-22 Thread ogoh
Hello, I am using spark 1.3 & Hive 0.13.1 in AWS. >From Spark-SQL, when running Hive query to export Hive query result into AWS S3, it failed with the following message: == org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths: s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_459

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Yana Kadiyska
Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3 .0-SNAPS

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread tyronecai
may because of snappy-java, https://issues.apache.org/jira/browse/SPARK-5081 On May 23, 2015, at 1:23 AM, Josh Rosen wrote: > I don't think that 0.9.3 has been released, so I'm assuming that you're > running on branch-0.9. > > There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afr

Dynamic Allocation with Spark Streaming

2015-05-22 Thread Saiph Kappa
Hi, 1. Dynamic allocation is currently only supported with YARN, correct? 2. In spark streaming, it is possible to change the number of executors while an application is running? If so, can the allocation be controlled by the application, instead of using any already defined automatic policy? Tha

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Ted Yu
For #1, the answer is yes. For #2, See TD's comments on SPARK-7661 Cheers On Fri, May 22, 2015 at 6:58 PM, Saiph Kappa wrote: > Hi, > > 1. Dynamic allocation is currently only supported with YARN, correct? > > 2. In spark streaming, it is possible to change the number of executors > while an

SparkSQL query plan to Stage wise breakdown

2015-05-22 Thread Pramod Biligiri
Hi, Is there an easy way to see how a SparkSQL query plan maps to different stages of the generated Spark job? The WebUI is entirely in terms of RDD stages and I'm having a hard time mapping it back to my query. Pramod

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Saiph Kappa
Sorry, but I can't see on TD's comments how to allocate executors on demand. It seems to me that he's talking about resources within an executor, mapping shards to cores. I want to be able to decommission executors/workers/machines. On Sat, May 23, 2015 at 3:31 AM, Ted Yu wrote: > For #1, the an

Re: Bigints in pyspark

2015-05-22 Thread Davies Liu
Could you show up the schema and confirm that they are LongType? df.printSchema() On Mon, Apr 27, 2015 at 5:44 AM, jamborta wrote: > hi all, > > I have just come across a problem where I have a table that has a few bigint > columns, it seems if I read that table into a dataframe then collect it

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Saiph Kappa
Or should I shutdown the streaming context gracefully and then start it again with a different number of executors? On Sat, May 23, 2015 at 4:00 AM, Saiph Kappa wrote: > Sorry, but I can't see on TD's comments how to allocate executors on > demand. It seems to me that he's talking about resource

RE: Question about Serialization in Storage Level

2015-05-22 Thread Jiang, Zhipeng
Hi Todd, Howard, Thanks for your reply, I might not present my question clearly. What I mean is, if I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager will cache the rdd to MemoryStore. RDD will be migrated to DiskStore when it cannot fit in memory. I think this migration does r

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Ted Yu
That should do. Cheers On Fri, May 22, 2015 at 8:28 PM, Saiph Kappa wrote: > Or should I shutdown the streaming context gracefully and then start it > again with a different number of executors? > > On Sat, May 23, 2015 at 4:00 AM, Saiph Kappa > wrote: > >> Sorry, but I can't see on TD's comme