Re: dataframe slow down with tungsten turn on

2015-11-04 Thread Rick Moritz
Something to check (just in case): Are you getting identical results each time? On Wed, Nov 4, 2015 at 8:54 AM, gen tang wrote: > Hi sparkers, > > I am using dataframe to do some large ETL jobs. > More precisely, I create dataframe from HIVE table and do some operations. > And then I save it as

Dynamic (de)allocation with Spark Streaming

2015-11-04 Thread Wojciech Pituła
Hi, I have some doubts about dynamic resource allocation with spark streaming. If spark had allocated 5 executors for me, then he would dispatch every batch tasks on all of them equally. So if batchSize < spark.dynamicAllocation.executorIdleTimeout then spark will never free any executor. Moreove

Re: Rule Engine for Spark

2015-11-04 Thread Stefano Baghino
Hi LCassa, unfortunately I don't have actual experience on this matter, however for a similar use case I have briefly evaluated Decision (then called literally Streaming CEP Engine) and it looked interesting. I hope it may help. On Wed, Nov 4, 2015 at 1:42 AM,

Codegen In Shuffle

2015-11-04 Thread 牛兆捷
Dear all: Tungsten project has mentioned that they are applying code generation is to speed up the conversion of data from in-memory binary format to wire-protocol for shuffle. Where can I find the related implementation in spark code-based ? -- *Regards,* *Zhaojie*

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data from in-memory binary f

Re: PMML version in MLLib

2015-11-04 Thread Stefano Baghino
Hi Fazian, I actually had a problem with an invalid PMML produced by Spark 1.5.1 due to the missing "version" attribute in the "PMML" tag. Is this your case too? I've briefly checked the PMML standard and that attribute is required, so this may be an issue that should be addressed. I'll happily loo

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Yes, the same code, the same result. In fact, the code has been running for a more one month. Before 1.5.0, the performance is quite the same, So I doubt that it is causd by tungsten. Gen On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz wrote: > Something to check (just in case): > Are you getting i

spark filter function

2015-11-04 Thread Zhiliang Zhu
Hi All, I would like to filter some elements in some given RDD, only the needed left, at the time the row number of the result RDD is smaller. Then I select filter function, however, by test, filter function would only accept Boolean type, that is to say, will only JavaRDDbe returned for filter.

Running Apache Spark 1.5.1 on console2

2015-11-04 Thread Hitoshi Ozawa
I have Spark 1.5.1 running directly on Windows7 but would like to run it on console2. I have JAVA_HOME, SCALA_HOME, and SPARK_HOME setup and have verified Java and Scala are working property (did a -version and able to run programs). However, when I try to use Spark using "spark-shell" it return th

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Hi Stefano, Although the intention for my question wasn't as you expected, what you say makes sense. The standard[1] for PMML 4.1 specifies that "*For PMML 4.1 the attribute version must have the value 4.1". *I'm not sure whether that means that other PMML versions do not need that attribute to be

Re: PMML version in MLLib

2015-11-04 Thread Stefano Baghino
I used KNIME, which internally uses the org.dmg.pmml library. On Wed, Nov 4, 2015 at 9:45 AM, Fazlan Nazeem wrote: > Hi Stefano, > > Although the intention for my question wasn't as you expected, what you > say makes sense. The standard[1] for PMML 4.1 specifies that "*For PMML > 4.1 the attribu

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
I just went through all specifications, and they expect the version attribute. This should be addressed very soon because if we cannot use the PMML model without the version attribute, there is no use of generating one without it. On Wed, Nov 4, 2015 at 2:17 PM, Stefano Baghino < stefano.bagh...@r

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Zhiliang Zhu
Hi Debasish Das, Firstly I must show my deep appreciation towards you kind help. Yes, my issue is some typical LP related, it is as:Objective function:f(x1, x2, ..., xn) = a1 * x1 + a2 * x2 + ... + an * xn,   (n would be some number bigger than 100) There are only 4 constraint functions,x1 + x2

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Rex Xiong
Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my old table definition in pure Hive, the execution plan will do a parquet scan across all partitions, and it runs very slow. Looks like the execution pla

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Zhiliang Zhu
Hi Debasish Das, I found that there are lots of much useful information for me in your kind reply.However, I am sorry that still I could not exactly catch each words you said. I just know spark mllib will use breeze as its underlying package, however, I did not practise and do not know how to di

Re: [Spark MLlib] about linear regression issue

2015-11-04 Thread Zhiliang Zhu
Hi DB Tsai, Firstly I must show my deep appreciation towards your kind help. Did you just mean like this, currently there is no way for users to deal with constrains like all weights >= 0 in spark, though spark also has LBFGS ... Moreover, I did not know whether spark SVD will help some for that i

Avoid RDD.saveAsTextFile() generating empty part-* and .crc files

2015-11-04 Thread amit tewari
Dear Spark Users I have RDD.saveAsTextFile() statement which is generating many empty part-* and .crc files. I understand that the empty part-* files are due to number of partitions, but still I would not like to generate either empty part-* or .crc files. How to achieve this? Thanks

Prevent possible out of memory when using read/union

2015-11-04 Thread Alexander Lenz
Hi colleagues, In Hadoop I have a lot of folders containing small files. Therefore I am reading the content of all folders, union the small files and write the unioned data into a single folder containing one file. Afterwards I delete the small files and the according folders. I see two possib

Re: Rule Engine for Spark

2015-11-04 Thread Adrian Tanase
Another way to do it is to extract your filters as SQL code and load it in a transform – which allows you to change the filters at runtime. Inside the transform you could apply the filters by goind RDD -> DF -> SQL -> RDD. Lastly, depending on how complex your filters are, you could skip SQL an

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
[adding dev] On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem wrote: > I just went through all specifications, and they expect the version > attribute. This should be addressed very soon because if we cannot use the > PMML model without the version attribute, there is no use of generating one > wit

Re: PMML version in MLLib

2015-11-04 Thread Sean Owen
I'm pretty sure that attribute is required. I am not sure what PMML version the code has been written for but would assume 4.2.1. Feel free to open a PR to add this version to all the output. On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem wrote: > [adding dev] > > On Wed, Nov 4, 2015 at 2:27 PM,

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Thanks Owen. Will do it On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen wrote: > I'm pretty sure that attribute is required. I am not sure what PMML > version the code has been written for but would assume 4.2.1. Feel > free to open a PR to add this version to all the output. > > On Wed, Nov 4, 2015 a

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
If some of the operations required involve shuffling and partitioning, it might mean that the data set is skewed to specific partitions which will create hot spotting on certain executors. -adrian From: Khaled Ammar Date: Tuesday, November 3, 2015 at 11:43 PM To: "user@spark.apache.org

Custom application.conf on spark executor nodes?

2015-11-04 Thread Adrian Tanase
Hi guys, I’m trying to deploy a custom application.conf that contains various app specific entries and config overrides for akka and spray. The file is successfully loaded by the driver by submitting with —driver-class-path – at least for the application specific config. I haven’t yet managed

SparkSQL JDBC to PostGIS

2015-11-04 Thread Mustafa Elbehery
Hi Folks, I am trying to connect from SparkShell to PostGIS Database. Simply PostGIS is a *spatial *extension for Postgresql, in order to support *geometry * types. Although the JDBC connection from spark works well with Postgresql, it does not with a database on the same server, which supports t

Re: Parsing a large XML file using Spark

2015-11-04 Thread Jin
I recently worked around datasource and parquet a bit at Spark and someone requested me to make a XML datasource plugin. So iI did this. https://github.com/HyukjinKwon/spark-xml It tried to get rid of in-line format just like Json datasource in Spark. Altough I didn't add a CI tool for this yet,

Re: SparkSQL JDBC to PostGIS

2015-11-04 Thread Stefano Baghino
Hi Mustafa, are you trying to run geospatial queries on the PostGIS DB with SparkSQL? Correct me if I'm wrong, but I think SparkSQL itself would need to support the geospatial extensions in order for this to work. On Wed, Nov 4, 2015 at 1:46 PM, Mustafa Elbehery wrote: > Hi Folks, > > I am tryi

Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Priya Ch
Hello All, How to use multiple Spark Context in executing multiple test suite of spark code ??? Can some one throw light on this ?

Looking for the method executors uses to write to HDFS

2015-11-04 Thread Tóth Zoltán
Hi, I'd like to write a parquet file from the driver. I could use the HDFS API but I am worried that it won't work on a secure cluster. I assume that the method the executors use to write to HDFS takes care of managing Hadoop security. However, I can't find the place where HDFS write happens in th

Re: Codegen In Shuffle

2015-11-04 Thread 牛兆捷
I see. Thanks very much. 2015-11-04 16:25 GMT+08:00 Reynold Xin : > GenerateUnsafeProjection -- projects any internal row data structure > directly into bytes (UnsafeRow). > > > On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > >> Dear all: >> >> Tungsten project has mentioned that they are applying

Distributing Python code packaged as tar balls

2015-11-04 Thread Praveen Chundi
Hi, Pyspark/spark-submit offers a --py-files handle to distribute python code for execution. Currently(version 1.5) only zip files seem to be supported, I have tried distributing tar balls unsuccessfully. Is it worth adding support for tar balls? Best regards, Praveen Chundi ---

Re: Why some executors are lazy?

2015-11-04 Thread Khaled Ammar
Thank you Adrian, The dataset is indeed skewed. My concern was that some executors do not participate in computation at all. I understand that executors finish tasks sequentially. Therefore, using more executors allow for better parallelism. I managed to force all executors to participate by incr

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org" Subject: Re: Spark Streaming dat

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
Your solution is an easy, pragmatic one, however there are many factors involved – and it’s not guaranteed to work for other data sets. It depends on: * The distribution of data on the keys. Random session ids will naturally distribute better than “customer names” - you can mitigate this wi

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Ted Yu
Are you trying to speed up tests where each test suite uses single SparkContext ? You may want to read: https://issues.apache.org/jira/browse/SPARK-2243 Cheers On Wed, Nov 4, 2015 at 4:59 AM, Priya Ch wrote: > Hello All, > > How to use multiple Spark Context in executing multiple test suite

Problem using BlockMatrix.add

2015-11-04 Thread Kareem Sorathia
Hi, I'm attempting to use the distributed matrix data structure BlockMatrix (Spark 1.5.0, scala) and having some issues when attempting to add two block matrices together (error attached below). I'm constructing the two matrices by creating a collection of MatrixEntry's, putting that into Coordin

Problem using BlockMatrix.add

2015-11-04 Thread ksorat
Hi, I'm attempting to use the distributed matrix data structure BlockMatrix (Spark 1.5.0, scala) and having some issues when attempting to add two block matrices together (error attached below). I'm constructing the two matrices by creating a collection of MatrixEntry's, putting that into Coordin

sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-11-04 Thread tstewart
(apologies if this re-posts, having challenges with the various web front ends to this mailing list) I have the following script in a file named test.R: library(SparkR) sc <- sparkR.init(master="yarn-client") sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) showDF(df)

Spark 1.5.1 Dynamic Resource Allocation

2015-11-04 Thread tstewart
(apologies if this re-posts, having challenges with the various web front ends to this mailing list) I am running the following command on a Hadoop cluster to launch Spark shell with DRA: spark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spa

Cassandra via SparkSQL/Hive JDBC

2015-11-04 Thread Bryan Jeffrey
Hello. I have been working to add SparkSQL HDFS support to our application. We're able to process streaming data, append to a persistent Hive table, and have that table available via JDBC/ODBC. Now we're looking to access data in Cassandra via SparkSQL. In reading a number of previous posts, it

Spark driver, Docker, and Mesos

2015-11-04 Thread PHELIPOT, REMY
Hello, I’m trying to run a spark shell (or a spark notebook) connect to a mesos master inside a docker container. My goal is to give access to the mesos cluster for several user at the same time. It seems not possible to get it working with the container started in bridge mode, isn’t it? Is it

DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hello folks, Recently I have noticed unexpectedly big network traffic between Driver Program and Worker node. During debugging I have figured out that it is caused by following block of code —— Java ——— — DataFrame etpvRecords = context.sql(" SOME SQL query here"); Mapper m = new Mapp

Re: Spark driver, Docker, and Mesos

2015-11-04 Thread Timothy Chen
Hi Remy, Yes with docker bridge network it's not possible yet, with host network it should work. I was planning to create a ticket and possibly work on that in the future as there are some changes on the spark side. Tim > On Nov 4, 2015, at 8:24 AM, PHELIPOT, REMY wrote: > > Hello, > > I’m

Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Nisrina Luthfiyati
Hi all, I'm running some spark jobs in java on top of YARN by submitting one application jar that starts multiple jobs. My question is, if I'm setting some resource configurations, either when submitting the app or in spark-defaults.conf, would this configs apply to each job or the entire applicat

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Marcelo Vanzin
Resources belong to the application, not each job, so the latter. On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting one > application jar that starts multiple jobs. > My question is, if I'm setting some resourc

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Sandy Ryza
Hi Nisrina, The resources you specify are shared by all jobs that run inside the application. -Sandy On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting one > application jar tha

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
I noticed that toJavaRDD causes a computation on the DataFrame, so is it considered an action, even though logically it's a transformation? On Nov 4, 2015 6:51 PM, "Aliaksei Tsyvunchyk" wrote: > Hello folks, > > Recently I have noticed unexpectedly big network traffic between Driver > Program and

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hello Romi, Do you mean that in my particular case I’m causing computation on dataFrame or it is regular behavior of DataFrame.toJavaRDD ? If it’s regular behavior, do you know which approach could be used to perform make/reduce on dataFrame without causing it to load all data to driver program

Re: Prevent possible out of memory when using read/union

2015-11-04 Thread Sujit Pal
Hi Alexander, You may want to try the wholeTextFiles() method of SparkContext. Using that you could just do something like this: sc.wholeTextFiles("hdfs://input_dir") > .saveAsSequenceFile("hdfs://output_dir") The wholeTextFiles returns a RDD of ((filename, content)). http://spark.apache.or

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
In my program I move between RDD and DataFrame several times. I know that the entire data of the DF doesn't go into the driver because it wouldn't fit there. But calling toJavaRDD does cause computation. Check the number of partitions you have on the DF and RDD... On Nov 4, 2015 7:54 PM, "Aliaksei

SPARK_SSH_FOREGROUND format

2015-11-04 Thread Kayode Odeyemi
From http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts : If you do not have a password-less setup, you can set the environment > variable SPARK_SSH_FOREGROUND and serially provide a password for each > worker. > What does "serially provide a password for each worker

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hi Romi, Thank for pointing me. I quite new in Spark and not sure how it can help when I’ll check number of partitions in DF and RDD, so if you can give me some explanation it would be really helpful. Link to documentation will also help. > On Nov 4, 2015, at 1:05 PM, Romi Kuntsman wrote: > >

Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Kayode Odeyemi
Hi, I can't seem to understand why all created executors always fail. I have a Spark standalone cluster setup make up of 2 workers and 1 master. My spark-env looks like this: SPARK_MASTER_IP=192.168.2.11 SPARK_LOCAL_IP=192.168.2.11 SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4" SPARK_WORKER_C

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Nisrina Luthfiyati
Got it. Thanks! On Nov 5, 2015 12:32 AM, "Sandy Ryza" wrote: > Hi Nisrina, > > The resources you specify are shared by all jobs that run inside the > application. > > -Sandy > > On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < > nisrina.luthfiy...@gmail.com> wrote: > >> Hi all, >> >> I'm runn

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Ted Yu
Have you tried using -Dspark.master=local ? Cheers On Wed, Nov 4, 2015 at 10:47 AM, Kayode Odeyemi wrote: > Hi, > > I can't seem to understand why all created executors always fail. > > I have a Spark standalone cluster setup make up of 2 workers and 1 master. > My spark-env looks like this: >

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Kayode Odeyemi
Thanks Ted. Where would you suggest I add that? I'm creating a SparkContext from a Spark app. My conf setup looks like this: conf.setMaster("spark://192.168.2.11:7077") conf.set("spark.logConf", "true") conf.set("spark.akka.logLifecycleEvents", "true") conf.set("spark.executor.memory", "5g") On

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Bryan Jeffrey
Priya, If you're trying to get unit tests running local spark contexts, you can just set up your spark context with 'spark.driver.allowMultipleContexts' set to true. Example: def create(seconds : Int, appName : String): StreamingContext = { val master = "local[*]" val conf = new SparkConf().

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Ted Yu
Something like this: conf.setMaster("local[3]") On Wed, Nov 4, 2015 at 11:08 AM, Kayode Odeyemi wrote: > Thanks Ted. > > Where would you suggest I add that? I'm creating a SparkContext from a > Spark app. My conf setup looks like this: > > conf.setMaster("spark://192.168.2.11:7077") > conf.s

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Kayode Odeyemi
I've tried that once. No job was executed on the workers. That is, the workers weren't used. What I want to achieve is to have the SparkContext use a remote spark standalone master at 192.168.2.11 (this is where I started the master with ./start-master.sh and all the slaves with ./start-slaves.sh)

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-04 Thread tstewart
https://issues.apache.org/jira/browse/SPARK-10790 Changed to add minExecutors < initialExecutors < maxExecutors and that works. spark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAlloc

[Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space

2015-11-04 Thread Shuai Zheng
Hi All, I have a program which actually run a bit complex business (join) in spark. And I have below exception: I running on Spark 1.5, and with parameter: spark-submit --deploy-mode client --executor-cores=24 --driver-memory=2G --executor-memory=45G -class . Some other setup:

Re: kerberos question

2015-11-04 Thread Chen Song
After a bit more investigation, I found that it could be related to impersonation on kerberized cluster. Our job is started with the following command. /usr/lib/spark/bin/spark-submit --master yarn-client --principal [principle] --keytab [keytab] --proxy-user [proxied_user] ... In application m

Re: kerberos question

2015-11-04 Thread Ted Yu
2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - Could not find uri with key [dfs. encryption.key.provider.uri] to create a keyProvider !! Could it be related to HDFS-7931 ? On Wed, Nov 4, 2015 at 12:30

PairRDD from SQL

2015-11-04 Thread pratik khadloya
Hello, Is it possible to have a pair RDD from the below SQL query. The pair being ((item_id, flight_id), metric1) item_id, flight_id are part of group by. SELECT item_id, flight_id, SUM(metric1) AS metric1 FROM mytable GROUP BY item_id, flight_id Thanks, Pratik

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Debasish Das
Yeah for this you can use breeze quadratic minimizer...that's integrated with spark in one of my spark pr. You have quadratic objective with equality which is primal and your proximal is positivity that we already support. I have not given an API for linear objective but that should be simple to a

RE: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-11-04 Thread Shuai Zheng
And an update is: this ONLY happen in Spark 1.5, I try to run it under Spark 1.4 and 1.4.1, there are no issue (the program is developed under Spark 1.4 last time, and I just re-test it, it works). So this is proven that there is no issue on the logic and data, it is caused by the new version of Sp

Re: PairRDD from SQL

2015-11-04 Thread Stéphane Verlet
sqlContext.sql().map(row=> ((row.getString(0), row.getString(1)),row.getInt(2))) On Wed, Nov 4, 2015 at 1:44 PM, pratik khadloya wrote: > Hello, > > Is it possible to have a pair RDD from the below SQL query. > The pair being ((item_id, flight_id), metric1) > > item_id, flight_id are part of gr

Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread swetha
Hi, What is the efficient approach to save an RDD as a file in HDFS and retrieve it back? I was thinking between Avro, Parquet and SequenceFileFormart. We currently use SequenceFileFormart for one of our use cases. Any example on how to store and retrieve an RDD in an Avro and Parquet file forma

Futures timed out after [120 seconds].

2015-11-04 Thread Kayode Odeyemi
Hi, I'm running a Spark standalone in cluster mode (1 master, 2 workers). Everything has failed including spark-submit with errors such as "Caused by: java.lang.ClassNotFoundException: com.migration.App$$anonfun$upsert$1" Now, I've reverted back to submitting jobs through scala apps. Any ideas

ExecutorId in JAVA_OPTS

2015-11-04 Thread surbhi.mungre
I was trying to profile some Spark jobs and I want to collect Java Flight Recorder(JFR) files from each executor. I am running my job on a YARN cluster with several nodes, so I cannot manually collect JRF file for each run. MR provides a way to name JFR files generated by each task with taskId. I

Memory are not used according to setting

2015-11-04 Thread William Li
Hi All - I have a four worker node cluster, each with 8GB memory. When I submit a job, the driver node takes 1gb memory, each worker node only allocates one executor, also just take 1gb memory. The setting of the job has: sparkConf .setExecutorEnv("spark.driver.memory", "6g") .setExecutorEn

Protobuff 3.0 for Spark

2015-11-04 Thread Cassa L
Hi, Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4 built for HDP 2.3. Given that protobuff has compatibility issues , want to know if spark supports protbuff 3.0 LCassa

Re: Rule Engine for Spark

2015-11-04 Thread Cassa L
Thanks for reply. How about DROOLs. Does it worj with Spark? LCassa On Wed, Nov 4, 2015 at 3:02 AM, Adrian Tanase wrote: > Another way to do it is to extract your filters as SQL code and load it in > a transform – which allows you to change the filters at runtime. > > Inside the transform you

Re: Protobuff 3.0 for Spark

2015-11-04 Thread Lan Jiang
I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop itself comes with protobuf 2.5. I think the steps apply to HDP too. You need to do the following 1. Set the below parameter spark.executor.userClassPathFirst=true spark.driver.userClassPathFirst=true 2. Include proto

RE: Rule Engine for Spark

2015-11-04 Thread Cheng, Hao
Or try Streaming SQL? Which is a simple layer on top of the Spark Streaming. ☺ https://github.com/Intel-bigdata/spark-streamingsql From: Cassa L [mailto:lcas...@gmail.com] Sent: Thursday, November 5, 2015 8:09 AM To: Adrian Tanase Cc: Stefano Baghino; user Subject: Re: Rule Engine for Spark Tha

Re: Memory are not used according to setting

2015-11-04 Thread Shixiong Zhu
You should use `SparkConf.set` rather than `SparkConf.setExecutorEnv`. For driver configurations, you need to set them before starting your application. You can use the `--conf` argument before running `spark-submit`. Best Regards, Shixiong Zhu 2015-11-04 15:55 GMT-08:00 William Li : > Hi All –

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my old table definitio

How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha
Hi, How to unpersist a DStream in Spark Streaming? I know that we can persist using dStream.persist() or dStream.cache. But, I don't see any method to unPersist. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream-in-S

how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Andy Davidson
Hi I just set up a spark cluster on AWS ec2 cluster. In the past I have done a lot of work using RStudio on my local machine. Bin/sparkR looks interesting how ever it looks like you just get an R command line interpreter. Does anyone have an experience using something like RStudio or Rstudio serv

Question about Spark shuffle read size

2015-11-04 Thread Dogtail L
Hi all, When I run WordCount using Spark, I find that when I set "spark.default.parallelism" to different numbers, the Shuffle Write size and Shuffle Read size will change as well (I read these data from history server's web UI). Is it because the shuffle write size also include some metadata size

Re: how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Shivaram Venkataraman
RStudio should already be setup if you launch an EC2 cluster using spark-ec2. See http://blog.godatadriven.com/sparkr-just-got-better.html for details. Shivaram On Wed, Nov 4, 2015 at 5:11 PM, Andy Davidson wrote: > Hi > > I just set up a spark cluster on AWS ec2 cluster. In the past I have don

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Zhiliang Zhu
Dear Debasish Das, Thanks very much for your kind reply. I am very sorry that, but may you clearify a little more about the places, since I could not find them. On Thursday, November 5, 2015 5:50 AM, Debasish Das wrote: Yeah for this you can use breeze quadratic minimizer...that's i

Spark reading from S3 getting very slow

2015-11-04 Thread Younes Naguib
Hi all, I'm reading large text files from s3. Sizes between from 30GB and 40GB. Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason! Here is my sample code: val myDF = sc.textFile(input_file).map{ x => val p = x.split("\t", -1) new (

Re: Rule Engine for Spark

2015-11-04 Thread Cassa L
ok. Let me try it. Thanks, LCassa On Wed, Nov 4, 2015 at 4:44 PM, Cheng, Hao wrote: > Or try Streaming SQL? Which is a simple layer on top of the Spark > Streaming. J > > > > https://github.com/Intel-bigdata/spark-streamingsql > > > > > > *From:* Cassa L [mailto:lcas...@gmail.com] > *Sent:* Thu

Re: Rule Engine for Spark

2015-11-04 Thread Daniel Mahler
I am not familiar with any rule engines on Spark Streaming or even plain Spark Conceptually closest things I am aware of are Datomic and Bloom-lang. Neither of them are Spark based but they implement Datalog like languages over distributed stores. - http://www.datomic.com/ - http://bloom-lan

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Priya Ch
Already tried setting spark.driver.allowMultipleContexts to true. But it not successful. I the problem is we have different test suites which of course run in parallel. How do we stop sparkContext after each test suite and start it in the next test suite or is there any way to share sparkContext ac

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha kasireddy
Other than setting the following. sparkConf.set("spark.streaming.unpersist", "true") sparkConf.set("spark.cleaner.ttl", "7200s") On Wed, Nov 4, 2015 at 5:03 PM, swetha wrote: > Hi, > > How to unpersist a DStream in Spark Streaming? I know that we can persist > using dStream.persist() or dStrea

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread Stefano Baghino
What scenario would you like to optimize for? If you have something more specific regarding your use case, the mailing list can surely provide you with some very good advice. If you just want to save an RDD as Avro you can use a module from Databricks (the README on GitHub

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Yashwanth Kumar
Hi, DStream->Discretized Streams are made up of multiple RDDs You can unpersist each RDD by accessing the individual RDD's using dstreamrdd.foreachRDD { rdd.unpersist(). } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream-in-Sp

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Saisai Shao
Hi Swetha, Would you mind elaborating your usage scenario of DStream unpersisting? >From my understanding: 1. Spark Streaming will automatically unpersist outdated data (you already mentioned about the configurations). 2. If streaming job is started, I think you may lose the control of the job,