Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Aakash Basu
Hi, I've a dataset where a few rows of the column F as shown below have line breaks in CSV file. [image: Inline image 1] When Spark is reading it, it is coming as below, which is a complete new line. [image: Inline image 2] I want my PySpark 2.1.0 to read it by forcefully avoiding the line bre

[SS] How to know what events were late in a streaming batch?

2017-09-03 Thread Jacek Laskowski
Hi, I've asked this question on SO [1], but hope to catch more attention posting here. I'd like to know how many events were late in a streaming batch in Structured Streaming. Is there a way to know the number or (better) what events exactly were late? Thanks for any help you may offer! [1] htt

Re: Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Riccardo Ferrari
Hi Aakash, What I see in the picture seems correct. Spark (pyspark) is reading your F2 cell as a multi-line text. Where are the nulls you're referring to? You might find the pyspark.sql.functions.regexp_replace

Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread prtimsina
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second laye

Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread Timsina, Prem
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second laye

Port to open for submitting Spark on Yarn application

2017-09-03 Thread Satoshi Yamada
Hi, In case we run Spark on Yarn in client mode, we have firewall for Hadoop cluster, and the client node is outside firewall, I think I have to open some ports that Application Master uses. I think the ports is specified by "spark.yarn.am.port" as document says. https://spark.apache.org/docs/la

java heap space

2017-09-03 Thread KhajaAsmath Mohammed
Hi, I am getting java.lang.OutOfMemoryError: Java heap space error whenever I ran the spark sql job. I came to conclusion issue is because of reading number of files from spark. I am reading 37 partitions and each partition has around 2000 files with filesize more than 128 MB 37*2000 files from

Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Saisai Shao
I think spark.yarn.am.port is not used any more, so you don't need to consider this. If you're running Spark on YARN, I think some YARN RM port to submit applications should also be reachable via firewall, as well as HDFS port to upload resources. Also in the Spark side, executors will be connect

Re: java heap space

2017-09-03 Thread 周康
May be you can repartition? 2017-09-04 9:25 GMT+08:00 KhajaAsmath Mohammed : > Hi, > > I am getting java.lang.OutOfMemoryError: Java heap space error whenever I > ran the spark sql job. > > I came to conclusion issue is because of reading number of files from > spark. > > I am reading 37 partitio

Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Satoshi Yamada
Jerry, Thanks for your comment. On Mon, Sep 4, 2017 at 10:43 AM, Saisai Shao wrote: > I think spark.yarn.am.port is not used any more, so you don't need to > consider this. > > If you're running Spark on YARN, I think some YARN RM port to submit > applications should also be reachable via firew

Re: Spark GroupBy Save to different files

2017-09-03 Thread Pralabh Kumar
Hi arun rdd1.groupBy(_.city).map(s=>(s._1,s._2.toList.toString())).toDF("city","data").write. *partitionBy("city")*.csv("/data") should work for you . Regards Pralabh On Sat, Sep 2, 2017 at 7:58 AM, Ryan wrote: > you may try foreachPartition > > On Fri, Sep 1, 2017 at 10:54 PM, asethia wrote

sparkR 3rd library

2017-09-03 Thread patcharee
Hi, I am using spark.lapply to execute an existing R script in standalone mode. This script calls a function 'rbga' from a 3rd library 'genalg'. This rbga function works fine in sparkR env when I call it directly, but when I apply this to spark.lapply I get the error could not find function