Clearing usercache on EMR [pyspark]

2018-08-01 Thread Shuporno Choudhury
Hi everyone, I am running spark jobs on EMR (using pyspark). I noticed that after running jobs, the size of the usercache (basically the filecache folder) keeps on increasing (with directory names as 1,2,3,4,5,...). Directory location: */mnt/yarn/usercache/hadoop/**filecache/* Is there a way to

How to make Yarn dynamically allocate resources for Spark

2018-08-01 Thread Anton Puzanov
Hi everyone, have a cluster managed with Yarn and runs Spark jobs, the components were installed using Ambari (2.6.3.0-235). I have 6 hosts each with 6 cores. I use Fair scheduler I want Yarn to automatically add/remove executor cores, but no matter what I do it doesn't work Relevant Spark confi

How to make Yarn dynamically allocate resources for Spark

2018-08-01 Thread Anton Puzanov
Hi everyone, have a cluster managed with Yarn and runs Spark jobs, the components were installed using Ambari (2.6.3.0-235). I have 6 hosts each with 6 cores. I use Fair scheduler I want Yarn to automatically add/remove executor cores, but no matter what I do it doesn't work Relevant Spark confi

How to use window method with direct kafka streaming ?

2018-08-01 Thread fat.wei
Hi everyone, I have the following scenario , and I tried to use window method with direct kafka streaming. The program can run, but doesn't run right! 1. The data is stored in kafka. 2. Every single item of the data has its primary key. 3. Every single item of the data will be split into multipe

Data quality measurement for streaming data with apache spark

2018-08-01 Thread Uttam
Hello, I have very general question about Apache Spark. I want to know if it is possible(and where to start, if possible) to implement a data quality measurement prototype for streaming data using Apache Spark. Let's say I want to work on Timeliness or Completeness as a data quality metrics, is si

Re: How to add a new source to exsting struct streaming application, like a kafka source

2018-08-01 Thread Robb Greathouse
How to unsubscribe? On Mon, Jul 30, 2018 at 3:13 AM 杨浩 wrote: > How to add a new source to exsting struct streaming application, like a > kafka source > -- Robb Greathouse Middleware BU 505-507-4906

Re: How to add a new source to exsting struct streaming application, like a kafka source

2018-08-01 Thread David Rosenstrauch
On 08/01/2018 12:36 PM, Robb Greathouse wrote: How to unsubscribe? List-Unsubscribe: - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Nirav Patel
Hi Peay, Have you find better solution yet? I am having same issue. Following says it works with spark 2.1 onward but only when you use sqlContext and not Dataframe https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a Thanks, Nirav On Mon, Oct 2, 2017 at 4:37 AM,

Overwrite only specific partition with hive dynamic partitioning

2018-08-01 Thread Nirav Patel
Hi, I have a hive partition table created using sparkSession. I would like to insert/overwrite Dataframe data to specific set of partition without loosing any other partition. In each run I have to update Set of partitions not just one. e.g. I have dataframe with bid=1, bid=2, bid=3 in first time

RE: Split a row into multiple rows Java

2018-08-01 Thread nookala
Pivot seems to do the opposite of what I want, convert rows to columns. I was able to get this done in python, but would like to do this in Java idfNew = idf.rdd.flatMap((lambda row: [(row.Name, row.Id, row.Date, "0100",row.0100),(row.Name, row.Id, row.Date, "0200",row.0200),row.Name, row.Id, row

Re: Split a row into multiple rows Java

2018-08-01 Thread Anton Puzanov
you can always use array+explode, I don't know if its the most elegant/optimal solution (would be happy to hear from the experts) code example: //create data Dataset test= spark.createDataFrame(Arrays.asList(new InternalData("bob", "b1", 1,2,3), new InternalData("alive", "c1", 3,4,6),

Spark Memory Requirement

2018-08-01 Thread msbreuer
Many threads talk about memory requirements and most often answers are, to add more memory to spark. My understanding of spark is a scaleable anyltics engine, which is able to utilize assigned resources and to calculate the correct answer. So assigning core and memory may speedup an task. I am usi

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Koert Kuipers
this works for dataframes with spark 2.3 by changing a global setting, and will be configurable per write in 2.4 see: https://issues.apache.org/jira/browse/SPARK-20236 https://issues.apache.org/jira/browse/SPARK-24860 On Wed, Aug 1, 2018 at 3:11 PM, Nirav Patel wrote: > Hi Peay, > > Have you fin

unsubscribe

2018-08-01 Thread Eco Super
Hi User, unsubscribe me