Spark Yarn Java Out Of Memory on Complex Query Execution Plan

2017-09-11 Thread Nimmi Cv
Exception in thread "main" java.lang.OutOfMemoryError at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:1

Spark Yarn Java Out Of Memory on Complex Query Execution Plan

2017-09-11 Thread nimmi.cv
Exception in thread "main" java.lang.OutOfMemoryError at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:1

Unable to save an RDd on S3 with SSE-KMS encryption

2017-09-11 Thread Vikash Pareek
I am trying to save an rdd on S3 with server side encryption using KMS key (SSE-KMS), But I am getting the following exception: *Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 695E32175EBA568A, AWS Error Code:

Re: Multiple vcores per container when running Spark applications in Yarn cluster mode

2017-09-11 Thread Gourav Sengupta
Saisai, thanks a ton :) Regards, Gourav On Mon, Sep 11, 2017 at 11:36 PM, Xiaoye Sun wrote: > Hi Jerry, > > This solves my problem. 🙏 thanks > > On Sun, Sep 10, 2017 at 8:19 PM Saisai Shao > wrote: > >> I guess you're using Capacity Scheduler with DefaultResourceCalculator, >> which doesn't c

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-11 Thread Gourav Sengupta
Hi Matthew, I have read close to 3 TB of data in Parquet format without any issues in EMR. A few questions: 1. What is the EMR version that you are using? 2. How many partitions do you have? 3. How many fields do you have in the table? Are you reading all of them? 4. Is there a way that you can j

Re: Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
I can actually compile the following code with any one of these jars. But none of them seem to print the messages to console however when I use Kafka-console-consumer with the same hello topic I can see messages. When I run my spark code it just hangs here forever even when I continue producing mes

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
Which version of spark? On Mon, Sep 11, 2017 at 8:27 PM, 张万新 wrote: > Thanks for reply, but using this method I got an exception: > > "Exception in thread "main" > org.apache.spark.sql.streaming.StreamingQueryException: > nondeterministic expressions are only allowed in > > Project, Filter, Agg

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread 张万新
Thanks for reply, but using this method I got an exception: "Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window" Can you give more advice? Michael Armbrust 于2017年9月12日周二 上午4:48写

unable to read from Kafka (very strange)

2017-09-11 Thread kant kodali
Hi All, I started using spark 2.2.0 very recently and now I can't even get the json data from Kafka out to console. I have no clue what's happening. This was working for me when I was using 2.1.1 Here is my code StreamingQuery query = sparkSession.readStream() .format("kafka") .o

Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
Hi All, Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0? kafka-clients-0.10.0.1.jar spark-streaming-kafka-0-10_2.11-2.2.0.jar 1) Above two are the only Kafka related jars or am I missing something? 2) What is the difference between the above two jars? 3) If I have th

Re: Multiple vcores per container when running Spark applications in Yarn cluster mode

2017-09-11 Thread Xiaoye Sun
Hi Jerry, This solves my problem. 🙏 thanks On Sun, Sep 10, 2017 at 8:19 PM Saisai Shao wrote: > I guess you're using Capacity Scheduler with DefaultResourceCalculator, > which doesn't count cpu cores into resource calculation, this "1" you saw > is actually meaningless. If you want to also calc

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-11 Thread Matthew Anthony
any other feedback on this? On 9/8/17 11:00 AM, Neil Jonkers wrote: Can you provide a code sample please? On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony > wrote: Hi all - since upgrading to 2.2.0, we've noticed a significant increase in read.parquet(

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
I agree, Java tend to be pretty verbose unfortunately. You can check the "alternate" approach that should be more compact and readable. Should be something like: df.select(to_json(struct(col("*")).alias("value")) Of course to_json, struct and col are from the https://spark.apache.org/docs/2.2.0/ap

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-11 Thread Michael Armbrust
The following will convert the whole row to JSON. import org.apache.spark.sql.functions.* df.select(to_json(struct(col("*" On Sat, Sep 9, 2017 at 6:27 PM, kant kodali wrote: > Thanks Ryan! In this case, I will have Dataset so is there a way to > convert Row to Json string? > > Thanks > > On

Re: Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread Michael Armbrust
Checkpoints record what has been processed for a specific query, and as such only need to be defined when writing (which is how you "start" a query). You can use the DataFrame created with readStream to start multiple queries, so it wouldn't really make sense to have a single checkpoint there. On

Efficient Spark-Submit planning

2017-09-11 Thread Aakash Basu
Hi, Can someone please clarify a little on how should we effectively calculate the parameters to be passed over using spark-submit. Parameters as in - Cores, NumExecutors, DriverMemory, etc. Is there any generic calculation which can be done over most kind of clusters with different sizes from

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.withColumn("window", window(current_timestamp(), "15 minutes")) On Mon, Sep 11, 2017 at 3:03 AM, 张万新 wrote: > Hi, > > In structured streaming how can I add a column to a dataset with current > system time aligned with 15 minutes? > > Thanks. >

Re: How to convert Row to JSON in Java?

2017-09-11 Thread kant kodali
getValuesMap is not very Java friendly. I have to do something like this String[] fieldNames = row.schema().fieldNames(); Seq fieldNamesSeq = JavaConverters.asScalaIteratorConverter(Arrays.asList(fieldNames).iterator()).asScala().toSeq(); String json = row.getValuesMap(fieldNamesSeq).toString();

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Eduardo D'Avila
Burak, thanks for the resources. I was thinking that the trigger interval and the sliding window were the same thing, but now I am confused. I didn't know there was a .trigger() method, since the official Programming Guide

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Burak Yavuz
Hi Eduardo, What you have written out is to output counts "as fast as possible" for windows of 5 minute length and with a sliding window of 1 minute. So for a record at 10:13, you would get that record included in the count for 10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18. Plea

[Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Eduardo D'Avila
Hi, I'm trying to use Spark 2.1.1 structured streaming to *count the number of records* from Kafka *for each time window* with the code in this GitHub gist . I expected that, *once each minute* (the slide duration), it would *outp

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-11 Thread Cody Koeninger
If you want an "easy" but not particularly performant way to do it, each org.apache.kafka.clients.consumer.ConsumerRecord has a topic. The topic is going to be the same for the entire partition as long as you haven't shuffled, hence the examples on how to deal with it at a partition level. On Fri

Re:

2017-09-11 Thread Gourav Sengupta
Hi, Why SPARK 1.5? Let me guess, are you using JAVA as well? The world has moved on mate. Regards, Gourav On Fri, Sep 8, 2017 at 8:36 AM, PICARD Damien wrote: > Hi ! > > > > I’m facing a Classloader problem using Spark 1.5.1 > > > > I use javax.validation and hibernate validation annotations

Re: CSV write to S3 failing silently with partial completion

2017-09-11 Thread Gourav Sengupta
Hi, Can you please let me know the following: 1. Why are you using JAVA? 2. The way you are creating the SPARK cluster 3. The way you are initiating SPARK session or context 4. Are you able to query the data that is written to S3 using a SPARK dataframe and validate that the number of rows in the

Bayesian network with Saprk

2017-09-11 Thread Md. Rezaul Karim
Hi All, I am planning to use a Bayesian network to integrate and infer the links between miRNA and proteins based on their expression. Is there any implementation in Spark for the Bayesian network so that I can adapt to feed my data? Regards, _ *Md. Rezaul Kari

Re: Quick Start Guide Syntax Error (Python)

2017-09-11 Thread larister
Hi, I just ran into this as well. I actually fixed it by using `alias` instead. Did you submit a PR? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why do checkpoints work the way they do?

2017-09-11 Thread Dmitry Naumenko
+1 for me for this question. If there any constraints in restoring checkpoint for Structured Streaming, they should be documented. 2017-08-31 9:20 GMT+03:00 张万新 : > So is there any documents demonstrating in what condition can my > application recover from the same checkpoint and in what conditi

How does spark work?

2017-09-11 Thread 陈卓
Hi I'm a newbie. In my spark cluster, there are 5 machines, each machine 16G memory, but my data may be more than 900G, the source may be HDFS or mongodb, I want to know how to put this 900G data into spark cluster memory because I have a total memory space of 80G. How does spark work? Thank

[SS]How to add a column with custom system time?

2017-09-11 Thread 张万新
Hi, In structured streaming how can I add a column to a dataset with current system time aligned with 15 minutes? Thanks.

Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread kant kodali
Hi All, I was wondering if we need to checkpoint both read and write streams when reading from Kafka and inserting into a target store? for example sparkSession.readStream().option("checkpointLocation", "hdfsPath").load() vs dataSet.writeStream().option("checkpointLocation", "hdfsPath") Thank

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
Hi Ayan, yup that works very well however I believe Kant's other mail "Queries with streaming sources must be executed with writeStream.start()" is adding more context. I think he is trying leverage on structured streaming and applying the rdd conversion to a streaming dataset is breaking the strea