Re: Spark Dataframe returning null columns when schema is specified

2017-09-07 Thread Praneeth Gayam
What is the desired behaviour when a field is null for only a few records? You can not avoid nulls in this case But if all rows are guaranteed to be uniform(either all-null are all-non-null), you can *take* the first row of the DF and *drop* the columns with null fields. On Fri, Sep 8, 2017 at 12:

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Praneeth Gayam
With file stream you will have to deal with the following 1. The file(s) must not be changed once created. So if the files are being continuously appended, the new data will not be read. Refer 2. The

Spark ML DAG Pipelines

2017-09-07 Thread Srikanth Sampath
Hi Spark Experts, Can someone point me to some examples for non-linear (DAG) ML pipelines. That would be of great help. Thanks much in advance -Srikanth

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Sunita Arvind
Thanks for your response Michael Will try it out. Regards Sunita On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream read using the file source. This will maintain exactly > once processing even if ther

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Denis Magda
Hello Anjaneya, Marco, Honestly, I’m not aware if the video broadcasting or recording is planned. Could you go to the meetup page [1] and raise the question there? Just in case, here is you can find a list of all upcoming Ignite related events [2]. Probably some of them will be in close proximi

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Marco Mistroni
Hi Will there be a podcast to view afterwards for remote EMEA users? Kr On Sep 7, 2017 12:15 AM, "Denis Magda" wrote: > Folks, > > Those who are craving for mind food this weekend come over the meetup - > Santa Clara, Sept 9, 9.30 AM: > https://www.meetup.com/datariders/events/242523245/?a=soc

Spark Dataframe returning null columns when schema is specified

2017-09-07 Thread ravi6c2
Hi All, I have this problem where in Spark Dataframe is having null columns for the attributes from JSON that are not present. A clear explanation is provided below: *Use case:* Convert the JSON object into dataframe for further usage. *Case - 1:* Without specifying the schema for JSON: records.

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Mcclintic, Abbi
Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and without – in this example we are repartitioning to 10 but we also see the proble

Spark UI to use Marathon assigned port

2017-09-07 Thread Sunil Kalyanpur
Hello all, I am running PySpark Job (v2.0.2) with checkpoint enabled in Mesos cluster and am using Marathon for orchestration. When the job is restarted using Marathon, Spark UI is not getting started at the port specified by Marathon. Instead, it is picking port from the checkpoint. Is there a

Re: graphframe out of memory

2017-09-07 Thread Lukas Bradley
Did you also increase the size of the heap of the Java app that is starting Spark? https://alvinalexander.com/blog/post/java/java-xmx-xms-memory-heap-size-control On Thu, Sep 7, 2017 at 12:16 PM, Imran Rajjad wrote: > I am getting Out of Memory error while running connectedComponents job on > g

graphframe out of memory

2017-09-07 Thread Imran Rajjad
I am getting Out of Memory error while running connectedComponents job on graph with around 12000 vertices and 134600 edges. I am running spark in embedded mode in a standalone Java application and have tried to increase the memory but it seems that its not taking any effect sparkConf = new SparkC

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Patrick Alwell
Sounds like an S3 bug. Can you replicate locally with HDFS? Try using S3a protocol too; there is a jar you can leverage like so: spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py EMR can sometimes be buggy. :/ You could also try le

RE: CSV write to S3 failing silently with partial completion

2017-09-07 Thread JG Perrin
Are you assuming that all partitions are of equal size? Did you try with more partitions (like repartitioning)? Does the error always happen with the last (or smaller) file? If you are sending to redshift, why not use the JDBC driver? -Original Message- From: abbim [mailto:ab...@amazon.c

Pyspark UDF causing ExecutorLostFailure

2017-09-07 Thread nicktgr15
Hi, I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as follows: from pyspark.sql.functions import col, udf from pyspark.sql.types import StringType path = 's3://some/parquet/dir/myfile.parquet' df = spark.read.load(path) def _test_udf(useragent): return useragent.upp