[Structured Streaming] How to compute the difference between two rows of a streaming dataframe?

2017-09-29 Thread 张万新
Hi, I want to compute the difference between two rows in a streaming dataframe, is there a feasible API? May be some function like the window function *lag *in normal dataframe, but it seems that this function is unavailable in streaming dataframe. Thanks.

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread vaquar khan
If you're running in a clustered mode you need to copy the file across all the nodes of same shared file system. 1) put it into a distributed filesystem as HDFS or via (s)ftp 2) you have to transfer /sftp the file into the worker node before running the Spark job and then you have to put as an a

Re: Crash in Unit Tests

2017-09-29 Thread Eduardo Mello
I had this problem at my work. I solved by increasing the unix ulimit, because spark is trying to open to many files. Em 29 de set de 2017 5:05 PM, "Anthony Thomas" escreveu: > Hi Spark Users, > > I recently compiled spark 2.2.0 from source on an EC2 m4.2xlarge instance > (8 cores, 32G RAM) ru

Crash in Unit Tests

2017-09-29 Thread Anthony Thomas
Hi Spark Users, I recently compiled spark 2.2.0 from source on an EC2 m4.2xlarge instance (8 cores, 32G RAM) running Ubuntu 14.04. I'm using Oracle Java 1.8. I compiled using the command: export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ./build/mvn -DskipTests -Pnetlib-lgpl clean package

RE: HDFS or NFS as a cache?

2017-09-29 Thread JG Perrin
You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. From: Alexander Czech [mailto:alexander.cz...@googlemail.com] Sent: Friday, September 29, 2017 8:15 AM To: user@spark.apache.org Subject: HDFS or NFS as a cache? I have a

RE: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread JG Perrin
On a test system, you can also use something like Owncloud/Nextcloud/Dropbox to insure that the files are synchronized. Would not do it for TB of data ;) ... -Original Message- From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Friday, September 29, 2017 5:14 AM To: Gaurav1809 Cc: us

Re: More instances = slower Spark job

2017-09-29 Thread Vadim Semenov
Hi Jeroen, > However, am I correct in assuming that all the filtering will be then performed on the driver (since the .gz files are not splittable), albeit in several threads? Filtering will not happen on the driver, it'll happen on executors, since `spark.read.json(…).filter(…).write(…)` is a se

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Imran Rajjad
Try tachyon.. its less fuss On Fri, 29 Sep 2017 at 8:32 PM lucas.g...@gmail.com wrote: > We use S3, there are caveats and issues with that but it can be made to > work. > > If interested let me know and I'll show you our workarounds. I wouldn't > do it naively though, there's lots of potential

Needed some best practices to integrate Spark with HBase

2017-09-29 Thread Debabrata Ghosh
Dear All, Greetings ! I needed some best practices for integrating Spark with HBase. Would you be able to point me to some useful resources / URL's to your convenience please. Thanks, Debu

Re: More instances = slower Spark job

2017-09-29 Thread Gourav Sengupta
I think that the best option is to see whether data frames option of reading JSON files works or not. On Fri, Sep 29, 2017 at 3:53 PM, Alexander Czech < alexander.cz...@googlemail.com> wrote: > Does each gzip file look like this: > > {json1} > {json2} > {json3} > > meaning that each line is a s

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread lucas.g...@gmail.com
We use S3, there are caveats and issues with that but it can be made to work. If interested let me know and I'll show you our workarounds. I wouldn't do it naively though, there's lots of potential problems. If you already have HDFS use that, otherwise all things told it's probably less effort t

Re: HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3 because of the how the rename works so that I probably lose data. On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov wrote: > How

Re: More instances = slower Spark job

2017-09-29 Thread Alexander Czech
Does each gzip file look like this: {json1} {json2} {json3} meaning that each line is a separate json object? I proccess a similar large file batch and what I do is this: input.txt # each line in input.txt represents a path to a gzip file each containing a json object every line my_rdd = sc.par

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2017-09-29 Thread Vadim Semenov
As alternative: checkpoint the dataframe, collect days, and then delete corresponding directories using hadoop FileUtils, then write the dataframe On Fri, Sep 29, 2017 at 10:31 AM, peay wrote: > Hello, > > I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") > to write

Re: HDFS or NFS as a cache?

2017-09-29 Thread Vadim Semenov
How many files you produce? I believe it spends a lot of time on renaming the files because of the output committer. Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they have 10GbE and you can get good throughput for S3. On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech < alex

Saving dataframes with partitionBy: append partitions, overwrite within each

2017-09-29 Thread peay
Hello, I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") to write a dataset while splitting by day. I would like to run a Spark job to process, e.g., a month: dataset.parquet/day=2017-01-01/... ... and then run another Spark job to add another month using the same

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Yana Kadiyska
One thing to note, if you are using Mesos, is that the version of Mesos changed from 0.21 to 1.0.0. So taking a newer Spark might push you into larger infrastructure upgrades On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D wrote: > Hello All, > > Currently our Batch ETL Jobs are in Spark 1.6.

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Alexander Czech
Yes you need to store the file at a location where it is equally retrievable ("same path") for the master and all nodes in the cluster. A simple solution (apart from a HDFS) that does not scale to well but might be a OK with only 3 nodes like in your configuration is a network accessible storage (a

HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet files to S3. But the S3 performance for various reasons is bad when I access s3 through the parquet write method: df.write.parquet('s3a://bucket/parquet') Now I want to setup a small cache for the parquet output. One o

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Gokula Krishnan D
Do you see any changes or improvments in the *Core-API* in Spark 2.X when compared with Spark 1.6.0. ?. Thanks & Regards, Gokula Krishnan* (Gokul)* On Mon, Sep 25, 2017 at 1:32 PM, Gokula Krishnan D wrote: > Thanks for the reply. Forgot to mention that, our Batch ETL Jobs are in > Core-Spark

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Arun Rai
Or you can try mounting that drive to all node. On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: > You should use a distributed filesystem such as HDFS. If you want to use > the local filesystem then you have to copy each file to each node. > > > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: >

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Jörn Franke
You should use a distributed filesystem such as HDFS. If you want to use the local filesystem then you have to copy each file to each node. > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: > > Hi All, > > I have multi node architecture of (1 master,2 workers) Spark cluster, the > job runs to rea

[Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Gaurav1809
Hi All, I have multi node architecture of (1 master,2 workers) Spark cluster, the job runs to read CSV file data and it works fine when run on local mode (Local(*)). However, when the same job is ran in cluster mode(Spark://HOST:PORT), it is not able to read it. I want to know how to reference t

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Sathishkumar Manimoorthy
Place it in HDFS and give the reference path in your code. Thanks, Sathish On Fri, Sep 29, 2017 at 3:31 PM, Gaurav1809 wrote: > Hi All, > > I have multi node architecture of (1 master,2 workers) Spark cluster, the > job runs to read CSV file data and it works fine when run on local mode > (Loca

[Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Gaurav1809
Hi All, I have multi node architecture of (1 master,2 workers) Spark cluster, the job runs to read CSV file data and it works fine when run on local mode (Local(*)). However, when the same job is ran in cluster mode (Spark://HOST:PORT), it is not able to read it. I want to know how to reference th

Re: Replicating a row n times

2017-09-29 Thread Weichen Xu
I suggest you to use `monotonicallyIncreasingId` which is high efficient. But note that the ID it generated will not be consecutive. On Fri, Sep 29, 2017 at 3:21 PM, Kanagha Kumar wrote: > Thanks for the response. > I can use either row_number() or monotonicallyIncreasingId to generate > uniqueI

Structured Streaming and Hive

2017-09-29 Thread HanPan
Hi guys, I'm new to spark structured streaming. I'm using 2.1.0 and my scenario is reading specific topic from kafka and do some data mining tasks, then save the result dataset to hive. While writing data to hive, somehow it seems like not supported yet and I tried this: It run

Re: Applying a Java script to many files: Java API or also Python API?

2017-09-29 Thread Weichen Xu
Although python can launch subprocess to run java code, but in PySpark, the processing code which need to run parallelly in cluster, have to be written in python, for example, in PySpark: def f(x): ... rdd.map(f) // The function `f` must be pure python code If you try to launch subprocess to

Re: Replicating a row n times

2017-09-29 Thread Kanagha Kumar
Thanks for the response. I can use either row_number() or monotonicallyIncreasingId to generate uniqueIds as in https://hadoopist.wordpress.com/2016/05/24/generate-unique-ids-for-each-rows-in-a-spark-dataframe/ I'm looking for a java example to use that to replicate a single row n times by appendi