Time series data

2018-05-23 Thread amin mohebbi
Could you please help me to understand  the performance that we get from using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings = 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10'sor 100's of TBs of data and I feel that NoSQL will be much quicker tha

Write data from Hbase using Spark Failing with NPE

2018-05-23 Thread Alchemist
aI am using Spark to write data to Hbase, I can read data just fine but write is failing with following exception. I found simila issue that got resolved by adding *site.xml and hbase JARs. But it is npot working for me.       JavaPairRDD  tablePuts = hBaseRDD.mapToPair(new PairFunction, Immuta

Re: help in copying data from one azure subscription to another azure subscription

2018-05-23 Thread Pushkar.Gujar
What are you using for storing data in those subscriptions? Datalake or Blobs? There is Azure Data Factory already available that can do copy between these cloud storage without having to go through spark Thank you, *Pushkar Gujar* On Mon, May 21, 2018 at 8:59 AM, amit kumar singh wrote: > HI

PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet
Please forgive me if this question has been asked already. I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of r

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread ayan guha
Curious question: what is the reason of using spark here? Why not simple sql-based ETL? On Thu, May 24, 2018 at 5:09 AM, Ajay wrote: > Do you worry about spark overloading the SQL server? We have had this > issue in the past where all spark slaves tend to send lots of data at once > to SQL and

Re: Alternative for numpy in Spark Mlib

2018-05-23 Thread Suzen, Mehmet
You can use Breeze, which is part of spark distribution: https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra Check out the modules under import breeze._ On 23 May 2018 at 07:04, umargeek wrote: > Hi Folks, > > I am planning to rewrite one of my python module written for entropy > cal

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Ajay
Do you worry about spark overloading the SQL server? We have had this issue in the past where all spark slaves tend to send lots of data at once to SQL and that slows down the latency of the rest of the system. We overcame this by using sqoop and running it in a controlled environment. On Wed, Ma

Re: Submit many spark applications

2018-05-23 Thread Marcelo Vanzin
On Wed, May 23, 2018 at 12:04 PM, raksja wrote: > So InProcessLauncher wouldnt use the native memory, so will it overload the > mem of parent process? I will still use "native memory" (since the parent process will still use memory), just less of it. But yes, it will use more memory in the parent

CMR: An open-source Data acquisition API for Spark is available

2018-05-23 Thread Thomas Fuller
Hi Folks, Today I've released my open-source CMR API, which is used to acquire data from several data providers directly in Spark. Currently the CMR API offers integration with the following: - Federal Reserve Bank of St. Louis - World Bank - TreasuryDirect.gov - OpenFIGI.com *Of note*: - The

Re: Submit many spark applications

2018-05-23 Thread raksja
Hi Marcelo, I'm facing same issue when making spark-submits from an ec2 instance and reaching native memory limit sooner. we have the #1, but we are still in spark 2.1.0, couldnt try #2. So InProcessLauncher wouldnt use the native memory, so will it overload the mem of parent process? Is there

Re: Spark driver pod garbage collection

2018-05-23 Thread Anirudh Ramanathan
There's a flag to the controller manager that is in charge of retention policy for terminated or completed pods. https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#options --terminated-pod-gc-threshold int32 Default: 12500 Number of terminated pods that

Cannot make Spark to honour the spark.jars.ivySettings config

2018-05-23 Thread Bruno Aranda
Hi, I am trying to use my own ivy settings file. For that, I am submitting to Spark using a command such as the following to test: spark-shell --packages some-org:some-artifact:102 --conf spark.jars.ivySettings=/home/hadoop/ivysettings.xml The idea is to be able to get the artifact from a privat

Spark driver pod garbage collection

2018-05-23 Thread purna pradeep
Hello, Currently I observe dead pods are not getting garbage collected (aka spark driver pods which have completed execution). So pods could sit in the namespace for weeks potentially. This makes listing, parsing, and reading pods slower and well as having junk sit on the cluster. I believe minim

Re: [structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-23 Thread Sushil Kotnala
You can use .option( "auto.offset.reset","earliest") while reading from kafka. With this, new stream will read from the first offset present for topic . On Wed, May 23, 2018 at 11:32 AM, karthikjay wrote: > Chris, > > Thank you for responding. I get it. > > But, if I am using a console sink w

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Super, just giving high level idea what i want to do. I have one source schema which is MS SQL Server 2008 and target is also MS SQL Server 2008. Currently there is c# based ETL application which does extract transform and load as customer specific schema including indexing etc. Thanks On Wed, M

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread kedarsdixit
Yes. Regards, Kedar Dixit -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Thank you Kedar Dixit, Silvio Fiorito. Just one question that - even it's not an azure cloud MS-SQL Server. It should support MS-SQL Server installed on local machine. right ? Thank you. On Wed, May 23, 2018 at 6:18 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Try this https://d

Re: Adding jars

2018-05-23 Thread kedarsdixit
This can help us to solve the immediate issue, however the ideally one should submit the jars in the beginning of the job. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubsc

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Silvio Fiorito
Try this https://docs.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector From: Chetan Khatri Date: Wednesday, May 23, 2018 at 7:47 AM To: user Subject: Bulk / Fast Read and Write with MSSQL Server and Spark All, I am looking for approach to do bulk read / write with MSSQL Se

Re: Adding jars

2018-05-23 Thread Sushil Kotnala
The purpose of broadcast variable is different. @Malveeka, could you please explain your usecase and issue. If the FAT/ Uber jar is not having required dependent jars then the spark job will fail at the start itself. What is your scenario in which you want to add new jars? Also, what you mean by

Re: Adding jars

2018-05-23 Thread kedarsdixit
In case of already running jobs, you can make use of broadcasters which will broadcast the jars to workers, if you want to change it on the fly you can rebroadcast it. You can explore broadcasters bit more to make use of. Regards, Kedar Dixit Data Science at Persistent Systems Ltd. -- Sent fro

Re: Adding jars

2018-05-23 Thread Sushil Kotnala
Hi With spark-submit we can start a new spark job, but it will not add new jar files in already running job. ~Sushil On Wed, May 23, 2018, 17:28 kedarsdixit wrote: > Hi, > > You can add dependencies in spark-submit as below: > > ./bin/spark-submit \ > --class \ > --master \ > --deploy

Re: Spark is not evenly distributing data

2018-05-23 Thread kedarsdixit
Hi, Can you elaborate more here? We don't understand the issue in detail. Regards, Kedar Dixit Data Science at Persistent Systems Ltd. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail

Re: Adding jars

2018-05-23 Thread kedarsdixit
Hi, You can add dependencies in spark-submit as below: ./bin/spark-submit \ --class \ --master \ --deploy-mode \ --conf = \ *--jars * \ ... # other options \ [application-arguments] Hope this helps. Regards, Kedar Dixit Data Science at Persistent Systems Ltd -- Sent from

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread kedarsdixit
Hi, I had came across this a while ago check if this is helpful. Regards, ~Kedar Dixit Data Science @ Persistent Systems Ltd. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
All, I am looking for approach to do bulk read / write with MSSQL Server and Apache Spark 2.2 , please let me know if any library / driver for the same. Thank you. Chetan

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread Jungtaek Lim
The issue looks like fixed in https://issues.apache.org/jira/browse/SPARK-23670, and likely 2.3.1 will include the fix. -Jungtaek Lim (HeartSaVioR) 2018년 5월 23일 (수) 오후 7:12, weand 님이 작성: > Thanks for clarification. So it really seem a Spark UI OOM Issue. > > After setting: > --conf spark.sql

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread weand
Thanks for clarification. So it really seem a Spark UI OOM Issue. After setting: --conf spark.sql.ui.retainedExecutions=10 --conf spark.worker.ui.retainedExecutors=10 --conf spark.worker.ui.retainedDrivers=10 --conf spark.ui.retainedJobs=10 --conf spark.ui.retainedStages=10

Recall: spark sql in-clause problem

2018-05-23 Thread Shiva Prashanth Vallabhaneni
Shiva Prashanth Vallabhaneni would like to recall the message, "spark sql in-clause problem". Any comments or statements made in this email are not necessarily those of Tavant Technologies. The information transmitted is intended only for the person or entity to