Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack
Sean is right, casting timestamps to strings (which is what show() does) uses the local timezone, either the Java default zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or the default DataFrameWriter zone `timeZone`(when writing to file). You say you are in PST, whic

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : > https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-co

RE: Apache Spark Operator for Kubernetes?

2022-10-28 Thread Jim Halfpenny
Hi Clayton, I’m not aware of an official Apache operator, but I can recommend taking a look a the one we’re created at Stackable. https://github.com/stackabletech/spark-k8s-operator It’s actively maintained and we’d be happy to receive feedback if you have feature requests. Kind regards, Jim

Re: Apache Spark Operator for Kubernetes?

2022-10-14 Thread Artemis User
If you have the hardware resources, it isn't difficult to set up Spark in a kubernetes cluster.  The online doc describes everything you would need (https://spark.apache.org/docs/latest/running-on-kubernetes.html). You're right, both AWS EMR and Google's environment aren't flexible and not che

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-18 Thread Sean Owen
These kinds of static analysis have limited value to send around. It's not clear whether any of the CVEs actually affect Spark's usage of the library. jackson -- generally, yes could theoretically affect Spark apps. I can't really read this output, but seems like the affected versions are generally

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-15 Thread Sean Owen
I think these are readily answerable if you look at the text of the CVEs and Spark 3.0.3 release. https://nvd.nist.gov/vuln/detail/CVE-2019-17531 concerns Jackson Databind up to 2.9.10, but you can see that 3.0.3 uses 2.10.0 https://nvd.nist.gov/vuln/detail/CVE-2020-9480 affects Spark 2.x, not 3.x

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-15 Thread Rajesh Krishnamurthy
Hi Sean, I am looking for fixing the vulnerabilities such as these in the 3.0.X branch. 1) CVE-2019-17531 2)CVE-2020-9480 3)CVE-2019-0204 Rajesh Krishnamurthy | Enterprise Architect T: +1 510-833-7189 | M: +1 925-917-9208 http://www.perforce.com Visit us on: Twitter

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-14 Thread Sean Owen
What vulnerabilities are you referring to? I'm not aware of any critical outstanding issues, but not sure what you have in mind either. See https://spark.apache.org/versioning-policy.html - 3.0.x is EOL about now, which doesn't mean there can't be another release, but would not generally expect one

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-11 Thread Sean Owen
3.0.x is about EOL now, and I hadn't heard anyone come forward to push a final maintenance release. Is there a specific issue you're concerned about? On Fri, Feb 11, 2022 at 4:24 PM Rajesh Krishnamurthy < rkrishnamur...@perforce.com> wrote: > Hi there, > > We are just wondering if there are any

Re: Apache Spark 3.2.0 | Pyspark | Pycharm Setup

2021-11-17 Thread Mich Talebzadeh
yep the latest pyspark is 3.2. you can easily install it from available packages [image: image.png] my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any

Re: Apache Spark 3.2.0 | Pyspark | Pycharm Setup

2021-11-17 Thread Khalid Mammadov
Hi Anil You dont need to download and Install Spark. It's enough to add pyspark to PyCharm as a package for your environment and start developing and testing locally. The thing is PySpark includes local Spark that is installed as part of pip install. When it comes to your particular issue. I beli

Re: Apache Spark 3.2.0 | Pyspark | Pycharm Setup

2021-11-17 Thread Gourav Sengupta
Hi Anil, I generally create an anaconda environment, and then install pyspark in it, and then configure the interpreter to point to that particular environment. Never faced an issue with my approach. Regards, Gourav Sengupta On Wed, Nov 17, 2021 at 7:39 AM Anil Kulkarni wrote: > Hi Spark comm

Re: apache-spark-how to run distributed spark jobs in Apache-spark-standalone-cluster

2021-11-10 Thread Dinakar Chennubotla
Hi Sean Owen, Thank you very much for your quick response. Could you please help us, with the below? coming to the point, yes I did check the link you sent. but I did not get whole clarity, what I am searching for. Our Prod problem statement is: = 1. Can we launch distri

Re: apache-spark-how to run distributed spark jobs in Apache-spark-standalone-cluster

2021-11-08 Thread Sean Owen
Yes, did you check the docs? https://spark.apache.org/docs/latest/spark-standalone.html On Mon, Nov 8, 2021 at 6:40 AM Dinakar Chennubotla wrote: > Hi All,i am dinakar and I am admin,i have question, > Question: " is it possible to run distributed spark jobs in Apache spark > standalone cluster"

Re: apache-spark

2021-10-14 Thread Mich Talebzadeh
Also have you tried to see what is going on within k8s driver? DRIVER_POD_NAME=`kubectl get pods -n $NAMESPACE |grep driver|awk '{print $1}'` kubectl describe pod $DRIVER_POD_NAME -n $NAMESPACE kubectl logs $DRIVER_POD_NAME -n $NAMESPACE view my Linkedin profile

Re: apache-spark

2021-10-14 Thread Mich Talebzadeh
Hi, Airflow is nothing but a new version of cron on linux with dag dependency. What operator in airflow are you using to submit your spark-submit for example BashOperator? Can you actually run the command outside of airflow by submitting spark-submit to K8s cluster? Is that GKE cluster or somethi

Re: [apache-spark][Spark SQL][Debug] Maven Spark build fails while compiling spark-hive-thriftserver_2.12 for Hadoop 2.10.1

2021-09-17 Thread Sean Owen
I don't think that has ever showed up in the CI/CD builds and can't recall someone reporting this. What did you change? it may be some local env issue On Fri, Sep 17, 2021 at 7:09 AM Enrico Minardi wrote: > > Hello, > > > the Maven build of Apache Spark 3.1.2 for user-provided Hadoop 2.10.1 with

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-06-14 Thread Daniel de Oliveira Mantovani
Did you include Apache Spark dependencies in your build? if you did, you should remove it. If you are using sbt, all spark dependencies should be as "provided". On Wed, Jun 2, 2021 at 10:11 AM Kanchan Kauthale < kanchankauthal...@gmail.com> wrote: > Hello Sean, > > Please find below the stack tra

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-06-02 Thread Kanchan Kauthale
Hello Sean, Please find below the stack trace- java.lang.NoclassDefFoundError: Could not initialize class org.spark.project.jetty.servlet.ServletContextHandler at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:143) at org.apache.spark.ui.JettyUtils$.createServletHandler(J

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-05-27 Thread Sean Owen
Despite the name, the error doesn't mean the class isn't found but could not be initialized. What's the rest of the error? I don't believe any testing has ever encountered this error, so it's likely something to do with your environment, but I don't know what. On Thu, May 27, 2021 at 7:32 AM Kanch

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-05-27 Thread Kanchan Kauthale
Hello, I could see Jetty version has been updated to 9.4.35, from 9.4.28 in JIra- https://issues.apache.org/jira/browse/SPARK-33831 Does it have something to do with it? Thank you Kanchan On Thu, May 27, 2021 at 5:16 PM Kanchan Kauthale < kanchankauthal...@gmail.com> wrote: > Hello, > > We have

RE: Apache Spark

2021-01-26 Thread Синий Андрей
: +79103801534 e-mail: avs...@mts.ru<mailto:avs...@mts.ru> г. Нижний Новгород, пр. Гагарина, д. 168А, пом. П8, 3, 310 [cid:image003.png@01D6F473.EBD3A1E0] From: Maxim Gekk Sent: Tuesday, January 26, 2021 10:08 PM To: Lalwani, Jayesh Cc: Синий Андрей ; user@spark.apache.org Subject: Re: Apache

Re: Apache Spark

2021-01-26 Thread Maxim Gekk
Hi Андрей, You can write to https://databricks.com/company/contact . Probably, we can offer something to you. For instance, Databricks has OEM program which might be interesting to you: https://partners.databricks.com/prm/English/c/Overview Maxim Gekk Software Engineer Databricks, Inc. On Tue

Re: Apache Spark

2021-01-26 Thread Lalwani, Jayesh
All of the major cloud vendors have some sort of Spark offering. They provide support if you build in their cloud. From: Синий Андрей Date: Tuesday, January 26, 2021 at 7:52 AM To: "user@spark.apache.org" Subject: [EXTERNAL] Apache Spark CAUTION: This email originated from outside of the orga

Re: Apache Spark

2021-01-26 Thread Ivan Petrov
Hello Andrey, you can try to reach Beeline beeline.ru, they use Databricks as far as I know. вт, 26 янв. 2021 г. в 15:01, Sean Owen : > To clarify: Apache projects and the ASF do not provide paid support. > However there are many vendors who provide distributions of Apache Spark > who will provi

Re: Apache Spark

2021-01-26 Thread Sean Owen
To clarify: Apache projects and the ASF do not provide paid support. However there are many vendors who provide distributions of Apache Spark who will provide technical support - not nearly just Databricks but Cloudera, etc. There are also plenty of consultancies and individuals who can provide pro

Re: Apache Spark

2021-01-26 Thread Gourav Sengupta
Hi, why do you want to buy paid SPARK? Regards, Gourav On Tue, Jan 26, 2021 at 1:22 PM Pasha Finkelshteyn < pavel.finkelsht...@gmail.com> wrote: > Hi Andrey, > > It looks like you may contact Databricks for that. > Also it would be easier for non-russian spaekers to respond you if your > name w

Re: Apache Spark

2021-01-26 Thread Pasha Finkelshteyn
Hi Andrey, It looks like you may contact Databricks for that. Also it would be easier for non-russian spaekers to respond you if your name would be written in English. On 21/01/26 12:41PM, Синий Андрей wrote: > Hello! > > We plan to use Apache Spark software in our organization, can I purchase p

Re: Apache Spark Connector for SQL Server and Azure SQL

2020-10-26 Thread Artemis User
The best option certainly would be to recompile the Spark Connector for MS SQL server using the Spark 3.0.1/Scala 2.12 dependencies, and just fix the compiler errors as you go. The code is open source on github (https://github.com/microsoft/sql-spark-connector).  Looks like this connector is us

Re: Apache Spark Connector for SQL Server and Azure SQL

2020-10-26 Thread ayan guha
I would suggest to ask microsoft and databricks, this forum is for apache spark. if you are interested please drop me a note separately as I m keen to understand the issue as we use same setup Ayan On Mon, 26 Oct 2020 at 11:53 pm, wrote: > Hi, > > > > In a project where I work with Databricks,

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Dongjoon Hyun
Thank you so much for your feedback, Koert. Yes, SPARK-20202 was created in April 2017 and targeted for 3.1.0 since Nov 2019. However, I believe Apache Spark 3.1.0 (Hadoop 3.2/Hive 2.3 distribution) will work with old Hadoop 2.x clusters if you isolated the classpath via SPARK-31960. SPARK-31960

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Koert Kuipers
it seems to me with SPARK-20202 we are no longer planning to support hadoop2 + hive 1.2. is that correct? so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with hive? my use case is building spark 3.1 and launching on these existing clusters that are not managed by me. e.g. i do

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
As pointed out by Dongjoon, the 2nd half of December is the holiday season in most countries. If we do the code freeze in mid November and release the first RC in mid December. I am afraid the community will not be active to verify the release candidates during the holiday season. Normally, the RC

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
> > Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0. I think we made a change in release cadence since Spark 2.3. See the commit: https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa Thus, Spark 3.1 might just follow the release cadence of Spark 2.

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
For Xiao's comment, I want to point out that Apache Spark 3.1.0 is different from 2.3 or 2.4. Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0. - Apache Spark 2.0.0 was released on July 26, 2016. - Apache Spark 2.1.0 was released on December 28, 2016. Bests, Dongjoon. On Sun, Oct

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
Thank you all. BTW, Xiao and Mridul, I'm wondering what date you have in your mind specifically. Usually, `Christmas and New Year season` doesn't give us much additional time. If you think so, could you make a PR for Apache Spark website according to your expectation? https://spark.apache.org/v

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Mridul Muralidharan
+1 on pushing the branch cut for increased dev time to match previous releases. Regards, Mridul On Sat, Oct 3, 2020 at 10:22 PM Xiao Li wrote: > Thank you for your updates. > > Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of > the 3.1 branch cut, the feature development

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Xiao Li
Thank you for your updates. Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of the 3.1 branch cut, the feature development time window is less than 5 months. This is shorter than what we did in Spark 2.3 and 2.4 releases. Below are three highly desirable feature work I am wa

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well. On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, wrote: > Hi, All. > > As of today, master branch (Apache Spark 3.1.0) resolved > 852+ JIRA issues and 606+ issues are 3.1.0-only patches. >

Re: Apache Spark Bogotá Meetup

2020-09-30 Thread Miguel Angel Díaz Rodríguez
Cool here is my PR 🤞 https://github.com/apache/spark-website/pull/291 On Wed, 30 Sep 2020 at 07:34, Sean Owen wrote: > Sure, we just ask people to open a pull request against > https://github.com/apache/spark-website to update the page and we can > merge it. > > On Wed, Sep 30, 2020 at 7:30 AM

Re: Apache Spark Bogotá Meetup

2020-09-30 Thread Sean Owen
Sure, we just ask people to open a pull request against https://github.com/apache/spark-website to update the page and we can merge it. On Wed, Sep 30, 2020 at 7:30 AM Miguel Angel Díaz Rodríguez < madiaz...@gmail.com> wrote: > Hello > > I am Co-organizer of Apache Spark Bogotá Meetup from Colomb

Re: Apache Spark- Help with email library

2020-07-27 Thread Suat Toksöz
Why I am not able to send my question to the spark email list? Thanks On Mon, Jul 27, 2020 at 10:31 AM tianlangstudio wrote: > I use SimpleJavaEmail http://www.simplejavamail.org/#/features for Send > email and parse email file. It is awesome and may help you. > >

Re: apache-spark mongodb dataframe issue

2020-06-26 Thread Mannat Singh
Hi Jeff Thanks for confirming the same. I have also thought about reading every MongoDB document separately along with their schemas and then comparing them to the schemas of all the documents in the collection. For our huge database this is a horrible horrible approach as you have already mention

Re: apache-spark mongodb dataframe issue

2020-06-23 Thread Jeff Evans
As far as I know, in general, there isn't a way to distinguish explicit null values from missing ones. (Someone please correct me if I'm wrong, since I would love to be able to do this for my own reasons). If you really must do it, and don't care about performance at all (since it will be horribl

Re: [apache-spark]-spark-shuffle

2020-05-24 Thread vijay.bvp
How a Spark job reads datasources depends on the underlying source system,the job configuration about number of executors and cores per executor. https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets About Shuffle operations. https://spark.apache.org/docs/latest/rdd-p

Re: apache-spark Structured Stateful Streaming with window / SPARK-21641

2019-10-16 Thread Jungtaek Lim
First of all, I guess you've asked for using both "arbitrary stateful operation" and "native support of windowing". (Even you don't deal with state directly, whenever you use stateful operations like streaming aggregation or stream-stream join, you use state.) In short, there's no native support o

Re: RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-24 Thread Bipul kumar
Thank you for the clarification. Respectfully, Bipul PUBLIC KEY 97F0 2E08 7DE7 D538 BDFA B708 86D8 BE27 8196 D466 ** Please excuse brevity and typos. ** On Mon, Jun 24, 2019 at 4:24 AM Mark Bidewell wrote: > Note that we selected Spark 2.

Re: RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-23 Thread Mark Bidewell
Note that we selected Spark 2.2.2 because we were trying to align with DSE Search 6. A new version might have fewer issues. On Sun, Jun 23, 2019 at 10:56 AM Bipul kumar wrote: > Hi Mark, > > Thanks for your wonderful suggestion. > I look forward to try that version. > > Respectfully, >

Re: RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-23 Thread Bipul kumar
Hi Mark, Thanks for your wonderful suggestion. I look forward to try that version. Respectfully, Bipul PUBLIC KEY 97F0 2E08 7DE7 D538 BDFA B708 86D8 BE27 8196 D466 ** Please excuse brevity and typos. ** On Sun, Jun 23, 2019 at 8:06 PM Ma

Re: RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-23 Thread Mark Bidewell
I have done a setup with Hadoop 2.9.2 and Spark 2.2.2. Apache Zeppelin is fine but some our internally developed apps need work on dependencies On Sun, Jun 23, 2019, 07:50 Bipul kumar wrote: > Hello People ! > > I am new to Apache Spark , and just started learning it. > Few questions i have in

RE - Apache Spark compatibility with Hadoop 2.9.2

2019-06-23 Thread Bipul kumar
Hello People ! I am new to Apache Spark , and just started learning it. Few questions i have in my mind which i am seeking here for 1 . Is there any compatibility with Apache Spark while using Hadoop.? Let say i am running Hadoop 2.9.2, which Apache Spark should i use? 2. As mentioned , i am

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread gpatcham
When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count with "spark.sql.orc.filterPushdown=true" I see below in executors logs. Predicate push down is happening 18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (IS_NULL conv_date) leaf-1 = (EQUALS conv_d

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke
A lot of small files is very inefficient itself and predicate push down will not help you much there unless you merge them into one large file (one large file can be much more efficiently processed). How did you validate that predicate pushdown did not work on Hive? You Hive Version is also ver

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
spark version 2.2.0 Hive version 1.1.0 There are lot of small files Spark code : "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true val logs =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date > 20181003") Hive: "spark.sql.orc.enabled": "true", "spark.s

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in progressing - try to merge as much as possible into one file. Can you please share full source code in Hive and Spark as well as the versions you are using? > Am 31.10.2018 um 18:23 schrieb gpatcham : > > > > When rea

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread amihay gonen
If you are using kafka direct connect api it might be committing offset back to kafka itself בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl ‏: > I met the same issue and I have try to delete the checkpoint dir before the > job , > > But spark seems can read the correct offset even though after the

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread licl
I met the same issue and I have try to delete the checkpoint dir before the job , But spark seems can read the correct offset even though after the checkpoint dir is deleted , I don't know how spark do this without checkpoint's metadata. -- Sent from: http://apache-spark-user-list.1001560.n3.

Re: Apache Spark Installation error

2018-05-31 Thread Irving Duran
You probably want to recognize "spark-shell" as a command in your environment. Maybe try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" Have you tried "./spark-shell" in the current path to see if it works? Thank You, Irving Duran On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote:

Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread Tathagata Das
Structured Streaming AUTOMATICALLY saves the offsets in a checkpoint directory that you provide. And when you start the query again with the same directory it will just pick up where it left off. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failur

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay: I am using spark-shell because I am still prototyping the steps involved. Regarding executors - I have 280 executors and UI only show a few straggler tasks on each trigger.  The UI does not show too much time spend on GC.  suspect the delay is because of getting data from kafka. The num

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
Instead of spark-shell have you tried running it as a job. how many executors and cores, can you share the RDD graph and event timeline on the UI and did you find which of the tasks taking more time was they are any GC please look at the UI if not already it can provide lot of information -

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
Thanks Richard.  I am hoping that Spark team will at some time, provide more detailed documentation. On Sunday, February 11, 2018 2:17 AM, Richard Qiao wrote: Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is help

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining why

Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state perio

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
Hi Jacek: Thanks for your response. I am just trying to understand the fundamentals of watermarking and how it behaves in aggregation vs non-aggregation scenarios. On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski wrote: Hi, What would you expect? The data is simply dropped as

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is proc

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-05 Thread M Singh
Hi TD: Just wondering if you have any insight for me or need more info. Thanks On Thursday, February 1, 2018 7:43 AM, M Singh wrote: Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output.

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) does indicate that it can dedup using watermark.  So I believe th

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-01 Thread M Singh
Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output. Updated code: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.joda.time._import org.apache.spark.sql._import

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread Tathagata Das
Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true)` and show us the plan output? On Wed, Jan 31, 2018 at 3:35 PM, M Singh wrote: > Hi Folks: > > I have to add a column to a structured *streaming* dataframe but when I > do that (using select or wit

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored. I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is ap

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread Jacek Laskowski
Hi, I'm curious how would you do the requirement "by a certain amount of time" without a watermark? How would you know what's current and compute the lag? Let's forget about watermark for a moment and see if it pops up as an inevitable feature :) "I am trying to filter out records which are laggi

Re: Apache Spark - Custom structured streaming data source

2018-01-26 Thread M Singh
Thanks TD.  When will 2.3 scheduled for release ?   On Thursday, January 25, 2018 11:32 PM, Tathagata Das wrote: Hello Mans, The streaming DataSource APIs are still evolving and are not public yet. Hence there is no official documentation. In fact, there is a new DataSourceV2 API (in

Re: Apache Spark - Custom structured streaming data source

2018-01-25 Thread Tathagata Das
Hello Mans, The streaming DataSource APIs are still evolving and are not public yet. Hence there is no official documentation. In fact, there is a new DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this point of time, it's hard to make any concrete suggestion. You can take a

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
Hi Jacek: The javadoc mentions that we can only consume data from the data frame in the addBatch method.  So, if I would like to save the data to a new sink then I believe that I will need to collect the data and then save it.  This is the reason I am asking about how to control the size of the

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski
Hi, > If the data is very large then a collect may result in OOM. That's a general case even in any part of Spark, incl. Spark Structured Streaming. Why would you collect in addBatch? It's on the driver side and as anything on the driver, it's a single JVM (and usually not fault tolerant) > Do y

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
Thanks Tathagata for your answer. The reason I was asking about controlling data size is that the javadoc indicate you can use foreach or collect on the dataframe.  If the data is very large then a collect may result in OOM. >From your answer it appears that the only way to control the size (in 2

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread Tathagata Das
1. It is all the result data in that trigger. Note that it takes a DataFrame which is a purely logical representation of data and has no association with partitions, etc. which are physical representations. 2. If you want to limit the amount of data that is processed in a trigger, then you should

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
Thanks Eyal - it appears that these are the same patterns used for spark DStreams. On Wednesday, December 27, 2017 1:15 AM, Eyal Zituny wrote: Hiif you're interested in stopping you're spark application externally, you will probably need a way to communicate with the spark driver  (wh

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-27 Thread Eyal Zituny
Hi if you're interested in stopping you're spark application externally, you will probably need a way to communicate with the spark driver (which start and holds a ref to the spark context) this can be done by adding some code to the driver app, for example: - you can expose a rest api that st

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-26 Thread M Singh
Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh" escreveu: Hi:

Re: Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread Diogo Munaro Vieira
Window function requires a timestamp column because you will apply a function for each window (like an aggregation). You still can use UDF for customized tasks Em 25 de dez de 2017 20:15, "M Singh" escreveu: > Hi: > I would like to use window function on a DataSet stream (Spark 2.2.0) > The wind

Re: Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread Diogo Munaro Vieira
Can you please post here your code? Em 25 de dez de 2017 19:24, "M Singh" escreveu: > Hi: > > I am using spark structured streaming (v 2.2.0) to read data from files. I > have configured checkpoint location. On stopping and restarting the > application, it looks like it is reading the previously

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread Diogo Munaro Vieira
Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh" escreveu: > Hi: > Are there any patterns/recommendations for gracefully stopping a > structured streaming application ? > Thanks > > >

Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Scott Reynolds
The train method is on the Companion Object https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans$ here is a decent resource on Companion Object usage: https://docs.scala-lang.org/tour/singleton-objects.html On Wed, Dec 13, 2017 at 9:16 AM Michael Segel

Re: Apache Spark-Subtract two datasets

2017-10-13 Thread Nathan Kronenfeld
I think you want a join of type "left_anti"... See below log scala> import spark.implicits._ import spark.implicits._ scala> case class Foo (a: String, b: Int) defined class Foo scala> case class Bar (a: String, d: Double) defined class Bar scala> var fooDs = Seq(Foo("a", 1), Foo("b", 2), Foo("

Re: Apache Spark-Subtract two datasets

2017-10-12 Thread Imran Rajjad
if the datasets hold objects of different classes, then you will have to convert both of them to rdd and then rename the columns befrore you call rdd1.subtract(rdd2) On Thu, Oct 12, 2017 at 10:16 PM, Shashikant Kulkarni < shashikant.kulka...@gmail.com> wrote: > Hello, > > I have 2 datasets, Datas

Re: Apache Spark - MLLib challenges

2017-09-23 Thread vaquar khan
MLIB is old RDD-based API since Apache Spark 2 is recommended to use dataset based APIs to get good performance and introduce ML. ML contains new API build around Dataset and ML Pipelines ,mllib is slowly being deprecated (this already happened in case of linear regression) MLIB currently enter

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Koert Kuipers
our main challenge has been the lack of support for missing values generally On Sat, Sep 23, 2017 at 3:41 AM, Irfan Kabli wrote: > Dear All, > > We are looking to position MLLib in our organisation for machine learning > tasks and are keen to understand if their are any challenges that you might

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Aseem Bansal
This is something I wrote specifically for the challenges that we faced when taking spark ml models to production http://www.tothenew.com/blog/when-you-take-your-machine-learning-models-to-production-for-real-time-predictions/ On Sat, Sep 23, 2017 at 1:33 PM, Jörn Franke wrote: > As far as I kno

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Jörn Franke
As far as I know there is currently no encryption in-memory in Spark. There are some research projects to create secure enclaves in-memory based on Intel sgx, but there is still a lot to do in terms of performance and security objectives. The more interesting question is why would you need this f

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Bryan Cutler
much. It is great help, I will try spark-sklearn. >> >> Prem >> >> >> >> >> >> >> >> >> >> *From: *Yanbo Liang >> *Date: *Tuesday, September 5, 2017 at 10:40 AM >> *To: *Patrick McCarthy >> *Cc: *"Timsina,

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
> > > > > > > > *From: *Yanbo Liang > *Date: *Tuesday, September 5, 2017 at 10:40 AM > *To: *Patrick McCarthy > *Cc: *"Timsina, Prem" , "user@spark.apache.org" < > user@spark.apache.org> > *Subject: *Re: Apache Spark: Parallelizatio

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Timsina, Prem
ay, September 5, 2017 at 10:40 AM To: Patrick McCarthy Cc: "Timsina, Prem" , "user@spark.apache.org" Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark ML

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation If

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run multiple machine learning alg

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-30 Thread Karin Valisova
Looks like the parallelization into RDD was the right move I was omitting, JavaRDD jsonRDD = new JavaSparkContext(sparkSession. sparkContext()).parallelize(results); then I created a schema as List fields = new ArrayList(); fields.add(DataTypes.createStructField("column_name1", DataTypes.String

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-28 Thread Richard Xin
Maybe you could try something like that:        SparkSession sparkSession = SparkSession     .builder()     .appName("Rows2DataSet")     .master("local")     .getOrCreate();         List results = new LinkedList();         JavaRDD jsonRDD =          

Re: Apache Spark MLIB

2017-02-24 Thread Jon Gregg
Here's a high level overview of Spark's ML Pipelines around when it came out: https://www.youtube.com/watch?v=OednhGRp938. But reading your description, you might be able to build a basic version of this without ML. Spark has broadcast variables

  1   2   3   >