Re: [DISCUSS] SPIP: XML data source support

2023-07-19 Thread Burak Yavuz
+1 on adding to Spark. Community involvement will make the XML reader better. Best, Burak On Wed, Jul 19, 2023 at 3:25 AM Martin Andersson wrote: > Alright, makes sense to add it then. > -- > *From:* Hyukjin Kwon > *Sent:* Wednesday, July 19, 2023 11:01 > *To:* Mart

Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-05 Thread Burak Yavuz
I'm also a +1 on the newer APIs. We had a lot of learnings from using flatMapGroupsWithState and I believe that we can make the APIs a lot easier to use. On Wed, Nov 29, 2023 at 6:43 PM Anish Shrigondekar wrote: > Hi dev, > > Addressed the comments that Jungtaek had on the doc. Bumping the threa

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-10 Thread Burak Yavuz
+1. Excited to see more stateful workloads with Structured Streaming! Best, Burak On Wed, Jan 10, 2024 at 8:21 AM Praveen Gattu wrote: > +1. This brings Structured Streaming a good solution for customers wanting > to build stateful stream processing applications. > > On Wed, Jan 10, 2024 at 7:

Re: Static partitioning in partitionBy()

2019-05-07 Thread Burak Yavuz
It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity. On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia wrote:

Re: Issues with Delta Lake on 3.0.0 preview + preview 2

2019-12-30 Thread Burak Yavuz
I can't imagine any Spark data source using Spark internals compiled on Spark 2.4 working on 3.0 out of the box. There are many braking changes. I'll try to get a *dev* branch for 3.0 soon (mid Jan). Best, Burak On Mon, Dec 30, 2019, 8:53 AM Jean-Georges Perrin wrote: > Hi there, > > Trying to

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Burak Yavuz
+1 On Mon, Mar 9, 2020 at 4:55 PM Reynold Xin wrote: > +1 > > > > On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge wrote: > >> +1 (non-binding) >> >> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer wrote: >> >>> +1 (non-binding) >>> >>> I am disappointed however that this only mentions API and not >>>

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Burak Yavuz
Hey Russell, Great catch on the documentation. It seems out of date. I honestly am against having different DataSources having different default SaveModes. Users will have no clue if a DataSource implementation is V1 or V2. It seems weird that the default value can change for something that I have

Re: [DISCUSS] "complete" streaming output mode

2020-05-20 Thread Burak Yavuz
Oh wow. I never thought this would be up for debate. I use complete mode VERY frequently for all my dashboarding use cases. Here are some of my thoughts: > 1. It destroys the purpose of watermark and forces Spark to maintain all of state rows, growing incrementally. It only works when all keys are

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Burak Yavuz
+1 Best, Burak On Tue, Jun 9, 2020 at 1:48 PM Shixiong(Ryan) Zhu wrote: > +1 (binding) > > Best Regards, > Ryan > > > On Tue, Jun 9, 2020 at 4:24 AM Wenchen Fan wrote: > >> +1 (binding) >> >> On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote: >> >>> +1 (non-binding) >>> >>> >>> >>> -- >>> Sen

Re: SPIP: Catalog API for view metadata

2020-08-13 Thread Burak Yavuz
My high level comment here is that as a naive person, I would expect a View to be a special form of Table that SupportsRead but doesn't SupportWrite. loadTable in the TableCatalog API should load both tables and views. This way you avoid multiple RPCs to a catalog or data source or metastore, and y

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-04 Thread Burak Yavuz
+1 On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan wrote: > +1 > > On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu > wrote: > >> +1. >> >> On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia >> wrote: >> >>> +1 from me too. >>> >>> Matei >>> >>> > On Nov 3, 2017, at 4:59 PM, Wenchen Fan wrote: >>> > >>> >

Re: Reload some static data during struct streaming

2017-11-13 Thread Burak Yavuz
I think if you don't cache the jdbc table, then it should auto-refresh. On Mon, Nov 13, 2017 at 1:21 PM, spark receiver wrote: > Hi > > I’m using struct streaming(spark 2.2) to receive Kafka msg ,it works > great. The thing is I need to join the Kafka message with a relative static > table stor

Re: queryable state & streaming

2017-12-08 Thread Burak Yavuz
Hi Stavros, Queryable state is definitely on the roadmap! We will revamp the StateStore API a bit, and a queryable StateStore is definitely one of the things we are thinking about during that revamp. Best, Burak On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" wrote: > Just to re-phrase my questi

Re: Welcoming some new committers

2018-03-04 Thread Burak Yavuz
Congrats all! Well deserved. On Sat, Mar 3, 2018 at 4:10 AM, Marco Gaido wrote: > Congratulations to you all! > > On 3 Mar 2018 8:30 a.m., "Liang-Chi Hsieh" wrote: > >> >> Congrats to everyone! >> >> >> Kazuaki Ishizaki wrote >> > Congratulations to everyone! >> > >> > Kazuaki Ishizaki >> > >>

Re: Structured Streaming with Watermark

2018-10-18 Thread Burak Yavuz
Hi Sandeep, Watermarks are used in aggregation queries to ensure correctness and clean up state. They don't allow you to drop records in map-only scenarios, which you have in your example. If you would do a test of `groupBy().count()` then you will see that the count doesn't increase with the last

Re: [SS] FlatMapGroupsWithStateExec with no commitTimeMs metric?

2018-11-25 Thread Burak Yavuz
Probably just oversight. Anyone is welcome to add it :) On Sun, Nov 25, 2018 at 8:55 AM Jacek Laskowski wrote: > Hi, > > Why is FlatMapGroupsWithStateExec not measuring the time taken on state > commit [1](like StreamingDeduplicateExec [2] and StreamingGlobalLimitExec > [3])? Is this on purpose?

Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread Burak Yavuz
Congrats Jose! On Tue, Jan 29, 2019 at 10:50 AM Xiao Li wrote: > Congratulations! > > Xiao > > Shixiong Zhu 于2019年1月29日周二 上午10:48写道: > >> Hi all, >> >> The Apache Spark PMC recently added Jose Torres as a committer on the >> project. Jose has been a major contributor to Structured Streaming. Pl

Re: spark-packages with maven

2016-07-15 Thread Burak Yavuz
Hi Ismael and Jacek, If you use Maven for building your applications, you may use the spark-package command line tool ( https://github.com/databricks/spark-package-cmd-tool) to perform packaging. It requires you to build your jar using maven first, and then does all the extra magic that Spark Pack

Re: Remove / update version in spark-packages.org

2016-07-26 Thread Burak Yavuz
Hi, It's bad practice to change jars for the same version and is prohibited in Spark Packages. Please bump your version number and make a new release. Best regards, Burak On Tue, Jul 26, 2016 at 3:51 AM, Julio Antonio Soto de Vicente < ju...@esbet.es> wrote: > Hi all, > > Maybe I am missing som

Re: Spark SQL JSON Column Support

2016-09-28 Thread Burak Yavuz
I would really love something like this! It would be great if it doesn't throw away corrupt_records like the Data Source. On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande wrote: > We are currently pulling out the JSON columns, passing them through > read.json, and then joining them back onto the i

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Burak Yavuz
+1 On Sep 29, 2016 4:33 PM, "Kyle Kelley" wrote: > +1 > > On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai wrote: > >> +1 >> >> On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende >> wrote: >> >>> +1 (non-binding) >>> >>> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin >>> wrote: >>> Please vote on r

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Burak Yavuz
Hi Maciej, I believe it would be useful to either fix the documentation or fix the implementation. I'll leave it to the community to comment on. The code right now disallows intervals provided in months and years, because they are not a "consistently" fixed amount of time. A month can be 28, 29, 3

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Burak Yavuz
Thank you very much everyone! Hoping to help out the community as much as I can! Best, Burak On Tue, Jan 24, 2017 at 2:29 PM, Jacek Laskowski wrote: > Wow! At long last. Congrats Burak and Holden! > > p.s. I was a bit worried that the process of accepting new committers > is equally hard as pas

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Burak Yavuz
Congrats Takuya! On Mon, Feb 13, 2017 at 2:17 PM, Dilip Biswal wrote: > Congratulations, Takuya! > > Regards, > Dilip Biswal > Tel: 408-463-4980 <(408)%20463-4980> > dbis...@us.ibm.com > > > > - Original message - > From: Takeshi Yamamuro > To: dev > Cc: > Subject: Re: welcoming Takuya

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-20 Thread Burak Yavuz
Hi, We plan to add a more comprehensive local linear algebra package for MLlib 1.4. This local linear algebra package can then easily be extended to BlockMatrix to support the same operations in a distributed fashion. You may find the JIRA to track this here: SPARK-6442

Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Burak Yavuz
This is awesome! I can write the apps for it, to make the Web UI more functional! On Wed, Apr 1, 2015 at 12:37 AM, Tathagata Das wrote: > This is a significant effort that Reynold has undertaken, and I am super > glad to see that it's finally taking a concrete form. Would love to see > what the

Re: Fwd: [jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-14 Thread Burak Yavuz
Hi Sean and fellow devs, I also wanted to chime in and remind people of . Just because the work of someone doesn't fit into the broader scope of things, devs should be encouraged to showcase their hard work in Spark Packages. We have been working hard to make it easier for devs to share their work

Re: CSV Support in SparkR

2015-06-02 Thread Burak Yavuz
Hi, cc'ing Shivaram here, because he worked on this yesterday. If I'm not mistaken, you can use the following workflow: ```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3``` and then ```df <- read.df(sqlContext, "/data", "csv", header = "true")``` Best, Burak On Tue, Jun 2, 2015 a

Re: Ivy support in Spark vs. sbt

2015-06-04 Thread Burak Yavuz
Hi Marcelo, This is interesting. Can you please send me links to any failing builds if you see that problem please. For now you can set a conf: `spark.jars.ivy` to use a path except `~/.ivy2` for Spark. Thanks, Burak On Thu, Jun 4, 2015 at 4:29 AM, Sean Owen wrote: > I've definitely seen the "

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-04 Thread Burak Yavuz
+1 Tested on Mac OS X Burak On Thu, Jun 4, 2015 at 6:35 PM, Calvin Jia wrote: > +1 > > Tested with input from Tachyon and persist off heap. > > On Thu, Jun 4, 2015 at 6:26 PM, Timothy Chen wrote: > >> +1 >> >> Been testing cluster mode and client mode with mesos with 6 nodes cluster. >> >> Ev

Re: unsafe/compile error

2015-06-21 Thread Burak Yavuz
You need to build an assembly jar for the cluster tests to pass. You may use 'sbt assembly/assembly'. Best, Burak On Jun 21, 2015 3:43 AM, "acidghost" wrote: > After an sbt update the tests run. But all the "cluster" ones fail on "task > size should be small in both training and prediction" > >

Re: unsafe/compile error

2015-06-21 Thread Burak Yavuz
In addition, if you want to run a single suite, you may use: mllib/testOnly $SUITE_NAME with sbt. On Jun 21, 2015 10:32 AM, "Burak Yavuz" wrote: > You need to build an assembly jar for the cluster tests to pass. You may > use 'sbt assembly/assembly'. > > Best, &

Re: [GraphX] Graph 500 graph generator

2015-06-24 Thread Burak Yavuz
Hi Ryan, If you can get past the paperwork, I'm sure this can make a great Spark Package (http://spark-packages.org). People then can use it for benchmarking purposes, and I'm sure people will be looking for graph generators! Best, Burak On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan wrote: >

Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Burak Yavuz
+1 nonbinding. On Thu, Jul 9, 2015 at 7:38 AM, Sean Owen wrote: > +1 nonbinding. All previous RC issues appear resolved. All tests pass > with the "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver" invocation. > Signatures et al are OK. > > On Thu, Jul 9, 2015 at 6:55 AM, Patrick Wendell > wrote:

Re: BlockMatrix multiplication

2015-07-14 Thread Burak Yavuz
Hi Alexander, >From your example code, using the GridPartitioner, you will have 1 column, and 5 rows. When you perform an A^T^A multiplication, you will generate a separate GridPartitioner with 5 columns and 5 rows. Therefore you are observing a huge shuffle. If you would generate a diagonal-block

Re: BlockMatrix multiplication

2015-07-15 Thread Burak Yavuz
> > bm.validate() > > val t = System.nanoTime() > > // multiply matrix with itself > > val aa = bm.multiply(bm) > > aa.validate() > > println(rows + "x" + columns + ", block:" + blockSize + "\t" + > (System.nanoTime() - t) / 1e9) > >

Re: BlockMatrix multiplication

2015-07-17 Thread Burak Yavuz
mit a JIRA Issue related to the problem of block matrix > shuffling given the blocks co-location? > > > > Best regards, Alexander > > > > *From:* Burak Yavuz [mailto:brk...@gmail.com] > *Sent:* Wednesday, July 15, 2015 3:29 PM > > *To:* Ulanov, Alexander > *Cc:* Rakesh Ch

Re: FrequentItems in spark-sql-execution-stat

2015-08-01 Thread Burak Yavuz
Hi Yucheng, Thanks for pointing out the issue. You are correct, in the case that the final map is completely empty after the merge, we do need to add the final element to the map, with the correct count (decrement the count with the max count that was already in the map). I'll submit a fix for it.

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Burak Yavuz
+1. Tested complex R package support (Scala + R code), BLAS and DataFrame fixes good. Burak On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman wrote: > Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile), > Standalone without any problems. Re-tested dynamic allocation specifically. > > "L

Re: Export BLAS module on Spark MLlib

2015-11-30 Thread Burak Yavuz
Or you could also use reflection like in this Spark Package: https://github.com/brkyvz/lazy-linalg/blob/master/src/main/scala/com/brkyvz/spark/linalg/BLASUtils.scala Best, Burak On Mon, Nov 30, 2015 at 12:48 PM, DB Tsai wrote: > The workaround is have your code in the same package, or write som

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Burak Yavuz
+1 tested SparkSQL and Streaming on some production sized workloads On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra wrote: > +1 > > On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust > wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 1.6.0! >> >> The vote is

Re: Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Burak Yavuz
Hi Yash, I've run into multiple problems due to version incompatibilities, either due to protobuf or jackson. That may be your culprit. The problem is that all failures by the Kinesis Client Lib is silent, therefore don't show up on the logs. It's very hard to debug those buggers. Best, Burak On

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-08 Thread Burak Yavuz
+1 On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or wrote: > +1 > > 2016-03-08 10:59 GMT-08:00 Yin Huai : > >> +1 >> >> On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin wrote: >> >>> +1 (binding) >>> >>> >>> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov >>> wrote: >>> +1 Spark ODBC server

Re: 15 new MLlib algorithms

2014-07-09 Thread Burak Yavuz
Hi, The roadmap for the 1.1 release and MLLib includes algorithms such as: Non-negative matrix factorization, Sparse SVD, Multiclass decision tree, Random Forests (?) and optimizers such as: ADMM, Accelerated gradient methods also a statistical toolbox that includes: descriptive statistics, sa

Re: Hello All

2014-08-05 Thread Burak Yavuz
Hi Guru, Take a look at: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark It has all the information you need on how to contribute to Spark. Also take a look at: https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-28 Thread Burak Yavuz
+1. Tested MLlib algorithms on Amazon EC2, algorithms show speed-ups between 1.5-5x compared to the 1.0.2 release. - Original Message - From: "Patrick Wendell" To: dev@spark.apache.org Sent: Thursday, August 28, 2014 8:32:11 PM Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC2) I'll k

Re: [mllib] State of Multi-Model training

2014-09-16 Thread Burak Yavuz
Hi Kyle, I'm actively working on it now. It's pretty close to completion, I'm just trying to figure out bottlenecks and optimize as much as possible. As Phase 1, I implemented multi model training on Gradient Descent. Instead of performing Vector-Vector operations on rows (examples) and weights,

Re: [mllib] State of Multi-Model training

2014-09-16 Thread Burak Yavuz
ny feedback from you and the rest of the community! Best, Burak - Original Message - From: "Kyle Ellrott" To: "Burak Yavuz" Cc: dev@spark.apache.org Sent: Tuesday, September 16, 2014 9:41:45 PM Subject: Re: [mllib] State of Multi-Model training I'd be intereste

Re: [mllib] State of Multi-Model training

2014-09-17 Thread Burak Yavuz
I believe it will be in the main repo. Burak - Original Message - From: "Kyle Ellrott" To: "Burak Yavuz" Cc: dev@spark.apache.org Sent: Wednesday, September 17, 2014 9:48:54 AM Subject: Re: [mllib] State of Multi-Model training This sounds like a pretty major re

Re: [DISCUSS] SPIP: Declarative Pipelines

2025-04-09 Thread Burak Yavuz
+1 On Wed, Apr 9, 2025 at 4:33 PM Szehon Ho wrote: > +1 really excited to finally see Materialized View finally make its way to > Spark, as many other ecosystem projects (Trino, Starrocks, soon Iceberg) > already supporting it. > > Thanks > Szehon > > On Wed, Apr 9, 2025 at 2:33 AM Martin Grund