Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Maciej
.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>. > >   > > * > > There are many important features missing that are > very common in data science.

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Maciej
ase vote on the SPIP for the next 72 hours: > > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because … > > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Time to start publishing Spark Docker Images?

2021-08-16 Thread Maciej
> > -- > > Twitter: https://twitter.com/holdenkarau > > <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790729540%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=x6fXgTuoQqVYqu9JPbt0hG2P0zl6l3p%2FrU5bDng85AY%3D&reserved=0> > > Books (Learning Spark, High Performance Spark, > etc.): https://amzn.to/2MaRAG9  > > <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790729540%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WCHuF%2BcEl0rBZyVOePRQT1AOefwRDlIavu9B0wDmmOk%3D&reserved=0> > > YouTube Live > Streams: https://www.youtube.com/user/holdenkarau > > <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790739490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=52hSM52z%2FFRahVO%2FcRwJ6eDuDInvhhtt1xQfbhMRazQ%3D&reserved=0> > > > > -- > Twitter: https://twitter.com/holdenkarau > <https://twitter.com/holdenkarau> > Books (Learning Spark, High Performance Spark, > etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > <https://www.youtube.com/user/holdenkarau> > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Maciej
The author will in no case be liable for any monetary > damages arising from such loss, damage or destruction. > >   > > > > On Mon, 16 Aug 2021 at 18:46, Maciej <mailto:mszymkiew...@gmail.com>> wrote: > > I have a few concerns regarding PySpark and SparkR im

Nabble archive is down

2021-08-17 Thread Maciej
e ASF archives? -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

[MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Maciej
Hi All, Just wondering ‒ would it make sense to add .github/FUNDING.yml with custom link pointing to one (or both) of these: * https://www.apache.org/foundation/sponsorship.html * https://www.apache.org/foundation/contributing.html -- Best regards, Maciej Szymkiewicz Web: https://zero323

Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Maciej
ct > > On Wed, Dec 15, 2021, 8:34 AM Maciej wrote: > > Hi All, > > Just wondering ‒ would it make sense to add > .github/FUNDING.yml with custom link pointing to one (or both) > of these: > > * https://www.apache.org/found

[R] SparkR on conda-forge

2021-12-19 Thread Maciej
Hi everyone, FYI ‒ thanks to good folks from conda-forge we have now these: * https://github.com/conda-forge/r-sparkr-feedstock * https://anaconda.org/conda-forge/r-sparkr -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description

Re: PySpark Dynamic DataFrame for easier inheritance

2021-12-29 Thread Maciej
mn` and `select` methods and > the expected output. > > I'm sharing this here in case you feel like this approach can be > useful for anyone else. In our case it greatly sped up the > development of abstraction layers and allowed us to write cleaner > code. One of the advantages is that it would simply be a "plugin" > over pyspark, that does not modify anyhow already existing code or > application interfaces. > > If you think that this can be helpful, I can write a PR as a more > refined proof of concept. > > Thanks! > > Pablo > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: PySpark Dynamic DataFrame for easier inheritance

2021-12-29 Thread Maciej
On 12/29/21 16:18, Pablo Alcain wrote: > Hey Maciej! Thanks for your answer and the comments :)  > > On Wed, Dec 29, 2021 at 3:06 PM Maciej <mailto:mszymkiew...@gmail.com>> wrote: > > This seems like a lot of trouble for not so common use case that has > vi

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Maciej
impact compatibility should be worked on > immediately. Everything else please > retarget to an appropriate release. > == But my bug isn't > fixed? == In order to > make timely releases, we will typically > not hold the release unless the bug in > question is a regression from the > previous release. That being said, if > there is something which is a regression > that has not been correctly targeted > please ping me or a committer to help > target the issue. > > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: [How To] run test suites for specific module

2022-01-24 Thread Maciej
file in a specific module. > > I would really appreciate any suggestion or comment. > > > Best regards, > > Fangjia Shen > > Purdue University > > > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Apache Spark 3.3 Release

2022-03-06 Thread Maciej
here are any remaining works for Spark > 3.3, and switch to QA mode, cut a branch and keep everything on track. I > would like to volunteer to help drive this process. > > Best regards, > Max Gekk -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Apache Spark 3.3 Release

2022-04-29 Thread Maciej
wrote: > >> > >> >> > >> > >> >> Let me clarify my above suggestion. Maybe > we can wait 3 more days to collect the list of > actively developed PRs th

Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-17 Thread Maciej
quot; in > many places when we: import pyspark.pandas as ps. > This is similar to "Structured Streaming" in JIRA, and "SS" in PR title. > > I think it'd be easier to track the changes here with that. > Currently it's a bit difficult to

Re: Welcoming three new PMC members

2022-08-10 Thread Maciej
> > >>>>> The Spark PMC > > > > > > > > > > > > -- > > > Takuya UESHIN > > > > > > > ---

Re: Welcome Xinrong Meng as a Spark committer

2022-08-10 Thread Maciej
--- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto:dev-unsubscr...@spark.apache.org> -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

2022-08-14 Thread Maciej
--- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto:dev-unsubscr...@spark.apache.org> -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Is it possible to specify explicitly map() key/value types?

2022-08-27 Thread Maciej
Is it possible to create a map by specifying the key-value type explicitly? So far, I came up with a workaround using map('', '') to initialise the map for string key-value and using map_filter() to exclude/remove the redundant map('', '') key-value item:

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Maciej
engliang - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <mailto:dev-unsubscr...@spark.apache.org> -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Syndicate Apache Spark Twitter to Mastodon?

2022-12-11 Thread Maciej
au> -- Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau <https://www.youtube.com/user/holdenkarau> -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Maciej
y Process A. How can I achieve that? I've tried  pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it will create a new spark context. -- Best regards, Maciej Szymkiewicz Web: h

Re: How can I get the same spark context in two different python processes

2022-12-13 Thread Maciej
r Py4j for that matter) so don't do it unless you fully understand the implications (including, but not limited to, risk of leaking the token). Use this approach at your own risk. On 12/13/22 03:52, Kevin Su wrote: Maciej, Thanks for the reply. Could you share an example to achi

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Maciej
so the release can happen twice every year regardless of the actual release date. I believe it both makes the release cadence predictable, and relaxes the burden about making releases. WDYT? -- Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501

Re: Slack for Spark Community: Merging various threads

2023-04-06 Thread Maciej
here?  I'm glad to work with whomever to help manage the various aspects of Slack (code of conduct, linen.dev <http://linen.dev> and search/archive process, invite management, etc.). HTH! Denny -- Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC OpenPGP_signature Description: OpenPGP digital signature

Re: Slack for Spark Community: Merging various threads

2023-04-06 Thread Maciej
-- Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/6/23 17:13, Denny Lee wrote: Thanks Dongjoon, but I don't think this is misleading insofar that this is not a /self-service process/ but an invite process which admittedly I did not state explicitly i

Re: Slack for Spark Community: Merging various threads

2023-04-08 Thread Maciej
and all of us. -- Maciej On 4/7/23 21:02, Bjørn Jørgensen wrote: Yes, I have done some search for slack alternatives <https://itsfoss.com/open-source-slack-alternative/> I feel that we should do some search, to find if there can be a better solution than slack. For what I have found, th

Re: [CONNECT] New Clients for Go and Rust

2023-05-19 Thread Maciej
survived while none is particularly active, as far as I'm aware.  Taking responsibility for more clients, without being sure that we have resources to maintain them and there is enough community around them to make such effort worthwhile, doesn't seem like a good idea. -- Be

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-26 Thread Maciej
Weren't some of these functions provided only for compatibility  and intentionally left out of the language APIs? -- Best regards, Maciej On 5/25/23 23:21, Hyukjin Kwon wrote: I don't think it'd be a release blocker .. I think we can implement them across multiple releases.

Re: [CONNECT] New Clients for Go and Rust

2023-05-26 Thread Maciej
onnect is, it is not exactly a replacement for many existing deployments. Furthermore, it doesn't make extending Spark much easier and the current ecosystem is, subjectively speaking, a bit brittle. -- Best regards, Maciej On 5/26/23 07:26, Martin Grund wrote: Thanks everyone for your fe

Re: [CONNECT] New Clients for Go and Rust

2023-06-01 Thread Maciej
tending Spark functionality while using Spark Connect, effectively limiting the target audience for any 3rd party library. > Martin > > > On Fri, May 26, 2023 at 5:39 PM Maciej > wrote: > > It might be a good idea to have a discussion about how new connect > cli

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Maciej
extensible or customizable sources, in case there is such a need. -- Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 6/20/23 05:19, Hyukjin Kwon wrote: Actually I support this idea in a way that Python developers don't have to learn Scala to write their own s

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-20 Thread Maciej
then, -1 for the following reasons: - Relevant ASF policy seems to say this is fine, as argued at https://lists.apache.org/thread/p15tc772j9qwyvn852sh8ksmzrol9cof - There is no argument any of this has caused a problem for the community anyway; there is just nothing to &#

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Maciej
+1 -- Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 6/21/23 17:35, Holden Karau wrote: A small request, it’s pride weekend in San Francisco where some of the core developers are and right before one of the larger spark related conferences so more folks

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-24 Thread Maciej
sources through 3rd party FDWs? Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 6/20/23 16:23, Wenchen Fan wrote: In an ideal world, every data source you want to connect to already has a Spark data source implementation (either v1 or v2), then this Python API is

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Maciej
experience in terms of reliability and execution cost. Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 6/24/23 23:42, Martin Grund wrote: Hey, I would like to express my strong support for Python Data Sources even though they might not be immediately as powerful as

Re: [VOTE][SPIP] Python Data Source API

2023-07-06 Thread Maciej
+0 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 7/6/23 17:41, Xiao Li wrote: +1 Xiao Hyukjin Kwon 于2023年7月5日周三 17:28写道: +1. See https://youtu.be/yj7XlTB1Jvc?t=604 :-). On Thu, 6 Jul 2023 at 09:15, Allison Wang wrote: Hi all

Re: [DISCUSS] SPIP: XML data source support

2023-07-19 Thread Maciej
That's a great idea, as long as we can keep additional dependencies under control. Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 7/19/23 18:22, Franco Patano wrote: +1 Many people have struggled with incorporating this separate library into their

Re: [VOTE] SPIP: XML data source support

2023-07-29 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 7/29/23 11:28, Mich Talebzadeh wrote: +1 for me. Though Databriks did a good job releasing the code. GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames <https://github.

Re: LLM script for error message improvement

2023-08-03 Thread Maciej
eptable within the project. Ideally, with an official opinion from the ASF as the copyright owner. WDYT All? Shall we start a separate discussion? Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 8/3/23 18:33, Haejoon Lee wrote: Additional information: Please c

Re: LLM script for error message improvement

2023-08-04 Thread Maciej
he tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-by: ”.' and consider adjusting PR template / merge tool accordingly. Best regards, Maciej Szymkiewicz Web:https://

Re: [VOTE] Updating documentation hosted for EOL and maintenance releases

2023-09-26 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 9/26/23 17:12, Michel Miotto Barbosa wrote: +1 A disposição | At your disposal Michel Miotto Barbosa https://www.linkedin.com/in/michelmiottobarbosa/ mmiottobarb...@gmail.com +55 11 984 342 347 On Tue

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
oing so I've noticed that all the methods that aren't in the pyi file > are *unable to be called from other python files*. I was unaware of > this effect of the pyi files. As soon as you create the files, all the > methods are shielded from external access. Feels like going b

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
in the discussion Pandas didn't type check and had no clear timeline for advertising annotations. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC signature.asc Description: OpenPGP digital signature

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
se, just my excitement to see this happen. Any action > points that we can define and that I can help on? I'm fine with taking > the route that Hyukjin suggests :) > > Cheers, Fokko > > Op do 27 aug. 2020 om 18:45 schreef Maciej <mailto:mszymkiew...@gmail.com>>:

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
In short term there are also some upstream changes that haven't been reflected in stubs master... On 8/27/20 10:24 PM, Driesprong, Fokko wrote: > . Any action points that we can define and that I can help on? I'm > fine with taking the route that Hyukjin suggests :) > -- Bes

[DISCUSS][R] Adding magrittr as a dependency for SparkR

2020-09-30 Thread Maciej
just anecdotal evidence, most of the SparkR applications I've seen out there, already use magrittr. Non-goals: * Supporting non-standard evaluation. Thanks in advance for your input. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero3

Broken rlang installation on AppVeyor

2020-10-08 Thread Maciej
ween rlang 0.4.7 and 0.4.8 (previous run with 0.4.7 https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/35630069), but is there any reason why we seem to default to i386 (https://github.com/apache/spark/blob/c5f6af9f17498bb0ec393c16616f2d99e5d3ee3d/dev/appveyor-install-dependencies.ps1#L22) for R i

Re: Broken rlang installation on AppVeyor

2020-10-09 Thread Maciej
uot;. Can you > open a PR to change? > > 2020년 10월 9일 (금) 오전 4:36, Maciej <mailto:mszymkiew...@gmail.com>>님이 작성: > > Hi Everyone, > > I've been digging into AppVeyor test failures for > https://github.com/apache/spark/pull/29978 > > >

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Maciej
on, and R > functions, without requiring maintainers to write tests for each > language's version of the functions. Would that address the > maintenance burden? With R we don't really test most of the functions beyond the simple "callability". One the complex

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Maciej
function, whereas in SQL/Python they are just one. There's a > limit on how many functions we can add, and it also makes it > difficult to browse through the docs when there are a lot of > functions. > > > > On Thu, Jan 28, 2021 at

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/15/24 8:16 PM, Rui Wang wrote: +1, non-binding. Thanks Dongjoon to drive this! -Rui On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng wrote: +1 Thank you @Dongjoon Hyun <mailto:dongjoo

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/25/24 6:21 PM, Reynold Xin wrote: +1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: +1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: FYI, there is a proposal to drop

Re: Introduce FORMAT clause to CAST with SQL:2016 datetime patterns

2019-03-20 Thread Maciej Szymkiewicz
nd Spark to take a look at my > proposal and share their opinion from their own component's perspective. If > we get on the same page I'll eventually open Jiras to cover this > improvement for each mentioned systems. > > Cheers, > Gabor > > > > -- Regards, Maciej

Is SPARK-9961 is still relevant?

2019-10-05 Thread Maciej Szymkiewicz
lanned defaultEvaluator was the primary reason to use such annotation there. -- Best regards, Maciej

[DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-24 Thread Maciej Szymkiewicz
well? -- Best regards, Maciej

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-30 Thread Maciej Szymkiewicz
rt next January > (https://spark.apache.org/versioning-policy.html), > I'm +1 for the deprecation (Python < 3.6) > at Apache Spark 3.0.0. > > It's just a deprec

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Maciej Szymkiewicz
ndard) > > We should still add PostgreSQL features that Spark doesn't have, or > Spark's behavior violates SQL standard. But for others, let's just > update the answer files of PostgreSQL tests. > > Any comments are welcome! > > Thanks, > Wenchen -- Best regards, Maciej

Re: Apache Spark Docker image repository

2020-02-06 Thread Maciej Szymkiewicz
>       (This can be used in GitHub Action Jobs and Jenkins K8s > Integration Tests to speed up jobs and to have more stabler > environments) > > > > Bests, > > Dongjoon. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.

Re: Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-18 Thread Maciej Szymkiewicz
; treated as private. > > Is this intentional?  If so, what's the rationale?  If not, then it > feels like a bug and DataFrame should have some form of public access > back to the context/session.  I'm happy to log the bug but thought I > would ask here first.  Thanks! --

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
to the main > repository? > > > > -- > Sent from: > http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > ---

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > > > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, H

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
e. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC signature.asc Description: OpenPGP digital signature

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
sync right? > Do you provide different stubs for different versions of Python? I had to > look up the literals: https://www.python.org/dev/peps/pep-0586/ > I think it is more about portability between Spark versions > > > Cheers, Fokko > > Op wo 22 jul. 2020 om 09:40 schr

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
n for separate git repo? >> >> >> From: Hyukjin Kwon >> Sent: Monday, August 3, 2020 1:58:55 AM >> To: Maciej Szymkiewicz >> Cc: Driesprong, Fokko ; Holden Karau >> ; Spark Dev List >> Subject: Re: [PySpark] Revisiting PySpark type ann

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
/pyspark-stubs/graphs/contributors) and at least some use cases (https://stackoverflow.com/q/40163106/). So, subjectively speaking, it seems we're already beyond POC. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
why I asked. > > >   > ---- > *From:* Maciej Szymkiewicz > *Sent:* Tuesday, August 4, 2020 12:59 PM > *To:* Sean Owen > *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; > Spark Dev List > *Subject:* Re: [PySpark] Revisiting PySpark type annotat

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions: - https://stackoverflow.com/q/41670103/1560062 - https://stackoverflow.com/q/42465568/1560062 - https://stackoverflow.com/q/41670103/1560062 it is probably more "nobody thought about asking", than "it is not used often". On Wed, 22 Aug 2018 at

Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Maciej Szymkiewicz
Hi Imran, On Wed, 29 Aug 2018 at 22:26, Imran Rashid wrote: > Hi Li, > > yes that makes perfect sense. That more-or-less is the same as my view, > though I framed it differently. I guess in that case, I'm really asking: > > Can pyspark changes please be accompanied by more unit tests, and not

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options - Use stub files and limit yourself to support only Python 3 support. Python 3 users benefit from type hints, Python 2 users don't, but no core functionality is affected. This is the approach I've used with https://git

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
arry on, but do we want to take that baggage into Apache Spark 3.x > era? The next time you may drop it would be only 4.0 release because > of breaking change. > > -- > ,,,^..^,,, > On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz > wrote: > > > > There is no need to d

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by default (with exception to __init__). There is :special-members: option which could be passed to, for example, autoclass. On Tue, 23 Oct 2018 at 21:32, Sean Owen wrote: > (& and | are both logical and bitwise operators in Jav

[PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Maciej Szymkiewicz
ance. -- Best, Maciej

Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
ini >>>> Data Engineer >>>> mobile: +98 912 468 1859 <+98+912+468+1859> >>>> site: www.moein.xyz >>>> email: moein...@gmail.com >>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm> >>>> [image: twitter] <https://twitter.com/moein7tl> >>>> >>>> >> >> -- >> >> Moein Hosseini >> Data Engineer >> mobile: +98 912 468 1859 <+98+912+468+1859> >> site: www.moein.xyz >> email: moein...@gmail.com >> [image: linkedin] <https://www.linkedin.com/in/moeinhm> >> [image: twitter] <https://twitter.com/moein7tl> >> >> -- Regards, Maciej

Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
Hi, Did anyone measure performance of Spark 2.0 vs Spark 1.6 ? I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes 2x slower. I tested following queries: 1) select count(*) where id > some_id In this query we have PPD and performance i

Re: Spark 2.0 Performance drop

2016-06-29 Thread Maciej Bryński
2016-06-29 23:22 GMT+02:00 Michael Allman : > I'm sorry I don't have any concrete advice for you, but I hope this helps > shed some light on the current support in Spark for projection pushdown. > > Michael Michael, Thanks for the answer. This resolves one of my questions. Which Spark version you

Re: Spark 2.0 Performance drop

2016-06-30 Thread Maciej Bryński
t; have them. > > Cheers, > > Michael > >> On Jun 29, 2016, at 2:39 PM, Maciej Bryński wrote: >> >> 2016-06-29 23:22 GMT+02:00 Michael Allman : >>> I'm sorry I don't have any concrete advice for you, but I hope this helps >>> shed some l

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Maciej Bryński
-1 https://issues.apache.org/jira/browse/SPARK-16379 https://issues.apache.org/jira/browse/SPARK-16371 2016-07-06 7:35 GMT+02:00 Reynold Xin : > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes > i

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Maciej Bryński
@Sean Owen, As we're not planning to implement DataSets in Python do you plan to revert this Jira ? https://issues.apache.org/jira/browse/SPARK-13594 2016-07-19 10:07 GMT+02:00 Sean Owen : > I think unfortunately at least this one is gonna block: > https://issues.apache.org/jira/browse/SPARK-1662

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Maciej Bryński
@Reynold Xin, How this will work with Hive Support ? SparkSession.sqlContext return HiveContext ? 2016-07-19 0:26 GMT+02:00 Reynold Xin : > Good idea. > > https://github.com/apache/spark/pull/14252 > > > > On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust > wrote: >> >> + dev, reynold >> >> Yeah

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Maciej Bryński
@Michael, I answered in Jira and could repeat here. I think that my problem is unrelated to Hive, because I'm using read.parquet method. I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues) And code profiling suggest bottleneck when reading parquet file. I wo

Re: Spark jdbc update SaveMode

2016-07-22 Thread Maciej Bryński
2016-07-22 23:05 GMT+02:00 Ramon Rosa da Silva : > Hi Folks, > > > > What do you think about allow update SaveMode from > DataFrame.write.mode(“update”)? > > Now Spark just has jdbc insert. I'm working on patch that creates new mode - 'upsert'. In Mysql it will use 'REPLACE INTO' command. M. ---

What happens in Dataset limit followed by rdd

2016-08-01 Thread Maciej Szymkiewicz
Hi everyone, This doesn't look like something expected, does it? http://stackoverflow.com/q/38710018/1560062 Quick glance at the UI suggest that there is a shuffle involved and input for first is ShuffledRowRDD. -- Best regards, Maciej Szymkiewicz

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz
owever, in the second case, the optimisation in the CollectLimitExec > does not help, because the previous limit operation involves a shuffle > operation. All partitions will be computed, and running LocalLimit(1) > on each partition to get 1 row, and then all partitions are shuffled >

Re: What happens in Dataset limit followed by rdd

2016-08-03 Thread Maciej Szymkiewicz
mply pushes down across mapping functions, > because the number of rows may change across functions. for example, > flatMap() > > It seems that limit can be pushed across map() which won’t change the > number of rows. Maybe this is a room for Spark optimisation. > >> On Aug 2, 20

Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Hi, I have some operation on DataFrame / Dataset. How can I see source code for whole stage codegen ? Is there any API for this ? Or maybe I should configure log4j in specific way ? Regards, -- Maciek Bryński

Re: Spark SQL and Kryo registration

2016-08-05 Thread Maciej Bryński
Hi Olivier, Did you check performance of Kryo ? I have observations that Kryo is slightly slower than Java Serializer. Regards, Maciek 2016-08-04 17:41 GMT+02:00 Amit Sela : > It should. Codegen uses the SparkConf in SparkEnv when instantiating a new > Serializer. > > On Thu, Aug 4, 2016 at 6:14

Re: Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
inal class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator > { > > /* 006 */ private Object[] references; > > /* 007 */ private org.apache.spark.sql.execution.metric.SQLMetric > range_numOutputRows; > > /* 008 */ private boolean range_initRang

Re: GraphFrames 0.2.0 released

2016-08-24 Thread Maciej Bryński
Hi, Do you plan to add tag for this release on github ? https://github.com/graphframes/graphframes/releases Regards, Maciek 2016-08-17 3:18 GMT+02:00 Jacek Laskowski : > Hi Tim, > > AWESOME. Thanks a lot for releasing it. That makes me even more eager > to see it in Spark's codebase (and replaci

Tree for SQL Query

2016-08-24 Thread Maciej Bryński
Hi, I read this article: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html And I have a question. Is it possible to get / print Tree for SQL Query ? Something like this: Add(Attribute(x), Add(Literal(1), Literal(2))) Regards, -- Maciek Bryński --

Re: Tree for SQL Query

2016-08-24 Thread Maciej Bryński
016-08-24 22:39 GMT+02:00 Reynold Xin : > It's basically the output of the explain command. > > > On Wed, Aug 24, 2016 at 12:31 PM, Maciej Bryński wrote: >> >> Hi, >> I read this article: >> >> https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sql

Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej Bryński
2016-08-27 15:27 GMT+02:00 Julien Dumazert : > df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _) I think reduce and sum has very different performance. Did you try sql.functions.sum ? Or of you want to benchmark access to Row object then count() function will be better idea. Regards,

Cache'ing performance

2016-08-27 Thread Maciej Bryński
Hi, I did some benchmark of cache function today. *RDD* sc.parallelize(0 until Int.MaxValue).cache().count() *Datasets* spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) *RDD - 6s* *Datasets - 13,5 s* Is that expected be

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Maciej Bryński
ow, it seems that it got much slower from 1.6 to > 2.0. I guess, it's because of the fact that Dataframe is now Dataset[Row], > and thus uses the same encoding/decoding mechanism as for any other case > class. > > Best regards, > > Julien > > Le 27 août 2016 à 22:32, M

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Maciej Bryński
+1 At last :) 2016-09-26 19:56 GMT+02:00 Sameer Agarwal : > +1 (non-binding) > > On Mon, Sep 26, 2016 at 9:54 AM, Davies Liu wrote: > >> +1 (non-binding) >> >> On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley >> wrote: >> > +1 >> > >> > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee >> wrote: >> >>

java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
xception: key not found: a while Java serializer works just fine: scala> val sc = new SparkContext(new SparkConf().setAppName("bar").set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")) scala> sc.parallelize(Seq(aMap)).map(_("a")).first res9: Long = 0 -- Best regards, Maciej

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Maciej Bryński
+1 2016-09-30 7:01 GMT+02:00 vaquar khan : > +1 (non-binding) > Regards, > Vaquar khan > > On 29 Sep 2016 23:00, "Denny Lee" wrote: > >> +1 (non-binding) >> >> On Thu, Sep 29, 2016 at 9:43 PM Jeff Zhang wrote: >> >>> +1 >>> >>> On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz wrote: >>> +1 >

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-30 Thread Maciej Szymkiewicz
r a custom >> serializer that handles this case. Or work around it in your client >> code. I know there have been other issues with Kryo and Map because, >> for example, sometimes a Map in an application is actually some >> non-serializable wrapper view. >> >> O

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
You have to remember that Stack Overflow crowd (like me) is highly opinionated, so many questions, which could be just fine on the mailing list, will be quickly downvoted and / or closed as off-topic. Just saying... -- Best, Maciej On 11/07/2016 04:03 AM, Reynold Xin wrote: > OK I've

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
bstantially underestimated how opinionated people can be on > mailing lists too :) > > On Sunday, November 6, 2016, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > You have to remember that Stack Overflow crowd (like me) is highly > opinionated, so many

  1   2   >