8oo
>>> >>> >>- JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-44167
>>> >>> >>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> >>> >>
>>> >>> >>
>>> >>> >> Please vote on the SPIP for the next 72 hours:
>>> >>> >>
>>> >>> >> [ ] +1: Accept the proposal as an official SPIP
>>> >>> >> [ ] +0
>>> >>> >> [ ] -1: I don’t think this is a good idea because …
>>> >>> >>
>>> >>> >>
>>> >>> >> Thank you!
>>> >>> >>
>>> >>> >> Liang-Chi Hsieh
>>> >>> >>
>>> >>> >>
>>> -
>>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>> >>
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
--
Ryan Blue
Tabular
;> However, taking klaws of diminishing returns, I would not advise that
>>> either.. You can ofcourse usse gzip for compression that may be more
>>> suitable for your needs.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solu
V2 atm ?, as We have a use
> case where we need parquet V2 as one of our components uses Parquet V2 .
>
> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue wrote:
>
>> Hi Prem,
>>
>> Parquet v1 is the default because v2 has not been finalized and adopted
>> by th
gt; https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Team,
>>>>>>> May I know how to check which version of parquet is supported by
>>>>>>> parquet-mr 1.2.1 ?
>>>>>>>
>>>>>>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>>>>>>
>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>> May I get the release notes where parquet versions are mentioned ?
>>>>>>>
>>>>>>
--
Ryan Blue
Tabular
netes operator, making it a part of the Apache Flink project (
>>> https://github.com/apache/flink-kubernetes-operator). This move has
>>> gained wide industry adoption and contributions from the community. In a
>>> mere year, the Flink operator has garnered more than 600 stars and has
>>> attracted contributions from over 80 contributors. This showcases the level
>>> of community interest and collaborative momentum that can be achieved in
>>> similar scenarios.
>>> More details can be found at SPIP doc : Spark Kubernetes Operator
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>>
>>> Thanks,
>>> --
>>> *Zhou JIANG*
>>>
>>>
>>>
--
Ryan Blue
Tabular
hint system [
> https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html
> or sql("select 1").hint("foo").show()] aren't visible from the
> TableCatalog/Table/ScanBuilder.
>
> I guess I could set a config parameter but I'd rather do this on a
> per-query basis. Any tips?
>
> Thanks!
>
> -0xe1a
>
--
Ryan Blue
Tabular
schema metadata that are
> enforced in the implementation of a FileFormatDataWriter?
>
> Just throwing it out there and wondering what other people think. It's an
> area that interests me as it seems that over half my problems at the day
> job are because of dodgy data.
>
> Regards,
>
> Phillip
>
>
--
Ryan Blue
Tabular
to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP until Feb. 9th (Wednesday).
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
--
Ryan Blue
Tabular
Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
>>> Volcano/Alternative Schedulers Proposal
>>> > - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>>> Schedulers Proposal
>>> > - JIRA: SPARK-36057
>>> >
>>
? we can extract options from runtime session
>> configurations e.g., SessionConfigSupport.
>>
>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Side note about time travel: There is a PR
>>> <https:
iable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer
> wrote:
>
>> I think since we probably will end up using this same syntax on write,
>> this makes a lot of sense. Unless there
t;
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not sure if
>> the syntax makes sense, and would like to know if there is other feedback
>> or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>
--
Ryan Blue
Tabular
rg/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
>>>>> >
>>>>> > - JIRA: SPARK-35801 <
>>>>> https://issues.apache.org/jira/browse/SPARK-35801>
>>>>> > - PR for handling DELETE statements:
>>>>> > <https://github.com/apache/spark/pull/33008>
>>>>> >
>>>>> > - Design doc
>>>>> > <
>>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>>>> >
>>>>> >
>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>> >
>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>> > [ ] +0
>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>> >
>>>>> > -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
--
Ryan Blue
Tabular
t; > > >
>> > > > [ ] +1: Accept the proposal as an official SPIP
>> > > > [ ] +0
>> > > > [ ] -1: I don’t think this is a good idea because …
>> > > >
>> > > >
>> -
>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > > >
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Tabular
hash function. Or we can
> clearly define the bucket hash function of the builtin `BucketTransform` in
> the doc.
>
> On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue wrote:
>
>> Two v2 sources may return different bucket IDs for the same value, and
>> this breaks the phase 1 s
pache.org/jira/browse/SPARK-19256> has
>>> details).
>>>
>>>
>>>
>>> 1. Would aggregate work automatically after the SPIP?
>>>
>>>
>>>
>>> Another major benefit for having bucketed table, is to avoid shuffle
>>>
;>>>
>>>>> +1 for this SPIP.
>>>>>
>>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao
>>>>> wrote:
>>>>>
>>>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>>>
tribution properties
> reported by data sources and eliminate shuffle whenever possible.
> >
> > Design doc:
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> (includes a POC link at the end)
> >
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Chao
>
--
Ryan Blue
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
> update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
--
Ryan Blue
Software Engineer
Netflix
This SPIP is adopted with the following +1 votes and no -1 or +0 votes:
Holden Karau*
John Zhuge
Chao Sun
Dongjoon Hyun*
Russell Spitzer
DB Tsai*
Wenchen Fan*
Kent Yao
Huaxin Gao
Liang-Chi Hsieh
Jungtaek Lim
Hyukjin Kwon*
Gengliang Wang
kordex
Takeshi Yamamuro
Ryan Blue
* = binding
On Mon, Mar
t; > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao <
>> >>>>
>> >>>> > huaxin.gao11@
>> >>>>
>> >>>> > > wrote:
>> >>>> >
>> >>>> >> +1 (non-binding)
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >>>>
>> >>>> -
>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>
--
Ryan Blue
Software Engineer
Netflix
.82w8qxfl2uwl
Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do
a final update of the PR and we can merge the API.
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …
--
Ryan Blue
the "magical methods", then we can have a single
>>> ScalarFunction interface which has the row-parameter API (with a
>>> default implementation to fail) and documents to describe the "magical
>>> methods" (which can be done later).
>>>
>&
throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
only be Object and cause boxing issues.
>
> I agree that Object[] is worse than InternalRow. But I can't think of
> real use cases that will force the individual-parameters approach to use
> Object instead of concrete types.
>
>
> On Tue, Mar 2, 2021 at 3:36 AM Ryan
safety guarantees only if you need just one set of types for each number of
arguments and are using the non-codegen path. Since varargs is one of the
primary reasons to use this API, then I don’t think that it is a good idea
to use Object[] instead of InternalRow.
--
Ryan Blue
Software Engineer
Netflix
gt; merge two Arrays (of generic types) to a Map.
>>
>> Also, to address Wenchen's InternalRow question, can we create a number
>> of Function classes, each corresponding to a number of input parameter
>> length (e.g., ScalarFunction1, ScalarFunction2, etc)?
>>
>&
individual-parameters version or the row-parameter version?
>
> To move forward, how about we implement the function loading and binding
> first? Then we can have PRs for both the individual-parameters (I can take
> it) and row-parameter approaches, if we still can't reach a consensus at
conclude this thread
>> and have at least one implementation in the `master` branch this month
>> (February).
>> If you need more time (one month or longer), why don't we have Ryan's
>> suggestion in the `master` branch first and benchmark with your PR later
>>
>
> This proposal looks very interesting. Would future goals for this
> functionality include both support for aggregation functions, as well
> as support for processing ColumnBatch-es (instead of Row/InternalRow)?
>
> Thanks
> Andrew
>
> On Mon, Feb 15, 2021 at 12:44 PM Ryan Bl
gt;> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> >>> and
>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>>> user to
>>> >>> work with in the common case, but has greater
n
>>>>> the community for a long period of time. I especially appreciate how the
>>>>> design is focused on a minimal useful component, with future optimizations
>>>>> considered from a point of view of making sure it's flexible, but actual
>&
String instead,
then the UDF wouldn’t work. What then? Does Spark detect that the wrong
type was used? It would need to or else it would be difficult for a UDF
developer to tell what is wrong. And this is a runtime issue so it is
caught late.
--
Ryan Blue
. The UDF will report input data types and result data type, so the
> analyzer can check if the call method is valid via reflection, and we
> still have query-compile-time type safety. It also simplifies development
> as we can just use the Invoke expression to invoke UDFs.
>
> On Tue
API:
https://github.com/apache/spark/pull/24559/files
Let's discuss the proposal here rather than on that PR, to get better
visibility. Also, please take the time to read the proposal first. That
really helps clear up misconceptions.
--
Ryan Blue
the following change to the community.
>>>
>>> SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax
>>>
>>> This is merged today and now Spark's `CREATE TABLE` is using Spark's
>>> default data sources ins
months of time range + LGTM
>> from an SS contributor is enough to go ahead?
>>
>> https://github.com/apache/spark/pull/27649
>> https://github.com/apache/spark/pull/28363
>>
>> These are under 100 lines of changes per each, and not invasive.
>>
>
--
Ryan Blue
Software Engineer
Netflix
efore feature freeze for Spark
>> 3.1 is happening? Submitted 1.5 years ago and continues struggling for
>> including it in Spark 3.2 (another half of a year) doesn't make sense to me.
>>
>> In addition, is there a way to unblock me to work for meaningful features
>&
Hyun 于2020年11月19日周四 下午4:02写道:
>>>>>>
>>>>>>> Thank you for your volunteering!
>>>>>>>
>>>>>>> Since the previous branch-cuts were always soft-code freeze which
>>>>>>> allowed committers to merge to the new branches still for a while, I
>>>>>>> believe 1st December will be better for stabilization.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I think we haven’t decided yet the exact branch-cut, code freeze
>>>>>>>> and release manager.
>>>>>>>>
>>>>>>>> As we planned in https://spark.apache.org/versioning-policy.html
>>>>>>>>
>>>>>>>> Early Dec 2020 Code freeze. Release branch cut
>>>>>>>>
>>>>>>>> Code freeze and branch cutting is coming.
>>>>>>>>
>>>>>>>> Therefore, we should finish if there are any remaining works for
>>>>>>>> Spark 3.1, and
>>>>>>>> switch to QA mode soon.
>>>>>>>> I think it’s time to set to keep it on track, and I would like to
>>>>>>>> volunteer to help drive this process.
>>>>>>>>
>>>>>>>> I am currently thinking 4th Dec as the branch-cut date.
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>> Thanks all.
>>>>>>>>
>>>>>>>>
--
Ryan Blue
Software Engineer
Netflix
ost an update shortly.
>>>
>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan wrote:
>>>
>>>> Any updates here? I agree that a new View API is better, but we need a
>>>> solution to avoid performance regression. We need to elaborate on the cache
>
to disable the
>> check with the new config. In the PR currently there is no objection but
>> suggestion to hear more voices. Please let me know if you have some
>> thoughts.
>>
>> Thanks.
>> Liang-Chi Hsieh
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
the error log
>> message at least.
>>
>> Would like to hear the voices.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>
--
Ryan Blue
Software Engineer
Netflix
On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue wrote:
> I don’t think Spark ever claims to be 100% Hive compatible.
>
> By accepting the EXTERNAL keyword in some circumstances, Spark is
> providing compatibility with Hive DDL. Yes, there are places where it
> breaks. The question
be compatible
>
> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
> diverged from Hive intentionally in several places, where we think the Hive
> behavior was not reasonable and we shouldn't follow it.
>
> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue wrote:
s not
>> file-based.
>>
>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table. Hive gives warning when you create managed tables with
>> custom location, which means this behavior is not recommended. Shall we
>> "infer&
xternal
>>>>>> catalog (Hive) - so replacing default session catalog with custom one and
>>>>>> trying to use it like it is in external catalog doesn't work, which
>>>>>> destroys the purpose of replacing the default session catalog.
>&
suggestion of leaving it up to the catalogs on how to handle this makes
>> sense.
>>
>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue
>> wrote:
>>
>>> I would summarize both the problem and the current state differently.
>>>
>>> Currently, Spar
pproach seems to
> disallow default catalog with custom one. Am I missing something?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
--
Ryan Blue
Software Engineer
Netflix
that it only makes sense for file source, as the table
> directory can be managed. I'm not sure how to interpret EXTERNAL in
> catalogs like jdbc, cassandra, etc.
>
> For more details, please refer to the long discussion in
> https://github.com/apache/spark/pull/28026
>
> Thanks,
> Wenchen
>
--
Ryan Blue
Software Engineer
Netflix
>> > while (valueIndex < this.currentCount) {
> >> > // values are bit packed 8 at a time, so reading bitWidth will
> always work
> >> > ByteBuffer buffer = in.slice(bitWidth);
> >> > this.packer.unpack8Values(buffer, buffer.position(),
> this.currentBuffer, valueIndex);
> >> > valueIndex += 8;
> >> > }
> >> >
> >> >
> >> > Per my profile, the codes will spend 30% time of readNextGrou() on
> slice , why we can't call slice out of the loop?
>
--
Ryan Blue
val child = View(
> desc = metadata,
> output = metadata.schema.toAttributes,
> child = parser.parsePlan(viewText))
>
> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
> out in SPIP. I will add a new paragraph accordingly.
>
--
Ryan Blue
Software Engineer
Netflix
ckends, I am proposing a new view catalog API to load,
>>>> create, alter, and drop views.
>>>>
>>>> Document:
>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>
>>>> As part of a project to support common views across query engines like
>>>> Spark and Presto, my team used the view catalog API in Spark
>>>> implementation. The project has been in production over three months.
>>>>
>>>> Thanks,
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> John Zhuge
>
--
Ryan Blue
Software Engineer
Netflix
from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
ence, I usually have to spend at least 15-20
> minutes explaining that a worker will not actually do work, and the master
> won't run their application.
>
> Thanks Holden for doing all the legwork on this!
>
--
Ryan Blue
Software Engineer
Netflix
by its nature is
>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>> release some of the new functionality while continuing to support the older
&g
Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
--
Ryan Blue
Software Engineer
Netflix
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>>
>>> C
e accordingly but this seems to no longer be the
> case. Was this intentional? I feel like if we could
> have the default be based on the Source then upgrading code from DSV1 ->
> DSV2 would be much easier for users.
>
> I'm currently testing this on RC2
>
>
> Any thoughts?
>
> Thanks for your time as usual,
> Russ
>
--
Ryan Blue
Software Engineer
Netflix
undation/voting.html> to commit without waiting
for a review.
On Wed, May 20, 2020 at 10:00 AM Ryan Blue wrote:
> Why was https://github.com/apache/spark/pull/28523 merged with a -1? We
> discussed this months ago and concluded that it was a bad idea to introduce
> a new v2 API that
;
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
>>
>>> I would really appreciate that, I'm probably going to just write a
>>> planner rule for now which matches up my table schema with the query output
>>> if they are valid, and fails analysis otherwise. This approach is how I got
>>> metadata col
esent during an insert as well as those
> which are not required.
>
> Please let me know if i've misread this,
>
> Thanks for your time again,
> Russ
>
--
Ryan Blue
Software Engineer
Netflix
pend quite a bit of time to resolve conflicts and fix tests.
>>>
>>> I don't see why it's still a problem if a feature is disabled and hidden
>>> from end-users (it's undocumented, the config is internal). The related
>>> code will be replaced i
nabled is disabled, we
>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>> with this problem for years to come.
>>>>
>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee wrote:
>>>>
>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>
>>>>> This seems to be controversial, and can not be done in a short time.
>>>>> It is
>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>>>>> 3.1.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
; necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@s
, you should search if there
>>>> is a variant of that API
>>>> every time specifically for Java APIs. But yes, it gives you Java/Scala
>>>> friendly instances.
>>>>
>>>> For 4., having one API that returns a Java instance makes you able to
>>>> use it in both Scala and Java APIs
>>>> sides although it makes you call asScala in Scala side specifically.
>>>> But you don’t
>>>> have to search if there’s a variant of this API and it gives you a
>>>> consistent API usage across languages.
>>>>
>>>> Also, note that calling Java in Scala is legitimate but the opposite
>>>> case is not, up to my best knowledge.
>>>> In addition, you should have a method that returns a Java instance for
>>>> PySpark or SparkR to support.
>>>>
>>>>
>>>> *Proposal:*
>>>>
>>>> I would like to have a general guidance on this that the Spark dev
>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all
>>>> cost.
>>>>
>>>> Note that this isn't a hard requirement but *a general guidance*;
>>>> therefore, the decision might be up to
>>>> the specific context. For example, when there are some strong arguments
>>>> to have a separate Java specific API, that’s fine.
>>>> Of course, we won’t change the existing methods given Micheal’s rubric
>>>> added before. I am talking about new
>>>> methods in unreleased branches.
>>>>
>>>> Any concern or opinion on this?
>>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
ld remain the same as before.
>
> I think this functionality will be useful as DSv2 continues to evolve,
> please let me know your thoughts.
>
> Thanks
> Andrew
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
; >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -----
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
n't appear together with PARTITIONED
>>>>> BY transformList.
>>>>>
>>>>
>>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>>> TABLE syntax, we can also update Catalog.createTable() to support
>>>> creating partitioned tables
>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
e
>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>> BY transformList.
>>>>
>>>
>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>> TABLE syntax, we can also update Catalog.createTable() to support
>>> creating partitioned tables
>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>
>>
--
Ryan Blue
Software Engineer
Netflix
gt; Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
rk/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
;>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads
>>>>>>> to
>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a
>>>>>>> serious
>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>>> parser rules conflict and we pick the native syntax depending on the
>>>>>>> rule
>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE
>>>>>>> syntax?
>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi devs,
>>>>>>>>
>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>> SPARK-30098 [1].
>>>>>>>>
>>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>>> via "USING provider", which was very intuitive and deterministic. Say,
>>>>>>>> DDL
>>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>
>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which
>>>>>>>> are
>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>> which end users would have no idea to reason about, and very
>>>>>>>> unintuitive.
>>>>>>>>
>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>>>> to provide better message) brought more confusion (I've described the
>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>
>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>>>> trying to document the difference and talk end users to be careful
>>>>>>>> about
>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>
>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>>> "HIVE" TABLE ...`.
>>>>>>>>
>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>> intend to create Hive table. (Given we will also provide legacy option
>>>>>>>> I'm
>>>>>>>> feeling this is acceptable.)
>>>>>>>>
>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>
>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>
>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>> syntax
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>> 2.
>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 3.
>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>
>>>>>>>>
--
Ryan Blue
Software Engineer
Netflix
st.
>
> On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue wrote:
>
>> We've implemented these metrics in the RDD (for input metrics) and in the
>> v2 DataWritingSparkTask. That approach gives you the same metrics in the
>> stage views that you get with v1 sources, regardl
support metrics.
>>
>> So it will be easy to collect the metrics if FilePartitionReaderFactory
>> implements ReportMetrics
>>
>>
>> Please let me know the views, or even if we want to have new solution or
>> design.
>>
>
--
Ryan Blue
Software Engineer
Netflix
s are going to be
>>> allowed together (e.g., `concat(years(col) + days(col))`);
>>> however, it looks impossible to extend with the current design. It just
>>> directly maps transformName to implementation class,
>>> and just pass arguments:
>>>
>>> transform
>>> ...
>>> | transformName=identifier
>>> '(' argument+=transformArgument (',' argument+=transformArgument)*
>>> ')' #applyTransform
>>> ;
>>>
>>> It looks regular expressions are supported; however, it's not.
>>> - If we should support, the design had to consider that.
>>> - if we should not support, different syntax might have to be used
>>> instead.
>>>
>>> *Limited Compatibility Management*
>>> The name can be arbitrary. For instance, if "transform" is supported in
>>> Spark side, the name is preempted by Spark.
>>> If every the datasource supported such name, it becomes not compatible.
>>>
>>>
>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re
late! Feel free to add more details or corrections. Thanks!
rb
*Attendees*:
Ryan Blue
John Zhuge
Dongjoon Hyun
Joseph Torres
Kevin Yu
Russel Spitzer
Terry Kim
Wenchen Fan
Hyukjin Kwan
Jacky Lee
*Topics*:
- Relation
ng something.
>>>
>>> What do you think? It would bring backward incompatible change, but
>>> given the interface is marked as Evolving and we're making backward
>>> incompatible changes in Spark 3.0, so I feel it may not matter.
>>>
>>> Would love to hear your thoughts.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
Actually, my conflict was cancelled so I'll send out the usual invite for
Wednesday. Sorry for the noise.
On Sun, Dec 8, 2019 at 3:15 PM Ryan Blue wrote:
> Hi everyone,
>
> I have a conflict with the normal DSv2 sync time this Wednesday and I'd
> like to attend to talk a
Hi everyone,
I have a conflict with the normal DSv2 sync time this Wednesday and I'd
like to attend to talk about the TableProvider API.
Would it work for everyone to have the sync at 6PM PST on Tuesday, 10
December instead? I could also make it at the normal time on Thursday.
Thanks,
--
d be
>> minimal since this applies only when there are temp views and tables with
>> the same name.
>>
>> Any feedback will be appreciated.
>>
>> I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon
>> Hyun for guidance and suggestion.
>>
>> Regards,
>> Terry
>>
>>
>> <https://issues.apache.org/jira/browse/SPARK-29900>
>>
>
--
Ryan Blue
Software Engineer
Netflix
://mail-archives.apache.org/mod_mbox/parquet-dev/201911.mbox/%3c8357699c-9295-4eb0-a39e-b3538d717...@gmail.com%3E>
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
&g
lows
>> arbitrary metadata/framing data to be wrapped around individual objects
>> cheaply. Right now, that’s only possible at the stream level. (There are
>> hacks around this, but this would enable more idiomatic use in efficient
>> shuffle implementations.)
>>
>>
>> Have serializers indicate whether they are deterministic. This provides
>> much of the value of a shuffle service because it means that reducers do
>> not need to spill to disk when reading/merging/combining inputs--the data
>> can be grouped by the service, even without the service understanding data
>> types or byte representations. Alternative (less preferable since it would
>> break Java serialization, for example): require all serializers to be
>> deterministic.
>>
>>
>>
>> --
>>
>> - Ben
>>
>
--
Ryan Blue
Software Engineer
Netflix
at, it's quite
> expensive to deserialize all the various metadata, so I was holding the
> deserialized version in the DataSourceReader, but if Spark is repeatedly
> constructing new ones, then that doesn't help. If this is the expected
> behavior, how should I handle this as a consumer of the API?
>
> Thanks!
> Andrew
>
--
Ryan Blue
Software Engineer
Netflix
*Attendees*:
Ryan Blue
Terry Kim
Wenchen Fan
Jose Torres
Jacky Lee
Gengliang Wang
*Topics*:
- DROP NAMESPACE cascade behavior
- 3.0 tasks
- TableProvider API changes
- V1 and V2 table resolution rules
- Separate logical and physical write (for streaming)
- Bucketing support
Hi everyone,
I can't make it to the DSv2 sync tomorrow, so let's skip it. If anyone
would prefer to have one and is willing to take notes, I can send out the
invite. Just let me know, otherwise let's consider it cancelled.
Thanks,
rb
--
Ryan Blue
Software Engineer
Netflix
Here are my notes from last week's DSv2 sync.
*Attendees*:
Ryan Blue
Terry Kim
Wenchen Fan
*Topics*:
- SchemaPruning only supports Parquet and ORC?
- Out of order optimizer rules
- 3.0 work
- Rename session catalog to spark_catalog
- Finish TableProvider update to
rules are originally for Dataset
>>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>>> default.
>>>>
>>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>>> and V2 in Spark 3.0.
>>>>
>>>> This vote is open until Friday (Oct. 11).
>>>>
>>>> [ ] +1: Accept the proposal
>>>> [ ] +0
>>>> [ ] -1: I don't think this is a good idea because ...
>>>>
>>>> Thank you!
>>>>
>>>> Gengliang
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>
--
Ryan Blue
Software Engineer
Netflix
ues.apache.org/jira/browse/HIVE-9152). Henry R's description
>> was also correct.
>>
>>
>>
>>
>>
>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue
>> wrote:
>>
>>> Where can I find a design doc for dynamic partition pruning tha
ite sure. Seems to me it's better
> to run it before join reorder.
>
> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> I have been working on a PR that moves filter and projection pushdown
>> into the optimizer for DSv2, instead of
addressed.
rb
--
Ryan Blue
Software Engineer
Netflix
view has
> advantage here (assuming we provide maven artifacts as well as official
> announcement), as it can give us expectation that there're bunch of changes
> given it's a new major version. It also provides bunch of time to try
> adopting it before the version is officially
t;
>> I would personally love to see us provide a gentle migration path to
>> Spark 3 especially if much of the work is already going to happen anyways.
>>
>> Maybe giving it a different name (eg something like
>> Spark-2-to-3-transitional) would make it more clear about i
to try the DSv2 API and build DSv2 data
>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>> maintaining two major versions. There’s not that much
ine ... as
> suggested by others in the thread, DSv2 would be one of the main reasons
> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>
>
>
> On Sat, Sep 21, 2019 at
.
On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin wrote:
> How would you not make incompatible changes in 3.x? As discussed the
> InternalRow API is not stable and needs to change.
>
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue wrote:
>
>> > Making downstream to diverge thei
rate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apach
licy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
&g
this way, you might as well argue we should make the
> entire catalyst package public to be pragmatic and not allow any changes.
>
>
>
>
> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue wrote:
>
>> When you created the PR to make InternalRow public
>>
>> This i
temporarily. You can't just make a bunch internal APIs
> tightly coupled with other internal pieces public and stable and call it a
> day, just because it happen to satisfy some use cases temporarily assuming
> the rest of Spark doesn't change.
>
>
>
> On Fri, Sep 20, 2
1 - 100 of 415 matches
Mail list logo