Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-09 Thread Herman van Hovell
+1

On Mon, Nov 9, 2020 at 2:06 AM Takeshi Yamamuro 
wrote:

> +1
>
> On Thu, Nov 5, 2020 at 3:41 AM Xinyi Yu  wrote:
>
>> Hi all,
>>
>> We had the discussion of SPIP: Standardize Spark Exception Messages at
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html
>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html>
>>
>> . The SPIP document link is at
>>
>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
>> <
>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing>
>>
>> . We want to have the vote on this, for 72 hours.
>>
>> Please vote before November 7th at noon:
>>
>> [ ] +1: Accept this SPIP proposal
>> [ ] -1: Do not agree to standardize Spark exception messages, because ...
>>
>>
>> Thanks for your time and feedback!
>>
>> --
>> Xinyi
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: -Phadoop-provided still includes hadoop jars

2020-11-09 Thread Steve Loughran
On Mon, 12 Oct 2020 at 19:06, Sean Owen  wrote:

> I don't have a good answer, Steve may know more, but from looking at
> dependency:tree, it looks mostly like it's hadoop-common that's at issue.
> Without -Phive it remains 'provided' in the assembly/ module, but -Phive
> causes it to come back in. Either there's some good reason for that, or,
> maybe we need to explicitly manage the scope of hadoop-common along with
> everything else Hadoop, even though Spark doesn't reference it directly.
> '
>

sorry, missed this.

Yes, they should be scoped so that hadoop-provided leaves them out. Open a
JIRA, and point me at it and I'll do my best.

The artifacts should just go into the hadoop-provided scope, shouldn't they?


> On Mon, Oct 12, 2020 at 12:38 PM Kimahriman  wrote:
>
>> When I try to build a distribution with either -Phive or -Phadoop-cloud
>> along
>> with -Phadoop-provided, I still end up with hadoop jars in the
>> distribution.
>>
>> Specifically, with -Phive and -Phadoop-provided, you end up with
>> hadoop-annotations, hadoop-auth, and hadoop-common included in the Spark
>> jars, and with -Phadoop-cloud and -Phadoop-provided, you end up with
>> hadoop-annotations, as well as the hadoop-{aws,azure,openstack} jars. Is
>> this supposed to be the case or is there something I'm doing wrong? I just
>> want the spark-hive and spark-hadoop-cloud jars without the hadoop
>> dependencies, and right now I just have to delete the hadoop jars after
>> the
>> fact.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Adding uuid support

2020-11-09 Thread Denise Mauldin
Hello,

When I run PySpark to save to a Postgresql database, I run into an error
where uuid insert statements are not constructed properly.  There are a lot
of different questions on stackoverflow about the same issue.

https://stackoverflow.com/questions/64671739/pyspark-nullable-uuid-type-uuid-but-expression-is-of-type-character-varying

I would like to add support for saving uuids to Postgresql in Pyspark.

How do I identify what is causing this error? Is this something that needs
to be fixed in the Pyspark code, the Apache Spark Code, or the Postgresql
JDBC driver?  Does anyone have advice on how I should approach fixing this
issue?

Thanks,
Denise


Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-09 Thread Allison Wang
Thanks everyone for voting! With 11 +1s and no -1s, this vote passes.

+1s:
Mridul Muralidharan
Angers Zhu
Chandni Singh
Eve Liao
Matei Zaharia
Kalyan
Wenchen Fan
Gengliang Wang
Xiao Li
Takeshi Yamamuro
Herman van Hovell

Thanks,
Allison



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-09 Thread Reynold Xin
Exciting & look forward to this!

(And a late +1 vote that probably won't be counted)

On Mon, Nov 09, 2020 at 2:37 PM, Allison Wang < allison.w...@databricks.com > 
wrote:

> 
> 
> 
> Thanks everyone for voting! With 11 +1s and no -1s, this vote passes.
> 
> 
> 
> +1s:
> Mridul Muralidharan
> Angers Zhu
> Chandni Singh
> Eve Liao
> Matei Zaharia
> Kalyan
> Wenchen Fan
> Gengliang Wang
> Xiao Li
> Takeshi Yamamuro
> Herman van Hovell
> 
> 
> 
> Thanks,
> Allison
> 
> 
> 
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: SPIP: Catalog API for view metadata

2020-11-09 Thread Wenchen Fan
Moving back the discussion to this thread. The current argument is how to
avoid extra RPC calls for catalogs supporting both table and view. There
are several options:
1. ignore it as extra PRC calls are cheap compared to the query execution
2. have a per session cache for loaded table/view
3. have a per query cache for loaded table/view
4. add a new trait TableViewCatalog

I think it's important to avoid perf regression with new APIs. RPC calls
can be significant for short queries. We may also double the RPC
traffic which is bad for the metastore service. Normally I would not
recommend caching as cache invalidation is a hard problem. Personally I
prefer option 4 as it only affects catalogs that support both table and
view, and it fits the hive catalog very well.

On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:

> SPIP
> 
> has been updated. Please review.
>
> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>
>> Wenchen, sorry for the delay, I will post an update shortly.
>>
>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:
>>
>>> Any updates here? I agree that a new View API is better, but we need a
>>> solution to avoid performance regression. We need to elaborate on the cache
>>> idea.
>>>
>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>>
 I think it is a good idea to keep tables and views separate.

 The main two arguments I’ve heard for combining lookup into a single
 function are the ones brought up in this thread. First, an identifier in a
 catalog must be either a view or a table and should not collide. Second, a
 single lookup is more likely to require a single RPC. I think the RPC
 concern is well addressed by caching, which we already do in the Spark
 catalog, so I’ll primarily focus on the first.

 Table/view name collision is unlikely to be a problem. Metastores that
 support both today store them in a single namespace, so this is not a
 concern for even a naive implementation that talks to the Hive MetaStore. I
 know that a new metastore catalog could choose to implement both
 ViewCatalog and TableCatalog and store the two sets separately, but that
 would be a very strange choice: if the metastore itself has different
 namespaces for tables and views, then it makes much more sense to expose
 them through separate catalogs because Spark will always prefer one over
 the other.

 In a similar line of reasoning, catalogs that expose both views and
 tables are much more rare than catalogs that only expose one. For example,
 v2 catalogs for JDBC and Cassandra expose data through the Table interface
 and implementing ViewCatalog would make little sense. Exposing new data
 sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
 likely to be the same. Say I have a way to convert Pig statements or some
 other representation into a SQL view. It would make little sense to combine
 that with some other TableCatalog.

 I also don’t think there is benefit from an API perspective to justify
 combining the Table and View interfaces. The two share only schema and
 properties, and are handled very differently internally — a View’s SQL
 query is parsed and substituted into the plan, while a Table is wrapped in
 a relation that eventually becomes a Scan node using SupportsRead. A view’s
 SQL also needs additional context to be resolved correctly: the current
 catalog and namespace from the time the view was created.

 Query planning is distinct between tables and views, so Spark doesn’t
 benefit from combining them. I think it has actually caused problems that
 both were resolved by the same method in v1: the resolution rule grew
 extremely complicated trying to look up a reference just once because it
 had to parse a view plan and resolve relations within it using the view’s
 context (current database). In contrast, John’s new view substitution rules
 are cleaner and can stay within the substitution batch.

 People implementing views would also not benefit from combining the two
 interfaces:

- There is little overlap between View and Table, only schema and
properties
- Most catalogs won’t implement both interfaces, so returning a
ViewOrTable is more difficult for implementations
- TableCatalog assumes that ViewCatalog will be added separately
like John proposes, so we would have to break or replace that API

 I understand the initial appeal of combining TableCatalog and
 ViewCatalog since it is done that way in the existing interfaces. But I
 think that Hive chose to do that mostly on the fact that the two were
 already stored together, and not because it made sense for users of the
 API, or any other implementer of the API.