[ANNOUNCE] Apache Roadshow Chicago, Call for Presentations

2019-01-15 Thread Trevor Grant
Hello Devs!


You're receiving this email because you are subscribed to one or more
Apache developer email lists.

I’m writing to let you know about an exciting event coming to the Chicago
area: The Apache Roadshow Chicago.  It will be held May 13th and 14th at
three bars in the Logan Square neighborhood (Revolution Brewing, The
Native, and the Radler).

There will be six tracks:

   -

   Apache in Adtech:  Tell us how Apache works in your advertising stack
   -

   Apache in Fintech: Tell us how Apache works in your finance/insurance
   business
   -

   Apache in Startups: Tell us how you’re using Apache in your startup
   -

   Diversity in Apache: How do we increase and encourage diversity in
   Apache and tech fields overall?
   -

   Made in Chicago: Apache related things made by people in Chicago that
   don’t fall into other buckets
   -

   Project Shark Tank: Do you want more developers or users for your Apache
   project? Come here and pitch it!


This is an exciting chance to learn about how Apache Projects are in use in
production around Chicago, how business users make the decision to use
Apache projects, to learn about exciting new projects that want help from
developers like you, and how/why to increase diversity in tech and IT.

If you have any use cases of Apache products in Adtech, Fintech, or
Startups; if you represent a minority working in tech and have perspectives
to share, if you live in the Chicagoland area and want to highlight some
work you’ve done on an Apache project, or if you want to get other people
excited to come work on your project, then please submit a CFP before the
deadline on February 15th!

Tickets to the Apache Roadshow Chicago are $100; speakers will get a
complimentary ticket.

We’re looking forward to reading your submissions and seeing you there on
May 13-14!

Sincerely,

Trevor Grant

https://www.apachecon.com/chiroadshow19/cfp.html

https://www.apachecon.com/chiroadshow19/register.html


[DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Yuming Wang
Dear Spark Developers and Users,



Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
 to 2.3.4
 to solve
some critical issues, such as support Hadoop 3.x, solve some ORC and
Parquet issues. This is the list:

*Hive issues*:

[SPARK-26332 ][HIVE-10790]
Spark sql write orc table on viewFS throws exception

[SPARK-25193 ][HIVE-12505]
insert overwrite doesn't throw exception when drop old data fails

[SPARK-26437 ][HIVE-13083]
Decimal data becomes bigint to query, unable to query

[SPARK-25919 ][HIVE-11771]
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
table is Partitioned

[SPARK-12014 ][HIVE-11100]
Spark SQL query containing semicolon is broken in Beeline



*Spark issues*:

[SPARK-23534 ] Spark run
on Hadoop 3.0.0

[SPARK-20202 ] Remove
references to org.spark-project.hive

[SPARK-18673 ]
Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

[SPARK-24766 ]
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
stats in parquet





Since the code for the *hive-thriftserver* module has changed too much for
this upgrade, I split it into two PRs for easy review.

The first PR  does not contain
the changes of hive-thriftserver. Please ignore the failed test in
hive-thriftserver.

The second PR  is complete
changes.



I have created a Spark distribution for Apache Hadoop 2.7, you might
download it via Google Drive
 or Baidu
Pan .

Please help review and test. Thanks.


SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all,

I want to re-send the previous SPIP on introducing a DataFrame-based graph
component to collect more feedback. It supports property graphs, Cypher
graph queries, and graph algorithms built on top of the DataFrame API. If
you are a GraphX user or your workload is essentially graph queries, please
help review and check how it fits into your use cases. Your feedback would
be greatly appreciated!

# Links to SPIP and design sketch:

* Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994
* Google Doc:
https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing
* Jira issue for a first design sketch:
https://issues.apache.org/jira/browse/SPARK-26028
* Google Doc:
https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing

# Sample code:

~~~
val graph = ...

// query
val result = graph.cypher("""
  MATCH (p:Person)-[r:STUDY_AT]->(u:University)
  RETURN p.name, r.since, u.name
""")

// algorithms
val ranks = graph.pageRank.run()
~~~

Best,
Xiangrui


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the
dependence on Hive. Currently, most of Spark users are not using Hive. The
changes looks risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA:
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang  于2019年1月15日周二 上午8:41写道:

> Dear Spark Developers and Users,
>
>
>
> Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
>  to 2.3.4
>  to
> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
> Parquet issues. This is the list:
>
> *Hive issues*:
>
> [SPARK-26332 ][HIVE-10790]
> Spark sql write orc table on viewFS throws exception
>
> [SPARK-25193 ][HIVE-12505]
> insert overwrite doesn't throw exception when drop old data fails
>
> [SPARK-26437 ][HIVE-13083]
> Decimal data becomes bigint to query, unable to query
>
> [SPARK-25919 ][HIVE-11771]
> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
> table is Partitioned
>
> [SPARK-12014 ][HIVE-11100]
> Spark SQL query containing semicolon is broken in Beeline
>
>
>
> *Spark issues*:
>
> [SPARK-23534 ] Spark
> run on Hadoop 3.0.0
>
> [SPARK-20202 ] Remove
> references to org.spark-project.hive
>
> [SPARK-18673 ]
> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>
> [SPARK-24766 ]
> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
> stats in parquet
>
>
>
>
>
> Since the code for the *hive-thriftserver* module has changed too much
> for this upgrade, I split it into two PRs for easy review.
>
> The first PR  does not
> contain the changes of hive-thriftserver. Please ignore the failed test in
> hive-thriftserver.
>
> The second PR  is complete
> changes.
>
>
>
> I have created a Spark distribution for Apache Hadoop 2.7, you might
> download it via Google Drive
>  or Baidu
> Pan .
>
> Please help review and test. Thanks.
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
How do we know that most Spark users are not using Hive? I wouldn't be
surprised either way, but I do want to make sure we aren't making decisions
based on any one person's (or one company's) experience about what "most"
Spark users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:

> Hi, Yuming,
>
> Thank you for your contributions! The community aims at reducing the
> dependence on Hive. Currently, most of Spark users are not using Hive. The
> changes looks risky to me.
>
> To support Hadoop 3.x, we just need to resolve this JIRA:
> https://issues.apache.org/jira/browse/HIVE-16391
>
> Cheers,
>
> Xiao
>
> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>
>> Dear Spark Developers and Users,
>>
>>
>>
>> Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
>>  to 2.3.4
>>  to
>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>> Parquet issues. This is the list:
>>
>> *Hive issues*:
>>
>> [SPARK-26332 ][HIVE-10790]
>> Spark sql write orc table on viewFS throws exception
>>
>> [SPARK-25193 ][HIVE-12505]
>> insert overwrite doesn't throw exception when drop old data fails
>>
>> [SPARK-26437 ][HIVE-13083]
>> Decimal data becomes bigint to query, unable to query
>>
>> [SPARK-25919 ][HIVE-11771]
>> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
>> table is Partitioned
>>
>> [SPARK-12014 ][HIVE-11100]
>> Spark SQL query containing semicolon is broken in Beeline
>>
>>
>>
>> *Spark issues*:
>>
>> [SPARK-23534 ] Spark
>> run on Hadoop 3.0.0
>>
>> [SPARK-20202 ] Remove
>> references to org.spark-project.hive
>>
>> [SPARK-18673 ]
>> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>>
>> [SPARK-24766 ]
>> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
>> stats in parquet
>>
>>
>>
>>
>>
>> Since the code for the *hive-thriftserver* module has changed too much
>> for this upgrade, I split it into two PRs for easy review.
>>
>> The first PR  does not
>> contain the changes of hive-thriftserver. Please ignore the failed test in
>> hive-thriftserver.
>>
>> The second PR  is complete
>> changes.
>>
>>
>>
>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>> download it via Google Drive
>>  or Baidu
>> Pan .
>>
>> Please help review and test. Thanks.
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
Resolving https://issues.apache.org/jira/browse/HIVE-16391 means to keep Spark 
on Hive 1.2?

I’m not sure that is reducing dependency on Hive - Hive is still there and it’s 
a very old Hive. IMO it is increasing the risk the longer we keep on this. (And 
it’s been years)

Looking at the two PR. They don’t seem very drastic to me, except for thrift 
server. Is there another, better approach to thrift server?



From: Xiao Li 
Sent: Tuesday, January 15, 2019 9:44 AM
To: Yuming Wang
Cc: dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive from 
1.2.1-spark2 to 
2.3.4 to solve 
some critical issues, such as support Hadoop 3.x, solve some ORC and Parquet 
issues. This is the list:
Hive issues:
[SPARK-26332][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534] Spark run on 
Hadoop 3.0.0
[SPARK-20202] Remove 
references to org.spark-project.hive
[SPARK-18673] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it via Google 
Drive or 
Baidu Pan.
Please help review and test. Thanks.


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
And we are super 100% dependent on Hive...



From: Ryan Blue 
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2 
to 2.3.4 to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534] Spark run on 
Hadoop 3.0.0
[SPARK-20202] Remove 
references to org.spark-project.hive
[SPARK-18673] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive or 
Baidu Pan.
Please help review and test. Thanks.


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
Unless it's going away entirely, and I don't think it is, we at least
have to do this to get off the fork of Hive that's being used now.
I do think we want to keep Hive from getting into the core though --
see comments on PR.

On Tue, Jan 15, 2019 at 11:44 AM Xiao Li  wrote:
>
> Hi, Yuming,
>
> Thank you for your contributions! The community aims at reducing the 
> dependence on Hive. Currently, most of Spark users are not using Hive. The 
> changes looks risky to me.
>
> To support Hadoop 3.x, we just need to resolve this JIRA: 
> https://issues.apache.org/jira/browse/HIVE-16391
>
> Cheers,
>
> Xiao
>
> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>
>> Dear Spark Developers and Users,
>>
>>
>>
>> Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2 to 2.3.4 
>> to solve some critical issues, such as support Hadoop 3.x, solve some ORC 
>> and Parquet issues. This is the list:
>>
>> Hive issues:
>>
>> [SPARK-26332][HIVE-10790] Spark sql write orc table on viewFS throws 
>> exception
>>
>> [SPARK-25193][HIVE-12505] insert overwrite doesn't throw exception when drop 
>> old data fails
>>
>> [SPARK-26437][HIVE-13083] Decimal data becomes bigint to query, unable to 
>> query
>>
>> [SPARK-25919][HIVE-11771] Date value corrupts when tables are 
>> "ParquetHiveSerDe" formatted and target table is Partitioned
>>
>> [SPARK-12014][HIVE-11100] Spark SQL query containing semicolon is broken in 
>> Beeline
>>
>>
>>
>> Spark issues:
>>
>> [SPARK-23534] Spark run on Hadoop 3.0.0
>>
>> [SPARK-20202] Remove references to org.spark-project.hive
>>
>> [SPARK-18673] Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop 
>> version
>>
>> [SPARK-24766] CreateHiveTableAsSelect and InsertIntoHiveDir won't generate 
>> decimal column stats in parquet
>>
>>
>>
>>
>>
>> Since the code for the hive-thriftserver module has changed too much for 
>> this upgrade, I split it into two PRs for easy review.
>>
>> The first PR does not contain the changes of hive-thriftserver. Please 
>> ignore the failed test in hive-thriftserver.
>>
>> The second PR is complete changes.
>>
>>
>>
>> I have created a Spark distribution for Apache Hadoop 2.7, you might 
>> download it via Google Drive or Baidu Pan.
>>
>> Please help review and test. Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Let me take my words back. To read/write a table, Spark users do not use
the Hive execution JARs, unless they explicitly create the Hive serde
tables. Actually, I want to understand the motivation and use cases why
your usage scenarios need to create Hive serde tables instead of our Spark
native tables?

BTW, we are still using Hive metastore as our metadata store. This does not
require the Hive execution JAR upgrade, based on my understanding. Users
can upgrade it to the newer version of Hive metastore.

Felix Cheung  于2019年1月15日周二 上午9:56写道:

> And we are super 100% dependent on Hive...
>
>
> --
> *From:* Ryan Blue 
> *Sent:* Tuesday, January 15, 2019 9:53 AM
> *To:* Xiao Li
> *Cc:* Yuming Wang; dev
> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> How do we know that most Spark users are not using Hive? I wouldn't be
> surprised either way, but I do want to make sure we aren't making decisions
> based on any one person's (or one company's) experience about what "most"
> Spark users do.
>
> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>
>> Hi, Yuming,
>>
>> Thank you for your contributions! The community aims at reducing the
>> dependence on Hive. Currently, most of Spark users are not using Hive. The
>> changes looks risky to me.
>>
>> To support Hadoop 3.x, we just need to resolve this JIRA:
>> https://issues.apache.org/jira/browse/HIVE-16391
>>
>> Cheers,
>>
>> Xiao
>>
>> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>
>>> Dear Spark Developers and Users,
>>>
>>>
>>>
>>> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
>>>  to 2.3.4
>>>  to
>>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>>> Parquet issues. This is the list:
>>>
>>> *Hive issues*:
>>>
>>> [SPARK-26332 
>>> ][HIVE-10790]
>>> Spark sql write orc table on viewFS throws exception
>>>
>>> [SPARK-25193 
>>> ][HIVE-12505]
>>> insert overwrite doesn't throw exception when drop old data fails
>>>
>>> [SPARK-26437 
>>> ][HIVE-13083]
>>> Decimal data becomes bigint to query, unable to query
>>>
>>> [SPARK-25919 
>>> ][HIVE-11771]
>>> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
>>> table is Partitioned
>>>
>>> [SPARK-12014 
>>> ][HIVE-11100]
>>> Spark SQL query containing semicolon is broken in Beeline
>>>
>>>
>>>
>>> *Spark issues*:
>>>
>>> [SPARK-23534 ] Spark
>>> run on Hadoop 3.0.0
>>>
>>> [SPARK-20202 ]
>>> Remove references to org.spark-project.hive
>>>
>>> [SPARK-18673 ]
>>> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>>>
>>> [SPARK-24766 ]
>>> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
>>> stats in parquet
>>>
>>>
>>>
>>>
>>>
>>> Since the code for the *hive-thriftserver* module has changed too much
>>> for this upgrade, I split it into two PRs for easy review.
>>>
>>> The first PR  does not
>>> contain the changes of hive-thriftserver. Please ignore the failed test in
>>> hive-thriftserver.
>>>
>>> The second PR  is complete
>>> changes.
>>>
>>>
>>>
>>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>>> download it viaGoogle Drive
>>>  or 
>>> Baidu
>>> Pan .
>>>
>>> Please help review and test. Thanks.
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
One common case we have is a custom input format.

In any case, even when Hive metatstore is protocol compatible we should still 
upgrade or replace the hive jar from a fork, as Sean says, from a ASF release 
process standpoint. Unless there is a plan for removing hive integration (all 
of it) from the spark core project..



From: Xiao Li 
Sent: Tuesday, January 15, 2019 10:03 AM
To: Felix Cheung
Cc: rb...@netflix.com; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Let me take my words back. To read/write a table, Spark users do not use the 
Hive execution JARs, unless they explicitly create the Hive serde tables. 
Actually, I want to understand the motivation and use cases why your usage 
scenarios need to create Hive serde tables instead of our Spark native tables?

BTW, we are still using Hive metastore as our metadata store. This does not 
require the Hive execution JAR upgrade, based on my understanding. Users can 
upgrade it to the newer version of Hive metastore.

Felix Cheung mailto:felixcheun...@hotmail.com>> 
于2019年1月15日周二 上午9:56写道:
And we are super 100% dependent on Hive...



From: Ryan Blue 
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2 
to2.3.4 to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534] Spark run on 
Hadoop 3.0.0
[SPARK-20202] Remove 
references to org.spark-project.hive
[SPARK-18673] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive 
orBaidu Pan.
Please help review and test. Thanks.


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Marcelo Vanzin
The metastore interactions in Spark are currently based on APIs that
are in the Hive exec jar; so that makes it not possible to have Spark
work with Hadoop 3 until the exec jar is upgraded.

It could be possible to re-implement those interactions based solely
on the metastore client Hive publishes; but that would be a lot of
work IIRC.

I can't comment on how many people use Hive serde tables (although I
do know they use it, just not how extensively), but that's not the
only reason why Spark currently requires the hive-exec jar.

On Tue, Jan 15, 2019 at 10:03 AM Xiao Li  wrote:
>
> Let me take my words back. To read/write a table, Spark users do not use the 
> Hive execution JARs, unless they explicitly create the Hive serde tables. 
> Actually, I want to understand the motivation and use cases why your usage 
> scenarios need to create Hive serde tables instead of our Spark native tables?
>
> BTW, we are still using Hive metastore as our metadata store. This does not 
> require the Hive execution JAR upgrade, based on my understanding. Users can 
> upgrade it to the newer version of Hive metastore.
>
> Felix Cheung  于2019年1月15日周二 上午9:56写道:
>>
>> And we are super 100% dependent on Hive...
>>
>>
>> 
>> From: Ryan Blue 
>> Sent: Tuesday, January 15, 2019 9:53 AM
>> To: Xiao Li
>> Cc: Yuming Wang; dev
>> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> How do we know that most Spark users are not using Hive? I wouldn't be 
>> surprised either way, but I do want to make sure we aren't making decisions 
>> based on any one person's (or one company's) experience about what "most" 
>> Spark users do.
>>
>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>>>
>>> Hi, Yuming,
>>>
>>> Thank you for your contributions! The community aims at reducing the 
>>> dependence on Hive. Currently, most of Spark users are not using Hive. The 
>>> changes looks risky to me.
>>>
>>> To support Hadoop 3.x, we just need to resolve this JIRA: 
>>> https://issues.apache.org/jira/browse/HIVE-16391
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Yuming Wang  于2019年1月15日周二 上午8:41写道:

 Dear Spark Developers and Users,



 Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2 to 2.3.4 
 to solve some critical issues, such as support Hadoop 3.x, solve some ORC 
 and Parquet issues. This is the list:

 Hive issues:

 [SPARK-26332][HIVE-10790] Spark sql write orc table on viewFS throws 
 exception

 [SPARK-25193][HIVE-12505] insert overwrite doesn't throw exception when 
 drop old data fails

 [SPARK-26437][HIVE-13083] Decimal data becomes bigint to query, unable to 
 query

 [SPARK-25919][HIVE-11771] Date value corrupts when tables are 
 "ParquetHiveSerDe" formatted and target table is Partitioned

 [SPARK-12014][HIVE-11100] Spark SQL query containing semicolon is broken 
 in Beeline



 Spark issues:

 [SPARK-23534] Spark run on Hadoop 3.0.0

 [SPARK-20202] Remove references to org.spark-project.hive

 [SPARK-18673] Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop 
 version

 [SPARK-24766] CreateHiveTableAsSelect and InsertIntoHiveDir won't generate 
 decimal column stats in parquet





 Since the code for the hive-thriftserver module has changed too much for 
 this upgrade, I split it into two PRs for easy review.

 The first PR does not contain the changes of hive-thriftserver. Please 
 ignore the failed test in hive-thriftserver.

 The second PR is complete changes.



 I have created a Spark distribution for Apache Hadoop 2.7, you might 
 download it viaGoogle Drive or Baidu Pan.

 Please help review and test. Thanks.
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
Xiao, thanks for clarifying.

There are a few use cases for metastore tables. Felix mentions a good one,
custom metastore tables. There are also common formats that Spark doesn't
support natively. Spark has CSV support, but the behavior is different from
Hive's delimited format. Hive also supports Sequence file tables. We have a
few of those that are old.

The other use case that comes to mind is mixed-format tables. I don't think
Spark supports a different format per partition without going through the
Hive read path. We use this feature to convert old tables to Parquet by
simply writing new partitions in Parquet format. Without this, it would be
a much more painful migration process. Only the jobs that read older
partitions need to go through Hive, so converting to a HadoopFs table
usually works.

rb

On Tue, Jan 15, 2019 at 10:07 AM Felix Cheung 
wrote:

> One common case we have is a custom input format.
>
> In any case, even when Hive metatstore is protocol compatible we should
> still upgrade or replace the hive jar from a fork, as Sean says, from a ASF
> release process standpoint. Unless there is a plan for removing hive
> integration (all of it) from the spark core project..
>
>
> --
> *From:* Xiao Li 
> *Sent:* Tuesday, January 15, 2019 10:03 AM
> *To:* Felix Cheung
> *Cc:* rb...@netflix.com; Yuming Wang; dev
> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Let me take my words back. To read/write a table, Spark users do not use
> the Hive execution JARs, unless they explicitly create the Hive serde
> tables. Actually, I want to understand the motivation and use cases why
> your usage scenarios need to create Hive serde tables instead of our Spark
> native tables?
>
> BTW, we are still using Hive metastore as our metadata store. This does
> not require the Hive execution JAR upgrade, based on my understanding.
> Users can upgrade it to the newer version of Hive metastore.
>
> Felix Cheung  于2019年1月15日周二 上午9:56写道:
>
>> And we are super 100% dependent on Hive...
>>
>>
>> --
>> *From:* Ryan Blue 
>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>> *To:* Xiao Li
>> *Cc:* Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> How do we know that most Spark users are not using Hive? I wouldn't be
>> surprised either way, but I do want to make sure we aren't making decisions
>> based on any one person's (or one company's) experience about what "most"
>> Spark users do.
>>
>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>>
>>> Hi, Yuming,
>>>
>>> Thank you for your contributions! The community aims at reducing the
>>> dependence on Hive. Currently, most of Spark users are not using Hive. The
>>> changes looks risky to me.
>>>
>>> To support Hadoop 3.x, we just need to resolve this JIRA:
>>> https://issues.apache.org/jira/browse/HIVE-16391
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>>
 Dear Spark Developers and Users,



 Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
  to2.3.4
  to
 solve some critical issues, such as support Hadoop 3.x, solve some ORC and
 Parquet issues. This is the list:

 *Hive issues*:

 [SPARK-26332 
 ][HIVE-10790]
 Spark sql write orc table on viewFS throws exception

 [SPARK-25193 
 ][HIVE-12505]
 insert overwrite doesn't throw exception when drop old data fails

 [SPARK-26437 
 ][HIVE-13083]
 Decimal data becomes bigint to query, unable to query

 [SPARK-25919 
 ][HIVE-11771]
 Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
 table is Partitioned

 [SPARK-12014 
 ][HIVE-11100]
 Spark SQL query containing semicolon is broken in Beeline



 *Spark issues*:

 [SPARK-23534 ]
 Spark run on Hadoop 3.0.0

 [SPARK-20202 ]
 Remove references to org.spark-project.hive

 [SPARK-18673 ]
 Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

 [SPARK-24766 ]
 CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
 stats in parquet





 Since the code for the *hive-thriftserver* module has changed too much
 for this upgrade, I split it into two PRs for easy review.

 The first PR 

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Since Spark 2.0, we have been trying to move all the Hive-specific logics
to a separate package and make Hive as a data source like the other
built-in data sources. You might see a lot of refactoring PRs for this
goal. Hive will be still an important data source Spark supports for sure.

Now, the upgrade of Hive execution JAR touches so many code and changes
many dependencies. Any PR like this looks very risky to me. Both quality
and stability are my major concern. This could impact the adoption rate of
our upcoming Spark 3.0 release, which will contain many important features.
I doubt whether upgrading the Hive execution JAR is really needed?


Ryan Blue  于2019年1月15日周二 上午10:15写道:

> Xiao, thanks for clarifying.
>
> There are a few use cases for metastore tables. Felix mentions a good one,
> custom metastore tables. There are also common formats that Spark doesn't
> support natively. Spark has CSV support, but the behavior is different from
> Hive's delimited format. Hive also supports Sequence file tables. We have a
> few of those that are old.
>
> The other use case that comes to mind is mixed-format tables. I don't
> think Spark supports a different format per partition without going through
> the Hive read path. We use this feature to convert old tables to Parquet by
> simply writing new partitions in Parquet format. Without this, it would be
> a much more painful migration process. Only the jobs that read older
> partitions need to go through Hive, so converting to a HadoopFs table
> usually works.
>
> rb
>
> On Tue, Jan 15, 2019 at 10:07 AM Felix Cheung 
> wrote:
>
>> One common case we have is a custom input format.
>>
>> In any case, even when Hive metatstore is protocol compatible we should
>> still upgrade or replace the hive jar from a fork, as Sean says, from a ASF
>> release process standpoint. Unless there is a plan for removing hive
>> integration (all of it) from the spark core project..
>>
>>
>> --
>> *From:* Xiao Li 
>> *Sent:* Tuesday, January 15, 2019 10:03 AM
>> *To:* Felix Cheung
>> *Cc:* rb...@netflix.com; Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> Let me take my words back. To read/write a table, Spark users do not use
>> the Hive execution JARs, unless they explicitly create the Hive serde
>> tables. Actually, I want to understand the motivation and use cases why
>> your usage scenarios need to create Hive serde tables instead of our Spark
>> native tables?
>>
>> BTW, we are still using Hive metastore as our metadata store. This does
>> not require the Hive execution JAR upgrade, based on my understanding.
>> Users can upgrade it to the newer version of Hive metastore.
>>
>> Felix Cheung  于2019年1月15日周二 上午9:56写道:
>>
>>> And we are super 100% dependent on Hive...
>>>
>>>
>>> --
>>> *From:* Ryan Blue 
>>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>>> *To:* Xiao Li
>>> *Cc:* Yuming Wang; dev
>>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>>
>>> How do we know that most Spark users are not using Hive? I wouldn't be
>>> surprised either way, but I do want to make sure we aren't making decisions
>>> based on any one person's (or one company's) experience about what "most"
>>> Spark users do.
>>>
>>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>>>
 Hi, Yuming,

 Thank you for your contributions! The community aims at reducing the
 dependence on Hive. Currently, most of Spark users are not using Hive. The
 changes looks risky to me.

 To support Hadoop 3.x, we just need to resolve this JIRA:
 https://issues.apache.org/jira/browse/HIVE-16391

 Cheers,

 Xiao

 Yuming Wang  于2019年1月15日周二 上午8:41写道:

> Dear Spark Developers and Users,
>
>
>
> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
>  to2.3.4
>  to
> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
> Parquet issues. This is the list:
>
> *Hive issues*:
>
> [SPARK-26332 
> ][HIVE-10790]
> Spark sql write orc table on viewFS throws exception
>
> [SPARK-25193 
> ][HIVE-12505]
> insert overwrite doesn't throw exception when drop old data fails
>
> [SPARK-26437 
> ][HIVE-13083]
> Decimal data becomes bigint to query, unable to query
>
> [SPARK-25919 
> ][HIVE-11771]
> Date value corrupts when tables are "ParquetHiveSerDe" formatted and 
> target
> table is Partitioned
>
> [SPARK-12014 
> ][HIVE-11100]
> Spark SQL query contain

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
If https://issues.apache.org/jira/browse/HIVE-16391 can be resolved, we do
not need to keep our fork of Hive.

Sean Owen  于2019年1月15日周二 上午10:44写道:

> It's almost certainly needed just to get off the fork of Hive we're
> not supposed to have. Yes it's going to impact dependencies, so would
> need to happen at Spark 3.
> Separately, its usage could be reduced or removed -- this I don't know
> much about. But it doesn't really make it harder or easier.
>
> On Tue, Jan 15, 2019 at 12:40 PM Xiao Li  wrote:
> >
> > Since Spark 2.0, we have been trying to move all the Hive-specific
> logics to a separate package and make Hive as a data source like the other
> built-in data sources. You might see a lot of refactoring PRs for this
> goal. Hive will be still an important data source Spark supports for sure.
> >
> > Now, the upgrade of Hive execution JAR touches so many code and changes
> many dependencies. Any PR like this looks very risky to me. Both quality
> and stability are my major concern. This could impact the adoption rate of
> our upcoming Spark 3.0 release, which will contain many important features.
> I doubt whether upgrading the Hive execution JAR is really needed?
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
It's almost certainly needed just to get off the fork of Hive we're
not supposed to have. Yes it's going to impact dependencies, so would
need to happen at Spark 3.
Separately, its usage could be reduced or removed -- this I don't know
much about. But it doesn't really make it harder or easier.

On Tue, Jan 15, 2019 at 12:40 PM Xiao Li  wrote:
>
> Since Spark 2.0, we have been trying to move all the Hive-specific logics to 
> a separate package and make Hive as a data source like the other built-in 
> data sources. You might see a lot of refactoring PRs for this goal. Hive will 
> be still an important data source Spark supports for sure.
>
> Now, the upgrade of Hive execution JAR touches so many code and changes many 
> dependencies. Any PR like this looks very risky to me. Both quality and 
> stability are my major concern. This could impact the adoption rate of our 
> upcoming Spark 3.0 release, which will contain many important features. I 
> doubt whether upgrading the Hive execution JAR is really needed?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
Congrats, Great work Dongjoon.



Dongjoon Hyun  于2019年1月15日周二 下午3:47写道:

> We are happy to announce the availability of Spark 2.2.3!
>
> Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
> maintenance branch of Spark. We strongly recommend all 2.2.x users to
> upgrade to this stable release.
>
> To download Spark 2.2.3, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-2-3.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Bests,
> Dongjoon.
>


-- 
Best Regards

Jeff Zhang


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Hyukjin Kwon
Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of
our Hive fork (correct me if I am mistaken).

Just to be honest by myself and as a personal opinion, that basically says
Hive to take care of Spark's dependency.
Hive looks going ahead for 3.1.x and no one would use the newer release of
1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,

Frankly, my impression was that it's, honestly, our mistake to fix. Since
Spark community is big enough, I was thinking we should try to fix it by
ourselves first.
I am not saying upgrading is the only way to get through this but I think
we should at least try first, and see what's next.

It does, yes, sound more risky to upgrade it in our side but I think it's
worth to check and try it and see if it's possible.
I think this is a standard approach to upgrade the dependency than using
the fork or letting Hive side to release another 1.2.x.

If we fail to upgrade it for critical or inevitable reasons somehow, yes,
we could find an alternative but that basically means
we're going to stay in 1.2.x for, at least, a long time (say .. until Spark
4.0.0?).

I know somehow it happened to be sensitive but to be just literally honest
to myself, I think we should make a try.


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Marcelo Vanzin
+1 to that. HIVE-16391 by itself means we're giving up things like
Hadoop 3, and we're also putting the burden on the Hive folks to fix a
problem that we created.

The current PR is basically a Spark-side fix for that bug. It does
mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
it's really the right path to take here.

On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
>
> Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of 
> our Hive fork (correct me if I am mistaken).
>
> Just to be honest by myself and as a personal opinion, that basically says 
> Hive to take care of Spark's dependency.
> Hive looks going ahead for 3.1.x and no one would use the newer release of 
> 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,
>
> Frankly, my impression was that it's, honestly, our mistake to fix. Since 
> Spark community is big enough, I was thinking we should try to fix it by 
> ourselves first.
> I am not saying upgrading is the only way to get through this but I think we 
> should at least try first, and see what's next.
>
> It does, yes, sound more risky to upgrade it in our side but I think it's 
> worth to check and try it and see if it's possible.
> I think this is a standard approach to upgrade the dependency than using the 
> fork or letting Hive side to release another 1.2.x.
>
> If we fail to upgrade it for critical or inevitable reasons somehow, yes, we 
> could find an alternative but that basically means
> we're going to stay in 1.2.x for, at least, a long time (say .. until Spark 
> 4.0.0?).
>
> I know somehow it happened to be sensitive but to be just literally honest to 
> myself, I think we should make a try.
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org