Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jark Wu Sat, 14 Dec 2019 20:42:38 -0800

I agree with Seth and Aljoscha and think that is a right way to go.
We already provided uber jars for kafka and elasticsearch for out-of-box,
you can see the download links in this page[1].
Users can easily to download the connectors and versions they like and drag
to SQL CLI lib directories. The uber jars
contains all the dependencies required and may be shaded. In this way,
users can skip to build a uber jar themselves.
Hive is indeed a "connector" too, and should also follow this way.


Best,
Jark

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies

On Sat, 14 Dec 2019 at 03:03, Aljoscha Krettek <aljos...@apache.org> wrote:

> I was going to suggest the same thing as Seth. So yes, I’m against having
> Flink distributions that contain Hive but for convenience downloads as we
> have for Hadoop.
>
> Best,
> Aljoscha
>
> > On 13. Dec 2019, at 18:04, Seth Wiesman <sjwies...@gmail.com> wrote:
> >
> > I'm also -1 on separate builds.
> >
> > What about publishing convenience jars that contain the dependencies for
> > each version? For example, there could be a flink-hive-1.2.1-uber.jar
> that
> > users could just add to their lib folder that contains all the necessary
> > dependencies to connect to that hive version.
> >
> >
> > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <rmetz...@apache.org>
> wrote:
> >
> >> I'm generally not opposed to convenience binaries, if a huge number of
> >> people would benefit from them, and the overhead for the Flink project
> is
> >> low. I did not see a huge demand for such binaries yet (neither for the
> >> Flink + Hive integration). Looking at Apache Spark, they are also only
> >> offering convenience binaries for Hadoop only.
> >>
> >> Maybe we could provide a "Docker Playground" for Flink + Hive in the
> >> documentation (and the flink-playgrounds.git repo)?
> >> (similar to
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
> >> )
> >>
> >>
> >>
> >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <ches...@apache.org>
> >> wrote:
> >>
> >>> -1
> >>>
> >>> We shouldn't need to deploy additional binaries to have a feature be
> >>> remotely usable.
> >>> This usually points to something else being done incorrectly.
> >>>
> >>> If it is indeed such a hassle to setup hive on Flink, then my
> conclusion
> >>> would be that either
> >>> a) the documentation needs to be improved
> >>> b) the architecture needs to be improved
> >>> or, if all else fails c) provide a utility script for setting it up
> >> easier.
> >>>
> >>> We spent a lot of time on reducing the number of binaries in the hadoop
> >>> days, and also go extra steps to prevent a separate Java 11 binary, and
> >>> I see no reason why Hive should get special treatment on this matter.
> >>>
> >>> Regards,
> >>> Chesnay
> >>>
> >>> On 13/12/2019 09:44, Bowen Li wrote:
> >>>> Hi all,
> >>>>
> >>>> I want to propose to have a couple separate Flink distributions with
> >> Hive
> >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> >>> distributions
> >>>> will be provided to users on Flink download page [1].
> >>>>
> >>>> A few reasons to do this:
> >>>>
> >>>> 1) Flink-Hive integration is important to many many Flink and Hive
> >> users
> >>> in
> >>>> two dimensions:
> >>>>      a) for Flink metadata: HiveCatalog is the only persistent catalog
> >>> to
> >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the
> >> persistent
> >>>> catalog would be playing even more critical role in users' workflow
> >>>>      b) for Flink data: Hive data connector (source/sink) helps both
> >>> Flink
> >>>> and Hive users to unlock new use cases in streaming,
> >>> near-realtime/realtime
> >>>> data warehouse, backfill, etc.
> >>>>
> >>>> 2) currently users have to go thru a *really* tedious process to get
> >>>> started, because it requires lots of extra jars (see [2]) that are
> >> absent
> >>>> in Flink's lean distribution. We've had so many users from public
> >> mailing
> >>>> list, private email, DingTalk groups who got frustrated on spending
> >> lots
> >>> of
> >>>> time figuring out the jars themselves. They would rather have a more
> >>> "right
> >>>> out of box" quickstart experience, and play with the catalog and
> >>>> source/sink without hassle.
> >>>>
> >>>> 3) it's easier for users to replace those Hive dependencies for their
> >> own
> >>>> Hive versions - just replace those jars with the right versions and no
> >>> need
> >>>> to find the doc.
> >>>>
> >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user
> >> base
> >>>> out there, and that's why we are using them as examples for
> >> dependencies
> >>> in
> >>>> [1] even though we've supported almost all Hive versions [3] now.
> >>>>
> >>>> I want to hear what the community think about this, and how to achieve
> >> it
> >>>> if we believe that's the way to go.
> >>>>
> >>>> Cheers,
> >>>> Bowen
> >>>>
> >>>> [1] https://flink.apache.org/downloads.html
> >>>> [2]
> >>>>
> >>>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> >>>> [3]
> >>>>
> >>>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
> >>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to