Hi Flink users and devs,

We want to get your feedbacks on integrating Flink with Hive.

Background: In Flink Forward in Beijing last December, the community
announced to initiate efforts on integrating Flink and Hive. On Feb 21 Seattle
Flink Meetup <https://www.meetup.com/seattle-flink/events/258723322/>, We
presented Integrating Flink with Hive
<https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-xuefu-zhang-and-bowen-li-seattle-flink-meetup-feb-2019>
with
a live demo to local community and got great response. As of mid March now,
we have internally finished building Flink's brand-new catalog
infrastructure, metadata integration with Hive, and most common cases of
Flink reading/writing against Hive, and will start to submit more design
docs/FLIP and contribute code back to community. The reason for doing it
internally first and then in community is to ensure our proposed solutions
are fully validated and tested, gain hands-on experience and not miss
anything in design. You are very welcome to join this effort, from
design/code review, to development and testing.

*The most important thing we believe you, our Flink users/devs, can help
RIGHT NOW is to share your Hive use cases and give us feedbacks for this
project. As we start to go deeper on specific areas of integration, you
feedbacks and suggestions will help us to refine our backlogs and
prioritize our work, and you can get the features you want sooner! *Just
for example, if most users is mainly only reading Hive data, then we can
prioritize tuning read performance over implementing write capability.
A quick review of what we've finished building internally and is ready to
contribute back to community:

   - Flink/Hive Metadata Integration
      - Unified, pluggable catalog infra that manages meta-objects,
      including catalogs, databases, tables, views, functions, partitions,
      table/partition stats
      - Three catalog impls - A in-memory catalog, HiveCatalog for
      embracing Hive ecosystem, GenericHiveMetastoreCatalog for persisting
      Flink's streaming/batch metadata in Hive metastore
      - Hierarchical metadata reference as
      <catalog_name>.<database_name>.<metaobject_name> in SQL and Table API
      - Unified function catalog based on new catalog infra, also support
      Hive simple UDF
   - Flink/Hive Data Integration
      - Hive data connector that reads partitioned/non-partitioned Hive
      tables, and supports partition pruning, both Hive simple and complex data
      types, and basic write
   - More powerful SQL Client fully integrated with the above features and
   more Hive-compatible SQL syntax for better end-to-end SQL experience

*Given above info, we want to learn from you on: How do you use Hive
currently? How can we solve your pain points? What features do you expect
from Flink-Hive integration? Those can be details like:*

   - *Which Hive version are you using? Do you plan to upgrade Hive?*
   - *Are you planning to switch Hive engine? What timeline are you looking
   at? Until what capabilities Flink has will you consider using Flink with
   Hive?*
   - *What's your motivation to try Flink-Hive? Maintain only one data
   processing system across your teams for simplicity and maintainability?
   Better performance of Flink over Hive itself?*
   - *What are your Hive use cases? How large is your Hive data size? Do
   you mainly do reading, or both reading and writing?*
   - *How many Hive user defined functions do you have? Are they mostly
   UDF, GenericUDF, or UDTF, or UDAF?*
   - any questions or suggestions you have? or as simple as how you feel
   about the project

Again, your input will be really valuable to us, and we hope, with all of
us working together, the project can benefits our end users. Please feel
free to either reply to this thread or just to me. I'm also working on
creating a questionnaire to better gather your feedbacks, watch for the
maillist in the next couple days.

Thanks,
Bowen

Reply via email to