using the Hive SQL parser in Spark

Reynold Xin Fri, 18 Dec 2015 11:55:40 -0800

Hi Hive devs,

I would like to share with you Spark's plan w.r.t. SQL parser going
forward. As you may (or may not) know, Spark SQL has had two parsers so far:


- a very simple one based on Scala's parser combinator; and
- one that depends on Hive's

The Scala parser combinator one was written quickly so we could parse SQL
queries even when Hive dependency is off. However, it suffers from some
major problems, the most important of which are (1) really bad error
messages and (2) no warning when grammars rules conflict.

We really like the Hive parser. It calls into Hive itself and translates
the generated AST into Spark's logical plans. However, because the grammar
definition was not in Spark, we could not introduce new grammars or fix
bugs when needed.

These two parsers have been a major source of confusions for Spark users,
because depending on which mode Spark SQL is running on, you get subtle
differences in grammar. It has been our intention to replace both of them
with a built-in parser.

We have looked into various options, and it looks like the best option is
to copy the ANTLR grammar file from Hive into Spark. Because the grammar
file is tightly coupled with Hive's semantic analysis, we need to refactor
some code to use them so it will end up becoming the .g file plus some
coupled code. We already have a prototype that somewhat works. I expect we
will get this done in early 2016.


We have also looked into creating an independent library for the SQL parser
that both Hive and Spark share. However, we eventually decided that it
wouldn't make much sense with this approach, because it is a lot of work
for both Hive and Spark to refactor existing code to introduce an external
parser. From Hive's perspective this does not provide any immediate
benefits. From Spark's perspective, we iterate very quickly so having to
depend on an external component also slow down our development. We also
have some requirements that simply don't apply in other projects (e.g.
being able to parse DataFrame expressions).


Thanks a lot for developing this parser, and we will try our best to
contribute back as we fix bugs. I will also make sure we have the proper
acknowledgment when we do this.

Cheers.

- Reynold

using the Hive SQL parser in Spark

Reply via email to