[ https://issues.apache.org/jira/browse/HIVE-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314309#comment-16314309 ]
ASF GitHub Bot commented on HIVE-17580: --------------------------------------- GitHub user vihangk1 opened a pull request: https://github.com/apache/hive/pull/287 HIVE-17580 : Remove standalone-metastore's dependency with serdes Removing the dependency on serdes for the metastore requires a series of changes. I have created multiple commits which hopefully would be easier to review. Each major commit has a descriptive commit message to give a high level idea of what the change is doing. There are still some bits which need to be completed but it would be good to a review. Overview of all the changes done: 1. Creates a new module called serde-api under storage-api like discussed. Although I think we can keep it separate as well. 2. Moved List, Map, Struct, Constant, Primitive, Union ObjectInspectors to serde-api 3. Moved PrimitiveTypeInfo, PrimitiveTypeEntry and TypeInfo to serde-api. 4. Moved TypeInfoParser, TypeInfoFactory to serde-api 5. Added a new class which reading avro storage schema by copying the code from AvroSerde and AvroSerdeUtils. The parsing is done such that String value is first converted into TypeInfos and then into FieldSchemas bypassing the need for ObjectInspectors. In theory we could get rid of TypeInfos as well but that path was getting too difficult with lot of duplicate code between Hive and metastore. 6. Introduces a default storage schema reader. I noticed that most of the serdes use the same logic to parse the metadata information. This code should be refactored to a common place instead of having many copies (one in standalone hms and another set in multiple serdes) 7. Moved HiveChar, HiveVarchar, HiveCharWritable, HiveVarcharWritable to storage-api. I noticed that HiveDecimal is already in storage-api. It probably makes sense to move the other primitive types (timestamp, interval etc)to storage-api as well but it requires storage-api to be upgraded to Java 8. 8. Adds a basic test for the schema reader. I plan to add more tests as this code is reviewed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vihangk1/hive vihangk1_HIVE-17580 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hive/pull/287.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #287 ---- commit bbfb7dc44904db74a840167c02b07f50a6010b69 Author: Vihang Karajgaonkar <vihang@...> Date: 2017-11-09T00:52:39Z HIVE-17580 : Remove dependency of get_fields_with_environment_context API to serde commit d54879845eff10c19bc17bda9e09dda16f6fa295 Author: Vihang Karajgaonkar <vihang@...> Date: 2017-11-29T00:54:23Z Moved List, Map, Struct OI to storage-api commit a12d6c7ba3de598c6b6f75da1bd4efcac43036b1 Author: Vihang Karajgaonkar <vihang@...> Date: 2017-11-29T04:19:39Z Moved ConstantObjectInspector PrimitiveObjectInspector and UnionObjectInspector commit 13fb832fc2d51958e75d5e609f6781f87449aed8 Author: Vihang Karajgaonkar <vihang@...> Date: 2017-12-28T01:25:59Z Moved PrimitiveTypeInfo to serde-api In order to move PrimitiveTypeInfo we need to move the PrimitiveTypeEntry as well. PrimitiveTypeEntry depends on PrimitiveObjectInspectorUtils which cannot be pulled into serde-api. Hence the static final maps are moved to PrimitiveEntry and we provide static access methods to these maps along with the register method to add the key value pairs in the maps commit 9cbc789fd3f4ced7ce66a7313c451b75a154976f Author: Vihang Karajgaonkar <vihang@...> Date: 2017-12-28T20:51:39Z Moved the other TypeInfos to serde-api In order to move the other TypeInfo classes to serde-api we need to move the serdeConstants.java as well. This is a thrift generated class. This commit copies the serde.thrift instead of moving. The only reason I did not move it is in case of backwards compatibility reasons (in case someone is using the thrift file location to do something). If it is okay to move serde.thrift from serde module to serde-api we can delete it from serde module in a separate change. The other concern is there are some TypeInfo classes which do some validation like VarCharTypeInfo, DecimalTypeInfo. The validating methods use the actual type implementation like HiveChar, HiveDecimal etc to ensure that the params are under the correct limits. This creates a problem since we cannot bring in the type implementations as well to serde-api. Currently, I have marked these as TODO and commented them out. commit 23aa899a90d17648e560d39602a3bd29bf53661e Author: Vihang Karajgaonkar <vihang@...> Date: 2017-12-29T22:14:06Z Moved TypeInfoParser and TypeInfoFactory to serde-api In order to use TypeInfoParser in standalone metastore, we should move it to serde-api There is a problem in moving the TypeInfoParser to serde-api which is there are some validation util methods which validated the max lengths of varchar, char or decimal typeinfo's precision and scale. These values are originally defined in HiveChar, HiveVarChar and HiveDecimal which cannot be moved to serde-api. In order to fix this this change moved these constants to serde.thrift so that it can be accessed from serdeConstants instead of HiveChar or HiveVarChar (not sure if this is the right thing to do, but there doesn't seem any good option here) commit 0b29dc43e78eaa33e0cb672fb9cdabfb5b02d534 Author: Vihang Karajgaonkar <vihang@...> Date: 2018-01-03T19:45:32Z Add avro storeage schema reader This commit adds a AvroStorageSchemaReader which reads the Avro schema files both for external schema and regular avro tables. Most of the util methods are in AvroSchemaUtils class which has methods copied from AvroSerDeUtils. Some of the needed classes like SchemaResolutionProblem, InstanceCache, SchemaToTypeInfo, TypeInfoToSchema are also copied from Hive. The constants defined in AvroSerde are copied in AvroSerdeConstants. The class AvroFieldSchemaGenerator converts the AvroSchema into List of FieldSchema which is returned by the AvroStorageSchemaReader commit 5f2b174cd4e038e9871348dd4cce5389f0f75776 Author: Vihang Karajgaonkar <vihang@...> Date: 2018-01-04T01:02:40Z Introduce default storage schema reader This change introduces a default storage schema reader which copies the common code from serdes initialization method and uses it to parse the column name, type and comments from the table properties. For custom storage schema reades like Avro we will have to add more schema readers as and when required commit 2ecd4ffd1b65c29371cb4b384c43445c7c661161 Author: Vihang Karajgaonkar <vihang@...> Date: 2018-01-04T18:33:02Z Moved the schema reader classes to a separate package commit 6f734eb0c5a9472b239bdc91da56ec0efe7ceef8 Author: Vihang Karajgaonkar <vihang@...> Date: 2018-01-04T19:18:03Z Integrates the avro schema reader into the DefaultStorageaSchemaReader commit d93ce33559036020facd12ec5ea258223eb05c4c Author: Vihang Karajgaonkar <vihang@...> Date: 2018-01-05T01:40:36Z Moved HiveChar, HiveVarChar, HiveCharWritable, HiveVarcharWritable to storage-api The PrimitiveEntry caches (static maps) which keep a mapping between the typename to the classes needs to be populated before using in standalone-metastore. This is done in StorageSchemaUtils static initialization block. However, this initialization needs access to HiveChar, HiveVarchar and Timestamp related classes from serde. I see that HiveDecimal class is in storage-api. HiveChar, HiveVarChar and their base classes are comparitively light weight and can be moved to storage-api. However, Timestamp related primitive types like TimestampWritable, TimestampLocalTZWritable, TimestampTZ, HiveIntervalYearMonth, HiveIntervalYearMonthWritable, HiveIntervalDayTimeWritable classes cannot be moved in a straight-forward way because they have dependency with Java 8 and some util methods and classes in LazyBinaryUtils. I have marked this as a TODO currently. As far as I can tell TypeInfoParser does not need these classes while parsing type string to deserialize the types. Note: HiveIntervalDayTime is already in storage-api. Not sure why the other time related types are not in storage-api in the first place. commit 9bfdb9beb48088712db82b31d8bc24651dc0d474 Author: Vihang Karajgaonkar <vihang@...> Date: 2018-01-05T02:38:28Z Added a test for getFields method in standalone-metastore Adds a test case. Also fixes a bug in AvroFieldSchemaGenerator ---- > Remove dependency of get_fields_with_environment_context API to serde > --------------------------------------------------------------------- > > Key: HIVE-17580 > URL: https://issues.apache.org/jira/browse/HIVE-17580 > Project: Hive > Issue Type: Sub-task > Components: Standalone Metastore > Reporter: Vihang Karajgaonkar > Assignee: Vihang Karajgaonkar > Labels: pull-request-available > > {{get_fields_with_environment_context}} metastore API uses {{Deserializer}} > class to access the fields metadata for the cases where it is stored along > with the data files (avro tables). The problem is Deserializer classes is > defined in hive-serde module and in order to make metastore independent of > Hive we will have to remove this dependency (atleast we should change it to > runtime dependency instead of compile time). > The other option is investigate if we can use SearchArgument to provide this > functionality. -- This message was sent by Atlassian JIRA (v6.4.14#64029)