[jira] [Commented] (HIVE-5687) Streaming support in Hive

Owen O'Malley (JIRA) Sun, 06 Apr 2014 10:40:06 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961469#comment-13961469
 ]


Owen O'Malley commented on HIVE-5687:
-------------------------------------

Initial comments
* Please create a package-info.java (or package.html) for the entire package 
that has the text from the design document, but without the example.
* I believe the API will go through some iterations and use before it becomes 
stable. We should warn users that it will likely evolve in future versions of 
Hive and won't necessarily be backwards compatible. The package-info.java is 
probably the best place to place the warning.
* The current API requires users to implement a RecordWriter wrapper for each 
SerDe they want to use. In Hive 0.14, I think we need to revisit this and 
switch to just requiring a serde class name and a string to string map of serde 
properties. This way, any of a user's current SerDes can be used to parse the 
byte[] and there can be a generic method for constructing the object using 
reflection.
* The code shouldn't depend on OrcOutputFormat, but instead find the 
OutputFormat of the table/partition and use that. The streaming code should 
only require that it implement AcidOutputFormat.
* The RecordWriter should be passed the HiveConf rather than create it. It will 
make it easier to do unit tests.
* The StreamingIntegrationTester needs to print the exception's getMessage to 
stderr if the options don't parse correctly. Otherwise, the user doesn't get 
any clue as to which parameter they forgot.
* I don't see how the column reordering can be invoken. The SerDe is using the 
table properties from the table in the MetaStore to define the columns it 
returns, so the two should always be the same. My suggestion is to remove all 
of the column reordering code.
* If you don't remove the column ordering code, you should deserialize and then 
reorder the columns rather than the current strategy of deserialize, reorder, 
serialize, and deserialize.
* Revert the change that adds startMetaStore. It isn't called and thus 
shouldn't be added.
* The method writeImpl(byte[]) doesn't add any value and should just be inlined.
* Why do you use DDL to create partitions rather than the MetaStoreClient API 
that you everywhere else?

Some style guidelines:
* Please split the lines that are longer than 100 characters and it is even 
better if they are less than 80 characters.
* Ensure your if statements have a space before the parenthesis.
* Remove the commented out code.
* Please remove the private uncalled functions (eg. 
HiveEndPoint.newConnection(String proxyUser, ...) )
* You've defined a lot of exceptions in this API. Is the user expected to 
handle each exception separately? Your throw declarations don't list the exact 
exceptions that are thrown. You'd be better off with different exceptions only 
when the user is expected to be able to handle a specific error. Otherwise, you 
might as well use StreamingException with a descriptive error message for 
everything.


> Streaming support in Hive
> -------------------------
>
>                 Key: HIVE-5687
>                 URL: https://issues.apache.org/jira/browse/HIVE-5687
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>              Labels: ACID, Streaming
>             Fix For: 0.13.0
>
>         Attachments: 5687-api-spec4.pdf, 5687-draft-api-spec.pdf, 
> 5687-draft-api-spec2.pdf, 5687-draft-api-spec3.pdf, 
> HIVE-5687-unit-test-fix.patch, HIVE-5687.patch, HIVE-5687.v2.patch, 
> HIVE-5687.v3.patch, HIVE-5687.v4.patch, HIVE-5687.v5.patch, Hive Streaming 
> Ingest API for v3 patch.pdf, Hive Streaming Ingest API for v4 patch.pdf
>
>
> Implement support for Streaming data into HIVE.
> - Provide a client streaming API 
> - Transaction support: Clients should be able to periodically commit a batch 
> of records atomically
> - Immediate visibility: Records should be immediately visible to queries on 
> commit
> - Should not overload HDFS with too many small files
> Use Cases:
>  - Streaming logs into HIVE via Flume
>  - Streaming results of computations from Storm



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-5687) Streaming support in Hive

Reply via email to