Re: Review Request 50896: HIVE-14404: Allow delimiterfordsv to use multiple-character delimiters

Marta Kuczora Thu, 11 Aug 2016 00:44:07 -0700


> On Aug. 9, 2016, 5:35 p.m., Szehon Ho wrote:
> > I'm ambivilent, I would rather have pursued the change to make it in 
> > superCSV to be better in long run.  But I do see it might not move very 
> > fast (did you guys try contacting them?).   The patch itself looks mostly 
> > fine though.
> > 
> > My only question is, does it need to be a 2nd version of the format?  That 
> > is, is there anything that is actually backward incompatibie other than 
> > adding a new flag?  Thanks.


No, I didn’t contacted the SuperCSV team. I asked around what would be the 
better way to go and I mostly got the answer to fix the issue in Hive, so I 
went in that direction. But I can try to contact them if fixing the issue there 
would be preferable.

Actually the issue can be fixed without introducing the 2nd dsv output format 
as well. In this case I saw two possible solutions:
- Separate the logic for the dsv output format from the csv2 and tsv2 formats 
and for dsv format always use the new logic which supports the string 
delimiter. In this case the Super CSV library wouldn’t be used for the dsv 
format.

- Change the logic in the SeparatedValuesOutputFormat class to use the SuperCSV 
library just like now if the delimiter is a single character and the use the 
new logic if the delimiter is a string.  This solution would leave the 
single-character case unchanged, and would support the multi-character 
delimiter for the same dsv format, but the code in the 
SeparatedValuesOutputFormat class would not be that nice.

I was thinking about which solution to choose. The reason why I introduced the 
new format is that I didn’t wan’t to change the logic for the single-character 
delimiter case, because I think using Super CSV makes the code a lot cleaner. 
But I also wanted to the keep the code clean, so I chose to separate the 
single-, and multi-character delimiter cases and introduce a new format.
If it is preferable to have only one dsv format, I can change the patch easily.


- Marta


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50896/#review145227
-----------------------------------------------------------


On Aug. 8, 2016, 3:13 p.m., Marta Kuczora wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50896/
> -----------------------------------------------------------
> 
> (Updated Aug. 8, 2016, 3:13 p.m.)
> 
> 
> Review request for hive, Naveen Gangam, Sergio Pena, Szehon Ho, and Xuefu 
> Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Introduced a new outputformat (dsv2) which supports multiple characters as 
> delimiter.
> For generating the dsv, csv2 and tsv2 outputformats, the Super CSV library is 
> used. This library doesn’t support multiple characters as delimiter. Since 
> the same logic is used for generating csv2, tsv2 and dsv outputformats, I 
> decided not to change this logic, rather introduce a new outputformat (dsv2) 
> which supports multiple characters as delimiter. 
> The new dsv2 outputformat has the same escaping logic as the dsv outputformat 
> if the quoting is not disabled.
> Extended the TestBeeLineWithArgs tests with new test steps which are using 
> multiple characters as delimiter.
> 
> Main changes in the code:
>  - Changed the SeparatedValuesOutputFormat class to be an abstract class and 
> created two new child classes to separate the logic for single-character and 
> multi-character delimiters: SingleCharSeparatedValuesOutputFormat and 
> MultiCharSeparatedValuesOutputFormat
> 
>  - Kept the methods which are used by both children in the 
> SeparatedValuesOutputFormat and moved the methods specific to the 
> single-character case to the SingleCharSeparatedValuesOutputFormat class.
> 
>  - Didn’t change the logic which was in the SeparatedValuesOutputFormat, only 
> moved some parts to the child class.
> 
>  - Implemented the value escaping and concatenation with the delimiter string 
> in the MultiCharSeparatedValuesOutputFormat.
> 
> 
> Diffs
> -----
> 
>   beeline/src/java/org/apache/hive/beeline/BeeLine.java e0fa032 
>   beeline/src/java/org/apache/hive/beeline/BeeLineOpts.java e6e24b1 
>   
> beeline/src/java/org/apache/hive/beeline/MultiCharSeparatedValuesOutputFormat.java
>  PRE-CREATION 
>   beeline/src/java/org/apache/hive/beeline/SeparatedValuesOutputFormat.java 
> 66d9fd0 
>   
> beeline/src/java/org/apache/hive/beeline/SingleCharSeparatedValuesOutputFormat.java
>  PRE-CREATION 
>   
> itests/hive-unit/src/test/java/org/apache/hive/beeline/TestBeeLineWithArgs.java
>  892c733 
> 
> Diff: https://reviews.apache.org/r/50896/diff/
> 
> 
> Testing
> -------
> 
> - Tested manually in BeeLine.
> - Extended the TestBeeLineWithArgs tests with new test steps which are using 
> multiple characters as delimiter.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>

Re: Review Request 50896: HIVE-14404: Allow delimiterfordsv to use multiple-character delimiters

Reply via email to