[ https://issues.apache.org/jira/browse/HIVE-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203979#comment-14203979 ]
ronan stokes commented on HIVE-8763: ------------------------------------ To avoid any performance issues, the SERDE modifications will not support embedded record delimiters in quoted strings . For example if the source data uses newline (UTF-8 0x0a) as the record delimiter, the modifications will not do anything specifically to handle that - nor will they disallow it. As handling of embedded record delimiters requires changes to the underlying input format, I am not proposing to handle embedded record delimiters with these modifications. > Support for use of enclosed quotes in LazySimpleSerde > ----------------------------------------------------- > > Key: HIVE-8763 > URL: https://issues.apache.org/jira/browse/HIVE-8763 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 0.11.0, 0.12.0, 0.13.0, 0.13.1 > Environment: many - verified on Centos / Redhat with CDH > Reporter: ronan stokes > > Currently the LazySimpleSerde does not support the use of quotes for > delimited fields to allow use of separators within a quoted field - this > means having to use alternatives for many common use cases for CSV style > data. > Key scenarios that do not work include: > (3 column row for int, string, float delimited by ',') > 100,"3.5 inch hard drive, quantity 10",2650.30 > 100,"3.5 \" hard drive, quantity 10",2650.30 > 100, "3.5 "" hard drive, quantity 10", 2650.30 > 100,"3.5 "" hard drive, quantity 10",2650.30 > There are a number of fixes that I have implemented support in the > deserialization stage to a copy of the Lazy simple serde to address this: > For serialization, the code is unchanged with the relevant embedded > characters being escaped. > Assuming a row with 3 fields - SKU ID, description, price, delimited by ',' > 1) allow use of enclosed quotes around a string field > For example > 100,"3.5 inch hard drive, quantity 10",2650.30 > 2) support escaping of quotes within field to allow use of embedded quote > 100,"3.5 \" hard drive, quantity 10",2650.30 > 3) support for old style CSV embedded quotes > for example > 100,"3.5 "" hard drive, quantity 10",2650.30 > 4) support for skipping of leading spaces in field > For example (note space between first ',' and opening quote) > 100, "3.5 "" hard drive, quantity 10", 2650.30 > In each case, with the changes these are evaluated as though the delimiters > and embedded quotes were escaped: > e.g > 100, 3.5 \" hard drive\, quantity 10, 2650.30 > All of these are enabled or disabled using serde properties for quotechar, > whether enclosed quotes is supported, whether double embedded quotes are > treated as single quote (of same char type) -- This message was sent by Atlassian JIRA (v6.3.4#6332)