The translate UDF does take care of non-ascii characters. It uses
codepoints instead of characters.
Here is the unit test to demonstrate that:
https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/udf_translate.q#L36
But you guys are right. It doesn't solve Tom's original p
Anyone know if translate takes ranges, like some implementations? e.g.,
translate ('[a-z]', '[A-Z]')
Of course, that probably doesn't work for non-ascii characters.
On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall wrote:
> Thanks Dean,
>
> I dont think translate would work as the set of things to rem
Thanks Dean,
I dont think translate would work as the set of things to remove is massive.
Yeah, it's a one-off cleanup job while exporting to try redshift on our
datasets.
My guess is it's something about the way hive handles strings? Tried
"\\ufffd" as the replacement str but no joy either.
Chee
I think this should work, but you might investigate using the translate
function instead. I suspect it will provide much better performance than
using regexps. Also, Are you planning to do this once to create your final
tables? If so, the performance overhead won't matter much.
dean
On Fri, Mar 1
I would like to remove unicode chars that are outside the Basic
Multilingual Plane [1]
I thought
select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
my_table
would work but while the regexp does work the replacement str does not (I
can paste in the literal �, which you may or may