Re: regexp_replace with unicode chars

2013-03-01 Thread Mark Grover
The translate UDF does take care of non-ascii characters. It uses codepoints instead of characters. Here is the unit test to demonstrate that: https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/udf_translate.q#L36 But you guys are right. It doesn't solve Tom's original p

Re: regexp_replace with unicode chars

2013-03-01 Thread Dean Wampler
Anyone know if translate takes ranges, like some implementations? e.g., translate ('[a-z]', '[A-Z]') Of course, that probably doesn't work for non-ascii characters. On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall wrote: > Thanks Dean, > > I dont think translate would work as the set of things to rem

Re: regexp_replace with unicode chars

2013-03-01 Thread Tom Hall
Thanks Dean, I dont think translate would work as the set of things to remove is massive. Yeah, it's a one-off cleanup job while exporting to try redshift on our datasets. My guess is it's something about the way hive handles strings? Tried "\\ufffd" as the replacement str but no joy either. Chee

Re: regexp_replace with unicode chars

2013-03-01 Thread Dean Wampler
I think this should work, but you might investigate using the translate function instead. I suspect it will provide much better performance than using regexps. Also, Are you planning to do this once to create your final tables? If so, the performance overhead won't matter much. dean On Fri, Mar 1

regexp_replace with unicode chars

2013-03-01 Thread Tom Hall
I would like to remove unicode chars that are outside the Basic Multilingual Plane [1] I thought select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from my_table would work but while the regexp does work the replacement str does not (I can paste in the literal �, which you may or may