Re: Hive REGEXP_REPLACE use or equivalent in Spark

Chandeep Singh Fri, 19 Feb 2016 02:35:15 -0800

You might be better off using the CSV loader in this case. 
https://github.com/databricks/spark-csv 
<https://github.com/databricks/spark-csv>


Input:
[csingh ~]$ hadoop fs -cat test.csv
360,10/02/2014,"?2,500.00",?0.00,"?2,500.00”

and here is quick ad dirty way to resolve your issue..

val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").load("test.csv")
—> df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: string, C3: 
string, C4: string

df.first()
—> res0: org.apache.spark.sql.Row = [360,10/02/2014,?2,500.00,?0.00,?2,500.00]

val a = df.map(x => (x.getInt(0), x.getString(1), x.getString(2).replace("?", 
"").replace(",", ""), x.getString(3).replace("?", ""), 
x.getString(4).replace("?", "").replace(",", "")))
—> a: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = 
MapPartitionsRDD[17] at map at <console>:21
    
a.collect()
—> res1: Array[(Int, String, String, String, String)] = 
Array((360,10/02/2014,2500.00,0.00,2500.00))

> On Feb 19, 2016, at 9:06 AM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Ok
>  
> I have created a one liner csv file as follows:
>  
> cat testme.csv
> 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"
>  
> I use the following in Spark to split it
>  
> csv=sc.textFile("/data/incoming/testme.csv")
> csv.map(_.split(",")).first
> res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2, 
> 500.00")
>  
> That comes back with an array
>  
> Now all I want is to get rid of “?” and “,” in above. The problem is I have a 
> currency field “?2,500.00” that has got an additional “,” as well that messes 
> up things
>  
> replaceAll() does not work
>  
> Any other alternatives?
>  
> Thanks,
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Andrew Ehrlich [mailto:and...@aehrlich.com] 
> Sent: 19 February 2016 01:22
> To: Mich Talebzadeh <m...@peridale.co.uk>
> Cc: User <user@spark.apache.org>
> Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark
>  
> Use the scala method .split(",") to split the string into a collection of 
> strings, and try using .replaceAll() on the field with the "?" to remove it.
>  
> On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk 
> <mailto:m...@peridale.co.uk>> wrote:
>> Hi,
>> 
>> What is the equivalent of this Hive statement in Spark
>> 
>>  
>> 
>> select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]','');
>> +------------+----------+--+
>> |    _c0     |   _c1    |
>> +------------+----------+--+
>> | ?2,500.00  | 2500.00  |
>> +------------+----------+--+
>> 
>> Basically I want to get rid of "?" and "," in the csv file
>> 
>>  
>> 
>> The full csv line is
>> 
>>  
>> 
>> scala> csv2.first
>> res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"
>> 
>> I want to transform that string into 5 columns and use "," as the split
>> 
>> Thanks,
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.

Re: Hive REGEXP_REPLACE use or equivalent in Spark

Reply via email to