You might be better off using the CSV loader in this case. https://github.com/databricks/spark-csv <https://github.com/databricks/spark-csv>
Input: [csingh ~]$ hadoop fs -cat test.csv 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00” and here is quick ad dirty way to resolve your issue.. val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("test.csv") —> df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: string, C3: string, C4: string df.first() —> res0: org.apache.spark.sql.Row = [360,10/02/2014,?2,500.00,?0.00,?2,500.00] val a = df.map(x => (x.getInt(0), x.getString(1), x.getString(2).replace("?", "").replace(",", ""), x.getString(3).replace("?", ""), x.getString(4).replace("?", "").replace(",", ""))) —> a: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[17] at map at <console>:21 a.collect() —> res1: Array[(Int, String, String, String, String)] = Array((360,10/02/2014,2500.00,0.00,2500.00)) > On Feb 19, 2016, at 9:06 AM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Ok > > I have created a one liner csv file as follows: > > cat testme.csv > 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00" > > I use the following in Spark to split it > > csv=sc.textFile("/data/incoming/testme.csv") > csv.map(_.split(",")).first > res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2, > 500.00") > > That comes back with an array > > Now all I want is to get rid of “?” and “,” in above. The problem is I have a > currency field “?2,500.00” that has got an additional “,” as well that messes > up things > > replaceAll() does not work > > Any other alternatives? > > Thanks, > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > From: Andrew Ehrlich [mailto:and...@aehrlich.com] > Sent: 19 February 2016 01:22 > To: Mich Talebzadeh <m...@peridale.co.uk> > Cc: User <user@spark.apache.org> > Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark > > Use the scala method .split(",") to split the string into a collection of > strings, and try using .replaceAll() on the field with the "?" to remove it. > > On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk > <mailto:m...@peridale.co.uk>> wrote: >> Hi, >> >> What is the equivalent of this Hive statement in Spark >> >> >> >> select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]',''); >> +------------+----------+--+ >> | _c0 | _c1 | >> +------------+----------+--+ >> | ?2,500.00 | 2500.00 | >> +------------+----------+--+ >> >> Basically I want to get rid of "?" and "," in the csv file >> >> >> >> The full csv line is >> >> >> >> scala> csv2.first >> res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00" >> >> I want to transform that string into 5 columns and use "," as the split >> >> Thanks, >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus free, >> therefore neither Peridale Technology Ltd, its subsidiaries nor their >> employees accept any responsibility.