Hey buddy,
Thanks a TON! Issue resolved. Thanks again, Aakash. On 30-Oct-2017 11:44 PM, "Hondros, Constantine (ELS-AMS)" < c.hond...@elsevier.com> wrote: > You should just use regexp_replace to remove all the leading number > information (assuming it ends with a full-stop, and catering for the > possibility of a capital letter). > > > > This is untested, but it shoud do the trick based on your examples so far: > > > > df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, > “”)) > > > > > > *From:* Aakash Basu [mailto:aakash.spark....@gmail.com] > *Sent:* 30 October 2017 18:53 > *To:* user > *Subject:* Split column with dynamic data > > > > **** External email: use caution **** > > > > Hi all, > > > > I've a requirement to split a column and fetch only the description where > I have numbers appended before that for some rows whereas other rows have > only the description - > > > > Eg - (Description is the column header) > > > > *Description* > > Inventory Tree > > Products > > 1. AT&T Services > > 2. Accessories > > 4. Miscellaneous > > 5. Service Center Items > > 3. 3rd Party Services > > Integrated Service > > 6. Demo Devices > > IT Department > > Merchandising > > 2A. Impulse > > 2. Handsets > > 3. Strategic Products > > 1. Opportunities > > 5. Features > > 6. Rate Plans > > 7. Other AT&T Incentives > > 8. Wired > > 4. Rate Plan Tier Change > > Integration SKUs (March 2016) > > 9. Financing SKUs > > 1. Smartphone > > > > > From the above, I only want the words of description and remove the > numbers. But since they're abrupt, it is difficult to use a single logic to > have it done. > > > > Trying with the following options - > > > > 1) Using split by *fullstop*, together - > > > > *split_col = split(CategoryInp_DF['description'], '.')* > > > > *CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', > split_col.getItem(0))* > > *CategoryInp_DF= CategoryInp_DF.withColumn('description', > split_col.getItem(1))* > > > > > > Result -> But both the output columns come as nulls. > > > > 2) Using split by *fullstop and space*, together - > > *split_col = split(CategoryInp_DF['description'], '. ')* > > > > *CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', > split_col.getItem(0))* > > *CategoryInp_DF= CategoryInp_DF.withColumn('description', > split_col.getItem(1))* > > > > Result -> This is perfectly working for the numbered rows, but is > unnecessarily splitting the non-numbered rows too. > > > > Any help would be greatly appreciated. > > > > > > Thanks, > > Aakash. > > ------------------------------ > > Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The > Netherlands > <https://maps.google.com/?q=Radarweg+29,+1043+NX+Amsterdam,+The+Netherlands&entry=gmail&source=g>, > Registration No. 33156677, Registered in The Netherlands. >