You should just use regexp_replace to remove all the leading number information (assuming it ends with a full-stop, and catering for the possibility of a capital letter).
This is untested, but it shoud do the trick based on your examples so far: df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, “”)) From: Aakash Basu [mailto:aakash.spark....@gmail.com] Sent: 30 October 2017 18:53 To: user Subject: Split column with dynamic data *** External email: use caution *** Hi all, I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description - Eg - (Description is the column header) Description Inventory Tree Products 1. AT&T Services 2. Accessories 4. Miscellaneous 5. Service Center Items 3. 3rd Party Services Integrated Service 6. Demo Devices IT Department Merchandising 2A. Impulse 2. Handsets 3. Strategic Products 1. Opportunities 5. Features 6. Rate Plans 7. Other AT&T Incentives 8. Wired 4. Rate Plan Tier Change Integration SKUs (March 2016) 9. Financing SKUs 1. Smartphone From the above, I only want the words of description and remove the numbers. But since they're abrupt, it is difficult to use a single logic to have it done. Trying with the following options - 1) Using split by fullstop, together - split_col = split(CategoryInp_DF['description'], '.') CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0)) CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1)) Result -> But both the output columns come as nulls. 2) Using split by fullstop and space, together - split_col = split(CategoryInp_DF['description'], '. ') CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0)) CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1)) Result -> This is perfectly working for the numbered rows, but is unnecessarily splitting the non-numbered rows too. Any help would be greatly appreciated. Thanks, Aakash. ________________________________ Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.