You should just use regexp_replace to remove all the leading number information 
(assuming it ends with a full-stop, and catering for the possibility of a 
capital letter).

This is untested, but it shoud do the trick based on your examples so far:

df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, “”))


From: Aakash Basu [mailto:aakash.spark....@gmail.com]
Sent: 30 October 2017 18:53
To: user
Subject: Split column with dynamic data


*** External email: use caution ***


Hi all,

I've a requirement to split a column and fetch only the description where I 
have numbers appended before that for some rows whereas other rows have only 
the description -

Eg - (Description is the column header)

Description

Inventory Tree

Products

1. AT&T Services

2. Accessories

4. Miscellaneous

5. Service Center Items

3. 3rd Party Services

Integrated Service

6. Demo Devices

IT Department

Merchandising

2A. Impulse

2. Handsets

3. Strategic Products

1. Opportunities

5. Features

6. Rate Plans

7. Other AT&T Incentives

8. Wired

4. Rate Plan Tier Change

Integration SKUs (March 2016)

9. Financing SKUs

1. Smartphone



From the above, I only want the words of description and remove the numbers. 
But since they're abrupt, it is difficult to use a single logic to have it done.

Trying with the following options -

1) Using split by fullstop, together -

split_col = split(CategoryInp_DF['description'], '.')

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))
CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))


Result -> But both the output columns come as nulls.

2) Using split by fullstop and space, together -
split_col = split(CategoryInp_DF['description'], '. ')

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))
CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

Result -> This is perfectly working for the numbered rows, but is unnecessarily 
splitting the non-numbered rows too.

Any help would be greatly appreciated.


Thanks,
Aakash.

________________________________

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The 
Netherlands, Registration No. 33156677, Registered in The Netherlands.

Reply via email to