On Thursday, 3 August 2017 01:05:57 UTC+10, Daiyue Weng wrote: > Hi, I am trying to removing extra quotes from a large set of strings (a > list of strings), so for each original string, it looks like, > > """str_value1"",""str_value2"",""str_value3"",1,""str_value4""" > > > I like to remove the start and end quotes and extra pairs of quotes on each > string value, so the result will look like, > > "str_value1","str_value2","str_value3",1,"str_value4" > > > and then join each string by a new line. > > I have tried the following code, > > for line in str_lines[1:]: > strip_start_end_quotes = line[1:-1] > splited_line_rem_quotes = > strip_start_end_quotes.replace('\"\"', '"') > str_lines[str_lines.index(line)] = splited_line_rem_quotes > > for_pandas_new_headers_str = '\n'.join(splited_lines) > > but it is really slow (running for ages) if the list contains over 1 > million string lines. I am thinking about a fast way to do that. > > I also tried to multiprocessing this task by > > def preprocess_data_str_line(data_str_lines): > """ > > :param data_str_lines: > :return: > """ > for line in data_str_lines: > strip_start_end_quotes = line[1:-1] > splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', > '"') > data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes > > return data_str_lines > > > def multi_process_prepcocess_data_str(data_str_lines): > """ > > :param data_str_lines: > :return: > """ > # if cpu load < 25% and 4GB of ram free use 3 cores > # if cpu load < 70% and 4GB of ram free use 2 cores > cores_to_use = how_many_core() > > data_str_blocks = slice_list(data_str_lines, cores_to_use) > > for block in data_str_blocks: > # spawn processes for each data string block assigned to every cpu > core > p = multiprocessing.Process(target=preprocess_data_str_line, > args=(block,)) > p.start() > > but I don't know how to concatenate the results back into the list so that > I can join the strings in the list by new lines. > > So, ideally, I am thinking about using multiprocessing + a fast function to > preprocessing each line to speed up the whole process. > > cheers
Hi MRAB, My first thought is to use split/join to solve this problem, but you would need to decide what to do with the non-strings in your 1,000,000 element list. You also need to be sure that the pipe character | is in none of your strings. split_on_dbl_dbl_quote = original_list.join('|').split('""') remove_dbl_dbl_quotes_and_outer_quotes = split_on_dbl_dbl_quote[::2].join('').split('|') You need to be sure of your data: [::2] (return just even-numbered elements) relies on all double-double-quotes both opening and closing within the same string. This runs in under a second for a million strings but does affect *all* elements, not just strings. The non-strings would become strings after the second statement. As to multi-processing: I would be looking at well-optimised single-thread solutions like split/join before I consider MP. If you can fit the problem to a split-join it'll be much simpler and more "pythonic". Cheers, Nick -- https://mail.python.org/mailman/listinfo/python-list