Digest to filenames question, including Python code to answer (sorry)

Cheerio,

Graeme

-----------------------

Majority along the lines of

some_prefix_numbers.suffix

e.g. foo_bar_1_0001.cbf however suggestion that there are cases where
(numbers) start from 1, 2, 3 rather than 001, 002, 003 etc - do the
file names do not have a consistent length. Since both Mosflm and XDS
can't cope with these, I will not worry about the regexp :o)

Sometimes there's a shortage of underscores (perhaps with lab sources)
e.g. image0001.img. I'm also aware of prefix.numbers e.g. prefix.0001
etc. What had clobbered my existing expressions was
prefix_1.8A_001.img etc., i.e. having additional number.something
expressions in there.

Solution:

Assume that the frame number is the *last* numerical value in the file
name, allowing for cases where the file name extension includes
numbers. Turns out working on the reversed file name makes things much
easier:

# N.B. these are reversed patterns...

patterns = [r'([0-9]{2,12})\.(.*)',
            r'(.*)\.([0-9]{2,12})_(.*)',
            r'(.*)\.([0-9]{2,12})(.*)']

joiners = ['.', '_', '']

# Python code follows

compiled_patterns = [re.compile(pattern) for pattern in patterns]

def template_regex(filename):
    '''Try a bunch of templates to work out the most sensible. N.B. assumes that
    the image index will be the last digits found in the file name.'''

    rfilename = filename[::-1]

    global patterns, compiled_patterns

    for j, cp in enumerate(compiled_patterns):
        match = cp.match(rfilename)
        if not match:
            continue
        groups = match.groups()

        if len(groups) == 3:
            exten = '.' + groups[0][::-1]
            digits = groups[1][::-1]
            prefix = groups[2][::-1] + joiners[j]
        else:
            exten = ''
            digits = groups[0][::-1]
            prefix = groups[1][::-1] + joiners[j]

        template = prefix + ''.join(['#' for d in digits]) + exten
        break

    return template, int(digits)

def work_template_regex():
    questions_answers = {
        'foo_bar_001.img':'foo_bar_###.img',
        'foo_bar001.img':'foo_bar###.img',
        'foo_bar_1.8A_001.img':'foo_bar_1.8A_###.img',
        'foo_bar.001':'foo_bar.###',
        'foo_bar_001.img1000':'foo_bar_###.img1000',
        'foo_bar_00001.img':'foo_bar_#####.img'
        }

    for filename in questions_answers:
        answer = template_regex(filename)
        assert answer[0] == questions_answers[filename]


On 30 April 2012 09:19, Graeme Winter <graeme.win...@gmail.com> wrote:
> Hi Folks,
>
> Following some bug reports I spent a few minutes over the weekend wrangling 
> with regular expressions to digest image file names - the dismantling of e.g. 
> foo_bar_001.img to foo_bar_###.img, 1 etc. I think now that the scheme I have 
> should work for everything, however what I could really do with is a proper 
> list of test cases.
>
> So (foolishly he asks) please could people email me *off list* with example 
> image names if they don't fall into the following structure:
>
> prefix_numbers.extension i.e. foo_bar_001.img
> prefix.numbers i.e. foo_bar.001
>
> I'll send back a digest of any responses I get. Ideally if you could indicate 
> where the images come from when you do this I'd be obliged.
>
> Best wishes,
>
> Graeme

Reply via email to