Larry Bates wrote: > Steve Holden wrote: >> Larry Bates wrote: >>> Steve Holden wrote: >>>> Larry Bates wrote: >>>>> I have a project that I wanted to solicit some advice >>>>> on from this group. I have millions of pages of scanned >>>>> documents with each page in and individual .JPG file. >>>>> When the documents were scanned the people that did >>>>> the scanning put a colored (hot pink) separator page >>>>> between the individual documents. I was wondering if >>>>> there was any way to utilize PIL to scan through the >>>>> individual files, look at some small section on the >>>>> page, and determine if it is a separator page by >>>>> somehow comparing the color to the separator page >>>>> color? I realize that this would be some sort of >>>>> percentage match where 100% would be a perfect match >>>>> and any number lower would indicate that it was less >>>>> likely that it was a coverpage. >>>>> >>>>> Thanks in advance for any thoughts or advice. >>>>> >>>> I suspect the easiest way would be to select a few small patches of each >>>> image and average the color values of the pixels, then normalize to hue >>>> rather than RGB. >>>> >>>> Close enough to the hue you want (and you could include saturation and >>>> intensity too, if you felt like it) across several areas of the page >>>> would be a hit for a separator. >>>> >>>> regards >>>> Steve >>> Steve, >>> >>> I'm completely lost on how to proceed. I don't know how to average color >>> values, normalize to hue... Any guidance you could give would be greatly >>> appreciated. >>> >>> Thanks in advance, >>> Larry >> I'd like to help but I don't have any sample code to hand. Maybe someone >> who does could give you more of a clue. Let's hope so, anyway ... >> >> regards >> Steve > > I think I've come up with something that will work. I use PIL > Image.getcolors() to get colors and take the top 10 colors of my > background page. I then calculate the average of the R, G, B > components. That becomes my reference. Then I read a page and > make the same calculation. I then calculate the absolute value > of the difference of R, G, B of the two values. Sum those > together gives something like the average difference between > the two average colors (at least that is what I think it does). > This seems to give me small numbers when the pages are the same > and large numbers when they are different. It isn't super fast > but it is working. > > Thanks for pushing me in the right direction. > > -Larry
Well done! Thanks for letting me know that the basic approach was correct. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://del.icio.us/steve.holden ------------------ Asciimercial --------------------- Get on the web: Blog, lens and tag your way to fame!! holdenweb.blogspot.com squidoo.com/pythonology tagged items: del.icio.us/steve.holden/python All these services currently offer free registration! -------------- Thank You for Reading ---------------- -- http://mail.python.org/mailman/listinfo/python-list