On Wed, Jan 12, 2011 at 4:37 AM, David Bovill <da...@vaudevillecourt.tv>wrote:
> If it quacks like a duck it is a duck. > > So I have some data in a variable that I want to display. I can use is an > array/number/date - but for other types of data I'm wandering... xml should > be easy, but harder would be to distinguish long text files from binary. > Any > ideas for hacks to distinguish: > > 1. images > 2. sounds > 3. video > 4. binary blob > 5. text > 6. rtftext > 7. utf8 > > This is a pretty solved problem (except for the "array" part, which is a LC-specific data type/format). Wish I had some references for you at the moment, but here's some things to keep in mind: - First, use your OS when possible. Images, sounds, video, and often text is already done for you via registry on Windows or the 4-byte code on Mac (i.e. 'TEXT'). - Next, determine text vs. binary. This is usually done by just grabbing the first N (where N is ~1000) bytes and look for any that are < 10 or > 127. If you find any, it's binary - or unicode. - Binary starts the look at image vs. video vs. unicode. Image and video are pretty simple. You don't need to understand every form of image or video, just a handful that will hit 99% of all images/videos out there. And they all - very politely - have a nice header you can examine. For example, looking at PNG: http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header >From there, you can see that the first 4 bytes of a PNG file are 0x89 0x50 0x4E and 0x47 (where 50, 4E, and 47 are actually the ASCII letters 'PNG'). Almost every single image and video format you'll care about will have something very similar you can use. This is a great site you can reference: http://www.wotsit.org/ If you don't find a header that you understand, then you are looking at either a straight binary lump/blob or multi-byte text file (unicode). Remember that while UTF8 is not ASCII, it's designed to be indistinguishable from ASCII most of the time. I don't have any advice to give you here on how to determine if the file is unicode text or not... as I understand it this is really a difficult problem to solve. I'm sure Google can help, though. ;-) - At this point you've determined that the file is "text" in nature and you are trying to specifically figure out if it's RTF, XML, INI, whatever. This gets a little more tricky, as often times people skip what optional headers could be there (e.g. <?xml ...?>, <!DOCTYPE ...>, ...) and you are left with either taking your best guess or going off the file extension. - RTF - I don't believe - has an actual "header" that lets you know it is an RTF file. Instead, just scan it and look for "{\" in the file followed by some known RTF "tags". - XML/HTML/*ML, is a matter of scanning for some known tags (like <BODY>, <HTML>) you know should be there near the top or - in the case of XML - checking for namespaces in the tag names. Hope this helps! Jeff M. _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode