This is an apology for forgetting to use a title for my previous post,
which was about my plan to develop a backup to s3 where the filepath/name
will be replaced by just the 32 character MD5.

In case you deleted the post thinking it was spam I attach the body of the
post in a text file. All comments would be most welcome, preferably as
replies to the previous post if you still have it.

Russell
I am fed up with the current consumer cloud backup packages, having been 
through Jungledisk, Wuala and recently Spideroak. I don't regard them as 
reliable, often slow and sometimes buggy and giving no confidence. So I am 
considering rolling my own by uploading to S3. I plan to use either s3tools or 
Amazon AWS CLI for upload, download and lists but not the sync facility. I have 
some questions, but first let me outline my ideas.

I plan to upload files with name equal to the MD5 only, no filename or path. I 
will write some code to make an index of files and their MD5s plus a few status 
items. There will be a few programs to do the uploading, downloading, etc. 
maintaining the index. The index will be in the form of a simple CSV file, one 
line per file, so easily edited with a spreadsheet. To find a file or 
directory, I will use the searching / sorting facilities of the spreadsheet. 
I've used spreadsheets a fair bit and might even write the odd macro, although 
probably not. Each update of the index will be date stamped and also be stored 
in s3 and also a separate set of the backup programs (with proper filenames not 
MD5 pseudo filenames). So the s3 console will recognise 3 "folders": files, 
index, software.

The objectives of this are:
1. One click backup refresh from a link on the desktop.
2. Use simple text files containing a list of the folders to back up and some 
regex file/folder exclusions.
3. Keep historic versions by not deleting files from the backup (but versioning 
won't need to be turned on in s3).
4. Keep one copy of each file even if there are multiple copies in my system.
5. Don't reupload when filenames or locations are changed (I often find myself 
moving and reorganising files).
6. Optionally elect not to back up certain files or folders (or delete file 
already backed up) by altering a status flag in the relevant index entry.
7. Ability to run a verification of the backup by downloading the file and 
validating the MD5, perhaps as a low priority background job. I calculate that 
to validate around 200GB should cost around $18. Once validated it will set a 
status flag in the index for all matching MD5s.
8. Avoid dependence on external tools as much as possible. If I use s3tools or 
some other service to upload files, the changes I might need if the software 
changes hopefully will be minimal.

My questions are:
1. Where does Amazon get its MD5 from? Is it calculated locally in my PC and 
sent in some headers? If Amazon calculates it at their end from the file it has 
on its servers then the verification is ok but otherwise how do I know their 
copy of the file is valid?
2. How easy is it to find out how to use Amazon's AWS CLI in Linux? I have 
tried out s3cmd and it seems easy to use, but at first glance the AWS CLI looks 
pretty complex.
3. I plan to use Bash and a little sed / awk in Linux. I've already done some 
code to create and manipulate this index as a trial. I don't particularly like 
Bash as such but it does a job. Alternatively I could perhaps use this project 
to learn some other language such as Python, but I'm not particularly keen to 
do this unless it confers particular advantages. Any opinions would be welcome 
(leaning perhaps to a C-like language if possible).

Any other observations would be welcome (including whether I'm sane).
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general

Reply via email to