This is an apology for forgetting to use a title for my previous post,
which was about my plan to develop a backup to s3 where the filepath/name
will be replaced by just the 32 character MD5.
In case you deleted the post thinking it was spam I attach the body of the
post in a text file. All comments would be most welcome, preferably as
replies to the previous post if you still have it.
Russell
I am fed up with the current consumer cloud backup packages, having been
through Jungledisk, Wuala and recently Spideroak. I don't regard them as
reliable, often slow and sometimes buggy and giving no confidence. So I am
considering rolling my own by uploading to S3. I plan to use either s3tools or
Amazon AWS CLI for upload, download and lists but not the sync facility. I have
some questions, but first let me outline my ideas.
I plan to upload files with name equal to the MD5 only, no filename or path. I
will write some code to make an index of files and their MD5s plus a few status
items. There will be a few programs to do the uploading, downloading, etc.
maintaining the index. The index will be in the form of a simple CSV file, one
line per file, so easily edited with a spreadsheet. To find a file or
directory, I will use the searching / sorting facilities of the spreadsheet.
I've used spreadsheets a fair bit and might even write the odd macro, although
probably not. Each update of the index will be date stamped and also be stored
in s3 and also a separate set of the backup programs (with proper filenames not
MD5 pseudo filenames). So the s3 console will recognise 3 "folders": files,
index, software.
The objectives of this are:
1. One click backup refresh from a link on the desktop.
2. Use simple text files containing a list of the folders to back up and some
regex file/folder exclusions.
3. Keep historic versions by not deleting files from the backup (but versioning
won't need to be turned on in s3).
4. Keep one copy of each file even if there are multiple copies in my system.
5. Don't reupload when filenames or locations are changed (I often find myself
moving and reorganising files).
6. Optionally elect not to back up certain files or folders (or delete file
already backed up) by altering a status flag in the relevant index entry.
7. Ability to run a verification of the backup by downloading the file and
validating the MD5, perhaps as a low priority background job. I calculate that
to validate around 200GB should cost around $18. Once validated it will set a
status flag in the index for all matching MD5s.
8. Avoid dependence on external tools as much as possible. If I use s3tools or
some other service to upload files, the changes I might need if the software
changes hopefully will be minimal.
My questions are:
1. Where does Amazon get its MD5 from? Is it calculated locally in my PC and
sent in some headers? If Amazon calculates it at their end from the file it has
on its servers then the verification is ok but otherwise how do I know their
copy of the file is valid?
2. How easy is it to find out how to use Amazon's AWS CLI in Linux? I have
tried out s3cmd and it seems easy to use, but at first glance the AWS CLI looks
pretty complex.
3. I plan to use Bash and a little sed / awk in Linux. I've already done some
code to create and manipulate this index as a trial. I don't particularly like
Bash as such but it does a job. Alternatively I could perhaps use this project
to learn some other language such as Python, but I'm not particularly keen to
do this unless it confers particular advantages. Any opinions would be welcome
(leaning perhaps to a C-like language if possible).
Any other observations would be welcome (including whether I'm sane).
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general