I am fed up with the current consumer cloud backup packages, having been
through Jungledisk, Wuala and recently Spideroak. I don't regard them as
reliable, often slow and sometimes buggy and giving no confidence. So I am
considering rolling my own by uploading to S3. I plan to use either s3tools
or Amazon AWS CLI for upload, download and lists but not the sync facility.
I have some questions, but first let me outline my ideas.
I plan to upload files with name equal to the MD5 only, no filename or
path. I will write some code to make an index of files and their MD5s plus
a few status items. There will be a few programs to do the uploading,
downloading, etc. maintaining the index. The index will be in the form of a
simple CSV file, one line per file, so easily edited with a spreadsheet. To
find a file or directory, I will use the searching / sorting facilities of
the spreadsheet. I've used spreadsheets a fair bit and might even write the
odd macro, although probably not. Each update of the index will be date
stamped and also be stored in s3 and also a separate set of the backup
programs (with proper filenames not MD5 pseudo filenames). So the s3
console will recognise 3 "folders": files, index, software.
The objectives of this are:
1. One click backup refresh from a link on the desktop.
2. Use simple text files containing a list of the folders to back up and
some regex file/folder exclusions.
3. Keep historic versions by not deleting files from the backup (but
versioning won't need to be turned on in s3).
4. Keep one copy of each file even if there are multiple copies in my
system.
5. Don't reupload when filenames or locations are changed (I often find
myself moving and reorganising files).
6. Optionally elect not to back up certain files or folders (or delete
file already backed up) by altering a status flag in the relevant index
entry.
7. Ability to run a verification of the backup by downloading the file
and validating the MD5, perhaps as a low priority background job. I
calculate that to validate around 200GB should cost around $18. Once
validated it will set a status flag in the index for all matching MD5s.
8. Avoid dependence on external tools as much as possible. If I use
s3tools or some other service to upload files, the changes I might need if
the software changes hopefully will be minimal.
My questions are:
1. Where does Amazon get its MD5 from? Is it calculated locally in my PC
and sent in some headers? If Amazon calculates it at their end from the
file it has on its servers then the verification is ok but otherwise how do
I know their copy of the file is valid?
2. How easy is it to find out how to use Amazon's AWS CLI in Linux? I
have tried out s3cmd and it seems easy to use, but at first glance the AWS
CLI looks pretty complex.
3. I plan to use Bash and a little sed / awk in Linux. I've already done
some code to create and manipulate this index as a trial. I don't
particularly like Bash as such but it does a job. Alternatively I could
perhaps use this project to learn some other language such as Python, but
I'm not particularly keen to do this unless it confers particular
advantages. Any opinions would be welcome (leaning perhaps to a C-like
language if possible).
Any other observations would be welcome (including whether I'm sane).
Russell
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general