As I've mentioned before, I'm using s3cmd for Fedora Infrastructure to sync
~1TB and >1M files out to S3 in each region as mirrors for Fedora instances
running in EC2. Most of the feature enhancements I've written thus far
have been in support of this use case.
I am still having a couple significant problems, that I expect will require
some "better thinking" to resolve.
1) Hitting MemoryError when trying to sync this many files, on 32-bit
python. We keep dicts of a lot of data about the local and remote object
lists. At >900k objects, we run out of address space. Yes, running a
64-bit OS and python, with 12-16GB RAM would resolve this, but it seems
like overkill to me.
2) It takes >24 hours just to do an "incremental sync" of the directories
that have changed locally, mostly because we're doing S3 directory listings
on the whole blessed tree first. As I've got a good sense of what files
may have changed within subtrees, I'd like to be able to cache the remote
directory listings and use them between runs, updating the cache when we
upload or delete content. That alone could save 20+ hours.
I think both problems could be solved by shifting away from using in-memory
dicts for everything, to using an on-disk (or in-memory if persistence
isn't needed for some use cases) a
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general