Re: [sage-devel] Re: profiling Sage startup

Robert Bradshaw Tue, 01 Mar 2011 10:48:24 -0800

On Tue, Mar 1, 2011 at 4:32 AM, Johan S. R. Nielsen
<j.s.r.niel...@mat.dtu.dk> wrote:
> On Mar 1, 10:13 am, Robert Bradshaw <rober...@math.washington.edu>
> wrote:
>> On Tue, Mar 1, 2011 at 12:48 AM, Johan S. R. Nielsen
>>
>>
>>
>> <j.s.r.niel...@mat.dtu.dk> wrote:
>> > On Feb 23, 11:03 pm, Jason Grout <jason-s...@creativetrax.com> wrote:
>> >> On 2/23/11 3:56 PM, Robert Bradshaw wrote:
>>
>> >> > On Wed, Feb 23, 2011 at 1:47 PM, Jason Grout
>> >> > <jason-s...@creativetrax.com>  wrote:
>> >> >> On 2/23/11 3:06 PM, Robert Bradshaw wrote:
>>
>> >> >>> On Wed, Feb 23, 2011 at 11:34 AM, William Stein<wst...@gmail.com>    
>> >> >>> wrote:
>>
>> >> >>>> On Wed, Feb 23, 2011 at 10:57 AM, Jason Grout
>> >> >>>> <jason-s...@creativetrax.com>    wrote:
>>
>> >> >>>>> On 2/23/11 12:28 PM, William Stein wrote:
>>
>> >> >>>>>> At lunch yesterday Robert Bradshaw made the interesting suggestion 
>> >> >>>>>> to
>> >> >>>>>> read the docs for importlib
>> >> >>>>>> (http://docs.python.org/dev/library/importlib.html) and write a
>> >> >>>>>> customized import hook, so that every time during Sage startup 
>> >> >>>>>> that a
>> >> >>>>>> module is imported, the import is done from a single big in-memory 
>> >> >>>>>> zip
>> >> >>>>>> file instead of done using the filesystem.    If this can be made 
>> >> >>>>>> to
>> >> >>>>>> work, it would be a huge win for slow filesystems.   The basic 
>> >> >>>>>> problem
>> >> >>>>>> is that some filesystems are fast but have huge*latency*.
>>
>> >> >>>>> Is it a big win primarily because the zip file contents can be read 
>> >> >>>>> in
>> >> >>>>> and
>> >> >>>>> cached by us?  I'm just trying to understand it better.
>>
>> >> >>>> Which would you rather do on a high latency filesystem:
>>
>> >> >>>>   (1) Read/stat 20,000 little files, or
>> >> >>>>   (2) Read exactly one 40MB file.
>>
>> >> >>>>>   Is this the same idea as Jar files in java?
>>
>> >> >>>> I don't know.
>>
>> >> >>> Yep. In that case the "high latency file system" was a webserver.
>>
>> >> >>>>> You mean likehttp://docs.python.org/library/zipimport.html?
>>
>> >> >>>> Cool.
>>
>> >> >>> Note that this should just involve putting the zip file first in the
>> >> >>> python path.
>>
>> >> >>>> I don't know for a fact that Robert Bradshaw's suggestion will be a
>> >> >>>> big win, since nobody has tried this yet.  But I'm optimistic.  The
>> >> >>>> idea would be to make a zip archive of
>> >> >>>> $SAGE_ROOT/local/lib/python/site-packages (say), and do *all* imports
>> >> >>>> using that massive zip archive.
>>
>> >> >>> I'm optimistic too. This would, of course, make more sense for
>> >> >>> system-wide installs than development versions, but the former are
>> >> >>> more likely to be on a non-local filesystem anyways.
>>
>> >> >> Sounds like it is time for a trial!
>>
>> >> >> I created a directory of 2000 .py files and an __init__.py file to 
>> >> >> make it a
>> >> >> module
>>
>> >> >> for i in range(2000):
>> >> >>     with open('importtest/test_%s.py'%i,'w') as f:
>> >> >>         f.write("VALUE=%s\n"%i)
>> >> >> with open('importtest/__init__.py','w') as f:
>> >> >>     f.write(' ')
>>
>> >> >> Then I imported each of these so that .pyc files were created.
>>
>> >> >> for i in range(2000):
>> >> >>     exec 'import importtest.test_%s'%i
>>
>> >> >> Okay, then I copied the directory and zipped it up (in the shell now):
>>
>> >> >> $ cp -r importtest zipimporttest
>> >> >> $ zip -r tmp.zip zipimporttest
>> >> >> $ rm -rf zipimporttest
>>
>> >> >> One nice side effect is that the zip file is less than one MB, while 
>> >> >> the
>> >> >> directory of python files is around 16M.
>>
>> >> >> Now for the test.  Here are my two scripts.  One imports each module 
>> >> >> in the
>> >> >> directory and adds up the VALUE in each module:
>>
>> >> >> % cat mytest.py
>> >> >> s=0
>> >> >> for i in range(2000):
>> >> >>     exec 'import importtest.test_%s as tt'%i
>> >> >>     s+=tt.VALUE
>> >> >> print s
>>
>> >> >> The other first adds the zip to the front of sys.path and then does 
>> >> >> the same
>> >> >> imports and summing, but using the zipped module:
>>
>> >> >> % cat mytestzip.py
>> >> >> import sys
>> >> >> sys.path.insert(0,'./tmp.zip')
>> >> >> s=0
>> >> >> for i in range(2000):
>> >> >>     exec 'import zipimporttest.test_%s as tt'%i
>> >> >>     s+=tt.VALUE
>> >> >> print s
>>
>> >> >> And now for the timings:
>>
>> >> >> % time sage -python mytest.py
>> >> >> Detected SAGE64 flag
>> >> >> Building Sage on OS X in 64-bit mode
>> >> >> 1999000
>> >> >> sage -python mytest.py  0.26s user 1.47s system 75% cpu 2.282 total
>>
>> >> >> % time sage -python mytestzip.py
>> >> >> Detected SAGE64 flag
>> >> >> Building Sage on OS X in 64-bit mode
>> >> >> 1999000
>> >> >> sage -python mytestzip.py  0.21s user 0.11s system 99% cpu 0.327 total
>>
>> >> >> It looks like the zip is a clear winner in this case.  And this is 
>> >> >> with the
>> >> >> directory presumably in the FS cache.
>>
>> >> > Cool. Given the CPU was pegged at 99%, have you tried using an
>> >> > uncompressed zip file? It'd have more data to read, but less to do
>> >> > with it once it's read.
>>
>> >> In my case, using zip -0 (no compression) gives:
>>
>> >> % time sage -python mytestzip.py
>> >> Detected SAGE64 flag
>> >> Building Sage on OS X in 64-bit mode
>> >> 1999000
>> >> sage -python mytestzip.py  0.20s user 0.10s system 99% cpu 0.309 total
>>
>> >> So just a slight savings.
>>
>> >> Jason
>>
>> > I had an orthorgonal thought, though I'm not sure it's completely
>> > possible. Insted of actually loading the real functions/classes etc.,
>> > couldn't we fast-load (or generate) stub-versions of all these, which
>> > when called would load and replace themselves with the real version
>> > and then run it. I'm not completely sure it's possible with Python,
>> > but Python is pretty flexible so perhaps there is a way; in
>> > particular, I don't know how Python supports reflection for adding new
>> > functions to the namespace dynamically. Also, the doc-strings and
>> > search*-functions should also somehow be thought into it.
>> > If it's possible, as far as I can see, the user would not notice this
>> > (except for a minute overhead the first time a function was called),
>> > and only the very small fraction of used modules would be loaded each
>> > session. Furthermore, because the stub-functions were in the
>> > namespace, tab-completion would still work.
>> > The stub-versions could either come from auto-generated python-files
>> > from when compiling Sage and loaded by the usual module-loader, or
>> > perhaps by some Python-function which used a compile-time-generated
>> > listing of all functions/classes etc. to create these wrapper-
>> > functions at run-time and add them to the namespace.
>>
>> See lazy-import. Doing this for everything may incur significant
>> delays the first time a function is called (rather than before the
>> prompt) and there are issues with Sage being fragile about the order
>> in which some modules are implemented, but yes, it's possible and
>> largely implemented.
>>
>> - Robert
>
> Nice! I weren't aware of this module. When you get a good idea,
> there's a good chance that someone else thought of it before ;-) I
> like the fact that one can dynamically hack into an object's
> namespace :-D However, lazy_import seems not to be used much (only 2-3
> places) currently in the Sage startup (or did I grep wrongly?). Was it
> never the intention or is it due to the overhead?


No, it's because it barely got into Sage (well, the new and improved
version at least).

> Also, I didn't proofread the entire lazy_import code, but the
> implementation seems to differ from my idea in a significant way: the
> LazyImport object keeps wrapping the imported objects. My thought was
> that the first time the imported object was accessed, it would replace
> itself with the original in the global namespace, and then forward the
> call. This way, there will be zero overhead in all later calls.

It does if the original namespace is available. See the _get_object method.

> I agree that it probably shouldn't be done for central functions and
> modules, but if the lasting (after first call) overhead could be
> completely removed, wouldn't it be a good idea to apply for more or
> less _all_ satellite-modules?

As long as we don't get to the point that the first call takes a large
amount of time, yes. There's also the (unimplemented) idea of actively
loading lazily imported objects in a background thread during idle
time.

- Robert

-- 
To post to this group, send an email to sage-devel@googlegroups.com
To unsubscribe from this group, send an email to 
sage-devel+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URL: http://www.sagemath.org

Re: [sage-devel] Re: profiling Sage startup

Reply via email to