Re: Confused about two development utils [EXT]

Sandhya Fri, 25 Dec 2020 23:23:22 -0800

unsubscribe.

On Fri, Dec 25, 2020 at 10:30 PM André Warnier (tomcat/perl) <[email protected]>
wrote:


> Hello James.
> Bravo and many thanks for this excellent overview of your activities. Of
> course the setup
> (in your previous message) and the activities are very impressive by
> themselves.
> But in addition, even though your message is not in itself a perl advocacy
> message, I feel
> that it would have its right place in some perl/mod_perl advocacy forum,
> because it
> touches on some general idea which are valid /also/ for perl and mod_perl.
> It was very refreshing to read for once a clear exposé of why it is still
> important
> nowadays to think before programming, to program efficiently, and to
> choose the right tool
> for the job at hand (be it perl, mod_perl, or any other) without the kind
> of off-the-cuff
> general a-priori which tend to plague these discussions.
>
> And even though our own (commercial) activities and setups do not have
> anything even close
> to the scope which you describe, I would like to say that the same basic
> principles which
> you mention in your exposé are just as valid when you scale-down as when
> you scale-up.
> ("--you can’t just throw memory, CPUs, power at a problem – you have to
> think – how can I do what I need to do with the least resources..")
> Even when you think of a single server, or a single server rack, at any
> one period in time
> there is always a practical limit as to how much memory or CPUs you can
> fit in a given
> server, or how many servers you can fit in a rack, or how many additional
> Gb of bandwidth
> you can allocate per server, beyond which there is a sudden "quantum jump"
> as to how
> practical and cost-effective a whole project becomes.
> In that sense, I particulary enjoyed your examples of the database and of
> the additional
> power line.
>
>
> On 24.12.2020 02:38, James Smith wrote:
> > We don’t use perl for everything, yes we use it for web data, yes we
> still use it as the
> > glue language in a lot of cases, the most complex stuff is done with C
> (not even C++ as
> > that is too slow). Others on site use Python, Java, Rust, Go, PHP, along
> with looking at
> > using GPUs in cases where code can be highly parallelised
> >
> > It is not just one application – but many, many applications… All with a
> common goal of
> > understanding the human genome, and using it to assist in developing new
> understanding and
> > techniques which can advance health care.
> >
> > We are a very large sequencing centre (one of the largest in the world)
> – what I was
> > pointing out is that you can’t just throw memory, CPUs, power at a
> problem – you have to
> > think – how can I do what I need to do with the least resources. Rather
> than what
> > resources can I throw at the problem.
> >
> > Currently we are acting as the central repository for all COVID-19
> sequencing in the UK,
> > along with one of the largest “wet” labs sequencing data for it – and
> that is half the
> > sequenced samples in the whole world. The UK is sequencing more COVID-19
> genomes a day
> > than most other countries have sequenced since the start of the pandemic
> in Feb/Mar. This
> > has lead to us discovering a new more transmissible version of the
> virus, and it what part
> > of the country the different strains are present – no other country in
> the world has the
> > information, technology or infrastructure in place to achieve this.
> >
> > But this is just a small part of the genomic sequencing we are looking
> at – we work on:
> > * other pathogens – e.g. Plasmodium (Malaria);
> > * cancer genomes (and how effective drugs are);
> > * are a major part of the Human Cell Atlas which is looking at how the
> expression of genes
> > (in the simplest terms which ones are switched on and switched off) are
> different in
> > different tissues;
> > * sequencing the genomes of other animals to understand their evolution;
> > * and looking at some other species in detail, to see what we can learn
> from them when
> > they have defective genes;
> >
> > Although all these are currently scaled back so that we can work
> relentlessly to support
> > the medical teams and other researchers get on top of COVID-19.
> >
> > What is interesting is that many of the developers we have on campus
> (well all wfh at the
> > moment) are all (relatively) old as we learnt to develop code on
> machines with limited CPU
> > and limited memory – so that things had to be efficient, had to be
> compact…. And that is
> > as important now as it was 20 or 30 years ago – the data we handle is
> going up faster than
> > Moore’s Law! Many of us have pride in doing things as efficiently as
> possible.
> >
> > It took around 10 years to sequence and assemble the first human genome
> {well we are still
> > tinkering with it and filling in the gaps} – now at the institute we can
> sequence and
> > assemble around 400 human genomes in a day – to the same quality!
> >
> > So most of our issues are due to the scale of the problems we face –
> e.g. the human genome
> > has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t scale
> to that (once
> > many years ago we looked at setting up an Oracle database where there
> was at least 1 row
> > for every base pair – recording all variants (think of them as spelling
> mistakes, for
> > example a T rather than an A, or an extra letter inserted or deleted)
> for that base pair…
> > The schema was set up – and then they realised it would take 12 months
> to load the data
> > which we had then (which is probably less than a millionth of what we
> have now)!
> >
> > Moving compute off site is a problem as the transfer of the level of
> data we have would
> > cause a problem – you can’t easily move all the data to the compute – so
> you have to bring
> > the compute to the data.
> >
> > The site I worked on before I became a more general developer was doing
> that – and the
> > code that was written 12-15 years ago is actually still going strong –
> it has seen a few
> > changes over the year – many displays have had to be redeveloped as the
> scale of the data
> > has got so big that even the summary pages we produced 10 years ago have
> to be summarised
> > because they are so large.
> >
> > *From:*Mithun Bhattacharya <[email protected]>
> > *Sent:* 24 December 2020 00:06
> > *To:* mod_perl list <[email protected]>
> > *Subject:* Re: Confused about two development utils [EXT]
> >
> > James would you be able to share more info about your setup ?
> >
> > 1. What exactly is your application doing which requires so much memory
> and CPU - is it
> > something like gene splicing (no i don't know much about it beyond
> Jurassic Park :D )
> >
> > 2. Do you feel Perl was the best choice for whatever you are doing and
> if yes then why ?
> > How much of your stuff is using mod_perl considering you mentioned not
> much is web related ?
> >
> > 3. What are the challenges you are currently facing with your
> implementation ?
> >
> > On Wed, Dec 23, 2020 at 6:58 AM James Smith <[email protected] <mailto:
> [email protected]>>
> > wrote:
> >
> >     Oh but memory is a problem – but not if you have just a small
> cluster of machines!
> >
> >     Our boxes are larger than that – but they all run virtual machine
> {only a small
> >     proportion web related} – machines/memory would rapidly become in
> our data centre - we
> >     run VMWARE [995 hosts] and openstack [10,000s of hosts] + a
> selection of large memory
> >     machines {measured in TBs of memory per machine }.
> >
> >     We would be looking at somewhere between 0.5 PB and 1 PB of memory –
> not just the
> >     price of buying that amount of memory - for many machines we need
> the fastest memory
> >     money can buy for the workload, but we would need a lot more CPUs
> then we currently
> >     have as we would need a larger amount of machines to have 64GB
> virtual machines {we
> >     would get 2 VMs per host. We currently have approx. 1-2000 CPUs
> running our hardware
> >     (last time I had a figure) – it would probably need to go to
> approximately 5-10,000!
> >     It is not just the initial outlay but the environmental and
> financial cost of running
> >     that number of machines, and finding space to run them without
> putting the cooling
> >     costs through the roof!! That is without considering what additional
> constraints on
> >     storage having the extra machines may have (at the last count a year
> ago we had over
> >     30 PBytes of storage on side – and a large amount of offsite backup.
> >
> >     We would also stretch the amount of power we can get from the
> national grid to power
> >     it all - we currently have 3 feeds from different part of the
> national grid (we are
> >     fortunately in position where this is possible) and the dedicated
> link we would need
> >     to add more power would be at least 50 miles long!
> >
> >     So - managing cores/memory is vitally important to us – moving to
> the cloud is an
> >     option we are looking at – but that is more than 4 times the price
> of our onsite
> >     set-up (with substantial discounts from AWS) and would require an
> upgrade of our
> >     existing link to the internet – which is currently 40Gbit of data (I
> think).
> >
> >     Currently we are analysing a very large amounts of data directly
> linked to the current
> >     major world problem – this is why the UK is currently being isolated
> as we have
> >     discovered and can track a new strain, in near real time – other
> countries have no
> >     ability to do this – we in a day can and do handle, sequence and
> analyse more samples
> >     than the whole of France has sequenced since February. We probably
> don’t have more of
> >     the new variant strain than in other areas of the world – it is just
> that we know we
> >     have because of the amount of sequencing and analysis that we in the
> UK have done.
> >
> >     *From:*Matthias Peng <[email protected] <mailto:
> [email protected]>>
> >     *Sent:* 23 December 2020 12:02
> >     *To:* mod_perl list <[email protected] <mailto:
> [email protected]>>
> >     *Subject:* Re: Confused about two development utils [EXT]
> >
> >     Today memory is not serious problem, each of our server has 64GB
> memory.
> >
> >
> >         Forgot to add - so our FCGI servers need a lot (and I mean a
> lot) more memory than
> >         the mod_perl servers to serve the same level of content (just in
> case memory blows
> >         up with FCGI backends)
> >
> >         -----Original Message-----
> >         From: James Smith <[email protected] <mailto:[email protected]>>
> >         Sent: 23 December 2020 11:34
> >         To: André Warnier (tomcat/perl) <[email protected] <mailto:
> [email protected]>>;
> >         [email protected] <mailto:[email protected]>
> >         Subject: RE: Confused about two development utils [EXT]
> >
> >
> >          > This costs memory, and all the more since many perl modules
> are not
> >         thread-safe, so if you use them in your code, at this moment the
> only safe way to
> >         do it is to use the Apache httpd prefork model. This means that
> each Apache httpd
> >         child process has its own copy of the perl interpreter, which
> means that the
> >         memory used by this embedded perl interpreter has to be counted
> n times (as many
> >         times as there are Apache httpd child processes running at any
> one time).
> >
> >         This isn’t quite true - if you load modules before the process
> forks then they can
> >         cleverly share the same parts of memory. It is useful to be able
> to "pre-load"
> >         core functionality which is used across all functions {this is
> the case in Linux
> >         anyway}. It also speeds up child process generation as the
> modules are already in
> >         memory and converted to byte code.
> >
> >         One of the great advantages of mod_perl is Apache2::SizeLimit
> which can blow away
> >         large child process - and then if needed create new ones. This
> is not the case
> >         with some of the FCGI solutions as the individual processes can
> grow if there is a
> >         memory leak or a request that retrieves a large amount of
> content (even if not
> >         served), but perl can't give the memory back. So FCGI processes
> only get bigger
> >         and bigger and eventually blow up memory (or hit swap first)
> >
> >
> >
> >
> >
> >         --
> >           The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity
> >         registered in England with number 1021457 and a  company
> registered in England
> >         with number 2742969, whose registered  office is 215 Euston
> Road, London, NW1 2
> >         [google.com]
> >         <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
> >BE.
> >
> >
> >
> >         --
> >           The Wellcome Sanger Institute is operated by Genome Research
> >           Limited, a charity registered in England with number 1021457
> and a
> >           company registered in England with number 2742969, whose
> registered
> >           office is 215 Euston Road, London, NW1 2 [google.com]
> >         <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
> >BE.
> >
> >     -- The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity
> >     registered in England with number 1021457 and a company registered
> in England with
> >     number 2742969, whose registered office is 215 Euston Road, London,
> NW1 2BE.
> >
> > -- The Wellcome Sanger Institute is operated by Genome Research Limited,
> a charity
> > registered in England with number 1021457 and a company registered in
> England with number
> > 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>
>

Re: Confused about two development utils [EXT]

Reply via email to