Re: Confused about two development utils [EXT]

Matthias Peng Sat, 26 Dec 2020 00:01:36 -0800

If I have been using modperl for development, does it influence I drive
Maserati? :)



> unsubscribe.
>
> On Fri, Dec 25, 2020 at 10:30 PM André Warnier (tomcat/perl) <
> a...@ice-sa.com> wrote:
>
>> Hello James.
>> Bravo and many thanks for this excellent overview of your activities. Of
>> course the setup
>> (in your previous message) and the activities are very impressive by
>> themselves.
>> But in addition, even though your message is not in itself a perl
>> advocacy message, I feel
>> that it would have its right place in some perl/mod_perl advocacy forum,
>> because it
>> touches on some general idea which are valid /also/ for perl and mod_perl.
>> It was very refreshing to read for once a clear exposé of why it is still
>> important
>> nowadays to think before programming, to program efficiently, and to
>> choose the right tool
>> for the job at hand (be it perl, mod_perl, or any other) without the kind
>> of off-the-cuff
>> general a-priori which tend to plague these discussions.
>>
>> And even though our own (commercial) activities and setups do not have
>> anything even close
>> to the scope which you describe, I would like to say that the same basic
>> principles which
>> you mention in your exposé are just as valid when you scale-down as when
>> you scale-up.
>> ("--you can’t just throw memory, CPUs, power at a problem – you have to
>> think – how can I do what I need to do with the least resources..")
>> Even when you think of a single server, or a single server rack, at any
>> one period in time
>> there is always a practical limit as to how much memory or CPUs you can
>> fit in a given
>> server, or how many servers you can fit in a rack, or how many additional
>> Gb of bandwidth
>> you can allocate per server, beyond which there is a sudden "quantum
>> jump" as to how
>> practical and cost-effective a whole project becomes.
>> In that sense, I particulary enjoyed your examples of the database and of
>> the additional
>> power line.
>>
>>
>> On 24.12.2020 02:38, James Smith wrote:
>> > We don’t use perl for everything, yes we use it for web data, yes we
>> still use it as the
>> > glue language in a lot of cases, the most complex stuff is done with C
>> (not even C++ as
>> > that is too slow). Others on site use Python, Java, Rust, Go, PHP,
>> along with looking at
>> > using GPUs in cases where code can be highly parallelised
>> >
>> > It is not just one application – but many, many applications… All with
>> a common goal of
>> > understanding the human genome, and using it to assist in developing
>> new understanding and
>> > techniques which can advance health care.
>> >
>> > We are a very large sequencing centre (one of the largest in the world)
>> – what I was
>> > pointing out is that you can’t just throw memory, CPUs, power at a
>> problem – you have to
>> > think – how can I do what I need to do with the least resources. Rather
>> than what
>> > resources can I throw at the problem.
>> >
>> > Currently we are acting as the central repository for all COVID-19
>> sequencing in the UK,
>> > along with one of the largest “wet” labs sequencing data for it – and
>> that is half the
>> > sequenced samples in the whole world. The UK is sequencing more
>> COVID-19 genomes a day
>> > than most other countries have sequenced since the start of the
>> pandemic in Feb/Mar. This
>> > has lead to us discovering a new more transmissible version of the
>> virus, and it what part
>> > of the country the different strains are present – no other country in
>> the world has the
>> > information, technology or infrastructure in place to achieve this.
>> >
>> > But this is just a small part of the genomic sequencing we are looking
>> at – we work on:
>> > * other pathogens – e.g. Plasmodium (Malaria);
>> > * cancer genomes (and how effective drugs are);
>> > * are a major part of the Human Cell Atlas which is looking at how the
>> expression of genes
>> > (in the simplest terms which ones are switched on and switched off) are
>> different in
>> > different tissues;
>> > * sequencing the genomes of other animals to understand their evolution;
>> > * and looking at some other species in detail, to see what we can learn
>> from them when
>> > they have defective genes;
>> >
>> > Although all these are currently scaled back so that we can work
>> relentlessly to support
>> > the medical teams and other researchers get on top of COVID-19.
>> >
>> > What is interesting is that many of the developers we have on campus
>> (well all wfh at the
>> > moment) are all (relatively) old as we learnt to develop code on
>> machines with limited CPU
>> > and limited memory – so that things had to be efficient, had to be
>> compact…. And that is
>> > as important now as it was 20 or 30 years ago – the data we handle is
>> going up faster than
>> > Moore’s Law! Many of us have pride in doing things as efficiently as
>> possible.
>> >
>> > It took around 10 years to sequence and assemble the first human genome
>> {well we are still
>> > tinkering with it and filling in the gaps} – now at the institute we
>> can sequence and
>> > assemble around 400 human genomes in a day – to the same quality!
>> >
>> > So most of our issues are due to the scale of the problems we face –
>> e.g. the human genome
>> > has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t
>> scale to that (once
>> > many years ago we looked at setting up an Oracle database where there
>> was at least 1 row
>> > for every base pair – recording all variants (think of them as spelling
>> mistakes, for
>> > example a T rather than an A, or an extra letter inserted or deleted)
>> for that base pair…
>> > The schema was set up – and then they realised it would take 12 months
>> to load the data
>> > which we had then (which is probably less than a millionth of what we
>> have now)!
>> >
>> > Moving compute off site is a problem as the transfer of the level of
>> data we have would
>> > cause a problem – you can’t easily move all the data to the compute –
>> so you have to bring
>> > the compute to the data.
>> >
>> > The site I worked on before I became a more general developer was doing
>> that – and the
>> > code that was written 12-15 years ago is actually still going strong –
>> it has seen a few
>> > changes over the year – many displays have had to be redeveloped as the
>> scale of the data
>> > has got so big that even the summary pages we produced 10 years ago
>> have to be summarised
>> > because they are so large.
>> >
>> > *From:*Mithun Bhattacharya <mit...@gmail.com>
>> > *Sent:* 24 December 2020 00:06
>> > *To:* mod_perl list <modperl@perl.apache.org>
>> > *Subject:* Re: Confused about two development utils [EXT]
>> >
>> > James would you be able to share more info about your setup ?
>> >
>> > 1. What exactly is your application doing which requires so much memory
>> and CPU - is it
>> > something like gene splicing (no i don't know much about it beyond
>> Jurassic Park :D )
>> >
>> > 2. Do you feel Perl was the best choice for whatever you are doing and
>> if yes then why ?
>> > How much of your stuff is using mod_perl considering you mentioned not
>> much is web related ?
>> >
>> > 3. What are the challenges you are currently facing with your
>> implementation ?
>> >
>> > On Wed, Dec 23, 2020 at 6:58 AM James Smith <j...@sanger.ac.uk <mailto:
>> j...@sanger.ac.uk>>
>> > wrote:
>> >
>> >     Oh but memory is a problem – but not if you have just a small
>> cluster of machines!
>> >
>> >     Our boxes are larger than that – but they all run virtual machine
>> {only a small
>> >     proportion web related} – machines/memory would rapidly become in
>> our data centre - we
>> >     run VMWARE [995 hosts] and openstack [10,000s of hosts] + a
>> selection of large memory
>> >     machines {measured in TBs of memory per machine }.
>> >
>> >     We would be looking at somewhere between 0.5 PB and 1 PB of memory
>> – not just the
>> >     price of buying that amount of memory - for many machines we need
>> the fastest memory
>> >     money can buy for the workload, but we would need a lot more CPUs
>> then we currently
>> >     have as we would need a larger amount of machines to have 64GB
>> virtual machines {we
>> >     would get 2 VMs per host. We currently have approx. 1-2000 CPUs
>> running our hardware
>> >     (last time I had a figure) – it would probably need to go to
>> approximately 5-10,000!
>> >     It is not just the initial outlay but the environmental and
>> financial cost of running
>> >     that number of machines, and finding space to run them without
>> putting the cooling
>> >     costs through the roof!! That is without considering what
>> additional constraints on
>> >     storage having the extra machines may have (at the last count a
>> year ago we had over
>> >     30 PBytes of storage on side – and a large amount of offsite backup.
>> >
>> >     We would also stretch the amount of power we can get from the
>> national grid to power
>> >     it all - we currently have 3 feeds from different part of the
>> national grid (we are
>> >     fortunately in position where this is possible) and the dedicated
>> link we would need
>> >     to add more power would be at least 50 miles long!
>> >
>> >     So - managing cores/memory is vitally important to us – moving to
>> the cloud is an
>> >     option we are looking at – but that is more than 4 times the price
>> of our onsite
>> >     set-up (with substantial discounts from AWS) and would require an
>> upgrade of our
>> >     existing link to the internet – which is currently 40Gbit of data
>> (I think).
>> >
>> >     Currently we are analysing a very large amounts of data directly
>> linked to the current
>> >     major world problem – this is why the UK is currently being
>> isolated as we have
>> >     discovered and can track a new strain, in near real time – other
>> countries have no
>> >     ability to do this – we in a day can and do handle, sequence and
>> analyse more samples
>> >     than the whole of France has sequenced since February. We probably
>> don’t have more of
>> >     the new variant strain than in other areas of the world – it is
>> just that we know we
>> >     have because of the amount of sequencing and analysis that we in
>> the UK have done.
>> >
>> >     *From:*Matthias Peng <pengmatth...@gmail.com <mailto:
>> pengmatth...@gmail.com>>
>> >     *Sent:* 23 December 2020 12:02
>> >     *To:* mod_perl list <modperl@perl.apache.org <mailto:
>> modperl@perl.apache.org>>
>> >     *Subject:* Re: Confused about two development utils [EXT]
>> >
>> >     Today memory is not serious problem, each of our server has 64GB
>> memory.
>> >
>> >
>> >         Forgot to add - so our FCGI servers need a lot (and I mean a
>> lot) more memory than
>> >         the mod_perl servers to serve the same level of content (just
>> in case memory blows
>> >         up with FCGI backends)
>> >
>> >         -----Original Message-----
>> >         From: James Smith <j...@sanger.ac.uk <mailto:j...@sanger.ac.uk>>
>> >         Sent: 23 December 2020 11:34
>> >         To: André Warnier (tomcat/perl) <a...@ice-sa.com <mailto:
>> a...@ice-sa.com>>;
>> >         modperl@perl.apache.org <mailto:modperl@perl.apache.org>
>> >         Subject: RE: Confused about two development utils [EXT]
>> >
>> >
>> >          > This costs memory, and all the more since many perl modules
>> are not
>> >         thread-safe, so if you use them in your code, at this moment
>> the only safe way to
>> >         do it is to use the Apache httpd prefork model. This means that
>> each Apache httpd
>> >         child process has its own copy of the perl interpreter, which
>> means that the
>> >         memory used by this embedded perl interpreter has to be counted
>> n times (as many
>> >         times as there are Apache httpd child processes running at any
>> one time).
>> >
>> >         This isn’t quite true - if you load modules before the process
>> forks then they can
>> >         cleverly share the same parts of memory. It is useful to be
>> able to "pre-load"
>> >         core functionality which is used across all functions {this is
>> the case in Linux
>> >         anyway}. It also speeds up child process generation as the
>> modules are already in
>> >         memory and converted to byte code.
>> >
>> >         One of the great advantages of mod_perl is Apache2::SizeLimit
>> which can blow away
>> >         large child process - and then if needed create new ones. This
>> is not the case
>> >         with some of the FCGI solutions as the individual processes can
>> grow if there is a
>> >         memory leak or a request that retrieves a large amount of
>> content (even if not
>> >         served), but perl can't give the memory back. So FCGI processes
>> only get bigger
>> >         and bigger and eventually blow up memory (or hit swap first)
>> >
>> >
>> >
>> >
>> >
>> >         --
>> >           The Wellcome Sanger Institute is operated by Genome Research
>> Limited, a charity
>> >         registered in England with number 1021457 and a  company
>> registered in England
>> >         with number 2742969, whose registered  office is 215 Euston
>> Road, London, NW1
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1?entry=gmail&source=g>
>> 2
>> >         [google.com]
>> >         <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
>> >BE.
>> >
>> >
>> >
>> >         --
>> >           The Wellcome Sanger Institute is operated by Genome Research
>> >           Limited, a charity registered in England with number 1021457
>> and a
>> >           company registered in England with number 2742969, whose
>> registered
>> >           office is 215 Euston Road, London, NW1
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1?entry=gmail&source=g>
>> 2 [google.com]
>> >         <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
>> >BE.
>> >
>> >     -- The Wellcome Sanger Institute is operated by Genome Research
>> Limited, a charity
>> >     registered in England with number 1021457 and a company registered
>> in England with
>> >     number 2742969, whose registered office is 215 Euston Road,
>> London, NW1 2BE
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1+2BE?entry=gmail&source=g>
>> .
>> >
>> > -- The Wellcome Sanger Institute is operated by Genome Research
>> Limited, a charity
>> > registered in England with number 1021457 and a company registered in
>> England with number
>> > 2742969, whose registered office is 215 Euston Road, London, NW1 2BE
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1+2BE?entry=gmail&source=g>
>> .
>>
>>

Re: Confused about two development utils [EXT]

Reply via email to