I wrote:
>I calculated roughly that encoding a 2-hour video could be parallelized by a
>factor of perhaps 20 trillion, using pipelining and divide-and-conquer

On Tue, Oct 20, 2009 at 03:16:22AM +0100, matt wrote:
> I know you are using video / audio encoding as an example and there are
> probably datasets that make sense but in this case, what use is it?

I was using it to work out the *maximum* extent to which a common task can be
parallelized.  20-trillion-fold is the answer I came up with.  Someone was
talking about Ahmdal's Law and saying that having large numbers of processors
is not much use because Ahmdal's Law limits their utilization.  I disagree.

In reality 10,000 processing units might be a more sensible number to have than
20 trillion.  If you have ever done H264 video encoding on a PC you would know
that it is very slow, even normal mpeg encoding is barely faster than real time
on a 1Ghz PC.  Few people like having to wait 2 hours for a task to complete.

This whole argument / discussion has come out of nowhere since it appears Ken's
original comment was criticising the normal sort of multi-core systems, and he
is more in favor of other approaches like FPGA.  I fully agree with that.

> You can't watch 2 hours of video per second and you can't write it to disk
> fast enough to empty the pipeline.

If I had a computer with 20 trillion processing units capable of recoding 2
billion hours of video per second, I would have superior storage media and IO
systems to go with it.  The system I described could encode 2 BILLION hours of
video per second, not 2 hours per second.

> You've got to feed in 2 hours of source material - 820Gb per stream, how?

I suppose some sort of parallel bus of wires or optic fibres.  If I have
massively parallel processing I would want massively parallel IO to go with it.
I.e. something like "read data starting from here" -> "here it is streaming one
megabit in parallel down the bus at 1Ghz over 1 million channels"

> Once you have your uncompressed stream, MPEG-2 encoding requires seeking
> through the time dimension with keyframes every n frames and out of order
> macro blocks, so we have to wait for n frames to be composited.  For the best
> quality the datarate is unconstrained on the first processing run and then
> macro blocks best-fitted and re-ordered on the second to match the desired
> output datarate, but again, this is n frames at a time.
> 
> Amdahl is punching you in the face every time you say "see, it's easy".

I'm no expert on video encoding but it seems to me you are assuming I would
approach it the conventional stupid serial way.  With massively parallel
processing one could "seek" through the time dimension simply by comparing data
from all time offsets at once in parallel.

Can you give one example of a slow task that you think cannot benefit much from
parallel processing?  video is an extremely obvious example of one that
certainly does benefit from just about as much parallel processing as you can
throw at it, so I'm surprised you would argue about it.  Probably my "20
trillion" upset you or something, it seems you didn't get my point.

It might have been better to consider a simpler example, such as frequency
analysis of audio data to perform pitch correction (for out of tune singers).

I can write a simple shell script using ffmpeg to do h264 video encoding which
would take advantage of perhaps 720 "cores" to encode a two hour video in 10
second chunks with barely any Ahmdal effects, running the encoding over a LAN.
A server should be able to pipe the whole 800Mb input - I am assuming it is
already encoded in xvid or something - over the network in about 10 seconds on
a gigabit (or faster) network.  Each participating computer will receive the
chunk of data it needs.  The encoding would take perhaps 30 seconds for the 10
seconds of video on each of 720 1Ghz computers.  And another 10 seconds to pipe
the data back to the server.  Concatenating the video should take very little
time, although perhaps the mp4 format is not the best for that, I'm not sure.

The entire operation takes 50 seconds as opposed to 6 hours (21600 seconds).
With my 721 computers I achieve a 432 times speed up.  Ahmdal is not sucking up
much there, only a little for transferring data around.  And each computer
could be doing something else while waiting for its chunk of data to arrive,
the total actual utilization can be over 99%.  People do this stuff every day.
Have you heard of a render-farm?

This applies for all Ahmdal arguments - if part of the system is idle due to
serial constraints in the algorithm, it could likely be working on something
else.  Perhaps you have a couple of videos to recode?  Then you can achieve
close to 100% utilization.  The time taken for a single task may be limited by
the method or the hardware, but a batch of several tasks can be achieved close
to N times faster if you have N processors/computers.

I'm not sure why I'm wasting time writing about this, it's obvious anyway.

Sam

Reply via email to