Hello all,

I am in a bind with Apache's multi process limit. Let me explain what I am
doing. There's this website which has career details of all the football
players since the beginning of professional football. They have a simple web
form which allows you to look at a player's profile by entering his name or
his 7 digit numeric id number (on that website).

One of my client wants a list of all the players with a certain "flag" in
their profile. So I created an automatic form submission and HTML parsing
script to get details of all the players with that "flag" in their profile.
Let me not go into too much details and tell you that after applying a few
pattern rules to the id number, the number of possible id numbers comes to
about 1 million (instead of 10^7; each field can have
{0,1,2,3,4,5,6,7,8,9}=10 digits, so net combinations =
10*10*10*10*10*10*10).

Therefore, to completely automate this process I wrote a script which would
generate an id number, submit the form with that id number, and parse the
resulting HTML profile for the "flag". If the script finds a hit on the
flag, it stores all the fields of that player in a database. This script is
working absolutely fine but the speed I was getting was about one check per
second which means that I would have to leave the script running for about
11 days (to process all of about 1 million checks).

So i came up with this idea to divide the check into ten parts and i created
separate scripts for each part. Now basically the first script checks for
the first 100 thousand combinations, the second checks for another 100
thousand combinations, and so on.

*The problem is that I am able to get only two of these scripts running at
the same time.* So it would still take me at least 5 days to get all the
results. The rest of the scripts just sit there in the server's backlog.
This is definitely due to Apache's limitation to handle multiple processes.
The server I am using to run this script as well the target webserver both
run on Apache2. I am sure it's not a problem with the receiving server. It
has to be my Apache web server which is running the scripts. I have tried
using mpm_winnt <http://httpd.apache.org/docs/2.0/mod/mpm_winnt.html> (on a
windows server) as well as the
prefork<http://httpd.apache.org/docs/2.0/mod/prefork.html>and
worker <http://httpd.apache.org/docs/2.0/mod/worker.html> modules (on a
linux server) without any luck. Has any of you ever faced the same
situation?

Please guys help me out here.

Best,
Tony Miller

PS: For those concerned about the legitimacy of this work, rest assured,
this is absolutely legit. There's nothing in the website's use policy which
restricts somebody from doing this. Moreover, my client hired me to do this
only because the website owners were not able to hand over the data he
required. They gave the stupid reason that they are helpless in providing
the data because they don't have a system in place which would allow them to
do a search restriction!

Reply via email to