Table of Contents:
Before we dive into performance issues, there is something very important to understand. It applies to any webserver, not only apache. All the efforts are made to make user's web browsing experience a swift. Among other web site usability factors, speed is one of the most crucial ones. What is a correct speed measurement? Since user is the one that interacts with web site, speed measurement is a time passed from the moment user follows a link or presses a submit button till the resulting page is being rendered by her browser. So if we trace the data packet's movement as it leaves user's machine (request sent) till the reply arrives, the packet travels through many entities on its way. It has to make its way through the network, passing many interconnection nodes, before it enters the target machine it might go through proxy (accelerator) servers, then it's being served by your server, and finally it has to make the whole way back. A webserver is only one of the elements the packet sees on its way. You could work hard to fine tune your webserver for the best performance, but a slow NIC (Network Interface Card) or slow network connection from your server might defeat it all. That's why it's important to think big and to be aware of possible bottlenecks between the server and the web. Of course there is nothing you can do if user has a slow connection on its behalf.
Moreover, you might tune your scripts and webserver to process incoming requests ultra fast, so you will need a little number of working servers, but you might find out that server processes are busy waiting for slow clients to complete the download. You will see more examples in this chapter.
My point is that a web service is like car, if one of the details or mechanisms is broken the car will not drive smoothly and it can even stop dead if pushed further without first fixing it.
(META: Only partial analysis. Please submit more points. Many points are scattered around the document and should be gathered here, to represent the whole picture. It also should be merged with the above item!)
You need to analyze all of the problem's dimensions. There are several things that need to be considered:
*How long does it take to process each request
*How many requests can you process simultaneously
*How many simultaneous requests are you planning to get
The first one is probably the easiest to optimize. Follow the performance optimization tips in the guide and other docs, let a profeccional perl (mod_perl) programmer to work out your code and improve it.
The second one is a function of RAM. How much RAM is in the box, how many boxes do you have, and how much RAM does each mod_perl process take? Multiply the first two and divide by the third. Ask yourself whether it is better to switch to another, possibly just as inefficient language will actually cost more than throwing another Ultra 2 into the rack. Also ask yourself whether switching to another language will even help. In some applications, a huge chunk of memory is needed e.g. to link in Oracle runtime libraries. So you would pay this price even if you switch from Perl to C.
The last one is important. You need to have a realistic answer. Are you really expecting 8 million hits per day? What is the expected peak load, and what kind of response time do you need to guarantee? Remember that this numbers might change drastically when you apply code changes and your site becomes more popular. Remember that when the you get a very high hits rate, the requirements wouldn't grow lineary by exponentialy!
A very important point is the sharing of memory. If your OS supports this
(and most sane systems do), you might save more memory by sharing it
between child processes. This is only possible when you preload code at
server startup. However during a child process' life, its memory pages
becomes unshared and there is no way we can control perl to make it
allocate memory so (dynamic) variables land on different memory pages than
constants, that's why the copy-on-write effect (will explain in a moment) will hit almost at random. If you are
pre-loading many modules you might be able to balance the memory that stays
shared against the time for an occasional fork by tuning the
MaxRequestsPerChild to a point where you restart before too much becomes unshared. In this case
the MaxRequestsPerChild is very specific to your scenario. You should do some measurements and you
might see if this really makes a difference and what a reasonable number
might be. Each time a child reaches this upper limit and restarts it should
release the unshared copies and the new child will inherit pages that are
shared until it scribbles on them.
It is very important to understand that your goal is not to have
MaxRequestsPerChild to be 10000. Having a child serving 300 requests on precompiled code is
already a huge speedup, so if it is 100 or 10000 it does not really matter
if it saves you the RAM by sharing. Do not forget that if you preload most
of your code at the server startup, the fork to spawn a new child will be
very very fast, because it inherits most of the preloaded code and the perl
interpreter from the parent process. But than, during the work of the
child, its memory pages (which aren't really its yet, it uses the parent's
pages) are getting dirty (originally inherited and shared variables are
getting updated/modified) and the copy-on-write
happens, which reduces the number of shared memory pages - thus enlarging
the memory demands. Killing the child and respawning a new one, allows to
get the pristine shared memory from the parent process again.
The conclusion is that MaxRequestsPerChild should not be too big, otherwise you loose the benefits of the memory
sharing.
See Choosing MaxRequestsPerChild for more about tuning the MaxRequestsPerChild parameter.
You've probably noticed that the word shared is being repeated many times in many things related to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it. See the Formula and the numbers.
How much shared memory do you have? You can see it by either using the
memory utils that comes with your system or you can deploy GTop
module:
print "Shared memory of the current process: ",
GTop->new->proc_mem($$)->share,"\n";
print "Total shared memory: ",
GTop->new->mem->share,"\n";
When you watch the output of the top utility, don't confuse RSS
(or RES) column with SHARE column -- RES is a RESident memory, which is a size of pages currently swapped in.
Use the PerlRequire and PerlModule directives to load commonly used modules such as CGI.pm, DBI and etc., when the server is started. On most systems, server children will
be able to share the code space used by these modules. Just add the
following directives into httpd.conf:
PerlModule CGI; PerlModule DBI;
But even a better approach is to create a separate startup file (where you code in plain perl) and put there things like:
use DBI; use Carp;
Then you require() this startup file with help of PerlRequire
directive from httpd.conf, by placing it before the rest of the mod_perl configuration directives:
PerlRequire /path/to/start-up.pl
CGI.pm is a special case. Ordinarily CGI.pm autoloads most of its functions on an as-needed basis. This speeds up the
loading time by deferring the compilation phase. However, if you are using
mod_perl, FastCGI or another system that uses a persistent Perl
interpreter, you will want to precompile the methods at initialization
time. To accomplish this, call the package function compile()
like this:
use CGI ();
CGI->compile(':all');
The arguments to compile() are a list of method names or sets, and are identical to those accepted by
the use() and import()
operators. Note that in most cases you will want to replace ':all'
with tag names you really use in your code, since generally only a subset
of subs is actually being used.
You can also preload the Registry scripts. See Preload Registry Scripts.
(META: while the numbers and conclusions are mostly correct, need to rewrite the whole benchmark section using the GTop library to report the shared memory which is very important and will improve the benchmarks)
(META: Add the memory size tests when the server was compiled with EVERYTHING=1 and without it, does loading everything imposes a big change in the memory footprint? Probably the suggestion would be as follows: For a development server use EVERYTHING=1, while for a production if your server is pretty busy and/or low on memory and every bit is on account, only the required parts should be built in. BTW, remember that apache comes with many modules that are being built by default, and you might not need those!)
I have conducted a few tests to benchmark the memory usage when some
modules are preloaded. The first set of tests checks the memory use with
Library Perl Module preload (only CGI.pm). The second set checks the compile method of CGI.pm. The third test checks the benefit of Library Perl Module preload but a
few of them (to see more memory saved) and also the effect of precompiling
the Registry modules with Apache::RegistryLoader.
1. In the first test, the following script was used:
use strict;
use CGI ();
my $q = new CGI;
print $q->header;
print $q->start_html,$q->p("Hello");
Server restarted
Before the CGI.pm preload: (No other modules preloaded)
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 87004 0.0 0.0 1060 1524 - A 16:51:14 0:00 httpd httpd 240864 0.0 0.0 1304 1784 - A 16:51:13 0:00 httpd
After running a script which uses CGI's methods (no imports):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 188068 0.0 0.0 1052 1524 - A 17:04:16 0:00 httpd httpd 86952 0.0 1.0 2520 3052 - A 17:04:16 0:00 httpd
Observation: child httpd has grown up by 1268K
Server restarted
After the CGI.pm preload:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 240796 0.0 0.0 1456 1552 - A 16:55:30 0:00 httpd httpd 86944 0.0 0.0 1688 1800 - A 16:55:30 0:00 httpd
after running a script which uses CGI's methods (no imports):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 86872 0.0 0.0 1448 1552 - A 17:02:56 0:00 httpd httpd 187996 0.0 1.0 2808 2968 - A 17:02:56 0:00 httpd
Observation: child httpd has grown up by 1168K, 100K less then without preload - good!
Server restarted
After CGI.pm preloaded and compiled with CGI->compile(':all');
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 86980 0.0 0.0 2836 1524 - A 17:05:27 0:00 httpd httpd 188104 0.0 0.0 3064 1768 - A 17:05:27 0:00 httpd
After running a script which uses CGI's methods (no imports):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 86980 0.0 0.0 2828 1524 - A 17:05:27 0:00 httpd httpd 188104 0.0 1.0 4188 2940 - A 17:05:27 0:00 httpd
Observation: child httpd has grown up by 1172K No change! So what does CGI->compile(':all') help? I think it's because we never use all of the methods CGI provides - so in real use it's faster. So you might want to compile only the tags you are about to use - then you will benefit for sure.
2. I have tried the second test to find it. I run the script:
use strict;
use CGI qw(:all);
print header,start_html,p("Hello");
Server restarted
After CGI.pm was preloaded and NOT compiled with CGI->compile(':all'):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 17268 0.0 0.0 1456 1552 - A 18:02:49 0:00 httpd httpd 86904 0.0 0.0 1688 1800 - A 18:02:49 0:00 httpd
After running a script which imports symbols (all of them):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 17268 0.0 0.0 1448 1552 - A 18:02:49 0:00 httpd httpd 86904 0.0 1.0 2952 3112 - A 18:02:49 0:00 httpd
Observation: child httpd has grown up by 1264K
Server restarted
After CGI.pm was preloaded and compiled with CGI->compile(':all'):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 86812 0.0 0.0 2836 1524 - A 17:59:52 0:00 httpd httpd 99104 0.0 0.0 3064 1768 - A 17:59:52 0:00 httpd
After running a script which imports symbols (all of them):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 86812 0.0 0.0 2832 1436 - A 17:59:52 0:00 httpd httpd 99104 0.0 1.0 4884 3636 - A 17:59:52 0:00 httpd
Observation: child httpd has grown by 1868K. Why? Isn't
CGI::compile(':all') supposed to make children to share the compiled code with parent? It does
works as advertised, but if you pay attention in the code we have called
only three CGI.pm's methods - just saying use CGI qw(:all) doesn't mean we compile the all available methods - we just import their
names. So actually this test is misleading. Execute compile() only on the methods you are actually using and then you will see the
difference.
3. The third script:
use strict; use CGI; use Data::Dumper; use Storable; [and many lines of code, lots of globals - so the code is huge!]
Server restarted
Nothing preloaded at startup:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 90962 0.0 0.0 1060 1524 - A 17:16:45 0:00 httpd httpd 86870 0.0 0.0 1304 1784 - A 17:16:45 0:00 httpd
Script using CGI (methods), Storable, Data::Dumper called:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 90962 0.0 0.0 1064 1436 - A 17:16:45 0:00 httpd httpd 86870 0.0 1.0 4024 4548 - A 17:16:45 0:00 httpd
Observation: child httpd has grown by 2764K
Server restarted
Preloaded CGI (compiled), Storable, Data::Dumper at startup:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 26792 0.0 0.0 3120 1528 - A 17:19:21 0:00 httpd httpd 91052 0.0 0.0 3340 1764 - A 17:19:21 0:00 httpd
Script using CGI (methods), Storable, Data::Dumper called
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 26792 0.0 0.0 3124 1440 - A 17:19:21 0:00 httpd httpd 91052 0.0 1.0 6568 5040 - A 17:19:21 0:00 httpd
Observation: child httpd has grown by 3276K. Ouch: 512K more!!!
The reason is that when you preload at the startup all of the methods, they
all are being precompiled, there are many of them and they take a big chunk
of memory. If you don't use the compile() method, only the
functions that are being used will be compiled. Yes, it will slightly slow
down the first reposnse of each process, but the actuall memory usage will
be lower. BTW, if you write in the script:
use CGI qw(all);
Only the symbols of all functions are being imported. While they are taking some space, it's smaller than the space that a compiled code of these functions might occupy.
Server restarted
All the above modules + the above script PreCompiled with
Apache::RegistryLoader at startup:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 43224 0.0 0.0 3256 1528 - A 17:23:12 0:00 httpd httpd 26844 0.0 0.0 3488 1776 - A 17:23:12 0:00 httpd
Script using CGI (methods), Storable, Data::Dumper called:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND root 43224 0.0 0.0 3252 1440 - A 17:23:12 0:00 httpd httpd 26844 0.0 1.0 6748 5092 - A 17:23:12 0:00 httpd
Observation: child httpd has grown even more 3316K ! Does not seem to be good!
Summary:
1. Library Perl Modules Preloading gave good results everywhere.
2. CGI.pm's compile() method seems to use even more memory. It's because we never use all of the
methods CGI provides. Do compile()
only the tags that you are going to use and you will save the overhead of
the first call for each has not yet been called method, and the memory -
since compiled code will be shared across all the children.
3. Apache::RegistryLoader might make scripts load faster on the first request after the child has
just started but the memory usage is worse!!! See the numbers by yourself.
HW/SW used : The server is apache 1.3.2, mod_perl 1.16 running on AIX 4.1.5 RS6000 1G RAM.
Apache::RegistryLoader compiles Apache::Registry scripts at server startup. It can be a good idea to preload the scripts you
are going to use as well. So the code will be shared among the children.
Here is an example of the use of this technique. This code is included in a PerlRequire'd file, and walks the directory tree under which all registry scripts are
installed. For each .pl file encountered, it calls the Apache::RegistryLoader::handler() method to preload the script in the parent server (before pre-forking the
child processes):
use File::Find 'finddepth';
use Apache::RegistryLoader ();
{
my $perl_dir = "perl/";
my $rl = Apache::RegistryLoader->new;
finddepth(sub {
return unless /\.pl$/;
my $url = "/$File::Find::dir/$_";
print "pre-loading $url\n";
my $status = $rl->handler($url);
unless($status == 200) {
warn "pre-load of `$url' failed, status=$status\n";
}
}, $perl_dir);
}
Note that we didn't use the second argument to handler() here, as module's manpage suggests. To make the loader smarter about the
uri->filename translation, you might need to provide a trans()
function to translate the uri to filename. URI to filename translation
normally doesn't happen until HTTP request time, so the module is forced to
roll its own translation. If filename is omitted and a trans() routine was not defined, the loader will try using the URI relative to ServerRoot.
You have to check whether this makes any improvement for you though, I did some testing [ Preload Perl modules - Real Numbers ], and it seems that it takes more memory than when the scripts are being called from the child - This is only a first impression and needs better investigation. If you aren't concerned about few script invocations which will take some time to respond while they load the code, you might not need it all!
See also BEGIN blocks
It's always a good idea to stay away from global variables when possible.
Some variables must be global so Perl can see them, such as a module's @ISA or $VERSION variables (or fully qualified
@MyModule::ISA). In common practice, a combination of strict and
vars pragmas keeps modules clean and reduces a bit of noise. However, vars pragma also creates aliases as the Exporter
does, which eat up more memory. When possible, try to use fully qualified
names instead of use vars. Example:
package MyPackage; use strict; @MyPackage::ISA = qw(...); $MyPackage::VERSION = "1.00";
vs.
package MyPackage; use strict; use vars qw(@ISA $VERSION); @ISA = qw(...); $VERSION = "1.00";
Also see Using global variables and sharing them
When possible, avoid importing a module's functions into your name space.
The aliases which are created can take up quite a bit of space. Try to use
method interfaces and fully qualified
Package::function or $Package::variable like names instead. For benchmarks see Object Methods Calls Versus Function Calls.
Note: method interfaces are a little bit slower than function calls. You
can use a Benchmark module to profile your specific code.
PerlSetupEnv Off is another optimization you might consider.
mod_perl fiddles with the environment to make it appear as if the script were being
called under the CGI protocol. For example, the
$ENV{QUERY_STRING} environment variable is initialized with the contents of Apache::args(), and $ENV{SERVER_NAME} is filled in from the value returned by Apache::server_hostname().
But %ENV population is expensive. Those who have moved to the Perl Apache API no
longer need this extra %ENV population, can gain by turning it Off.
By default it is On.
Note that you can still set ENV variables. e.g. when you use the following configuration:
<Location /perl> PerlSetupEnv Off PerlSetEnv TEST hi SetHandler perl-script PerlHandler Apache::RegistryNG->handler Options +ExecCGI </Location>
A script having a print Data::Dumper(\%ENV) line, prints:
$VAR1 = {
'GATEWAY_INTERFACE' => 'CGI-Perl/1.1',
'MOD_PERL' => 'mod_perl/1.21_01-dev',
'PATH' => '/usr/lib/perl5/5.00503:... snipped ...',
'TEST' => 'hi'
};
Proxy gives you a great performance increase in most cases. It's being discussed in the Adding a Proxy Server in http Accelerator Mode section.
(META: complete the full description)
HTML::Mason is a system that makes use of components to build final html pages.
HTML::Mason can really improve performance of your service and diminish the load on the
system in case most of the output generated dynamically, but each final
page can be separated into different components, and those cached.
So if you have a page consisting of five components, each generated by SQL query, but for the four components it's the same query per user, you don't have to rerun this query again and again. Only the fifth component that gets generated by a unique query every time will not use the cache.
If your mod_perl server's httpd.conf includes the following directives:
KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 15
you've gotten a real performance penalty, since after completing each
request processing, the process will wait for KeepAliveTimeout
seconds before closing the connection and thus not serving other requests
at this time. You will need many more processes on a server with high
traffic.
Most chances are that you don't want this feature to be enabled. So set it Off with:
KeepAlive Off
the other two directive don't matter anymore.
You might want to consider to enable this option if the client's browser needs to bring more than one object from your server at once (for a single HTML page). If this is the situation you actually save the connection overhead for all requests but the first one.
For example if you have a page with 10 ad banners, which is not uncommon
today, you server will work more effectively if a single process will serve
them all during a single connection. You client will get a little slower
responce, since banners will be brought one at a time and not all together
if each IMG tag would open a separate connection.
There are definite advantages to keep-alive from a TCP perspective since fresh connections will incur not only the 3 way-TCP handshake but also be penalised by slow-start. So while turning it off may help the memory usage on the server, it will disadvantage the client from a network speed perspective.
You probably have followed the advice of sending all the static object
requests to a plain Apache server. And since most of the pages include more
than one static unique image, you better keep the default setting of the
non-mod_perl server, which has the KeepAlive directive
On. Probably reducing a little the number of timeout seconds is a good idea
too.
One option I suppose would be for the proxy/accelerator to keep the connection open to the client but make individual connections to the server, read the response, buffer it for sending to the client and close the server connection (making new connections to the server as required by the client requests obviously).
If some particular script's main functionality is uploading or downloading of big files, you probably want it to be executed on plain apache server under mod_cgi. Taken of course that the script requires none of the functionalities the mod_perl server provides. Like custom authentication handlers.
You don't want to tie up your precious mod_perl backend server children doing something as long and dumb as transfering a file.
Also, the user won't really see any important performance benefits from mod_perl anyway, since the upload may take up to several minutes, and the overhead saved by mod_perl is typically under one second.
Generally you should not fork from your mod_perl scripts, since when you do -- you are forking the entire apache web server, lock, stock and barrel. Not only is your perl code being duplicated, but so is mod_ssl, mod_rewrite, mod_log, mod_proxy, mod_spelling or whatever modules you have used in your server, all the core routines and so on.
A much wiser approach would be to spawn a sub-process, hand it the
information it needs to do the task, and have it detach (close x3 +
setsid()). This is wise only if the parent who spawns this process, immediately
continue, you do not wait for the sub-process to complete. This approach is
suitable for a situation when you want to trigger a long time taking
process through the web interface, like processing some data, sending email
to thousands of subscribed users and etc. Otherwise, you should convert the
code into a module, and use its functions or methods to call from CGI
script.
Just making a system() call defeats the whole idea behind mod_perl, perl interpreter and modules
should be loaded again for this external program to run.
Basically, you would do:
$params=FreezeThaw::freeze(
[all data to pass to the other process]
);
system("program.pl $params");
and in program.pl :
use POSIX qw(setsid); @params=FreezeThaw::thaw(shift @ARGV); # check that @params is ok close STDIN; close STDOUT; close STDERR; # you might need to reopen the STDERR # open STDERR, ">/dev/null"; setsid(); # to detach
At this point, program.pl is running in the ``background'' while the
system() returns and permits apache to get on with life.
This has obvious problems. Not the least of which is that @params
must not be bigger then whatever your architecture's limit is (could depend
on your shell).
Also, the communication is only one way.
However, you might want be trying to do the ``wrong thing''. If what you
want is to send information to the browser and then do some
post-processing, look into PerlCleanupHandler.
If you are interested in more deep level details, this is what actually happens when you fork and make a system call, like
system("echo Hi"),CORE::exit(0) unless fork();
which is might be more familiar in this form:
if (fork){
#do nothing
} else {
system("echo Hi");
CORE::exit(0);
}
What happens is that fork() gives you 2 execution paths and
the child gets virtual memory sharing a copy of the program text (read
only) and sharing a copy of the data space copy-on-write (remember why you
pre-load modules in mod_perl?). In the above code a parent will immediately
continue with the code that comes up after the fork, while the forked
process will execute system("echo Hi") and then terminate itself.
Notice that I use CORE::exit and not exit which would be automatically overriden by Apache::exit if used in conjunction with
Apache::Registry and friends.
The only work is setting up the page tables for the virtual memory and the second process goes on its separate way.
Next, Perl will find /bin/echo along the search path, and invoke it directly. Perl system()
is *not* system(3) [C-library]. Only when the command has shell meta-chars does Perl invoke a
real shell. That's a *very* nice optimization.
Only if you do:
system "sh -c 'echo foo'"
OS actually parses your command with a shell so you exec() a
copy of
/bin/sh, but since one is almost certainly already running somewhere, the system
will notice that (via the disk inode reference) and replace your virtual
memory page table with one pointed at the already-loaded program code plus
your own data space. Then the shell parses the passed command.
Since it is echo, it will execute it as a built-in in the latter example or a /bin/echo in the former and be done, but this is only an example. You aren't calling system("echo Hi") in your mod_perl scripts, right? Since most other real things (heavy
programs executed as a subprocess) would involve repeating the process to
load the specified command or script (it might involve some actual demand
paging from the program file if you execute new code).
The only place you see real overhead from this scheme is when the parent
process is huge (unfortunately like mod_perl...) and the page table becomes
large as a side effect. The whole point of mod_perl is to avoid having to
fork() / exec() something on every hit, though.
Perl can do just about anything by itself. However, you probably won't get
in trouble until you hit about 30 forks/sec on a so-so pentium.
Now let's get to the gory details of forking. Normally, every process has
its parent. Many processes are children of the init process, whose PID equals to 1. When you fork a process you must wait() or
waitpid() for it to finish. If you don't wait for it becomes a
zombie.
Zombie, is a process that doesn't have a father. When the child quits, it
reports the termination to his parent. If no one wait()s to
collect the exit status of the child, it gets ``confused'' and becomes a
ghost process, that can be seen, but not killed. It will be killed only
when you stop the httpd process that spawned it! (generally
top()/ps() utilities display these processes with <defunc> tag, and you will see an increment of the zombies counter reported
when doing top().) These zombie processes can take up system
resources and are generally undesirable.
So the proper fork is:
print "Content-type: text/plain\n\n";
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
waitpid($kid,0);
print "Parent has finished\n";
} else {
# do something
CORE::exit(0);
}
But in most cases the only reason you would want to fork is when you need to spawn a process that would take a lot of time to complete. So if the server child that spawns this process has to wait for it to finish, you gained nothing. You cannot neither wait for its completion, nor continue because you will get yet another zombie process.
The simplest solution is to ignore your dead children (this doesn't work everywhere, however) (META: do you know where? tell me!!! It does work with linux!):
$SIG{CHLD} = IGNORE;
When you set CHLD signal handler to IGNORE, all the processes will be collected by the init process and prevent from them to become zombies.
Note, that you cannot localize this setting with local(). If you do, it wouldn't take the desired effect. (META: anyone to explain
why? It doesn't work...)
The other thing that you must do -- is to close all the pipes to the
connection socket that were opened by the parent process (a STDIN
and a STDOUT) and inherited by the child, so the parent will be able to complete the
request and free itself for serving other requests. You may need to close
and reopen a STDERR filehandler (It's opened to append to the error_log file as inhereted by
parent, so chances are that you want it to leave untouched).
So now the code would look like:
print "Content-type: text/plain\n\n";
$SIG{CHLD} = IGNORE;
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
waitpid($kid,0);
print "Parent has finished\n";
} else {
close STDIN;
close STDOUT;
close STDERR;
# do something long lasting
CORE::exit(0);
}
Another more portable, but slightly more expensive solution is to use a double fork approach.
print "Content-type: text/plain\n\n";
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
waitpid($kid,0);
} else {
defined (my $grandkid = fork) or die "Kid cannot fork: $!\n";
if ($grandkid) {
CORE::exit(0);
} else {
# code here
close STDIN;
close STDOUT;
close STDERR;
# do something long lasting
CORE::exit(0);
}
}
Grandkid becomes a "child of init" (parent process ID is 1).
Note that the last two solutions do allow you to know the exit status of the process, but in our case we don't want to.
One more solution is to use a different SIGCHLD handler:
use POSIX 'WNOHANG';
$SIG{CHLD} = sub { while( waitpid(-1,WNOHANG)>0 ) {} };
Which is usefull when you fork() more than once process. The
handler could call wait() as well, but for a variety of
reasons involving tge handling of stopped processes and the rare event in
which two children exit at nearly the same moment, the best technique is to
call waitpid() in a tight loop with a first argument of -1 and a second argument of WNOHANG. Together these arguments tell waitpid() to reap the next
child that's available, and prevent the call from blocking if there happens
to be no child ready from reaping. The handler will loop untill
waitpid() returns a negative number or zero, indicating that
no more reapable children remain.
You will probably want to open your own log file in the spawned process and log some info so you know what have happened there. At least while debugging your code.
Check also Apache::SubProcess for a better system and exec
implementations for mod_perl (use CPAN!). META: some docs regarding this
module?
Scripts under mod_perl can very easily leak memory! Global variables stay
around indefinitely, lexical variables (declared with my() are destroyed when they go out of scope, provided there are no references
to them from outside of that scope.
Perl doesn't return the memory it acquired from the kernel. It does reuse it though!
First example demonstrates reading in a whole file:
open IN, $file or die $!; local $/ = undef; # will read the whole file in $content = <IN>; close IN;
If your file is 5Mb, the child who served that script will grow exactly by that size. Now if you have 20 children and all of them will serve this CGI, all of them will consume additional 20*5M = 100M of RAM! If that's the case, try to use other approaches of processing the file, if possible of course. Try to process a line at a time and print it back to the file. (If you need to modify the file itself, use a temporary file. When finished, overwrite the source file, make sure to provide a locking mechanism!)
Second example demonstrates copying variables between functions (passing variables by
value). Let's use the example above, assuming we have no choice but to read
the whole file before any data processing takes place. Now you have some
imagine process() subroutine that processes the data and returns it back. What happens if you
pass the
$content by value? You have just copied another 5M and the child has grown by
another 5M in size (watch your swap space!) now multiply it again by factor
of 20 you have 200M of wasted RAM, which will be apparently reused but it's
a waste! Whenever you think the variable can grow bigger than few Kb, pass
it by reference!
Once I wrote a script that passed a content of a little flat file DataBase to a function that processed it by value -- it worked and it was processed fast, but with a time the DataBase became bigger, so passing it by value was an overkill -- I had to make a decision, whether to buy more memory or to rewrite the code. It's obvious that adding more memory will be merely a temporary solution. So it's better to plan ahead and pass the variables by reference, if a variable you are going to pass might be bigger than you think at the time of your coding process. There are a few approaches you can use to pass and use variables passed by reference. For example:
my $content = qq{foobarfoobar};
process(\$content);
sub process{
my $r_var = shift;
$$r_var =~ s/foo/bar/gs;
# nothing returned - the variable $content outside has been
# already modified
}
@{$var_lr} -- dereferences an array
%{$var_hr} -- dereferences a hash
For more info see perldoc perlref.
Another approach would be to directly use a @_ array. Using directly the @_ array serves the job of passing by reference!
process($content);
sub process{
$_[0] =~ s/foo/bar/gs;
# nothing returned - the variable $content outside has been
# already modified
}
From perldoc perlsub:
The array @_ is a local array, but its elements are aliases for
the actual scalar parameters. In particular, if an element
$_[0] is updated, the corresponding argument is updated (or an
error occurs if it is not possible to update)...
Be careful when you write this kind of subroutines, since it can confuse a
potential user. It's not obvious that call like
process($content); modifies the passed variable -- programmers (which are the users of your
library in this case) are used to subs that either modify variables passed
by reference or return the processed variable (e.g. $content=process($content);).
Third example demonstrates a work with DataBases. If you do some DB processing, many times you encounter the need to read lots of records into your program, and then print them to the browser after they are formatted. (I don't even mention the horrible case where programmers read in the whole DB and then use perl to process it!!! Use a relational DB and let the SQL do the job, so you get only the records you need!!!).
We will use DBI for this (assume that we are already connected to the DB) (refer to perldoc DBI for a complete manual of the DBI
module):
$sth->execute;
while(@row_ary = $sth->fetchrow_array;) {
<do DB accumulation into some variable>
}
<print the output using the the data returned from the DB>
In the example above the httpd_process will grow up by the size of the variables that have been allocated for the records that matched the query. (Again remember to multiply it by the number of the children your server runs!).
A better approach is to not accumulate the records, but rather print them
as they are fetched from the DB. Moreover, we will use the
bind_col() and $sth->fetchrow_arrayref() (aliased to
$sth->fetch()) methods, to fetch the data in the fastest possible way. The example below
prints a HTML TABLE with matched data, the only memory that is being used
is a @cols array to hold temporary row values:
my @select_fields = qw(a b c);
# create a list of cols values
my @cols = ();
@cols[0..$#select_fields] = ();
$sth = $dbh->prepare($do_sql);
$sth->execute;
# Bind perl variables to columns.
$sth->bind_columns(undef,\(@cols));
print "<TABLE>";
while($sth->fetch) {
print "<TR>",
map("<TD>$_</TD>", @cols),
"</TR>";
}
print "</TABLE>";
Note: the above method doesn't allow you to know how many records have been
matched. The workaround is to run an identical query before the code above
where you use SELECT count(*) ... instead of 'SELECT *
... to get the number of matched records. It should be much faster, since you
can remove any SORTBY and alike attributes.
For those who think that $sth->rows will do the job, here is the quote from the DBI manpage:
rows();
$rv = $sth->rows;
Returns the number of rows affected by the last database altering command, or -1 if not known or not available. Generally you can only rely on a row count after a do or non-select execute (for some specific operations like update and delete) or after fetching all the rows of a select statement.
For select statements it is generally not possible to know how many rows will be returned except by fetching them all. Some drivers will return the number of rows the application has fetched so far but others may return -1 until all rows have been fetched. So use of the rows method with select statements is not recommended.
As a bonus, I wanted to write a single sub that flexibly processes any query, accepting: conditions, call-back closure sub, select fields and restrictions.
# Usage:
# $o->dump(\%conditions,\&callback_closure,\@select_fields,@restrictions);
#
sub dump{
my $self = shift;
my %param = %{+shift}; # dereference hash
my $rsub = shift;
my @select_fields = @{+shift}; # dereference list
my @restrict = shift || '';
# create a list of cols values
my @cols = ();
@cols[0..$#select_fields] = ();
my $do_sql = '';
my @where = ();
# make a @where list
map { push @where, "$_=\'$param{$_}\'" if $param{$_};} keys %param;
# prepare the sql statement
$do_sql = "SELECT ";
$do_sql .= join(" ", @restrict) if @restrict;# append the restriction list
$do_sql .= " " .join(",", @select_fields) ; # append the select list
$do_sql .= " FROM $DBConfig{TABLE} "; # from table
# we will not add the WHERE clause if @where is empty
$do_sql .= " WHERE " . join " AND ", @where if @where;
print "SQL: $do_sql \n" if $debug;
$dbh->{RaiseError} = 1; # do this, or check every call for errors
$sth = $dbh->prepare($do_sql);
$sth->execute;
# Bind perl variables to columns.
$sth->bind_columns(undef,\(@cols));
while($sth->fetch) {
&$rsub(@cols);
}
# print the tail or "no records found" message
# according to the previous calls
&$rsub();
} # end of sub dump
Now a callback closure sub can do lots of things. We need a closure to know what stage are we in: header, body or tail. For example, we want a callback closure for formatting the rows to print:
my $rsub = eval {
# make a copy of @fields list, since it might go
# out of scope when this closure will be called
my @fields = @fields;
my @query_fields = qw(user dir tool act); # no date field!!!
my $header = 0;
my $tail = 0;
my $counter = 0;
my %cols = (); # columns name=> value hash
# Closure with the following behavior:
# 1. Header's code will be executed on the first call only and
# if @_ was set
# 2. Row's printing code will be executed on every call with @_ set
# 3. Tail's code will be executed only if Header's code was
# printed and @_ isn't set
# 4. "No record found" code will be executed if Header's code
# wasn't executed
sub {
# Header
if (@_ and !$header){
print "<TABLE>\n";
print $q->Tr(map{ $q->td($_) } @fields );
$header = 1;
}
# Body
if (@_) {
print $q->Tr(map{$q->td($_)} @_ );
$counter++;
return;
}
# Tail, will be printed only at the end
if ($header and !($tail or @_)){
print "</TABLE>\n $counter records found";
$tail = 1;
return;
}
# No record found
unless ($header){
print $q->p($q->center($q->b("No record was found!\n")));
}
} # end of sub {}
}; # end of my $rsub = eval {
You might also want to check Limiting the size of the processes and Limiting the resources used by httpd children.
Newer Perl versions also have build time options to reduce runtime memory consumption. These options might shrink down the size of your httpd by about ~150k (quite big number if you remember to multiply it by the number of chidren you use.)
-DTWO_POT_OPTIMIZE macro improves allocations of data with size close to a power of two; but
this works for big allocations (starting with 16K by default). Such
allocations are typical for big hashes and special-purpose scripts,
especially image processing.
Perl memory allocation is by bucket with sizes close to powers of two.
Because of these malloc overhead may be big, especially for data of size
exactly a power of two. If PACK_MALLOC is defined, perl uses a slightly different algorithm for small allocations
(up to 64 bytes long), which makes it possible to have overhead down to 1
byte for allocations which are powers of two (and appear quite often).
Expected memory savings (with 8-byte alignment in alignbytes) is about 20% for typical Perl usage. Expected slowdown due to additional
malloc overhead is in fractions of a percent (hard to measure, because of
the effect of saved memory on speed).
You will find these and other memory improvement details in
perl5004delta.pod.
Important: both options are On by default in perl versions 5.005 and higher.
Under Apache::Registry the requested CGI script is always being
stat()'ed to check whether it was modified. It adds a very little overhead, but
if you are into squeezing all the jouces from the server, you might want to
save this call. If you do -- take a look at
Apache::RegistryBB module.
When you do a stat() or its variations (-M - modification time,
-A last access time, -C inode-change time, and other), the information is being cached, so if you
need to make an additional check for the same file, save the overhead of
this check and use a
_ variable instead. For example when testing for existance and read
permissions you might use:
my $filename = "./test";
# two stat() calls
print "OK\n" if -e $filename and -r $filename;
my $mod_time = (-M $filename) * 24 * 60 * 60;
print "$filename was modified $mod_time seconds ago\n";
or the more efficient (two stat() syscalls saved)!:
my $filename = "./test";
# two stat() calls
print "OK\n" if -e $filename and -r _;
my $mod_time = (-M _) * 24 * 60 * 60;
print "$filename was modified $mod_time seconds ago\n";
Remember that with mod_perl you might get negative times when you use
-M and alike file tests. -M tests the difference in time between file modification file and the start
of the script that performs this check. Because ^T variable is not being reset on each script invocation, and equal to the
time the process has been forked at, you might want to perform:
$^T = time();
at the beginning of your scripts to get the regular perl script behaviour of file tests
As you know Apache::Registry caches the scripts based on their URI. If you have the same script that can
be reached by different URIs, possible if you have used a symbolic links,
like:
% ln -s /home/httpd/perl/news/news.pl /home/httpd/perl/news.pl
Now the script can be reached as /news/news.pl and /news.pl
URIs. It doesn't really matter until you advertise the two URIs, and users
reach the same script from both of them. The moment this happens, you will
get the same script cached twice!
To detect it use /perl-status handler to see all the compiled scripts and their packages. In our example when requesting: http://localhost/perl-status?rgysubs you would see:
Apache::ROOT::perl::news::news_2epl Apache::ROOT::perl::news_2epl
after the both URIs have been requested from the same child process that happened to serve your request. To make the debug easier run the server in a single mode.
Apache::SizeLimit allows you to kill off Apache httpd processes if they grow too large. see
perldoc Apache::SizeLimit for more details.
By using this module, you should be able to discontinue using the Apache
configuration directive MaxRequestsPerChild, although for some folks, using both in combination does the job.
Apache::Resource uses the BSD::Resource module, which uses the C function setrlimit() to set limits on system resources such as memory and cpu usage.
To configure use:
PerlModule Apache::Resource
# set child memory limit in megabytes
# (default is 64 Meg)
PerlSetEnv PERL_RLIMIT_DATA 32:48
# set child CPU limit in seconds
# (default is 360 seconds)
PerlSetEnv PERL_RLIMIT_CPU 120
PerlChildInitHandler Apache::Resource
If you configure Apache::Status, it will let you review the resources set this way.
The following limit values are in megabytes: DATA, RSS,
STACK, FSIZE, CORE, MEMLOCK; all others are treated as their natural unit. Prepend PERL_RLIMIT_ for each one you want to use. Refer to setrlimit man page on your OS for other possible resources.
If the value of the variable is of the form S:H, S is treated as the soft limit, and H is the hard limit. If it is just a single number, it is used for both soft
and hard limits.
To debug add:
<Perl>
$Apache::Resource::Debug = 1;
require Apache::Resource;
</Perl>
PerlChildInitHandler Apache::Resource
and look in the error_log to see what it's doing.
Refer to perldoc Apache::Resource and man 2 setrlimit for more info.
A limitation of using pattern matching to identify robots is that it only catches the robots that you know about, and only those that identify themselves by name. A few devious robots masquerade as users by using user agent strings that identify themselves as conventional browsers. To catch such robots, you'll have to be more sophisticated.
Apache::SpeedLimit comes for you to help, see:
http://www.modperl.com/chapters/ch6.html#Blocking_Greedy_Clients
How much faster is mod_perl than mod_cgi (aka plain perl/CGI)? There are
many ways to benchmark the two. I'll present a few examples and numbers
below. Checkout the benchmark directory of mod_perl distribution for more examples.
If you are going to write your own benchmarking utility -- use
Benchmark module for heavy scripts and Time::HiRes module for very fast scripts (faster than 1 sec) where you need better time
precision.
There is no need to write a special benchmark though. If you want to
impress your boss or colleagues, just take some heavy CGI script you have
(e.g. a script that crunches some data and prints the results to STDOUT),
open 2 xterms and call the same script in mod_perl mode in one xterm and in
mod_cgi mode in the other. You can use lwp-get
from LWP package to emulate the web agent (browser). (benchmark
directory of mod_perl distribution includes such an example)
See also 2 tools for benchmarking: ApacheBench and crashme test
Perrin Harkins writes on benchmarks or comparisons, official or unofficial:
I have used some of the platforms you mentioned and researched others. What I can tell you for sure, is that no commercially available system offers the depth, power, and ease of use that mod_perl has. Either they don't let you access the web server internals, or they make you use less productive languages than Perl, sometimes forcing you into restrictive and confusing APIs and/or GUI development environments. None of them offer the level of support available from simply posting a message to this list, at any price.
As for performance, beyond doing several important things (code-caching, pre-forking/threading, and persistent database connections) there isn't much these tools can do, and it's mostly in your hands as the developer to see that the things which really take the time (like database queries) are optimized.
The downside of all this is that most manager types seem to be unable to believe that web development software available for free could be better than the stuff that cost $25,000 per CPU. This appears to be the major reason most of the web tools companies are still in business. They send a bunch of suits to give PowerPoint presentations and hand out glossy literature to your boss, and you end up with an expensive disaster and an approaching deadline.
But I'm not bitter or anything...
Jonathan Peterson adds:
Most of the major solutions have something that they do better than the others, and each of them has faults. Microsoft's ASP has a very nice objects model, and has IMO the best data access object (better than DBI to use - but less portable) It has the worst scripting language. PHP has many of the advantages of Perl-based solutions, but is less complicated for developers. Netscape's Livewire has a good object model too, and provides good server-side Java integration - if you want to leverage Java skills, it's good. Also, it has a compiled scripting language - which is great if you aren't selling your clients the source code (and a pain otherwise).
mod_perl's advantage is that it is the most powerful. It offers the greatest degree of control with one of the more powerful languages. It also offers the greatest granularity. You can use an embedding module (eg eperl) from one place, a session module (Session) from another, and your data access module from yet another.
I think the
Apache::ASPmodule looks very promising. It has very easy to use and adequately powerful state maintenance, a good embedding system, and a sensible object model (that emulates the Microsoft ASP one). It doesn't replicate MS's ADO for data access, butDBIis fine for that.I have always found that the developers available make the greatest impact on the decision. If you have a team with no Perl experience, and a small or medium task, using something like PHP, or Microsoft ASP, makes more sense than driving your staff into the vertical learning curve they'll need to use mod_perl.
For very large jobs, it may be worth finding the best technical solution, and then recruiting the team with the necessary skills.
Here are the numbers from Michael Parker's mod_perl presentation at Perl Conference (Aug, 98) (Sorry there used to be links here to the source, but they went dead one day, so I removed them). The script is a standard hits counter, but it logs the counts into the mysql relational DataBase:
Benchmark: timing 100 iterations of cgi, perl... [rate 1:28]
cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu)
perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu)
Benchmark: timing 1000 iterations of cgi,perl... [rate 1:21]
cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu)
perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu)
Benchmark: timing 10000 iterations of cgi, perl [rate 1:21]
cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu)
perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu)
We don't know what server configurations was used for these tests, but I guess the numbers speak for themselves.
The source code of the script was available at (http://www.realtime.net/~parkerm/perl/conf98/sld006.htm ) - it's a dead link - if you know its new location, please let me know....
As noted before, for very fast scripts you will have to use the
Time::HiRes module, its usage is similar to the Benchmark's.
use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [ gettimeofday ]; &sub_that_takes_a_teeny_bit_of_time() my $end_time = [ gettimeofday ]; my $elapsed = tv_interval($start_time,$end_time); print "the sub took $elapsed secs."
See also crashme test.
At http://perl.apache.org/dist/contrib/
you will find
Apache::Timeit package which does PerlHandler's Benchmarking.
It's very important to make a correct configuration of the
MinSpareServers, MaxSpareServers, StartServers,
MaxClients, and MaxRequestsPerChild parameters. There are no defaults, the values of these variable are very
important, as if too ``low'' you will under-use the system's capabilities,
and if too ``high'' chances that the server will bring the machine to its
knees.
All the above parameters should be specified on the basis of the resources
you have. While with a plain apache server, there is no big deal if you run
too many servers (not too many of course) since the processes are of ~1Mb
and aren't eating a lot of your RAM. Generally the numbers are even smaller
if memory sharing is taking place. The situation is different with
mod_perl. I have seen mod_perl processes of 20Mb and more. Now if you have MaxClients set to 50: 50x20Mb = 1Gb - do you have 1Gb of RAM? Probably not. So how do
you tune these parameters? Generally by trying different combinations and
benchmarking the server. Again mod_perl processes can be of much smaller
size if sharing is in place.
Before you start this task you should be armed with a proper weapon. You
need a crashme utility, which will load your server with mod_perl scripts you possess. You
need it to have an ability to emulate a multiuser environment and to
emulate multiple clients behavior which will call the mod_perl scripts at
your server simultaneously. While there are commercial solutions, you can
get away with free ones which do the same job. You can use an
ApacheBench ab utility that comes with apache distribution, a crashme script which uses
LWP::Parallel::UserAgent or httperf (see Download page).
Another important issue is to make sure to run testing client (load generator) on a system that is more powerful than the system being tested. After all we are trying to simulate the Internet users, where many users are trying to reach your service at once -- since a number of concurrent users can be quite large, your testing machine much be very powerful and capable to generate a heavy load. Of course you should not run the clients and the server on the same machine. If you do -- your testing results would be incorrect, since clients will eat a CPU and a memory that have to be dedicated to the server, and vice versa.
See also 2 tools for benchmarking: ApacheBench and crashme test
ab is a tool for benchmarking your Apache HTTP server. It is designed to give you an impression on how much performance your current Apache installation can give. In particular, it shows you how many requests per secs your Apache server is capable of serving. The ab tool comes bundled with apache source distribution (and it's free :).
Let's try it. We will simulate 10 users concurrently requesting a very
light script at www.nowhere.com:81/test/test.pl. Each ``user'' makes 10 requests.
% ./ab -n 100 -c 10 www.nowhere.com:81/test/test.pl
The results are:
Concurrency Level: 10
Time taken for tests: 0.715 seconds
Complete requests: 100
Failed requests: 0
Non-2xx responses: 100
Total transferred: 60700 bytes
HTML transferred: 31900 bytes
Requests per second: 139.86
Transfer rate: 84.90 kb/s received
Connection Times (ms)
min avg max
Connect: 0 0 3
Processing: 13 67 71
Total: 13 67 74
The only numbers we really care about are:
Complete requests: 100 Failed requests: 0 Requests per second: 139.86
Let's raise the load of requests to 100 x 10 (10 users, each makes 100 requests)
% ./ab -n 1000 -c 10 www.nowhere.com:81/perl/access/access.cgi Concurrency Level: 10 Complete requests: 1000 Failed requests: 0 Requests per second: 139.76
As expected nothing changes -- we have the same 10 concurrent users. Now let's raise the number of concurrent users to 50:
% ./ab -n 1000 -c 50 www.nowhere.com:81/perl/access/access.cgi Complete requests: 1000 Failed requests: 0 Requests per second: 133.01
We see that the server is capable of serving 50 concurrent users at an
amazing 133 req/sec! Let's find the upper boundary. Using -n 10000
-c 1000 failed to get results (Broken Pipe?). Using -n 10000 -c
500 derived 94.82 req/sec. The server's performance went down with the high
load.
The above tests were performed with the following configuration:
MinSpareServers 8 MaxSpareServers 6 StartServers 10 MaxClients 50 MaxRequestsPerChild 1500
Now let's kill a child after a single request, we will use the following configuration:
MinSpareServers 8 MaxSpareServers 6 StartServers 10 MaxClients 100 MaxRequestsPerChild 1
Simulate 50 users each generating a total of 20 requests:
% ./ab -n 1000 -c 50 www.nowhere.com:81/perl/access/access.cgi
The benchmark timed out with the above configuration.... I watched the
output of ps as I ran it, the parent process just wasn't capable of respawning the
killed children at that rate...When I raised the
MaxRequestsPerChild to 10 I've got 8.34 req/sec - very bad (18 times slower!) (You can't
benchmark the importance of the
MinSpareServers, MaxSpareServers and StartServers with this kind of test).
Now let's try to return MaxRequestsPerChild to 1500, but to lower the
MaxClients to 10 and run the same test:
MinSpareServers 8 MaxSpareServers 6 StartServers 10 MaxClients 10 MaxRequestsPerChild 1500
I've got 27.12 req/sec, which is better but still 4-5 times slower (133
with MaxClients of 50)
Summary: I have tested a few combinations of server configuration variables (MinSpareServers MaxSpareServers StartServers
MaxClients MaxRequestsPerChild). And the results we have received are as follows:
MinSpareServers, MaxSpareServers and StartServers are only important for user response times (sometimes user will have to
wait a bit).
The important parameters are MaxClients and
MaxRequestsPerChild. MaxClients should be not to big so it will not abuse your machine's memory resources
and not too small, when users will be forced to wait for the children to
become free to come serve them. MaxRequestsPerChild should be as big as possible, to take the full benefit of mod_perl, but
watch your server at the beginning to make sure your scripts are not
leaking memory, thereby causing your server (and your service) to die very
fast.
Also it is important to understand that we didn't test the response times in the tests above, but the ability of the server to respond under a heavy load of requests. If the script that was used to test was heavier, the numbers would be different but the conclusions are very similar.
The benchmarks were run with:
HW: RS6000, 1Gb RAM SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3 Machine running only mysql, httpd docs and mod_perl servers. Machine was _completely_ unloaded during the benchmarking.
After each server restart when I did changes to the server's configurations, I made sure the scripts were preloaded by fetching a script at least once by every child.
It is important to notice that none of requests timed out, even if was kept in server's queue for more than 1 minute! (That is the way ab works, which is OK for the testing purposes but will be unacceptable in the real world - users will not wait for more than 5-10 secs for a request to complete, and the client (browser) will timeout in a few minutes.)
Now let's take a look at some real code whose execution time is more than a few millisecs. We will do real testing and collect the data in tables for easier viewing.
I will use the following abbreviations:
NR = Total Number of Request NC = Concurrency MC = MaxClients MRPC = MaxRequestsPerChild RPS = Requests per second
Running a mod_perl script with lots of mysql queries (the script under test is mysqld bounded) (http://www.nowhere.com:81/perl/access/access.cgi?do_sub=query_form), with configuration:
MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 MaxRequestsPerChild 5000
gives us:
NR NC RPS comment
------------------------------------------------
10 10 3.33 # not a reliable statistics
100 10 3.94
1000 10 4.62
1000 50 4.09
Conclusions: Here I wanted to show that when the application is slow -- not due to perl loading, code compilation and execution, but bounded to some external operation like mysqld querying which made the bottleneck -- it almost does not matter what load we place on the server. The RPS (Requests per second) is almost the same (given that all the requests have been served, you have an ability to queue the clients, but be aware that something that goes to queue means a waiting client and a client (browser) that might time out!)
Now we will benchmark the same script without using the mysql (perl only bounded code) (http://www.nowhere.com:81/perl/access/access.cgi), it's the same script that just returns a HTML form, without making any SQL queries.
MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
10 10 26.95 # not a reliable statistics
100 10 30.88
1000 10 29.31
1000 50 28.01
1000 100 29.74
10000 200 24.92
100000 400 24.95
Conclusions: This time the script we executed was pure perl (not bounded to I/O or
mysql), so we see that the server serves the requests much faster. You can
see the RequestPerSecond (RPS) is almost the same for any load, but goes lower when the number of
concurrent clients goes beyond the MaxClients. With 25 RPS, the client supplying a load of 400 concurrent clients will
be served in 16 secs. But to get more realistic and assume the max
concurrency of 100, with 30 RPS, the client will be served in 3.5 secs,
which is pretty good for a highly loaded server.
Now we will use the server for its full capacity, by keeping all
MaxClients alive all the time and having a big
MaxRequestsPerChild, so no server will be killed during the benchmarking.
MinSpareServers 50
MaxSpareServers 50
StartServers 50
MaxClients 50
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
100 10 32.05
1000 10 33.14
1000 50 33.17
1000 100 31.72
10000 200 31.60
Conclusion: In this scenario there is no overhead involving the parent server loading new children, all the servers are available, and the only bottleneck is contention for the CPU.
Now we will try to change the MaxClients and to watch the results: Let's reduce MC to 10.
MinSpareServers 8
MaxSpareServers 10
StartServers 10
MaxClients 10
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
10 10 23.87 # not a reliable statistics
100 10 32.64
1000 10 32.82
1000 50 30.43
1000 100 25.68
1000 500 26.95
2000 500 32.53
Conclusions: A very little difference! Almost no change! 10 servers were able to serve
almost with the same throughput as 50 servers. Why? My guess it's because
of CPU throttling. It seems that 10 servers were serving requests 5 times
faster than when in the test above we worked with 50 servers. In the case
above each child received its CPU time slice 5 times less frequently. So
having a big value for
MaxClients, doesn't mean that the performance will be better. You have just seen the
numbers!
Now we will start to drastically reduce the MaxRequestsPerChild:
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
NR NC MRPC RPS comment
------------------------------------------------
100 10 10 5.77
100 10 5 3.32
1000 50 20 8.92
1000 50 10 5.47
1000 50 5 2.83
1000 100 10 6.51
Conclusions: When we drastically reduce the MaxRequestsPerChild, the performance starts to become closer to the plain mod_cgi. Just for
comparison with mod_cgi, here are the numbers of this run with mod_cgi:
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
NR NC RPS comment
------------------------------------------------
100 10 1.12
1000 50 1.14
1000 100 1.13
Conclusion: mod_cgi is much slower :) in test NReq/NClients 100/10 the RPS in mod_cgi was of 1.12 and in mod_perl of 32, which is 30 times faster!!! In the first test each child waited about 100 secs to be served. In the second and third 1000 secs!
httperf is an utility written by David Mosberger. Just like ApacheBench--it measures the performance of the webserver.
A sample command line is shown below:
httperf --server hostname --port 80 --uri /test.html \ --rate 150 --num-conn 27000 --num-call 1 --timeout 5
This command causes httperf to use the web server on the host with IP name hostname, running at port 80. The web page being retrieved is /test.html and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed (a call consists of sending a request and receiving a reply).
The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independent of what load the server can actually sustain. And here is a result that one might get:
Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s
Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections)
Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0
Connection time [ms]: connect 0.3
Request rate: 148.3 req/s (6.7 ms/req)
Request size [B]: 72.0
Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples)
Reply time [ms]: response 4.6 transfer 0.0
Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0)
Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0
CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%)
Net I/O: 190.9 KB/s (1.6*10^6 bps)
Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0
This is another crashme suite originally written by Michael Schilli and
located at http://www.linux-magazin.de/ausgabe.1998.08/Pounder/pounder.html
. I did a few modifications (mostly adding my() operands). I
also allowed it to accept more than one url to test, since sometimes you
want to test an overall and not just one script.
The tool provides the same results as ab above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get Latency (secs/Request) and Throughput (Requests/sec) numbers. It can give you a better picture and make a complete simulation of your favorite Netscape browser :).
I have noticed while running these 2 benchmarking suites - ab gave me results 2.5-3.0 times better. Both suites run on the same machine with the same load with the same parameters. But the implementations are different.
Sample output:
URL(s): http://www.nowhere.com:81/perl/access/access.cgi Total Requests: 100 Parallel Agents: 10 Succeeded: 100 (100.00%) Errors: NONE Total Time: 9.39 secs Throughput: 10.65 Requests/sec Latency: 0.85 secs/Request
And the code:
#!/usr/apps/bin/perl -w
use LWP::Parallel::UserAgent;
use Time::HiRes qw(gettimeofday tv_interval);
use strict;
###
# Configuration
###
my $nof_parallel_connections = 10;
my $nof_requests_total = 100;
my $timeout = 10;
my @urls = (
'http://www.nowhere.com:81/perl/faq_manager/faq_manager.pl',
'http://www.nowhere.com:81/perl/access/access.cgi',
);
##################################################
# Derived Class for latency timing
##################################################
package MyParallelAgent;
@MyParallelAgent::ISA = qw(LWP::Parallel::UserAgent);
use strict;
###
# Is called when connection is opened
###
sub on_connect {
my ($self, $request, $response, $entry) = @_;
$self->{__start_times}->{$entry} = [Time::HiRes::gettimeofday];
}
###
# Are called when connection is closed
###
sub on_return {
my ($self, $request, $response, $entry) = @_;
my $start = $self->{__start_times}->{$entry};
$self->{__latency_total} += Time::HiRes::tv_interval($start);
}
sub on_failure {
on_return(@_); # Same procedure
}
###
# Access function for new instance var
###
sub get_latency_total {
return shift->{__latency_total};
}
##################################################
package main;
##################################################
###
# Init parallel user agent
###
my $ua = MyParallelAgent->new();
$ua->agent("pounder/1.0");
$ua->max_req($nof_parallel_connections);
$ua->redirect(0); # No redirects
###
# Register all requests
###
foreach (1..$nof_requests_total) {
foreach my $url (@urls) {
my $request = HTTP::Request->new('GET', $url);
$ua->register($request);
}
}
###
# Launch processes and check time
###
my $start_time = [gettimeofday];
my $results = $ua->wait($timeout);
my $total_time = tv_interval($start_time);
###
# Requests all done, check results
###
my $succeeded = 0;
my %errors = ();
foreach my $entry (values %$results) {
my $response = $entry->response();
if($response->is_success()) {
$succeeded++; # Another satisfied customer
} else {
# Error, save the message
$response->message("TIMEOUT") unless $response->code();
$errors{$response->message}++;
}
}
###
# Format errors if any from %errors
###
my $errors = join(',', map "$_ ($errors{$_})", keys %errors);
$errors = "NONE" unless $errors;
###
# Format results
###
#@urls = map {($_,".")} @urls;
my @P = (
"URL(s)" => join("\n\t\t ", @urls),
"Total Requests" => "$nof_requests_total",
"Parallel Agents" => $nof_parallel_connections,
"Succeeded" => sprintf("$succeeded (%.2f%%)\n",
$succeeded * 100 / $nof_requests_total),
"Errors" => $errors,
"Total Time" => sprintf("%.2f secs\n", $total_time),
"Throughput" => sprintf("%.2f Requests/sec\n",
$nof_requests_total / $total_time),
"Latency" => sprintf("%.2f secs/Request",
($ua->get_latency_total() || 0) /
$nof_requests_total),
);
my ($left, $right);
###
# Print out statistics
###
format STDOUT =
@<<<<<<<<<<<<<<< @*
"$left:", $right
.
while(($left, $right) = splice(@P, 0, 2)) {
write;
}
The MaxClients directive sets the limit on the number of simultaneous requests that can be
supported; no more than this number of child server processes will be
created. To configure more than 256 clients, you must edit the HARD_SERVER_LIMIT entry in httpd.h
and recompile. In our case we want this variable to be as small as
possible, this way we can virtually bound the resources used by the server
children. Since we can restrict each child's process size (see
Limiting the size of the processes) -- the calculation of MaxClients is pretty straightforward :
Total RAM Dedicated to the Webserver
MaxClients = ------------------------------------
MAX child's process size
So if I have 400Mb left for the webserver to run with, I can set the
MaxClients to be of 40 if I know that each child is bounded to the 10Mb of memory
(e.g. with
Apache::SizeLimit).
Certainly you will wonder what happens to your server if there are more
than MaxClients concurrent users at some moment. This situation is accompanied by the
following warning message into the
error.log file:
[Sun Jan 24 12:05:32 1999] [error] server reached MaxClients setting, consider raising the MaxClients setting
There is no problem -- any connection attempts over the MaxClients
limit will normally be queued, up to a number based on the
ListenBacklog directive. Once a child process is freed at the end of a different request,
the connection will then be served.
But it is an error because clients are being put in the queue rather than getting served at once, despite the fact that they do not get an error response. The error can be allowed to persist to balance available system resources and response time, but sooner or later you will need to get more RAM so you can start more children. The best approach is to try not to have this condition reached at all, and if reach it often you should start to worry about it.
It's important to understand how much real memory a child occupies. Your
children can share the memory between them (when OS supports that and you
take action to allow the sharing happen - See
Preload Perl modules at server startup). If this is the case, chances are that your MaxClients can be even higher. But it seems that it's not so simple to calculate the
absolute number. (If you come up with solution please let us know!). If the
shared memory was of the same size through the child's life, we could
derive a much better formula:
Total_RAM + Shared_RAM_per_Child * MaxClients
MaxClients = ---------------------------------------------
Max_Process_Size - 1
which is:
Total_RAM - Max_Process_Size
MaxClients = ---------------------------------------
Max_Process_Size - Shared_RAM_per_Child
Let's roll some calculations:
Total_RAM = 500Mb Max_Process_Size = 10Mb Shared_RAM_per_Child = 4Mb
500 - 10
MaxClients = --------- = 81
10 - 4
With no sharing in place
500
MaxClients = --------- = 50
10
With sharing in place you can have 60% more servers without purchasing more RAM, if you improve and keep the sharing level, let's say:
Total_RAM = 500Mb Max_Process_Size = 10Mb Shared_RAM_per_Child = 8Mb
500 - 10
MaxClients = --------- = 245
10 - 8
390% more servers!!! You've got the point :)
The MaxRequestsPerChild directive sets the limit on the number of requests that an individual child
server process will handle. After
MaxRequestsPerChild requests, the child process will die. If
MaxRequestsPerChild is 0, then the process will live forever.
Setting MaxRequestsPerChild to a non-zero limit has two beneficial effects: it solves memory leakages
and helps reduce the number of processes when the server load reduces.
The first reason is the most crucial for mod_perl, since sloppy programming
will cause a child process to consume more memory after each request. If
left unbounded, then after a certain number of requests the children will
use up all the available memory and leave the server to die from memory
starvation. Note, that sometimes standard system libraries leak memory too,
especially on OSes with bad memory management (e.g. Solaris 2.5 on x86
arch). If this is your case you can set MaxRequestsPerChild to a small number, which will allow the system to reclaim the memory,
greedy child process consumed, when it exits after MaxRequestsPerChild requests. But beware -- if you set this number too low, you will loose a
fracture of the speed bonus you receive with mod_perl. Consider using Apache::PerlRun if this is the case. Also setting MaxSpareServers to a number close to
MaxClients, will improve the response time (but your parent process will be busy
respawning new children all the time!)
Another approach is to use Apache::SizeLimit (See Limiting the size of the processes). By using this module, you should be able to discontinue using the
MaxRequestsPerChild, although for some folks, using both in combination does the job.
See also Preload Perl modules at server startup and Sharing Memory.
With mod_perl enabled, it might take as much as 30 seconds from the time
you start the server until it is ready to serve incoming requests. This
delay depends on the OS, the number of preloaded modules and the process
load of the machine. So it's best to set
StartServers and MinSpareServers to high numbers, so that if you get a high load just after the server has
been restarted, the fresh servers will be ready to serve requests
immediately. With mod_perl, it's usually a good idea to raise all 3
variables higher than normal. In order to maximize the benefits of
mod_perl, you don't want to kill servers when they are idle, rather you
want them to stay up and available to immediately handle new requests. I
think an ideal configuration is to set MinSpareServers and MaxSpareServers to similar values, maybe even the same. Having the MaxSpareServers
close to MaxClients will completely use all of your resources (if
MaxClients has been chosen to take the full advantage of the resources), but it'll
make sure that at any given moment your system will be capable of
responding to requests with the maximum speed (given that number of
concurrent requests is not higher than
MaxClients.)
Let's try some numbers. For a heavily loaded web site and a dedicated machine I would think of (note 400Mb is just for example):
Available to webserver RAM: 400Mb Child's memory size bounded: 10Mb MaxClients: 400/10 = 40 (larger with mem sharing) StartServers: 20 MinSpareServers: 20 MaxSpareServers: 35
However if I want to use the server for many other tasks, but make it capable of handling a high load, I'd think of:
Available to webserver RAM: 400Mb Child's memory size bounded: 10Mb MaxClients: 400/10 = 40 StartServers: 5 MinSpareServers: 5 MaxSpareServers: 10
(These numbers are taken off the top of my head, and it shouldn't be used as a rule, but rather as examples to show you some possible scenarios. Use this information wisely!)
OK, we've run various benchmarks -- let's summarize the conclusions:
If your scripts are clean and don't leak memory, set this variable to a
number as large as possible (10000?). If you use
Apache::SizeLimit, you can set this parameter to 0 (equal to infinity). You will want this
parameter to be smaller if your code becomes unshared over the process'
life.
If you keep a small number of servers active most of the time, keep this
number low. Especially if MaxSpareServers is low as it'll kill the just loaded servers before they were utilized at
all (if there is no load). If your service is heavily loaded, make this
number close to
MaxClients (and keep MaxSpareServers equal to MaxClients as well.)
If your server performs other work besides web serving, make this low so the memory of unused children will be freed when there is no big load. If your server's load varies (you get loads in bursts) and you want fast response for all clients at any time, you will want to make it high, so that new children will be respawned in advance and be waiting to handle bursts of requests.
The logic is the same as of MinSpareServers - low if you need the machine for other tasks, high if it's a dedicated web
host and you want a minimal response delay.
Not too low, so you don't get into a situation where clients are waiting for the server to start serving them (they might wait, but not for too long). Do not set it too high, since if you get a high load and all requests will be immediately granted and served, your CPU will have a hard time keeping up, and if the child's size * number of running children is larger than the total available RAM, your server will start swapping (which will slow down everything, which in turn will make things even more slower, until eventually your machine will die). It's important that you take pains to ensure that swapping does not normally happen. Swap space is an emergency pool, not a resource to be used on a consistent basis. If you are low on memory and you badly need it - buy it, memory is amazingly cheap these days.
But based on the test I conducted above, even if you have plenty of memory
like I have (1Gb), increasing MaxClients sometimes will give you no speedup. The more clients are running, the more
CPU time will be required, the less CPU time slices each process will
receive. The response latency (the time to respond to a request) will grow,
so you won't see the expected improvement. The best approach is to find the
minimum requirement for your kind of service and the maximum capability of
your machine. Then start at the minimum and test like I did, successively
raising this parameter until you find the point on the curve of the graph
of the latency or/and throughput where the improvement becomes smaller.
Stop there and use it. Of course when you use these parameters in
production server, you will have the ability to tune them more precisely,
since then you will see the real numbers. Also don't forget that if you add
more scripts, or just modify the running ones -- most probably that the
parameters need to be recalculated, since the processes will grow in size
as you compile in more code.
Another popular use of mod_perl is to take advantage of its ability to maintain persistent open database connections. The basic approach is as follows:
# Apache::Registry script ------------------------- use strict; use vars qw($dbh); $dbh ||= SomeDbPackage->connect(...);
Since $dbh is a global variable for the child, once the child has opened the
connection it will use it over and over again, unless you perform disconnect().
Be careful to use different names for handlers if you open connection to different databases!
Apache::DBI allows you to make a persistent database connection. With this module
enabled, every connect() request to the plain DBI module will be forwarded to the Apache::DBI
module. This looks to see whether a database handle from a previous
connect() request has already been opened, and if this handle is still valid using
the ping method. If these two conditions are fulfilled it just returns the
database handle. If there is no appropriate database handle or if the ping
method fails, a new connection is established and the handle is stored for
later re-use. There is no need to delete the disconnect() statements
from your code. They will not do a thing, as the Apache::DBI
module overloads the disconnect() method with a NOP. On child's exit there is no explicit disconnect, the
child dies and so does the database connection. You may leave the use DBI; statement inside the scripts as well.
The usage is simple -- add to httpd.conf:
PerlModule Apache::DBI
It is important, to load this module before any other DBI,
DBD::* and ApacheDBI* modules!
db.pl
------------
use DBI;
use strict;
my $dbh = DBI->connect( 'DBI:mysql:database', 'user', 'password',
{ autocommit => 0 }
) || die $DBI::errstr;
...rest of the program
If you use DBI for DB connections, and you use Apache::DBI to make them persistent, it also allows you to preopen connections to DB
for each child with connect_on_init() method, thus saving up a connection overhead on the very first request of
every child.
use Apache::DBI ();
Apache::DBI->connect_on_init("DBI:mysql:test",
"login",
"passwd",
{
RaiseError => 1,
PrintError => 0,
AutoCommit => 1,
}
);
This can be used as a simple way to have apache children establish
connections on server startup. This call should be in a startup file
require()d by PerlRequire or inside <Perl> section. It will establish a connection when a child is started in
that child process. See the Apache::DBI manpage to see the requirements for this method.
You can also benefit from persistent connections by replacing
prepare() with prepare_cached(). That way you
will always be sure that you have a good statement handle and you will get
some caching benefit. The downside is that you are going to pay for DBI to
parse your SQL and do a cache lookup every time you call
prepare_cached().
Be warned that some databases doesn't support caches of prepared plans. (e.g PostgreSQL and Sybase). Though with Sybase you could open multiple connections to achieve the same result (at the risk of getting deadlocks depending on what you are trying to do!)
Another problem is with timeouts: some databases disconnect the client
after a certain time of inactivity. This problem is known as morning
bug. The ping() method ensures that this will not happen. Some
DBD drivers don't have this method, check the Apache::DBI
manpage to see how to write a ping() method.
Another approach is to change the client's connection timeout. For mysql
users, starting from mysql-3.22.x you can set a wait_timeout
option at mysqld server startup to change the default value. Setting it to
36 hours probably would fix the timeout problem.
A common web application architecture is one or more application servers
which handle requests from client browsers by consulting one or more
database servers and performing a transform on the data. When an
application must consult the database on every request, the interaction
with the database server becomes the central performance issue. Spending a
bit of time optimizing your database access can result in significant
application performance improvements. In this analysis, a system using
Apache, mod_perl, DBI, and Oracle will be considered. The application server uses Apache and
mod_perl to service client requests, and DBI to communicate with a remote Oracle database.
In the course of servicing a typical client request, the application server must retrieve some data from the database and execute a stored procedure. There are several steps that need to be done to complete the request:
1: Connect to the database server 2: Prepare a SQL SELECT statement 3: Execute the SELECT statement 4: Retrieve the results of the SELECT statement 5: Release the SELECT statement handle 6: Prepare a PL/SQL stored procedure call 7: Execute the stored procedure 8: Release the stored procedure statement handle 9: Commit or rollback 10: Disconnect from the database server
In this document, an application will be described which achieves maximum performance by eliminating some of the steps above and optimizing others.
A naive implementation would perform steps 1 through 10 from above on every request. A portion of the source code might look like this:
# ...
my $dbh = DBI->connect('dbi:Oracle:host', 'user', 'pass')
|| die $DBI::errstr;
my $baz = $r->param('baz');
eval {
my $sth = $dbh->prepare(qq{
SELECT foo
FROM bar
WHERE baz = $baz
});
$sth->execute;
while (my @row = $sth->fetchrow_array) {
# do HTML stuff
}
$sth->finish;
my $sph = $dbh->prepare(qq{
BEGIN
my_procedure(
arg_in => $baz
);
END;
});
$sph->execute;
$sph->finish;
$dbh->commit;
};
if ($@) {
$dbh->rollback;
}
$dbh->disconnect;
# ...
In practice, such an implementation would have hideous performance problems. The majority of the execution time of this program would likely be spent connecting to the database. An examination shows that step 1 is comprised of many smaller steps:
1: Connect to the database server 1a: Build client-side data structures for an Oracle connection 1b: Look up the server's alias in a file 1c: Look up the server's hostname 1d: Build a socket to the server 1e: Build server-side data structures for this connection
The naive implementation waits for all of these steps to happen, and then
throws away the database connection when it is done! This is obviously
wasteful, and easily rectified. The best solution is to hoist the database
connection step out of the per-request lifecycle so that more than one
request can use the same database connection. This can be done by
connecting to the database server once, and then not disconnecting until
the Apache child process exits. The
Apache::DBI module does this transparently and automatically with little effort on the
part of the programmer.
Apache::DBI intercepts calls to DBI's connect and disconnect methods and replaces them with its own. Apache::DBI caches database connections when they are first opened, and it ignores
disconnect commands. When an application tries to connect to the same
database, Apache::DBI returns a cached connection, thus saving the significant time penalty of
repeatedly connecting to the database. You will find a full treatment of Apache::DBI at Persistent DB Connections
When Apache::DBI is in use, none of the code in the example needs to change. The code is
upgraded from naive to respectable with the use of a simple module! The
first and biggest database performance problem is quickly dispensed with.
Most database servers, including Oracle, utilize a cache to improve the performance of recently seen queries. The cache is keyed on the SQL statement. If a statement is identical to a previously seen statement, the execution plan for the previous statement is reused. This can be a considerable improvement over building a new statement execution plan.
Our respectable implementation from the last section is not making use of this caching ability. It is preparing the statement:
SELECT foo FROM bar WHERE baz = $baz
The problem is that $baz is being read from an HTML form, and is therefore likely to change on every
request. When the database server sees this statement, it is going to look
like:
SELECT foo FROM bar WHERE baz = 1
and on the next request, the SQL will be:
SELECT foo FROM bar WHERE baz = 42
Since the statements are different, the database server will not be able to reuse its execution plan, and will proceed to make another one. This defeats the purpose of the SQL statement cache.
The application server needs to make sure that SQL statements which are the same look the same. The way to achieve this is to use placeholders and bound parameters. The placeholder is a blank in the SQL statement, which tells the database server that the value will be filled in later. The bound parameter is the value which is inserted into the blank before the statement is executed.
With placeholders, the SQL statement looks like:
SELECT foo FROM bar WHERE baz = :baz
Regardless of whether baz is 1 or 42, the SQL always looks the same, and the database server can
reuse its cached execution plan for this statement. This technique has
eliminated the execution plan generation penalty from the per-request
runtime. The potential performance improvement from this optimization could
range from modest to very significant.
Here is the updated code fragment which employs this optimization:
# ...
my $dbh = DBI->connect('dbi:Oracle:host', 'user', 'pass')
|| die $DBI::errstr;
my $baz = $r->param('baz');
eval {
my $sth = $dbh->prepare(qq{
SELECT foo
FROM bar
WHERE baz = :baz
});
$sth->bind_param(':baz', $baz);
$sth->execute;
while (my @row = $sth->fetchrow_array) {
# do HTML stuff
}
$sth->finish;
my $sph = $dbh->prepare(qq{
BEGIN
my_procedure(
arg_in => :baz
);
END;
});
$sph->bind_param(':baz', $baz);
$sph->execute;
$sph->finish;
$dbh->commit;
};
if ($@) {
$dbh->rollback;
}
# ...
The example program has certainly come a long way and the performance is
now probably much better than that of the first revision. However, there is
still more speed that can be wrung out of this server architecture. The
last bottleneck is in SQL statement parsing. Every time DBI's prepare() method is called, DBI parses the SQL command looking for placeholder strings, and does some
housekeeping work. Worse, a context has to be built on the client and
server sides of the connection which the database will use to refer to the
statement. These things take time, and by eliminating these steps the time
can be saved.
To get rid of the statement handle construction and statement parsing
penalties, we could use DBI's prepare_cached() method. This method compares the SQL
statement to others that have already been executed. If there is a match,
the cached statement handle is returned. But the application server is
still spending time calling an object method (very expensive in Perl), and
doing a hash lookup. Both of these steps are unnecessary, since the SQL is
very likely to be static and known at compile time. The smart programmer
can take advantage of these two attributes to gain better database
performance. In this example, the database statements will be prepared
immediately after the connection to the database is made, and they will be
cached in package scalars to eliminate the method call.
What is needed is a routine that will connect to the database and prepare
the statements. Since the statements are dependent upon the connection, the
integrity of the connection needs to be checked before using the
statements, and a reconnection should be attempted if needed. Since the
routine presented here does everything that
Apache::DBI does, it does not use Apache::DBI and therefore has the added benefit of eliminating a cache lookup on the
connection.
Here is an example of such a package:
package My::DB;
use strict;
use DBI;
sub connect {
if (defined $My::DB::conn) {
eval {
$My::DB::conn->ping;
};
if (!$@) {
return $My::DB::conn;
}
}
$My::DB::conn = DBI->connect(
'dbi:Oracle:server', 'user', 'pass', {
PrintError => 1,
RaiseError => 1,
AutoCommit => 0
}
) || die $DBI::errstr; #Assume application handles this
$My::DB::select = $My::DB::conn->prepare(q{
SELECT foo
FROM bar
WHERE baz = :baz
});
$My::DB::procedure = $My::DB::conn->prepare(q{
BEGIN
my_procedure(
arg_in => :baz
);
END;
});
return $My::DB::conn;
}
1;
Now the example program needs to be modified to use this package.
# ...
my $dbh = My::DB->connect;
my $baz = $r->param('baz');
eval {
my $sth = $My::DB::select;
$sth->bind_param(':baz', $baz);
$sth->execute;
while (my @row = $sth->fetchrow_array) {
# do HTML stuff
}
my $sph = $My::DB::procedure;
$sph->bind_param(':baz', $baz);
$sph->execute;
$dbh->commit;
};
if ($@) {
$dbh->rollback;
}
# ...
Notice that several improvements have been made. Since the statement
handles have a longer life than the request, there is no need for each
request to prepare the statement, and no need to call the statement
handle's finish method. Since Apache::DBI and the prepare_cached() method are not used, no cache lookups
are needed.
The number of steps needed to service the request in the example system has been reduced significantly. In addition, the hidden cost of building and tearing down statement handles and of creating query execution plans is removed. Compare the new sequence with the original:
1: Check connection to database 2: Bind parameter to SQL SELECT statement 3: Execute SELECT statement 4: Fetch rows 5: Bind parameters to PL/SQL stored procedure 6: Execute PL/SQL stored procedure 7: Commit or rollback
It is probably possible to optimize this example even further, but I have not tried. It is very likely that the time could be better spent improving your database indexing scheme or web server buffering and load balancing. If there are any suggestions for further optimization of the application-database interaction, please mail them to me at jwb@cp.net.
Jeffrey Baker, 4 October 1999
As you know local $|=1; disables the buffering of the currently selected file handle (default is STDOUT). If you enable it,
ap_rflush() is called after each print(), unbuffering Apache's IO.
If you are using a _bad_ style in generating output, which consist of
multiple print() calls, or you just have too many of them, you will experience a degradation
in performance. The severity depends on the number of the calls you make.
Many old CGIs were written in the style of:
print "<BODY BGCOLOR=\"black\" TEXT=\"white\">"; print "<H1>"; print "Hello"; print "</H1>"; print "<A HREF=\"foo.html\"> foo </A>"; print "</BODY>";
which reveals the following drawbacks: multiple print() calls - performance degradation with $|=1, backslashism which makes the code less readable and more difficult to
format the HTML to be easily readable as CGI's output. The code below
solves them all:
print qq{
<BODY BGCOLOR="black" TEXT="white">
<H1>
Hello
</H1>
<A HREF="foo.html"> foo </A>
</BODY>
};
I guess you see the difference. Be careful though, when printing a
<HTML> tag. The correct way is:
print qq{<HTML>
<HEAD></HEAD>
<BODY>
}
If you try the following:
print qq{
<HTML>
<HEAD></HEAD>
<BODY>
}
Some older browsers might not accept the output as HTML, but rather print
it as a plain text, since they expect the first characters after the
headers and empty line to be <HTML> and not spaces and/or additional newline and then <HTML>. Even if it works with your browser, it might not work for others.
Now let's go back to the $|=1 topic. I still disable buffering, for 2 reasons: I use few print() calls by printing out multiline HTML and not a line per print() and I want my users to see the output immediately. So if I am about to
produce the results of the DB query, which might take some time to
complete, I want users to get some titles ahead. This improves the
usability of my site. Recall yourself: What do you like better: getting the
output a bit slower, but steadily from the moment you've pressed the Submit button or having to watch the ``falling stars'' for awhile and then to
receive the whole output at once, even a few millisecs faster (if the
client (browser) did not time out till then).
An even better solution is to keep the buffering enabled, and use a Perl
API rflush() call to flush the buffers when wanted. This way you can aggregate in the
buffer the top of the page you are going to send to user, and flush it a
moment before you are going to do some lenghty operation, like DB query. So
you kill the two birds in one shoot: You show some of the data to the user
immediately, so user will feel that something is actually happening, and
you almost have no performance hit caused by disabled buffering.
use CGI ();
my $r = shift;
my $q = new CGI;
print $q->header('text/html');
print $q->start_html;
print $q->p("Searching...Please wait");
$r->rflush;
# imitate a lenghty operation
for (1..5) {
sleep 1;
}
print $q->p("Done!");
Conclusion: Do not blindly follow suggestions, but think what is best for you in every given case.
One of the important issues in improving the performance is reduction of memory usage - the less memory each server uses, the more server processes you can start, and thus the more performance you have (from the user's point of view - the response speed )
See Global vs Fully Qualified Variables
Profiling process helps you to determine which subroutines or just snippets of code take the longest execution time and which subroutines are being called most often. Probably you will want to optimize those, and to improve the code toward efficiency.
Let's write some code to mess with:
META: build a hash and sort it by value, key... then rewrite the comparisment subroutine to use Shwartzian transform.. and more
Think about some more web oriented examples...!
map {push @list, int rand(100)} (1..1000);
sub mysort {
map ...
}
META: remove all the diagnostics section below it's irrelevant here. (just reuse the explanations)
In the diagnostics pragma section, I showed that leaving it in production code is a bad idea, as it
significantly slows down the execution time. We verified that by using
Benchmark module. Now let see how to use profiler to find what subroutine diagnostics spends most of the time in, and once spotted it could be a good idea to
rewrite this specific code to make it more optimized. We wouldn't optimize
the code here as it's out of the scope of this document and since this is a
core Perl module, chances are that it's already optimized.
If you wander why, we can use Devel::DProf to help us. Let's use this code:
diagnostics.pl
--------------
use diagnostics;
test_code();
sub test_code{
for my $i (1..10) {
my $j = $i**2;
}
$a = "Hi";
$b = "Bye";
if ($a == $b) {
$c = $a;
}
}
Run it with profiler enabled, and than create the profiling stastics withhelp of dprofpp:
% perl -d:DProf diagnostics.pl % dprofpp
Total Elapsed Time = 0.993458 Seconds
User+System Time = 0.933458 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
81.5 0.761 0.932 1 0.7610 0.9319 main::BEGIN
12.8 0.120 0.101 3161 0.0000 0.0000 diagnostics::unescape
6.43 0.060 0.060 2 0.0300 0.0300 diagnostics::BEGIN
2.14 0.020 0.020 3 0.0067 0.0067 diagnostics::transmo
1.07 0.010 0.010 2 0.0050 0.0050 Config::FETCH
0.00 0.000 -0.000 2 0.0000 - Exporter::import
0.00 0.000 -0.000 2 0.0000 - Exporter::export
0.00 0.000 -0.000 1 0.0000 - Config::BEGIN
0.00 0.000 -0.000 1 0.0000 - diagnostics::import
0.00 0.000 0.020 3 0.0000 0.0066 diagnostics::warn_trap
0.00 0.000 0.020 3 0.0000 0.0066 diagnostics::splainthis
0.00 0.000 -0.000 1 0.0000 - Config::TIEHASH
0.00 0.000 -0.000 3 0.0000 - diagnostics::shorten
0.00 0.000 -0.000 3 0.0000 - diagnostics::autodescribe
0.00 0.000 0.010 1 0.0000 0.0099 main::test_code
It's not easy to see who is responsible for this enourmous overhead, even
if main::BEGIN seems to run, most of the time. To get a whole picture we must see the OPs
tree, which shows us who calls who, so we run:
% dprofpp -T
and the output is:
main::BEGIN
diagnostics::BEGIN
Exporter::import
Exporter::export
diagnostics::BEGIN
Config::BEGIN
Config::TIEHASH
Exporter::import
Exporter::export
Config::FETCH
Config::FETCH
diagnostics::unescape
.....................
B<3159 times [diagnostics::unescape] snipped> .
.....................
diagnostics::unescape
diagnostics::import
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
main::test_code
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
So we see that 2 executions of diagnostics::BEGIN and 3161 of
diagnostics::unescape are responsible for most of the running overhead.
META: but we see that it might be run only once in mod_perl, so the numbers are better right? check it!
If we comment out the diagnostics module, we get:
Total Elapsed Time = 0.079974 Seconds
User+System Time = 0.059974 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
0.00 0.000 -0.000 1 0.0000 - main::test_code
It is possible to profile code running under mod_perl with the
Devel::DProf module, available on CPAN. However, you must have apache version 1.3b3 or
higher and the PerlChildExitHandler enabled (during the httpd build process). When the server is started,
Devel::DProf installs an END block to write the tmon.out
file. This block will be called at the server shutdown. Here is how to
start and stop a server with the profiler enabled:
% setenv PERL5OPT -d:DProf % httpd -X -d `pwd` & ... make some requests to the server here ... % kill `cat logs/httpd.pid` % unsetenv PERL5OPT % dprofpp
The Devel::DProf package is a Perl code profiler. It will collect information on the
execution time of a Perl script and of the subs in that script (remember
that print() and map() are just like any other subroutines you write, but they are come bundled
with Perl!)
Another approach is to use Apache::DProf, which hooks
Devel::DProf into mod_perl. The Apache::DProf module will run a
Devel::DProf profiler inside each child server and write the
tmon.out file in the directory $ServerRoot/logs/dprof/$$ when the child is shutdown (where $$ is a number of the child process). All it takes is to add to httpd.conf:
PerlModule Apache::DProf
Remember that any PerlHandler that was pulled in before
Apache::DProf in the httpd.conf or <startup.pl>, would not have its code debugging info inserted. To run dprofpp, chdir to
$ServerRoot/logs/dprof/$$ and run:
% dprofpp
Which approach is more efficient: OOP methods or function calls? For
example, CGI.pm allows you to work in both modes.
use CGI;
my $q = new CGI;
$q->param('x',5);
my $x = $q->param('x');
versus
use CGI qw(:standard);
param('x',5);
my $x = param('x');
As usual, let's benchmark and compare:
meth_vs_func.pl
---------------
use Benchmark;
use CGI qw(:standard);
$CGI::NO_DEBUG = 1;
my $q = new CGI;
my $x;
timethese
(20000,
{
'Method' => sub {$q->param('x',5); $x = $q->param('x'); },
'Function' => sub {param('x',5); $x = param('x');},
});
The benchmark is written is such a way, that all the initializations are done at the beginning, so we can do a pure benchmarking. Let's do it:
% ./meth_vs_func.pl
Function: 29 wallclock secs (25.19 usr + 0.13 sys = 25.32 CPU)
Method: 28 wallclock secs (22.94 usr + 0.10 sys = 23.04 CPU)
What we are looking at are 'total CPU times' and not 'wallclock seconds', since it's possible that the load on the system was different for the two test while benchmarking, so these numbers are wrong ones to base our conclusions on.
As we see methods are for about 6% slower than functions. This number is
true for all methods in CGI.pm and other OOP modules as well. Why? Because the difference between
functions and methods is in time taking to resolve the pointer from the
object, to find the Module it belongs too and the actual method.
If you maintain the data object in a package's global variable like
CGI.pm does, you also save a little more time since you don't have to pass it to
the function. One parameter less to pass, less stack operations, less time
to get to the guts of the function.
But this little overhead is insignificant for most of us, relative to the benefits it gives when we have a big project to take care of. And with big projects it's much easier to use the object oriented approach.
In addition there is a real memory hit when you import all of the function into your process' memory. This can significantly enlarge memory requirements, particularly when there are many child processes.
Aside of namespace pollution, when importing symbols from any module any
script, its size grows by the size of the allocated space for those
symbols. The more you import (e.g. qw(:standard) vs
qw(:all)) the more memory will be used. Let's say the overhead
is of size X. Now take the number of scripts you deploy the function method
interface, let's call it Y. Finally let's say that you have Z number of
processes.
You will need X*Y*Z size of additional memory, taking X=10k, Y=10, Z=30, we get 10k*10*30 = 3Mb!!! Now you understand the difference.
Let's benchmark the CGI.pm using GTop.pm. First with no exporting at all.
use GTop (); use CGI (); print GTop->new->proc_mem($$)->size;
1,949,696
Now exporting a few dozens symbols:
use GTop (); use CGI qw(:standard); print GTop->new->proc_mem($$)->size;
1,966,080
And finally exporting all the symbols (about 130)
use GTop (); use CGI qw(:all); print GTop->new->proc_mem($$)->size;
1,970,176
Results:
import symbols size(bytes) delta(bytes) relative to () -------------------------------------- () 1949696 0 qw(:standard) 1966080 16384 qw(:all) 1970176 20480
So in my example above X=20k => 20K*10*30 = 6Mb. You will need 6Mb more
when importing all the CGI.pm's symbols versus not importing at all.
But generally you use more scripts, more processes and probably import more symbols from the additional modules that use deploy.
But, as reported, function method is faster in general case, because of the time overhead that takes to resolve the pointer from the object.
If you are heading to performance improving direction, you will have to
face the fact, that having to type My::Module::my_method might save you a good chunk of memory if the above call must not be called
with a reference to an object, but even then it can be passed by value.
I strongly endorse Apache::Request (libapreq) - Generic Apache Request Library. Its guts are all written in C, giving it a significant memory and
performance benefit. It has all the functionality CGI.pm has, but HTML generation functions.
See Apache::GzipChain - compress HTML (or anything) in the OutputChain
mergemem is an experimental utility for linux, which looks *very* interesting for us
mod_perl users:
http://mondoshawan.ml.org/mergemem/
It looks like it could be run periodically on your server to find and merge duplicate pages. There are caveats: it would halt your httpds during the merge (it appears to be very fast, but still ...).
This software comes with a utility called memcmp to tell you how much you might save.
If you have tried this utility, please let us know what do you think about it! Thanks
|
|
||
|
Written by Stas Bekman.
Last Modified at 12/18/1999 |
|
Use of the Camel for Perl is a trademark of O'Reilly & Associates, and is used by permission. |