Ch 22 -- Monitoring Web Server Activity
UNIX Unleashed, Internet Edition
- 22 -
Monitoring Web Server Activity
by Mike Starkenburg
Many people consider server activity to be the true sign of a successful Web site.
The more hits you have, the more popular your Web site must be, right? In fact, that's
not strictly true, and in the following sections, I explain how the data in your
server logs can help you build a better site. In this chapter we'll go over the following:
- The HTTP access log.
- The referrer and user_agent logs.
- The error log.
- Basic and Advanced log analysis.
- Factors in Log accuracy.
- Analysis Tools
Access Logs
The primary method for monitoring Web server activity is by analyzing the Web
server's access logs. The access log records each HTTP request to the server, including
both GET and POST method requests. The access log records successes
and failures, and includes a status code for each request. Some servers log "extended"
data including browser type and referring site. This data may be in separate logs
or stored in the main access log itself.
This data is generally kept in a /logs subdirectory of your server directory.
The file is often called access_log, and it can be large--about 1MB per
10,000 entries. The specific directory and name vary depending on your server, and
are configurable in the httpd.conf file.
These requests, or hits as they are commonly called, are the basic metric
of all Web server usage.
Uses for Access Log Data
In many organizations, log data is under-utilized or ignored completely. Often,
the only person with access to the logging data (and the only person who can interpret
the reports) is the Webmaster. In fact, the log data is a gold mine of information
for the entire company if properly analyzed and distributed.
Content Programming
One classic use of access logs is to assist in determining which content on a
Web site is most effective. By examining the frequency of hits to particular pages,
you, as a content developer, can judge the relative popularity of distinct content
areas.
Most analysis programs provide lists of the "top ten" and "bottom
ten" pages on a site, ranked by total hits. By examining this kind of report,
a Web content developer can find out which types of content users are finding helpful
or entertaining.
Web sites can have over 50 percent of their hits just to the index page, which
isn't much help in determining content effectiveness. Where the user goes next,
however, is perhaps one of the most useful pieces of data available from the access
logs. Some analysis programs (you explore a few later in the chapter) allow you to
examine the most common user "paths" through the site.
CAUTION: Note that for programming and
advertising purposes, access logs cannot be considered a completely accurate source.
, In the "Log Accuracy" section later in this chapter, we discuss factors
that cause overstatement and understatement of access logs.
Scaling and Load Determination
Using access logs is a quick method of determining overall server load. By benchmarking
your system initially and then analyzing the changes in traffic periodically, you
can anticipate the need to increase your system capacity.
Each hit in an access log contains the total transfer size (in kilobytes) for
that request. By adding the transfer sizes of each hit, you can get an aggregate
bandwidth per period of time. This number can be a fairly good indicator of total
load over time.
Of course, the best scaling tests separately track system metrics such as CPU
usage, disk access, and network interface capacity. (See Unix Unleashed, System
Administrator's Edition for a more detailed discussion of this kind of monitoring.)
Analyzing access logs, however, is an easy way to get a quick snapshot of the load.
Advertising
Advertising is becoming one of the primary business models supporting Internet
sites. Advertising is generally sold in thousands of impressions, where an
impression is one hit on the ad graphic. Accurate tracking of this information
has a direct effect on revenue.
Because Web logs are not 100 percent accurate, businesses that are dependent on
ad revenue should consider using an ad management system such as NetGravity or Accipiter.
These systems manage ad inventory, reliably count impressions, and also count clickthroughs,
which are measures of ad effectiveness.
In cases in which ads are used in non-critical applications, access logs may be
useful. They may be used to judge the effectiveness of different ads in the same
space. Finally, you can use access log analysis to find new pages that may be appropriate
for ads.
Access Log Format
Although each server can have a different access log format, most popular servers
use the common log format. Common log format is used in most servers derived
from the NCSA httpd server, including Netscape and Apache.
If your server does not use common log format by default, don't fret. Some servers
can be configured to use common log format, and some analyzers process several different
log formats. If all else fails, you can write a pre-parser that converts your logs
to common log format.
A common log format entry looks like the following:
Result Codes
Every attempted request is logged in the access log, but not all of them are successful.
The following common result codes can help you troubleshoot problems on your site:
Code |
Meaning |
2XX |
Success. |
200 |
OK. If your system is working correctly, this code is the most common one found in
the log. It signifies that the request was completed without incident. |
201 |
Created. Successful POST command. |
202 |
Accepted. Processing request accepted. |
203 |
Partial information. Returned information may be cached or private. |
204 |
No response. Script succeeded but did not return a visible result. |
3XX |
Redirection. |
301 |
Moved. Newer browsers should automatically link to the new reference. The response
contains a new address for the requested page. |
302 |
Found. Used to indicate that a different URL should be loaded. Often used by CGI
scripts to redirect the user to the results of the script. |
304 |
Not modified. A client can request a page "if-modified-since"
a certain time. If the object has not been modified, the server responds with a 304,
and the locally cached version of the object can be used. |
4XX |
Client error. |
400 |
Bad request. Bad syntax in request. |
401 |
Unauthorized. Proper authentication required to retrieve object. |
402 |
Payment required. Proper "charge-to" header required to retrieve
object. |
403 |
Forbidden. No authentication possible. This code sometimes indicates problems with
file permissions on the UNIX file system. |
404 |
Not found. No document matches the URL requested. |
5XX |
Server error. |
500 |
Internal error. |
501 |
Not implemented. |
502 |
Timed out. |
Extended Logs
In addition to logging the basic access information in the common log format,
some servers log additional information included in the HTTP headers. Check your
server software's documentation to determine whether you have this capability. Note
that many servers have this capability but have it disabled by default. A simple
change to the httpd.conf file may enable extended logging.
In some server software, extended information is logged as fields tacked on the
end of each entry in a common log format file. Other servers maintain separate files
for the additional information. The two most common types of extended logs are the
referrer log and the user_agent log.
Referrer
Two important questions not answered by the standard access logs are
- From where are people coming to my site?
- How do people navigate through my site?
To answer these questions, look to your referrer log. This data is often ignored
by Webmasters, but it can provide a great deal of useful information.
Referrer data is generated by the client that is connecting to your site and is
passed in the HTTP headers for each connection. A referrer log entry contains two
pieces of data, as in the following example:
http://www.aol.com/credits.html -> /resume.html
The first URL represents the last page the user requested. The second represents
the filename on your server that the user is currently requesting. In this case,
the person who requested my resume was most recently looking at the aol.com credits
page. When referrer data frequently contains a given Web site, it is likely that
a Webmaster has linked to your site.
NOTE: If a site shows up only a few times
in your referrer log, that information doesn't necessarily indicate that a link exists
from that site to yours. In the preceding example, the user might have been looking
at the aol.com page last but manually typed in the URL for my resume. The
browser still sends the AOL page as the referrer information because it was the last
page the user requested. I can assure you that no link connects http://www.aol.com
to my resume.
You can get the data you need out of your referrer log in several ways. Many of
the tools I describe in the "Analysis Tools" section of this chapter process
your referrer log for you as they process your access logs.
If you specifically want to work with the referrer log, check out RefStats 1.1.1
by Jerry Franz. RefStats is a Perl script that counts and lists referring pages in
a clean and organized manner. You can find the script and sample output at
http://www.netimages.com/~snowhare/utilities/refstats.html
User-Agent
When Webmasters design Web sites, they are often faced with a difficult question:
Which browser will we develop for? Each browser handles HTML differently, and each
supports different scripting languages and accessory programs.
In most cases, you should build your site for the browser most frequently used
by your audience. One way to decide which browser to support is to watch industry-wide
browser market share reports. For one example, try the following site:
http://www.webtrends.com/products/webtrend/REPORTS/industry/browser/apr97/report.htm
A more accurate method is to examine "user-agent" logs. Most servers
log the type of browser used for each request in a file called agent_log.
The agent information is passed in HTTP headers, like the referrer data.
There is no formal standard for user-agent strings, but they generally consist
of a browser name, a slash, a version number, and additional information in parentheses.
Now take a look at some common agents:
Mozilla/2.02 (Win16; I)
The preceding is the classic user-agent string: It denotes a user with a Netscape
browser on a Windows 16-bit platform. Mozilla is Netscape's internal pet name for
its browser.
Here's another example:
Mozilla/2.0 (compatible; MSIE 3.01; AK; Windows 95)
Now, the preceding string looks like Netscape, but it is actually Microsoft's
Internet Explorer 3.01 masquerading as Netscape. Microsoft created this agent to
take advantage of early Web sites that delivered two versions of content: one for
Netscape users with all the bells and whistles, and a plain one for everyone else.
Now consider this example:
Mozilla/2.0 (Compatible; AOL-IWENG 3.1; Win16)
Here's another imposter. This time, it's the AOL proprietary browser. AOL's browser
began life as InternetWorks by BookLink, hence the IWENG name.
The following is yet another example:
Mozilla/3.01 (Macintosh; I; PPC) via proxy gateway CERN-HTTPD/3.0 libwww/2.17
This one is really Netscape 3.01 on a PowerPC Mac. What's interesting about this
agent is that the user was behind a Web proxy. The proxy tacked its name onto the
actual agent string.
Again, many of the analysis programs discussed in this chapter process user_agent
logs as well. If you want a quick way to process just the user_agent file,
check out Chuck Musciano's nifty little sed scripts at
http://members.aol.com/htmlguru/agent_log.html
Error Logs
The second type of standard Web server activity log is the error log. The error
log records server events, including startup and shutdown messages. The error log
also records extended debugging information for each unsuccessful access request.
This data is generally kept in the /logs subdirectory with the access_log.
The file is often called error_log. The specific directory and name vary
depending on your server, and are configurable in the httpd.conf file.
Most events recorded in the error log are not critical. Depending on your server
and configuration, your server may log events like the following:
[02/May/1997:12:11:00 -0500] Error: Cannot access file
/usr/people/www/pages/artfile.html. File does not exist.
This message simply means that the requested file could not be found on the disk.
The problem could be a bad link, improper permissions settings, or a user could be
requesting outdated content.
Some entries in the error log can be useful in debugging CGI scripts. Some servers
log anything written by a script to stderr as an error event. By watching
your error logs, you can identify failing scripts. Some of the common errors that
indicate script failures include
- Attempt to invoke directory as script.
- File does not exist.
- Invalid CGI ref.
- Malformed header from script.
- Script not found or unable to stat.
Basic Analysis
The simplest measure of your server activity is to execute the following command:
wc -l access_log
This command returns a single number that represents the total accesses to your
server since the log was created. Unfortunately, this number includes many accesses
you might not want to count, including errors and redirects. It also doesn't give
you much useful information.
By judicious use of SED, GREP, shell scripting, or piping, you
can create a much more interesting output. For example, if you were tracking hits
to a certain advertisement graphic, you could use the following:
grep ad1.gif access_log | wc -l
By issuing ever more complex commands, you can begin to gather really useful information
about usage on your site. These scripts are time-consuming to write, execute slowly,
and have to be revised every time you want to extract a different statistic. Unless
you have a specific statistic you need to gather in a certain format, you will probably
be better off using one of the many analysis programs on the market. You examine
a few of them later in this chapter.
General Statistics
Figure 22.1 shows the general statistics derived from my access log by my favorite
analysis program, Analog. I talk at more length about Analog in the "Analysis
Tools" section of this chapter. Other tools may give slightly different output,
but Analog produces a good variety of basic statistics and is easy to use.
Figure 22.1.
General statistics.
The general statistics section gives a good snapshot of traffic on your server.
As you can see in the figure, the analysis program has summarized several categories
of requests, including
- Successful requests
- Successful requests for pages
- Failed requests
- Redirected requests
You might also get average hits per day, total unique hosts or files, and an analysis
of the total bytes served.
NOTE: If you plan to use your log analysis
for advertising or content programming, be sure you know the difference between hits
and impressions. Hits represent all the accesses on your server, whereas impressions
represent only the accesses to a specific piece of information or advertisement.
Most people count impressions by counting only actual hits to the HTML page containing
the content or graphics.
By watching for changes in this information, you can see when you are having unusually
high numbers of errors, and you can watch the growth of your traffic overall. Of
course, taking this snapshot and comparing the numbers manually every day gets tiresome,
so most analysis tools allow some kind of periodic reports.
Periodic Reporting
Analysis tools provide a variety of reports that count usage over a specific period
of time. Most of these reports count total hits per period, although the more advanced
tools allow you to run reports on specific files or groups of files. Each of the
periodic reports has a specific use.
- Monthly report: Figure 22.2 shows a monthly report for my Web site for five months.
Monthly reports are good for long-term trend analysis and planning. Also, seasonal
businesses may see patterns in the monthly reports: Retail Web sites can expect a
big bump during Christmas, and educational sites will have a drop during the summer
months.
Figure 22.2.
Monthly report.
- Daily report: Figure 22.3 shows the daily summary for the same five-month period.
This report can show trends over the week. Most sites show either a midweek or weekend
focus, depending on content. Some analysis programs allow you to break this report
out by date as well as day, so you can see the trends across several weeks.
Figure 22.3.
Daily report.
- Hourly report: Figure 22.4 shows my hourly summary. This report is most useful
for determining the daily peak. Heavy Web use generally begins at 6 p.m., grows to
an 11 p.m. peak, and then continues heavily until 1 a.m. A lunchtime usage spike
also occurs as workers surf the Net on their lunch hours. This report is crucial
to scaling your site.
Figure 22.4.
Hourly report.
Some analysis programs also allow you to run reports for specific periods on time
in whatever units you may need.
Demographic Reporting
Before you get excited, be informed: In most cases, you cannot get personal demographics
information from your Web logs. You can't get users' age, sex, or income level without
explicitly asking.
If your friends in marketing would like real demographics on the average Web user,
check out the Commercenet/Nielsen Internet user demographics survey at
Figure 22.5.
Domain report.
- Host report: A related report is the "host report." It shows specifically
which host sent you access requests. In some cases, you might recognize hostnames
belonging to friends (or competitors). Most sites have several hosts with strange
numerical hostnames. They are often dynamically assigned IP addresses from ISPs.
Also, look for names like *.proxy.aol.com, which indicate users coming through
a proxy system, and spider6.srv.pgh.lycos.com, which indicate a Web crawler
from a major search engine.
Tip: Many Web servers give you the option either
to log the user's IP address or to look up the actual hostname at the time of access.
Many analysis programs perform a lookup for you as they analyze the logs. The choice
is yours, and the trade-off is speed: Either you have a small delay with every hit
as the server does the lookup or a big delay in processing as the analysis program
looks up every single address.
Page Reporting
One of the most interesting questions you can ask of your logs is this: What do
people look at most on my Web site? Figures 22.6 and 22.7 show the reports that answer
this question.
- Directory report: If you organize your content correctly, the directory report
can give you a quick overview of the popular sections of your Web site. Another good
use for this report is to separate out image hits; you can store all the images in
a single directory, and the remaining hits will reflect only content impressions.
Figure 22.6.
Host report.
- Request report: Possibly the most useful report you can generate, the request
report shows hits to individual pages. Note that some pages are miscounted by this
report. For example, the root directory / redirects to index.html.
To get an accurate count for this page, you need to add the two counts together.
Figure 22.7.
Directory report.
Advanced Analysis
The basic reports I've talked about merely summarize the access logs in different
ways. Some more advanced analysis methods look for patterns in the log entries. Two
useful patterns that the access log entries can produce are user sessions and session
paths.
Sessioning
Some advanced analysis programs allow you to try to distinguish unique visits
to your site. These programs usually define a session as a series of requests from
a specific IP address within a certain period of time. After a session is defined,
the program can give you additional information about the session. Over time, you
can gather aggregate information that may be useful for marketing and planning, including
- Total unique visitors per period of time
- Average total session duration
- Average pages visited per session
Sessioning is not an exact science. If multiple users come from the same IP address
during the same period, those hits can't be used for sessioning. Therefore, users
from online services that use Web proxies (including AOL and Prodigy) can't be tracked
with sessioning. Also, dynamic IP addresses that are frequently reassigned can't
be reliably tracked by sessioning. Despite these weaknesses, you may still be able
to gain some interesting information from an analysis program that allows sessioning.
Pathing
If you can identify a specific user session, you can follow that user's path from
page to page as he or she navigates through your Web site. Advanced analysis programs
look at each session and find the most frequently followed paths. Experienced Webmasters
use this data to determine the most popular entry pages, the most popular exit pages,
and the most common navigation paths.
Log Accuracy
Your Web server logs provide a wealth of useful data to Webmasters, marketers,
and advertisers. Unfortunately, the raw log by itself is not a reliable source for
accurate counts of your site usage. A number of factors can cause the output of reports
run on your raw logs to be significantly overstated or understated.
If your log understates usage, it can quickly cause measurable damage to your
bottom line. Imagine if you run an advertising-supported Web site, and your ad impressions
are 10 percent off? Imagine if you have carefully scaled your Web site to perform
well under peak load, as forecasted by your raw logs, only to find that you are under-built
by up to 25 percent! In the following sections, I describe some causes of these inaccuracies
and ways to mitigate those risks.
Adjusting for Caching
The biggest problem that affects your log accuracy is content caching. If a piece
of content is cached, it is served to the user from a store, either locally
on the user's hard drive or from an ISP's proxy system. When content is served from
a cache, often no request is made to your server, so you never see any entry in your
logs.
In most cases, caching is a good thing: It improves the user experience,
lessens the load on the Net, and even saves you money in hardware and network costs.
You might want to optimize your site to take advantage of caching, but before you
do, you should consider the effects that caching will have on your log files. In
fact, in only a few cases will you want to consider defeating a caching system:
- Advertising: If you run an ad-supported Web site, every ad impression is directly
related to revenue. Although you might be able to defeat caching of your ads in some
cases, your best bet is to employ an actual ad server that handles rotating the ads.
These ad servers usually have built-in capability to defeat some caching systems.
- Dynamic content: Most content is updated only every few hours or days. For this
kind of content, you probably don't need to worry about caching. A few hours of updating
should not affect the timeliness of your data. But if your data truly updates every
few minutes (for example, stock quotes), you might want to defeat caching. Note that
you want to defeat the cache only for the HTML page itself; let the graphics be cached.
- Secure data: You may be handling sensitive data that you do not want saved in
a proxy or local cache. Note that most proxy systems and browsers do not cache SSL-encrypted
data at all, so if you're using this form of security, you are already taken care
of.
If you want to take advantage of your user's proxy and local cache, you should
try to determine what percentage of your hits are understated because of the cache.
You can then use this figure as a rule of thumb for future analysis.
Local Caching
Most Web browsers keep a local cache of content and serve out of that cache whenever
possible. Some browsers send a special kind of request called a "get-if-modified-since"
that, in effect, asks the server whether the document has been updated. If the server
finds that the document has been updated, it returns the new document. If it finds
that the document is the same, it returns a status code 304. Status code
304 tells the browser to serve the document out of the cache.
TIP: According to FIND/SVP, as much as
one third of all Web traffic originates with America Online users. Depending on your
audience, a significant proportion of your traffic might be coming from behind AOL's
caching system and through its proprietary browser. For the inside scoop on how to
best program for that environment, check out AOL's Web site at
http://webmaster.info.aol.com
The site contains details on AOL's browsers, proxy system, and other useful stuff.
Some browsers support methods of defeating the cache on a page-by-page basis.
You should use these methods sparingly; caching is your friend! By inserting the
following http headers, you might be able to defeat caching for the pages that follow
them:
HTTP 1.0 header: Pragma: no-cache
HTTP 1.0 header: Expires: Thu, 01 Dec 1997 16:00:00 GMT
HTTP 1.0 header: Expires: now
HTTP 1.1 header: Cache-Control: no-cache
HTTP 1.1 header: Cache-Control: no-store
Proxy Caching
Many corporations and large ISPs, including America Online, use a caching proxy
for their members' Web access. Besides the normal security role of a proxy, these
servers keep a copy of some content closer to the members. This way, these ISPs can
provide faster Web service and significantly ease the load on the Internet.
Proxy caches can be configured to keep content for a certain length of time or
until the file reaches a certain age. If you want to ensure that your content is
not cached, you can try several things. First, many caching proxies follow the instructions
of the expires and cache-control headers listed in the preceding section. In addition,
some proxies do not cache any requests that contain cgi-bin or a question
mark because these characters usually denote dynamic, script-generated pages.
CAUTION: Each ISP has different "rules"
for what is cached and for how long. Some follow all the rules outlined previously,
and some follow none. To make things worse, some ISPs occasionally change their
caching rules. If you're concerned about your content being held in a proxy cache,
you should periodically test to see if your content is cached by that ISP.
Analysis Tools
As you saw earlier in the chapter, you can analyze your logs manually using a
wide variety of text manipulation tools. This kind of analysis gets tedious, however,
and is hard to maintain. To get the most useful data from your web server logs, you
will probably want to invest the time and money to choose, install, and use a web
server analysis tool.
Choosing an analysis tool.
There are literally hundreds of analysis tools on the market, ranging from simple
freeware PERL scripts to complicated database-driven applications. Because the market
is so new, it's easy to become confused about exactly which features you need for
your application. Before you select an analysis tool, be sure you know:
- What kind of analysis you intend to perform.
- What format you prefer for the output.
- How much enterprise support you need from the tool.
- What platform you intend to use.
- How large your log files will be.
Type of Analysis
The most important question to ask yourself when evaluating analysis programs
is "Exactly what information am I looking for?" If you are only looking
for basic access analysis, such as hits over a specific period of time, or basic
web demographics, then almost any analysis program will suffice.
As your needs become more sophisticated, you'll need to make sure your package
will support advanced analysis features. Generally, advanced features such as pathing
and sessioning are only available in commercial packages costing hundreds of dollars.
Output Quality
Analysis programs vary widely in the overall attractiveness of their report output.
Almost all programs create HTML files as the primary output format, and many create
graphs and tables within those pages. This kind of output is generally acceptable
for your own analysis, but falls short for some business applications.
If you intend to distribute your web log reports to clients, partners, or investors,
consider using a more advanced package that offers better page layout. Many commercial
packages will provide output in document formats (for example, Microsoft Word) with
embedded color tables and graphs.
Enterprise Support
Most analysis programs are designed for the single server website. They expect
to read only one log file, and build relative links from only one home page. If your
website spans more than one server, or you manage several different websites, you
may want to consider getting an advanced analysis package.
Analysis programs which have "enterprise support" can handle multiple
log files, and can build reports which represent multiple websites. They allow you
to group websites to present consolidated data across several servers. This kind
of support, unfortunately, is mostly only found in the most expensive packages.
Platform
Not all analysis programs are available for all UNIX versions, and many are available
only for Windows NT. If you are going to be running your analysis on the same machine
as your web server, you need to ensure that your analysis program is compatible with
your UNIX version.
You don't necessarily have to run your analysis program on the same machine as
your webserver. In fact, it may be desirable to have a different machine dedicated
to this task. Log analysis can have a heavy impact on your machine performance, in
both CPU utilization and disk usage. If you are going to have a machine specifically
for log analysis, then you can get the hardware to support the software which has
the features you like.
Speed
As your access logs quickly grow to several megabytes in size, analysis speed
becomes an issue. Check to see how fast your analysis program claims to run against
larger files: Most vendors will give you a metric measured in "megabytes processed
per minute."
Log processing speed does not always grow linearly: As your logs get bigger, some
analysis programs will get progressively slower. Before you invest in an expensive
processing program, test the performance on some real logs--and be aware that some
of the fastest programs are freeware.
Popular Tools
Prices for analysis programs vary widely, but they tend to fall into one of three
categories: freeware tools, single-site commercial products, and enterprise commercial
packages.
Shareware/Freeware Analysis tools
The quickest way to get into web analysis is to download one of the very capable
pieces of freeware on the market. These programs can quickly digest your access logs
and give you very usable information immediately. In addition, source is often available
for you to add your own special touches. They often lack some of the advanced features
of the commercial tools, but try these out before you spend hundreds (or thousands)
of dollars on another tool:
- getstats: Available on several UNIX platforms, getstats is the fastest
of all the analysers. It is also the hardest to configure and only generates basic
analysis. But if you have a large log file and only need a quick snapshot of your
usage, try this one out.
(http://web.eit.com/goodies/software/getstats/)
- http-analyze: Almost as fast as getstats, and with much nicer
output features. The latest version of this program does 3D VRML output, and handles
extended logs such as user_agent and referrer.
(http://www.netstore.de/Supply/http-analyze/)
- analog: My personal favorite shareware analyser,
this program is available for UNIX, Windows, Mac, and (gasp) vms. Besides being pretty
fast, analog handles extended logs and is extremely configurable.
(http://www.statslab.cam.ac.uk/~sret1/analog/)
TIP: An extremely interesting writeup
on the comparative performance of several freeware tools (complete with links to
the homepage of each tool) is available at:
www.uu.se/software/getstats/performance.html
Commercial Analysis tools.
Most serious business applications will eventually require a commercial analysis
tool. Besides being more robust and feature rich, these products include upgrades
and technical support that most MIS departments need. Prices on these packages can
range from $295 to $5,000 and higher, depending on your installation. Many of the
products are available for a free trial download on their website so you can try
before you buy.
- Accrue Insight: This totally unique software is the most expensive of all these
packages, but it works differently than all the others. Instead of analyzing the
logs themselves, it sits on the network and measures traffic between clients and
your server.
(http://www.accrue.com)
- Microsoft Site Server: Formerly Interse Market Focus, this package has a SQL
server backend and provides one of the most robust feature sets on the market. This
comes at a price, however.
(http://www.backoffice.microsoft.com)
- Whirl: A newcomer to the market, this package is optimized to support multi server
enterprises. The system creates data sets in Microsoft Excel which can then be manipulated
for optimal reporting.
(http://www.interlogue.com)
- Web Trends: The leading single server/intranet solution, this Windows package
has a easy-to-use UI for its extensive features. The latest version of this software
has a report caching technology that allows quick repeat runs of large logs. It can
also be scheduled to run periodically as a windows NT service.
(http://www.webtrends.com)
- Net.analysis: This product is a single server analyser which provides extensive
real-time or batch mode site activity reports.
(http://www.netgen.com)
Summary
In this chapter, you learned about tracking Web server usage. This data, which
is primarily stored in the access and error logs, provides information that helps
you scale, program, and advertise on your Web site.
The access log tracks each attempt request and provides you with the bulk of your
server activity information. The extended logs help you track which browsers were
most used to access your site and which sites passed the most traffic to you.
Basic analysis includes counting the entries in the access log in a number of
different ways. The simplest statistics you can gather are summaries of different
types of accesses, including successes and failures. Looking at traffic over time,
in hourly, daily, and monthly reports, is also useful. Finally, the logs provide
you with limited "demographic" information about your visitors, such as
which country they are in and whether they are from commercial or educational institutions.
Advanced analysis involves looking for patterns in the accesses. Sessioning is
the process of identifying unique visits and determining the duration and character
of the visit. Pathing is looking for the most common navigational paths users took
during their visit.
Unfortunately, the access logs are not necessarily reliable sources of data. Several
factors can affect your log's accuracy, most importantly caching. Local caching and
proxy caching can both cause your log numbers to be understated.
Finally, you learned about several tools that are available to assist you in analyzing
your server activity. Many tools are freely available over the Net, whereas others
are commercial products that include support and upgrades. Some companies download,
audit, and process your logs for you for a monthly fee.
©Copyright,
Macmillan Computer Publishing. All rights reserved.
|