Web Server Logs and Your Information

This article seemingly was written for Windows users, and even though I’m still barely on my PC – actually only when the guys and I are playing together WoW downstairs in their office – but it’s a good read and informative regarding Web Server Logs and Your Information and what they are all about.

And following is the story itself – a copy/paste for my own archives and for those days in the future that the above link no longer works:

Web Server Logs and Your Information – …linking technology and productivity for Windows users:

Wednesday, 11 May 2005

I’m not sure when it started, but there is an increase in people asking about data collection and privacy. People are clearly concerned about the use of data in general and not just our site. I think this is good, but it also highlights that we haven’t done the best job in telling people what information is captured in a web server log and how it can be used. I can’t answer for all sites, but I can for ours. The answers may surprise you.

Many of you know that I’m concerned about privacy so I was startled to get a request from a reader to remove data. However, the more I dug into the issue it became apparent that people don’t know what is collected. And as we all know, fear takes on a life of its own. I hope that this article will address those concerns.

How the process starts

The process begins when your computer requests a web page from our website. Most sites maintain a web server log that records various transactions. A transaction is generally defined as getting a resource such as a web page, picture, file and so on.

The server log is at the discretion of the website and can be configured to capture various fields. Most websites maintain a server log as it contains very useful information such as traffic patterns and errors. In our case, the web server uses the Apache HTTP combined server log format.

As you move through our website, multiple lines are appended to this daily log in chronological order. The reason there are multiple lines is that a web page consists of many resources such as images, text, style sheets and so on. In other words, the web page you’re viewing may appear as one item to you, but from the web server’s view there might have been a dozen requests to display the page. Each request becomes its own line item in the web server log, which is why these logs are very large.

Show me my info!

While web logs contain lots of data, I can’t say they are fun reading. The data can be useful but you need a parsing and analysis tool to make sense of the information. Below, you’ll see one item request from the raw server log that I’ve parsed to make reading easier. I’ve also numbered the data elements. In the web server log, this information appears as one long line.


(1) 65.192.81.64

(2) -

(3) -

(4) [11/May/2005:10:37:04 -0700]

(5) “GET /mos/Email/Outlook/Creating_Outlook_Signatures/ HTTP/1.1″

(6) 200

(7) 7538

(8) http://www.google.com/search?q=outlook+signature+&hl=en&lr=&start=20&sa=N (9) Mozilla/5.0

(10) (Windows; U; Windows NT 5.1; en-US; rv:1.7.7)

(11)Gecko/20050414

(12) Firefox/1.0.3″



(1) IP address

The first data item is the IP address of the client making the request. The client could be your computer, firewall, proxy, PDA and so on. For many people, the IP address is dynamic meaning that it shows as 65.192.48.61 on May 11, but it might be different the next time you visit. Or, in the case of some firewalls, it could be all the computers behind the firewall use the same IP address.

There is additional information that can be inferred from an IP address concerning location since there is a methodology for assigning the numbers. For example, internet service providers (ISP) or large companies may be assigned blocks of IP addresses. If you want to see how your IP is translated, go to http://www.showmyip.com/ . This site provides some interesting information about your IP address.

People should also know the information isn’t always precise. A common example is the number of people who show as being near Vienna, Virginia because they’re AOL users. Rest assured your ISP would not provide your address and contact information to us unless they received a court order requesting the data.

In our case, we do a country lookup and domain name lookup. The information is aggregated and we can see how many users are from a specific country or domain. And in the interest of disclosure sometimes, I have to go to a site like http://www.answers.com/ to find out where a country like Seychelles is located.

(2) Identity Check

At first, I thought the displayed hyphen was a delimiter, but it actually means data is not available. The field is used for determining the identd of the client machine. The name was a little worrisome until I read the Apache documentation that states, “This information is highly unreliable and should almost never be used except on tightly controlled internal networks. Apache httpd will not even attempt to determine this information unless IdentityCheck is set to On. This rather reminds of a relationship I was in. I turned the boyfriend off too.

(3) UserID

Again, the field shows as a hyphen since we didn’t collect any data. This field might show data if the article being requested was password protected and we required authentication. We do use this field for internal use to access test areas.

(4) When did the server finish the request

This is the time the server finished getting your information. The -0700 indicates our server is 7 time zones behind GMT.

(5) What can I get you?

This line indicates what you requested. In this instance, the reader requested the article on creating Outlook signatures. The HTTP/1.1 indicates what protocol was used.

(6) Result Code

This number indicates the status code the server sent back to your computer. If everything worked, you get your request. Otherwise, you might see one of our infamous “Oops…we’re sorry” pages (aka 404 errors). In this case, the 200 indicates the page was successfully received.

(7) Size

This figure indicates the size of the object returned. In this case it was the size of the article or 7537 bytes.

(8) Who sent you?

One advantage to the combined log format is it shows who referred you to our site. In the example above, the reader did a search on the US version of Google for “Outlook signature”. This information is passed along in the URL from search engines or links from other websites.

We should mention that this referrer information isn’t based on any marketing or partnering agreements with search engines or sites. If this type of information concerns you, there are software programs that will strip this information.

(9-12) Browser Information

Items 9-12 are sent by your browser and indicate which version you’re using and your operating system. In the example above, the client was using the US version of Windows NT 5.1 with version 1.0.3 of Firefox.

What are you doing with my data?

The next question is whether we use all this data and why. The short answer we use some of the data, but not all. While web server logs collect a lot of information, that doesn’t mean it’s accurate or meaningful. We’re primarily concerned with trends and what items we might need to change.

The other important item is that to leverage the log information, webmasters need another application that can parse, sort, filter, aggregate and do lookups. We do use a third party package to help answer the following type of questions:

Are there any pages that are broken that we need to fix?

We can determine there is a problem from looking at items 5 and 6. This is an important issue since a broken or slow web page is a terrible user experience.

What are people reading?

OK, no one should ever be shocked that any webmaster wants to know this information. After all, if you’re not reading the content, we don’t have a business. It only makes sense that we want to know what are the most read articles as well as the least read articles.

Hey, are you new to these parts?

As with any business, you like to get new customers and keep the regulars. This is the type of information you can get after accumulating enough daily log files. Even then, the info isn’t precise because so many people have dynamic IPs in which case they appear new to the web server. One way we could circumvent this problem is to force people to register, but we don’t.

How did you find us?

As you might expect, item 8 can help us in this regard. We look at the referrer information as it indicates where someone posted information about our site or articles. This gives us an opportunity to read what was said on another website and post our comments if needed.

Just because a referrer is listed, doesn’t mean we can go back to the referring site or want to. In one case last year, we saw a huge number of referrers from a private adult oriented group on a major portal. As much as we were curious as to why all these people were referencing one of our articles, we didn’t pursue this one. We would first need to register with this site and secondly the content, including their Privacy Policy, was in Portuguese. Hey, even we draw the line.

The biggest concern people usually have is seeing their search terms included in a log entry. I can understand this, as I never knew this happened until I looked at a web server log. The search terms are useful as it gives us an idea of what type of information people need. These keywords have also helped us with language differences where as a US based author I might use one term, but someone from Europe might use a term or phrase I might not know. Yes, I’m still trying to figure out what the Brits mean by a “punter”.

What browsers are people using?

We use item 12 to answer this question. The reason we’re interested is that different browsers handle the web code in different ways. While the differences may be subtle, there are times where we have abandoned some features because we couldn’t get them to work correctly with a specific browser.

I suppose if we had ample time and budget we would be more proactive with this information. For example, we might offer a reminder to people using older browsers to upgrade as they may be at risk.

The other reason we look at this info is there are certain bots that are designed to harvest email addresses or images from websites. Since we don’t have forums, we don’t have to worry about this too much but we still block these agents when appropriate.

You downloaded how much data?

Many people have the notion that the web is free. Well, this is true if you don’t have a website. The truth is that websites have data costs in terms of storage or bandwidth transmission. This is typically set by a contract with a hosting service. If you exceed a contractual term, you pay.

In the majority of cases bandwidth isn’t an issue. We’re more than happy to provide content to people and have released articles using the Creative Commons license . After all, the intent of this website is to help people. However, we do draw the line when it becomes apparent people are copying huge chunks of our site for their economic gain.

Cookies anyone?

Perhaps, the most common question is whether we have cookies. Yes, but they are low-carb. Sorry, that was my rant on the recent fad diet craze. The web server does set a session cookie for www.timeatlas.com when you visit the site. The contents of the cookie vary for each user and helps members with features such as recalling their user name and password. The cookie is set to expire in 24 hours.

Bottom line

I suspect the above information answered some questions, but raised others. Certainly, I can answer items regarding this site, but can’t speak for other sites. The brilliance of the web is how it is interconnected, but it comes with risks. The downside is some sites do install spyware or combine server log information with other databases, which reveal more information about you than you might be aware. The best defense is to be vigilant about spyware and always read End User License Agreements (EULA) and Privacy Policies.

This entry was posted in asides. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe without commenting

  • Word of the Moment

      Everyone is entitled to their own opinion. It's just that yours is stupid.
  • Get a Glimpse

  • You can LIVE CHAT with me!

  • MyDiet

  • iPhone

    • Greed before the Fall.. When companies - such as AT&T - get too greedy, they plant their own seed for their future collapse. It’s pretty disgusting. As Phil Schiller mentioned ...

  • Asides

    • Day Grid Balancer . David Seah writes, "I really suck at work-life balance, and have started to crave some way of visually representing the essential elements of a good ...

    • Function Reference « WordPress Codex. Many questions can be answered here, and definitely a useful stop before heading over to the user forum to ask: The files of WordPress define many ...

    • Widgetizing Themes « WordPress Codex. It is way simpler than you might think to update your outdated WP blog into a widget happy one: Widgetizing Themes « WordPress Codex