The first step in attacking a website is to understand it — to map out its topology and technology. The first step towards protecting a site is to know what others know about you — which essentially means carrying out the same assessment.
Hackers and pen-testers have a number of sophisticated tools at their disposal. But here we’ll be looking at how you can ‘footprint’ a site — gather essential information about it — using only the most basic software typically bundled with operating systems. And these are techniques that are unlikely to trigger an IDS or raise a sysadmin’s suspicions.
Netcraft
Your first stop is not the site itself but Netcraft.com. Typical information provided includes: the IP address; the registered name and real-world address of the domain owner; the domain’s nameserver; reverse DNS for the domain, revealing who is hosting the site (plus contact email); uptime stats for the server, including when it was last rebooted; when the site was first seen; OS and web server (with version and module details); and more.
Netcraft.com can be a one-stop shop for basic footprinting. But it doesn’t always provide this much information, so let’s take a look at the site itself.
Browsing the site
If the site is based on off-the-shelf software you may find public displays of not only the software it’s using but the precise version, and also possibly the operating system. Searching pages with the phrases “powered by” and “running on” produces results far more often than it should.
As you browse the site, make a note of URLs. These may reveal interesting sub-directories and parameters. You should carefully analyse anything after a query symbol to see if it offers opportunities for SQL injection or providing carefully crafted parameters to scripts. In particular, anything that looks like a filename, path, database field name or query might reveal exploitable flaws.
The URLs will usually tell you which scripting platform the site is using, through the script filename extensions. Common extensions like .php, .asp, .jsp and so on are easy to spot. However, some CMS software may disguise this through URL rewriting: for example, MODx allows you to add any extension you like to a page’s permalink address, so pages may appear to be static HTML. In that case, a little more digging is required.
Contact pages may offer names, phone numbers, locations and email addresses — including the details of admin or support people — that you can exploit for social engineering.
HTML source
Take a look at the source code. Comments, either inserted manually or created by the CMS system, may provide clues as to what’s going on behind the scenes. You may even find contact details for the web admin or developer.
Pay close attention to all links and image tags. They will help you map the filesystem structure and may reveal the existence of directories and files you wouldn’t otherwise know about.
Forms generally feed into scripts and/or databases, and the ‘action’ attribute gives you the destination of the script itself. Careless programming might also have left hackable hidden fields. Entering fake data (perhaps with a throwaway email address that you’ve just created at Hotmail) may reveal some detail about how the script works — for example, how well it sanitises HTML or SQL code that you embed in the fields.
It might be worth perusing any Javascript, too. Anyone incompetent enough to put hackable client-side code on their site shouldn’t be in this business. However, the heavy use of Ajax these days might lead you to server-side scripts designed to feed back information as XML or JSON data. It’s easy to monitor the information sent back with a tool like the Firebug add-on for Firefox. Maybe the server is broadcasting more information than it needs to.
Cookies
Examining any cookies set by the server might tell you something about what software it’s running or how it’s behaving. For example, if you see cookies called CP, CFID and CFTOKEN you can be pretty sure the site is running ColdFusion. If you spot WEBTRENDS_ID, you know it’s using WebTrends to track visitor behaviour.
You may also see session and other cookies that identify any scripting platforms — eg, PHPSESSID tells you the site’s running on PHP and is probably using session variables. Firefox has some useful add-ons that allow you to view, and even edit, cookies.
Interesting files
In the interests of search engine optimisation, many sites now produce sitemaps — an XML format listing of all the publicly accessible pages. While sensitive locations and files should never appear in the sitemap, it’s possible that a misconfiguration or misunderstanding by the webmaster may reveal files that simply shouldn’t be there. In any case, it’s a quick way of building up a map of the basic filesystem structure.
Just add ‘sitemap.xml’ or ‘sitemap.xml.gz’ (because they are often zipped) to the base URL of the site, eg:
http://www.example.com/sitemap.xml
Most sites also have a robots.txt file, used to instruct search engines, spiders and crawlers of all kinds which files they should and shouldn’t index. This partly saves bandwidth and prevents files of no interest appearing in searches. The robots.txt file also tells you which files and folders the webmaster does not want you to look at. We’re most interested in the files and folders that are disallowed. Viewing http://www.example.com/robots.txt we might see something like:
user-agent: *
Disallow: /assets/
Disallow: /cgi-bin/
Disallow: /secure/
The use of the ‘*’ wildcard means that this instruction applies to all bots. It’s followed (in this instance) by a list of directories that the bot should not visit. To visit one of these yourself, just add the text above to the base URL, eg:
http://www.webvivant.com/secure/
If the directory is properly protected, you should just see the index page for that directory. But you could also try adding filenames to the end of the URL to see what happens. Be warned, however, that canny webmasters sometimes add a location like that to the robots.txt file to test for bad bots (or hackers) who ignore the instructions. The directory is likely to contain a honeypot that will log the IP of any visitors.
Sites running PHP may also have a publicly viewable php.ini file — eg:
http://www.example.com/php.ini
Aside from confirming that the site uses PHP, this may tell you whether safe mode is enabled or disabled (sites with it disabled have suffered vulnerabilities in the past), as well as offering up file path info, what extensions are in use and so on.
Headers
The headers sent with the web page have a few things to tell us. Perhaps the simplest way of seeing headers is from the command line using telnet, specifying the URL and port — eg:
telnet www.example.com 80
Once the server has responded, you can type one or more headers in reply, followed by two carriage returns. To get some basic headers, use:
HEAD / HTTP/1.0
Here’s a typical response:
HTTP/1.1 200 OK
Date: Tue, 10 Mar 2009 10:27:49 GMT
Server: Apache/2.2.11 (Unix) mod_ssl/2.2.11 OpenSSL/0.9.8i DAV/2 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635
Accept-Ranges: bytes
Connection: close
Content-Type: text/html
Connection closed by foreign host.
So we know which web serverand version is being used, some basic details about its configuration and which version of PHP is running — though this doesn’t confirm that the site itself employs PHP.
Other options
Replacing HEAD with OPTIONS may return headers detailing which HTTP methods the server accepts: Apache provides an ‘Allow:’ header while IIS uses ‘Public:’. You’d expect to see HEAD, TRACE, POST, OPTIONS and GET. In the unlikely event you see options like RMDIR, MKDIR, MOVE, DELETE you know there may be opportunities for misbehaving.
It’s common now for the site you’re examining to be hosted on a server concealed behind a proxy. Many organisations and hosting services use reverse proxying as a means of defence and of running multiple sites through one connection. If the proxy is passing through traffic using standard HTTP methods, such as GET and POST, you can still connect through to the server you’re querying by adding the Host: header. If our querying of www.example.com had turned up a 302 (redirect) or 403 (forbidden) response, rather than 200 OK, we could have tried the following:
HEAD / HTTP/1.0
Host: www.example.com
Other environments use the HTTP CONNECT method for proxying. In this case, you’ll need to use CONNECT instead of HEAD and you also have to supply the port number. For example:
CONNECT www.example.com:80 HTTP/1.0
It’s sometimes possible to connect to mail servers this way, using the appropriate subdomain name, such as ‘mail.example.com’ and port 25. Spammers have been known to use the POST method in a similar fashion to send fake mail via accessible servers. Even if the attempt fails using these methods, with a 405 (method not allowed) response, you may still be treated to the headers you’re after.
Full page
To get the full headers for a page, use GET instead of HEAD with the page filename. For example:
GET /index.html HTTP/1.0
Note the leading slash. You may also need to use the Host: header for reverse-proxied sites.
Here’s an example of a returned header that nicely confirms what environment is in use:
X-Powered-By: PHP/5.2.8
If you see a 3xx series response (most often 302) this means redirection is being used. This might be proxying, but it’s also used by phishers and other miscreants to disguise the true location of the site and get around blacklists and spam filters. The headers should include at least one starting with ‘Location:’ and the URL of the domain or page to which you’re being redirected, sometimes with an IP address.
Headers may even reveal the internal IP address of the server within its local network — look for a header starting ‘Content-location:’. This may make it possible to employ reverse DNS querying techniques to map the local network behind a firewall.
You can also view headers using browser plug-ins (such as Firebug) or tools such as wget (with the -S option) or curl (with the -v option).
Valuable picture
These are basic techniques. There’s a lot more you can achieve with wget, curl, whois, dig and, of course, the power tools of network analysis like NMAP. And Google hacking (as covered in our previous issue) may reveal actual vulnerabilities. But you can use these simple techniques with the most basic tools to build a fast and valuable picture of any target.
Resources
'Network Security Assessment' by Chris McNab (O'Reilly, Second Edition 2007): this has invaluable details about HTML method vulnerabilities as well as more advanced footprinting techniques.







