Designing URIs

ASP.NET is essentially an advanced request-processing framework. Naturally, the URI is the most important part of any request (or should be). URIs should be well designed, and should represent the requested content accurately and succinctly. Unfortunately, they are frequently misused, which causes browsers, users, and search engines no end of trouble.
Although you can't change the main portion of the URI without reloading the page, you can modify the fragment to your heart's content with Javascript. Adobe has released a BrowserManager class that makes "deep linking" easier, and vastly improves the user's experience on all-flash sites.
Some misuse URIs by making them too generic; some sites have everything on the home page. Flash, AJAX, and frames are the biggest culprits here, as they are capable of making big changes to the current content of the page without affecting the address bar. Users of this type of site are frustrated because if they bookmark a buried page in the site, it only records the address of the home page. The back button also betrays them - it doesn't undo their actions anymore, but plops them completely off the site. Search engines dislike these sites because either (1) they can't access buried content due to its form (JavaScript or Flash) or (2) they can access it, but all keywords are diluted from the massive amount of content available on one page. Some developers take the misuse to the opposite end. The feel that the address bar is the perfect place to store all variables, interface state data, and user preferences. They, too, cause problems for both users and search engines. Users bookmarking or e-mailing such links often find that they no longer work after their session has expired, or after a change was made on the site. Their length and lack of simplicity also makes them hard to understand, as many users depend on the address bar to understand where they are located on the site. URLs longer than 80 characters are also a pain to e-mail. Many e-mail clients will break the URL in half, making it unusable. Search engines find these type confusing, because they see (and rank) each unique URI as a separate page, and dilute ranking accordingly.

So, you ask, what makes a good URI?

  • It should be as short as possible. Don't sacrifice consistency or obviousness, but be brief.
  • Organize and name things logically. ASP.NET isn't always helpful in keeping a clean structure, so I highly recommend that you use a URL rewriting module. URIs should be 'hackable'.
  • URIs should be deterministic.
    • No two URIs should ever display the same page
    • The same URI should always display the same content.
  • The query string should only contain data that AFFECTS THE QUERY. If it doesn't describe the content, it doesn't belong. Keep the query string for queries, please.
  • Tip: Don't try to spam URLs with keywords. Density algorithms are applied here, also. As with page titles, pick 1 keyword and stick with it.
    The URI path should not rely on cryptic or numerical identifiers. If it does, it should also provide a human-readable title. It's really nice to be able to look at a URL and guess what it contains - especially when you have a long list of them. As a bonus, search engines absolutely love URIs that match keywords.
Further reading (by Tim Berners-Lee): http://www.w3.org/Provider/Style/URI. I strongly suggest that all ASP.NET projects use some kind of URL rewriting library, such as UrlRewritingNet. Even if you only need a single rewrite, I still recommend using a library instead of trying to do it yourself. UrlRewritingNet helps you overcome the bugs in the framework seamlessly.

Bad examples:

  • /Default.aspx?tabid=3
  • /Products/ShowProduct.aspx?prodid=4982
  • /showblog.aspx?articleid=98

Better examples:

  • /Default.aspx?tabid=3&title=ContactUs
  • /Products/ShowProduct.aspx?id=4982&product=Nokia_Wall_Adpater_12V
  • /showblog.aspx?articleid=98&title= Why_you_should_never_concatenate_SQL_commands

Even better:

  • /contact/
  • /products/4982_Nokia_Wall_Adapter_12v
  • /blog/98_Why_you_should_never_concatenate_SQL_commands

WWW

The famous "www" prefix is actually pointless. You can still have ftp, mail, and smtp subdomains without forcing your website to use www. The www convention came into being since servers were typically named after their role, and HTTP was just starting out. Since web browsers only speak HTTP, you should really point your second-level domain (example.com) directly to your web server. Realize that some search engines will index www.example.com and example.com separately, since they are different locations. To prevent SSL cert and cross-domain flakiness in Flash, you should standardize on one or the other. You can force this by checking for www in Global.asax, and calling Response.Redirect() with the "fixed" version of the requested URI.

URIs in the HTTP protocol

Let's look at how URI is sent to the server using HTTP Here is a basic GET request. The first line consists of the HTTP method, followed by a root-relative path, then the protocol version. The subsequent lines contain the header collection, in the form of simple name-colon-value pairs. The two parts of the URI here are the path (/blog?page=2), and the HOST-header (youngfoundations.org). We know that the scheme is probably "http" since we are communication using the HTTP protocol. IIS tells us which port the request arrived on, so between the pieces we can reconstruct the original URI somewhat accurately.
Note: there are LOTS of schemes out there that use the HTTP protocol, like firefoxurl://, etc.
Note: The HOST header is important, since some servers host dozens of domains, and this allows IIS to forward the request to the appropriate application in shared hosting situations. Multiple domains (hostnames) can be pointed to a single application.
The path and the query are divided by the first question mark.

GET Request

GET /blog?page=2 HTTP/1.1[CRLF] Host: youngfoundations.org[CRLF] Connection: close[CRLF] Accept-Encoding: gzip[CRLF] Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain; q=0.8,image/png,*/*;q=0.5[CRLF] Accept-Language: en-us,en;q=0.5[CRLF] Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7[CRLF] User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.5; .NET CLR 2.0.50727) Gecko/20070713 Firefox/2.0.0.5 Web-Sniffer/1.0.24[CRLF] Referer: http://web-sniffer.net/[CRLF]
The client can send content with any request, although it is typically sent with the POST method. The header collection is separated from the request body by the character sequence [CRLF][CRLF] (2 newlines). The content in the request body is described by the content-type and content-length HTTP headers.

POST Request

POST /blog HTTP/1.1[CRLF] Host: youngfoundations.org[CRLF] Connection: close[CRLF] Accept-Encoding: gzip[CRLF] Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[CRLF] Accept-Language: en-us,en;q=0.5[CRLF] Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7[CRLF] User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.5; .NET CLR 2.0.50727) Gecko/20070713 Firefox/2.0.0.5 Web-Sniffer/1.0.24[CRLF] Referer: http://web-sniffer.net/[CRLF] Content-type: text/html; charset=utf-8 [CRLF] Content-length: 19[CRLF] [CRLF] Sample content body

Response

The HTTP response generated by your ASP.NET application looks slightly different that the request that prompted it. The general format remains, but the first line is now [HTTP Version] [Status-code] [Status code description]. Http status codes are very important, but are beyond the scope of this article.
HTTP/1.1 301 Moved Permanently [CRLF] Connection: close [CRLF] Date:Fri, 03 Aug 2007 00:36:57 GMT [CRLF] Server:Microsoft-IIS/6.0 [CRLF] X-Powered-By:ASP.NET [CRLF] Location:http://www.microsoft.com [CRLF] Content-Length:31 [CRLF] Content-Type:text/html [CRLF] Set-Cookie:ASPSESSIONIDSSSBDQAT=PIJAGJDBFFLFAALAJDCGBAMI; path=/CRLFCache-control:private [CRLF] [crlf] [Content-body]
Important note: If you have multiple domains pointing to one website, make sure they are all 301 redirected to precisely one host name. Otherwise you will sabotage your search engine placement by (1) diluting your page rank, and (2) being penalized for duplicate content.

URIs versus URLs

The term URL (Uniform Resource Locator) has been considered obsolete by w3c for a long time. In its place stands the URI (the Uniform Resource Identifier). Strictly speaking, a URL must provide all of the information required to located and retrieve a resource, while a URI is only required to identify it in relation to the current context. Thus, a URL is a URI that "in addition to identifying a resource, [provides] a means of locating the resource by describing its primary access mechanism (e.g., its network 'location').". Most people aren't aware of the difference, and use them interchangeably.
For example, the following URI is also a URL:
  • http://www.mysite.com:54321/ folder/virtualfolder/default.aspx? param1=thisisatest&param2=test2
However, these are not:
  • ../css/shared.css [URI relative to the location of the parent document]
  • /images/banner.jpg [URI relative to the current network location (usually termed 'absolute')]
  • Logo.gif [URI relative to the location of the parent document.]
  • #requirements [URI fragment relative to current document.]

Fragments

Fragments describe a section, place, or entity in the current document. In HTML, they usually refer to a certain anchor tag (by name or ID). The window is usually scrolled to the location of the anchor tag. Fragments are never sent to the server computer, and only function as a display instruction to the client. If a fragment isn't understood, it is ignored. Fragments are pretty much free-form. If the current document is http://mysite.com/home.html and a link to http://mysite.com/home.html#part3 is clicked, the browser (or user-agent), is not supposed to ask the server for http://mysite.com/home.html again, but older clients may. Relative fragments like #part3 are handled more reliably. Now let us dissect the following URL: http://www.mysite.com:54321/folder/virtualfolder/default.aspx? param1=thisisatest&param2=test2 http The scheme (protocol). The protocol determines how the client should talk to the server (basically the language, or grammar). www.mysite.com The computer the resource is located on (DNS, WINS, or IP Address) :54321 The port number to communicate with on the computer. Instead of trying to sort out incoming packets and route them to the right application on the server computer, ports are used. Certain default ports are assumed for some protocols. Http requests are sent to port 80 by default. Https requests are sent to port 443, and FTP requests are sent to port 21. If an application is not listening on that port (or the request packets are blocked by a firewall), no response will be given. Additional sorting is sometimes performed, as in the case of WCF (.NET 3.0) port sharing, or when multiple sites are hosted on a single server. When an HTTP request is sent to a server, it is accompanied by the original hostname from the address bar. An unlimited number of DNS (Domain Name System) addresses can point to a single computer, which is convenient for web hosting providers. IIS (Internet Information Services) can be configured to look at this host header, and forward the request to whichever site is configured to receive requests for that particular hostname (DNS address). For information about DNS, read http://en.wikipedia.org/wiki/Domain_name_system.

Super-simplified view of DNS

DNS addresses are hierarchical, and levels (domains) are separated by a period. Domains progress from most specific to least specific. For example, in resolving www.mysite.com, the following steps would be taken:
  • Ask computer 'COM' where computer 'MYSITE' is at (what its IP address is).
  • Ask computer MYSITE where computer 'WWW' is at.
DNS is used for a whole lot more that just web browsing, so the company at mysite.com might have a whole bunch of computers, such as ftp.mysite.com, mail.mysite.com, pop.mysite.com, telnet.mysite.com, as well as www.mysite.com. WWW usually points to the web server for the company. Please note, however that the WWW part is completely unnecessary, and is just a commonly followed convention. Note: In www.mysite.com, "com" is a TLD (Top-level domain), and "mysite" is a SLD (Second-level domain) SLDs usually cost a registration fee, as the poor owner of the "COM" computer has tremendous bandwidth bills. Third-level domains can be freely created if the parent SLD is under your control. One of the most critical steps in designing a web site is choosing your URI structure for the site. Clean, friendly URIs make visitors more comfortable and help them keep track of where they are on the site. Short URLs don't get wrapped as badly, are easier to type, and just look nicer.

Published on

About Nathanael

Nathanael Jones is a software engineer, husband, consultant, and computer linguist with unreasonably high expectations of inanimate objects. He refines .NET, ruby, and javascript libraries full-time at Imazen, but can often be found on stack overflow or participating in W3C community groups.

ImageResizer

If you develop websites, and those websites have images, ImageResizer can make your life much eaiser. Find out more at imageresizing.net.

Recent Tweets

| Loading recent tweets...

Imazen

I run Imazen, a tiny software company that specializes in web-based image processing and other difficult engineering problems. I spend most of my time writing image-processing code in C#, web apps in Ruby, and documentation in Markdown. Check out some of my current projects.