Dynamically Generating Sitemaps Using 404 Errors
Posted on 7/22/2007
Using Microsoft Internet Information Server (IIS), when you designate a page to handle HTTP 404 (Not Found) errors on a website, you don’t have to return an HTTP 404 error at all. Instead, you can build dynamic content and return it with an HTTP 200 (OK) result instead. This is helpful when you want to build a sitemap.xml file to enhance the search engine performance for your website. In this article, I’ll show you how I did this for the gotnet.biz site.
Backgrounder
There are two ways that a 404 error handler page can get invoked when using IIS with ASP.NET. For the page types registered for ASP.NET, e.g. ASPX, ASMX, etc., the <customError> element in the <system.web> section of your web.config file determines what page will be invoked when different kinds of errors occur. For 404 errors, ASP.NET performs the switch to the handler page by using an HTTP 302 (Moved) redirect. This is unacceptable when you want a clean, transparent transfer to the handler page without any knowledge on the part of the client. However, when IIS handles a 404 error instead of ASP.NET, it does something akin to a Server.Transfer call under the hood. This is good and it’s exactly what we need to implement our dynamically generated sitemap.xml file.
Sitemaps used by Google and others depend on a simple XML schema which you can find it here (http://www.sitemaps.org/protocol.php). If you’re like me, the best way to understand such a simple schema is to look at a real, live sitemap. Load the sitemap.xml for the gotnet.biz site (http://www.gotnet.biz/sitemap.xml) into your web browser. It's very easy to understand, don't you think? Sitemaps are a good complement to the robots.txt file on your site because they allow you to specify what should be indexed by the search engine instead of what should not be indexed.
The 404 Handler Page Code
Of course, the key to being able to dynamically generate a sitemap.xml file using a 404 error handler is that the sitemap.xml file must not exist, physically, on your site. Start by creating a new ASPX page that will do the work instead. Remember that this page is probably going to do double duty by generating your sitemap.xml file and by handling real "not found" problems. So it should be styled in a way that matches your site design. Inside the new page, which we’ll call Error404NotFound.aspx, we need a helper method:
using System;
using System.Web;
using System.Web.UI;
using System.Xml;
public partial class ErrorNotFound404 : Page
{
// the standard schema namespace and change frequencies
// for sitemaps defined at http://www.sitemaps.org/protocol.php
private static readonly string xmlns =
"http://www.sitemaps.org/schemas/sitemap/0.9";
private enum freq { hourly, daily, weekly, monthly, yearly, never }
// add a url node to the specified XML document with standard
// priority to the urlset at the document root
private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
string loc, DateTime? lastmod, freq? changefreq )
{
// sanity checks
if (reqUrl == null || doc == null || loc == null)
return;
// call the overload with standard priority
AddUrlNodeToUrlSet( reqUrl, doc, loc, lastmod, changefreq, null );
}
// add a url node to the specified XML document with variable
// priority to the urlset at the document root
private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
string loc, DateTime? lastmod, freq? changefreq, float? priority )
{
// sanity checks
if (reqUrl == null || doc == null || loc == null)
return;
// create the child url element
XmlNode urlNode = doc.CreateElement( "url", xmlns );
// format the URL based on the site settings and then escape it
// ESCAPED( SCHEME + AUTHORITY + VIRTUAL PATH + FILENAME )
string url = String.Format( "{0}://{1}{2}", reqUrl.Scheme,
reqUrl.Authority, VirtualPathUtility.ToAbsolute(
String.Format( "~/{0}", loc ) ) ).Replace( "&", "&" )
.Replace( "'", "'" ).Replace( "\"", """ )
.Replace( "<", "<" ).Replace( ">", ">" );
// set up the loc node containing the URL and add it
XmlNode newNode = doc.CreateElement( "loc", xmlns );
newNode.InnerText = url;
urlNode.AppendChild( newNode );
// set up the lastmod node (if it should exist) and add it
if (lastmod != null)
{
newNode = doc.CreateElement( "lastmod", xmlns );
newNode.InnerText = lastmod.Value.ToString( "yyyy-MM-dd" );
urlNode.AppendChild( newNode );
}
// set up the changefreq node (if it should exist) and add it
if (changefreq != null)
{
newNode = doc.CreateElement( "changefreq", xmlns );
newNode.InnerText = changefreq.Value.ToString();
urlNode.AppendChild( newNode );
}
// set up the priority node (if it should exist) and add it
if (priority != null)
{
newNode = doc.CreateElement( "priority", xmlns );
newNode.InnerText =
(priority.Value < 0.0f || priority.Value > 1.0f)
? "0.5" : priority.Value.ToString( "0.0" );
urlNode.AppendChild( newNode );
}
// add the new url node to the urlset node
doc.DocumentElement.AppendChild( urlNode );
}
}
The AddUrlNodeToUrlSet method defined above will be used during Page_Load to construct the sitemap.xml file. It simply adds one <url> node for a page on your site that you want to reference in the sitemap. Now let’s look at the Page_Load method:
using System;
using System.Data.OleDb;
using System.Web;
using System.Web.UI;
using System.Xml;
public partial class ErrorNotFound404 : Page
{
protected void Page_Load( object sender, EventArgs e )
{
string QS = Request.ServerVariables["QUERY_STRING"];
// was it the sitemap.xml file that was not found?
if (QS != null && QS.EndsWith( "sitemap.xml" ))
{
// build the sitemap.xml file dynamically from add all of the
// articles from the database, set the MIME type to text/xml
// and stream the file back to the search engine bot
XmlDocument doc = new XmlDocument();
doc.LoadXml( String.Format( "<?xml version=\"1.0\" encoding" +
"=\"UTF-8\"?><urlset xmlns=\"{0}\"></urlset>", xmlns ) );
// add the fixed blog URL for this site with top priority
AddUrlNodeToUrlSet( Request.Url, doc, "MyBlog.aspx", null,
freq.daily, 1.0f );
// NOTE: add more fixed urls as necessary for your site
// this could be done programmatically or better still by
// dependency injection
// now query the database and add the virtual URLs for this site
string connectionString = String.Format(
"NOTE: set this to suit the needs of your content database" );
string query = "SELECT PAGE_NAME, POSTING_DATE FROM BLOGDB " +
"ORDER BY POSTING_DATE";
OleDbConnection conn = new OleDbConnection( connectionString );
conn.Open();
OleDbCommand cmd = new OleDbCommand( query, conn );
OleDbDataReader rdr = cmd.ExecuteReader();
if (rdr.HasRows)
{
while (rdr.Read())
{
object page_name = rdr[0];
object posting_date = rdr[1];
if ((object)page_name != null && !(page_name is DBNull))
{
AddUrlNodeToUrlSet( Request.Url, doc, String.Format(
"{0}.ashx", page_name.ToString().Trim() ),
(DateTime?)posting_date, freq.monthly );
}
}
}
// IMPORTANT - trace has to be disabled or the XML returned will
// not be valid because the div tag inserted by the tracing code
// will look like a second root XML node which is invalid
Page.TraceEnabled = false;
// IMPORTANT - you must clear the response in case handlers
// upstream inserted anything into the buffered output already
Response.Clear();
// IMPORTANT - set the status to 200 OK, not the 404 Not Found
// that this page would normally return
Response.Status = "200 OK";
// IMPORTANT - set the MIME type to XML
Response.ContentType = "text/xml";
// buffer the whole XML document and end the request
Response.Write( doc.OuterXml );
Response.End();
}
// not the sitemap.xml file so set the standard 404 error code
Response.Status = "404 Not Found";
}
}
When the Page_Load starts, it checks the QUERY_STRING to see if the sitemap.xml file was the missing one that caused the transfer to happen. If so, it starts a new XML document and adds fixed and virtual <url> nodes using the AddUrlNodeToUrlSet method shown above. Which page names you include in your sitemap is totally dependent on your site’s content so you’ll have to make most of your adjustments to my sample in that area. At the end of the Page_Load is some interesting code. There are four key things that have to happen at this point.
- You must disable page tracing if it’s turned on. If you don’t ASP.NET appends a &;div&t; element to the end of the document making your XML appear as though it has two root nodes which will invalidate it.
- You must clear the Response object in case some other code has already buffered some content already. You want just the XML of the sitemap in the output.
- You may need to set the HTTP status code to 200 to make sure that the client sees the result of its request as successful.
- You must set the MIME type of the Response to text/xml because that’s what the search engine bots expect for the document you’re returning.
Conclusion
Finally, all the Page_Load method has to do is send the OuterXml of the XmlDocument it built to the client with a Write operation. Make sure that you register your new page as the 404 handler with IIS and you’re ready to go. You can also register the same page as an error handler with ASP.NET via the web.config file as discussed above. Just be aware that when ASP.NET handles a 404 error, it will redirect the client to the page you specify, so if you’re depending on a clean transfer to the error handler, you probably aren’t going to get exactly what you want. For the sitemap.xml file though, the approach shown above is very clean because of the way IIS (not ASP.NET) handles missing files. Open the sitemap.xml file for gotnet.biz again in Internet Explorer using the Fiddler2 Web Debugging Proxy (http://www.fiddler2.com/fiddler2/) and use the Session Inspector. You’ll see just how clean this code makes the would-be 404 error for that missing file appear to the search engine bots.