Enhanced usability with automatically generated site maps

This site, as many others, does not actually provide a full sitemap [1]. Categories and searching techniques are no real replacement for a sitemap anyway. What Evan Jones comes up with [2] is an automatically sitemap generator. Evan was driven by the Google Programming Contest last year [3].

He aimed to realize a solution that was able to generate a sitemap for any site available on the web. Evan describes his algorithm as follows

  • Iterate over the set of web pages, gathering the following information:
    URL, title, and which URLs it links to.
  • Iterate over the set of links, copying any links that exist in both directions
    to a new database.
  • Iterate over the new set of bi-directional links, copying the metadata
    for the URLs that are referenced.
  • Iterate over this new set of URLs and generate the set of links belonging
    to each site:
    • If the URL has a site assigned to it, continue to the next URL.
    • Otherwise, collect all the URLs belonging to this site by recursively
      adding all the linked URLs to this site.
    • Sort the set of URLs in the site. The root of the site is the first
      URL in the sorted list.
    • Recurse for each URL, starting at the root URL, generating the hierarchical
      structure:
      • Get the list of URLs that this URL links to.
      • For each of those URLs, if they can be found in the sorted URL list,
        add them as children of this URL and then remove them from the list.

If one gets the impression that this looks like a brute-force method, it might still be a way to create a sitemap at all, even though only to hae an initial overview of the structure of a certain website.

[1] http://www.useit.com/alertbox/20020106.html
[2] http://www.eng.uwaterloo.ca/~ejones/software/google-contest-2002/
[3] http://catalogs.google.com/programming-contest/

Leave a Reply