How do we search sites that no longer exist?

There’s a lot of material in the WaybackWhenMachine at Archive.org.  Images of websites from days gone by, full of material that may not be online now but that we might like to see if we could.

But how?  If we know the site, we can go to it and look through.

What we need is a Google for the WaybackWhenMachine.  And this does not seem to exist.

Anyone know of anything?

6 thoughts on “How do we search sites that no longer exist?

  1. Hi Roger,

    I agree that it’s unfortunate that Wayback has no search feature. In fairness to the Internet Archive though, it is a non-profit organization and therefore, unlike Google, cannot afford to maintain the large data processing centers required to implement a real-time search function for a massive data set the size of that which Wayback has accumulated. Hopefully as technology improves and the per unit cost of data storage and retrieval continues to shrink, this will be eventually remedied. I’m just grateful for what the good folks at the Internet Archive have already provided to all of us (free of charge!).

    But here are some ideas that may be useful in certain situations.

    (1) Back in the 90s and early 2000s, at least a few search engines had directory pages containing lists of websites organized by category. Here’s what Yahoo’s directory looked like (back in October 1996):
    http://web.archive.org/web/19961017235908/http://www2.yahoo.com/

    For example, by going to this directory and taking the route (Society and Culture) >> (Religion) >> (Christianity), I was able to locate an interesting looking archive of a university webpage about Gregory of Nyssa: (http://web.archive.org/web/19961128143639/http://www.ucc.uconn.edu/~das93006/nyssa.html).

    Perhaps also check the Wayback archives of some of the other popular search engines during the late 90s (e.g. Lycos, Infoseek, Altavista, etc.) They may have had similar directory lists too which point to other interesting websites which no longer exist.

    (2) If you have some idea of how a particular website is structured, you can use a “*” in your URL requests to the Wayback Machine to find all archived webpages for that website (or website subsection).

    For example, if I were to type:
    http://roger-pearse.com/*

    into the Wayback Machine, it will reveal each and every page that Wayback has ever archived from your website. But if I want to narrow my search down to only the weblog portion of your site, I would instead type http://roger-pearse.com/weblog/* In this case, it still declares that “1,151 URLs have been captured for this domain”, but for some sites, using a similar procedure can really help to narrow down the search to a manageable set of results. Oh, and sometimes certain deeply embedded portions of archived webpages don’t seem to be readily accessible from the archived copy of the website’s main entry page, so using this “wildcard search” is necessary to track them down.

    Unfortunately, none of this is as simple or as effective as a basic Google-type search would be, but it’s certainly better than nothing.

    God Bless, Roger!

  2. Many thanks indeed for these ideas. Your feelings about Archive.org are mine. Isn’t it *scary* that this is all there is, by way of archive?

    Your suggestions are good ones. I don’t think they will help me in this case, but in others they might.

    It’s like being back at the start of the web, where there was no effective way to find things; you just had lists of urls.

    God bless you, Michael.

  3. unfortunate they disallows google from indexing their site, otherwise you could just use google; with the search ‘”roger pearse” site:web.archive.org’

    But archive org has set a robots.txt preventing search engines from indexing its content;

    # robots.txt web.archive.org 2011-02-01

    User-agent: *
    Disallow: /web/1
    Disallow: /web/2
    Disallow: /web/http:
    Crawl-delay: 3

Leave a Reply