Interesting SHDH project
Posted: March 19th, 2006 | Author: vinnie | Filed under: Uncategorized | No Comments »I was checking out the wikipedia.org article on robots.txt. robots.txt is the file you put on your webserver to let search engines, such as Google, know which files they shouldn’t index (make a copy of.) For example, if I didn’t want google to make a copy of my blogs posts, I could filter out Google’s crawler with a robots.txt file. Some sites do this to protect copyright material, others do it to avoid having somebody grab a historical copy of their content and use it against them.
Wikipedia had an interesting link to the robots.txt of the whitehouse. For instance, the whitehouse is currently blocking: http://www.whitehouse.gov/visit/philippines/text from being indexed. That doesn’t mean anything controversial, but it gave me an idea.
My idea is how to pick up controversial stories before anybody realizes they are controversial. Simply write a script that parses the robots.txt file each morning, when you see a new link added, grab that link’s cached (historical copy) off of google or the www.internetarchive.org and that could be your next hot story!
Leave a Reply