Democratic Underground Latest Greatest Lobby Journals Search Options Help Login
Google

An interesting thing that I found while searching the Web

Printer-friendly format Printer-friendly format
Printer-friendly format Email this thread to a friend
Printer-friendly format Bookmark this thread
This topic is archived.
Home » Discuss » Archives » General Discussion (Through 2005) Donate to DU
 
ck4829 Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:37 AM
Original message
An interesting thing that I found while searching the Web
Edited on Mon Jun-20-05 08:40 AM by ck4829
http://web.archive.org/web/*/www.whitehouse.gov/robots.txt

(You have to copy and paste the URL OR if that doesn't work, you can go to archive.org and search for www.whitehouse.gov/robots.txt)

This is the WH's Robot List listed for the last 5 years. These 'robots' will stop a Search Engine Spider from getting to a certain page.

One should note that the list is very small (only stops 1 item from being searched) until about September 2, 2001.
Printer Friendly | Permalink |  | Top
loveable liberal Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:40 AM
Response to Original message
1. what does that mean?
Printer Friendly | Permalink |  | Top
 
ck4829 Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:42 AM
Response to Reply #1
2. I don't know, I too would like an answer
Printer Friendly | Permalink |  | Top
 
MountainLaurel Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:44 AM
Response to Reply #1
3. I think
What the list shows are the Web pages sitting on the Whitehouse.gov server that the public is blocked from viewing.
Printer Friendly | Permalink |  | Top
 
jojo54 Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:47 AM
Response to Reply #3
6. I agree
A few days ago, someone had posted that they couldn't get on www.freewayblogger.com. A message came up that it was "distasteful", or something like that.

RW propaganda at work again.
Printer Friendly | Permalink |  | Top
 
helderheid Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:46 AM
Response to Reply #1
5. these are search queries
not allowed to be used to find the whitehouse.gov site
Printer Friendly | Permalink |  | Top
 
Dragonfli Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 10:39 AM
Response to Reply #1
9. A Quick Explanation
A Quick Explanation for the "The Robots Exclusion Standard"
http://www.searchengineworld.com/robots/robots_tutorial.htm


Robots.txt Tutorial
Search engines will look in your root domain for a special file named "robots.txt" (http://www.mydomain.com/robots.txt). The file tells the robot (spider) which files it may spider (download). This system is called, The Robots Exclusion Standard.
The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is:....


Having that file with exclusions is not in and of itself unusual. It would be more unusual not to have a robots.txt file.
What I find odd is the # of entries.
please note also that only "polite bots" will follow those guidelines.

I will say this - that robot.txt file will be very effective at hiding the excluded files from goggle and other search engines.
Other than that it does pretty much nothing about access to the files themselves.


I hope this helps you understand what that file is.
Printer Friendly | Permalink |  | Top
 
hootinholler Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 11:30 AM
Response to Reply #9
13. As a person with a bit of search engine experience...
A robots.txt file will stop 'polite' search engines. Its intent is to offer guidance to spiders (a program that 'crawls' links in web-space) and limit the bandwidth consumed by them. Mostly through a "nothing to see here" strategy. As most spiders are polite, this in general works fairly well, but if you set your spider to ignore the file, it will.

Essentially it (the file, and changes over time) boils down to a list of things the W.H. doesn't want available easily by search. This can be circumvented by another site putting the links on a page with the apropos keywords in the link text.

A link's link text is the bit in between the anchor open tag <a> and its close </a> This is how google-bombs are created, similar to: True Patriot. Put that link on a lot of websites and see what happens, current google is: here. URF, I just went to the top site, hope we can change it!

By creating the association of the key text with the URL being masked (hidden) outside of the environment controlled by the robots.txt, the net effect is to backfill the spider's 'territory' when you submit the page to the spider's controller.

Just a little background,
-Hoot
Printer Friendly | Permalink |  | Top
 
helderheid Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:45 AM
Response to Original message
4. interesting
Disallow: /thanksgiving/iraq

Printer Friendly | Permalink |  | Top
 
crikkett Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 11:01 AM
Response to Reply #4
11. ...just looks like housecleaning to me
Checking the 'disallow' rule you mentioned:
www.whitehouse.gov/thanksgiving/iraq returns a 404; this makes me think that the content & structure of the site was changed & the webmaster wanted to clean up indexes. I don't see anything dastardly.

:tinfoilhat:
Printer Friendly | Permalink |  | Top
 
Just Me Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:48 AM
Response to Original message
7. Um, a means of controlling the flow of information?
*shit* :scared:
Printer Friendly | Permalink |  | Top
 
hootinholler Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 11:53 AM
Response to Reply #7
14. Well it could be used that way, but,
Edited on Mon Jun-20-05 11:54 AM by hootinholler
In this case it appears non-sinister. The pages listed are low-bandwidth support versions. I haven't checked all of them but the pages I checked, have graphical versions that are indexed (spidered).

-Hoot
Printer Friendly | Permalink |  | Top
 
Tsiyu Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 08:53 AM
Response to Original message
8. Trippy
who is behind the curtain? :-(
Printer Friendly | Permalink |  | Top
 
hemp_not_war Donating Member (52 posts) Send PM | Profile | Ignore Mon Jun-20-05 10:51 AM
Response to Original message
10. that is strange
why don't they disallow everything? The way they do it makes it seem they allow caching of certain material and not others. A lot of stuff about 9/11 and Iraq they don't want google or MSN caching for us to find when they change their story later. I might be willing to help write something to cache government sites and check for data being changed later. I have heard about the case after 9/11 where Andrews Air Force base removed a bunch of material showing how prepared they always are, it was needed to change the story. But they got caught because of google's cache, so they probably are wise to that by now.
Printer Friendly | Permalink |  | Top
 
paula777 Donating Member (1000+ posts) Send PM | Profile | Ignore Mon Jun-20-05 11:20 AM
Response to Original message
12. Weird
Disallow: /president/sept112003/photoessay/text
Disallow: /president/september11/iraq
Disallow: /president/september11/text

Also weird is the CNN coverage on the web archive STARTS at 4pm. Where are the archived pages from that morning?
Printer Friendly | Permalink |  | Top
 
DU AdBot (1000+ posts) Click to send private message to this author Click to view 
this author's profile Click to add 
this author to your buddy list Click to add 
this author to your Ignore list Thu May 02nd 2024, 12:12 AM
Response to Original message
Advertisements [?]
 Top

Home » Discuss » Archives » General Discussion (Through 2005) Donate to DU

Powered by DCForum+ Version 1.1 Copyright 1997-2002 DCScripts.com
Software has been extensively modified by the DU administrators


Important Notices: By participating on this discussion board, visitors agree to abide by the rules outlined on our Rules page. Messages posted on the Democratic Underground Discussion Forums are the opinions of the individuals who post them, and do not necessarily represent the opinions of Democratic Underground, LLC.

Home  |  Discussion Forums  |  Journals |  Store  |  Donate

About DU  |  Contact Us  |  Privacy Policy

Got a message for Democratic Underground? Click here to send us a message.

© 2001 - 2011 Democratic Underground, LLC