Page 1 of 1

robots.txt

Posted: Tue Mar 08, 2005 8:32 am
by roid
robots.txt

why is this file here?

it disallows searches on the internet archive. preventing people from seeing any thread on the DBB that was last contributed to before Dec 31 2003.

Posted: Tue Mar 08, 2005 9:42 am
by roid
would you be ok in allowing JUST the internet archiver access (but no other bots)? i think a robots.txt written like this would do it:

Code: Select all

User-agent: * 
Disallow: /

User-agent: ia_archiver 
Disallow: 

Posted: Tue Mar 08, 2005 3:46 pm
by Skyalmian
You should have never, ever purged all those posts... So much information was lost...

Let the IA in. It's the most you can do to make up for the terrible decision that was made.

Posted: Tue Mar 08, 2005 5:53 pm
by Skyalmian
Xciter wrote:We never purged any old posts...
Purge, prune, and delete all mean the same thing. If you didn't delete all threads prior to 2004, then where did they go? Did you make a full backup of the board prior to nuking nearly everything?
Xciter wrote:It prevents bots looking for information from using guest accounts which open 50 to 70 threads up at a single time bringing the board to a crawl
Not all bots, just the friendly ones. There are hundreds, if not thousands, of malevolent search bots (the RIAA has one, called Cyveillance) that completely ignore the file (in recent days, GoogleBot has been said by a growing number of server operators to be reading the file for specific forbidden folders that are to be excluded in its search and then heading into said folders and archiving them, putting it on a growing list of people's malevolent bots lists) and go wherever they please, and these too use bandwidth. For those, you'll need to add code into the server configuration file or use an .htaccess file to ban them. This is just a small list of such bots:
"Alexibot"
"asterias"
"autoemailspider"
"b2w 0.1"
"BackWeb"
"BackDoorBot 1.0"
"Black Hole"
"BlackWidow"
"BlowFish 1.0"
"CherryPicker 1.0"
"CherryPickerSE 1.0"
"CherryPickerElite 1.0"
"ChinaClaw"
"Collector"
"Copier"
"Crescent"
"Crescent Internet ToolPak HTTP OLE Control v.1.0"
"Custo"
"DISCo"
"DISCo Pump"
"DISCo Pump 3.1"
"Download Demon"
"Download Wonder"
"Downloader"
"Drip"
"eCatch"
"EirGrabber"
"EmailCollector"
"EmailCollector 1.0"
"EmailSiphon"
"EmailWolf"
"EmailWolf 1.00"
"Express WebPictures"
"ExtractorPro"
"EyeNetIE"
"FileHound"
"Flaming AttackBot"
"FlashGet"
"GetRight"
"GetSmart"
"GetWeb!"
"Go!Zilla"
"Go-Ahead-Got-It"
"gotit"
"Grabber"
"GrabNet"
"Grafula"
"Harvest 1.5"
"HMView"
"HTTrack"
"Image Stripper"
"Image Sucker"
"InterGET"
"Internet Ninja"
"Iria"
"JetCar"
"JOC Web Spider"
"JOC"
"JustView"
"larbin"
"lftp"
"LeechFTP"
"likse"
"Magnet"
"Mag-Net"
"Mass Downloader"
"Memo"
"MIDown tool"
"Mirror"
"Mister PiX"
"Navroad"
"NearSite"
"NetAnts"
"NetSpider"
"Net Vampire"
"NetZIP"
"NICErsPRO"
"Ninja"
"Octopus"
"Offline Explorer"
"Offline Navigator"
"PageGrabber"
"Papa Foto"
"pavuk"
"pcBrowser"
"Pump"
"RealDownload"
"Reaper"
"Recorder"
"ReGet"
"Siphon"
"SiteSnagger"
"SmartDownload"
"Snake"
"SpaceBison"
"Sucker"
"SuperBot"
"SuperHTTP"
"Surfbot"
"tAkeOut"
"Teleport"
"Teleport Pro"
"Teleport Pro/1.29.1718"
"Teleport Pro/1.29.1632"
"Teleport Pro/1.29.1590"
"Teleport Pro/1.29.1616"
"Vacuum"
"VoidEYE"
"WebAuto"
"WebBandit"
"WebBandit 2.1"
"WebBandit 3.50"
"Webbandit 4.00.0"
"WebCapture 2.0"
"WebCopier v.2.2"
"WebCopier v3.2a"
"WebCopier"
"WebEMailExtractor 1.0B"
"WebFetch"
"WebGo IS"
"Web Image Collector"
"Web Sucker"
"WebLeacher"
"WebReaper"
"WebSauger"
"Website"
"Website eXtractor"
"Website Quester"
"Webster"
"WebStripper"
"WebWhacker"
"WebZIP"
"WebZip/4.0"
"WebZIP/4.21"
"WebZIP/5.0"
"Wget"
"Wget/1.5.3"
"Wget/1.6"
"Whacker"
"Widow"
"WWW-Collector-E"
"WWWOFFLE"
"Xaldon"
"Xaldon/WebSpider"
How to get rid of them -- first link at top. There are three many-page threads on the subject at Webmaster World

Posted: Tue Mar 08, 2005 6:12 pm
by Skyalmian
Xciter wrote:When we converted to the new board we only saved / converted 1 month of old threads, it has nothing to do with this robots.txt file...
Internet Archiver, lost forum posts, and robots.txt are all related to one another: can't get to lost forum posts with Internet Archiver if robots.txt blocks it. That's why roid made this thread.

Edit: MD-2389 tells me you did keep the UBB install. But why are all of the HTML pages broken? They're all empty. What happened to them?

Posted: Tue Mar 08, 2005 9:29 pm
by Topher
The UBB install has been removed as far as I know to make room for the new board and to avoid having to update both boards. If we left the old UBB code in we would be vulnerable to exploits in that code. As far as the HTML posts go, HTML takes up a lot more room than the phpBB posts in the SQL database, so they were nixed.

You can't get to lost forum posts if they don't exist anymore on this server, so allowing the robot in would be a moot point; if it's gone, then it would have to have been indexed before the robots.txt was in place anyway.

Roid has a good point, I'd be up for letting that one in.

Posted: Tue Mar 08, 2005 10:57 pm
by roid
the robots.tx wasn't always there, because i can remember in the past looking at a years old version of the DBB using the archive.org site. so all the posts were once all archived into their database.

they say that once you put a blanket disallowing robots.txt file on your site then they don't just stop your site from being indexed, but it also AUTOMATICALLY DELETES ALL PAST ARCHIVES OF IT.
a horrifying thought, but i'm not convinced they do that. why? because during the time when verisign was redirecting all "no page at this location" - effectively making every non-existant site point to their site (you'd all remember that shenanigans i'm sure) - they had a disallowing robots.txt on the main root versign server. therefore all non-existant sites were now considered by the archive.org bot to be part of verisign - and since they had that robots.txt - all nonexistant sites could no longer be accessed via archive.org and all archives of them were supposed to have been automatically deleted by their software.

however, since verisign stopped doing that, all the previous nonexistant sites that archive.org had previously archived (that were unavailable and supposedly automatically deleted during the verisign scandal) were back as good as new on their servers - they never actually deleted them.

Posted: Wed Mar 09, 2005 11:33 pm
by roid
i was right, they didn't delete anything from their servers!!!

check it out
sweet sweet DBB archives all the way back to 1999

thanks a bunch guys.

:D:D:D

Posted: Thu Mar 10, 2005 12:41 am
by roid
well, almost complete.
this may explain:
http://www.archive.org/about/faqs.php#T ... ck_Machine

How did I end up on the live version of a site? or I clicked on X date, but now I am on Y date, how is that possible?

Not every date for every site archived is 100% complete. When you are surfing an incomplete archived site the Wayback Machine will grab the closest available date to the one you are in for the links that are missing. In the event that we do not have the link archived at all, the Wayback Machine will look for the link on the live web and grab it if available. Pay attention to the date code embedded in the archived url. This is the list of numbers in the middle; it translates as yyyymmddhhmmss. For example in this url http://web.archive.org/web/200002291233 ... yahoo.com/ the date the site was crawled was Feb 29, 2000 at 12:33 and 40 seconds.
the DBB archive is full of many such cases it seems. some pages are completely gone.