详情页

识别User Agent屏蔽一些Web爬虫防采集

时间:2023年06月14日

编辑:佚名

自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTS\sAgent|Creative\sAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$
RewriteRule .* - [F,L]
  屏蔽User Agent为空的代码:
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule .* - [F]
  屏蔽Referer和User Agent都为空的代码:
RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^$ [NC]
RewriteRule .* - [F] 
  下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:
User-Agent
DTS Agent
HttpClient
Owlin
Kazehakase
Creative AutoUpdate
HTTrack
YisouSpider
baiduboxapp
Python-urllib
python-requests
SemrushBot
SearchmetricsBot
MegaIndex
Scrapy
EMail Exractor
007ac9
ltx71
  其它也可以考虑屏蔽的:
Mail.RU_Bot:http://go.mail.ru/help/robots
Feedly
ZumBot
Pcore-HTTP
Daum
your-server
Mobile/12A4345d
PhantomJS/2.1.1
archive.org_bot
AcooBrowser
Go-http-client
Jakarta Commons-HttpClient
Apache-HttpClient
BDCbot
ECCP
Nutch
cr4nk
MJ12bot
MOT-MPx220
Y!OASIS/TEST
libwww-perl
  一般不要屏蔽的主流搜索引擎特征:
Google
Baidu
Yahoo
Slurp
yandex
YandexBot
MSN
  一些常见浏览器或者通用代码也不要轻易屏蔽:
FireFox
Apple
PC
Chrome
Microsoft
Android
Mail
Windows
Mozilla
Safar
Macintosh
  有的时候是采集者单独设置的User Agent,也可以通过分析后进行屏蔽,例如:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(\'Mozilla\/5\.0|\'Mozilla\'|\'Moz\'|\'Mozil\'|\'(.+)\'|Mobile\/13G34|Chrome\/53\.0\.2785\.143)(.*)$
RewriteRule .* - [F,L]
  或者与HTTP_USER_AGENT一起考虑其它的因素再联合判断检测、屏蔽,例如:
RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{HTTP_USER_AGENT} ^(.*)(Firefox\/44\.0|Safari\/537\.36)(.*)$
RewriteCond %{REQUEST_URI} ^(.*)\/comment\/reply\/(.*)$
RewriteRule .* - [F,L]
  上面这是遇到反复POST提交留言的情况,判断特征进行屏蔽。
  网上也找了一些其它的代码,列出供参考:
RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC]
RewriteRule ^(.*)$ - [F]
  除了修改.htaccess文件以外,还可以通过修改httpd.conf配置文件来实现:
DocumentRoot /home/wwwroot/xxx
<Directory "/home/wwwroot/xxx">
SetEnvIfNoCase User-Agent ".*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT
        Order allow,deny
        Allow from all
       deny from env=BADBOT
</Directory>
这样修改后需要重启Apache。别人列出的需要屏蔽特征:
FeedDemon             内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy            sql注入
Java                  内容采集
Jullo                 内容采集
Feedly                内容采集
UniversalFeedParser   内容采集
ApacheBench           cc攻击器
Swiftbot              无用爬虫
YandexBot             无用爬虫
AhrefsBot             无用爬虫
YisouSpider           无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
MJ12bot               无用爬虫
ZmEu phpmyadmin       漏洞扫描
WinHttp               采集cc攻击
EasouSpider           无用爬虫
HttpClient            tcp攻击
Microsoft URL Control 扫描
YYSpider              无用爬虫
jaunty                wordpress爆破扫描器
oBot                  无用爬虫
Python-urllib         内容采集
Indy Library          扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot           无用爬虫
 继续补充:
WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot
  还有:
Aboundex
80legs
^Java
^Cogentbot
^Alexibot
^asterias
^attach
^BackDoorBot
^BackWeb
Bandit
^BatchFTP
^Bigfoot
^Black.Hole
^BlackWidow
^BlowFish
^BotALot
Buddy
^BuiltBotTough
^Bullseye
^BunnySlippers
^Cegbfeieh
^CheeseBot
^CherryPicker
^ChinaClaw
Collector
Copier
^CopyRightCheck
^cosmos
^Crescent
^Custo
^AIBOT
^DISCo
^DIIbot
^DittoSpyder
^Download\ Demon
^Download\ Devil
^Download\ Wonder
^dragonfly
^Drip
^eCatch
^EasyDL
^ebingbong
^EirGrabber
^EmailCollector
^EmailSiphon
^EmailWolf
^EroCrawler
^Exabot
^Express\ WebPictures
Extractor
^EyeNetIE
^Foobot
^flunky
^FrontPage
^Go-Ahead-Got-It
^gotit
^GrabNet
^Grafula
^Harvest
^hloader
^HMView
^HTTrack
^humanlinks
^IlseBot
^Image\ Stripper
^Image\ Sucker
Indy\ Library
^InfoNaviRobot
^InfoTekies
^Intelliseek
^InterGET
^Internet\ Ninja
^Iria
^Jakarta
^JennyBot
^JetCar
^JOC
^JustView
^Jyxobot
^Kenjin.Spider
^Keyword.Density
^larbin
^LexiBot
^lftp
^libWeb/clsHTTP
^likse
^LinkextractorPro
^LinkScan/8.1a.Unix
^LNSpiderguy
^LinkWalker
^lwp-trivial
^LWP::Simple
^Magnet
^Mag-Net
^MarkWatch
^Mass\ Downloader
^Mata.Hari
^Memo
^Microsoft.URL
^Microsoft\ URL\ Control
^MIDown\ tool
^MIIxpc
^Mirror
^Missigua\ Locator
^Mister\ PiX
^moget
^Mozilla/3.Mozilla/2.01
^Mozilla.*NEWT
^NAMEPROTECT
^Navroad
^NearSite
^NetAnts
^Netcraft
^NetMechanic
^NetSpider
^Net\ Vampire
^NetZIP
^NextGenSearchBot
^NG
^NICErsPRO
^niki-bot
^NimbleCrawler
^Ninja
^NPbot
^Octopus
^Offline\ Explorer
^Offline\ Navigator
^Openfind
^OutfoxBot
^PageGrabber
^Papa\ Foto
^pavuk
^pcBrowser
^PHP\ version\ tracker
^Pockey
^ProPowerBot/2.14
^ProWebWalker
^psbot
^Pump
^QueryN.Metasearch
^RealDownload
Reaper
Recorder
^ReGet
^RepoMonkey
^RMA
Siphon
^SiteSnagger
^SlySearch
^SmartDownload
^Snake
^Snapbot
^Snoopy
^sogou
^SpaceBison
^SpankBot
^spanner
^Sqworm
Stripper
Sucker
^SuperBot
^SuperHTTP
^Surfbot
^suzuran
^Szukacz/1.4
^tAkeOut
^Teleport
^Telesoft
^TurnitinBot/1.5
^The.Intraformant
^TheNomad
^TightTwatBot
^Titan
^True_Robot
^turingos
^TurnitinBot
^URLy.Warning
^Vacuum
^VCI
^VoidEYE
^Web\ Image\ Collector
^Web\ Sucker
^WebAuto
^WebBandit
^Webclipping.com
^WebCopier
^WebEMailExtrac.*
^WebEnhancer
^WebFetch
^WebGo\ IS
^Web.Image.Collector
^WebLeacher
^WebmasterWorldForumBot
^WebReaper
^WebSauger
^Website\ eXtractor
^Website\ Quester
^Webster
^WebStripper
^WebWhacker
^WebZIP
Whacker
^Widow
^WISENutbot
^WWWOFFLE
^WWW-Collector-E
^Xaldon
^Xenu
^Zeus
ZmEu
^Zyborg
Acunetix
FHscan
临时屏蔽(返回503错误),而不是长期屏蔽的代码:
RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* - [R=503,L]
相关文章
猜你需要