Linux入门教程:Nginx支持HTTPS并且支持反爬虫,运维日志


自己写了若干爬虫, 但是自己的网站也有人爬, 呵呵, 这里介绍一种Nginx反爬.我在阿里云只开放80端口, 所有一般端口都通过Nginx进行反向代理. 通过Nginx, 我们还可以拦截大部分爬虫.

然后我们再给自己的网站加上HTTPS支持.

Nginx安装

我的系统如下:

jinhan@jinhan-chen-110:~/book/Obiwan/bin$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.3 LTS
Release:    16.04
Codename:   xenial

安装(如果有apache服务器, 建议卸载了, 或者改Nginx的默认端口):

sudo apt-get install nginx

此时已经开启了80端口, 并且配置处在etc/nginx

lsof -i:80

cd /etc/nginx

Nginx服务一般配置

将配置放于conf.d/*

PHP配置(可忽视)

server{
    listen 80;
    server_name php.lenggirl.com;
    charset utf-8;
    access_log /data/logs/nginx/www.lenggirl.com.log;
    #error_log /data/logs/nginx/www.lenggirl.com.err;

    location / {
            root   /data/www/php/blog;
        index index.html index.php;
        #访问路径的文件不存在则重写URL转交给ThinkPHP处理
        if (!-e $request_filename) {
            rewrite  ^/(.*)$  /index.php/$1  last;
            break;
        }
    }

    ## Images and static content is treated different
    location ~* ^.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ {

        access_log        off;
        expires           30d;
        root /data/www/php/blog;
     }

    location ~\.php/?.*$ {
    root        /data/www/php/blog;
        fastcgi_pass   127.0.0.1:9000;
        fastcgi_index  index.php;
        #加载Nginx默认"服务器环境变量"配置
        include        fastcgi.conf;

        #设置PATH_INFO并改写SCRIPT_FILENAME,SCRIPT_NAME服务器环境变量
        set $fastcgi_script_name2 $fastcgi_script_name;
        if ($fastcgi_script_name ~ "^(.+\.php)(/.+)$") {
            set $fastcgi_script_name2 $1;
            set $path_info $2;
        }
        fastcgi_param   PATH_INFO $path_info;
        fastcgi_param   SCRIPT_FILENAME   $document_root$fastcgi_script_name2;
        fastcgi_param   SCRIPT_NAME   $fastcgi_script_name2;        
    }
}

Go配置

通过server_name, 用域名访问, 全部会到80端口, 根据域名会转发到8080

域名请A记录到该机器IP地址.

vim /etc/nginx/conf.d/www.lenggirl.com.conf

server{
    listen 80;
    # 本地测试时可以将域名改为: 127.0.0.1
    server_name www.lenggirl.com;
    charset utf-8;
    access_log /root/logs/nginx/www.lenggirl.com.log;
    #error_log /data/logs/nginx/www.lenggirl.com.err;
    location / {
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_redirect off;
    proxy_pass http://localhost:8080;

    # 这个就是反爬虫文件了
    include /etc/nginx/anti_spider.conf;
    }   
}

日志文件要先建立:

sudo mkdir -p /root/logs/nginx

查看配置是否无误, 并重启:

sudo nginx -t
sudo service nginx restart
sudo nginx -s reload

访问127.0.0.1会发现502错误, 因为8080端口我们没开! 此时访问localhost会发现, 这时Nginx欢迎页面出来了, 这是默认80端口页面!

反爬虫配置

增加反爬虫配额文件:

sudo vim /etc/nginx/anti_spider.conf

#禁止Scrapy等工具的抓取  
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {  
     return 403;  
}  

#禁止指定UA及UA为空的访问  
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) {  
     return 403;               
}  

#禁止非GET|HEAD|POST方式的抓取  
if ($request_method !~ ^(GET|HEAD|POST)$) {  
    return 403;  
}  

#屏蔽单个IP的命令是
#deny 123.45.6.7
#封整个段即从123.0.0.1到123.255.255.254的命令
#deny 123.0.0.0/8
#封IP段即从123.45.0.1到123.45.255.254的命令
#deny 124.45.0.0/16
#封IP段即从123.45.6.1到123.45.6.254的命令是
#deny 123.45.6.0/24

# 以下IP皆为流氓
deny 58.95.66.0/24;

在网站配置server段中都插入include /etc/nginx/anti_spider.conf, 见上文. 你可以在默认的80端口配置上加上此句:sudo vim sites-available/default

重启:

sudo nginx -s reload

爬虫UA常见:

FeedDemon             内容采集  
BOT/0.1 (BOT for JCE) sql注入  
CrawlDaddy            sql注入  
Java                  内容采集  
Jullo                 内容采集  
Feedly                内容采集  
UniversalFeedParser   内容采集  
ApacheBench           cc攻击器  
Swiftbot              无用爬虫  
YandexBot             无用爬虫  
AhrefsBot             无用爬虫  
YisouSpider           无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)  
jikeSpider            无用爬虫  
MJ12bot               无用爬虫  
ZmEu phpmyadmin       漏洞扫描  
WinHttp               采集cc攻击  
EasouSpider           无用爬虫  
HttpClient            tcp攻击  
Microsoft URL Control 扫描  
YYSpider              无用爬虫  
jaunty                wordpress爆破扫描器  
oBot                  无用爬虫  
Python-urllib         内容采集  
Indy Library          扫描  
FlightDeckReports Bot 无用爬虫  
Linguee Bot           无用爬虫  

使用curl -A 模拟抓取即可,比如:

# -A表示User-Agent
# -X表示方法: POST/GET
# -I表示只显示响应头部
curl -X GET -I -A 'YYSpider' localhost

HTTP/1.1 403 Forbidden
Server: nginx/1.10.3 (Ubuntu)
Date: Fri, 08 Dec 2017 10:07:15 GMT
Content-Type: text/html
Content-Length: 178
Connection: keep-alive

模拟UA为空的抓取:

curl -I -A ' ' localhost

模拟百度蜘蛛的抓取:

curl -I -A 'Baiduspider' localhost

支持https

见: https://certbot.eff.org/#ubuntuxenial-other

首先执行:

apt-get update
apt-get install software-properties-common
add-apt-repository ppa:certbot/certbot
apt-get update
apt-get install python-certbot-nginx 

然后执行这个按说明操作:

certbot --authenticator standalone --installer nginx --pre-hook "nginx -s stop" --post-hook "nginx"

之前/etc/nginx/conf.d/www.lenggirl.com.conf文件变更如下:

server{
        listen 80;
        # 本地测试时可以将域名改为: 127.0.0.1
        server_name www.lenggirl.com;
        charset utf-8;
        access_log /root/logs/nginx/www.lenggirl.com.log;
        #error_log /data/logs/nginx/www.lenggirl.com.err;
        location / {
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_redirect off;
        proxy_pass http://localhost:4000;

        # 这个就是反爬虫文件了
        # include /etc/nginx/anti_spider.conf;
        #
}

    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/www.lenggirl.com/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/www.lenggirl.com/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
    if ($scheme != "https") {
        return 301 https://$host$request_uri;
    } # managed by Certbot

}

相关内容