Nagios安装部署全攻略

文章由LinuxBoy分享于2019-03-29 06:03:04热评（529）

Nagios安装部署全攻略

概述：公司的生产机器一共有12台，2台LVS(主备)、2台nginx、2台tomcat、1台后台服务器(nginx_tomcat)、3台mysql(主+备+异地灾备)、1台图片服务器、2台memcached.
可以看出网站的架构就是基于高可用的原理的，每个层面都做了主备、系统的PV不高，对于并发布，高性能没有那么苛求，对于系统安全、稳定有较高要求，前期已经对系统做了各种日志分析，WAF配置，漏洞扫面等等，现在还需要对系统进行监控，考虑再三还是决定使用Nagios来做。

PS：之前的同事用的Zabbix，表示我这十几台机器真是伤不起。。。

照着网上的材料来做，有的地方实在是坑。。把自己整理出来的结果发出来，给大家做个参考

--------------------------------------分割线 --------------------------------------

在Ubuntu下配置Mrtg监控Nginx和服务器系统资源

使用 snmp+Mrtg 监控 Linux 系统

Mrtg服务器搭建（监控网络流量）

网络监控器Nagios全攻略

Nagios搭建与配置详解

Nginx环境下构建Nagios监控平台

在RHEL5.3上配置基本的Nagios系统(使用Nagios-3.1.2)

CentOS 5.5+Nginx+Nagios监控端和被控端安装配置指南

Ubuntu 13.10 Server 安装 Nagios Core 网络监控运用

--------------------------------------分割线 --------------------------------------

一、安装Nagios

1、安装依赖包

#rpm -q gcc glibc glibc-common gd gd-devel xinetd openssl-devel
#yum install -y gcc glibc glibc-common gd gd-devel xinetd openssl-devel
#yum - y install httpd php mysql-devel php-mysql

2、添加用户和组
#groupadd nagcmd
#useradd -G nagcmd nagios
#passwd nagios
#usermod -a -G nagcmd apache
3、编译安装

#tar nagios-3.4.3.tar.gz
#cd nagios
#./configure --sysconfdir=/etc/nagios --with-command-group=nagcmd --enable-event-broker
#make all
#make install
#make install-init
#make install-commandmode
#make install-config

在http的配置文件目录【conf.d】中创建nagios的web程序配置文件
#make install-webconf

创建一个登陆nagios web程序的用户，用这个账号登陆nagios(这是彻底的弱口令，配置完建议把密码修改掉)
#htpasswd -c /etc/nagios/htpasswd.users nagiosadmin
#密码：nagios

以上配置过程需重新启动httpd：
service httpd restart

报错信息：Could not reliably determine the server's fully qualified
vi /etc/httpd/conf/httpd.conf
加入：ServerName localhost:80
4、安装nagios-plugins
#tar zxvf nagios-plugins-1.4.13.tar.gz
#cd nagios-plugins-1.4.13
注意：组不使用nagcmd
#./configure --with-nagios-user=nagios --with-nagios-group=nagios
#make all
#make install

5.配置并启动nagios
(1)加入开机启动--
# chkconfig --add nagios # chkconfig --level 35 nagios on # chkconfig --list nagios

(2)检查其配置文件的语法是否正确
#/usr/local/nagios/bin/nagios -v /etc/nagios/nagios.cfg

(3)启动nagios
#service nagios restart

(4)配置selinux【会阻止CGI脚本】
1 #getenforce #setenfore 0 #vi /etc/sysconfig/selinux ->SELINUX=disabled

二、Nagios配置

这里只做简要说明，后续会贴出具体的配置
cgi.cfg
控制CGI访问的配置文件，如何新加了cgi配置文件，需要在这里增加
nagios.cfg
Nagios 主配置文件
resource.cfg
变量定义文件，又称为资源文件，在些文件中定义变量，以便由其他配置文件引用，如$USER1$，好吧，其实就就是全局变量
objects
objects 是一个目录，在此目录下有很多配置文件模板，用于定义Nagios 对象
objects/commands.cfg
命令定义配置文件，其中定义的命令可以被其他配置文件引用
objects/contacts.cfg
定义联系人和联系人组的配置文件
objects/localhost.cfg
定义监控本地主机的配置文件
objects/printer.cfg
定义监控打印机的一个配置文件模板，默认没有启用此文件
objects/switch.cfg
定义监控路由器的一个配置文件模板，默认没有启用此文件
objects/templates.cfg
定义主机和服务的一个模板配置文件，可以在其他配置文件中引用
objects/timeperiods.cfg
定义Nagios 监控时间段的配置文件
objects/windows.cfg
监控Windows 主机的一个配置文件模板，默认没有启用此文件
三、NRPE安装【客户端】

说明：NRPE（nagios remore plugin execute）远程插件执行器，用于在远端服务区上运行监测命令的守护进程，它用于让nagios监控端基于安装的方式出发远端主机上的检测命令，并将检测结果输出至监控端。而其执行的开销远低于基于ssh的检测方式，而且检测过程中并不需要远程主机上的系统账号等信息。必须在客户端安装nrpe的nagios的plugin
1、安装plugin
#useradd -s /sbin/nologin nagios
#yum grouplist
#yum -y groupinstall "Development Tools" "Development Libraries"
#tar zxvf nagios-plugins-1.4.13.tar.gz
#cd nagios-plugins-1.4.13
#./configure --with-nagios-user=nagios --with-nagios-group=nagios
#make all
#make install

2、安装nrpe
#tar zxvf nrpe-2.15.tar.gz
#cd nrpe-2.15
#./configure --with-nrpe-user=nagios -with-nrpe-group=nagios --with-nagios-user=nagios --with-nagios-group=nagios --enable-command-args --enable-ssl
#make all
#make install-plugin
#make install-daemon
#make install-daemon-config

3、配置NRPE
# vi /usr/local/nagios/etc/nrpe.cfg
log_facility=daemon
pid_file=/var/run/nrpe.pid
server_port=5666
#修改为本机的IP
server_address=192.168.1.101
nrpe_user=nagios
nrpe_group=nagios
#修改为Nagios服务端的IP
allowed_hosts=192.168.1.100
command_timeout=60

4、启动nrpe
# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
为了方便NRPE的启动，可以将如下内容定义为/etc/init.d/nrped脚本
#!/bin/bash
NRPE=/usr/local/nagios/bin/nrpe
NRPECONF=/usr/local/nagios/etc/nrpe.cfg
case "$1" in
start)
echo -n "Staring NRPE daemon...."
$NRPE -c $NRPECONF -d
echo "done.."
;;
stop)
echo -n "Stopping NRPE daemon...."
pkill -u nagios nrpe
echo "done.."
;;
restart)
$0 stop
sleep 1
$0 start
;;
*)
echo "Usage: $0 start|stop|restart"
esac
exit 0

5、配置示例
vi /usr/local/nagios/etc/nrpe.cfg

command[check_users]=/usr/local/nagios/libexec/check_users -w 10 -c 20
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sd1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda6
command[check_sd2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda3
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 200 -c 400
command[check_cpu]=/usr/local/nagios/libexec/check_cpu.sh -w 50 -c 80
command[check_mem]=/usr/local/nagios/libexec/check_mem.sh -w 50 -c 80
command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20% -c 10%

四、NRPE服务端

1、安装NRPE
#tar zxvf nrpe-2.15.tar.gz
#cd nrpe-2.15
#./configure --with-nrpe-user=nagios -with-nrpe-group=nagios --with-nagios-user=nagios --with-nagios-group=nagios --enable-command-args --enable-ssl --with-mysql
#make all
#make install-plugin
2、定义如何监控远程主机及服务
通过NPRE监控远程Linux主机要使用check_nrpe插件进行，其语法格式如下：
check_nrpe -H <host> [-n] [-u] [-p <port>] [-t <timeout>] [-c <command>] [-a <arglist...>]
示例：

define command
{
command_name check_swap_nrpe
command_line $USER1$check_nrpe -H "$HOSTADDRESS$" -c "check_swap"
}

如果还希望在监控远程LINUX主机时还能向其传递参数，则可以使用类似如下方式进行：
#cd /etc/nagios/objects/
#vi commands.cfg \\增加以下内容
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

贴出一个新增加的配置：
define host{
use linux-server
host_name linhost
alias My Linux Host
address 192.168.1.101
}
define service{
use generic-service
host_name linhost
service_description CHECK USERS
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name linhost
service_description Load
check_command check_nrpe!check_load
}
define service{
use generic-service
host_name linhost
service_description SDA1
check_command check_nrpe!check_sd1
}
define service{
use generic-service
host_name linhost
service_description SDA2
check_command check_nrpe!check_sd2
}
define service{
use generic-service
host_name linhost
service_description Zombie
check_command check_nrpe!check_zombie_procs
}
define service{
use generic-service
host_name linhost
service_description total procs
check_command check_nrpe!check_total_procs
}

五、增加监控脚本
比如CPU、内存、LVS等、需要自己写脚本来做、其实so easy，只要注意2个点就OK，控制输入(参数等)、格式化输出。只要输出格式符合Nagios的格式识别方式就行
1、CPU监控
vi check_cpu.sh

#!/bin/sh
# Filename: check_cpu.sh
procinfo=`which procinfo 2>/dev/null`
sar=`which sar 2>/dev/null`
function help {
echo -e "\n\tThis plugin shows the % of used CPU, using either procinfo or sar (whichever is available)\n\n\t$0:\n\t\t-c <integer>\tIf the % of used CPU is above <integer>, returns CRITICAL state\n\t\t-w <integer>\tIf the % of used CPU is below CRITICAL and above <integer>, returns WARNING state\n"
exit -1
}
# Getting parameters:
while getopts "w:c:h" OPT; do
case $OPT in
"w") warning=$OPTARG;;
"c") critical=$OPTARG;;
"h") help;;
esac
done
# Checking parameters:
( [ "$warning" == "" ] || [ "$critical" == "" ] ) && echo "ERROR: You must specify warning and critical levels" && help
[[ "$warning" -ge "$critical" ]] && echo "ERROR: critical level must be highter than warning level" && help
# Assuring that the needed tools exist:
( ( [ -f $procinfo ] && command="procinfo") || [ -f $sar ] ) || \
( echo "ERROR: You must have either procinfo or sar installer in order to run this plugin" && exit -1 )
# Doing the actual check:
( [ "$command" == "procinfo" ] && idle=`$procinfo | grep idle | cut -d% -f1 | awk '{print $NF}' | cut -d. -f1`) || \
idle=`$sar | tail -1 | awk '{print $8}' | cut -d. -f1`
used=`expr 100 - $idle`
# Comparing the result and setting the correct level:
if [[ $used -ge $critical ]]; then
msg="CRITICAL"
status=2
else if [[ $used -ge $warning ]]; then
msg="WARNING"
status=1
else
msg="OK"
status=0
fi
fi
# Printing the results:
echo "$msg - CPU used=$used% idle=$idle% | 'CPU Usage'=$used%;$warning;$critical;"
# Bye!
exit $status

修改用户数组和加权限，以下操作都一样
#chown nagios.nagios check_cpu.sh
#chmod +x check_cpu.sh
#./check_cpu.sh -w 60 -c 80

【问题】由于使用sar命令监控系统资源使用，有可能存在系统没有安装sar的情况
解决方案:
#yum -y install sysstat
初次执行的时候会存在问题需要建立一个存放记录的文件【当天日期】sar -o 16
在被监控端也需要配置【略】
【注意】需要加入crontab 每天生成记录cpu命令的文件
#crontab -e 记得检查crontab任务是否启动
1 0 * * * /usr/lib64/sa/sa1
2、内存监控

vi check_mem.sh

#!/bin/bash
#DESC: OS mem check
#Author:James
function help {
echo -e "\n\tThis plugin shows the % of used MEM, using free (whichever is available)\n\n\t$0:\n\t\t-c <integer>\tIf the % of used MEM is above <integer>, returns CRITICAL state\n\t\t-w <integer>\tIf the % of used MEM is below CRITICAL and above <integer>, returns WARNING state\n"
exit -1
}
while getopts "w:c:h" OPT; do
case $OPT in
"w") warning=$OPTARG;;
"c") critical=$OPTARG;;
"h") help;;
esac
done
set `free|head -2|tail -1`
MEMTOTAL=$2
MEMUSED=$3
MEMFREE=$4
MEMBUFFERS=$6
MEMCACHED=$7
REALMEMUSED=`echo $MEMUSED - $MEMBUFFERS - $MEMCACHED | bc`
USEPCT=`echo "scale=3; $REALMEMUSED / $MEMTOTAL * 100" |bc -l`
REALMEMUSEDmb=`echo "($REALMEMUSED)/1024" | bc`
MEMTOTALMB=`echo "($MEMTOTAL)/1024"|bc`
if [ `echo "$USEPCT > $critical" |bc` == 1 ];then
echo "MEM CRITICAL - Memory usage = ${USEPCT}%,MEMTOTAL=${MEMTOTALMB}MB,RealUsed=${REALMEMUSEDmb}MB |Used=${USEPCT}%;$warning;$critical"
exit 2
elif [ `echo "$USEPCT > $warning" |bc` == 1 ];then
echo "MEM WARNING - Memory usage = ${USEPCT}%,MEMTOTAL=${MEMTOTALMB}MB,RealUsed=${REALMEMUSEDmb}MB |Used=${USEPCT}%;$warning;$critical"
exit 1
elif [ `echo "$USEPCT < $warning" |bc` == 1 ];then
echo "MEM OK - Memory usage = ${USEPCT}%,MEMTOTAL=${MEMTOTALMB}MB,RealUsed=${REALMEMUSEDmb}MB|Used=${USEPCT}%;$warning;$critical"
exit 0
else
echo "MEM ERROR - Unable to determine memory usage"
exit 3
fi
echo "Unable to determine memory usage."
exit 3

3、LVS监控
vi check_lvs.sh

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647 #!/bin/bash
USAGE_Method="$(basename $0)[-h|--hostname] <Free ip or hostname> [-w|--warning] <Free integer> [-c|--critical] <Free integer>"
USAGE_Value="warning value must be small than critical value: `basename $0` $*"
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
if [ $# -lt 4 ];then
echo "Usage:$USAGE_Method"
fi
while [ $# -gt 0 ];
do
case "$1" in
-w|--warning)
shift
warning=$1
;;
-c|--critical)
shift
critical=$1
;;
esac
shift
done
if [[ $warning == $critical || $warning -gt $critical ]];then
#echo $warning
#echo $critical
echo "$USAGE_Value"
echo "Usage: $USAGE_Method"
exit 0
fi
ACT_COUNT=0
Inactive_count=0
stat1=`sudo ipvsadm | grep http | grep Route|wc -l`
if [ $stat1 -ne 0 ];then
for NUM in `sudo ipvsadm | grep http | grep Route | awk '{print $5}'`
do
ACT_COUNT=$(($ACT_COUNT+ $NUM))
done
for NUM in `sudo ipvsadm | grep http | grep Route | awk '{print $6}'`
do
Inactive_count=$(($Inactive_count+ $NUM))
done
else
echo " stat1:$stat1, lvs critical,lvs is down now."
exit 3
fi

4、MYSQL监控

在需要监控的mysql数据库上建一个专门给Nagios使用的库

mysql>create database nagdb default CHARSET=utf8;
mysql> grant select on nagdb.* to 'nagios'@'192.168.1.100';
mysql> update mysql.user set Password = PASSWORD('nagios') where user='nagios';

#/usr/local/nagios/libexec/check_mysql -H 192.168.1.101 -u nagios -d nagdb -p nagios -w 10 -c 30

5、memcached监控
使用插件，用perl语言写的，需要安装多个依赖包，比较坑爹。。我也不容易啊

(1)安装模块

#yum -y install perl-Carp-Clan perl-Cache-Memcached perl-Nagios-Plugin

--如果不能安装
#wget http://dag.wieers.com/rpm/packages/rpmforge-release/rpmforge-release-0.5.2-2.rf.src.rpm
#rpm -ivh rpmforge-release-0.5.2-2.rf.src.rpm
#yum -y install perl-Nagios-Plugin.noarch perl-Carp-Clan.noarch perl-Cache-Memcached.noarch

--如果perl-Nagios-Plugin无法安装
wget http://packages.sw.be/perl-Nagios-Plugin/perl-Nagios-Plugin-0.33-1.el5.rf.noarch.rpm
rpm -ivh perl-Nagios-Plugin-0.33-1.el5.rf.noarch.rpm --force --nodeps

(2)插件安装
下载Nagios-Plugins-Memcached-0.02.tar.gz后安装【依赖包较多，请注意查看.pm文件的存放位置】
#tar xzvf Nagios-Plugins-Memcached-0.02.tar.gz
#cd Nagios-Plugins-Memcached-0.02
#yum -y install perl-CPAN
# perl Makefile.PL

--执行后会出现一些提示让你选择，按照自己想法选或者一路回车都能通过
# make

--这时他会下载一些运行时需要的东西
# make install

--默认会把check_memcached文件放到/usr/bin/check_memcached
--没关系把他拷贝到nagios的libexec下
#cp /usr/local/bin/check_memcached /usr/local/nagios/libexec/
#chown nagios.nagios check_memcached

在commands.cfg里面加上这么几条（这里我没有把check_memcached装在memcached服务器上，而是通过Nagios的check_memcached直接去访问memcached服务器的11211端口,当然你也可以把他装在memcached服务器上利用check_nrpe来取他的值）
define command {
command_name check_memcached_11211
command_line $USER1$/check_memcached -H 192.168.1.101:11211 --size-warning 80 --size-critical 90
}
上面这个是来监控memcached的内存使用比例
define command {
command_name memcached_response_11211
command_line /usr/local/bin/check_memcached -H 192.168.1.101 -w 300 -c 500
}
这个是用来监控memcached是否还有应答
define command {
command_name check_memcached_hit
command_line /usr/local/bin/check_memcached -H 192.168.1.101 --hit-warning 10 --hit-critical 5
}
./check_memcached -H 192.168.108.96 -w 300 -c 500

更多详情见请继续阅读下一页的精彩内容：

推荐文章：

Nagios安装部署全攻略