一个check_mk源码小bug的解决


在线上,我们使用了icinga结合check_mk作为监控系统。
今天,在用cmk -II更新主机的inventory信息时,无论后面跟的是什么主机,都会报告如下错误:

Removing unimplemented check /
Removing unimplemented check oom_adj_for_cron
Removing unimplemented check oom_adj_for_sshd
Traceback (most recent call last):
    File "/usr/share/check_mk/modules/check_mk.py", line 5801, in <module>
        remove_autochecks_of(host, checknames)
    File "/usr/share/check_mk/modules/check_mk.py", line 2907, in remove_autochecks_of
    if splitted[3] not in check_info:
IndexError: list index out of range

在网上搜寻了半天,根本找不到任何有帮助的信息,于是我尝试通过报错中提到的位置对源码进行调试:
修改/usr/share/check_mk/modules/check_mk.py,加入'print splitted'来打印溢出的List,即splitted。

       for fn in glob.glob(autochecksdir + "/*.mk"):
           lines = []
           count = 0
           for line in file(fn):
               # hostname and check type can be quoted with ' or with "
               double_quoted = line.replace("'", '"').lstrip()
               if double_quoted.startswith('("'):
                   count += 1
                   splitted = double_quoted.split('"')
                   print splitted
                   if splitted[1] != hostname or (checktypes != None and splitted[3] not in checktypes):
                   if splitted[3] not in check_info:
                       sys.stderr.write('Removing unimplemented check %s\n' % splitted[3])
                       continue
                       lines.append(line)
                   else:
                       removed += 1
               if len(lines) == 0:

然后再次运行cmk -II,发现如下信息:

...
("iad1-server5", job, 'oom_adj_for_sshd', None)
Removing unimplemented check oom_adj_for_sshd

("iad1-server5", kernel.util, None, kernel_util_default_levels)
Traceback (most recent call last):
    File "/usr/share/check_mk/modules/check_mk.py", line 5803, in 
       remove_autochecks_of(host, checknames)
    File "/usr/share/check_mk/modules/check_mk.py", line 2909, in remove_autochecks_of
    if splitted[3] not in check_info:

可以发现,
("iad1-server5", kernel.util, None, kernel_util_default_levels)
根本不能通过单双引号分割为一个长度大于3的List,所以会报溢出的错误:'IndexError: list index out of range'

于是,我加了一个简单的判断,当List的长度大于3时,再执行'Removing unimplemented check'的操作。
# vim /usr/share/check_mk/modules/check_mk.py

       for fn in glob.glob(autochecksdir + "/*.mk"):
           lines = []
           count = 0
           for line in file(fn):
               # hostname and check type can be quoted with ' or with "
               double_quoted = line.replace("'", '"').lstrip()
               if double_quoted.startswith('("'):
                   count += 1
                   splitted = double_quoted.split('"')
                   # Sometimes the length of splitted is only 3 due to some items in 'line' do not have quoted marks.
                   if len(splitted) > 3:
                       if splitted[1] != hostname or (checktypes != None and splitted[3] not in checktypes):
                           if splitted[3] not in check_info:
                               sys.stderr.write('Removing unimplemented check %s\n' % splitted[3])
                               continue
                           lines.append(line)
                       else:
                           removed += 1
           if len(lines) == 0:

然后,执行 'cmk -II',看到很多的 'Removing unimplemented check' 信息,再次执行就看不到了,应该是因为符合条件的过期记录都已经被删除了的原因。

# cmk -II iad1-server1

...
Removing unimplemented check /
Removing unimplemented check oom_adj_for_cron
Removing unimplemented check oom_adj_for_sshd
Removing unimplemented check crond
Removing unimplemented check sshd
Removing unimplemented check xinetd
cpu.loads         1 new checks
df                2 new checks
kernel.util       1 new checks
lnx_if            1 new checks
local             5 new checks
mem.used          1 new checks
mrpe              4 new checks
postfix_mailq     1 new checks
ps                5 new checks
tcp_conn_stats    1 new checks
uptime            1 new checks

# cmk -II iad1-server1

cpu.loads         1 new checks
df                2 new checks
kernel.util       1 new checks
lnx_if            1 new checks
local             5 new checks
mem.used          1 new checks
mrpe              4 new checks
postfix_mailq     1 new checks
ps                5 new checks
tcp_conn_stats    1 new checks
uptime            1 new checks

,

  1. No comments yet.
(will not be published)
*