Skip to content

Commit

Permalink
immediately abort chef-client if maintenance mode is set
Browse files Browse the repository at this point in the history
The reasoning is explained in the comments.  I'm not sure why I ever
thought it was a good idea to allow a chef-client run to proceed if
maintenance mode is set.  It's not even as if that approach could ever
restore an ill node to full health, because we were deliberately
checking whether the maintenance mode was set prior to the chef-client
run, and if so, leaving it set.  This way, if a node is left in
maintenance mode, it will be discovered sooner, resulting in the cloud
operator being alerted to a degraded cluster sooner.
  • Loading branch information
Adam Spiers committed Sep 4, 2015
1 parent ab982ad commit 2cc0c35
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 31 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,25 @@ class Handler < Chef::Handler

class StartHandler < Handler
def report
# This is informational only, and gives us a fraction more
# information in /var/log/chef/client.log and in the default
# attributes (until next run) for debugging purposes.
# However, it will only take effect after the handler has been
# installed in /etc/chef/client.rb *and* chef-client daemon
# has subsequently been restarted; the
# reload_chef_client_config hack doesn't work with
# start_handlers since it reloads the config too late, after
# the start handlers have already been triggered.
start_mode = record_maintenance_mode_before_this_chef_run
Chef::Log.info("Pacemaker maintenance mode currently %s" %
[start_mode ? "on" : "off"])

if maintenance_mode_set_via_this_chef_run?
# Sanity check: this should never happen because we're using
# default attributes which get wiped for each chef-client run.
raise "BUG: Pacemaker maintenance mode was already set at the start of this run! (pid #$$)"
# Check we're not in maintenance mode. This could happen for two
# reasons:
#
# 1. A previous chef-client run failed, so we shouldn't
# risk compounding problems by trying again until the
# root cause is addressed.
#
# 2. Someone/something other than Chef set the node into
# maintenance mode. That should be rare, but when it
# happens, we shouldn't interfere.
#
# So in both cases, we should abort the run immediately with a
# helpful message.
if maintenance_mode?
raise \
"Pacemaker maintenance mode was already set on " +
"#{node.hostname}; aborting! Please diagnose why this was the " +
"case, fix the root cause, and then unset maintenance mode via " +
"HAWK or by running 'crm node ready' on the node."
end
end
end
Expand All @@ -51,14 +53,15 @@ def report
Chef::Log.info("Taking node out of Pacemaker maintenance mode")
system("crm --wait node ready")
else
# This shouldn't happen, and suggests that one of the recipes
# is interfering in a way it shouldn't.
# This shouldn't happen, and suggests that something is
# interfering in a way it shouldn't.
raise "Something took node out of maintenance mode during run!"
end
else
if maintenance_mode?
Chef::Log.warn("Node placed in Pacemaker maintenance mode but " +
"not by Chef::Provider::PacemakerService; leaving as is")
# This shouldn't happen, and suggests that something is
# interfering in a way it shouldn't.
raise "Something put node into maintenance mode during run!"
else
Chef::Log.debug("Node not in Pacemaker maintenance mode")
end
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,6 @@ def maintenance_mode?
!! (%x(crm_attribute -G -N #{node.hostname} -n maintenance -q) =~ /^on$/)
end

def record_maintenance_mode_before_this_chef_run
# Via Chef::Pacemaker::StartHandler we track whether anything
# has put the node into Pacemaker maintenance mode prior to this
# chef-client run. This may come in handy during debugging.
#
# We use a default attribute so that it will get reset at the
# beginning of each chef-client run.
node.default[:pacemaker][:maintenance_mode][$$][:at_start] = maintenance_mode?
end

def set_maintenance_mode_via_this_chef_run
# We track whether anything has put the node into Pacemaker
# maintenance mode during this chef-client run, so we know
Expand Down

0 comments on commit 2cc0c35

Please sign in to comment.