Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickup latest goal state on tenant certificate rotation + Avoid infin… #3273

Merged
merged 5 commits into from
Dec 13, 2024

Conversation

narrieta
Copy link
Member

@narrieta narrieta commented Dec 9, 2024

Fixes for a couple of bugs on the goal state that surfaced on recent incidents.

First, if there is a problem with the tenant certificate, UpdateHandler._initialize_goal_state() can go into an infinite loop. Second, during rotation of the tenant certificate, the Agent needs to pick up the latest extensions goal state.

For both issues, comments in the code provide more details.

The PR also includes a fix for a bug logging in UpdateHandler._try_update_goal_state().

…ite loop when the tenant certificate is missing
@@ -61,14 +61,6 @@ class GoalStateProperties(object):
All = RoleConfig | HostingEnv | SharedConfig | ExtensionsGoalState | Certificates | RemoteAccessInfo


class GoalStateInconsistentError(ProtocolError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking the tenant cert at the GoalState level and raising an exception when there is a mismatch, now the callers of GoalState that are actually interested in this situation do the check and handle the error. The caller, in this case the main loop of the extension handler, can do a better job handling and reporting the error.

message = u"Fetching the goal state is still failing: {0}".format(textutil.format_exception(e))
logger.error(message)
add_event(op=WALAEventOperation.FetchGoalState, is_success=False, message=message, log_event=False)
# Report one single periodic error every 6 hours
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while working on these fixes, I noticed that these messages were being logged every 6 secs, rather than every 6 hours as was intended.

@@ -381,20 +380,6 @@ def http_get_handler(url, *_, **__):
self.assertEqual(initial_timestamp, goal_state.extensions_goal_state.created_on_timestamp, "The timestamp of the updated goal state is incorrect")
self.assertTrue(goal_state.extensions_goal_state.is_outdated, "The updated goal state should be marked as outdated")

def test_it_should_raise_when_the_tenant_certificate_is_missing(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore

@@ -444,45 +429,6 @@ def test_it_should_download_certs_on_a_new_fabric_goal_state(self):
goal_state.update()
self.assertTrue(os.path.isfile(crt_path))

def test_it_should_refresh_the_goal_state_when_it_is_inconsistent(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to test_update

@@ -506,39 +452,6 @@ def http_get_handler(url, *_, **__):
self.assertEqual(0, len(goal_state.certs.summary), "Cert list should be empty")
self.assertEqual(1, http_get_handler.certificate_requests, "There should have been exactly 1 requests for the goal state certificates")

def test_it_should_refresh_the_goal_state_when_it_is_fails_to_decrypt_cert(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test was not needed to begin with

maddieford
maddieford previously approved these changes Dec 9, 2024
if vm_settings is not None:
most_recent = vm_settings if vm_settings.created_on_timestamp > extensions_config.created_on_timestamp else extensions_config
else:
most_recent = extensions_config
elif goal_state_updated:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this elif condition the same as the first if condition?

Copy link
Member Author

@narrieta narrieta Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; with this change the elif is not needed (in fact, it is dead code.) Removing it.

@@ -491,45 +496,93 @@ def _check_threads_running(self, all_thread_handlers):
logger.warn("{0} thread died, restarting".format(thread_handler.get_thread_name()))
thread_handler.start()

errors = 0

def _try_update_goal_state(self, protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only hit this logic when updating the goal state. Do we also need it when we initialize the goal state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, _initialize_goal_state also calls this method

@@ -491,45 +496,93 @@ def _check_threads_running(self, all_thread_handlers):
logger.warn("{0} thread died, restarting".format(thread_handler.get_thread_name()))
thread_handler.start()

errors = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks; left-over debugging code. Removing it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if change pushed, it's still there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks; updated

azurelinuxagent/ga/update.py Show resolved Hide resolved
@@ -491,45 +496,93 @@ def _check_threads_running(self, all_thread_handlers):
logger.warn("{0} thread died, restarting".format(thread_handler.get_thread_name()))
thread_handler.start()

errors = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if change pushed, it's still there

azurelinuxagent/ga/update.py Show resolved Hide resolved
@narrieta narrieta merged commit 50d37f8 into Azure:develop Dec 13, 2024
9 of 11 checks passed
@narrieta narrieta deleted the goal-state branch December 18, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants