Skip to content

[DOCS-10646] HA Agent #28928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

aliciascott
Copy link
Contributor

What does this PR do? What is the motivation?

Merge instructions

Merge readiness:

  • Ready for merge

For Datadog employees:
Merge queue is enabled in this repo. Your branch name MUST follow the <name>/<description> convention and include the forward slash (/). Without this format, your pull request will not pass in CI, the GitLab pipeline will not run, and you won't get a branch preview. Getting a branch preview makes it easier for us to check any issues with your PR, such as broken links.

If your branch doesn't follow this format, rename it or create a new branch and PR.

To have your PR automatically merged after it receives the required reviews, add the following PR comment:

/merge

Additional notes

Verified

This commit was signed with the committer’s verified signature.
prestwich James Prestwich
@aliciascott aliciascott added the WORK IN PROGRESS No review needed, it's a wip ;) label Apr 23, 2025
@aliciascott aliciascott requested a review from a team as a code owner April 23, 2025 17:05
@github-actions github-actions bot added the Architecture Everything related to the Doc backend label Apr 23, 2025

Verified

This commit was signed with the committer’s verified signature.
prestwich James Prestwich
@github-actions github-actions bot added the Images Images are added/removed with this PR label Apr 23, 2025
aliciascott and others added 3 commits April 24, 2025 09:51

Verified

This commit was signed with the committer’s verified signature.
prestwich James Prestwich

Verified

This commit was signed with the committer’s verified signature.
prestwich James Prestwich

Verified

This commit was signed with the committer’s verified signature.
prestwich James Prestwich
@github-actions github-actions bot added Guide Content impacting a guide and removed Architecture Everything related to the Doc backend labels May 1, 2025
@aliciascott aliciascott changed the title [DOCS-10646] initial commit HA Agent setup [DOCS-10646] HA Agent May 2, 2025
Copy link
Contributor

@estherk15 estherk15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Noticed a few things:

  • Recommend consistency when referencing the preferred active Agent. (Preferred active Agent, preferred active Agent, Preferred Active Agent)
  • Left a suggestion to remove nested bullets, but if it changes the intended message, feel free to ignore.


### Installation

1. Install two Agents on like hosts (one on each host). The following setup is for hosts with similar capabilities (CPU, RAM, and networking) and configurations (including `datadog.yaml` and integration settings).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Install two Agents on like hosts (one on each host). The following setup is for hosts with similar capabilities (CPU, RAM, and networking) and configurations (including `datadog.yaml` and integration settings).
1. Install the Datadog Agent on two similar hosts (one on each host). The following setup is for hosts with similar capabilities (CPU, RAM, and networking) and configurations (including `datadog.yaml` and integration settings).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, it's one Agent on one host right?


1. Install two Agents on like hosts (one on each host). The following setup is for hosts with similar capabilities (CPU, RAM, and networking) and configurations (including `datadog.yaml` and integration settings).

2. For both Agents, on each host, configure your `datadog.yaml` with the following settings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. For both Agents, on each host, configure your `datadog.yaml` with the following settings:
2. Configure your `datadog.yaml` on each host, with the following settings:

Comment on lines +62 to +63
For example, to set up the SNMP integration, install it on both Agents using the [SNMP Metrics][1] setup guide.
**Note**: Both [individual device monitoring][10] and [Autodiscovery][11] methods are supported for the SNMP integration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you meant for this to be on a new line:

Suggested change
For example, to set up the SNMP integration, install it on both Agents using the [SNMP Metrics][1] setup guide.
**Note**: Both [individual device monitoring][10] and [Autodiscovery][11] methods are supported for the SNMP integration.
For example, to set up the SNMP integration, install it on both Agents using the [SNMP Metrics][1] setup guide. <br>
**Note**: Both [individual device monitoring][10] and [Autodiscovery][11] methods are supported for the SNMP integration.

For example, to set up the SNMP integration, install it on both Agents using the [SNMP Metrics][1] setup guide.
**Note**: Both [individual device monitoring][10] and [Autodiscovery][11] methods are supported for the SNMP integration.

After configured, the two Agents function as an HA pair:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After configured, the two Agents function as an HA pair:
After the Agents are configured, they function as an HA pair:


2. Search for your previously configured Agents using tags or hostname, for example, `config_id:<CONFIG-NAME>`.

{{< img src="/integrations/guide/high_availability/fleet-view-agents.png" alt="Fleet Automation View Agents" style="width:100%;" >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend aligning images under their numbered list. If possible, remove Preview labels with pawparazzi

Comment on lines +85 to +86
1. Test that failover works by shutting down the Agent or host that is Active.
2. The standby Agent should start monitoring the configured integration(s) after 1-3 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Test that failover works by shutting down the Agent or host that is Active.
2. The standby Agent should start monitoring the configured integration(s) after 1-3 minutes.
1. Test failover by shutting down the active Agent or its host.
2. The standby Agent should start monitoring the configured integration(s) after 1-3 minutes.

Comment on lines +92 to +103
**If no Preferred active Agent is defined**:

- The active Agent is initially chosen randomly.
- Active Agent switching is minimized to avoid unnecessary failover:
- If the primary Agent is active and it shuts down or crashes, the secondary Agent takes over as the new active Agent.
- When the primary Agent recovers, the secondary Agent remains active.

**If a Preferred active Agent is defined**:

- The preferred active Agent takes priority:
- If the primary Agent is the preferred active Agent and is active, a failover occurs if the primary Agent shuts down or crashes, making the secondary Agent active.
- When the primary Agent recovers, it automatically resumes the active role, and the secondary Agent returns to standby.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend towards one level of bulleted lists, but hopefully this still conveys the same points.

Suggested change
**If no Preferred active Agent is defined**:
- The active Agent is initially chosen randomly.
- Active Agent switching is minimized to avoid unnecessary failover:
- If the primary Agent is active and it shuts down or crashes, the secondary Agent takes over as the new active Agent.
- When the primary Agent recovers, the secondary Agent remains active.
**If a Preferred active Agent is defined**:
- The preferred active Agent takes priority:
- If the primary Agent is the preferred active Agent and is active, a failover occurs if the primary Agent shuts down or crashes, making the secondary Agent active.
- When the primary Agent recovers, it automatically resumes the active role, and the secondary Agent returns to standby.
** Without a preferred active Agent
- The active Agent is initially selected at random.
- Failover occurs only when the current active Agent shuts down or crashes.
- When the primary Agent recovers, it does not automatically reclaim the active role.
**With a preferred active Agent
- The preferred Agent always takes priority when available.
- If it fails, the standby Agent becomes active.
- When the preferred Agent recovers, it automatically resumes the active role, and the standby Agent returns to standby.


### Why does my Agent have an `unknown` HA Agent state?

- Remote Configuration may not be setup correctly. Review the [prerequisites](#prerequisites) and [Remote Configuration setup][12] documentation for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Remote Configuration may not be setup correctly. Review the [prerequisites](#prerequisites) and [Remote Configuration setup][12] documentation for more information.
- Remote Configuration may not be setup correctly. For more information, review the [prerequisites](#prerequisites) and [Remote Configuration setup][12] documentation.

@@ -41,20 +41,22 @@ Navigate to the [Agent installation page][1], and install the [Datadog Agent][2]

{{< img src="network_device_monitoring/getting_started/ndm_install_agent.png" alt="The Agent configuration page, highlighting the Ubuntu installation." style="width:100%;" >}}

## Setup

#### High Availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this section jump from H2 to H4?


You can configure active and standby Agents to function as an HA pair in NDM. If the active Agent goes down, the standby Agent takes over within 90 seconds, becoming the new active Agent. Additionally, you can designate a preferred active Agent, allowing NDM to automatically revert to it once it becomes available again. This feature allows for proactive Agent switching ahead of scheduled maintenance.

## Setup
Reference [High Availability support of the Datadog Agent][20] for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reference [High Availability support of the Datadog Agent][20] for more information.
For more information, see [High Availability support of the Datadog Agent][20].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Guide Content impacting a guide Images Images are added/removed with this PR okr11 WORK IN PROGRESS No review needed, it's a wip ;)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants