Hashicorp Vault role

Hashicorp Vault role

Every vault command or API call mentionned below assume you have previously exported VAULT_ADDR and VAULT_TOKEN environnement variables:

export VAULT_ADDR="..."
export VAULT_TOKEN="..."

or logged in via another auth method like LDAP:

vault token lookup 1>/dev/null || vault login -method=ldap username="..."

Variable reference

Mandatory variables

Variable	Description	Example value
hashicorpvault_version	fixed hasicorpvault apt version	`1.11.2-1`
hashicorpvault_cluster_name	name of the cluster, must match ansible group name in case of a cluster	`secret-management-staging`

Default simple variables

Variable	Description	Default value
hashicorpvault_tls_enable	Enable TLS. If enabled, certificates will be pulled from the url specified with`hashicorpvault_tls_remote_cert`.	`True`
hashicorpvault_tls_remote_cert	URL to pull the certificates from	`https://pub-auth-certificate.cosium.com`
hashicorpvault_listen_address	Specifies the address to bind to for listening	`127.0.0.1`
hashicorpvault_backup	Enable backups. Local only if you don't define `hashicorpvault_backup_sftp` dict.	`True`
hashicorpvault_backup_sftp	Define this dict to enable remote backups. `hashicorpvault_backup_sftp.server` and `hashicorpvault_backup_sftp.port`	Undefined

Deployment scenarios

non-HA with raft integrated storage backend

This is the simplest case. Launch this role, initialize, unseal Vault and you are good to go.

Note that the only way to guarantee consistent snapshots is to use raft snapshot, a backup solution will be implemented in a future PR

HA with raft storage backend

Install procedure

all nodes must have their DNS set
all nodes must have their certificates ready to be pulled from hashicorpvault_tls_remote_cert

a reverse proxy acting as a load balancer in front of the cluster, you can use the haproxy role with a configuration like:

  - name: "secret-management-staging"
    raw_config: |
      option httpchk GET /v1/sys/health
      http-check expect status 200
      default-server check check-ssl verify none
    server:
      - name: "secret-management-staging-1"
        fqdn: "secret-management-staging-1.cosium.com"
        port: "8200"
      - name: "secret-management-staging-2"
        fqdn: "secret-management-staging-2.cosium.com"
        port: "8200"
      - name: "secret-management-staging-3"
        fqdn: "secret-management-staging-3.cosium.com"
        port: "8200"

an inventory group, with a name equal to hashicorpvault_cluster_name with all nodes defined:

[secret-management-staging]
secret-management-staging-1 ansible_host=secret-management-staging-1.cosium.com
secret-management-staging-2 ansible_host=secret-management-staging-2.cosium.com
secret-management-staging-3 ansible_host=secret-management-staging-3.cosium.com

Launch the role on all nodes to install and start Vault
Initialize one node with vault operator init and unseal it with vault operator unseal. The unseal keys are valid for the whole cluster. The node will be the leader
Unseal the other nodes with the unseal keys of the leader. When an uninitialized Vault server starts up it will attempt to join each potential leader that is defined, retrying until successful.

Full operations example and useful commands

First node initialize and unseal (repeat 3 times), will be in standby until cluster is formed:

root@secret-management-staging-1:~ # vault operator init
...
root@secret-management-staging-1:~ # vault operator unseal  
Unseal Key (will be hidden):  
Key                Value  
---                -----  
Seal Type          shamir  
Initialized        true  
Sealed             true  
Total Shares       5  
Threshold          3  
Unseal Progress    1/3  
Unseal Nonce       ba09a8d2-e8cc-dbc7-05b9-a3f802cc68b2  
Version            1.6.3  
Storage Type       raft  
HA Enabled         true
...
root@secret-management-staging-1:~ # vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            5
Threshold               3
Version                 1.6.3
Storage Type            raft
Cluster Name            secret-management-staging
Cluster ID              82c02125-57fb-91cf-cb41-c4627806d04b
HA Enabled              true
HA Cluster              https://10.12.1.8:8201
HA Mode                 standby
Active Node Address     https://10.12.1.8:8200
Raft Committed Index    7966
Raft Applied Index      7966

Second node unseal with the unseal keys produced when initializing the first node, notice the standby status:

root@secret-management-staging-2:~ # vault operator unseal  
Unseal Key (will be hidden):  
Key                Value  
---                -----  
Seal Type          shamir  
Initialized        true  
Sealed             true  
Total Shares       5  
Threshold          3  
Unseal Progress    1/3  
Unseal Nonce       88a6750c-9670-ab4e-9a33-9cebafd5a8f5  
Version            1.6.3  
Storage Type       raft  
HA Enabled         true
...
root@secret-management-staging-2:~ # vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            5
Threshold               3
Version                 1.6.3
Storage Type            raft
Cluster Name            secret-management-staging
Cluster ID              82c02125-57fb-91cf-cb41-c4627806d04b
HA Enabled              true
HA Cluster              https://10.12.1.8:8201
HA Mode                 standby
Active Node Address     https://10.12.1.8:8200
Raft Committed Index    7972
Raft Applied Index      7972

Check the status of the first node, and notice its active status:

root@secret-management-staging-1:~ # vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            5
Threshold               3
Version                 1.6.3
Storage Type            raft
Cluster Name            secret-management-staging
Cluster ID              82c02125-57fb-91cf-cb41-c4627806d04b
HA Enabled              true
HA Cluster              https://10.12.1.8:8201
HA Mode                 active
Raft Committed Index    7972
Raft Applied Index      7972

Unseal the third (or n remaining nodes), and go back to the first node:

root@secret-management-staging-1:~ # vault operator raft list-peers
Node                           Address            State       Voter
----                           -------            -----       -----
secret-management-staging-1    10.12.1.8:8201     leader      true
secret-management-staging-3    10.12.1.10:8201    follower    true
secret-management-staging-2    10.12.1.9:8201     follower    true

In this example, a raft election occured once the second node was unsealed, so its just a matter of luck that the first node is the leader, you could have this instead:

root@secret-management-staging-1:~ # vault operator raft list-peers  
Node                           Address            State       Voter  
----                           -------            -----       -----  
secret-management-staging-1    10.12.1.8:8201     follower    true  
secret-management-staging-3    10.12.1.10:8201    follower    true  
secret-management-staging-2    10.12.1.9:8201     leader      true

To get a complete insight of the cluster status, use the API (endpoint was introduced in version 1.10, not installed at the time of testing so the output is a sample from the documentation):

root@secret-management-staging-1:~ # curl -s --header "X-Vault-Token: $VAULT_TOKEN" --request GET $VAULT_ADDR/v1/sys/ha-status
{
  "Nodes": [
    {
      "hostname": "node1",
      "api_address": "http://10.0.0.2:8200",
      "cluster_address": "https://10.0.0.2:8201",
      "active_node": true,
      "last_echo": null
    },
    {
      "hostname": "node2",
      "api_address": "http://10.0.0.3:8200",
      "cluster_address": "https://10.0.0.3:8201",
      "active_node": false,
      "last_echo": "2021-11-29T10:29:09.202235-05:00"
    },
    {
      "hostname": "node3",
      "api_address": "http://10.0.0.4:8200",
      "cluster_address": "https://10.0.0.4:8201",
      "active_node": false,
      "last_echo": "2021-11-29T10:29:07.402548-05:00"
    }
  ]
}

Upgrade procedure

Vault updates are designed such that large jumps (i.e. 1.3.10 -> 1.7.x) are supported.
The update notes for each intervening version must be reviewed.
The update notes may describe additional steps or configuration to update before, during, or after the update.
Vault does not make backward-compatibility guarantees for its data store. If you need to roll back to a previous version of Vault, you should roll back your data store as well.
Test the update in a staging cluster:
- The ideal way to do this is to take a snapshot of your data and load it into a test cluster, see Backup and restore procedure
- If you are issuing secrets to third party resources (cloud credentials, database credentials, etc.) ensure that you do not allow external network connectivity during testing, in case credentials expire. This prevents the test cluster from trying to revoke these resources along with the non-test cluster.

non-HA installation

Increment hashicorpvault_version to the desired version
Run the role
Unseal the vault

HA installation

This update procedure is designed to minimize downtime to < 1s
You should ensure that you never fail over from a newer version of Vault to an older version. This procedure is designed for this.
This role will fail when atempting to update multiple nodes

First step is to increment hashicorpvault_version variable to the desired version for the group, then:

On each standy node

Run the role on the standy node (use ansible-playbook option -l, --limit)
Unseal the standby node
Verify vault status shows correct Version, and HA Mode is standby
Review the node's logs to ensure successful startup and unseal

At this point all standby nodes will be updated and ready to take over. The update will not be complete until one of the updated standby nodes takes over active duty. To do this:

On the active node

Run the role on the remaining (active) node (use ansible-playbook option -l, --limit)
Unseal the node
Verify vault status shows correct Version and HA Mode is standby
Review the node's logs to ensure successful startup and unseal

Internal update tasks will happen after one of the updated standby nodes takes over active duty.

Backup and restore procedure

Snapshot

The leader's raft storage is the source of truth for the cluster, so you must snapshot the leader's storage, always.

Connect yourself to a cluster node

Check that the current node is the leader using the API:

root@secret-management-staging-1:~ # curl -s --header "X-Vault-Token: $VAULT_TOKEN" --request GET $VAULT_ADDR/v1/sys/leader | jq .is_self
true

or with

root@secret-management-staging-1:~ # vault operator raft list-peers
Node                           Address            State       Voter
----                           -------            -----       -----
secret-management-staging-1    10.12.1.8:8201     leader      true
secret-management-staging-3    10.12.1.10:8201    follower    true
secret-management-staging-2    10.12.1.9:8201     follower    true

Perform the snapshot from the leader with:

root@secret-management-staging-1:~ # vault operator raft snapshot save /tmp/test.snap

or from anywhere using the API, ensuring to query the leader

root@secret-management-staging-2:~ # export LEADER_ADDR="https://secret-management-staging-1.cosium.com:8200"
root@secret-management-staging-2:~ # curl -s --header "X-Vault-Token: $VAULT_TOKEN" --request GET $LEADER_ADDR/v1/sys/storage/raft/snapshot > test.snap

It is pointless to compress the snapshot (with e.g. zstd) as the data is encrypted.

Restore

From a snapshot of the same cluster

Copy your vault raft snapshot file onto the leader node and run the below command, replacing the filename with that of your snapshot file.

vault operator raft snapshot restore test.snap

or from anywhere with:

export LEADER_ADDR="https://secret-management-staging-1.cosium.com:8200"
curl -s --header "X-Vault-Token: $VAULT_TOKEN" --data-binary @test.snap --request POST $LEADER_ADDR/v1/sys/storage/raft/snapshot

From a snapshot of a different cluster onto a new cluster

This procedure assumes keyholders are available with access to the unseal keys for each, that you have access to tokens with sufficient privileges for the origin cluster.

Install the new vault cluster with the same version as the source cluster.
Connect to each node but one and stop vault:
```
systemctl stop vault.service
```
You will need to initialise and unseal one node and log in with the new root token that was generated during its initialisation. Note that these will be temporary - the original/source unseal keys will be needed following restore.
Restore the snapshot on the node you just unsealed. Note, the -force option is required here since the keys will not be consistent with the snapshot data as you will be restoring a snapshot from a different cluster:
```
vault operator raft snapshot restore -force test.snap
```
or with:
```
curl -s --header "X-Vault-Token: $VAULT_TOKEN" --data-binary @test.snap --request POST $VAULT_ADDR/v1/sys/storage/raft/snapshot-force
```
Unseal each node with the keys of the source cluster

From a snapshot of a different cluster onto an existing cluster

This procedure assumes keyholders are available with access to the unseal keys for each, that you have access to tokens with sufficient privileges for both clusters. This procedure is useful when bringing a staging cluster up with data from a prod cluster, to test an upgrade for example.

Ensure source and destination cluster are on the same vault version
Connect to each node, stop vault and remove everything under /opt/vault/data/raft with:
```
systemctl stop vault.service
rm -rf /opt/vault/data/raft/*
```
You will need to initialise and unseal one node and log in with the new root token that was generated during its initialisation. Note that these will be temporary - the original/source unseal keys will be needed following restore.
Restore the snapshot on the node you just unsealed. Note, the -force option is required here since the keys will not be consistent with the snapshot data as you will be restoring a snapshot from a different cluster:
```
vault operator raft snapshot restore -force test.snap
```
or with:
```
curl -s --header "X-Vault-Token: $VAULT_TOKEN" --data-binary @test.snap --request POST $VAULT_ADDR/v1/sys/storage/raft/snapshot-force
```
Unseal each node with the keys of the source cluster

Automated backups

This role will enable automated backups of the raft storage if hashicorpvault_backup is set to true. For automated backups to be effective, manual steps are neccessary:

Create a "snapshot" policy:

vault policy write snapshot snapshot_policy.hcl

with snapshot_policy.hcl being:

# file: snapshot_policy.hcl
 path "/sys/storage/raft/snapshot"
 {
   capabilities = ["read"]
 }

Then enable approle auth method, create a snapshot role with the policy "snapshot" and generate a secret for this role:

vault auth enable approle
vault write auth/approle/role/snapshot token_policies="snapshot"
vault read auth/approle/role/snapshot/role-id
vault write -f auth/approle/role/snapshot/secret-id

Copy your secret and role ids and permanently set them as environement variables in /root/.bash_profile as VAULT_ROLE_ID and VAULT_SECRET_ID for each node:
```
# file: /root/.bash_profile
export VAULT_ROLE_ID="..."
export VAULT_SECRET_ID="..."
```

To learn more about AppRole auth method and why it was chosen, see the Vault docs

Logging

By default, Hashicorp Vault does not enable logging. It can only be enabled via CLI or API once Vault is started and unsealed. Execute the following command to enable logging on the leader node:

vault audit enable syslog tag="vault" local="true"

Explanation:

Vault will log requests to /var/log/auth.log
tag="vault" enables easier parsing with e.g. elasticearch
local="true" means only the leader node will log requests, instead of replicating logs across the cluster. This avoid duplicates. If a raft election occurs, the new leader node will start logging.

Outage recovery

Just in case, here are some useful link in case of cluster outage (lost quorum...):