Hashicorp Vault role

Every vault command or API call mentionned below assume you have previously exported VAULT_ADDR and VAULT_TOKEN environnement variables:

export VAULT_ADDR="..."
export VAULT_TOKEN="..."

or logged in via another auth method like LDAP:

vault token lookup 1>/dev/null || vault login -method=ldap username="..."

Variable reference

Mandatory variables

Variable Description Example value
hashicorpvault_version fixed hasicorpvault apt version 1.11.2-1
hashicorpvault_cluster_name name of the cluster, must match ansible group name in case of a cluster secret-management-staging

Default simple variables

Variable Description Default value
hashicorpvault_tls_enable Enable TLS. If enabled, certificates will be pulled from the url specified withhashicorpvault_tls_remote_cert. True
hashicorpvault_tls_remote_cert URL to pull the certificates from https://pub-auth-certificate.cosium.com
hashicorpvault_listen_address Specifies the address to bind to for listening 127.0.0.1
hashicorpvault_backup Enable backups. Local only if you don't define hashicorpvault_backup_sftp dict. True
hashicorpvault_backup_sftp Define this dict to enable remote backups. hashicorpvault_backup_sftp.server and hashicorpvault_backup_sftp.port Undefined

Deployment scenarios

non-HA with raft integrated storage backend

This is the simplest case. Launch this role, initialize, unseal Vault and you are good to go.

Note that the only way to guarantee consistent snapshots is to use raft snapshot, a backup solution will be implemented in a future PR

HA with raft storage backend

Install procedure

  1. all nodes must have their DNS set
  2. all nodes must have their certificates ready to be pulled from hashicorpvault_tls_remote_cert
  3. a reverse proxy acting as a load balancer in front of the cluster, you can use the haproxy role with a configuration like:
      - name: "secret-management-staging"
        raw_config: |
          option httpchk GET /v1/sys/health
          http-check expect status 200
          default-server check check-ssl verify none
        server:
          - name: "secret-management-staging-1"
            fqdn: "secret-management-staging-1.cosium.com"
            port: "8200"
          - name: "secret-management-staging-2"
            fqdn: "secret-management-staging-2.cosium.com"
            port: "8200"
          - name: "secret-management-staging-3"
            fqdn: "secret-management-staging-3.cosium.com"
            port: "8200"
    
  4. an inventory group, with a name equal to hashicorpvault_cluster_name with all nodes defined:
    [secret-management-staging]
    secret-management-staging-1 ansible_host=secret-management-staging-1.cosium.com
    secret-management-staging-2 ansible_host=secret-management-staging-2.cosium.com
    secret-management-staging-3 ansible_host=secret-management-staging-3.cosium.com
    
  5. Launch the role on all nodes to install and start Vault
  6. Initialize one node with vault operator init and unseal it with vault operator unseal. The unseal keys are valid for the whole cluster. The node will be the leader
  7. Unseal the other nodes with the unseal keys of the leader. When an uninitialized Vault server starts up it will attempt to join each potential leader that is defined, retrying until successful.

Full operations example and useful commands

In this example, a raft election occured once the second node was unsealed, so its just a matter of luck that the first node is the leader, you could have this instead:

root@secret-management-staging-1:~ # vault operator raft list-peers  
Node                           Address            State       Voter  
----                           -------            -----       -----  
secret-management-staging-1    10.12.1.8:8201     follower    true  
secret-management-staging-3    10.12.1.10:8201    follower    true  
secret-management-staging-2    10.12.1.9:8201     leader      true

To get a complete insight of the cluster status, use the API (endpoint was introduced in version 1.10, not installed at the time of testing so the output is a sample from the documentation):

root@secret-management-staging-1:~ # curl -s --header "X-Vault-Token: $VAULT_TOKEN" --request GET $VAULT_ADDR/v1/sys/ha-status
{
  "Nodes": [
    {
      "hostname": "node1",
      "api_address": "http://10.0.0.2:8200",
      "cluster_address": "https://10.0.0.2:8201",
      "active_node": true,
      "last_echo": null
    },
    {
      "hostname": "node2",
      "api_address": "http://10.0.0.3:8200",
      "cluster_address": "https://10.0.0.3:8201",
      "active_node": false,
      "last_echo": "2021-11-29T10:29:09.202235-05:00"
    },
    {
      "hostname": "node3",
      "api_address": "http://10.0.0.4:8200",
      "cluster_address": "https://10.0.0.4:8201",
      "active_node": false,
      "last_echo": "2021-11-29T10:29:07.402548-05:00"
    }
  ]
}

Upgrade procedure

non-HA installation

  1. Increment hashicorpvault_version to the desired version
  2. Run the role
  3. Unseal the vault

HA installation

First step is to increment hashicorpvault_version variable to the desired version for the group, then:

On each standy node

  1. Run the role on the standy node (use ansible-playbook option -l, --limit)
  2. Unseal the standby node
  3. Verify vault status shows correct Version, and HA Mode is standby
  4. Review the node's logs to ensure successful startup and unseal

At this point all standby nodes will be updated and ready to take over. The update will not be complete until one of the updated standby nodes takes over active duty. To do this:

On the active node

  1. Run the role on the remaining (active) node (use ansible-playbook option -l, --limit)
  2. Unseal the node
  3. Verify vault status shows correct Version and HA Mode is standby
  4. Review the node's logs to ensure successful startup and unseal

Internal update tasks will happen after one of the updated standby nodes takes over active duty.

Backup and restore procedure

Snapshot

The leader's raft storage is the source of truth for the cluster, so you must snapshot the leader's storage, always.

  1. Connect yourself to a cluster node
  2. Check that the current node is the leader using the API:
    root@secret-management-staging-1:~ # curl -s --header "X-Vault-Token: $VAULT_TOKEN" --request GET $VAULT_ADDR/v1/sys/leader | jq .is_self
    true
    
    or with
    root@secret-management-staging-1:~ # vault operator raft list-peers
    Node                           Address            State       Voter
    ----                           -------            -----       -----
    secret-management-staging-1    10.12.1.8:8201     leader      true
    secret-management-staging-3    10.12.1.10:8201    follower    true
    secret-management-staging-2    10.12.1.9:8201     follower    true
    
  3. Perform the snapshot from the leader with:
    root@secret-management-staging-1:~ # vault operator raft snapshot save /tmp/test.snap
    
    or from anywhere using the API, ensuring to query the leader
    root@secret-management-staging-2:~ # export LEADER_ADDR="https://secret-management-staging-1.cosium.com:8200"
    root@secret-management-staging-2:~ # curl -s --header "X-Vault-Token: $VAULT_TOKEN" --request GET $LEADER_ADDR/v1/sys/storage/raft/snapshot > test.snap
    

It is pointless to compress the snapshot (with e.g. zstd) as the data is encrypted.

Restore

From a snapshot of the same cluster

Copy your vault raft snapshot file onto the leader node and run the below command, replacing the filename with that of your snapshot file.

vault operator raft snapshot restore test.snap

or from anywhere with:

export LEADER_ADDR="https://secret-management-staging-1.cosium.com:8200"
curl -s --header "X-Vault-Token: $VAULT_TOKEN" --data-binary @test.snap --request POST $LEADER_ADDR/v1/sys/storage/raft/snapshot

From a snapshot of a different cluster onto a new cluster

This procedure assumes keyholders are available with access to the unseal keys for each, that you have access to tokens with sufficient privileges for the origin cluster.

  1. Install the new vault cluster with the same version as the source cluster.
  2. Connect to each node but one and stop vault:
    systemctl stop vault.service
    
  3. You will need to initialise and unseal one node and log in with the new root token that was generated during its initialisation. Note that these will be temporary - the original/source unseal keys will be needed following restore.
  4. Restore the snapshot on the node you just unsealed. Note, the -force option is required here since the keys will not be consistent with the snapshot data as you will be restoring a snapshot from a different cluster:
    vault operator raft snapshot restore -force test.snap
    
    or with:
    curl -s --header "X-Vault-Token: $VAULT_TOKEN" --data-binary @test.snap --request POST $VAULT_ADDR/v1/sys/storage/raft/snapshot-force
    
  5. Unseal each node with the keys of the source cluster

From a snapshot of a different cluster onto an existing cluster

This procedure assumes keyholders are available with access to the unseal keys for each, that you have access to tokens with sufficient privileges for both clusters. This procedure is useful when bringing a staging cluster up with data from a prod cluster, to test an upgrade for example.

  1. Ensure source and destination cluster are on the same vault version
  2. Connect to each node, stop vault and remove everything under /opt/vault/data/raft with:
    systemctl stop vault.service
    rm -rf /opt/vault/data/raft/*
    
  3. You will need to initialise and unseal one node and log in with the new root token that was generated during its initialisation. Note that these will be temporary - the original/source unseal keys will be needed following restore.
  4. Restore the snapshot on the node you just unsealed. Note, the -force option is required here since the keys will not be consistent with the snapshot data as you will be restoring a snapshot from a different cluster:
    vault operator raft snapshot restore -force test.snap
    
    or with:
    curl -s --header "X-Vault-Token: $VAULT_TOKEN" --data-binary @test.snap --request POST $VAULT_ADDR/v1/sys/storage/raft/snapshot-force
    
  5. Unseal each node with the keys of the source cluster

Automated backups

This role will enable automated backups of the raft storage if hashicorpvault_backup is set to true. For automated backups to be effective, manual steps are neccessary:

  1. Create a "snapshot" policy:
    vault policy write snapshot snapshot_policy.hcl
    
    with snapshot_policy.hcl being:
    # file: snapshot_policy.hcl
     path "/sys/storage/raft/snapshot"
     {
       capabilities = ["read"]
     }
    
  2. Then enable approle auth method, create a snapshot role with the policy "snapshot" and generate a secret for this role:
    vault auth enable approle
    vault write auth/approle/role/snapshot token_policies="snapshot"
    vault read auth/approle/role/snapshot/role-id
    vault write -f auth/approle/role/snapshot/secret-id
    
  3. Copy your secret and role ids and permanently set them as environement variables in /root/.bash_profile as VAULT_ROLE_ID and VAULT_SECRET_ID for each node:
    # file: /root/.bash_profile
    export VAULT_ROLE_ID="..."
    export VAULT_SECRET_ID="..."
    

To learn more about AppRole auth method and why it was chosen, see the Vault docs

Logging

By default, Hashicorp Vault does not enable logging. It can only be enabled via CLI or API once Vault is started and unsealed. Execute the following command to enable logging on the leader node:

vault audit enable syslog tag="vault" local="true"

Explanation:

Outage recovery

Just in case, here are some useful link in case of cluster outage (lost quorum...):