Restoring a service from a backup

Location: https://jitsi.enough.community/EngagedWallsRespectCurrently


Bonjour,

This is an online disaster recovery :volcano: exercise to practice restoring a service from a backup. It will happen on a shared tmux session and anyone willing to attend is welcome. However, only one person will get to actually type the commands. If you’re interested, please register your SSH public key :old_key: at https://lab.enough.community/ and let me know your account name.

The duration of the session is likely to be around two hours, however most of the time will be waiting for provisioning to complete. It is recommended to plan for something easily interruptible to keep you busy in the meantime: there won’t be someone talking the whole time to entertain the audience :roll_eyes:

An Enough instance will be prepared for the occasion at https://the.re, with a Mattermost instance.

Disaster recovery

  • https://chat.the.re user loic password EafZuGup/Owg1
    after
  • ssh to the virtual machine @ GRA7 with ssh debian@51.91.136.84
  • connect to the shared console with tmux attach
  • read all the instructions before running any command
  • get name of chat volume snapshot to restore using list command
  • follow the instructions to restore using three commands instead of just one
    • the domain is the.re (instead of example.com in the documentation)
    • the service is chat (instead of cloud in the documentation)
    • the test domain is testchat.d.enough.community (instead of test.d.enough.community in the documentation)
    • the volume is 2020-09-17-chat-volume (instead of 2020-04-12-cloud-volume in the documentation)
  • https://chat.testchat.d.enough.community user loic password EafZuGup/Owg1
    before
  • useful commands to observe the content of the regions
    • enough --domain the.re openstack -- server list
    • enough --domain testchat.d.enough.community openstack -- server list
    • enough --domain testchat.d.enough.community openstack -- volume list
    • enough --domain the.re openstack -- volume snapshot list
  • beause of a bug in the chat role, it is necessary to do the following after restoring the service:
    • enough --domain testchat.d.enough.community ssh chat-host
    • cd /opt/enough/enough.community/chat-host/mattermost
    • docker-compose -f docker-compose-infrastructure.yml down
    • docker-compose -f docker-compose-infrastructure.yml up -d

@nqb which time / date would work best for you in September ? I’ll check with the other participant. If there is no trivial solution I’ll launch a date poll.

Hi,

If possible, on a Friday afternoon, starting at 2:00pm CEST.

Do you have an estimate duration ?

1 Like

It takes about two hours but requires only occasional interactions so you should have something else, easily interruptible, to keep you busy. Probably 30 minutes max to get started, explanations included. Then about 30 minutes max with 15 minutes interval waiting for commands to complete. So that would be about 10 minutes work, 15 minutes wait, times 3. That’s a pessimistic scenario.

I added a date poll in the topic. The other participant will be back from vacation September 3rd and I’ll invite him here.

The date is set to September 18th, 2pm - 6pm. Looking forward to working on this with you both @gm & @nqb :stuck_out_tongue:

The shared virtual machine was prepared.

Preparation

  • Three OpenStack regions @ OVH
    • GRA7 for the VM running tmux
    • SBG5 for the Enough instance
    • DE1 for restoring the backup
  • Manually create a virtual machine @ GRA7
  • ssh to the virtual machine @ GRA7
  • Setup the DNS
cat > ~/.enough/the.re/inventory/host_vars/bind-host/zone.yml <<EOF
---
bind_zone_records: |
     imap 1800 IN CNAME access.mail.gandi.net.
     pop 1800 IN CNAME access.mail.gandi.net.
     smtp 1800 IN CNAME relay.mail.gandi.net.

     @ 1800 IN MX 50 fb.mail.gandi.net.
     @ 1800 IN MX 10 spool.mail.gandi.net.
EOF
cat > ~/.enough/the.re/inventory/group_vars/all/dhcp.yml <<EOF
---
bind_server_ip_for_clients: "{{ hostvars[groups['bind-service-group'][0]]['ansible_facts'][network_primary_interface]['ipv4']['address'] }}"
EOF
  • enough --domain the.re service create bind
  • update the.re glue record @ gandi with the IP of bind-host
  • wait for propagation otherwise it will fail because letsencrypt fails

  • enough --domain the.re service create --host bind-host backup

  • Setup the chat:
cat > ~/.enough/the.re/inventory/host_vars/chat-host.yml <<EOF
---
openstack_volumes:
  - name: chat-volume
    size: 5
encrypted_device_mount_point: /opt
EOF
  • enough --domain the.re service create chat
  • enough --domain the.re ssh chat-host

@nqb @gm you should be able to ssh debian@51.91.136.84, your ssh keys are installed.

Presentation notes:

  • How long does it take ?
    • 15min presentation & questions
    • Launch the first operation 1min human time, 15-30 min real time
    • Launch the second operation 10min human time, 30-60 min real time
    • Launch the third operation 10 minutes
    • Wrap 5 minutes
  • What are we doing ? Recovering a chat service running on the the.re domain which is dedicated to this session, from a backup, in a test environment to verify the sanity of the backup.
  • How is the backup done ?
    • Each VM has an encrypted volume attached
    • The volume is snapshoted daily (highlight the difference between a copy and a snapshot, the concept of consistency groups)
  • The backup is generic, it does not depend on a service, they can all be restored in this way.
  • Enough is based on:
  • Three steps to restore a service in a test environment
    • Create a volume in the target region and copy the content of the snapshot into it
    • Create the service (and all services it depends on, that is bind, icinga and backup) in the target region
    • Substitute the volume of the service that was just created in the target region with the service being restored
  • Verify all is as it should be
  • Note on services that are inside a VPN