Self hosting Enough, data resilience and Ceph


#1

Bonjour,

When running on an OpenStack cluster, Enough is protected from data loss by using either:

  • a 3+ replica volume
  • daily backups stored in 3+ replica images which limit data loss to 24h
  • hosts 100% reproducible from the Ansible repository

When self hosting Enough, there is no replicated storage infrastructure and it is the responsibility of the Enough administrator to ensure daily backups and store them in a place that is unlikely to burn down at the same time as the hardware running Enough.

For administrators who do not have such a discipline, it would be nice to rely on a Ceph cluster (or something else, if there is an easier path). It could go like this:

  • Assuming Enough is deployed on a Intel NUC with 8GB RAM + 250 GB SSD (costs ~250 euros).
  • Buy three instead of one (costs ~750 euros)
  • Make a Ceph cluster where each node runs an OSD from a directory + a MON, over a tinc interface.
  • Create a 3 replica pool or a 2+1 erasure coded pool, and use it to store ~/.enough which contains all data that cannot be re-built from the ansible repository
  • Each machine is placed in a different location

In a disaster recovery scenario where the machine running Enough stops working, one of the two remaining machines can be used as a replacement: each of them has a copy of the data.

Although the devil is in the details and it seems too good to be true, I’m confident it actually works and can be made simple to setup and maintain. I spent a significant amount of time thinking about a similar setup back when I was working full time on Ceph, over two years ago. But maybe there now are smarter ways to do the same, using software that did not exist two years ago?

Ideas?


#2

A few years ago a healthy Ceph cluster had nodes in Berlin (DE), Toulouse (FR) and Paris (FR) and a bandwidth varying from 1MB/s to 10MB/s. One of the nodes did not have a public IP but it did not matter, as long as the tinc mesh had at least one public IP.

This requirement (at least one public IP) is an annoying limitation and it also means the tinc mesh is as fragile as the number of public IP it contains (less public IP means more fragile).

I wonder if it would be possible to combine tor + tinc to maintain a mesh that is both very resilient (tor) and reasonably fast (tinc). I did not know about tor a few years ago and it did not occur to me that it could be used as a convenient discovery mechanism to heal or maintain clusters. It is slow but it is available everywhere and it does not make a difference if a machine has a public IP or not.