Testing another OpenStack provider

Hello !

Intro

In order to contribute to Enough project, I need to execute OpenStack based tests. Enough project provides me OVH credentials allowing to execute these tests.

Why?

I wanted to be able to use another OpenStack provider because the OpenStack Orchestration API provided by OVH isn’t reliable. For example these issues were encountered several times:

  • INTERNALERROR> [packages-host]: UPDATE_IN_PROGRESS  Stack UPDATE started
    INTERNALERROR> [packages-host]: UPDATE_FAILED  Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
    INTERNALERROR> <class 'oslo_db.exception.DBError'> (HTTP 500) (Request-ID: req-55b4b67a-10d9-4c26-8af9-0226bb78d8b3)
    
  • openstack.py 34 INFO ERROR: Property error: : resources.instance.properties.flavor: : Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
    openstack.py 34 INFO <class 'sqlalchemy.exc.DBAPIError'> (HTTP 500) (Request-ID: req-5acadfdb-03b0-4c93-8e5f-85b039b2775c)
    

Besides the OpenStack Orchestration API provided by OVH is quite slow (see performance sections below).

Related changes

In order to be able to use another OpenStack provider than OVH, I proposed some patches. This work is almost done, I just created the last merge-request.

Which provider?

The current code base requires:

  • Orchestration API available (Heat, at least heat_template_version: 2016-10-14/newton),
  • a public IPv4 per virtual machine (not a floating IP but a direct IP)
  • In order to deploy every available services 15 virtual machines are required but only 5 virtual machines are required to execute the tests.
  • available flavors:
    • the flavors must be provided with an attached root disk by default (not an explicit block storage)
    • most of the virtual machines use 2Go RAM, some hosts/tests require 4Go or 8Go RAM

Ideally, these characteristics are available:

  • a private network with an IP per virtual machine
  • on demand billing: per seconds/per minute
  • Debian stable image
  • location/region: on the same continent as the user
  • the provider should be implicated in OpenStack development

OpenStack provider list (as of november 2020)

Provider Heat API Public IPv4 default disk billing Europe region available OpenStack commits
AURO :x: :x: floating IP :ballot_box_with_check: hour :x:
Catalystcloud.nz :ballot_box_with_check: :x: floating IP :ballot_box_with_check: hour :x: 1576
Citycloud :ballot_box_with_check: :x: floating IP :ballot_box_with_check: minute :ballot_box_with_check: 379
Dreamhost :x: hour :x:
Elastx, details :ballot_box_with_check: :x: floating IP :ballot_box_with_check: minute :ballot_box_with_check:
Fuga :ballot_box_with_check: :ballot_box_with_check: :ballot_box_with_check: minute :ballot_box_with_check:
Irideos I contacted them in order to get more information but they didn’t answered. :x: floating IP :ballot_box_with_check:
iweb :ballot_box_with_check: hour :ballot_box_with_check: 179
Limestone networks :ballot_box_with_check: :x: IPv6 only :ballot_box_with_check: minute :x: 484
Open Telekom Cloud, details :ballot_box_with_check: :x: floating IP :ballot_box_with_check: seconds :ballot_box_with_check: 90
Orange, details :ballot_box_with_check: :x: floating IP :ballot_box_with_check: minute :ballot_box_with_check: 1242
OVH, details :ballot_box_with_check: :ballot_box_with_check: :ballot_box_with_check: minute :ballot_box_with_check: 465
Vexxhost, details :ballot_box_with_check: :ballot_box_with_check: :x: hour :x: 1759
Zetta :ballot_box_with_check: :x: floating IP :x: hour :ballot_box_with_check: 5

The blank fields means that i don’t know the values, if people have the information let me know !

My own manual tests

Besides OVH, I tested: Open Telekom Cloud and Fuga providers.

Open Telekom Cloud

This provider uses FusionSphere (Huawei’s commercial OpenStack release). The web interface is a bit disappointing compared to Horizon web interface, it doesn’t provide a link to openrc.sh nor clouds.yaml file.
I didn’t encounter any technical issue with this provider: no stack stuck, no weird/unexpected API error.
By default, the quotas are: 10 Elastic cloud instances, 40 vCPU, 163Go RAM, 50 disks, 12To (disks), 10 virtual network, only 3 floating IPs.

Costs

The standard price model for the elastic instances is based on per-second consumption. I paid 3.19€ for 151 hours of s3.medium.2 (2 Go RAM, 1 vCPU, 20 Go SATA) instance (0.021€/hour), 1.66€ for 79 hours of s2.medium.2 (2 Go RAM, 1 vCPU, 20 Go SATA) instance (0.021€/hour) and 0.87€ for 19 hours of s2.medium.4 (4 Go RAM, 1 vCPU, 20 Go SATA) instance (0.046€/hour).

Performance

On average (12 runs) tests/run-tests.sh tox -e icinga -- --enough-no-tests --enough-no-destroy playbooks/icinga/tests takes 9 minutes 30 seconds. Using the OVH provider, the same command takes 15 minutes 34 seconds (8 runs).

Setback

This provider doesn’t support direct IP which is a requirement for Enough: that’s why I switched to Fuga.

clouds.yaml
---
clouds:
  production:
    auth:
      auth_url: "https://iam.eu-de.otc.t-systems.com/v3"
      project_name: "eu-de" # optional
      tenant_name: "eu-de" # optional
      user_domain_name: "OTC-EU-DE-000000000010000@@@@@"
      project_id: "8fd12e514a3a4ab5bb9565a67b9a6b03"
      username: "######## OTC-EU-DE-000000000010000@@@@@"
      password: "XXXXX"
    identity_api_version: 3
    interface: public
    endpoint_type: public
    volume_api_version: 2
    image_api_version: 2

Fuga

By default, the quotas are: 20 instances, 50 vCPU, 20Go RAM, 10 volumes, 12To (disks).

Costs

I paid 0.6€ for 27 hours of s3.small (2 Go RAM, 1 vCPU, 50 Go SSD) instance (0.02232€/hour).

Performance

On average (8 runs) tests/run-tests.sh tox -e icinga -- --enough-no-tests --enough-no-destroy playbooks/icinga/tests takes 8 minutes 34 seconds. Using the OVH provider, the same command takes 15 minutes 34 seconds (8 runs).

Setbacks

I contacted the support three times:

  • at the beginning I wasn’t able to create a stack using the provided credentials (the team credentials must be used in order to use Heat API). I contacted Fuga support using their web interface at 11pm (UTC), the workaround was provided by them the next day at 8:42 am.
  • twice in order to delete stacks stuck in DELETED_FAILED state. The first time I needed to reach them again: they didn’t acknowledge the issue at first and after pushing it was resolved. The 2nd time I contacted them a Saturday at 6:30 pm, they fixed this issue the following Monday around 1pm.
Heat API errors

The following errors have been encountered (at least once each, not many more):

  • INTERNALERROR> 2020-10-12 14:42:02Z [website-host]: UPDATE_IN_PROGRESS  Stack UPDATE started
    INTERNALERROR> 2020-10-12 14:42:11Z [website-host]: UPDATE_FAILED  (sqlalchemy.exc.ResourceClosedError) This Connection is closed (Background on this error at: http://sqlalche.me/e/dbapi)
    
  • INTERNALERROR> 2020-10-15 22:44:41Z [website-host]: UPDATE_IN_PROGRESS  Stack UPDATE started
    INTERNALERROR> 2020-10-15 22:44:45Z [website-host]: UPDATE_FAILED  (pymysql.err.OperationalError) (1213, u'Deadlock: wsrep aborted transaction') (Background on this error at: http://sqlalche.me/e/e3q8)
    
clouds.yml
clouds:
  production:
    auth:
      auth_url: "https://identity.api.ams.fuga.cloud:443/v3"
      user_id: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
      password: "YYYYYYYYYYYYYYYYYYYYYYYYY"
      user_domain_id: "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"
      project_domain_id: "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"
      project_id: "WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW"
    region_name: "ams"
    interface: "public"
    identity_api_version: 3

These tests don’t allow me to recommend Fuga for production services yet: more tests would be required but the provided service is good enough to execute tests requiring Heat Orchestration API.

Conclusion

I hope this work will ease the contributions of the community.
Don’t hesitate to share your feedback about OpenStack providers.
Since we will be able soon to use a more reliable provider, my next goal will be to enable OpenStack integration tests in the CI.

Thanks Misc for the proofreading !

2 Likes

I’m truly impressed by the amount of work and also very happy about the outcome. It is particularly interesting to note that since running tests on OVH takes twice as much time as running tests on Fuga, the cost of running tests on Fuga won’t be higher than running tests on OVH.

  • OVH
    • base price for a 2GB instance is 0.008€/hour
    • tests require X hours to run
    • tests have to be re-run because of OVH failures 25% of the time
    • the normalized hourly price for testing on OVH is therefore 1h * 0.008€ * 1.25 failrate == 0.010€
  • Fuga
    • base price for a 2GB instance is 0.022€/hour
    • tests require X/2 hours to run
    • tests have to be re-run because of Fuga failures 2% of the time
    • the normalized hourly price for testing on OVH is therefore 1/2h * 0.022€ * 1.02 failrate == 0.011€

Of course this is not an accurate number because:

  • The high failure rate of OVH makes it impossible to run the CI: it would not cost a lot of money but it would require frequent human interventions which is much more expensive than the OpenStack operating costs
  • The cost is not a function of the test duration because, in the case of OVH, a large part of the delay is when the instances are created, i.e. before billing even happens.

However, it shows that Fuga being more reliable has a significant impact on the effective cost. The unreliable OVH control plane makes it a lot more expensive than it looks. And when the control plane is too unreliable, it is no longer a question of cost, it is just impossible to use. You could submit a talk at the next OpenStack summit on that topic !

I agree: production is much more about how stable the resources are (instance, network, storage) in the long run, when the control plane is not used at all. Or used very rarely since Enough does not require the control plane except for creation of instances, backups, disaster recovery. In that regard OVH proved to be very stable over the years, with dozens of instances created over two years ago and still running after being rebooted multiple times for kernel updates.

A follow up on my usage of Fuga.

Some notes:

  • There is a maximum of 5 OpenStack projects per Fuga account. Each project have its own credentials The credentials from one project don’t allow to list the resources of the other projects.
  • Fuga Cloud Service Level Agreement doesn’t list OpenStack Orchestration service.

Stacks stuck

Since my previous post (18 days ago), I contacted the Fuga support by email three times:

  1. Friday 13 November at 5pm (UTC+1): I encountered some network issues on two virtual machines: random/intermittent connection failures from/to the internet. Both virtual machines were hosted on the same hypervisor, a workaround was to specify another availability zone within the Heat template. The fuga support answered the next Monday at 10:51am:

    We did have some networking issues where not all new connections could be established. The issue was resolved. We’re currently looking for a permanent fix.

  2. Tuesday 17 November at 10am: I got stacks stuck in DELETE_COMPLETE state with 3 projects. Fuga support deleted these stacks and answered the next day at 2:43pm. They mentioned that when this issue occurs, I can not do anything to fix it.
  3. Friday 27 November at 3:51pm: I got stacks stuck on each of my OpenStack project (there is a maximum of 5 projects). The support hasn’t answered yet.

API outage

Besides the “stacks stuck” issue, Fuga service was unavailable 3 times. Each time the dashboard was unavailable and every OpenStack API call (not only the Orchestration API calls) returned an HTTP error 500:

  • 26 November: about 10 minutes around 11am
  • 27 November: at least from ~1am to ~4am
  • 28 November: at least from ~2am to ~4am

These outages aren’t mentioned at all on the Fuga status page.

1 Like