openstack deployment

11 Jul 23 04:56 -0600

Why deploy OpenStack? I don’t really know if I can pinpoint what really motivated me to set off on this adventure. A private cloud always sounded appealing to me - maybe because it sounded rare and exotic, or maybe because I heard it was not straightforward to deploy and I wanted a challenge. Either way, I had learned A LOT with this deployment and, possibly more importantly, gained even more patience.

This post will document the steps I have taken and the struggles I had encountered (primarily struggles) on deploying my own private cloud with OpenStack. If this post seems all over the place, that is because I had added to it as I had worked on the project, adapting my plans according to how the deployment was going. I will do my best to make it more readable as I go, but no promises that it will make any semblance of sense to anybody but me.

My apologies in advance.

I planned to deploy OpenStack on older hardware - which of course isn’t the greatest idea since there will more than likely be issues that come with it. However, since this is a homelab deployment and I don’t plan on anyone but myself using it, I will try and work through the pain of my poor decisions.

HARDWARE SPECS:

Here are the hardware specs of the machines that I deployed OpenStack on. I initially had wanted to deploy everything on bare metal, but issues with deployment (you’ll see later), made me bite the bullet and use other options.

Server #1 - Dell C6100 (Single Node)

CPU: 2 x Intel Xeon X5660, 6 core 2.80 GHz
RAM: 12 x 8GB Samsung DDR3 SDRAM 1333MHz (P/N M393B1K70DH0-CH9)
STORAGE: 3 x 500GB Crucial MX500 2.5" SATA III SSD (P/N CT500MX500SSD1)
OS: Rocky Linux 9.3

Server #2 - Dell T320

CPU: Intel Xeon E5-2470, 8 core 2.30 GHz
RAM: 2 x 32GB A-Tech ECC DDR3L SDRAM 1333MHz (P/N AM144277), 4 x 4GB Hynix ECC DDR3 SDRAM 1333MHZ (P/N HMT151R7BFR4C-H9)
STORAGE: 4 x 6TB Seagate Enterprise 7200RPM 6Gb/s 3.5" SAS HDD (P/N ST6000NM0034), 4 x 4TB Seagate Constellation 7200RPM 6Gb/s 3.5" SAS HDD (P/N ST4000NM23)
RAID: H710P RAID Controller, RAID5 configuration
OS: Rocky Linux 9.3

Server #3 - ASUS ROG G750JM

CPU: Intel Core i7-4710HQ, 4 core 2.50 GHz
GPU: NVIDIA GeForce GTX860M
RAM: 4 x 8GB Timetec Non-ECC DDR3L RAM 1600MHz (P/N 76TT16NUSL2R8-8GK2)
STORAGE: 1TB Seagate Momentus 5400RPM 2.5" SATA III HDD (P/N ST1000LM024 HN-M101MBB), 1TB Samsung 860 Pro 2.5" SATA III SSD (P/N MZ-76P1T0BW)
OS: Proxmox Virtual Environment 8.1.3

DELL C6100 PRE-DEPLOYMENT

I had bought this baby off eBay for (relatively) cheap; it wasn’t in the greatest condition but that’s to be expected for the price. It’s loud, power hungry, was missing a handle and has dents all over the place, BUT it’s a four node server - perfect for an OpenStack multinode deployment. Since this server had probably not been used for a decade at this point, I had to do some preliminary updates to it. First of all, the BIOS was wildly out of date, so I decided to start there.

First update each node to the latest BIOS:

Create a USB that boots to FreeDOS using Rufus, open up Rufus, select USB, select “FreeDOS” from the “Boot Selection” dropdown, (double check that you have the right device selected), and hit the “START” button
Download the PEC6100BIOS018100.exe file from Dell’s website, run the file and unzip the contents to a place you will remember - put the contents you unzipped into the USB you formatted with Rufus
Once the node boots to FreeDOS, navigate into the folder and run the “FBIOS.bat” file to flash the latest BIOS and wait

        cd PEC6100BIOS018100
        FBIOS.bat

Wash, rinse, repeat for each node

I found this was the easiest way to flash the BIOS without having to install an OS.

DELL T320 PRE-DEPLOYMENT

Prior to this, the T320 was running VMware ESXi 7.0 - my first foray into virtualization. I have another server, a Dell R620, running Proxmox so this would be a good exercise in transitioning from one hypervisor to another. The process was actually easier than I expected and was more time consuming than anything. In ESXi, download the .vmdk of the VM powered off and convert the file to .qcow2 with the command:

qemu-img convert -p -f vmdk -O qcow2 xyz.vmdk xyz.qcow2

ASUS ROG G750JM PRE-DEPLOYMENT

I had not initially planned on using this machine, but since I had issues with MariaDB, I had decided to add it to the mix. This was the very first laptop that I had bought, that I used throughout my high school and college career. Yes, I somehow packed this behemoth into a backpack and walked around with it - I’m still not quite sure why. The battery life on this thing is abysmal, it’s heavy, and the keyboard no longer works. However it’s hardware isn’t awful and I have fond memories of the thing, being my first computer and all; I couldn’t just throw it out. Throw Proxmox on it, it’ll be fine. It actually performs quite well running Proxmox, and I’ve had no issues with it thus far.

INITIAL HARDWARE ISSUES

The C6100 I had bought was used off eBay, and had dents in the front chassis. This was particularly evident on the drive bay for Node 4, where I had to hammer the chassis to even get all of the drives to fit. I had bought a lot of 8GB DDR3 to max out the RAM in the server, however the cheapest option I could find was a lot of mixed brands - again probably not the best choice, but it is a homelab so I have nobody to blame but myself for wanting to save some money. The following are a list of issues and bandaid solutions I had put in place:

ISSUE - When attempting to install RHEL, the install process would take much longer than desired; I initially thought it was due to the mixed hard drives that I had used but no matter the configuration of drives, I was still running into the same issue. I had decided to take out all the drives for each drive bay slot on this node, and attempt to run through the install process with a single drive slot populated. After doing this, it turned out that the top-most drive bay was in fact the issue.

“SOLUTION” - Node 4 has one less 500GB drive due to the long install times that I had encountered. The top drive bay has a blank caddy instead of a hard drive. I SHOULD take apart the server and hammer out the chassis to really determine if the physical connection was the issue, but since 500GB isn’t a significant amount of storage and I wanted to start the installation process after a week of off and on troubleshooting anyway, I thought it was fine to proceed with one less drive on this node.

ISSUE - The Kingston 8GB RAM sticks were underclocking to 800 Mhz instead of the desired 1333 MHz; no matter the configuration, even without mixing RAM or using a single stick of RAM, the Kingston RAM was still underclocking - more than likely this RAM is unsupported for this server.

“SOLUTION” - I instead used the 4GB Hynix RAM that came preinstalled in the C6100; since Node 4 was having the hard drive issue, I had decided to have this node also be the one with the least amount of RAM. If I had to do this again, I would shop around a bit more for a lot of RAM that was NOT mixed brand (and most likely not Kingston branded), but this was a good lesson that not all RAM is compatible, even if it seems like the correct type.

DEPLOYING OPENSTACK MANUALLY:

Before I had much of a plan to deploy OpenStack, I had thought of deploying OpenStack manually. This was not the best idea with a multinode deployment considering I will have to replicate MULTIPLE steps on MULTIPLE nodes and hope that I didn’t make any mistakes along the way. I do think that the manual installation process does get you more intimately involved with the different components of OpenStack, and is a great way to learn the moving parts and pieces. If I had decided to do an all-in-one deployment, I would definitely consider this deployment method further, however I can’t trust myself to not make any mistakes four times in a row.

Yet in my ignorance, I trudged on.

Install each node with RHEL 8

Because Red Hat provides free licenses up to a certain amount of nodes, and since I’m not sure if Rocky Linux or Alma Linux is the best CentOS replacement yet, I decided to go with RHEL. RHEL 9 seems too new, and older, hopefully more stable, versions of OpenStack provide instructions for RHEL 8, so I decided to go that route. Yoga was a couple of iterations behind the most recent version of OpenStack, and I thought that it wasn’t an awful idea to deploy a couple of versions back to ensure that the version I was deploying was stable. Plus if I got OpenStack going, I could practice upgrading versions to Zed and maybe even Antelope.

But of course, I ran into more problems:

ISSUE - During RHEL 8 install from DVD USB (instead of minimal IMG which uses a RHEL CDN), install hangs at “Preparing transaction from installation source”.

“SOLUTION” - Redo the installation process, and before you hit the “Begin Installation” button, disconnect any ethernet cables from node and the installation will proceed as desired.

Uhh, is there something wrong with my networking? Or is it something else? I have this connected to a managed switch, but I have yet to provision it to do anything exotic. No VLANs, no trunk ports, no LAGs. It’s essentially a dumb switch, so why is this not working? Ah well, slap on another bandaid.

ISSUE - Following the Yoga install guide on the OpenStack website, I had run into issues when attempting to deploy the Glance service. Every time I would want to deploy an image, the Keystone middleware component would give me an error about the token I was using not being authorized. I had read through the guide multiple times, double/triple/quadruple-checked my config files for both Keystone and Glance, made sure I was using the correct admin openrc environment configurations, and yet could not find anything that stood out to me.

The fact that I was running into issues SO EARLY into deployment was a bit disheartening for me. Having just started on the journey, I had decided that other deployment methods may suit a beginner better.

Alternative deployment methods

RDO (the RHEL OpenStack repository) requires different deployment methods, such as TripleO or Packstack This requires different components such as an undercloud, an overcloud compute, and an overcloud controller. The undercloud is hardware intensive, especially requiring SAS or SATA SSD drives; since my nodes are spinning SATA drives, I thought it was a good idea to try a different deployment method. If needed, we can try this deployment method again with SSDs, but 2.5" to 3.5" drive adapters are needed, as well as 12 SSDs.

TripleO Deployment requires IPMI connectivity; the IPMI interfaces on the C6100 are largely out of date (and require outdated Java environments to view the interface). The current release of TripleO requires RHEL 9, so since I have to start over with a new OS anyways, I might as well go with Rocky 9 instead.

OpenStack Ansible seems like a good option, but I have little to no experience with deploying Ansible. It seems more involved on the Linux networking side of things, so if I want to gain more experience with that, OpenStack Ansible seems like a great option. I’ll keep this option in my back pocket if I can’t find something a bit more my speed, but this seems like a solid deployment method for the hardware that I had.

Kolla Ansible deploys OpenStack via containers, and seems to be the most “hands-off” approach to deployment. Just alter some config files, point the playbooks to the correct hosts, and off you go. Despite its name, little to no experience with Ansible is required, and would give me some experience working with containers. This seems like a good way to practice an “enterprise” deployment method, as well as introduce me to Ansible without getting my hands too dirty.

Eager to get a win, I went with what I thought would be the easiest option.

KOLLA ANSIBLE DEPLOYMENT:

The Kolla Ansible deployment method was attempted with OpenStack Zed. The hope here is that Zed is a little bit more time tested than Antelope, and shouldn’t have as many open issues. If I’m running into issues with Zed as well, try Yoga (I have found out that I have to use CentOS if I want to try Yoga as there is no official Rocky implementation… Zed it is). I installed my nodes with Rocky 9, and planned my deployment:

Components:

    C6100 Node 1 - Controller01, Network01, Compute01, Storage01
    C6100 Node 2 - Compute02, Monitoring01
    C6100 Node 3 - Controller02, Storage02, Compute03
    C6100 Node 4 - Controller03, Network02, Monitoring02
    T320 - Block Storage, Storage03

Each service would have HA backups across multiple nodes, except for block storage. Not really sure if I should keep ALL my storage on only the T320 either, but I’ll try this and see how it goes.

Kolla Ansible multinode syntax:

NODE ansible_ssh_user=USER ansible_become=True ansible_private_key_file=~/.ssh/id_rsa

Things to do for Ansible:

Having not done much with Ansible before, I did some research and found that I needed to allow the user running Ansible on each machine access to that machine. The user should be able to run “become” commands (sudo) without having to enter a password on my end, because that would get tiring and messy. To do so:

Ensure that the user can run sudo commands without entering in a password:

        # visudo

        ~
        # Passwordless sudo for user, required for Ansible
        USER ALL=(ALL) NOPASSWD:ALL
        ~

Configure networking:

        $ nmtui

        Ensure that one network interface has IPv4 configuration disabled, 
        while still being active, and the other has a manual IPv4 address.

Ensure that the deployment host can reach each node using ssh:

        $ rm -f /home/harthan/.ssh/known_hosts
        $ rm -f /home/harthan/.ssh/id_rsa
        $ rm -f /home/harthan/.ssh/id_rsa.pub

        ssh-keygen -t rsa

        ssh-copy-id harthan@node1
        ssh-copy-id harthan@node2
        ...

I had successfully ran the playbook, but for whatever reason, OpenStack would not operate correctly:

ISSUE - 413 Request Entity Too Large error when trying to upload images via Horizon UI, after rebooting nodes could not login to Horizon interface at all with a 504 Gateway Timeout error. MariaDB docker images would randomly restart as well on each Controller node, which is a possible cause of the 504 error. I had asked a coworker, and he said it could possibly be an issue with MariaDB not starting fast enough?

“POSSIBLE SOLUTIONS” -

Get SSDs to replace the HDDs on the C6100
Try deploying an external MariaDB instance

Looking at the status of the docker containers, MariaDB was on a constant boot loop. Not looking too hot there. After doing more research into this at a much later date, the issue was more than likely a permissions issue with the MariaDB docker container, but I regrettably had not considered that at the time.

Back to the drawing board.

HOW I SHOULD HAVE INITIALLY SOLVED THE HARDWARE ISSUES

I had decided that I needed to step back and address the “easy” issues first before I got too involved with the other issues that I was running into. First, I needed to address the issues that I should have fixed in the first place. I did the following:

Get SSDs and 2.5" to 3.5" drive adapters, so that I can try to see if faster drives will help with the MariaDB timeout issues
Get more 8GB RAM of the same brand so that each node has the same amount of RAM at the same clock rate
Hammer the chassis that held the drives for Node 4, so each node can have the same amount of storage

After patiently waiting for everything to come in and getting everything setup and good to go, I decided to try deploying OpenStack using Kolla Ansible once again.

I fired up the T520 that I was using to run the Kolla Ansible playbooks one more time AND… the SSDs didn’t help with the MariaDB issue. Well then.

At this point, I felt I had two options

Try a different installation method (TripleO, Openstack-Ansible, etc.)
Try this deployment method again with an external MariaDB server

At this point, feeling stubborn more than anything, I decided to go with option #2. After all, what’s another bandaid at this point?

REVISITING KOLLA ANSIBLE

First, I had decided to try deploying OpenStack with an external MariaDB server, since the Openstack Ansible multinode deployment seemed a bit out of my league. I will have to revisit the AiO deployment method for Openstack Ansible to get a better idea of the architecture, and then look into it’s multinode deployment if this doesn’t work. I had created a Rocky 9 VM on Proxmox to serve as my external MariaDB server, on my old trusty G750JM.

Components:

    C6100 Node 1 - Controller01, Network01, Compute01, Storage01
    C6100 Node 2 - Compute02, Monitoring01
    C6100 Node 3 - Controller02, Storage02, Compute03
    C6100 Node 4 - Controller03, Network02, Monitoring02
    T320 - Block Storage, Storage03
    G750JM - External MariaDB

To get everything setup correctly, I did the following:

Grant access to a user from a Remote System by creating users and granting them their respective privileges:

    $ mysql -u root -p

    MariaDB [(none)]> CREATE USER 'root'@'host.domain' IDENTIFIED BY 'password';
    MariaDB [(none)]> GRANT ALL on *.* to 'root'@'host.domain' IDENTIFIED BY 
    'password' WITH GRANT OPTION;

    MariaDB [(none)]> FLUSH PRIVILEGES;
    MariaDB [(none)]> EXIT;

Set database password in venv/share/kolla-ansible/etc-examples/passwords.yml and then issue:

    $ kolla-genpwd

Rename the NIC for the MariaDB VM so that Ansible playbook does not complain:

    $ sudo mkdir /etc/systemd/network
    $ sudo vim /etc/systemd/network/70-custom-ifnames.link

    ~ [Match]
    ~ MACAddress="ENTER MAC ADDRESS HERE"
    ~ [Link]
    ~ Name="eno1"
    :wq

    $ sudo chmod 0644 /etc/systemd/network/70-custom-ifnames.link
    $ sudo reboot

After this is done, check the output of “ipaddr” to ensure that the name of the interface has been changed, and run “nmtui” to ensure that the correct network device has the correct configuration.

After all of this done I tried once again and, to my extreme relief, logged into my working OpenStack instance.

MISCELLANEOUS WOES AND THINGS OF NOTE

After poking around for a week or so, I had run into so many weird things that I decided to make a quick note of so that I could reference them later.

No charge Neutron

Neutron deployment “Too many connections error” fix:

    # vim /etc/my.cnf

	~ max_connections=5000
	~ max_allowed_packet=256M
	~ wait_timeout=180
	~ interactive_timeout=200
	:wq

    # systemctl restart mariadb

Reaching my quota

Show OpenStack quotas:

    $ openstack quota show --default

    -or-

    $ openstack quota show PROJECT_NAME

Change OpenStack quotas for a project:

    $ openstack quota set --QUOTA_NAME QUOTA_VALUE PROJECT_NAME

    ex:

    $ openstack quota set --instances 50 admin

Note: A quota value of “-1” means an unlimited amount.

The Cinder conundrum

I had planned to deploy Cinder on the T320, since it has the most drive capacity as well as a RAID 5 setup. NFS didn’t work - Every time you want to deploy a volume, it would say that it could not find a weighted backend.

This was user error on initial deployment - mounting the NFS files to a given share solved this issue

Were all of these things quirks of OpenStack or did I do something wrong along the way? Looking at the bandaids all the way down, I think it’s probably user error, but I’m so unsure of everything at this point and I really didn’t want to break anything further. I decided OpenStack probably isn’t for me right now.

THE END?

All in all, this was a fun experiment that was definitely worth the time (and $$$) I spent towards it. However, I always felt like my OpenStack instance was constantly on the verge of breaking, and whenver I tried to add more services, such as Magnum, I could never quite get the service to work. Keeping the C6100 on for a month raised my electricity bill $30 alone - probably not the most sustainable option, considering this was just an experiment and all. I would like to come back to this at some point, with more knowledge on networking and systems, possibly trying a different deployment method that doesn’t rely on containers. I think I would benefit with more time going into researching and planning, and the issues that I ran into probably shouldn’t have happened. All in all, this was a good experience that I learned a lot from. I may not have answered all the questions that I had, but it was definitely a good excercise in troubleshooting. I’ll have my private cloud one of these days.