1&1 Mail & Media GmbH geo-redundant Kubernetes hardware cluster created in just six months
Creating a modern microservices platform within an organically developed infrastructure is no easy task, either in theory or in practice. That such a project can not only succeed, but can also turn the division involved into a kind of role model is demonstrated by a joint project undertaken by 1&1 Mail & Media GmbH and inovex in which a 1&1 infrastructure was deployed on modern Kubernetes clusters in just a few months.
We want to build a state-of-the-art container platform for the Portal sector, one with which we can eventually replace our existing virtualization infrastructure.
Simone Hoefer
Head of Portal Platform Services, 1&1 Mail & Media GmbH1&1 Mail & Media GmbH’s Portal division is responsible for the infrastructure underlying the company’s end-user webmail services. These include approximately 38 million email accounts located in three geo-redundant data centres.
The infrastructure was previously based on an enterprise-wide VMWare cluster deployed and managed by the company’s central IT department and run on dedicated (bare metal) hardware operated by the business unit’s sysadmin teams. Because the VM packages were supplied with fixed resource sizes (CPU, memory, storage, etc.) regardless of actual demand, selective scaling was either impossible or involved wasting those resources not used. In addition, virtual machines could only be deployed semi-automatically, a process which made automatic scaling impossible and took up a great deal of employee time.
The aim was therefore clear: the new infrastructure should be automatable to ensure rapid scaling and flexible enough to scale only those resources actually required. This would relieve the burden on the IT department currently responsible for providing resources and allow direct access by developers.
With these requirements in mind, then, a state-of-the-art solution architecture was outlined in May 2017 during a week-long joint workshop involving 1&1 and inovex. After discussing and evaluating possible alternatives based on the requirements, the group decided to use a container architecture with Kubernetes as an orchestration tool. inovex had successfully completed similar projects in the past and possessed a correspondingly large store of technical knowledge and experience. The significance of the new infrastructure within the company and potential starting points were also identified at this early stage.
In order to ensure the smoothest possible start, a “divide and conquer” strategy was devised, with stateless backend microservices being selected as the starting point. As stateful microservices involve additional technological challenges, applications with storage connections would not be tackled until the second phase.
In addition to the solution architecture, the underlying technical framework conditions were also clarified upfront. These included determining the compatibility of 1&1’s hardware with the CoreOS Container Linux operating system and identifying required interactions with other internal technical teams and suppliers. Four weeks later, the following solution sketch was developed:
As is customary with microservices platforms, the project was agilely staffed with a cross-functional team, in this case comprising inovex and 1&1 staff from a variety of teams and departments. The goal was to develop a proof of concept which could be productively deployed six months later. The plan was to implement it parallel to the existing VMs initially, allowing for risk-free testing of the container infrastructure and its handling of potential load spikes. Not only was the architecture decisively in favour of the agile approach, but Scrum was also the perfect solution from a corporate perspective: iterative development requires regular comparison of progress with objectives, while small successes in the form of completed sprints helped to convince management that the project was on the right track.
The project’s pioneering role quickly became apparent. The transparent development, open reviews and daily meetings, as well as the active solicitation of input from employees, contributed to the acceptance of the new architecture within the company. This allowed those employees who would end up managing the infrastructure to play an active part in its development.
In particular, the handling of containers involved somewhat of an initial learning curve. Unlike virtual machines, malfunctioning containers are replaced rather than repaired due to their rapid deployment and reproducibility (pet vs cattle principle). In the spirit of the aforementioned automation, GoCD deployment pipelines were also used. Over time, templates for these were written to facilitate the work of those developers new to the project and to save them having to get to grips with the nuts and bolts of deployment.
Modelling using infrastructure as code enables the automatic deployment of new clusters with out-of-the-box environments. This eliminates the need for manual provisioning or operating system setup and allows specialist departments to focus entirely on the development of their respective applications.
Deliberately using methods such as pair programming and regular code reviews also enabled the 1&1 employees to familiarise themselves with the new technologies, while the physical proximity of the development team allowed for quick questions.
To ensure that everything went smoothly, only a small section of the live environment was initially replaced by the new Kubernetes infrastructure, while the rest was kept running in the existing environment (blue-green deployment). Essentially, this means that the virtual machines were not immediately taken offline; instead, they ran in parallel – and thus remained available as fallbacks during the induction phase. Last but not least, internal technical and support channels were on hand to answer questions and maintain confidence in the project.
A side note: The fallback systems were deactivated on a trial basis with no discernible effect on operations. The new microservices architecture had passed the litmus test.
Despite the tight timeframe, the proof of concept was completed as planned six months after kick-off, in the third quarter of 2017. This allowed the transition to the next phase to take place during the fourth quarter. From December onwards, the project was to be operated exclusively by 1&1 employees, so two months were allowed for the transfer of ownership. During this handover period, it was necessary to ensure that the operations team could autonomously perform updates, diagnose problems, log events, and handle alerts.
Does automation cost jobs?
In automation projects, there is often a fear that jobs will disappear when pipelines take over tasks previously performed by employees. This is, in fact, not necessarily the case, as is evidenced by the joint project carried out by 1&1 and inovex. In this project, automation is used to perform repetitive and frustrating tasks, thereby reducing development time. At the same time, however, it increases the complexity of the system and creates new dependencies. The project therefore requires increased cooperation between the software development and operations departments and fosters the establishment of a DevOps culture.
In small, rapid increments, a microservices architecture was established using Kubernetes to facilitate the automated deployment of resources. The agile approach not only helped with the technological development, but also made the project more transparent and introduced 1&1 employees gradually to the new technologies. Over time, an internal community has emerged to maintain and drive the project.
Since its successful handover, the project has continued to advance. It currently comprises over 800 servers, geo-redundantly distributed across clusters in three locations. These are being further developed by a team of seven to eight members who provide the infrastructure for 80 administrators and over 300 developers. The more flexible administration also increased the efficiency of the resources used, in the reference case (monitoring of the spam scanner) by about 20 percent. Although there have been occasional outages since the beginning of the beta phase – some involving an entire cluster – the system design has meant that users remained completely unaffected. Since the completion of the beta phase, the system has remained failure-free.
This success has not gone unnoticed within the company. Today, more than 100 services are running on the new infrastructure, which is constantly being further scaled.