December 18, 2018

Improved Service for EAP via
Major Infrastructure Upgrade

The team at ECConnect embrace a philosophy of continuous improvement and are always looking for ways to provide a superior product for our customers with enhanced service, reliability and transparency. Over the past 12 months we have been improving the underlying infrastructure and processes for our flagship product - EAP. This article will highlight some of the changes that have been made and how these provide clients with greater performance and monitoring of the platform - for a seamless and reliable service.

The infrastructure changes over the last 12 months have centred on new virtualised hardware, improved processes, and new tools to assist with the deployment of updates or changes to the underlying infrastructure, and for improved monitoring systems.

This article will provide background details, followed by information relating to the changes and the various tools which have been implemented leading to a better service for our clients. The changes and tools broadly fall into the following infrastructure areas:

  • Multi-node Environment,
  • Monitoring,
  • Infrastructure as Code (IaC), and
  • Scheduling.

Background

EAP is ECConnect's flagship Telecommunication Business Support System (BSS) which is provided as a fully hosted and supported SaaS platform. ECConnect manage both the infrastructure and the numerous applications our telecommunication clients use every day to run their business. Since it's introduction in 2015, the EAP infrastructure has always operated in a multi-node environment, where every system is distributed among several redundant servers placed on different hardware. The general concept looks like the below:

Multi-Node Environment

The EAP Service will remain available provided there is one load balancer with one application node (i.e. Application Virtual Machine (VM) node). This design increases service availability, but also increases system performance due to horizontal scaling - which means more machines (VM nodes) can be added into the pool of resources if required.

Over the years ECConnect have investigated and researched several new tools and processes that could help us to manage and monitor this multi-node environment, so that a quicker and more reliable product and experience could be provided. Some of the tools that have been introduced and improved over the last 12 months are MYSQL Galera Cluster, Grafana, Chef, Centreon, Kibana, EAP Messenger and RunDeck. These will now be outlined in more detail and the benefits they provide to the EAP SaaS offering will be highlighted.

Galera (Multi-Node Database)

Initially, database service reliability and performance improvements of the multi-node environment was the most important aspect of this infrastructure upgrade project. Therefore, MySQL Galera cluster was implemented. This new database solution has been setup as a 3-node MySQL Galera Cluster with load balancing. These database nodes are also built with 'Chef' (which is discussed in further detail below).

Active-standby load balancers have been configured to distribute database requests among the three nodes, rather than just one, which has produced improved service availability and performance. Also, reliability is greater, because if any database instance fails, the load balancer will detect the problem and the node will be isolated from the production environment. Impact to the system is therefore minimal as the remaining nodes will continue to process requests and share the load.

Chef (IaC)

We turned to Chef, and the Infrastructure as Code (IaC) methodology, so that our system administration team could manage and provision the infrastructure through software, rather than using a manual process to configure the discrete hardware devices and operating systems. The idea being that long-term deployment would now be provided via a more automated process.

All server configurations are now in Chef Cookbooks and Recipes. The overall Cookbook contains many different Recipes which represent a certain part of a system's configuration, for example, web server installation, monitoring configuration, user creation, etc. Every server is assigned a Role and this Role defines which Cookbooks and Recipes will be applied during the system build.

As these Cookbooks and Recipe configurations are now stored in a version control repository, they can easily be reused for many different server roles. This provides us with a dynamic and flexible tool that allows for a more scalable and reliable infrastructure:

  • With Chef Cookbooks and Recipes, the environment (for example, a new server) can be created within minutes instead of days and this process can be regularly repeated, so that scaling up is quicker, more efficient, and more accurate.
  • Chef provides intrinsic documentation of our infrastructure, plus it allows us to quickly and safely make infrastructure modifications. All the changes in the system configuration can be tracked, so there is a complete history of all changes that have been made. As a result, there is less opportunity for human error and complete transparency into the system, leading to a more reliable and robust system for clients, with greater fault-tolerance.

Grafana (Monitoring)

Grafana Dashboard

Grafana is a monitoring tool which provides a view of collected metrics on a dashboard, displaying system or business related metrics visually e.g. graphs. Some examples are as follows:

  • Database Dashboard - the number of queries a client's EAP account is making to the database, the size of the database etc. can be visualised.
  • Payments dashboard - for example, the different response timeframes from different payment gateways can be seen.
  • Usage file processing - the speed and performance of file processing for end-user usage records and any usage files which failed to process can be visualised.
  • Web service and web page load times - the response time of web services and web page load times is monitored, to see how well the system is performing.

Grafana has been very helpful for investigation and systems analysis. If any issues occur, the part which is behaving abnormally or failing can be seen and identified easily so that it can be fixed. This also allows for incidents to be resolved pro-actively, leading to increased uptime, and greater stability and reliability in the environment.

Grafana also allows us to monitor the performance of the servers and databases, and by looking at these graphs, system requirements can be predicted and scaling up can be planned and carried out. For example, if there was a sudden influx of end-users for one of our telecommunication clients, monitoring via Grafana would highlight this and deployment of another server and database could be planned and implemented. This is then where the next tool we introduced comes in - Chef.

Centreon (Monitoring)

Moving more into monitoring, when EAP was first introduced as a SaaS, one of the first tools put in place was Centreon. This is our main internal monitoring system, with over 1200 probes which monitor our infrastructure and business related processes.

Within the system there is a role called the 'on-callÓ person. The 'on-callÓ person is an internal ECConnect staff member (that varies from week to week), who receives notifications 24/7 about any incidents identified by Centreon. Notifications include documentation explaining how various procedures work and how to resolve or escalate issues.

A few examples of notifications that are sent from Centreon are:

  • Lack of messages sent to end users
  • Clients remaining to be billed after daily billing has finished
  • Service/s not barred or throttled when expected
  • Clients were not emailed their invoice
  • Due date payments processing is incomplete
  • High response time
  • High error rate
  • Plus many typical hardware related checks, such as:
    • Hard disk utilisation
    • Memory usage
    • Load average

From investigating and responding to issues over the years, a solid knowledge base has been built, so that if an issue occurs in the future, the ECConnect team member can find how it was previously resolved, leading to greatly improved response and resolution of issues. The most recent and common incidents can easily be seen, so effort is focused effectively to resolve these issues and system adjustments are made in order to avoid these issues occurring in the future.

In the last 12 months the notification ability within Centreon has been improved. We now have the ability to send SMS and/or email, or alert via a 'text to voice' system to the 'on-callÓ person. This means issues are more easily visible to our team, and rectified in a more timely manner. The metrics from Centreon are also now fed into Grafana for better analysis.

Through the improvement of notification ability, development of the knowledge base and integration with Grafana, there has been a steady decrease in the number of incidents over time, which leads to a more stable, reliable and robust service for our clients.

Kibana (Monitoring & Visualisation)

Kibana is another analysis and monitoring tool that has been introduced, which allows for visualisation of large amounts of data for trends or anomalies. Logs from different systems are collected, such as data from EAP or the various load balancers within the environments, and these are analysed via Kibana to discover the expected (i.e. the system is healthy) and to uncover the unexpected (i.e. issues that need to be addressed).

This tool works differently from Grafana, which is collecting numerical data related to some processes and plotting it as a graph or another visual tool. Whereas Kibana quickly processes high volumes of textual data and displays such data in order to highlight particular inconsistencies within it. Therefore, we can quickly filter through large amounts of data to find out if, and where, there have been any issues.

The major benefit of a tool like this is that many messages or logs can be processed and analysed in a very short period of time, provideing greater visibility into the infrastructure, which in turn gives us the opportunity to keep it running optimally.

EAP Messenger (Monitoring & Communication)

We have also developed EAP Messenger, which is a dedicated notification tool allowing us to communicate with clients more efficiently, and to provide more visibility regarding scheduled upgrades and any alerts relating to problems.

At the simple click of a button, any ECConnect support person can quickly notify all designated client representatives of any service problems, which also includes a link to further detailed information explaining the issue. Clients can manage who will receive such notifications.

The status of various modules of EAP are listed, along with an explanation of any issues including the severity of the issue and the approximate timeframe to resolve it.

EAP Messenger provides the ideal method to communicate any issues or scheduled maintenance and provide follow-up notification once the issue has been resolved, giving clients greater visibility and transparency around supply of service.

Rundeck (Scheduling)

The final piece of the puzzle for ECConnect has been the introduction of Rundeck as a web based scheduling system. Behind the EAP application there are numerous processes that need to run frequently, or at a certain time of the day or week. Some of these are business related, such as credit card expiry reminder scripts, automated billing scripts, scheduled or recurring payment processes, etc., and some are system related, for example database archiving, resetting log files, and code deployment jobs.

When configuring or managing a job, Rundeck enables us to enter a summary, the steps that are taken for execution and the frequency at which the job will run. We can also specify what happens if the job fails, for example, the ECConnect development team is notified so action can be taken. Further to this, these jobs can be integrated with Centreon monitoring probes, which will validate job execution, and notify our 'on-call' person if there have been any issues with the execution of a job.

Rundeck also gives us views and reports, highlighting when various jobs which have been executed, how long they took and whether they were successful. This is very helpful for investigation, but the biggest benefit is improved visibility and the ability for our whole team (both developers and systems administrators), to access, modify or verify jobs. It is a great company-wide tool that allows us all to manage these jobs, so that the person with the most knowledge in a particular area can attend to any issues. This leads to greater efficiency and transparency on our part, and ultimately a more reliable and robust service for our clients.

Summary

It's been a great journey over the last 12 months deploying these new tools and we are already receiving comments from many of our clients about improved service. The team will continue to investigate improvements and explore new tools and processes so that ECConnect can provide the most trusted and reliable telecommunications management product on the Australian market.

We can streamline your telecommunications business! Just give us a call on 1300 322 666 to discuss your business requirements and organise a demo.