July 9, 2018
Multi-Node Locking to Protect Data Integrity
ECConnect's EAP Platform runs many parallel instances (nodes) to provide stability and high performance. For high volume processes, such as Billing and Call Rating, multiple nodes are used to increase throughput of data and enable the distribution of load without affecting other aspects of the system.
EAP's core architecture allows for certain processes to be locked, so that data can be isolated while it is worked on. That feature only works on a node-by-node basis however. If two (or more) nodes attempt to do the same data at the same time, they will not see each other's locks.
This can result in double-handling of data or processes, potentially causing problems with data integrity.
The main point of multiple nodes is redundancy and fault tolerance, which means that the nodes should ideally be as independent as possible. Direct communication between nodes is not an ideal way to check whether a process is locked. If the number of nodes increases to match load, or there are network issues, simply checking to see whether a process is locked could take forever.
The data itself cannot be trusted to inform a lock (for example with a status flag), since the processes which update the data may do so in bulk, and before the UPDATE can complete, a parallel SELECT may be performed beforehand on a different node.
We needed a way to store and track process locks across independent nodes without invoking the database and without cycling through all available nodes.
The solution turned out to be very simple and somewhat old-school.
Lockfiles have been used in local systems for decades to facilitate inter-process communication. With a distributed multi-node system it wouldn't seem a likely solution, but one of the key aspects to EAP's distributed systems is a shared filesystem.
Using a lockfile on a shared filesystem, we were able to take advantage of the filesystem's own file-state information (if a file is already open, take it as a given that a process is locked; if it's not really, we can try again shortly) and communicate between nodes without directly accessing each node.
Stored in the lockfile are timestamps for when the lock was requested, which allows for time-to-live (TTL) on locks. TTL can be modified on a case-by-case basis. If we know a process is likely to be long-lived, we can set TTL to be a longer time than something which should only take a short time.
This solution is very thin. It's almost a mini-microservice, or a process registry. For any given dataset or process type, we store only a key and a timestamp. When a process is finished, the lock is removed from the file.
This can be a fine- or coarse-grained as required, from top-level processes to individual records.
TTL allows for failure-recovery in the event that a locked process timed out or died for whatever reason, without manual intervention.
Because TTL is based on timestamps and an implementation driven period, rather than including a hard TTL in the lockfile, it can be overridden by variable parameters or by different processes having higher priority over the same lock-key.