Clock Blog
How to keep data synchronised using Node JS
Here at Clock we recently came across the problem, on one of our latest Node JS projects, of trying to keep the same data stored on two different systems in sync with each other.
Now I thought this would be quite a common problem, and that a simple Google search would reveal plenty of design patterns and code snippets to help solve it. I was wrong. After a couple of hours scouring the web I had found nothing which looked even vaguely like a solution.
So after discussing this problem with our CTO, we sat down and spent an hour coming up with a solution to my problem and this gist is what we came up with as a first draft.
This solution makes one fairly large assumption about your data, and that is; every item has some sort of unique identifier to distinguish it from all of the other items. It also relies fairly heavily on the difference and intersection functions available in Underscore.JS to work its magic.
The way it works is as follows:
First, we get all of our existing data and extract all of the unique IDs into an array. We will refer to this as A. We then do the same for our new data, which we will refer to as B.
To begin with, we want to identify all items which need to be removed. Once we have A and B we can use the Underscore.JS difference function to work out C which is all of the items within our old collection of data that are no longer present within our new collection.
The next step is to work out which items we may need to update. To do this we can use the Underscore.JS intersection function on A and B to determine D which is all of the items that are present in both collections of data.
The last step in determining our sets of data is to work out which items do not exist within our original collection and therefore need creating. To do this we again use the Underscore.JS difference function, but this time on B and D. This gives us E which is all of the items which exist in the new collection but are not present in both the old collection and the new collection.
These three calculations can be expressed as:
C = A - B
D = A ∩ B
E = B - D
Where C is the IDs of all the items that need to be removed, D is the IDs of all the items that may or may not need updating and E is the IDs of all the items that need creating.
The last step in this whole process is to actually update our original data. To do this we just loop through each of our three sets and perform the necessary action. This is very simple for our creation and deletion sets, however, for the update set we are still unsure as to which items have actually changed and are in need of being updated. To determine this, whilst looping through the set we pass our existing data item and our new data item off to a comparator function which will then decide whether or not to do the actual update.
So that is the theory behind the our data synchronisation algorithm.
To put this into practice I took the draft that we came up with and have cleaned it up, made it asynchronous and it is now available on my GitHub page and through NPM.
You can install via NPM using the following command:
npm install data-sync
For more information and detailed instructions on how to use it, take a look at my GitHub Page:
https://github.com/aduncan88/data-sync
What we have developed is far from ideal and was done to overcome a specific problem that we were having on one of our projects. It has been designed from the beginning to be as generic as possible with the core principle to reduce the number of reads and writes to and from your data store (database etc), however it does have one disadvantage in the fact that it stores both the old and new collections of data in their entirety in memory. Therefore if you are dealing with very large collections, you may run into low memory issues.
Like what you've read?