Wednesday, July 12, 2017

Role of Commit Log in Databases and Distributed Systems

Before jumping to the topic, let's start with few questions -
  • What if debit was done successfully from one account but before credit to second account, DB crashed! Quite possible, right? How relational DBs  handle such crashes and ensure that when it comes up the data is in consistent state? (in other words how DBs achieve atomicity and durability?)
  • How databases like Oracle, MongoDB, PostgreSQL keeps the replica component in synch with the master ?
  • In a distributed database - how two components of a distributed system agree on a given update order ? Assume that two changes which were sent arrive in different order due to network issues, latency, asynchronous nature or some other issue, then how the system can know what should be update order.  In Kafka it's known as Order guarantee. 
  • How a process remains consistent across different nodes ? Or, how a particular update gets consistently applied to different replicas ?
  • How you force multiple nodes in a distributed system to do the same stuff ? Or, how you enforce deterministic process is deterministic?

Answer to all above questions is - Log!
In database and systems world it is called as write-ahead log or commit log or journal (similar to application logs but used only for programatic access).  Jay Kreps in his book I Love Logs defines it as - a append-only sequence of records ordered by time.  



Each record (as rectangle in above image) gets appended to the log from left to right. Each entry is assigned a unique sequential log entry number which acts as unique key. The records are relatively ordered with time - leftmost having occurred at the earliest. So, log will help in recording what happened and when - quite handy in distributed data systems. 

Common Usage of Commit log- data integration, enterprise architecture, real-time data processing, and data system design. 

How Commit log helps?

Relational databases, NoSQL databases, and distributed systems write out the event/update information in a commit log first.  It will log all relevant details in the log before actually applying those changes to the actual systems.  Writing to log file is an atomic and non-distributed operation so it will get persisted immediately and then it gets used as single source of authority for applying those changes to the systems. So, even if system sees a crash or fault; once it recovers it will check the commit log and apply the pending changes.

Similarly, sequence of events/changes which happens on primary data nodes is exactly what is needed to keep the remote replica database in sync. The slave to replicate node/DB can apply those changes recorded in commit log to their own local data source to stay in sync with master. 

The problem of data integration can also be solved through commit logs. Take out all the organisations data and put it in a centralised log for processing. That's what the specialised systems like Kafka does. 


happy logging !

- - -
References:

No comments:

Post a Comment