Thursday, December 21, 2017

PUT vs POST for Modifying a Resource

Debate around PUT vs POST for resource update is quite common; I have had my share as well. Debate is NOT un-necessary as the difference is very subtle. One simple line of defence by many people is that if the update is IDEMPOTENT then we should use PUT else we can use POST. This explanation is correct to a good extent; provided we clearly understand if a request is truly Idempotent or not. 

Also, lot of content is available online which causes confusion. So, I tried to see what the originators of REST architectural style themselves say. This post might again be opinionated, or have missed few important aspects. I have tried to be as objective as possible. Feel free to  post your comments/openions :)

Updating a Resource

For our understanding, let's take a case that we are dealing with an account resource which has three attributes: firstName, lastName and status. 

Updating Status field:
Advocates of PUT consider below request to be IDEMPOTENT. 

HTTP 1.1 PUT /account/a-123
{
   "status":"disabled"
}

Reality is that, above request is NOT idempotent as it's updating a partial document. To make it idempotent you need to send all the attributes. So that line of defence is NOT perfect. 

HTTP 1.1 PUT /account/a-123
{
   "firstName":"abc",
   "lastName":"rai",
   "status":"disabled"
}


Below article tells very clearly that, if you want to use PUT to update a resource, you must send all attributes of the resource where as you can use POST for either partial or full update. 

So, you can use POST for either full or partial update (until PATCH support becomes universal). 

What the originator of REST style says

The master himself suggest that we can use POST if you are modifying part of the resource. 


My Final Recommendation

Prefer POST if you are doing partial update of the resource. 
If you doing full update of the resource and it's IDEMPOTENT, use PUT else use POST. 

Thursday, December 14, 2017

Build Tools in Java World

Build tools in Java (or JVM ecosystem) have evolved over period of time. Each successive build tool has tried to solve some of the pains of the previous one. But before going further down on tools, let’s start with basic features of standard build tools. 

Dependency Management
Each project requires some external libraries for build to be successful.  So these incoming files/jars/libraries are called as dependencies of the project.  Managing dependencies in a centralized manner is de-facto feature of modern build tools.  Output artifact of the project also gets published and then managed by dependency management. Apache IVY and Maven are two most popular tools which support dependency management.

Build By Convention
Build script needs to be as simple and compact as possible. Imagine specifying each and every action which needs to be performed during build (compile all files from src directory, copy them to dir file, create jar file and so on); this will definitely make the script huge and hence managing and evolving it becomes a daunting task. So, modern build tools uses convention like by default (or can be configured as well) it knows that source files are in src directory. This minimizes number of lines in the build file and hence it becomes easier to write and manage build scripts.  
So any standard build tool should have above two as de-facto. Below are list of tools which have these features.


ANT + IVY (Apache IVY for dependency management)
·         MAVEN
·         GRADLE

I have listed only most popular build tools above. ANT by default doesn’t have dependency management but other two have native support for dependency management. Java world is basically divided between MAVEN and GRADLE.  So, I have focused below on these two tools.


Maven vs Gradle

  • MAVEN uses XML to write build script where as GRADLE uses a DSL language based on Groovy (one of the JVM language). GRADLE build script tends to be shorter and cleaner compared to maven build script.
  • GRADLE build script is written in Groovy (and can also be extended using Java). This definitely gives more flexibility to customize the build process. Groovy is a real programming language (unlike XML). Also, GRADLE doesn’t force to always use convention, it can be overridden. 
  • GRADLE has first class support for multi-project build whereas multi-project build of MAVEN is broken. GRADLE support dependency management natively using Apache open source project IVY (is an excellent dependency management tool).  Dependency management of GRADLE is better than MAVEN.
  • MAVEN is quite popular tool so it has wide community and Java community have been using it for a while; GRADLE on the other hand is quite new so there will be learning curve for developers.
  • Both are plugin based (and GRADLE being a newer); finding plugin might be difficult for GRADLE. But adoption of GRADLE is growing at good pace, Google supports GRADE for Android. Integration of GRADLE with servers, IDEs and CI tools is not as much as that of MAVEN (as of now).


  • CONCLUSION

    Most of the cons for GRADLE are mainly because it’s a new kid in the block. Other than this, rest all looks quite impressive about GRADLE. It scores better on both core features i.e. Dependency Management and Build by Convention. IMO, configuring build through a programming language is going to be more seamless once we overcome the initial learning curve.
    Also, considering we are going down the microservices path, so we will have option and flexibility to experiment with build tool as well (along with language/framework).

    References

    https://github.com/tkruse/build-bench





Tuesday, December 5, 2017

How AWS structures its Infrastructure

This post talks about how AWS structures its global infrastructure. 


AWS' most basic infrastructure is Data Center.  A single Data Center houses several thousand servers. AWS core applications are deployed in N+1 configuration to ensure smooth functioning in the event of a data center failure. 

AWS data centers are organized into Availability Zones. One DC can only be part of one AZ. Each AZ is designed as an independent failure zone for fault isolation. Two AZs are interconnected with high-speed private links. 

Two or more AZs form a Region. As of now (dec '17) AWS has 16 regions across the globe.  Communication among regions use public infrastructure (i.e. internet), therefore use appropriate encryption methods to encrypt sensitive data. Data stored in a specific region is not replicated across other regions automatically. 

AWS also has 60+ global Edge Locations. Edge locations help lower latency and improve performance for end users. Helpful for services like Route 53 and Cloud Front. 



Guidlines for designing 

  • Design your system to survive temporary or prolonged failure of an Availability Zone. This brings resiliency to your system in case of natural disasters or system failures. 
  • AWS recommends replicating across AZs for resiliency. 
  • When you put data in a specific region, it's your job to move it to other regions if you require. 
  • AWS products and services are available by region so you may not see a service available in your region. 
  • Choose your region appropriately to reduce latency for your end-users. 

Saturday, October 21, 2017

How Kafka Achieves Durability

Durability is a guarantee that, once the Kafka broker confirms that the data is written, it will be permanent. Databases implement it by storing it in non-volatile storage. Kafka doesn't follow the DB approach!

Short Answer
Short answer is that, Kafka doesn't rely on the physical storage (i.e. file system) as the criteria that a message write is complete. It relies on the replicas.

Long Answer
When the message arrives to the broker, it first writes it to the in-memory copy of leader replica. Now it has following things to do before considering the write successful. 
Assume that, replication factor > 1. 

1. Persist the message in the file system of the partition leader.
2. Replicate the message to the all ISRs (in-sync replicas).

In ideal scenario, both above are important and should be done irrespective of order. But, the real question is, when does Kafka considers that the message write is complete? To answer this, let's try to answer below question-

If a consumer asks for a message 4 which just go persisted on the leader, will the leader return the data? And the answer is NO!


It's interesting to note that, not all data that exists on the leader is available for clients to read. Clients can read only those messages that were written to in-sync replicas. The replica leader knows which messages were replicated to which replica, so until it's replicated it will not be returned to the client. Attempt to read those messages will result in empty response.

So, now it's obvious, just writing the message to leader (including persisting to the file system) is hardly of any use. Kafka considers a message written only if it's replicated to all in-sync replicas.


~Happy replication!




Saturday, October 7, 2017

My favourite fiz-buzz problem for Senior Programmers

This post, I will be discussing one of my favourite fiz-buzz problems for senior programmers/engineers. 

Find Kth largest element from a list of 1 million integers. 

Or

Find Kth largest element at any given point of time from a stream of integers, count is not known. 


This problem is interesting as it has multiple approaches to solve and it checks the fundamentals of algorithms and data structure. Quite often, candidate start with asking questions like is the list sorted? Or can I sort the list ? In such case, I go and check on which sorting algorithm the candidate proposes. This gives me an opportunity to start conversation around complexity of the approach (particularly, time complexity). Most of the candidates are quick to point out algorithms (like Quick Sort , Merge Sort) which take O(NlogN) for sorting a list. This is right time to point out that why do you need to sort the complete array/list if you just need to find out 100th or kth largest/smallest element. Now the conversation usually go in either of the direction - 
  1. Candidate sometime suggest that, sorting is more quicker way to solve this problem - missing altogether the complexity aspect. If someone doesn't even realize that sorting is not the right way to handle this problem, then it kind of red signal for me. 
  2. At times candidates acknowledge the in-efficiency of sorting approach and then start looking for better approach. I suggest, candidates to think out loud which will give me insight about their thought process and how are they approaching it. When I see them not moving ahead; I suggest them on optimizing Quick sort approach ? Is there any way to cut down the problem size in half in every iteration ? Can you use divide and concur to improve on your O(NlogN) complexity ?   
This problem can be solved by Quick Select as well as using Heap data structure. This problem also has a brute force approach (i.e. run loop for k time; in each iteration find the maximum number lower than the last one). 


If the candidate doesn't make much progress then I try to simplify the problem by saying - find 3rd or 2nd largest element. I have seen some of the senior programmers failing to solve this trivial version as well. This is clear Reject sign for me.

Also, sometime I don't even ask candidate to code. I use this problem to just get an idea and skip the coding part if i see a programmer sitting right across me :)

-Happy problem solving !



Identifying Right Node in Couchbase

This post covers - how Couchbase achieves data partitioning and Replication. 

doc = "{"key1":"value1".....}"  ; doc-id = id

Steps:

  • Based on key (or document id) the hash gets calculated.
  • Hash returns a value in the range [0,1023] both inclusive. This is known as partition id.
    • number basically maps the document to one of the vBuckets. 
  • Next task is to map the vBucket to a physical node. This gets decided by vBucket Map.
    • This maps tells which is the primary node for the document and which all are the backup nodes.   vBucket Map will have 1024 entries, one of each vBucket. And each entry also an array. The first value is for primary node and rest all are replica nodes.
  • Server list stores list of live nodes. So based on the index of the vBucketMap, we get to know the physical node IP address.



Tuesday, August 15, 2017

Couchbase Primary vs Secondary Indexes

Couchbase supports key-value as well as JSON based data model. In Key-value model you don't care about the type of value. In JSON model you have ability to perform queries on the individual attributes using N1QL queries.


Key-Value Model 

Without Index
Key-value store is schema less where the object gets mapped to a given key (Just like a HashMap or Dictionary).   Couchbase is more like a distributed HashMap. The value could be any supported data type (JSON, CSV, or BLOB). You perform any operation using the key or Document Id. In this case, Couchbase looks up the value corresponding to a given document id. In simple terms, it's just like a key lookup in a HashMap. Index doesn't play any role here.


Querybucket.get(docId);


With Index
Now what if you want number of documents in your bucket ?

QuerySELECT COUNT(*) FROM `bucket-name`

Above query is going to do full Bucket scan (similar to full table scan in SQL world). In SQL world, index on primary key gets created by default so you can easily perform above operation. But, in Couchbase, that's not the case. You will have to create explicit index to perform above query or any other ad-hoc query. So, if you want to create an index on the the key or document id, we can create primary index. 

QueryCreate PRIMARY INDEX index_name on `bucket-name`


JSON Model (Secondary Indexes)

If you want to complete control on your data and queries, Json model is going to be your choice. In above approaches you can't say like give me all the objects which has certain attribute value. 

In JSON based model, we can query through a SQL like expressive language named as N1QL(pronounced as nickel). This is much more flexible model, we can look for a document(s) through the keys contained inside JSON. Obviously, to optimise lookup/search we can create index on those attributes. These indexes are named as secondary indexes or more precisely Global Seconday Indexes.

QueryCREATE INDEX type_index ON `bucket-name`(type) USING GSI



Primary vs Secondary Indexes


  • Primary indexes index all the keys in a given bucket and are used when a secondary index cannot be used to satisfy a query and a full bucket scan is required. 
  • Secondary indexes can index a subset of the items in a given bucket and are used to make queries targeting a specific subset of fields more efficiently. 


--- happy learning !

Friday, August 11, 2017

RAM sizing Data Node of Couchbase Cluster

This post talks about finding out how much RAM does your Couchbase cluster needs for holding your Data (in RAM)! 


RAM Calculator 

RAM is one of the most crucial areas to size correctly. Cached documents allow the reads to be served at low latency and high throughput.  Please note that, this doesn't not incorporate RAM requirement from the host/VM OS and other applications running along with Couchbase.

Enter below fields to estimate RAM -

Sample Document        (key)    (Value) 
This is required as document content length as well as ID length impacts RAM. Be mindful of the size aspect when deciding your key generation strategy. 


# Replicas                                        
Couchbase only supports upto 3 replicas. So enter either - 1, 2 or 3.


% Of Data you want to be in RAM  %
For best throughput you need to have all your documents in RAM i.e. 100% . This way any request will be served from RAM and there will be no IO.  In the field please enter only the value like 80, 100 etc. 


# Documents                                   
Number of documents in the cluster. When your application is starting from scratch then you can start with a number depending on the load of the application and then you need to evaluate it regularly and adjust your RAM quota if required. So, you can start with say 10000 or 1000000 documents. 


Type of Storage                                SSD        HDD
If storage is SSD then overhead % is 25 else it's 30%. SSD will bring better performance in disk throughput and latency. SSD storage will help improved performance if all data is not in the RAM. 


Couchbase Version                        < 2.1       2.1 or higher  
Size of meta data for 2.1 and higher versions is 56 bytes but for lower versions it's 64. 


High Water Mark                             %
If you want to use default value enter 85. 
If the amount of RAM used by documents reaches high water mark (upper threshold), both primary and replica documents are ejected until the memory usage reaches low Water Mark (lower threshold). 

                                                          

Based on the RAM requirement for the cluster, you can plan how many nodes are required. Another important aspect in deciding number of data nodes is how you expect your system to behave if 1, 2 or more nodes go down at the same time. This link, I have discussed about Replication factor and how it affects your system performance. So, take your call wisely!

The value got calculated as explained in the Couchbase link, here.
Reference for calculating document size is, here

--- happy sizing :)

What's so special about Java 8 stream API

Java 8 has added functional programming and one of the major addition in terms of API is, stream.

A mechanical analogy is car-manufacturing line where a stream of cars is queued between processing stations. Each take a car, does some modification/operation and then pass it to next station for further processing.


Main benefit of stream API is that, now in Java (8) you can program at higher level of abstraction. So you can transform stream of one type to stream of other type rather than processing each item at a time (using for loop or iterator). With this Java 8 can run a pipeline of stream operations on several CPU cores on different components of the input. This way you are getting parallelism almost free instead of hard work using threads and locks.

Stream focuses on partitioning the data rather than coordinating access to it. 

                                              Vs


Collection is mostly about storing and accessing the data, whereas stream is mostly about describing computation on data. 




Wednesday, August 9, 2017

Replication Factor in Couchbase

One of the core requirement for Distributed DBs is to be as High Availability as possible. What this literally means is that, even if node/nodes go down the DB should function (on its own or with minimum intervention). This is possible only if there are backup copies of the data. 

Replication factor controls number of replicas or backup of an item/data/document stored in a DB. The general rule is to have replica for each node which can fail in the cluster.

Let's check how one of the famous NoSQL distributed Db handles Replication Factor. 

Couchbase


Default replication factor is 1 in Couchbase (if it's enabled). Drop down field (as shown below) has only 3 values i.e. 1, 2 and 3. Practically, it doesn't make sense to have replication factor more than 3 no matter how large your cluster is.

 So even if you have only one node and enable replicas then in the same node there will be two copies of the same data (one original and one backup). Once you add more nodes to the cluster original and replicas will get re-distributed automatically. 


Recommendation:

Number of Nodes <= 5 - RF = 1
5 <= Number Of Nodes <= 10 - RF =2
Number of Nodes > 10 - RF = 3

Number of nodes mentioned above is only for data nodes if you are using Multi Dimensional Scaling.  If you are not using MDS then also above rule should hold good. 

In the event of failure we can fail over (manually or automatically) to replicas. 
  • In a 5 node cluster with 1 replica. If one node goes down cluster can fail it over. Now before the the failed node is up, what if another node goes down ? You are out of luck. You will have to add another node to the cluster. 
  • After a node goes down and it's failed over try to replace that node ASAP and perform rebalance. Rebalance creates the replica copies if there are enough nodes available. 


References

Saturday, July 29, 2017

Understanding AWS' IAM Service

In AWS world, everything is a service; in fact more technically a web service. Even for security there is one, IAM (Identity Access Management)


IAM background 

Let's assume, you manage a team (in a big company or you are a startup) and you decide to embrace your favourite cloud platform, AWS.  To start with, you create an account on AWS. This is root account or root user (as called in Linux world). You want your team members to get access to AWS console and it's different services.

Do you want to give them the same access as you have? Definitely NOT!

Full administrative access to all users will affect security of your systems and critical data. Root access might affect your monthly bills - what if a user starts bunch of powerful EC2 instances, although you wanted to use only S3 services of AWS.  That's where the concept of users, groups, role, policies comes into picture in AWS and this all gets achieved using IAM service. 

Amazon follows a shared security model - this means it's responsible for securing platform but as as customer you need to secure your data and access to a service. 

Root Account gets created when you first setup your AWS account. It has complete admin access. 

What is IAM ?

IAM is authentication and authorisation service of AWS. IAM allows your to control who can access AWS resources, how they can access and in what ways. As an Administrator, it gives you centralised control of your AWS account and enables you to manage users and their level of access to the AWS console. 

IAM, being a core service has global scope (not specific to a region). This means your user accounts, roles will be available all across the world. 


IAM Page on AWS Console

Sign-in through your root account to the AWS console. On the left top of UI, click on services and then from the list of services click IAM (comes under Security, Identity and Compliance). This takes to the IAM page of the console. 

At the very top it gives sign-in link which has numeric account number in the URL. I customised the url for easy readability by replacing account number with geekrai (this blog name). Attaching below screenshot of my IAM page.



IAM Components

Above image shows the IAM page in the AWS console.  It shows that there are 0 users (root user is not counted as a user), 0 groups and 0 roles. We will explore in detail all IAM components -

Users - End users of the services. 
Click on the Create individual IAM users to configure users for this account.  Through this you can add as many users as you want to this account. By default new users have no permission when they get created.  There are two types of accesses for new users. 
  1. Programatic Access: AWS enables an access key ID and secret access key for accessing AWS programatically (AWS API, CLI, SDK etc). 
  2. AWS Management Console Access: This allows your users to sign-in to AWS console. Users need a password to sign-in. 
You can choose either or both of above access types for a user. 

Groups - A collection of user under one set of permission.
Once a user is created, (ideally) it should be part of a group like developer, administrator etc. I created a group named as developer and added user with name siddheshwar to the group. This enables you to add policies to the group. 


Policies (Policy Document) - It's a document which defines one or more permissions which gets attached to the group or user.  
Policy document is a key value pair in JSON format.  AWS console provides list of all possible policies, you just need to select the one which is apt for your case. 

AmazonEC2FullAccess

Provides full access to Amazon EC2 via the AWS Management Console.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "ec2:*",
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "elasticloadbalancing:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "autoscaling:*",
            "Resource": "*"
        }
    ]
}

AdministratorAccess

Provides full access to AWS services and resources
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

Administrator access is the same as root access.  Please note that these policies can be added directly to the user, does't have always to be through Group.


Roles - Roles control responsibilities which get assigned to AWS resources. 
IAM roles is similar to user, but instead of being associated with a person it can be assigned to an application and service as well. Remember, in AWS everything is a service. 


How roles can help - 
  • Enable Identity federation. Allow users to log-in to AWS console through gmail, Amazon, OpenId etc. 
  • Enable access between your AWS account and 3rd party AWS account. 
  • Allow EC2 instance to call AWS services on your behalf.  

Below is the screenshot of a role page.



More details about Roles, here


IAM Best Practices

Multi Factor Authentication (MFA) For Root Account: Root account is the id password which you used to sign in to the AWS. Root account gives you unlimited access to AWS, and that's why security is quite important and AWS recommends to set up MFA. Once you set MFA, you will have to provide a MFA code as well while signing in.

Reference- https://aws.amazon.com/iam/details/mfa/

Set Password Policy: It's a good practice to set password polices- like what all characters are mandatory in password, expiry time or rotation policy.

Set Billing Alarm: You can set a threshold level on your monthly bills; if that level crosses then AWS will send an e-mail. This feature is not directly related to IAM. Amazon's cloud watch service helps in monitoring the billing. 


---
happy learning !


Sunday, July 23, 2017

Kafka - All that's Important

This post is all about KAFKA!  By the end of this post you should have idea about its design philosophy, components, and architecture.

Kafka is written in Scala and doesn't follow JMS standards.

What is Apache Kafka ?

Kafka is a distributed streaming platform which is highly scalable, fault-tolerant and efficient (provides high throughput and low latency).  Just like other messaging platforms it  allows you to publish and subscribe stream of records/messages..but as a platform it offer much more. An important distinction worth mentioning is -it's NOT a Queue implementation but it can be used as Queue. It can be used to pipe data flow between two systems something which has traditionally been done through ETL systems.


What is Stream Processing ?

Let's understand what is stream processing before we delve deeper. Jay Kreps, who implemented Kafka along with other members while working at LinkedIn explains about stream processing, here. Let's cover programming paradigms -
  1. Request/Response- Send ONE request and wait for ONE response (a typical http or REST call).
  2. Batch- Send ALL inputs, batch job does data crunching/processing and then returns ALL output in one go. 
  3. Stream Processing- Program has control in this model, it takes (bunch of) inputs and produces SOME output. SOME here depends on the program - it can return ALL output or ONE or it can do everything in-between. So, this is basically generalisation of above two extreme models.
Stream processing is generally async.  Stream processing has also been popularised by Lambdas and frameworks like Rx - where stream processing is confined to a process. But, in case of Kafka- stream processing is distributed and really large!

What is Event Stream ?

Events are actions which generate data!

Databases store the current state of data, which has been reached by sequence or stream of events. Visualising data as stream of events might not be very obvious, especially if you have grown seeing data being stored as rows in databases. 

Let's take example of your bank balance - your current bank balance is result of all credit and debit events which have occurred in the past. Events are business story which resulted in the current state of data. Similarly, current price of a stock is due to all the buy and sell events which have happened from the day it got listed on a bourses; in retail domain data can be realised as- stream of orders, sells and price adjustments. Google earns billions of dollers by capturing click events and impression events.

Now companies want to records all user events on their sites - this helps in better profiling the customers and offering more customised services. This has led to tons of data being generated. So, when we hear about the term big data - it basically means capturing all these events which most of the companies were ignoring earlier. 


Kafka's Core Data Structure

There are quite similarities between stream of records and application logs. Both order the entries with time. At the core, Kafka uses something similar to record all the stream of events - Commit Log.  I have discussed separately about commit logs, here.

Kafka provides commit log of updates as shown in below image. Data Sources ( or Producers) can publish stream of events or records which gets stored in commit log and then subscribers or consumers (like DB, Cache, http service) can read those events. These consumers are independent of each other and have their own reference points to read records. 

https://www.confluent.io/blog/stream-data-platform-1/
Commit logs can be partitioned or shared across cluster of nodes and they are also replicated to achieve fault-tolerance.

Read about Zero Copy here: https://www.ibm.com/developerworks/library/j-zerocopy/
How Kafka storage internally works: https://thehoard.blog/how-kafkas-storage-internals-work-3a29b02e026


Key Concepts and Terminologies

Let's cover important components and aspects of Kafka:


Message
Message is record or information which gets persisted in Kafka for processing. It's  a fundamental unit of data in Kafka world. Kafka stores the message in binary format. 


Topics
Messages in Kafka are categorised under topic. Topic is like a database table. Messages are always published and subscribed from a given topic (name). For each topic, Kafka maintains a structured commit log with one or more partitions.


Partition
A topic can have multiple partition as shown in above figure. Kafka appends new message at the end of a partition. Each message in a topic is assigned a unique identifier known as offsetWrite to a partition are sequential (from left to write) but write across different partitions can be done in parallel as each partition may be in a different box/node. Offset uniquely identifies a message in a given partition. Current offset where message is going to be written in partition 0 is 12 (in above pic).

Ordering of record is guaranteed only across a partition of the given topic. Partition allows Kafka to go beyond the limitation of a single server. This means single topic can be scalled horizontally across multiple servers.  But at the same time each individual partition must fit in a host. 

Each partition has one server which acts as leader and zero or more servers which acts as follower. The leader handles all reads and writes requests for that partition and follower replicate. If the leader fails, one of the follower will get chosen as leader. Each server acts a leader for some partition and follower for others so that load is balanced. 


Producers
As the name suggests, Producers post messages to topics. Producer is responsible for assigning messages to a partition in a given topic. Producer connects to any of the alive nodes and requests metadata about the leaders for the partition of a topic. This allows the producer to put the message directly to the lead broker of the partition. 


Consumers
Consumers subscribe to one or more topics and read messages for further processing. It's consumers job to keep track of which message have been read using offset. Consumers can re-read past messages or can jump to future messages as well. This is possible because Kafka retains all messages for a given time (which is pre configured).


Consumer Group
Kafka scales the consumption by grouping consumers and distributing partitions among them. 
https://cdn2.hubspot.net/hubfs/540072/New_Consumer_figure_1.png

Above diagram shows a topic with 3 partitions and a consumer group with 2 members. Each partition of topic is assigned to only one member in the group. Group coordination protocol is built into Kafka itself (earlier it was managed through zookeeper). For each group one broker is selected as group co-ordinator. It's main job is to control partition assignment when there is any change (addition/deletion) in membership of the group. 

https://kafka.apache.org/0110/images/consumer-groups.png
How Consumer Group helps:
  • If all consumer instances have same consumer group, then records will get load balanced over consumer instances. 
  • If all the consumer instances have different consumer groups, then each record will be broadcast to all consumer processes. 

Brokers
Kafka is a distributed system, so topics could be spread across different nodes in a cluster.  These individual nodes or servers are known as brokersBrokers job is to manage persistence and replication of messages. Brokers scale well as they are not responsible for tracking offset or messages for individual consumers. Also there is no issue due to rate with which consumer consumes messages. A single broker can easily handle thousands of partitions and millions of messages. 

Within a cluster of brokers, one will be elected as cluster controller. This happens automatically among live brokers.

A partition is owned by a single broker of the cluster and that broker acts as partition leader. To achieve replication, a partition will be assigned to multiple brokers. This provides redundancy of the portion and will be used as leader if the primary one fails. 

Leader
Node or Broker responsible for all the reads and writes of the given partition. 

Replication Factor
Replication factor controls the number of replica copies. If a topic is un-replicated then replication factor will be 1. 

If you want to design in such a way that f failures is fine then need to have 2f+1 replica. 
So, for 1 node failure; replication factor should be set to 3. 

Architecture

Below diagram shows all important components of Kafka and their relationship-



Role of Zookeeper

Zookeeper is an open source, high performance coordination service for distributed applications. In distributed systems Zookeeper helps in configuration management, consensus building, coordination and locks (Hadoop also uses Zookeeper). It acts as middle man among all nodes and helps in different co-ordination activities - it's source of truth.  

In Kafka, it's mainly used to track status of cluster nodes and also to keep track of topics, partitions, messages etc. So, before starting Kafka server, you should start Zookeeper. 


Kafka Vs Messaging Systems

Kafka can be used as a traditional messaging systems (or Brokers) like ActiveMQ, RabitMQ, Tibco.  Traditional messaging systems have two models - queuing and publish-subscribe. Each of these models have their own strengths and weaknesses -

Queuing- In this model pool of consumers can read from server and each record goes to one of them. This allows to divide up the processing over multiple consumers and helps in scale processing. But, queues are not multi-subscriber- once one subscriber reads data it's gone. 

Publish-Subscribe- This model allows to broadcast data to multiple subscribers. But, this approach doesn't scale up as every message goes to every consumer/subscriber. 

Kafka uses Consumer Group model; so every topic has both these properties (queue and publish-subscribe). It can scale processing and it's also multi-subscriber. Consumer group allows to divide up processing over different consumers of the group (just like queue model). And just like traditional pub-sub model it allows you to broadcast messages to multiple consumer groups. 

Order is not guaranteed in tradition systems, if records needs to be delivered to consumers asynchronously they may reach to consumers out of order. Kafka does better by creating multiple partitions for the same topics and each partition is consumed by exactly one consumer in the group. 


CLI Commands

Here, goes a link which covers Kafka CLI commands.



-------
References: