geekRai

Important Configuration Parameters of Kafka Producers

2019-12-12T10:43:00.001-08:00

The list of Kafka configurations for Producer is quite large. But, the good news is that you are not forced to configure all of them; Kafka provides a default for most of them. This may work for most of the cases. But, if you are particular about the performance, reliability, throughput, latency then it's worth revisiting them and customizing as per your specific need.

This post, I will cover some of the important configurations.

Kafka Reference - here.

compression.type

Default value = none. (i.e. No compression).

Available values = none, gzip, snappy, lz4

This is the algorithm that will be used by the producer (sitting in your application) to compress data before sending them to the brokers. If multiple messages are getting batched together before sending then this configuration improves performance. Enabling compression will reduce network utilization and storage. Snappy (invented by Google) provides decent compression ratios with low CPU overhead. Gzip, typically provides a better compression ratio but uses more CPU. So if network bandwidth is limited choose Gzip else go for Snappy.

batch.size

Default value = 16384 (i.e. 16K bytes)

Kafka Producer batches messages for each partition before sending them to the specific partition. This parameter controls the amount of memory (in bytes) which will be used for each batch. Kafka producer uses batch size and the timeout (linger.ms) to decide when to send. The producer will try to accumulate as many messages are possible (<= batch.size) and then send all of them in one go. If the batch size is very small, the Producer will be sending messages more frequently (0 value will disable batching). A larger batch size may waste some memory as the allocated memory might not get fully utilized.

linger.ms

Default value = 0

This value allows the Producer to group together records/messages before they get sent to the broker. This is the amount of time in milliseconds for which the producer will wait for accumulating messages in a batch. If this value is not set (default), then the producer will send messages as and when they arrive. Latency will be minimum for the default value. Setting this value to say, 5 will increase the latency but at the same time, it will also increase throughput (as you can send more messages in one go, so less overhead per message). If there is no load then setting it to 5 will increase latency by up to 5 ms.

acks

Default value = 1

This controls the number of acknowledgments the producer requires the leader to have received before considering a request complete. This affects the durability of the message.

acks=0

The message is considered to be written successfully to Kafka if the producer managed to send it over then network.

Resolving ClassNotFoundException in Java

2019-12-12T10:39:00.000-08:00

ClassNotFoundException is a checked exception and subclass of java.lang.Exception. Recall that, checked exceptions need to be handled either by providing try/catch or by throwing it to the caller. You get this exception when JVM is unable to load a class from the classpath. So, troubleshooting this exception requires an understanding of how classes get loaded and the significance of classpath.

Root Cause of java.lang.ClassNotFoundException

Java doc of java.lang.ClassNotFoundException puts it quite clearly. It gets thrown in below cases when the class definition is not found-

Load class using forname method of class Class
Load class using findSystemClass method of ClassLoader
Load using loadClass method of ClassLoader

References:
http://javaeesupportpatterns.blogspot.in/2012/11/javalangclassnotfoundexception-how-to.html

Distributed Data System Patterns

2019-12-12T10:26:00.002-08:00

I have published the post on Medium, here.

Performance Parameters for a System

2019-12-02T08:27:00.001-08:00

Performance is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.

Depending on the context, this may involve achieving one or more of the following:

Short response time/low latency for a given piece of work

High throughput (rate of processing work)

Low utilization of computing resource(s)

Response Time / Latency

The time between a client sending a request and receiving the response. The response time is what the client sees which includes the service time of the request and network & queuing delay.

Even if you make the same request time and again you will see a varying response time on every try. In practice, service or application handling a variety of request, the response time can vary a lot. One obvious reason is, a request for a user having a lot of data will be slower than another user which doesn't have much data. Other reasons could be - random additional latency, loss of a network packet during TCP transmission, GC pause, page fault forcing read from disk, other mechanical or network faults. That's why we need to think of response time not as a single number but as a distribution of values.

If 95th percentile (p95) response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.

low latency - achieving a short response time - is the most interesting aspect of performance, because it has a strong connection with physical (rather than financial) limitations.

In a distributed system, there is a minimum latency that cannot be overcome: the speed of light limits how fast information can travel, and hardware components have a minimum latency cost incurred per operation (think RAM and hard drives but also CPUs).

Throughput

The number of requests or records which can be processed per second, or the total time it takes to run a job on a dataset of a certain size.

There are tradeoffs involved in optimizing for any of these outcomes. For example, a system may achieve higher throughput by processing larger batches of work thereby reducing operational overhead. The tradeoff would be longer response times for individual pieces of work due to batching.

Resource Utilization

We want the optimal usage of the hardware resources which includes CPU, RAM, Network bandwidth. Or, in other words, do more with fewer resources. This will help in the scaling of the system.

Good practices for Accessing Couchbase Programatically

2019-07-20T07:38:00.000-07:00

Couchbase is one of the most popular distributed, fault-tolerant, and highly performant Document-oriented as well as a key-value store. Like any other database (relational or NoSQL), Couchbase provides language-specific Client SDK to access DB and perform CRUD operations. You can also access Couchbase through the command-line tool, cbc or from the Couchbase web console.

This post will focus on some of the good practices for accessing Couchbase. I will be using Java SDK for this post; the concepts are applicable for any language though.

Initialize the Connection

All operations performed against Couchbase (cluster) is through Bucket instance. From the relational world perspective, Bucket is the database instance. And to create bucket instance, we need to have the instance of the Cluster.

The important point to be noted is that we should only create ONE connection to the Couchbase cluster and ONE connection to each bucket, and then statically reference those connections for use across the application. Reusing the connections will ensure that underlying resources are utilized to the fullest.


// connects to cluster running on localhost

Cluster cluster = CouchbaseCluster.create();   




// Connects cluster on 10.0.0.1 and if it fails then tries 10.0.0.2
Custer cluster = CouchbaseCluster.create("10.0.0.1", "10.0.0.2");

Now, the Cluster instance is created, we can create Bucket instance to complete the initialization.


// Opens the default bucket

Bucket bucket = cluster.openBucket();   







// Opens connections to demo bucket
Bucket bucket = cluster.openBucket("demo");   


// Opens connections to SECURED demo bucket
Bucket bucket = cluster.openBucket("demo", "p@ssword");

Tips#

It's good practice to pass at least two IP addresses in the create method. At the time of this call if the first host is down (for some reason); 2nd host will be tried. These IPs will be used only during initialization. If there is only one host and it's down, then you are out of luck and bucket instance will not be created!

References:
https://blog.couchbase.com/10-things-developers-should-know-about-couchbase/

Thoughts on GC friendly programmig

2019-07-20T07:29:00.002-07:00

Garbage Collections in Java gets triggered automatically to reclaim some of the occupied memory by freeing up objects. Hotspot VM divides the Heap into different memory segments to optimize the garbage collection cycle. It mainly separates objects into two segments - young generation and old generation.

Objects get initially created into young gen. Young gen is quite small and thus minor garbage collection runs on it. If objects survive the minor GC; then they get moved to the old gen. So it's better to use short-lived and immutable objects than long-lived mutable objects.

Minor GC is quite fast (as its runs on smaller memory segment) and hence it's less disruptive. The ideal scenario will be that GC never compacts old gen. So if full GC can be avoided you will achieve the best performance.

So a lot depends on how you have configured your heap memory and another important factor is do you code keeping in mind these aspects.

http://www.ibm.com/developerworks/library/j-leaks/
http://stackoverflow.com/questions/6470651/creating-a-memory-leak-with-java/6471947#6471947

Graph DFS traversal

2019-07-20T07:24:00.000-07:00

This post, I will be focusing on Depth-First traversal.

The difference between BFS and DFS traversal lies in the order in which vertices are explored. And this mainly comes due to the data structure used to do traversal. BFS uses queue whereas DFS uses a stack data structure to perform traversal. Below diagram illustrates the order of traversal.

DFS Progress

DFS Implementation

DFS implementation uses UndirectedGraph.java from the BFS post. Below class has two implementations of DFS traversals (recursive and iterative). Method, traverse(..) is stack based implementation whereas traverseRecur(..) is recursive implementation.

package graph.algo;

import java.util.ArrayDeque;
import java.util.Deque;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Objects;
import java.util.Set;

/**
 * Depth-First Traversal of a graph represented as adjacency list
 * 
 * @author Siddheshwar
 * 
 */
public class DFS<V> {
 private Graph<V> graph; // graph on which DFS will be performed
 private Deque<V> stack; // Deque used to implement stack
 private Set<V> visited; // stores set of visited nodes

 public DFS(Graph<V> graph) {
  this.graph = graph;
  stack = new ArrayDeque<>();
  visited = new LinkedHashSet<>(); // to maintain the insertion order
 }

 /**
  * Iterative/stack based DFS implementation
  * 
  * @param source
  */
 public void traverse(V source) {
  Objects.requireNonNull(source, "source is manadatory!");
  if (this.graph == null || this.graph.isEmpty()) {
   throw new IllegalStateException(
     "Valid graph object is required !!!");
  }

  stack.push(source);
  this.markAsVisited(source);
  boolean pop = false;
  V stackTopVertex;
  System.out.print("  " + source);

  while (!stack.isEmpty()) {

   if (pop == true)
    stackTopVertex = stack.pop();
   else
    stackTopVertex = stack.peek();

   List<V> neighbors = graph.getAdjacentVertices(stackTopVertex);

   if (!neighbors.isEmpty()
     && hasUnvisitedNeighbor(neighbors, visited)) {
    for (V a : neighbors) {
     if (!this.isVertexVisited(a)) {
      System.out.print("  " + a);
      visited.add(a);
      stack.push(a);
      // break from loop if an unvisited neighbor is found
      break;
     }
    }
   } else {
    // if all neighbors are visited
    pop = true;
   }
  }
 }

 /**
  * Recursive implementation of DFS
  * 
  * @param source
  */
 public void traverseRecur(V source) {
  Objects.requireNonNull(source, "source is manadatory!");

  this.markAsVisited(source);
  System.out.print(" " + source);

  // get neighbors in sorted manner
  List<V> neighbors = this.graph.getAdjacentVertices(source);

  for (V n : neighbors) {
   if (!this.isVertexVisited(n)) {
    traverseRecur(n);
   }
  }
 }

 /**
  * checks if any of the neighbor is unvisited
  * 
  * @param neighbors
  * @param visited
  * @return
  */
 private boolean hasUnvisitedNeighbor(List<V> neighbors, Set<V> visited) {
  for (V i : neighbors) {
   if (!visited.contains(i))
    return true;
  }
  return false;
 }

 /**
  * Returns true if vertex is already visited
  * 
  * @param i
  * @return
  */
 private boolean isVertexVisited(V i) {
  return this.visited.contains(i);
 }

 /**
  * Mark a vertex visited
  * 
  * @param i
  */
 private void markAsVisited(V i) {
  this.visited.add(i);
 }

 // test method
 public static void main(String[] args) {
  Graph<Integer> graph = new Graph<>();
  graph.addEdge(1, 2);
  graph.addEdge(1, 5);
  graph.addEdge(1, 6);
  graph.addEdge(2, 3);
  graph.addEdge(2, 5);
  graph.addEdge(3, 4);
  graph.addEdge(5, 4);

  // for undirected graph
  graph.addEdge(2, 1);
  graph.addEdge(5, 1);
  graph.addEdge(6, 1);
  graph.addEdge(3, 2);
  graph.addEdge(5, 2);
  graph.addEdge(4, 3);
  graph.addEdge(4, 5);

  System.out.print("DFS -->");
  DFS<Integer> dfs = new DFS<>(graph);

  /**
   * stack based DFS traversal
   */
  dfs.traverse(1);

  /**
   * Recursive DFS traversal
   */
  // dfs.traverseRecur(1);

  // validation
  /**
   * after traversal; stack should be empty. And visited should have
   * vertices in DFS order
   */
  System.out.println("\nIs Stack is empty :"
    + (dfs.stack.isEmpty() ? "yes" : "no"));
  System.out.println("visited :" + dfs.visited);

 }

}

Output:
DFS --> 1 2 3 4 5 6
Is Stack is empty :yes
visited :[1, 2, 3, 4, 5, 6]

Time Complexity

Assuming above implementation of Graph i.e. Adjacency List, |V| is number of vertices and |E| is number of edges.

So complexity is O(|V|+|E|)

--
happy learning !!!

Scalability - Getting hang of Seconds

2018-08-20T22:57:00.000-07:00

This post list some of the most important numbers which are important for engineers to do back of envelope calculations. This post, I have focussed on seconds. If you hear someone telling that his service gets 1 million hits a day; don't get bogged down with the numbers, it just means he gets 10 requests per second.

Seconds

# Seconds in a day = 86400 (=24*60*60)

= 0.85*10^5

= 10^5

= 0.1 million

# Seconds in a month = 2592000 (=30*86400)

= 2.5*10^6

= 2.5 millions

If an online site gets 10 million hits per day then it means on an average it gets 100 requests/sec.

If an online site gets 10 million hits per month then it means on an average it gets 4 requests/sec.

# Seconds in a year = 31104000 (=12*259200)

= 31 millions

= Pie * 10^7

If we treat a year as 365.25 days then also, # seconds in a year would be 3,155,7600 which would approximate to 31 million.

# Seconds in a century = 3,155,760,000 seconds (considering 1 year = 365.25 days)

= 3.15 billions

Nanocentury is 1 billionth of a century. So, a nanocentry = 3.15 seconds.

i.e. Pie seconds are there in a nano century. This is also known as Duff's Rule.

Designing REST URI for supporting multiple content type

2018-03-25T11:57:00.000-07:00

Resources can be represented in multiple formats - JSON, XML, Atom, Binary formats like png, text and even proprietary formats. If client request a resource the REST service transfers the state of a resource (and not the resource itself) in the appropriate format.

Assume that you are designing RESTful interface for providing metadata for Cars and your service gets consumed by many clients, some traditionals as well as few startups. So each one of them have their own requirements to provide response in the given format. Let's see what are available options-

Approach 1: One URI per representation

http://www.myservice.com/cars
http://www.myservice.com/cars/xml

The first URI is default representation of the resource and second one returns the response in xml format. Both URI are different so there will be different handlers (end point) and hence the response can be easily returned in appropriate format.

Approach 2: Use Parameter of URI

http://www.myservice.com/cars?format=xml

This approach is easy to read and understand.

Approach 3: Single URI for all representation

This approach comes from the fact that if client is essentially asking for the same resource then why do we need different URIs. Remember, REST uses HTTP; can we leverage HTTP ACCEPT header to get different representation of the same resource. This is process of selecting the best representation for a given resource- termed as Content Negotiation.

Content Types

HTTP uses Internet media types (originally known as MIME types) in the content-type and accept header fields. Internet media types are divided into 5 top level categories: text, image, audio, video and application. And then these types are further divided into several subtypes:

text/xml : default content type for text message
text/html : commonly used type used in browsers
text/xml, application/xml: Format used for xml exchanges
image/gif, image/jpeg, image/png: image types
application/json: language independent light weight data-interchange text format

GET /cars/

Accept: application/json

So respect the HTTP headers and everything works out.

This approach could be bit code intensive for some frameworks like Django as you need to dig into headers and decode what clients wants. But most of the Java frameworks handle it though annotation.

Final Note

No matter which approach you use. It would be great if you stick to one across the services. Prefer to be consistent even if that leads to not being very right!

References

https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm#sec_5_2_1_2

http://www.restapitutorial.com/lessons/restquicktips.html

http://www.informit.com/articles/article.aspx?p=1566460

PUT vs POST for Modifying a Resource

2017-12-21T21:15:00.001-08:00

Debate around PUT vs POST for resource update is quite common; I have had my share as well. Debate is NOT un-necessary as the difference is very subtle. One simple line of defence by many people is that if the update is IDEMPOTENT then we should use PUT else we can use POST. This explanation is correct to a good extent; provided we clearly understand if a request is truly Idempotent or not.

Also, lot of content is available online which causes confusion. So, I tried to see what the originators of REST architectural style themselves say. This post might again be opinionated, or have missed few important aspects. I have tried to be as objective as possible. Feel free to post your comments/openions :)

Updating a Resource

For our understanding, let's take a case that we are dealing with an account resource which has three attributes: firstName, lastName and status.

Updating Status field:

Advocates of PUT consider below request to be IDEMPOTENT.

HTTP 1.1 PUT /account/a-123

{

"status":"disabled"

}

Reality is that, above request is NOT idempotent as it's updating a partial document. To make it idempotent you need to send all the attributes. So that line of defence is NOT perfect.

HTTP 1.1 PUT /account/a-123

{

"firstName":"abc",

"lastName":"rai",

"status":"disabled"

}

Below article tells very clearly that, if you want to use PUT to update a resource, you must send all attributes of the resource where as you can use POST for either partial or full update.

https://stormpath.com/blog/put-or-post

So, you can use POST for either full or partial update (until PATCH support becomes universal).

What the originator of REST style says

The master himself suggest that we can use POST if you are modifying part of the resource.

http://roy.gbiv.com/untangled/2009/it-is-okay-to-use-post

My Final Recommendation

Prefer POST if you are doing partial update of the resource.
If you doing full update of the resource and it's IDEMPOTENT, use PUT else use POST.

Build Tools in Java World

2017-12-14T21:58:00.001-08:00

Build tools in Java (or JVM ecosystem) have evolved over period of time. Each successive build tool has tried to solve some of the pains of the previous one. But before going further down on tools, let’s start with basic features of standard build tools.

Dependency Management

Each project requires some external libraries for build to be successful. So these incoming files/jars/libraries are called as dependencies of the project. Managing dependencies in a centralized manner is de-facto feature of modern build tools. Output artifact of the project also gets published and then managed by dependency management. Apache IVY and Maven are two most popular tools which support dependency management.

Build By Convention

Build script needs to be as simple and compact as possible. Imagine specifying each and every action which needs to be performed during build (compile all files from src directory, copy them to dir file, create jar file and so on); this will definitely make the script huge and hence managing and evolving it becomes a daunting task. So, modern build tools uses convention like by default (or can be configured as well) it knows that source files are in src directory. This minimizes number of lines in the build file and hence it becomes easier to write and manage build scripts.

So any standard build tool should have above two as de-facto. Below are list of tools which have these features.

ANT + IVY (Apache IVY for dependency management)

· MAVEN

· GRADLE

I have listed only most popular build tools above. ANT by default doesn’t have dependency management but other two have native support for dependency management. Java world is basically divided between MAVEN and GRADLE. So, I have focused below on these two tools.

Maven vs Gradle

MAVEN uses XML to write build script where as GRADLE uses a DSL language based on Groovy (one of the JVM language). GRADLE build script tends to be shorter and cleaner compared to maven build script.
GRADLE build script is written in Groovy (and can also be extended using Java). This definitely gives more flexibility to customize the build process. Groovy is a real programming language (unlike XML). Also, GRADLE doesn’t force to always use convention, it can be overridden.
GRADLE has first class support for multi-project build whereas multi-project build of MAVEN is broken. GRADLE support dependency management natively using Apache open source project IVY (is an excellent dependency management tool). Dependency management of GRADLE is better than MAVEN.
MAVEN is quite popular tool so it has wide community and Java community have been using it for a while; GRADLE on the other hand is quite new so there will be learning curve for developers.
Both are plugin based (and GRADLE being a newer); finding plugin might be difficult for GRADLE. But adoption of GRADLE is growing at good pace, Google supports GRADE for Android. Integration of GRADLE with servers, IDEs and CI tools is not as much as that of MAVEN (as of now).

CONCLUSION

Most of the cons for GRADLE are mainly because it’s a new kid in the block. Other than this, rest all looks quite impressive about GRADLE. It scores better on both core features i.e. Dependency Management and Build by Convention. IMO, configuring build through a programming language is going to be more seamless once we overcome the initial learning curve.

Also, considering we are going down the microservices path, so we will have option and flexibility to experiment with build tool as well (along with language/framework).

References

https://technologyconversations.com/2014/06/18/build-tools/

https://github.com/tkruse/build-bench

How AWS structures its Infrastructure

2017-12-05T09:19:00.002-08:00

This post talks about how AWS structures its global infrastructure.

AWS' most basic infrastructure is Data Center. A single Data Center houses several thousand servers. AWS core applications are deployed in N+1 configuration to ensure smooth functioning in the event of a data center failure.

AWS data centers are organized into Availability Zones. One DC can only be part of one AZ. Each AZ is designed as an independent failure zone for fault isolation. Two AZs are interconnected with high-speed private links.

Two or more AZs form a Region. As of now (dec '17) AWS has 16 regions across the globe. Communication among regions use public infrastructure (i.e. internet), therefore use appropriate encryption methods to encrypt sensitive data. Data stored in a specific region is not replicated across other regions automatically.

AWS also has 60+ global Edge Locations. Edge locations help lower latency and improve performance for end users. Helpful for services like Route 53 and Cloud Front.

Guidlines for designing

Design your system to survive temporary or prolonged failure of an Availability Zone. This brings resiliency to your system in case of natural disasters or system failures.
AWS recommends replicating across AZs for resiliency.
When you put data in a specific region, it's your job to move it to other regions if you require.
AWS products and services are available by region so you may not see a service available in your region.
Choose your region appropriately to reduce latency for your end-users.

How Kafka Achieves Durability

2017-10-21T10:32:00.000-07:00

Durability is a guarantee that, once the Kafka broker confirms that the data is written, it will be permanent. Databases implement it by storing it in non-volatile storage. Kafka doesn't follow the DB approach!

Short Answer

Short answer is that, Kafka doesn't rely on the physical storage (i.e. file system) as the criteria that a message write is complete. It relies on the replicas.

Long Answer

When the message arrives to the broker, it first writes it to the in-memory copy of leader replica. Now it has following things to do before considering the write successful.
Assume that, replication factor > 1.

1. Persist the message in the file system of the partition leader.

2. Replicate the message to the all ISRs (in-sync replicas).

In ideal scenario, both above are important and should be done irrespective of order. But, the real question is, when does Kafka considers that the message write is complete? To answer this, let's try to answer below question-

If a consumer asks for a message 4 which just go persisted on the leader, will the leader return the data? And the answer is NO!

It's interesting to note that, not all data that exists on the leader is available for clients to read. Clients can read only those messages that were written to in-sync replicas. The replica leader knows which messages were replicated to which replica, so until it's replicated it will not be returned to the client. Attempt to read those messages will result in empty response.

So, now it's obvious, just writing the message to leader (including persisting to the file system) is hardly of any use. Kafka considers a message written only if it's replicated to all in-sync replicas.

~Happy replication!

My favourite fiz-buzz problem for Senior Programmers

2017-10-07T06:36:00.002-07:00

This post, I will be discussing one of my favourite fiz-buzz problems for senior programmers/engineers.

Find Kth largest element from a list of 1 million integers.

Or

Find Kth largest element at any given point of time from a stream of integers, count is not known.

This problem is interesting as it has multiple approaches to solve and it checks the fundamentals of algorithms and data structure. Quite often, candidate start with asking questions like is the list sorted? Or can I sort the list ? In such case, I go and check on which sorting algorithm the candidate proposes. This gives me an opportunity to start conversation around complexity of the approach (particularly, time complexity). Most of the candidates are quick to point out algorithms (like Quick Sort , Merge Sort) which take O(NlogN) for sorting a list. This is right time to point out that why do you need to sort the complete array/list if you just need to find out 100th or kth largest/smallest element. Now the conversation usually go in either of the direction -

Candidate sometime suggest that, sorting is more quicker way to solve this problem - missing altogether the complexity aspect. If someone doesn't even realize that sorting is not the right way to handle this problem, then it kind of red signal for me.
At times candidates acknowledge the in-efficiency of sorting approach and then start looking for better approach. I suggest, candidates to think out loud which will give me insight about their thought process and how are they approaching it. When I see them not moving ahead; I suggest them on optimizing Quick sort approach ? Is there any way to cut down the problem size in half in every iteration ? Can you use divide and concur to improve on your O(NlogN) complexity ?

This problem can be solved by Quick Select as well as using Heap data structure. This problem also has a brute force approach (i.e. run loop for k time; in each iteration find the maximum number lower than the last one).

If the candidate doesn't make much progress then I try to simplify the problem by saying - find 3rd or 2nd largest element. I have seen some of the senior programmers failing to solve this trivial version as well. This is clear Reject sign for me.

Also, sometime I don't even ask candidate to code. I use this problem to just get an idea and skip the coding part if i see a programmer sitting right across me :)

-Happy problem solving !

Identifying Right Node in Couchbase

2017-10-07T06:22:00.000-07:00

This post covers - how Couchbase achieves data partitioning and Replication.

doc = "{"key1":"value1".....}" ; doc-id = id

Steps:

Based on key (or document id) the hash gets calculated.
Hash returns a value in the range [0,1023] both inclusive. This is known as partition id.

number basically maps the document to one of the vBuckets.

Next task is to map the vBucket to a physical node. This gets decided by vBucket Map.

This maps tells which is the primary node for the document and which all are the backup nodes. vBucket Map will have 1024 entries, one of each vBucket. And each entry also an array. The first value is for primary node and rest all are replica nodes.

Server list stores list of live nodes. So based on the index of the vBucketMap, we get to know the physical node IP address.

Couchbase Primary vs Secondary Indexes

2017-08-15T03:27:00.000-07:00

Couchbase supports key-value as well as JSON based data model. In Key-value model you don't care about the type of value. In JSON model you have ability to perform queries on the individual attributes using N1QL queries.

Key-Value Model

Without Index
Key-value store is schema less where the object gets mapped to a given key (Just like a HashMap or Dictionary). Couchbase is more like a distributed HashMap. The value could be any supported data type (JSON, CSV, or BLOB). You perform any operation using the key or Document Id. In this case, Couchbase looks up the value corresponding to a given document id. In simple terms, it's just like a key lookup in a HashMap. Index doesn't play any role here.

Query: bucket.get(docId);

With Index

Now what if you want number of documents in your bucket ?

Query: SELECT COUNT(*) FROM `bucket-name`

Above query is going to do full Bucket scan (similar to full table scan in SQL world). In SQL world, index on primary key gets created by default so you can easily perform above operation. But, in Couchbase, that's not the case. You will have to create explicit index to perform above query or any other ad-hoc query. So, if you want to create an index on the the key or document id, we can create primary index.

Query: Create PRIMARY INDEX index_name on `bucket-name`

JSON Model (Secondary Indexes)

If you want to complete control on your data and queries, Json model is going to be your choice. In above approaches you can't say like give me all the objects which has certain attribute value.

In JSON based model, we can query through a SQL like expressive language named as N1QL(pronounced as nickel). This is much more flexible model, we can look for a document(s) through the keys contained inside JSON. Obviously, to optimise lookup/search we can create index on those attributes. These indexes are named as secondary indexes or more precisely Global Seconday Indexes.

Query: CREATE INDEX type_index ON `bucket-name`(type) USING GSI

Primary vs Secondary Indexes

Primary indexes index all the keys in a given bucket and are used when a secondary index cannot be used to satisfy a query and a full bucket scan is required.
Secondary indexes can index a subset of the items in a given bucket and are used to make queries targeting a specific subset of fields more efficiently.

--- happy learning !

RAM sizing Data Node of Couchbase Cluster

2017-08-11T22:16:00.000-07:00

This post talks about finding out how much RAM does your Couchbase cluster needs for holding your Data (in RAM)!

RAM Calculator

RAM is one of the most crucial areas to size correctly. Cached documents allow the reads to be served at low latency and high throughput. Please note that, this doesn't not incorporate RAM requirement from the host/VM OS and other applications running along with Couchbase.

Enter below fields to estimate RAM -

Sample Document (key) (Value)

This is required as document content length as well as ID length impacts RAM. Be mindful of the size aspect when deciding your key generation strategy.

# Replicas

Couchbase only supports upto 3 replicas. So enter either - 1, 2 or 3.

% Of Data you want to be in RAM %

For best throughput you need to have all your documents in RAM i.e. 100% . This way any request will be served from RAM and there will be no IO. In the field please enter only the value like 80, 100 etc.

# Documents

Number of documents in the cluster. When your application is starting from scratch then you can start with a number depending on the load of the application and then you need to evaluate it regularly and adjust your RAM quota if required. So, you can start with say 10000 or 1000000 documents.

Type of Storage SSD HDD

If storage is SSD then overhead % is 25 else it's 30%. SSD will bring better performance in disk throughput and latency. SSD storage will help improved performance if all data is not in the RAM.

Couchbase Version < 2.1 2.1 or higher

Size of meta data for 2.1 and higher versions is 56 bytes but for lower versions it's 64.

High Water Mark %

If you want to use default value enter 85.

If the amount of RAM used by documents reaches high water mark (upper threshold), both primary and replica documents are ejected until the memory usage reaches low Water Mark (lower threshold).

Based on the RAM requirement for the cluster, you can plan how many nodes are required. Another important aspect in deciding number of data nodes is how you expect your system to behave if 1, 2 or more nodes go down at the same time. This link, I have discussed about Replication factor and how it affects your system performance. So, take your call wisely!

The value got calculated as explained in the Couchbase link, here.
Reference for calculating document size is, here.

--- happy sizing :)

What's so special about Java 8 stream API

2017-08-11T22:04:00.002-07:00

Java 8 has added functional programming and one of the major addition in terms of API is, stream.

A mechanical analogy is car-manufacturing line where a stream of cars is queued between processing stations. Each take a car, does some modification/operation and then pass it to next station for further processing.

Main benefit of stream API is that, now in Java (8) you can program at higher level of abstraction. So you can transform stream of one type to stream of other type rather than processing each item at a time (using for loop or iterator). With this Java 8 can run a pipeline of stream operations on several CPU cores on different components of the input. This way you are getting parallelism almost free instead of hard work using threads and locks.

Stream focuses on partitioning the data rather than coordinating access to it.

Vs

Collection is mostly about storing and accessing the data, whereas stream is mostly about describing computation on data.

Replication Factor in Couchbase

2017-08-09T22:21:00.003-07:00

One of the core requirement for Distributed DBs is to be as High Availability as possible. What this literally means is that, even if node/nodes go down the DB should function (on its own or with minimum intervention). This is possible only if there are backup copies of the data.

Replication factor controls number of replicas or backup of an item/data/document stored in a DB. The general rule is to have replica for each node which can fail in the cluster.

Let's check how one of the famous NoSQL distributed Db handles Replication Factor.

Couchbase

Default replication factor is 1 in Couchbase (if it's enabled). Drop down field (as shown below) has only 3 values i.e. 1, 2 and 3. Practically, it doesn't make sense to have replication factor more than 3 no matter how large your cluster is.

So even if you have only one node and enable replicas then in the same node there will be two copies of the same data (one original and one backup). Once you add more nodes to the cluster original and replicas will get re-distributed automatically.

Recommendation:

Number of Nodes <= 5 - RF = 1

5 <= Number Of Nodes <= 10 - RF =2

Number of Nodes > 10 - RF = 3

Number of nodes mentioned above is only for data nodes if you are using Multi Dimensional Scaling. If you are not using MDS then also above rule should hold good.

In the event of failure we can fail over (manually or automatically) to replicas.

In a 5 node cluster with 1 replica. If one node goes down cluster can fail it over. Now before the the failed node is up, what if another node goes down ? You are out of luck. You will have to add another node to the cluster.
After a node goes down and it's failed over try to replace that node ASAP and perform rebalance. Rebalance creates the replica copies if there are enough nodes available.

References

https://developer.couchbase.com/documentation/server/4.5/clustersetup/create-bucket.html
https://developer.couchbase.com/documentation/server/3.x/admin/Concepts/bp-sizingGuidelines.html

Understanding AWS' IAM Service

2017-07-29T06:44:00.000-07:00

In AWS world, everything is a service; in fact more technically a web service. Even for security there is one, IAM (Identity Access Management).

IAM background

Let's assume, you manage a team (in a big company or you are a startup) and you decide to embrace your favourite cloud platform, AWS. To start with, you create an account on AWS. This is root account or root user (as called in Linux world). You want your team members to get access to AWS console and it's different services.

Do you want to give them the same access as you have? Definitely NOT!

Full administrative access to all users will affect security of your systems and critical data. Root access might affect your monthly bills - what if a user starts bunch of powerful EC2 instances, although you wanted to use only S3 services of AWS. That's where the concept of users, groups, role, policies comes into picture in AWS and this all gets achieved using IAM service.

Amazon follows a shared security model - this means it's responsible for securing platform but as as customer you need to secure your data and access to a service.

Root Account gets created when you first setup your AWS account. It has complete admin access.

What is IAM ?

IAM is authentication and authorisation service of AWS. IAM allows your to control who can access AWS resources, how they can access and in what ways. As an Administrator, it gives you centralised control of your AWS account and enables you to manage users and their level of access to the AWS console.

IAM, being a core service has global scope (not specific to a region). This means your user accounts, roles will be available all across the world.

IAM Page on AWS Console

Sign-in through your root account to the AWS console. On the left top of UI, click on services and then from the list of services click IAM (comes under Security, Identity and Compliance). This takes to the IAM page of the console.

At the very top it gives sign-in link which has numeric account number in the URL. I customised the url for easy readability by replacing account number with geekrai (this blog name). Attaching below screenshot of my IAM page.

IAM Components

Above image shows the IAM page in the AWS console. It shows that there are 0 users (root user is not counted as a user), 0 groups and 0 roles. We will explore in detail all IAM components -

Users - End users of the services.

Click on the Create individual IAM users to configure users for this account. Through this you can add as many users as you want to this account. By default new users have no permission when they get created. There are two types of accesses for new users.

Programatic Access: AWS enables an access key ID and secret access key for accessing AWS programatically (AWS API, CLI, SDK etc).
AWS Management Console Access: This allows your users to sign-in to AWS console. Users need a password to sign-in.

You can choose either or both of above access types for a user.

Groups - A collection of user under one set of permission.

Once a user is created, (ideally) it should be part of a group like developer, administrator etc. I created a group named as developer and added user with name siddheshwar to the group. This enables you to add policies to the group.

Policies (Policy Document) - It's a document which defines one or more permissions which gets attached to the group or user.

Policy document is a key value pair in JSON format. AWS console provides list of all possible policies, you just need to select the one which is apt for your case.

AmazonEC2FullAccess

Provides full access to Amazon EC2 via the AWS Management Console.

{

"Version": "2012-10-17",

"Statement": [

{

"Action": "ec2:*",

"Effect": "Allow",

"Resource": "*"

{

"Effect": "Allow",

"Action": "elasticloadbalancing:*",

"Resource": "*"

{

"Effect": "Allow",

"Action": "cloudwatch:*",

"Resource": "*"

{

"Effect": "Allow",

"Action": "autoscaling:*",

"Resource": "*"

}

]

}

AdministratorAccess

Provides full access to AWS services and resources

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": "*",

"Resource": "*"

}

]

}

Administrator access is the same as root access. Please note that these policies can be added directly to the user, does't have always to be through Group.

Roles - Roles control responsibilities which get assigned to AWS resources.

IAM roles is similar to user, but instead of being associated with a person it can be assigned to an application and service as well. Remember, in AWS everything is a service.

How roles can help -

Enable Identity federation. Allow users to log-in to AWS console through gmail, Amazon, OpenId etc.
Enable access between your AWS account and 3rd party AWS account.
Allow EC2 instance to call AWS services on your behalf.

Below is the screenshot of a role page.

More details about Roles, here.

IAM Best Practices

Multi Factor Authentication (MFA) For Root Account: Root account is the id password which you used to sign in to the AWS. Root account gives you unlimited access to AWS, and that's why security is quite important and AWS recommends to set up MFA. Once you set MFA, you will have to provide a MFA code as well while signing in.

Reference- https://aws.amazon.com/iam/details/mfa/

Set Password Policy: It's a good practice to set password polices- like what all characters are mandatory in password, expiry time or rotation policy.

Set Billing Alarm: You can set a threshold level on your monthly bills; if that level crosses then AWS will send an e-mail. This feature is not directly related to IAM. Amazon's cloud watch service helps in monitoring the billing.

---
happy learning !

Kafka - All that's Important

2017-07-23T02:57:00.000-07:00

This post is all about KAFKA! By the end of this post you should have idea about its design philosophy, components, and architecture.

Kafka is written in Scala and doesn't follow JMS standards.

What is Apache Kafka ?

Kafka is a distributed streaming platform which is highly scalable, fault-tolerant and efficient (provides high throughput and low latency). Just like other messaging platforms it allows you to publish and subscribe stream of records/messages..but as a platform it offer much more. An important distinction worth mentioning is -it's NOT a Queue implementation but it can be used as Queue. It can be used to pipe data flow between two systems something which has traditionally been done through ETL systems.

What is Stream Processing ?

Let's understand what is stream processing before we delve deeper. Jay Kreps, who implemented Kafka along with other members while working at LinkedIn explains about stream processing, here. Let's cover programming paradigms -

Request/Response- Send ONE request and wait for ONE response (a typical http or REST call).
Batch- Send ALL inputs, batch job does data crunching/processing and then returns ALL output in one go.
Stream Processing- Program has control in this model, it takes (bunch of) inputs and produces SOME output. SOME here depends on the program - it can return ALL output or ONE or it can do everything in-between. So, this is basically generalisation of above two extreme models.

Stream processing is generally async. Stream processing has also been popularised by Lambdas and frameworks like Rx - where stream processing is confined to a process. But, in case of Kafka- stream processing is distributed and really large!

What is Event Stream ?

Events are actions which generate data!

Databases store the current state of data, which has been reached by sequence or stream of events. Visualising data as stream of events might not be very obvious, especially if you have grown seeing data being stored as rows in databases.

Let's take example of your bank balance - your current bank balance is result of all credit and debit events which have occurred in the past. Events are business story which resulted in the current state of data. Similarly, current price of a stock is due to all the buy and sell events which have happened from the day it got listed on a bourses; in retail domain data can be realised as- stream of orders, sells and price adjustments. Google earns billions of dollers by capturing click events and impression events.

Now companies want to records all user events on their sites - this helps in better profiling the customers and offering more customised services. This has led to tons of data being generated. So, when we hear about the term big data - it basically means capturing all these events which most of the companies were ignoring earlier.

Kafka's Core Data Structure

There are quite similarities between stream of records and application logs. Both order the entries with time. At the core, Kafka uses something similar to record all the stream of events - Commit Log. I have discussed separately about commit logs, here.

Kafka provides commit log of updates as shown in below image. Data Sources ( or Producers) can publish stream of events or records which gets stored in commit log and then subscribers or consumers (like DB, Cache, http service) can read those events. These consumers are independent of each other and have their own reference points to read records.

https://www.confluent.io/blog/stream-data-platform-1/

Commit logs can be partitioned or shared across cluster of nodes and they are also replicated to achieve fault-tolerance.

Read about Zero Copy here: https://www.ibm.com/developerworks/library/j-zerocopy/
How Kafka storage internally works: https://thehoard.blog/how-kafkas-storage-internals-work-3a29b02e026

Key Concepts and Terminologies

Let's cover important components and aspects of Kafka:

Message

Message is record or information which gets persisted in Kafka for processing. It's a fundamental unit of data in Kafka world. Kafka stores the message in binary format.

Topics

Messages in Kafka are categorised under topic. Topic is like a database table. Messages are always published and subscribed from a given topic (name). For each topic, Kafka maintains a structured commit log with one or more partitions.

Partition

A topic can have multiple partition as shown in above figure. Kafka appends new message at the end of a partition. Each message in a topic is assigned a unique identifier known as offset. Write to a partition are sequential (from left to write) but write across different partitions can be done in parallel as each partition may be in a different box/node. Offset uniquely identifies a message in a given partition. Current offset where message is going to be written in partition 0 is 12 (in above pic).

Ordering of record is guaranteed only across a partition of the given topic. Partition allows Kafka to go beyond the limitation of a single server. This means single topic can be scalled horizontally across multiple servers. But at the same time each individual partition must fit in a host.

Each partition has one server which acts as leader and zero or more servers which acts as follower. The leader handles all reads and writes requests for that partition and follower replicate. If the leader fails, one of the follower will get chosen as leader. Each server acts a leader for some partition and follower for others so that load is balanced.

Producers

As the name suggests, Producers post messages to topics. Producer is responsible for assigning messages to a partition in a given topic. Producer connects to any of the alive nodes and requests metadata about the leaders for the partition of a topic. This allows the producer to put the message directly to the lead broker of the partition.

Consumers

Consumers subscribe to one or more topics and read messages for further processing. It's consumers job to keep track of which message have been read using offset. Consumers can re-read past messages or can jump to future messages as well. This is possible because Kafka retains all messages for a given time (which is pre configured).

Consumer Group

Kafka scales the consumption by grouping consumers and distributing partitions among them.

https://cdn2.hubspot.net/hubfs/540072/New_Consumer_figure_1.png

Above diagram shows a topic with 3 partitions and a consumer group with 2 members. Each partition of topic is assigned to only one member in the group. Group coordination protocol is built into Kafka itself (earlier it was managed through zookeeper). For each group one broker is selected as group co-ordinator. It's main job is to control partition assignment when there is any change (addition/deletion) in membership of the group.

https://kafka.apache.org/0110/images/consumer-groups.png

How Consumer Group helps:

If all consumer instances have same consumer group, then records will get load balanced over consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all consumer processes.

Brokers

Kafka is a distributed system, so topics could be spread across different nodes in a cluster. These individual nodes or servers are known as brokers. Brokers job is to manage persistence and replication of messages. Brokers scale well as they are not responsible for tracking offset or messages for individual consumers. Also there is no issue due to rate with which consumer consumes messages. A single broker can easily handle thousands of partitions and millions of messages.

Within a cluster of brokers, one will be elected as cluster controller. This happens automatically among live brokers.

A partition is owned by a single broker of the cluster and that broker acts as partition leader. To achieve replication, a partition will be assigned to multiple brokers. This provides redundancy of the portion and will be used as leader if the primary one fails.

Leader

Node or Broker responsible for all the reads and writes of the given partition.

Replication Factor

Replication factor controls the number of replica copies. If a topic is un-replicated then replication factor will be 1.

If you want to design in such a way that f failures is fine then need to have 2f+1 replica.

So, for 1 node failure; replication factor should be set to 3.

Architecture

Below diagram shows all important components of Kafka and their relationship-

Role of Zookeeper

Zookeeper is an open source, high performance coordination service for distributed applications. In distributed systems Zookeeper helps in configuration management, consensus building, coordination and locks (Hadoop also uses Zookeeper). It acts as middle man among all nodes and helps in different co-ordination activities - it's source of truth.

In Kafka, it's mainly used to track status of cluster nodes and also to keep track of topics, partitions, messages etc. So, before starting Kafka server, you should start Zookeeper.

Kafka Vs Messaging Systems

Kafka can be used as a traditional messaging systems (or Brokers) like ActiveMQ, RabitMQ, Tibco. Traditional messaging systems have two models - queuing and publish-subscribe. Each of these models have their own strengths and weaknesses -

Queuing- In this model pool of consumers can read from server and each record goes to one of them. This allows to divide up the processing over multiple consumers and helps in scale processing. But, queues are not multi-subscriber- once one subscriber reads data it's gone.

Publish-Subscribe- This model allows to broadcast data to multiple subscribers. But, this approach doesn't scale up as every message goes to every consumer/subscriber.

Kafka uses Consumer Group model; so every topic has both these properties (queue and publish-subscribe). It can scale processing and it's also multi-subscriber. Consumer group allows to divide up processing over different consumers of the group (just like queue model). And just like traditional pub-sub model it allows you to broadcast messages to multiple consumer groups.

Order is not guaranteed in tradition systems, if records needs to be delivered to consumers asynchronously they may reach to consumers out of order. Kafka does better by creating multiple partitions for the same topics and each partition is consumed by exactly one consumer in the group.

CLI Commands

Here, goes a link which covers Kafka CLI commands.

https://howtoprogram.xyz/2016/07/08/apache-kafka-command-line-interface/

-------

References:

https://kafka.apache.org/documentation/

https://www.confluent.io/blog/stream-data-platform-1/

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

https://thenewstack.io/apache-kafka-primer/
http://thesecretlivesofdata.com

Role of Commit Log in Databases and Distributed Systems

2017-07-12T01:52:00.001-07:00

Before jumping to the topic, let's start with few questions -

What if debit was done successfully from one account but before credit to second account, DB crashed! Quite possible, right? How relational DBs handle such crashes and ensure that when it comes up the data is in consistent state? (in other words how DBs achieve atomicity and durability?)
How databases like Oracle, MongoDB, PostgreSQL keeps the replica component in synch with the master ?
In a distributed database - how two components of a distributed system agree on a given update order ? Assume that two changes which were sent arrive in different order due to network issues, latency, asynchronous nature or some other issue, then how the system can know what should be update order. In Kafka it's known as Order guarantee.
How a process remains consistent across different nodes ? Or, how a particular update gets consistently applied to different replicas ?
How you force multiple nodes in a distributed system to do the same stuff ? Or, how you enforce deterministic process is deterministic?

Answer to all above questions is - Log!
In database and systems world it is called as write-ahead log or commit log or journal (similar to application logs but used only for programatic access). Jay Kreps in his book I Love Logs defines it as - a append-only sequence of records ordered by time.

Each record (as rectangle in above image) gets appended to the log from left to right. Each entry is assigned a unique sequential log entry number which acts as unique key. The records are relatively ordered with time - leftmost having occurred at the earliest. So, log will help in recording what happened and when - quite handy in distributed data systems.

Common Usage of Commit log- data integration, enterprise architecture, real-time data processing, and data system design.

How Commit log helps?

Relational databases, NoSQL databases, and distributed systems write out the event/update information in a commit log first. It will log all relevant details in the log before actually applying those changes to the actual systems. Writing to log file is an atomic and non-distributed operation so it will get persisted immediately and then it gets used as single source of authority for applying those changes to the systems. So, even if system sees a crash or fault; once it recovers it will check the commit log and apply the pending changes.

Similarly, sequence of events/changes which happens on primary data nodes is exactly what is needed to keep the remote replica database in sync. The slave to replicate node/DB can apply those changes recorded in commit log to their own local data source to stay in sync with master.

The problem of data integration can also be solved through commit logs. Take out all the organisations data and put it in a centralised log for processing. That's what the specialised systems like Kafka does.

happy logging !

- - -

References:

https://www.confluent.io/blog/stream-data-platform-1/
https://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382/ref=sr_1_1?ie=UTF8&qid=1499836980&sr=8-1&keywords=i+love+logs
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Securing Communication between Data Centre and Cloud

2017-06-11T22:44:00.002-07:00

In the early days of my career, I used to wonder why we connect to office network using the crypto card. If you have no clue or some clue about what the hell is VPN (just like me :D) ; I would recommend this link which covers the fundamentals of VPN.

Let's start with definition of VPN gateway-

VPN Gateway

A VPN gateway is a type of networking device that connects two or more devices or networks together in a VPN infrastructure. It is designed to create connection or communication between two or more remote sites, networks or devices and/or connect multiple VPNs together. Ref

From the Perspective of Cloud

Companies are gradually moving (their systems) to cloud, so there is need of secure connectivity between Data Centre and Cloud hosted applications. Cloud has become a logical extension of the corporate datacenter (this is referred to as hybrid datacenter).

The cloud hosted application should be able to securely talk to in-premise data or application. This is where the VPN gateway comes into play by securing one site to another site. VPN builds a secure tunnel between two remote sites.

Below diagram shows VPC inside AWS and GCP. You can think of VPC (Virtual Private Cloud) as a cloud inside cloud; or a logical datacenter inside AWS (or GCP - Google Cloud Platform).

Traffic traveling between the two networks is encrypted by originator's VPN gateway, then it gets decrypted by the receiver's VPN gateway.

Count number of different bits in two Numbers

2017-06-11T04:08:00.000-07:00

Problem:
Given two numbers, find how many bits are different in two numbers.
Or, another way to look at problem is - Determine number of bits required to convert num_1 to num_2.

num_1 = 1
num_2 = 0
Number of different bits = 1

num_1 = 11111
num_2 = 01110
Number of different bits = 2

Solution:
Basically we need to find at each position if the value of bit in two number is same or different. If they are different then increase the counter and do the same for all subsequent bits.

It might not be very obvious from the problem but there is a bit operator which exactly finds out how different two inputs are. Let's apply XOR operator and see how it behaves:

1 ^ 0 = 1
0 ^ 0 = 0
0 ^ 1 = 1
1 ^ 1 = 0

So notice that, when bits are same output is always 0. And when both bits are different then output is 1.

11111
^ 01110
-------------
10001

So after taking XOR, we just need to count the number of 1's in the result.

Java Implementation

public static int countNumberOfDifferentBits(int a, int b){
int xor = a ^ b;
int count = 0;
for(int i= xor; i!=0;){
count += i & 1;
i = i >> 1;
}
}

Measuring Execution Time of a Method in Java

2017-06-11T04:05:00.000-07:00

Old fashioned Way

System.currentTimeMillis()

Accuracy is only in milli seconds, so if you are timing a method which is quite small then you might not get good results.

List<Integer> input = getInputList();

long t1 = System.currentTimeMillis();

Collections.sort(input);

long t2 = System.currentTimeMillis();

System.out.println("Time Taken ="+ (t2-t1) + " in milli seconds");

Using Nano seconds

System.nanoTime()

Preferred approach (compared to first one). But do keep in mind that not all systems will provide accuracy in nano time.

List<Integer> input = getInputList();

long t1 = System.nanoTime();

Collections.sort(input);

long t2 = System.nanoTime();

System.out.println("Time Taken ="+ (t2-t1) + " in nano seconds");

Java 8

List<Integer> input = getInputList();

Instant start = Instant.now();

Collections.sort(input);

Instant end = Instant.now();

System.out.println("Time Taken ="+ Duration.between(start, end) + " in nano seconds");

Guava

Stopwatch stopwatch = new Stopwatch().start();

Collections.sort(input);

stopwatch.stop();

geekRai

Important Configuration Parameters of Kafka Producers

compression.type

batch.size

linger.ms

acks

Resolving ClassNotFoundException in Java

Root Cause of java.lang.ClassNotFoundException

Distributed Data System Patterns

Performance Parameters for a System

Response Time / Latency

Throughput

Resource Utilization

Good practices for Accessing Couchbase Programatically

Initialize the Connection

Thoughts on GC friendly programmig

Graph DFS traversal

DFS Implementation

Time Complexity

Scalability - Getting hang of Seconds

Seconds

Designing REST URI for supporting multiple content type

Approach 1: One URI per representation

Approach 2: Use Parameter of URI

Approach 3: Single URI for all representation

Final Note

PUT vs POST for Modifying a Resource

Updating a Resource

What the originator of REST style says

My Final Recommendation

Build Tools in Java World

Maven vs Gradle

CONCLUSION

How AWS structures its Infrastructure

Guidlines for designing

How Kafka Achieves Durability

My favourite fiz-buzz problem for Senior Programmers

Find Kth largest element from a list of 1 million integers.

Or

Find Kth largest element at any given point of time from a stream of integers, count is not known.

Identifying Right Node in Couchbase

Steps:

Couchbase Primary vs Secondary Indexes

Key-Value Model

JSON Model (Secondary Indexes)

Primary vs Secondary Indexes

RAM sizing Data Node of Couchbase Cluster

RAM Calculator

What's so special about Java 8 stream API

Replication Factor in Couchbase

Couchbase

Understanding AWS' IAM Service

IAM background

What is IAM ?

IAM Page on AWS Console

IAM Components

AmazonEC2FullAccess

Provides full access to Amazon EC2 via the AWS Management Console.

AdministratorAccess

IAM Best Practices

Kafka - All that's Important

What is Apache Kafka ?

What is Stream Processing ?

What is Event Stream ?

Kafka's Core Data Structure

Key Concepts and Terminologies

Let's cover important components and aspects of Kafka:

Architecture

Below diagram shows all important components of Kafka and their relationship-

Kafka Vs Messaging Systems

CLI Commands

CLI Commands

Role of Commit Log in Databases and Distributed Systems

How Commit log helps?

Securing Communication between Data Centre and Cloud

VPN Gateway

From the Perspective of Cloud

Count number of different bits in two Numbers

Measuring Execution Time of a Method in Java

Guava