Tuesday, August 15, 2017

Couchbase Primary vs Secondary Indexes

Couchbase supports key-value as well as JSON based data model. In Key-value model you don't care about the type of value. In JSON model you have ability to perform queries on the individual attributes using N1QL queries.


Key-Value Model 

Without Index
Key-value store is schema less where the object gets mapped to a given key (Just like a HashMap or Dictionary).   Couchbase is more like a distributed HashMap. The value could be any supported data type (JSON, CSV, or BLOB). You perform any operation using the key or Document Id. In this case, Couchbase looks up the value corresponding to a given document id. In simple terms, it's just like a key lookup in a HashMap. Index doesn't play any role here.


Querybucket.get(docId);


With Index
Now what if you want number of documents in your bucket ?

QuerySELECT COUNT(*) FROM `bucket-name`

Above query is going to do full Bucket scan (similar to full table scan in SQL world). In SQL world, index on primary key gets created by default so you can easily perform above operation. But, in Couchbase, that's not the case. You will have to create explicit index to perform above query or any other ad-hoc query. So, if you want to create an index on the the key or document id, we can create primary index. 

QueryCreate PRIMARY INDEX index_name on `bucket-name`


JSON Model (Secondary Indexes)

If you want to complete control on your data and queries, Json model is going to be your choice. In above approaches you can't say like give me all the objects which has certain attribute value. 

In JSON based model, we can query through a SQL like expressive language named as N1QL(pronounced as nickel). This is much more flexible model, we can look for a document(s) through the keys contained inside JSON. Obviously, to optimise lookup/search we can create index on those attributes. These indexes are named as secondary indexes or more precisely Global Seconday Indexes.

QueryCREATE INDEX type_index ON `bucket-name`(type) USING GSI



Primary vs Secondary Indexes


  • Primary indexes index all the keys in a given bucket and are used when a secondary index cannot be used to satisfy a query and a full bucket scan is required. 
  • Secondary indexes can index a subset of the items in a given bucket and are used to make queries targeting a specific subset of fields more efficiently. 


--- happy learning !

Friday, August 11, 2017

RAM sizing Data Node of Couchbase Cluster

This post talks about finding out how much RAM does your Couchbase cluster needs for holding your Data (in RAM)! 


RAM Calculator 

RAM is one of the most crucial areas to size correctly. Cached documents allow the reads to be served at low latency and high throughput.  Please note that, this doesn't not incorporate RAM requirement from the host/VM OS and other applications running along with Couchbase.

Enter below fields to estimate RAM -

Sample Document        (key)    (Value) 
This is required as document content length as well as ID length impacts RAM. Be mindful of the size aspect when deciding your key generation strategy. 


# Replicas                                        
Couchbase only supports upto 3 replicas. So enter either - 1, 2 or 3.


% Of Data you want to be in RAM  %
For best throughput you need to have all your documents in RAM i.e. 100% . This way any request will be served from RAM and there will be no IO.  In the field please enter only the value like 80, 100 etc. 


# Documents                                   
Number of documents in the cluster. When your application is starting from scratch then you can start with a number depending on the load of the application and then you need to evaluate it regularly and adjust your RAM quota if required. So, you can start with say 10000 or 1000000 documents. 


Type of Storage                                SSD        HDD
If storage is SSD then overhead % is 25 else it's 30%. SSD will bring better performance in disk throughput and latency. SSD storage will help improved performance if all data is not in the RAM. 


Couchbase Version                        < 2.1       2.1 or higher  
Size of meta data for 2.1 and higher versions is 56 bytes but for lower versions it's 64. 


High Water Mark                             %
If you want to use default value enter 85. 
If the amount of RAM used by documents reaches high water mark (upper threshold), both primary and replica documents are ejected until the memory usage reaches low Water Mark (lower threshold). 

                                                          

Based on the RAM requirement for the cluster, you can plan how many nodes are required. Another important aspect in deciding number of data nodes is how you expect your system to behave if 1, 2 or more nodes go down at the same time. This link, I have discussed about Replication factor and how it affects your system performance. So, take your call wisely!

The value got calculated as explained in the Couchbase link, here.
Reference for calculating document size is, here

--- happy sizing :)

What's so special about Java 8 stream API

Java 8 has added functional programming and one of the major addition in terms of API is, stream.

A mechanical analogy is car-manufacturing line where a stream of cars is queued between processing stations. Each take a car, does some modification/operation and then pass it to next station for further processing.


Main benefit of stream API is that, now in Java (8) you can program at higher level of abstraction. So you can transform stream of one type to stream of other type rather than processing each item at a time (using for loop or iterator). With this Java 8 can run a pipeline of stream operations on several CPU cores on different components of the input. This way you are getting parallelism almost free instead of hard work using threads and locks.

Stream focuses on partitioning the data rather than coordinating access to it. 

                                              Vs


Collection is mostly about storing and accessing the data, whereas stream is mostly about describing computation on data. 




Wednesday, August 9, 2017

Replication Factor in Couchbase

One of the core requirement for Distributed DBs is to be as High Availability as possible. What this literally means is that, even if node/nodes go down the DB should function (on its own or with minimum intervention). This is possible only if there are backup copies of the data. 

Replication factor controls number of replicas or backup of an item/data/document stored in a DB. The general rule is to have replica for each node which can fail in the cluster.

Let's check how one of the famous NoSQL distributed Db handles Replication Factor. 

Couchbase


Default replication factor is 1 in Couchbase (if it's enabled). Drop down field (as shown below) has only 3 values i.e. 1, 2 and 3. Practically, it doesn't make sense to have replication factor more than 3 no matter how large your cluster is.

 So even if you have only one node and enable replicas then in the same node there will be two copies of the same data (one original and one backup). Once you add more nodes to the cluster original and replicas will get re-distributed automatically. 


Recommendation:

Number of Nodes <= 5 - RF = 1
5 <= Number Of Nodes <= 10 - RF =2
Number of Nodes > 10 - RF = 3

Number of nodes mentioned above is only for data nodes if you are using Multi Dimensional Scaling.  If you are not using MDS then also above rule should hold good. 

In the event of failure we can fail over (manually or automatically) to replicas. 
  • In a 5 node cluster with 1 replica. If one node goes down cluster can fail it over. Now before the the failed node is up, what if another node goes down ? You are out of luck. You will have to add another node to the cluster. 
  • After a node goes down and it's failed over try to replace that node ASAP and perform rebalance. Rebalance creates the replica copies if there are enough nodes available. 


References