geekRai: November 2016

Saturday, November 19, 2016

How Couchbase identifies Node for Data Access

This post talks about how Couchbase identifies where exactly the document is stored to facilitate quick read/update.

Couchbase uses a term Bucket which is equivalent to the term Database in the relational world to logically partition all documents across the cluster (or data nodes). Being a distributed DB, it tries to evenly distribute or partition (or shard) data into virtual buckets known as vBuckets. Each vbucket owns a subset of the keys or document id (and of course corresponding data as well). Documents get mapped to vBucket by applying the hash function on the key. Once the vBucket is identified there is a separate lookup table to know which node hosts the vBucket. The thing which maps different virtual buckets to nodes is known as vBucket map. (Note: Cluster Map contains the mapping of which service belong to which node at any given point of time)

Steps Involved (as shown in the diagram):

Hash(key) to get vBucket identifier (vb-k) which hosts/owns Key.
Looking up vBucket map tells vb-k is owned by node or server n-t
The request is sent directly to the primary server node, n-t to fetch the document.

Both hashing function as well number of vBucket is configurable. By default, Couchbase automatically divides each bucket into 1024 active vBuckets and 1024 replica buckets (per replica). When there is only one node, all vBuckets reside on that node.

What if a new node gets added to the cluster?

When the number of nodes scales up or down the information stored in vBuckets are re-distributed among the available nodes and then the corresponding vBucket map is also updated. This entire process is known as rebalancing. Rebalancing doesn't happen automatically; as an administrator/developer you need to trigger it either from UI or through CLI.

What if the primary node is down?

All read/update request by default go to the primary node. So, if a primary node fails for some reason, Couchbase takes off that node from the cluster (if configured to do so) and promotes the replica to become the primary node. So you can fail-over to replica node manually or automatically. You can address the issue with the node, fix it and add it back to the cluster by performing rebalancing. Below table shows a sample mapping.


+------------+---------+---------+
| vbucket id | active  | replica |

|     0      | node A  | node B  |
+------------+---------+---------+
|     1      | node B  | node C  |

+------------+---------+---------+
|     2      | node C  | node D  |

|     3      | node D  | node A  |

References:

http://docs.couchbase.com/admin/admin/Concepts/concept-vBucket.html

http://docs.couchbase.com/admin/admin/Tasks/tasks-rebalance.html

http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/Technical-Whitepaper-Couchbase-Server-vBuckets.pdf

http://dustin.sallings.org/2010/06/29/memcached-vbuckets.html

https://medium.com/@sent0hil/consistent-hashing-a-guide-go-implementation-fe3421ac3e8f

---

happy learning !!!

Wednesday, November 16, 2016

Why Multi-Dimensional Scaling in Couchbase

Couchbase has been supporting horizontal scaling in a monolithic fashion since its inception. You keep adding more nodes to the cluster to scale and improve performance (all nodes being exactly the same). This single dimension scaling works to a great extent as all services - Query, Index, and Data scale at the same rate. But, they all are unique and have specific resource requirement.

Let's profile these services in detail and their specific hardware requirements to drive home the point - why MDS is required!

This feature got added in Couchbase 4.0.

Query Service primarily executes Couchbase native queries, N1QL(similar to SQL, pronounced as nickel - leverages the flexibility of JSON and power of SQL). The query engine parses the query, generates execution plan and then executes the query in collaboration with index service and data service. The faster queries are executed, the better the performance.

Faster query processing requires more CPU or fast processor (and less memory & HDD). More cores will help in processing queries in parallel.

Reference on - Query Data with n1ql

Index Service performs indexing with Global Secondary Indexes (GSI - similar to B+tree used commonly in relational DBs). Index is a data structure which provides quick and efficient means to access data. Index service creates and maintains secondary indexes and also performs an index scan for N1QL queries. GSI/indexes are global across the cluster and are defined using CREATE INDEX statement in N1QL.

Index service is disk intensive so Optimized storage / SSD will help in boosting performance. It needs a basic processor and less RAM/memory. As an administrator, you can configure GSI with either the standard GSI storage, which uses ForestDB underneath (since version 4.0), for indexes that cannot fit in memory or can pick the memory optimized GSI for faster in-memory indexing and queries.

Data Service is central for Couchbase as data is the reason for any DB. It stores all data and handles all fetch and update requests on data. Data service is also responsible for creating and managing MapReduce views. Active documents that exist only on the disk take much longer to access, which creates a bottleneck for both reading and writing data. Couchbase tries to keep as much data as possible in memory.
Data refers to (document) keys, metadata, and the working set or the actual document. Couchbase relies on extensive caching to achieve high throughput and low read/write latency. In a perfect world, all data will be sitting in memory.

Data Service: Managed Cache (based on Memcached) + Storage Engine + View Query Engine

Memory and the speed of storage device affect performance (IO operations are queued by the server so faster storage helps to drain the queue faster).

Why MDS

So, each type of service has it's own resource constraints. Couchbase introduced multi-dimensional scaling in version 4.0 so that these services can be independently optimized and assigned the kind of hardware which will help them excel. One size fits all is not going to work (especially when you are looking for higher throughput i.e. sub-milliseconds response times). For example, storing data and executing queries on the same node will cause CPU contention. Similarly, storing data and indexes on the same node will cause disk IO contention.

http://blog.couchbase.com/introducing-multi-dimensional-scaling

Through MDS, we can separate, isolate and scale these three services independent of each other which will improve resource utilization as well as performance.

http://developer.couchbase.com/documentation/server/4.5/concepts/distributed-data-management.html

References
http://www.couchbase.com/multi-dimensional-scalability-overview
http://developer.couchbase.com/documentation/server/4.5/concepts/distributed-data-management.html

---
happy learning !!!

Friday, November 11, 2016

IP Address in Private Range

Ah, again; I forgot the range of private IP address. So, no more cursing of my memory. Now...instead of googling I will search on my blog :D

Below are permissible private IP ranges:

10.0.0.0 - 10.255.255.255
172.16.0.0 - 172.31.255.255
192.168.0.0 - 192.168.255.255

These IP address are used for computers inside a network which needs to access inside resources. Routers inside the private network can route traffic between these private addresses. However, if they want to access resource in outside world (like internet) these network entities have to have a public address in order for response to reach to them. This is where NATing is used.

Representing IP address in CIDR format

The number following the forward slash (/) is the prefix length, the number of shared initial bits, counting from the most significant bit of the address.

Thus, a /20 block is a CIDR block with an unspecified 20-bit prefix.

In below example, 10.0 is the network address and last two segments are device addresses.

Reference: http://whatismyipaddress.com/nat

geekRai