The combination of hardware nodes plus special software makes up the Nutanix Distributed File System. At the heart of each cluster node is the Nutanix Controller Virtual Machine. This hypervisor-specific virtual machine — Nutanix offers different versions tuned for vSphere, Hyper-V, or KVM — handles all communication between server nodes and all of the services running as a part of NDFS. In other words, the Controller VM both manages the cluster and serves as the central data store for the hypervisor and its guest VMs.
Figure 1: The Nutanix Virtual Computing Platform architecture
Figure 1 above shows the interconnections between some of the key software pieces in the Controller VM. Like node, disk, and network failures, controller failures are detected automatically. NDFS handles controller outages by redirecting I/Os to other Controller VMs in the cluster.
At the center is the Curator, a MapReduce-based cluster management application that handles the distribution of tasks (disk balancing, proactive scrubbing, and so on) throughout the cluster. It's controlled by an elected Curator Master, which serves as the task and job delegation manager.
Stargate is the primary data I/O manager. It communicates using NFS, iSCSI, or SMB and handles all the storage requests from the hypervisor. Medusa is a distributed metadata store based on Apache Cassandra that utilizes the Paxos algorithm to enforce strict consistency across all nodes.
Prism is the management gateway for configuring and monitoring the entire Nutanix cluster. It elects a leader in a similar fashion to the other components. Access to the management system is available via an HTML5-based Web interface, a console-like CLI, and a REST-based API.
Zeus is a cluster configuration manager based on Apache ZooKeeper. Responsibilities of the leader node include the receiving and forwarding of all requests for configuration changes. Should the leader fail, the Zeus services running on the other nodes will elect a new one.
Other components include Chronos for job and task scheduling, Cerebro for handling replication and disaster recovery, and Pithos for managing virtual disk configuration data.
All writes to disk are synchronously replicated before acknowledged to guard against any disk or node failures. The majority of disk write operations funnel through the SSD-based OpLog, which in actuality is a log entry of a disk operation. In effect, the OpLog serves as a very fast persistent store for all disk write operations. For read operations, there's a Content Cache located in local memory and on the SSD. If a specific disk fragment can't be found in the Content Cache, it will be located and retrieved from disk.
Virtual machines running on individual nodes use the resources of that node exclusively, although disk write operations get distributed across the cluster. Guest VMs see the local Controller VM as the central data store for virtual disks; as VMs migrate from node to node, the I/O moves from one Controller VM to another. Thus as VMware's Distributed Resource Scheduler or Microsoft's System Center tools distribute the VM load across the cluster, the storage load is balanced across the Controller VMs. All internode communication takes place over a 10Gb Ethernet network, which means you'll need a 10GbE switch to connect the nodes together.
Sign up for CIO Asia eNewsletters.