24/7 Solutions - From fault tolerance systems to fully redundant networks.
Practically any serious server equipment realizes methods of fault tolerance. That is properties to keep working
capacity and data at failure of a feed or breakage. Modern servers and disk files can use reserve power units, the fans,
duplicated controllers, processors, and also technologies virtual and real mirrored memories (operative and constant).
The listed decisions provide reliability at a level of 99% that corresponds to an idle time of 3,65 days in a year. It
is a quite good parameter. But comprehensible time of information services inaccessibility depends on concrete business
and for each enterprise individually. It varies from several minutes till several o'clock, and is usual the general figure
is fixed in Service Level Agreement (SLA). But for such spheres as banks and the finance, telecommunications, the industry
or scientific researches, parameter SLA at a level of 3,65 days in a year happens insufficiently.
Therefore even more often apply decisions which can be classified as fully redundant networks. Unlike fault tolerance
systems, they keep working capacity not only in case of system separate components damage, but also at plural
breakage, and also at failure of all subsystem (unit).
Creation fully redundant networks is based on the same principles, as fault tolerance systems. Difference is that
designers operate not with hardware components of the equipment, and separate servers, storehouses of data, the
telecommunication equipment. As a result of the decision, executed on the basis of similar concepts, guarantee practically
any demanded level of readiness information systems.
One of the most effective ways of creation fully redundant networks - removal of the basic data storehouse out from
the central element of the computing system. As a rule, Storage
Area Networks (SAN) are applied for this purpose. In them data carriers (usually monolithic files of disk carriers)
are incorporated in own network isolated from a LAN. Such networks can be carried on the area borrowing of some kilometers,
than additional increase of the system reliability, connected with decrease in threats from destructions, earthquakes,
flooding and other acts of nature is reached.
SAN-technology is most comprehensible to creation fully redundant networks. SAN possesses ample opportunities of scaling.
It speaks realization of SAN-decisions the allocated network. That allows to add freely systems of a data storage without
the applications configuration served by them. However Storage Area Networks have a number of lacks as SAN functions by a
principle of point-to-point connection between a server-storehouse and disks, at damage of a server the network loses the
integrity. To prevent a situation the reservation of liaison channels usually used in fully redundant networks helps.
It for achievement of satisfactory parameters SLA happens insufficiently. In fact the lining of a network cable,
escalating and service of a storage area networks demands an investment of additional means. It is necessary to allocate
means for maintenance of system work. Besides the SAN-storehouse in itself is object of the most different threats.
Not casually today systems of reserve copying get the increasing urgency.
In the general system of a data storage reserve copying represents a service subsystem and is the obligatory component
providing high availability. It allows to restore working capacity of information services even when data are damaged.
Creation of the centralized system of data backup enables to reduce cumulative cost of IT-infrastructure possession
due to optimum use of the equipment and reduction of charges on administration. Such system has the
multilevel architecture including:
- A server of management of reserve copying (simultaneously it can carry out functions of a server of copying of data);
- One or several servers of copying of data to which devices of a reserve data storage are connected;
- Computers-clients with the program agents of reserve copying established on them;
- The console of the system manager of reserve copying.
In this scheme the manager of system conducts the list of reserve copying clients, devices of record and carriers of
data, and also makes the schedule of data backup. The information contains in special base which is stored on a server
of management by reserve copying. According to the schedule or on a command of the operator the server of management gives
the program agent, instruction to start to copying data according to the chosen policy. The agent begins the data
gathering, a subject reservation, and their transfer on specified by a server of management a server of copying. This
server keeps obtained data on the device of reserve storage connected to them. The information on process is kept in base
of a server of management that it was possible to find quickly data if there will be a necessity of their restoration.
That the kept data were not inconsistent, they cannot be changed during gathering and copying. Therefore prior to the
beginning of procedure of the computer-client application should finish all transactions, keep contents a cache-memory
on a disk and suspend work. Corresponding actions are initiated on a command of the program-agent.
The system of reserve copying concerns to number service and the loading created by her on computing means is not useful.
Means, this loading is desirable for lowering. The similar problem breaks up to two stages: reduction so-called "reserve
copying window" (time during which the computer-client carries out reserve copying) and reduction of the corresponding data
traffic in a corporate LAN. Introduction of backup system in structure of systems of storage allows to reduce "window"
owing to integration with means of the PIT-spears creation, realized in modern disk files: from data instant "cut" is
practically instantly done, and reserve copying is carried out already from this cut, and the server continues work. To
lower loading on a local network technologies LAN-free backup and Serverless backup, data storage givenby networks that is
one more acknowledgement of special efficiency of this technology for fully redundant networks will help.
If the enterprise has a reserve data-processing centre or plans it to construct, for system of data backup it is
necessary to provide integration with such center. Transition to use of the reserve center attracts changes of
protection and data storage policy, conditions of operation and is frequently accompanied by modernization of existing
system of reserve copying. In particular, computing means of the reserve center will allow to carry out obligatory testing
of backup copies of data for working capacity, having unloaded computing means of the basic center and having simplified
all procedure. It is possible and organize storage of duplicates of backup copies in the reserve center, instead of in the
foreign removed storehouse.
Make clusters
In cases when rigid demands are made to the computing system besides high fault tolerance concerning availability and
productivity, the most actual decision are clusters. Cluster decisions manage much more dearly, than a data backup server
or the segment of a network allocated under network storehouse of data. But only with their help probably increase uptime
to 99,99% that corresponds to several minutes of idle time of system in a year.
Before passing to descriptions of typical schemes of clustering, we shall specify, that represents cluster. In practice
the term "cluster" has set of definitions. Some manufacturers carry to cluster systems NUMA (and similar), mass-parallel
systems, and at times and systems with Symmetric Multiprocessing (SMP). Besides one manufacturers ragard as of paramount
importance fault tolerance, others - scalability, the third - controllability, the fourth - the maximal productivity.
Most simple definition of cluster is based on hardware feature of its realization, formulated by company Digital Equipment
Corporation: cluster is a version of the parallel or distributed system which consists of the several computers connected
among themselves and it is used as the uniform, unified computer resource.
On each unit of cluster there is a copy of operational system (we use FreeBSD). Meanwhile such systems as SMMP, have one
general copy of OS, and this exception to the rules. Unit of cluster can be both uniprocessor, and a multiprocessing
computer, and within the limits of one DMMPS-cluster computers can have a various configuration. Units of cluster
incorporate among themselves by means of usual network connections. Inside cluster connections allow units to cooperate
among themselves irrespective of the external network environment. On inside cluster channels units not only
communicate, but also carry out the mutual control of working capacity.
The basic purpose of cluster, focused on the maximal reliability, consists in maintenance of a high level of availability
(differently - a level of readiness); a high degree of scalability; convenience of administration in comparison with the
isolated set of computers or servers.
Clusters should be tolerant to single refusals of components (both hardware, and program); generally at refusal of any
unit network services or applications are automatically transferred on other units. At restoration of working capacity of
the given up unit of the application can be transferred on it back.
Classification of clusters by criterion of reliability is made according to availability not operative memory of cluster
units (as in case of with clusters high efficiency), and devices of input/conclusion and, first of all, disks.
clusters maximal reliability, are used everywhere where cost of possible idle time exceeds cost of the expenses necessary
for construction of cluster system:
- Billing systems;
- Bank operations;
- Electronic commerce;
- Operation of business.
As well as reserve, cluster systems can be with divided disks and without. The concept shared disk cluster means, that
any unit has transparent access to any file system of the general disk space. Besides a divided disk subsystem on units of
cluster local disks are stipulated also, but in this case they are used, mainly, for loading OS on unit. Such cluster
should possess the special subsystem called Distributed Lock Manager. This system serves for elimination of conflicts at
simultaneous record in files from different units of cluster.
Shared nothing clusters have no general devices of input/conclusion. It is a question of the general disks absence on
logic, instead of a physical level. It means, that the disk subsystem can be connected at once to all units. If on a disk
subsystem there are some file systems (or logic/physical disks) at any moment access to the certain file system is given
only to one unit. To other file system access is resolved to other unit. Such administrative hierarchy of system resources
allows to be secured as much as possible against failure of any components from any making of cluster reliability. Failure
of one cluster server instantly throws management on another, and damaged is disconnected.
|