Generated from ASCE 3.5 on 9.10.2007 at 16:10:04
Author: Ilkka Norros, Pirkko Kuusela
Version: version available for IPLU-II management committee
Description: IPLU-II project case study, Spring - September 2007
This is the concept of dependability in the traditional sense of reliability analysis: the system is devided into components and there is a structure function merging the components. Components can further be studied as systems.
The reliability claims for Funet network are:
- The network has high technical and structural reliability.
- The risks related to outsourcing are recognized and minimized.
The network is well maintained.
The risks associated to oursourcing are noticed and minimized.
Ficora Regulation 41 on techinical documentation is fulfilled.
Control measures that are in principle possible in an IP network are available for the operator and actions can be taken promptly when needed.
All aspects of network invulnerability are recognized and adequately managed.
The network relies on the inherent robustness of basic IP protocols.
The structure function in the sense of reliability analysis needs to be defined, in principle, for each task one by one using the network topology and routing rules. Some examples are:
- each node works, the network is connected and it has connections to the outside world (i.e., it is connected to Ficix and NorduNet).
- one component is down, but the rest of the network is connected and connected to the outside world.
Structure claims used in this study are:
- Requirements set in Ficora (Finnish Communications Regulatory Authority) regulation 27 are fulfilled.
The protocol software used is reliable.
Software bugs found will be fixed promptly or can be managed in some other ways.
The network is dimensioned with the objective that it should be never congested, even by loss of single core links or routers.
On a primary connection link loads are at most 50%, 30-40 % is desired load in a normal situation.
Secondary connections have the same capacities as the primary connections.
Requirements in Ficora regulation 13 and 50 on dimensioning of the network are fulfilled.
Problems in network operation can be noticed and fixed fast enough and without degradation of the customer quality.
Requirements in Ficora Regulation 50 on network operation are fulfilled.
The network has implemented control measures against overload and can overcome overload situations.
Ficora requirements 13 and 50 on network load are fulfilled.
The traffic can be filtered effectively when needed.
Requirements in Ficore regulation 13 are fulfilled.
The physical security of the network is high.
Requirements in Ficora regulation 48 on physical protection of network are fulfilled.
Requirements in Ficore regulation 43 on electrical protection of network are fulfilled.
The information security is high and can be updated effectively, when a security hole is detected.
Announcements on data security and protection are monitored continuosly and actions are taken promptly when needed.
Requirements of Ficora regulation 13 on Internet information security and functionality are fulfilled.
Requirements of Ficora regulation 47 on information security are fulfilled.
Routers are physically very reliable.
However, no numerical criteria for router reliability has been set.
The reliability of transmission links has been secured adequately, more than one operator is used, repair and maintenance of links is prompt, SLAs contain requirements for repair and maintanance quality.
The transmission link operator is reliable. It has proper monitoring of links, good practices in operations, and its staff is experienced. Problems in transmission links are noticed and repaired fast enough.
Good practices by router software updates are being used.
The functionality of new software versions is verified before taking them into use.
The network is constantly monitored and, in case of a problem, corrective actions are started promptly.
Adequate tools are implemented for network monitoring and possible problems in the network can be detected soon enough.
Requirements in Ficora regulations on monitoring are met.
Networking faults can be localized effectively and actions to correct the faults are started promptly.
Requirements in Ficore regulations on actions and practices are met.
Operating staff is adequate and no single person is critical for the maintainability of the network.
The operating staff has sufficient qualifications for operating the network and for implementing and developing new functionalities.
The staff size is large enough to operate the network and improve practices.
The workload of each employee is adequate and human errors are at their minimum level.
The network is divided into domains that implement OSPF routing. One of them is the six-node core, others connect to that core over BGP. By link and node losses, routing is restored automatically.
TCP protocol itself has a feedback and it can adapt to the level of congestion in the network.
The amount of real-time traffic, where TCP is usually not used (and could not provide the quality needed) is not significant.
IP protocol itself is robust and flexible.
Routers and link capacities are updated according to traffic growth.
The upgrade frequency has slowed down somewhat during the last years. The traffic in FUNET increaces in a slower rate in comparison to other Finnish operators. The FUNET traffic is also considered well predictable.
Juniper releases new router software versions 3-4 times per year. New versions are installed only when they include significant extensions or fix security holes or other serious software errors. Updates are installed about once a year.
CSC has no separate router for testing, but Juniper can do testing for them. New versions go then immediately to production use.
Funet has seldom the very latest version of the software in use, unless that is required for the information security.
The operating staff at CSC has gained its expertise from a long working experience. The latest member of the operating team has experience of four years, the second latest six years.
The Funet operating team has five members and a team leader. Additionally, the neighboring team has also expertise and experience for duty service, resulting in a team of 8 persons for monitoring and operating the network.
The size of the team is considered sufficient for running the network and dealing with normal malfunctioning situations, but not for much additional development work.
No overload situations caused by proper customer traffic are expected, and no measures against them are pre-planned.
The core network does not provide QoS, as it is aimed to operate in an uncongested state. In a pilot trial, "less than best effort" service has been offered, and this is limited first when a traffic congestion takes place.
For overloads caused by malicious traffic attacs, the routers have implemented rate limits that are not active in a normal operating situation. If a rate limit is exceeded, an automatic alarm is sent to the network monitoring operator who may take actions in protecting the network, for example, by filtering the traffic.
CSC reacts to DoS-situations only when they hinder the normal traffic and a customer asks for actions.
Automatic alarms for DoS attacts.
Active traffic filtering is done only in DoS attacts, if regular traffic is otherwise harmed.
Typically, ports are filtered rather than particular target or source addresses.
CSC has DoS-monitoring in the network. Ill-behaving traffic can be filtered.
Interview of Janne Kanner and Juha Oinonen (CSC) by Eija Myötyri, Ilkka Norros and Pertti Raatikainen (VTT) on May 4, 2006.
Interview of Pekka Savola (CSC) by Ilkka Karanta, Ilkka Norros and Pertti Raatikainen (VTT) on February 28, 2007.
Interview of Kaisa Haapala and Pekka Savola (CSC) by Ilkka Karanta, Pirkko Kuusela and Ilkka Norros (VTT) on September 6, 2007.
CSC sends 5 consecutive pings (1s interval) at regular intervals to customers' sites and to the six core routers. If there is no response to any of the pings, the site is marked down. A separate quality metric keeps track of ping batches which were partially answered. Currently the interval between ping batches is about 60 seconds.
The data is publicly available from http://im.funet.fi. The pings are also the primary interface for network monitoring.
For this study, a script was written to read all observed downtimes between August 1, 2000 and July 31, 2007.
The busy hour data of May 31, 2007 was used as material in an analysis made with Mathematica (notebook funet-mathgraph.nb). A crude traffic matrix estimate allowed the consideration of what-if questions. The dimensioning was found sufficient also in link and node failure situations.
Functions for an analysis of minimal cutsets, taking into account both link and node losses, were written in Mathematica (notebook funet-mathgraph.nb). The considered objective function required that the network is connected internally and with both Ficix and Nordunet.
The network is protected for loss of any one link, and loss of any one node leaves the remaining network connected. There are two independent connections both with Ficix and with Nordunet.
Only ping data was available for this analysis.
Analysis of core router ping downtime data was made with Mathematica (notebook funet-mathgraph.nb). The availability of core routers was found to be better than 0.999.
No analysis of link downtimes was made in this study. There is some data in the text form that could be used for this purpose (ptp-pinger) giving information on problems in gore links. All links in the gore network are pinged every 5 min intervals to detect broken fibers.
Transmission link providers are required to monitor their own links.
CSC has computers that monitor the performance of the network and the tools from im.funet.fi are also used. Some own alarm systems have been created, which send an e-mail message to the person monitoring the network.
Human network monitoring is done during week days from 8:30 until 16:00. Beyond that there is monitoring by Otaverkko, which activates when an alarm is sent from the network.
The person monitoring the network watches a screen showing a network view. He inspects the sys-log-files, checks the multicast traffic and general error statistics. These error statistics come from routers and include information on link errors, number of broken packets received, number of dropped packets etc. From this information the monitoring person can often detect, for example, a link that is likely to malfunction soon. Problems can often be fixed well before customers notice any degradation in service quality.
At this point, error statistics are not used to generating automatic alarms. An expericenced person can detect malfunctioning links from the ping-data information, which is used in generating automatic alarms.
Detection of Denial of Service (DoS) activity in the routers will generate automatic alarms.
Automatic alarms have been in use for several years, and they are consired helpful and working well.
Link traffic of core network is shown in realtime in the public web page http://www.csc.fi/funet/status/tools/wm.
However, only pictorial information (load curves) is available over this interface.
The power supply is assured adequately.
All requirements in Ficora Regulation 30 in power supply are met.
The power unit is the most frequently failing component of a router.
All Funet core routers have double power units.
Power supply is secured by UPSes for some routers but not for all.
The UPSes used provide power for a router approximately 30-60 minutes, depending on the traffic and the condition of the UPS.
Some sites have in addition means to provide electricity if the original source of electricity goes down.
Funet has Juniper M10, M20 and T320 routers, in addition to some Cisco routers. These are widely used and potential problems can be heard from other users.
Four routers (csc0, shh3, helsinki0 and helsinki3) have double routing processors.
The SLAs between CSC and the transmission link providers TeliaSonera and Corenet are not public. Some features were announced however:
- The SLA requires that links securing each other must be physically independent.
- It also requires that the transmission is via genuine end-to-end connections (e.g., fiber, SDH or Ethernet) - virtual transmission links, e.g. MPLS, are not accepted.
The router software is provided by the router manufacturer (Juniper).
Errors in software appear rather frequently, but they are seldom problematic for actual IP traffic transport.
Usually problems can be temporarily fixed with appropriate workarounds in configurations.
Error-related communication with the provider flows well in both directions.
The link cards of links securing each other are placed in different modules of the router.
Links securing each other are physically separated in the building and leave it in different ducts. Further separation of links falls under the responsibility of the transmission provider.
Purely technical problems are usually easy to detect and repair.
Link operators maintain their own links and SLAs contain conditions on the quality of the maintanance.
CSC has created and implemented a system that prevents malfunctioning configurations.
CSC does the software updates using remote logging to the routers. During a normal software update a router is down only when it is rebooted.
CSC has bought the maintenance and repair service for its hardware from Sonera.
Routine maintenance work is not carried out during regular working hours (i.e., between 8 am and 4 pm), but not in deep night either.
The machine room in CSC building is well protected. The rest of the hardware resides at customer locations, i.e., in machine rooms of universities and institutions. One should keep in mind that a university environment is more open and more short-term employees (students) are used.
Funet routers are well protected from DoS attacts. Information security problems have the highest action priority, either configurations are changed or software is updated.
Good practices are used in encryption and password management matters.
Availability is the proportion of time, during which the network can be used. The report The Dependability of an IP network - how to asses it? of the IPLU project further distinguishes the availability of the pure IP-connectivity and the availability of high quality connectivity.
Concrete availability claims for Funet network are:
- The availability of the network is sufficiently high.
- The failure rate of a connection is sufficiently small.
- The duration of the unavailability of a connection is sufficiently small.
- All availability conditions in SLAs are fulfilled.
Funet has not set its availability aims quantitatively.
The SLAs between Funet and transport providers were not available in this study.
The task of Funet is to provide a network for research and eduation in Finland. It is a service provided by CSC. Funet is a network community whose members are organizations and users (universities and institutions). CSC coordinates the entity technically and administratively.
The basic functionality of an IP network is to transfer a datagram from one network node to another network node. The transfer speed (bit/s) is not a logical part of this basic service. However, speed is an essential attribute of real-time services - their functioning ends abruptly when the speed sinks below some threshold. Although network users are seldom conscious about this distinction, one should distinguish between two aspects of the service offered by a network like Funet:
2) high-speed IP connectivity
Note that an IP network can be in such a state that requirement 1 is fulfilled for each node pair but requirement 2 very badly.
Funet offers pure IP-connectivity without explicit Quality of Service (except for offering a "less than best effort" traffic class). The quality of service is provided primarily by the dimensioning of the network.
Funet does not use MPLS. IPv6 is offered, but omitted in this study.
The dependability of Funet network is divided in this study into six different aspects of dependability: availability, reliability, maintanability, controllability, invulnerability and robustness of protocols. As regards the conceptual framework, we refer to the IPLU project's final report The Dependability of an IP network - how to asses it?.
The requirements to be met are presented as claims and subclaims. A "claim" like "Availability" is to be read as "The network has sufficiently high availability", etc.
A claim is supported by an argument, which is based on technical data, analysis, or some other evidence like just information gathered by interviews.
The aim is to present and evaluate all aspects of Funet's dependability in one document and in a manner that enables one to see the whole view of the task easily. This case study contains technical analysis for the availability and reliability of Funet network, in the other parts the information is gathered only from interviews at CSC.