Funet Dependability Case

Generated from ASCE 3.5 on 9.10.2007 at 16:10:04

Author: Ilkka Norros, Pirkko Kuusela

Version: version available for IPLU-II management committee

Description: IPLU-II project case study, Spring - September 2007


Claim N861136 RELIABILITY

[Back to main map]
Parent nodes:

Child nodes:


This is the concept of dependability in the traditional sense of reliability analysis: the system is devided into components and there is a structure function merging the components. Components can further be studied as systems.

The reliability claims for Funet network are:
- The network has high technical and structural reliability.
- The risks related to outsourcing are recognized and minimized.

Claim N9545710 MAINTAINABILITY

[Back to main map]
Parent nodes:

Child nodes:


The network is well maintained.

The risks associated to oursourcing are noticed and minimized.

Ficora Regulation 41 on techinical documentation is fulfilled.

 

Claim N3827891 CONTROLLABILITY

[Back to main map]
Parent nodes:

Child nodes:


Control measures that are in principle possible in an IP network are available for the operator and actions can be taken promptly when needed.

Claim N4411433 INVULNERABILITY

[Back to main map]
Parent nodes:

Child nodes:


All aspects of network invulnerability are recognized and adequately managed.

 

Claim N1438977 ROBUSTNESS OF PROTOCOLS

[Back to main map]
Parent nodes:

Child nodes:


The network relies on the inherent robustness of basic IP protocols.

Claim N6417608 Structure

[Back to main map]
Parent nodes:

Child nodes:


The structure function in the sense of reliability analysis needs to be defined, in principle, for each task  one by one  using the network topology and routing rules. Some examples are:
- each node works, the network is connected and it has connections to the outside world (i.e., it is connected to Ficix and NorduNet).
- one component is down, but the rest of the network is connected and connected to the outside world. 

Structure claims used in this study are:
- Requirements set in Ficora (Finnish Communications Regulatory Authority) regulation 27 are fulfilled.

Claim N6361490 Components

[Back to main map]
Parent nodes:

Child nodes:


 

Claim N8546720 Protocol software

[Back to main map]
Parent nodes:

Child nodes:


The protocol software used is reliable.

Software bugs found will be fixed promptly or can be managed in some other ways.

 

 

Claim N7957837 Dimensioning

[Back to main map]
Parent nodes:

Child nodes:


The network is dimensioned with the objective that it should be never congested, even by loss of single core links or routers.

On a primary connection link loads are at most 50%, 30-40 % is desired load in a normal situation.

Secondary connections have the same capacities as the primary connections.

Requirements in Ficora regulation 13 and 50 on dimensioning of the network are fulfilled.

Claim N7164668 Operation

[Back to main map]
Parent nodes:

Child nodes:


Problems in network operation can be noticed and fixed fast enough and without degradation of the customer quality.

Requirements in Ficora Regulation 50 on network operation are fulfilled.

Claim N7546896 Overload control

[Back to main map]
Parent nodes:

Child nodes:


The network has implemented control measures against overload and can overcome overload situations.

Ficora requirements 13 and 50 on network load are fulfilled.

 

Claim N7200282 Traffic filtering abilities

[Back to main map]
Parent nodes:

Child nodes:


The traffic can be filtered effectively when needed.

Requirements in Ficore regulation 13 are fulfilled.

Claim N5889048 Physical security

[Back to main map]
Parent nodes:

Child nodes:


The physical security of the network is high.

Requirements in Ficora regulation 48 on physical protection of network are fulfilled.

Requirements in Ficore regulation 43 on electrical protection of network are fulfilled.

Claim N3876247 Information security

[Back to main map]
Parent nodes:

Child nodes:


The information security is high and can be updated effectively, when a security hole is detected.

Announcements on data security and protection are monitored continuosly and actions are taken promptly when needed.

Requirements of Ficora regulation 13 on Internet information security and functionality are fulfilled.

Requirements of Ficora regulation 47 on information security are fulfilled.

Claim N7078015 Routers

[Back to main map]
Parent nodes:

Child nodes:


Routers are physically very reliable.

However, no numerical criteria for router reliability has  been set.

 

Claim N3263361 Transmission links

[Back to main map]
Parent nodes:

Child nodes:


The reliability of transmission links has been secured adequately, more than one operator is used, repair and maintenance of links is prompt, SLAs contain requirements for repair and maintanance quality.

The transmission link operator is reliable. It has proper monitoring of  links, good practices in operations, and its staff is experienced. Problems in transmission links are noticed and repaired fast enough. 

Claim N6354747 Software updates

[Back to main map]
Parent nodes:

Child nodes:


Good practices by router software updates are being used.

The functionality of new software versions is verified before taking them into use.

 

 

Claim Monitoring

[Back to main map]
Parent nodes:

Child nodes:


The network is constantly monitored and, in  case of a problem, corrective actions are started promptly.

Adequate tools are implemented for network monitoring and possible problems in the network can be detected soon enough.

Requirements in Ficora regulations on monitoring are met.

Claim N3857944 Operation and maintenance practices

[Back to main map]
Parent nodes:

Child nodes:


Networking faults can be localized effectively and actions to correct the faults are started promptly.

Requirements in Ficore regulations on actions and practices are met.

Claim N6729984 Operating personnel

[Back to main map]
Parent nodes:

Child nodes:


Operating staff is adequate and no single person is critical for the maintainability of the network.

 

Claim N7727573 Staff expertise

[Back to main map]
Parent nodes:

Child nodes:


The operating staff has sufficient qualifications for operating the network and for implementing and developing new functionalities.

Claim N9267768 Staff size

[Back to main map]
Parent nodes:

Child nodes:


The staff size is large enough to operate the network and improve practices.

The workload of each employee is adequate and human errors are at their minimum level.

Argument N3908473 Robustness of routing

[Back to main map]
Parent nodes:

Child nodes:


The network is divided into domains that implement OSPF routing. One of them is the six-node core, others connect to that core over BGP. By link and node losses, routing is restored automatically.  

Argument N848095 Robustness of traffic control

[Back to main map]
Parent nodes:

Child nodes:


TCP protocol itself has a feedback and it can adapt to the level of congestion in the network.

The amount of real-time traffic, where TCP is usually not used (and could not provide the quality needed) is not significant.

IP protocol itself is robust and flexible.

Evidence N3176963 Literature on IP networking

[Back to main map]
Parent nodes:

Child nodes:


 

Argument N9971738 Capacity update practices

[Back to main map]
Parent nodes:

Child nodes:


Routers and link capacities are updated according to traffic growth.

The upgrade frequency has slowed down somewhat during the last years. The traffic in FUNET increaces in a slower rate in comparison to other Finnish operators. The FUNET traffic is also considered well predictable.

Argument N5947383 Software update practices

[Back to main map]
Parent nodes:

Child nodes:


Juniper releases new router software versions 3-4 times per year. New versions are installed only when they include significant extensions or fix security holes or other serious software errors. Updates are installed about once a year.

CSC has no separate router for testing, but Juniper can do testing for them. New versions go then immediately to production use.  

Funet has seldom the very latest version of the software in use, unless that is required for the information security.

Argument N2982082 Experience of staff

[Back to main map]
Parent nodes:

Child nodes:


The operating staff at CSC has gained its expertise from a long working experience. The latest member of the operating team has experience of four years, the second latest six years.  

Argument N3024546 Sufficiency of human resources

[Back to main map]
Parent nodes:

Child nodes:


The Funet operating team has five members and a team leader. Additionally, the neighboring team has also expertise and experience for duty service, resulting in a team of 8 persons for monitoring and operating the network.

The size of the team is considered sufficient for running the network and dealing with normal malfunctioning situations, but not for much additional development work.

Argument N3867723 Only DoS attack needs immediate action

[Back to main map]
Parent nodes:

Child nodes:


No overload situations caused by proper customer traffic are expected, and no measures against them are pre-planned.

The core network does not provide QoS, as it is aimed to operate in an uncongested state.  In a pilot trial, "less than best effort" service has been offered, and this is limited first when a traffic congestion takes place.

For overloads caused by malicious traffic attacs, the routers have implemented rate limits that are not active in a normal operating situation. If a rate limit is exceeded, an automatic alarm is sent to the network monitoring operator who may take actions in protecting the network, for example, by filtering the traffic.

CSC reacts to DoS-situations only when they hinder the normal traffic and a customer asks for actions.

Argument N1706427 Traffic filtering practices

[Back to main map]
Parent nodes:

Child nodes:


Automatic alarms for DoS attacts.

Active traffic filtering is done only in DoS attacts, if regular traffic is otherwise harmed.

Typically, ports are filtered rather than particular target or source addresses.

 

Argument N4536576 Security measures protecting Funet's customers

[Back to main map]
Parent nodes:

Child nodes:


CSC has DoS-monitoring in the network. Ill-behaving traffic can be filtered.

Evidence N9994472 IPLU interviews

[Back to main map]
Parent nodes:

Child nodes:


Interview of Janne Kanner and Juha Oinonen (CSC) by Eija Myötyri, Ilkka Norros and Pertti Raatikainen (VTT) on May 4, 2006.

Interview of Pekka Savola (CSC) by Ilkka Karanta, Ilkka Norros and Pertti Raatikainen (VTT) on February 28, 2007.

Interview of Kaisa Haapala and Pekka Savola (CSC) by Ilkka Karanta, Pirkko Kuusela and Ilkka Norros (VTT) on September 6, 2007.

 

Evidence N5041425 Ping downtime data 2000-2007

[Back to main map]
Parent nodes:

Child nodes:


CSC sends 5 consecutive pings (1s interval) at regular intervals to customers' sites and to the six core routers.  If there is no response to any of the pings, the site is marked down.  A separate quality metric keeps track of ping batches which were partially answered. Currently the interval between ping batches is about 60 seconds.

The data is publicly available from http://im.funet.fi. The pings are also the primary interface for network monitoring.

For this study, a script was written to read all observed downtimes between August 1, 2000 and July 31, 2007.

 

Argument N2664541 Analysis of link traffic data

[Back to main map]
Parent nodes:

Child nodes:


The busy hour data of May 31, 2007 was used as material in an analysis made with Mathematica (notebook funet-mathgraph.nb). A crude traffic matrix estimate allowed the consideration of what-if questions. The dimensioning was found sufficient also in link and node failure situations.

Argument N150661 Cutset analysis of network topology

[Back to main map]
Parent nodes:

Child nodes:


Functions for an analysis of minimal cutsets, taking into account both link and node losses, were written in Mathematica (notebook funet-mathgraph.nb). The considered objective function required that the network is connected internally and with both Ficix and Nordunet.

The network is protected for loss of any one link, and loss of any one node leaves the remaining network connected. There are two independent connections both with Ficix and with Nordunet.

Argument N4059870 Analysis of downtime statistics

[Back to main map]
Parent nodes:

Child nodes:


Only ping data was available for this analysis.

Analysis of core router ping downtime data was made with Mathematica (notebook funet-mathgraph.nb). The availability of core routers was found to be better than 0.999.

No analysis of link downtimes was made in this study. There is some data in the text form that could be used for this purpose (ptp-pinger) giving information on problems in gore links. All links in the gore network are pinged every 5 min intervals to detect broken fibers.

 

Argument N5891538 Monitoring practices

[Back to main map]
Parent nodes:

Child nodes:


Transmission link providers are required to monitor their own links.

CSC has computers that monitor the performance of the network and the tools from im.funet.fi are also used. Some own alarm systems have been created, which send an e-mail message to the person monitoring the network.

Human network monitoring is done during week days from 8:30 until 16:00. Beyond that there is monitoring by Otaverkko, which activates when an alarm is sent from the network.

The person monitoring the network watches a screen showing a network view. He inspects the sys-log-files, checks the multicast traffic and general error statistics. These error statistics come from routers and include information on link errors, number of broken packets received, number of dropped packets etc. From this information the monitoring person can often detect, for example, a link that is likely to malfunction soon. Problems can often be fixed well before customers notice any degradation in service quality.

At this point, error statistics are not used to generating automatic alarms. An expericenced person can detect malfunctioning links from the ping-data information, which is used in generating automatic alarms.

Detection of Denial of Service (DoS) activity in the routers will generate automatic alarms.

Automatic alarms have been in use for several years, and they are consired helpful and working well.

Evidence N4192710 Funet core network graph

[Back to main map]
Parent nodes:

Child nodes:


Evidence N4190878 Link traffic data 2000-2007

[Back to main map]
Parent nodes:

Child nodes:


Link traffic of core network is shown in realtime in the public web page http://www.csc.fi/funet/status/tools/wm.

However, only pictorial information (load curves) is available over this interface.

Claim N5259058 Power supply

[Back to main map]
Parent nodes:

Child nodes:


The power supply is assured adequately.

All requirements in Ficora Regulation 30 in power supply are met.

Argument N607569 Measures to ensure power for routers

[Back to main map]
Parent nodes:

Child nodes:


The power unit is the most frequently failing component of a router.

All Funet core routers have double power units.

Power supply is secured by UPSes for some routers but not for all.

The UPSes used provide power for a router approximately 30-60 minutes, depending on the traffic and the condition of the UPS.

Some sites have in addition means to provide electricity if the original source of electricity goes down.

Argument N6238171 The reliability of routers

[Back to main map]
Parent nodes:

Child nodes:


Funet has Juniper M10, M20 and T320 routers, in addition to some Cisco routers. These are widely used and potential problems can be heard from other users.

Four routers (csc0, shh3, helsinki0 and helsinki3) have double routing processors.

Argument N5416529 Service Level Agreements

[Back to main map]
Parent nodes:

Child nodes:


The SLAs between CSC and the transmission link providers TeliaSonera and Corenet are not public. Some features were announced however:

- The SLA requires that links securing each other must be physically independent.

- It also requires that the transmission is via genuine end-to-end connections (e.g., fiber, SDH or Ethernet) - virtual transmission links, e.g. MPLS, are not accepted.

Argument N7375082 Fixing errors in router software

[Back to main map]
Parent nodes:

Child nodes:


The router software is provided by the router manufacturer (Juniper).

Errors in software appear rather frequently, but they are seldom problematic for actual IP traffic transport.

Usually problems can be temporarily fixed with appropriate workarounds in configurations.

Error-related communication with the provider flows well in both directions.

 

Argument N167310 Separation of links

[Back to main map]
Parent nodes:

Child nodes:


The link cards of links securing each other are placed in different modules of the router.

Links securing each other are physically separated in the building and leave it in different ducts. Further separation of links falls under the responsibility of the transmission provider.

Purely technical problems are usually easy to detect and repair.

Argument N168521 Practices of operation

[Back to main map]
Parent nodes:

Child nodes:


Link operators maintain their own links and SLAs contain conditions on the quality of the maintanance.

CSC has created and implemented a system that prevents malfunctioning configurations.

CSC does the software updates using remote logging to the routers. During a normal software update a router is down only when it is rebooted.

CSC has bought the maintenance and repair service for its hardware from Sonera.

Routine maintenance work is not carried out during regular working hours (i.e., between 8 am and 4 pm), but not in deep night either.

Argument N4817665 Protection of machine rooms

[Back to main map]
Parent nodes:

Child nodes:


The machine room in CSC building is well protected. The rest of the hardware resides at customer locations, i.e., in machine rooms of universities and institutions. One should keep in mind that a university environment is more open and more short-term employees (students) are used.

Argument N8457889 Information security measures protecting Funet

[Back to main map]
Parent nodes:

Child nodes:


Funet routers are well protected from DoS attacts. Information security problems have the highest action priority, either configurations are changed or software is updated.

Good practices are used in encryption and password management matters.

Claim N3082898 AVAILABILITY

[Back to main map]
Parent nodes:

Child nodes:


Availability is the proportion of time, during which the network can be used. The report The Dependability of an IP network - how to asses it? of the IPLU project further distinguishes the availability of the pure IP-connectivity and the availability of high quality connectivity. 

Concrete availability claims for Funet network are:
- The availability of the network is sufficiently high.
- The failure rate of a connection is sufficiently small.
- The duration of the unavailability of a connection is sufficiently small.
- All availability conditions in SLAs are fulfilled.

Funet has not set its availability aims quantitatively.

The SLAs between Funet and transport providers were not available in this study.

 

Claim N2114003 Dependability of Funet

[Back to main map]
Parent nodes:

Child nodes:


The task of Funet is to provide a network for research and eduation in Finland. It is a service provided by CSC. Funet is a network community whose members are organizations and users (universities and institutions). CSC coordinates the entity technically and administratively.

The basic functionality of an IP network is to transfer a datagram from one network node to another network node. The transfer speed (bit/s) is not a logical part of this basic service. However, speed is an essential attribute of real-time services - their functioning ends abruptly when the speed sinks below some threshold. Although network users are seldom conscious about this distinction, one should distinguish between two aspects of the service offered by a network like Funet:
1) IP-connectivity
2) high-speed IP connectivity
Note that an IP network can be in such a state that requirement 1 is fulfilled for each node pair but requirement 2 very badly.

Funet offers pure IP-connectivity without explicit Quality of Service (except for offering a "less than best effort" traffic class).  The quality of service is provided primarily by the dimensioning of the network.

Funet does not use MPLS. IPv6 is offered, but omitted in this study.

___________________________________________________________________________________________________


The dependability of Funet network is divided in this study into six different aspects of dependability: availability, reliability, maintanability, controllability, invulnerability and robustness of protocols. As regards the conceptual framework, we refer to the IPLU project's final report The Dependability of an IP network - how to asses it?.

The requirements to be met are presented as claims and subclaims. A "claim" like "Availability" is to be read as "The network has sufficiently high availability", etc.

A claim is supported by an argument, which is based on technical data, analysis, or some other evidence like just information gathered by interviews.

The aim is to present and evaluate all aspects of Funet's dependability in one document and in a manner that enables one to see the whole view of the task easily. This case study contains technical analysis for the availability and reliability of Funet network, in the other parts the information is gathered only from interviews at CSC.