Planning for Resiliency

This is my second post related to IT Strategy, following up on the previous post regarding “Seeking Value” but this time looking at the resiliency of systems and infrastructure particularly around when things inevitably do go wrong.

Resiliency: Keeping it all working

I recently heard Mark Steed speaking at the EdTech Conversations event in London where he referred to his approach to the use of Educational Technology at JESS in Dubai.

In his speech, he talked about a “no excuses” approach to systems and the infrastructure on which educational technology solutions rely. His view was that if the foundations on which EdTech use are built are not solid, and if things such as Wi-Fi or the wider network don’t work or are intermittent then users of educational technology, be it the students or teachers, will simply turn off and seek non-technology solutions. Winning them back in the event of reliability issues being extremely difficult or near impossible. As such building strong technology foundations, a resilient infrastructure, is therefore key. Planning for when things might go wrong is a must.

As with most things building resiliency isn’t simple. In a world of infinite resources we would simply double up (N x 2), or even double up plus add spares;  So in the case of our Internet provider we would require two separate diversely routed fibres so that, in the event one fibre was damaged, we would be able to run off the 2nd fibre. We might then have a third redundant backup solution, possibly with lower capacity, and again diversely routed. All of this sounds good and minimises potential downtime from fibre damage within the incoming internet services however this all comes with a cost, first in terms of financial costs of additional lines and also in terms of additional hardware and support costs. We don’t live in a world of infinite resources and therefore decisions need to be taken as to how much resiliency we build in. This is where the usual risk assessment and management processes must kick in.

Let’s consider the key pieces of infrastructure which may exist and issues around each:

  • Internet Service Provision, Firewalls and Core Switches

As we use more and more cloud services, internet access and school internet provision becomes critically important. Due to the critical nature of internet access, when looking at Internet service provision, firewalls and core switches, the two main focal areas I would consider are doubling up where finances allow or carefully examining the service level agreement along with any penalties proposed for where service levels are not met. In the case of firewalls and core switches, cold spares with a lower specification may also be an option to minimize cost but allow for quick recovery in the event of any issue. When looking at the SLAs of providers in terms of their support offering for when things go wrong consider, is it next business day on-site support or return to base for example and how long their anticipated recovery period is.

  • Edge Switches and Wi-Fi

In the case of edge switches and Wi-Fi Access Points we are likely to have large numbers especially for larger sites. I would suggest that heat mapping for Wi-Fi is key at the outset of a Wi-Fi deployment, in making sure Wi-Fi will work across the site. In looking at resiliency for when things go wrong my view is an N+1 approach. This involves establishing a spare or quantity of spares based on the total number of units in use and the level of risk which is deemed acceptable. High levels of risk acceptance mean fewer spares, whereas a low level of risk acceptance may lead to a greater number of spares.

  • Cabling / Routing

Cables break plus various small animals love to chew on cables given half a chance.

As a result, it is important to examine your overall network layout with a view to any weak points where a single failure might impact on large areas or large numbers of users within the school.  Where possible plan for redundant routes such that any single failure can be quickly resolved by using an alternative route thereby minimising downtime while you wait for repairs.

  • School Management Solutions (SMS) /Management Information Systems (MIS)

I include the schools MIS system given its criticality in relation to parental contact info, student registration, etc. It is a critical system within a school. As such it is important to consider how it is backed up and how recovery would be undertaken. It is also important to test the processes. I have conducted tests in the past which have shown the recovery process did not perform as expected; Had I not tested, the first I would have known about difficulties would have been when I needed to recover the MIS for real, which is a time when the last thing you want is for things to not go as planned.

 

I note that the above is not an extensive or comprehensive list and I might have included classroom display technology, Mobile Device Management (MDM), Network Access Control (NAC), CCTV, access control and a whole manner of other solutions which may exist, however in the interest of keeping this post brief and to the point I have left these off.

For me, the key in relation to resiliency is a risk-based assessment of your systems and infrastructure.

We need to know the risks and their impact on the school. Armed with this information we can prioritise our available resources towards the aspects of our infrastructure where the greatest level of resiliency is required. The other key consideration is transparency and ensuring school leaders are aware of the risks which exist, where the available resources have been prioritised and where decisions have been taken not to deploy resources, plus the reasons why.

My concern with resiliency is that it is often something which people don’t worry about until things go wrong. Then come the difficult discussions as to why preventative measures or recovery plans hadn’t been put in place. Better to consider resiliency regularly and ensure that the state of play, including the risks, are all made clear to all. At my school, we approach this as part of an annual IT risk assessment process including risks related to resiliency. If you don’t have a risk assessment which includes a discussion of resiliency, it would be my strong advice to create one.