(in no particular order)
By: Sherri Douville, CEO & Board Member, Medigram Inc. & Eric Svetcov, CTO/CSO at Medigram, Inc. together with numerous technical advisors.
There are a myriad of ways that an application can fail. There can be introduction of a product defect and lack of process to catch that defect, there can be a threat that can exploit a known or unknown vulnerability, or even something as simple as an “infallible” system administrator tripping over a power cord. Failures can be exacerbated by poor design, inadequate or no architecture, strange implementation decisions, lack of technical depth compounded by an utterly complete lack of training.
We need to recognize what generally causes outages:
- One cause of large service outages is software defects introduced into production. Much of the time when you hear things like “this company’s systems were down for 4 hours today” the problem was some sort of software issue. Software issues cause far more outages than hardware, networking failures or DOS attacks or anything else. Sometimes the outage is the result of a software upgrade gone bad, inadequate testing, or fundamental misunderstanding of the requirements. There are a ton of various causes of software problems. Good software design and testing will be the front-line on having a truly resilient service.
- Other times it’s the result of a database problem (including misconfiguration, product defect, and lack of experience with the technology). Databases are the bane of many technologists’ existence. Frequently these database related outages happen because a database was corrupted or bad data was being written to a database. Companies literally take down their systems to prevent further damage to the data in their databases.
- Human error…enough said.
- Connectivity issues cause a lot of the outages we experience especially with wireless and mobile computing. DNS failures or DOS attacks on a DNS provider are other forms of connectivity problems. It seems like every year there’s at least one big DNS outage.
- The other major cause of network outages is fiber cuts that result in large geographic areas having either no connectivity or greatly degraded capability. You need to design your solution to withstand these types of problems.
- Hardware failures would be next in the list of things that cause outages. Design failures that result in production solutions with a single point of failure occur much more often than any of us would like.
- Localized catastrophes — fires, earthquakes, etc — that may destroy a site or critical infrastructure near a site. These are more rare, but underscore a lack of planning to build solutions that can withstand such a catastrophic event.
- Developers tend to leverage a plethora of 3rd party tools for modern software systems. Not only can this introduce software licensing issues, but some developers uncritically include code snippets or libraries that introduce both functional and security weaknesses that can manifest in ways that cause outages or confidentiality defects. Some introduced code can even create dependencies on external systems or services which can also introduce both availability and confidentiality risks.
- Mobile brings to bear its own whole set of additional issues. For example, a mobile operator may decide to block the ports used by a specific app and even connecting to public wireless can be problematic as the provider of the “free” access attempts to limit your “outrageous usage” by blocking ports that they don’t like (e.g. POP and IMAP are often blocked). With mobile just the act of moving from one network to another can cause a temporary service outage on the device — even when both networks are available (or appear available).
- Or any number of pedestrian mobile issues could be at play such as phone storage limitations causing performance problems: https://www.androidpolice.com/2019/10/03/clear-app-cache-data-android/
Some straightforward solutions to the above issues include implementation of good process for introducing new solutions (leverage leading IT frameworks to make sure you are doing the right thing), world class change management (yes, everyone should have “world class” change management), planning (in mobile computing it is truly important to plan and measure and plan and measure (twice)— only when you know you are doing the right thing is it time to implement) and manage your vulnerabilities and defects — your own as well as a strong 3rd party patch management program. As with any technology solution it is critical to build in a reasonable amount of resilience — Based upon the solution, build-in the appropriate level of infrastructure redundancy to include servers, connectivity/networking, and storage.
By: Sherri Douville CEO & Board Member & Eric Svetcov CTO/CSO at Medigram https://www.medigram.com/