Hi Adam! ITM
As usual, keep it Anon please!
In one of my previous jobs I worked for LloydsTSB in their main data
centre here in Gitmo-east. I can't comment on Natwest's particular
setup, but the story about them screwing up their systems with an
upgrade sounds highly bogative. I literally laughed out loud when I
caught the story on the news, because the idea is simply preposterous.
You have to remember that these are systems that are handling millions
of transactions an hour. In the banking industry you don't count outages
in terms of downtime. When the shit hits the fan, you have a director
standing over you, telling you how many millions of pounds you're losing
an hour. Banks will do anything to prevent this from happening.
As a result, all of the banking systems I've worked with are what's
known as "gold star" systems. This means that not only do you have a
system in a datacentre that writes all data live to disk (and is then
de-staged to tape), and whose individual components are all duplicated
to remain resilient to failure; but the data is also being written in
realtime to another datacentre elsewhere in the country, connected by
huge bandwidth, dedicated fibre links (point to point, not part of any
other communications network), so that in the event of the datacentre
itself "disappearing", an exact duplicate system that remains live in
the other datacentre can take over within milliseconds. As a result
they're virtually impossible to take down.
In the specific instance of an upgrade, they would apply the new code to
the secondary, idle system, test it thoroughly (and I mean thoroughly.
These guys leave nothing to chance and their testing routines are
incredibly detailed) and then switch processing to the newly installed
system. Once the transactions were seen to be processing live without
incident (typically for as long as a week), they would then apply the
upgrade to the remaining system.
With this setup and upgrade method, the chances of ever having an error
that stops you being able to process transactions and keep the profits
rolling in is virtually nonexistant. The only potential is that the
primary system failed while the secondary was being upgraded, but this
would also not have taken that long to resolve. The companies have
support contracts in place worth billions a year, and have engineers
from those support companies onsite 24/7, with direct pipelines back to
the host company warehouses for spare parts. If this had happened, the
chances are it would have been back up and running within a few hours.
Another reason I doubt that this was an upgrade issue is the timing. The
outage happened midway through the day on a Tuesday. Typically you
perform upgrades in your most off-peak time slots. I have never seen an
upgrade that would be allowed to occur during daylight hours, much less
partway through the week. An upgrade would have been scheduled for a
slot somewhere around 4am on a Saturday, when the market is closed,
cashpoints aren't likely to be used, and the risk of losing any
transactions (and therefore receiving bad publicity) in the event of
communications upset is almost non-existent.
I simply can't believe that this is due to that level of incompetence.
Not when that amount of money is at stake. My initial reaction was that
it was something done to prevent a run on the bank. This could have
something to do with an exploit in the cardless payment system that went
live on the 13th June
or it could simply be a distraction from the increasing accusations of
'systemic institutionalised fraud'
, alleged involvement in one of the UK's biggest Ponzi schemes
and allegations of 'racially and religiously profiling Muslims
Having said that, we have to to consider all angles, and that includes
the Conspiracy angle! Taking into account the wider set of outages
recently I have to wonder if it's also part of something larger:
1st June: Facebook outage (cause unconfirmed)
7th June: Gmail outage (cause unconfirmed
11th June: Berkeley Campus power outage (cause unconfirmed)
14th June: Amazon Web Services power outage (cause unconfirmed)
14th June: Samsung production lines in Seoul power outage (cause
19th June: NatWest outage (software upgrade)
20th June: Southwest Airlines power outage (cause unconfirmed),
20th June: Petrochemical complex in Singapore power outage (cause
22nd June: Twitter outage (cause unconfirmed)
29th June: iCloud, iTunes and iMessage outage (cause unconfirmed)
All of these companies would be expected to have planned around such a
failure to stay resilient. Odd how not one of them is able to get to the
bottom of what actually caused their outage. You'd think they'd be a bit
more concerned about finding the root cause and reporting it as fixed,
wouldn't you? _No-one_ in the lamestream media is taking a step back and
looking at all of these together to see if there's a pattern.
Anyway, that's just 2 cents from an ex-banking sysadmin.
Keep up the Sterling work! (pun intended)