Click here to show or hide the menubar.

Paul ex-banking sysadmin analysis of NtaWest

By Adam Curry. Posted Wednesday, June 27, 2012 at 7:18 PM.

Hi Adam! ITM

As usual, keep it Anon please!

In one of my previous jobs I worked for LloydsTSB in their main data

centre here in Gitmo-east. I can't comment on Natwest's particular

setup, but the story about them screwing up their systems with an

upgrade sounds highly bogative. I literally laughed out loud when I

caught the story on the news, because the idea is simply preposterous.

You have to remember that these are systems that are handling millions

of transactions an hour. In the banking industry you don't count outages

in terms of downtime. When the shit hits the fan, you have a director

standing over you, telling you how many millions of pounds you're losing

an hour. Banks will do anything to prevent this from happening.

As a result, all of the banking systems I've worked with are what's

known as "gold star" systems. This means that not only do you have a

system in a datacentre that writes all data live to disk (and is then

de-staged to tape), and whose individual components are all duplicated

to remain resilient to failure; but the data is also being written in

realtime to another datacentre elsewhere in the country, connected by

huge bandwidth, dedicated fibre links (point to point, not part of any

other communications network), so that in the event of the datacentre

itself "disappearing", an exact duplicate system that remains live in

the other datacentre can take over within milliseconds. As a result

they're virtually impossible to take down.

In the specific instance of an upgrade, they would apply the new code to

the secondary, idle system, test it thoroughly (and I mean thoroughly.

These guys leave nothing to chance and their testing routines are

incredibly detailed) and then switch processing to the newly installed

system. Once the transactions were seen to be processing live without

incident (typically for as long as a week), they would then apply the

upgrade to the remaining system.

With this setup and upgrade method, the chances of ever having an error

that stops you being able to process transactions and keep the profits

rolling in is virtually nonexistant. The only potential is that the

primary system failed while the secondary was being upgraded, but this

would also not have taken that long to resolve. The companies have

support contracts in place worth billions a year, and have engineers

from those support companies onsite 24/7, with direct pipelines back to

the host company warehouses for spare parts. If this had happened, the

chances are it would have been back up and running within a few hours.

Another reason I doubt that this was an upgrade issue is the timing. The

outage happened midway through the day on a Tuesday. Typically you

perform upgrades in your most off-peak time slots. I have never seen an

upgrade that would be allowed to occur during daylight hours, much less

partway through the week. An upgrade would have been scheduled for a

slot somewhere around 4am on a Saturday, when the market is closed,

cashpoints aren't likely to be used, and the risk of losing any

transactions (and therefore receiving bad publicity) in the event of

communications upset is almost non-existent.

I simply can't believe that this is due to that level of incompetence.

Not when that amount of money is at stake. My initial reaction was that

it was something done to prevent a run on the bank. This could have

something to do with an exploit in the cardless payment system that went

live on the 13th June

,

or it could simply be a distraction from the increasing accusations of

'systemic institutionalised fraud'

, alleged involvement in one of the UK's biggest Ponzi schemes

,

and allegations of 'racially and religiously profiling Muslims

' that have been growing

recently.

Having said that, we have to to consider all angles, and that includes

the Conspiracy angle! Taking into account the wider set of outages

recently I have to wonder if it's also part of something larger:

1st June: Facebook outage (cause unconfirmed)

,

7th June: Gmail outage (cause unconfirmed

),

11th June: Berkeley Campus power outage (cause unconfirmed)

,

14th June: Amazon Web Services power outage (cause unconfirmed)

,

14th June: Samsung production lines in Seoul power outage (cause

unconfirmed)

,

19th June: NatWest outage (software upgrade)

,

20th June: Southwest Airlines power outage (cause unconfirmed),

20th June: Petrochemical complex in Singapore power outage (cause

unconfirmed)

,

22nd June: Twitter outage (cause unconfirmed)

,

29th June: iCloud, iTunes and iMessage outage (cause unconfirmed)

,

All of these companies would be expected to have planned around such a

failure to stay resilient. Odd how not one of them is able to get to the

bottom of what actually caused their outage. You'd think they'd be a bit

more concerned about finding the root cause and reporting it as fixed,

wouldn't you? _No-one_ in the lamestream media is taking a step back and

looking at all of these together to see if there's a pattern.

Anyway, that's just 2 cents from an ex-banking sysadmin.

Keep up the Sterling work! (pun intended)

Paul

XML
Stats & Atts.

Greetings, citizen of Planet Earth. We are your overlords. :-)