Challenge 11: Design and defend a critical operation

The challenge: Choose one critical operation. Implement complete security controls for that one operation end-to-end. Make it both secure and usable.

Choose your operation

Pick one:

Option 1: Reactor startup

  • Complex multi-step procedure

  • Safety-critical

  • Requires coordination between multiple systems

  • Takes 30-60 minutes

  • Errors can be dangerous

Option 2: Turbine emergency stop

  • Must be fast (seconds matter)

  • Safety-critical

  • Can’t have delays

  • But must prevent unauthorised stops

  • Balance security and speed

Option 3: Safety system bypass

  • Extremely dangerous if abused

  • Legitimate need during maintenance

  • Must be temporary and monitored

  • Require multiple approvals

  • Automatic revert

Design comprehensive controls

Pre-operation:

  • Who can initiate?

  • What permissions required?

  • Any approvals needed?

  • Preconditions (system state checks)?

Authentication:

  • Single person or dual authorisation?

  • What role is required?

  • Certificate-based? Password? MFA?

Authorisation:

  • What permissions grant access?

  • Time-based (only during maintenance windows)?

  • Location-based (only from control room)?

Initiation:

  • How is operation triggered?

  • Any confirmation required?

  • Any wait period (cooling-off)?

During operation:

  • Monitoring and logging

  • Progress tracking

  • Anomaly detection

  • Ability to abort?

  • Who can abort?

Safety interlocks:

  • What safety checks during operation?

  • Automatic abort conditions?

  • Override procedures?

Completion:

  • Success criteria

  • Validation checks

  • Automatic revert (for bypass operations)

  • Notification

Post-operation:

  • Logging and audit trail

  • Who did what when?

  • Success or failure?

  • Any anomalies detected?

Emergency scenarios:

  • What if authentication fails?

  • What if safety interlock triggers?

  • What if operation hangs?

  • Break-glass procedures?

Implement and test

Normal operation testing:

  • Authorized user performs operation

  • Everything works smoothly

  • Logging captures all steps

  • Operation completes successfully

Authorisation testing:

  • Unauthorized user attempts operation - blocked

  • Wrong role attempts operation - blocked

  • Dual auth with only one person - blocked

Safety testing:

  • Trigger safety interlock during operation

  • Operation should abort safely

  • System returns to safe state

Failure testing:

  • Authentication server down during operation

  • What happens?

  • Can operation proceed?

  • Can operation complete?

Emergency testing:

  • Real emergency requiring immediate action

  • Can you bypass procedures?

  • Is it audited?

  • Can you justify it later?

Usability testing:

  • How long does the secure operation take vs unsecured?

  • Is the delay acceptable?

  • Do operators find it reasonable?

  • Or will they work around it?

What you can learn

Security vs safety:

  • Sometimes they conflict

  • Security can delay safety responses

  • Need emergency overrides

  • But overrides can be abused

  • No perfect answer

Usability vs security:

  • Most secure: lock it down completely

  • Most usable: no controls

  • Reality: somewhere in between

  • Finding balance requires iteration

Operational realities:

  • Procedures look good on paper

  • Reality is messier

  • Emergencies don’t follow procedures

  • Edge cases multiply

  • Need flexibility

Defence in depth for operations:

  • Authentication (who)

  • Authorisation (permission)

  • Dual authorisation (two-person rule)

  • Safety interlocks (prevent physical danger)

  • Monitoring (detect anomalies)

  • Logging (audit trail)

  • Emergency procedures (break-glass)

Where to start

# Choose your operation
# (Recommend reactor startup or safety bypass)

# Map the operation:
# 1. What are the steps?
# 2. What systems are involved?
# 3. What can go wrong?
# 4. What are the risks?

# Design security controls:
# For each phase (pre, during, post):
# - What checks?
# - What approvals?
# - What monitoring?
# - What logging?

# Consider emergency scenarios:
# - Authentication failure
# - Safety interlock triggers
# - Operation hangs
# - Real emergency requiring immediate action

# Implement incrementally:
# - Start with authentication
# - Add Authorisation
# - Add dual auth if needed
# - Add monitoring
# - Add logging
# - Test each addition

# Test thoroughly:
# - Normal operation
# - Unauthorized attempts
# - Failure scenarios
# - Emergency scenarios
# - Usability (is it practical?)

Example: Safety system bypass

Chosen operation: Bypass reactor safety interlock during maintenance

Why it’s critical:

  • Allows maintenance while reactor is hot

  • Removes safety protection

  • Dangerous if abused or forgotten

  • Must be temporary and monitored

Pre-operation controls:

  • Dual authorisation required (supervisor + engineer)

  • Justification required (text field: why are you bypassing?)

  • Maintenance window validation (only allowed during scheduled maintenance)

  • Safety system status check (ensure other interlocks still active)

  • Automatic expiry configured (1 hour default, max 4 hours)

During operation controls:

  • Alarm displayed on all HMIs: “SAFETY BYPASS ACTIVE”

  • Monitoring for any safety parameter violations

  • Logging all operations performed during bypass

  • Ability to abort maintenance and restore safety

  • Countdown timer showing time until automatic revert

Post-operation controls:

  • Automatic revert after time limit

  • Manual restore option (before time limit)

  • Validation that safety system restored

  • Test that safety interlock is functional

  • Audit log entry with: who, when, duration, justification, what was done

  • Report to safety officer

Emergency procedures:

  • If safety parameter exceeds threshold during bypass, automatic revert

  • If reactor enters unsafe state, forced shutdown

  • Emergency button overrides bypass immediately

Implementation:

# Simplified pseudocode
async def request_safety_bypass(user1_session, user2_session, justification, duration_minutes):
    # Dual authorisation check
    if not await auth.authorize_with_dual_auth(
        user1_session, user2_session,
        PermissionType.SAFETY_BYPASS, "reactor_1"
    ):
        log_security("Safety bypass denied - insufficient authorisation")
        return False

    # Maintenance window check
    if not in_maintenance_window():
        log_security("Safety bypass denied - not in maintenance window")
        return False

    # Duration limit check
    if duration_minutes > 240:  # Max 4 hours
        log_security("Safety bypass denied - duration exceeds maximum")
        return False

    # Record justification
    await log_audit(
        "Safety bypass requested",
        user1=get_user(user1_session),
        user2=get_user(user2_session),
        justification=justification,
        duration=duration_minutes
    )

    # Activate bypass
    await reactor.bypass_safety_interlock("temperature_high", duration_minutes)

    # Start monitoring
    await start_bypass_monitoring("reactor_1", duration_minutes)

    # Display alarm on all HMIs
    await hmi.show_alarm("SAFETY BYPASS ACTIVE", AlarmPriority.HIGH)

    return True

Testing results:

  • ✓ Dual auth required (single user attempt blocked)

  • ✓ Justification required (empty justification rejected)

  • ✓ Maintenance window enforced (attempt during production blocked)

  • ✓ Duration limited (5-hour request rejected)

  • ✓ Automatic revert after time limit

  • ✓ Emergency revert on safety parameter violation

  • ✓ Complete audit trail

Trade-offs accepted:

  • Adds ~2 minutes to bypass procedure (dual auth, justification)

  • Acceptable for maintenance operations (not emergencies)

  • Manual restore required after maintenance (can’t auto-detect “maintenance complete”)

  • False alarms possible (parameter violations during normal maintenance)

Residual risks:

  • Two colluding insiders can still abuse bypass

  • Mitigation: Audit review, pattern detection

  • Operator fatigue could lead to expired bypass not being noticed

  • Mitigation: Countdown timer, alarm, automatic revert