Challenge 11: Design and defend a critical operation¶

The challenge: Choose one critical operation. Implement complete security controls for that one operation end-to-end. Make it both secure and usable.

Choose your operation¶

Pick one:

Option 1: Reactor startup

Complex multi-step procedure
Safety-critical
Requires coordination between multiple systems
Takes 30-60 minutes
Errors can be dangerous

Option 2: Turbine emergency stop

Must be fast (seconds matter)
Safety-critical
Can’t have delays
But must prevent unauthorised stops
Balance security and speed

Option 3: Safety system bypass

Extremely dangerous if abused
Legitimate need during maintenance
Must be temporary and monitored
Require multiple approvals
Automatic revert

Design comprehensive controls¶

Pre-operation:

Who can initiate?
What permissions required?
Any approvals needed?
Preconditions (system state checks)?

Authentication:

Single person or dual authorisation?
What role is required?
Certificate-based? Password? MFA?

Authorisation:

What permissions grant access?
Time-based (only during maintenance windows)?
Location-based (only from control room)?

Initiation:

How is operation triggered?
Any confirmation required?
Any wait period (cooling-off)?

During operation:

Monitoring and logging
Progress tracking
Anomaly detection
Ability to abort?
Who can abort?

Safety interlocks:

What safety checks during operation?
Automatic abort conditions?
Override procedures?

Completion:

Success criteria
Validation checks
Automatic revert (for bypass operations)
Notification

Post-operation:

Logging and audit trail
Who did what when?
Success or failure?
Any anomalies detected?

Emergency scenarios:

What if authentication fails?
What if safety interlock triggers?
What if operation hangs?
Break-glass procedures?

Implement and test¶

Normal operation testing:

Authorized user performs operation
Everything works smoothly
Logging captures all steps
Operation completes successfully

Authorisation testing:

Unauthorized user attempts operation - blocked
Wrong role attempts operation - blocked
Dual auth with only one person - blocked

Safety testing:

Trigger safety interlock during operation
Operation should abort safely
System returns to safe state

Failure testing:

Authentication server down during operation
What happens?
Can operation proceed?
Can operation complete?

Emergency testing:

Real emergency requiring immediate action
Can you bypass procedures?
Is it audited?
Can you justify it later?

Usability testing:

How long does the secure operation take vs unsecured?
Is the delay acceptable?
Do operators find it reasonable?
Or will they work around it?

What you can learn¶

Security vs safety:

Sometimes they conflict
Security can delay safety responses
Need emergency overrides
But overrides can be abused
No perfect answer

Usability vs security:

Most secure: lock it down completely
Most usable: no controls
Reality: somewhere in between
Finding balance requires iteration

Operational realities:

Procedures look good on paper
Reality is messier
Emergencies don’t follow procedures
Edge cases multiply
Need flexibility

Defence in depth for operations:

Authentication (who)
Authorisation (permission)
Dual authorisation (two-person rule)
Safety interlocks (prevent physical danger)
Monitoring (detect anomalies)
Logging (audit trail)
Emergency procedures (break-glass)

Where to start¶

# Choose your operation
# (Recommend reactor startup or safety bypass)

# Map the operation:
# 1. What are the steps?
# 2. What systems are involved?
# 3. What can go wrong?
# 4. What are the risks?

# Design security controls:
# For each phase (pre, during, post):
# - What checks?
# - What approvals?
# - What monitoring?
# - What logging?

# Consider emergency scenarios:
# - Authentication failure
# - Safety interlock triggers
# - Operation hangs
# - Real emergency requiring immediate action

# Implement incrementally:
# - Start with authentication
# - Add Authorisation
# - Add dual auth if needed
# - Add monitoring
# - Add logging
# - Test each addition

# Test thoroughly:
# - Normal operation
# - Unauthorized attempts
# - Failure scenarios
# - Emergency scenarios
# - Usability (is it practical?)

Example: Safety system bypass¶

Chosen operation: Bypass reactor safety interlock during maintenance

Why it’s critical:

Allows maintenance while reactor is hot
Removes safety protection
Dangerous if abused or forgotten
Must be temporary and monitored

Pre-operation controls:

Dual authorisation required (supervisor + engineer)
Justification required (text field: why are you bypassing?)
Maintenance window validation (only allowed during scheduled maintenance)
Safety system status check (ensure other interlocks still active)
Automatic expiry configured (1 hour default, max 4 hours)

During operation controls:

Alarm displayed on all HMIs: “SAFETY BYPASS ACTIVE”
Monitoring for any safety parameter violations
Logging all operations performed during bypass
Ability to abort maintenance and restore safety
Countdown timer showing time until automatic revert

Post-operation controls:

Automatic revert after time limit
Manual restore option (before time limit)
Validation that safety system restored
Test that safety interlock is functional
Audit log entry with: who, when, duration, justification, what was done
Report to safety officer

Emergency procedures:

If safety parameter exceeds threshold during bypass, automatic revert
If reactor enters unsafe state, forced shutdown
Emergency button overrides bypass immediately

Implementation:

# Simplified pseudocode
async def request_safety_bypass(user1_session, user2_session, justification, duration_minutes):
    # Dual authorisation check
    if not await auth.authorize_with_dual_auth(
        user1_session, user2_session,
        PermissionType.SAFETY_BYPASS, "reactor_1"
    ):
        log_security("Safety bypass denied - insufficient authorisation")
        return False

    # Maintenance window check
    if not in_maintenance_window():
        log_security("Safety bypass denied - not in maintenance window")
        return False

    # Duration limit check
    if duration_minutes > 240:  # Max 4 hours
        log_security("Safety bypass denied - duration exceeds maximum")
        return False

    # Record justification
    await log_audit(
        "Safety bypass requested",
        user1=get_user(user1_session),
        user2=get_user(user2_session),
        justification=justification,
        duration=duration_minutes
    )

    # Activate bypass
    await reactor.bypass_safety_interlock("temperature_high", duration_minutes)

    # Start monitoring
    await start_bypass_monitoring("reactor_1", duration_minutes)

    # Display alarm on all HMIs
    await hmi.show_alarm("SAFETY BYPASS ACTIVE", AlarmPriority.HIGH)

    return True

Testing results:

✓ Dual auth required (single user attempt blocked)
✓ Justification required (empty justification rejected)
✓ Maintenance window enforced (attempt during production blocked)
✓ Duration limited (5-hour request rejected)
✓ Automatic revert after time limit
✓ Emergency revert on safety parameter violation
✓ Complete audit trail

Trade-offs accepted:

Adds ~2 minutes to bypass procedure (dual auth, justification)
Acceptable for maintenance operations (not emergencies)
Manual restore required after maintenance (can’t auto-detect “maintenance complete”)
False alarms possible (parameter violations during normal maintenance)

Residual risks:

Two colluding insiders can still abuse bypass
Mitigation: Audit review, pattern detection
Operator fatigue could lead to expired bypass not being noticed
Mitigation: Countdown timer, alarm, automatic revert