Skip to main content

Prismo: Pre-Mortem Intelligence for Shifting Left on Reliability with AI

Table of Contents

Prismo Logo

Prismo: Pre-Mortem Intelligence for Shifting Left on Reliability with AI
#

How to fail “on paper” rather than in production — and why that mindset matters


A Personal Story: How Prismo Shaped My MVP
#

Before diving into the technical details, let me share why I built Prismo.

I was working on a new product launch at Microsoft, and leadership asked for an MVP estimation. Like many engineers, I had a rough idea of what needed to be built, but I lacked clarity on what could go wrong and where to invest our limited time.

Using an early version of Prismo, I ran a pre-mortem analysis on our proposed architecture. Within 20 minutes, the tool identified 47 potential failure modes I hadn’t considered. More importantly, it helped me prioritize what actually mattered for the MVP.

Instead of building everything, I focused on the top 5 critical risks:

  • Secret rotation automation (RPN: 245)
  • Multi-region failover (RPN: 210)
  • Rate limiting and backpressure (RPN: 189)
  • Backup verification pipeline (RPN: 156)
  • Centralized logging and alerting (RPN: 152)

These five items became our MVP scope. We shipped on time, with the confidence that we’d addressed the highest-impact risks first. The product hasn’t had a major incident in production since launch.

That experience taught me: pre-mortems aren’t just risk identification — they’re prioritization tools. Prismo helped me fail on paper, so I didn’t have to fail in production.

Now, let me show you how it works.


Why Pre-Mortem? The Shift Left Mindset
#

As engineers, we systematically and repeatedly look at why our systems fail and how we can make them better. Pre-mortems are one technique to help us make better choices.

Continuous improvement doesn’t just have to be about fixing known bugs — it can be about actively thinking of what can go wrong before it happens. As we feel the pressure to add new features and expand our services, it’s important that our culture continues to change from reactive to proactive.

We as SRE teams can help. Embrace the fail whale!

What is Shift Left?
#

The concept is simple: move quality activities earlier in the development lifecycle.

QUALITY DEVELOPMENT LIFECYCLE

    [PREMORTEM] --> [HEARTBEAT] --> [RETROSPECTIVE] --> [POSTMORTEM]
        |              |                 |                   |
     Proactive      Monitor           Improve            Reactive
        |                                                   |
        v                                                   v
   Fail on PAPER                                    Fail in PRODUCTION

Traditional teams spend most of their energy on the right side — reacting to incidents, writing postmortems, and firefighting. Shifting left means investing more in the proactive side.

Why Shift Left Matters
#

Pre-mortems are part of a growing desire to shift left in the quality development lifecycle:

BenefitDescription
Fail on paper, not productionIdentify issues before they impact customers
Find surprises earlyDiscover risks before they become incidents
Define mitigations proactivelyHave playbooks ready when things go wrong
Raise awarenessMake the whole team conscious of system risks
Encourage improvement over quick fixBuild lasting solutions, not band-aids
Free up timeExpose technical debt before it becomes urgent

Research supports this approach. Studies on pre-mortem techniques show they significantly improve risk identification, helping teams catch “black swan” events before they happen.


From Reactive to Proactive: A Cultural Shift
#

As we feel the pressure to add new features and expand our services, it’s important that our culture continues to change from reactive to proactive.

The Engineering Mindset Shift
#

Reactive (Old Way)Proactive (New Way)
Fire-fightingPrevent problems
Post-incident learningPre-incident planning
Hope it worksData-driven decisions
Fix when brokenFix before it breaks
Respond to alertsAnticipate failures

The goal is simple: embrace the fail whale. Learn from failures before they happen in production.


Introducing Prismo: AI-Powered Pre-Mortem Analysis
#

Manual pre-mortem analysis has served us well, but it has limitations. That’s why I built Prismo — an AI-powered tool that refracts your architecture into a spectrum of risks.

The Prism Metaphor
#

Just like a prism breaks white light into a visible spectrum, Prismo breaks your architecture into a spectrum of risks — revealing hidden failure modes, categorizing them by type, and making the invisible visible.

    YOUR ARCHITECTURE (white light)
            |
            v
        [PRISMO] <- AI Analysis
            |
            v
    SPECTRUM OF RISKS
    (Separated, categorized, prioritized)

The Problem with Manual Pre-Mortems
#

Traditional pre-mortem sessions look like this:

  • Schedule 2-hour brainstorming meetings
  • Gather engineers in a room (or Zoom)
  • Play “assessment poker” with sticky notes
  • Manually calculate RPN scores
  • Transfer everything to Excel spreadsheets
  • Repeat quarterly (if you remember)

This process works, but it’s slow, subjective, and hard to maintain.

Prismo: A Better Approach
#

Prismo takes your architecture description and automatically:

  1. Refracts your architecture into failure modes
  2. Categorizes risks by type (like a spectrum)
  3. Calculates objective SOD scores
  4. Generates tactical and strategic mitigations
  5. Populates the FMEA worksheet

AI vs Manual: A Direct Comparison
#

Here’s how Prismo compares to traditional manual pre-mortem analysis:

Time and Effort
#

MetricManual ApproachPrismo
Time to complete4 weeks20 minutes
Meetings required6-8 sessions0 (async)
Engineers involved5-10 per session1 (to review)
Update frequencyQuarterlyContinuous

Quality and Coverage
#

MetricManual ApproachPrismo
Risks identified10-15 average30-50 average
Scoring consistencyVaries by teamStandardized
CoveragePoint-in-timeAlways current
BiasAnchoring, groupthinkObjective
Historical learningLimitedPattern database

Output Quality
#

MetricManual ApproachPrismo
DocumentationStatic ExcelDynamic, queryable
Mitigation suggestionsTeam-dependentBest practices library
Priority rankingSubjectiveRPN-based, consistent
TrackingManual follow-upIntegrated workflow

The Math
#

Manual Process:
- 4 weeks elapsed time
- 40+ person-hours invested
- 60% risk coverage (estimated)
- Quarterly refresh cycle

Prismo:
- 20 minutes elapsed time
- 2 person-hours (review + refinement)  
- 85%+ risk coverage
- Continuous monitoring

A Simple Example: Library Management System
#

Let’s see how Prismo works with a straightforward application — a Library Management System.

The Architecture
#

                    LIBRARY MANAGEMENT SYSTEM

                       +-------------+
                       |   WEB APP   |
                       |  (Frontend) |
                       +------+------+
                              |
                              | HTTPS
                              v
                       +-------------+
                       |  REST API   |
                       |  (Backend)  |
                       +------+------+
                              |
           +------------------+------------------+
           |                  |                  |
           v                  v                  v
    +------------+     +------------+     +-------------+
    |  KEY VAULT |     |   REDIS    |     | DOCUMENT DB |
    |  (Secrets) |     |   CACHE    |     |  (Storage)  |
    +------------+     +------------+     +-------------+
    
    - API Keys         - Session Data     - Books Catalog
    - DB Credentials   - Search Cache     - User Records  
    - Certificates     - Book Inventory   - Borrow History

The Flow: Architecture to FMEA
#

STEP 1: INPUT
Engineer provides architecture description

    "Library app with Web Frontend, REST API, Redis Cache,
     Document DB for storage, and Key Vault for secrets.
     Users can search books, borrow items, manage accounts."

                              |
                              v

STEP 2: PRISMO AI ANALYSIS

    +--------------------------------------------------+
    |                    PRISMO                        |
    |                                                  |
    |   [REFRACT]  -->  [CATEGORIZE]  -->  [SCORE]    |
    |    Architecture      by Type         S x O x D  |
    |    into Risks     (Like spectrum)               |
    |                                                  |
    +--------------------------------------------------+

                              |
                              v

STEP 3: AUTO-POPULATED FMEA

    Complete worksheet with risks, scores, and mitigations

Prismo Output: FMEA Worksheet
#

The AI analyzes the library system and produces this FMEA:

Identified Risks
#

IDFailure PointFailure ModeEffectSODRPNPriority
LIB-001Key VaultSecret expiration not monitoredAPI cannot connect to DB, total outage947252Critical
LIB-002Document DBSingle region deploymentComplete data loss if region fails838192Medium
LIB-003Redis CacheCache invalidation failureStale book availability shown565150Medium
LIB-004REST APINo rate limitingAPI overwhelmed during peak times744112Medium
LIB-005Web AppNo health checksFailed deployments not detected636108Medium
LIB-006Document DBNo backup verificationData loss if corruption occurs929162Medium
LIB-007All ComponentsNo centralized loggingSlow incident detection, extended outage657210Critical

Risk Categories Detected
#

CategoryCountExamples
Authentication1Secret expiration
Blast Radius / SPOF1Single region deployment
Monitoring & Detection2No health checks, no logging
Data Management2Backup verification, cache invalidation
Scalability1No rate limiting

Prismo separates risks into a visible spectrum — just like white light through a prism.


Generated Mitigations
#

Prismo doesn’t just identify risks — it suggests what to do about them.

LIB-001: Key Vault Secret Expiration (RPN: 252)
#

Tactical (Do Now):

  • Set calendar reminders for secret expiry dates
  • Document manual rotation procedure
  • Create break-glass emergency access procedure

Strategic (Plan):

  • Implement automated secret rotation
  • Add expiry monitoring alerts (30/14/7 days before)
  • Migrate to managed identities where possible

Expected Impact:

  • Detection improves: 7 to 2
  • New RPN: 9 x 4 x 2 = 72 (71% reduction)

LIB-007: No Centralized Monitoring (RPN: 210)
#

Tactical (Do Now):

  • Enable basic logging on all components
  • Create manual daily log review checklist
  • Set up email alerts for critical errors

Strategic (Plan):

  • Implement Application Insights across all services
  • Create unified dashboard for system health
  • Configure automated alerting for anomalies

Expected Impact:

  • Detection improves: 7 to 2
  • Occurrence improves: 5 to 3
  • New RPN: 6 x 3 x 2 = 36 (83% reduction)

Understanding the RPN Score
#

The Risk Priority Number determines where to focus your efforts.

The Formula
#

RPN = Severity x Occurrence x Detection

Where:
- Severity (S): How bad is the impact? (1-10)
- Occurrence (O): How often does it happen? (1-10)  
- Detection (D): Can we catch it before customers notice? (1-10)

Rating Scales
#

Severity (Business Impact)

ScoreMeaning
10Catastrophic - complete service failure
7-9Major - significant customer impact
4-6Moderate - degraded experience
1-3Minor - barely noticeable

Occurrence (Frequency)

ScoreMeaning
10Constant - happens daily
7-9Frequent - weekly
4-6Occasional - monthly
1-3Rare - annually or never

Detection (Monitoring)

ScoreMeaning
10No detection - customers report it
7-9Poor - usually miss it
4-6Moderate - sometimes catch it
1-3Good - almost always catch it first

Priority Actions
#

RPN RangePriorityAction
200-1000CriticalFix this sprint, escalate to leadership
100-199MediumPlan within quarter
50-99LowAdd to backlog
1-49MinimalDocument and monitor

The Risk Heatmap
#

After analysis, visualize risks on a criticality heatmap:

RISK PRIORITY MATRIX

                         PROBABILITY
             Unlikely      Likely       Certain
           +-----------+-----------+-----------+
           |           |           |           |
   HIGH    |  LIB-002  |  LIB-007  |           |
           |  LIB-006  |  LIB-001  |           |
           |   [192]   |   [210]   |           |
           |   [162]   |   [252]   |           |
           +-----------+-----------+-----------+
SEVERITY   |           |           |           |
   MEDIUM  |  LIB-005  |  LIB-003  |           |
           |   [108]   |  LIB-004  |           |
           |           |   [150]   |           |
           |           |   [112]   |           |
           +-----------+-----------+-----------+
           |           |           |           |
   LOW     |           |           |           |
           |           |           |           |
           |           |           |           |
           +-----------+-----------+-----------+

Priority Legend:
  RED (Critical):    RPN 200-1000 - Fix this sprint
  ORANGE (Medium):   RPN 100-199  - Plan within quarter  
  GREEN (Low):       RPN 50-99    - Add to backlog

Focus on the upper-right quadrant first: high severity, high probability risks are your top priority.

Action Items by Priority:

  1. LIB-001 (RPN 252) - Secret expiration monitoring - CRITICAL
  2. LIB-007 (RPN 210) - Centralized logging - CRITICAL
  3. LIB-002 (RPN 192) - Multi-region deployment
  4. LIB-006 (RPN 162) - Backup verification

Prismo Architecture
#

Here’s how the platform works:

                    PRISMO PLATFORM

    +-------------+
    |  ENGINEER   |
    |  provides   |
    |  architecture
    +------+------+
           |
           v
    +------+------+
    |   WEB APP   |
    | (Input UI)  |
    +------+------+
           |
           v
    +------+------+
    |  REST API   |
    +------+------+
           |
    +------+------+------+
    |      |      |      |
    v      v      v      v
+-------+ +----+ +------+ +----------+
|  KEY  | |REDIS| | AI   | |DOCUMENT |
| VAULT | |CACHE| |ENGINE| |   DB    |
+-------+ +----+ +------+ +----------+
                    |
                    v
           +--------+--------+
           | FMEA WORKSHEET  |
           | (Auto-populated)|
           +-----------------+

Components
#

ComponentPurpose
Web AppInput architecture, view results
REST APIProcess requests, orchestrate analysis
Key VaultStore API keys and secrets securely
Redis CacheCache analysis results for performance
AI EngineRefract architecture into risk spectrum
Document DBStore risks, analyses, and historical patterns

Best Practices: Lessons Learned
#

From running pre-mortems across multiple teams, here’s what works.

Do This
#

  • Identify stakeholders — deeply understand their problems before starting
  • Align with leadership — get buy-in on goals and expected outcomes
  • Measure first — if you can’t measure it, invest in measuring before mitigating
  • Collaborate — you can’t move mountains alone, build your network
  • Celebrate wins — create awareness of quality milestones

Avoid This
#

  • Being a purist — quality is about limiting exposure, not eliminating all issues
  • Operating in silos — branch out to partner teams, share and reuse
  • Ignoring failure management — expect failures and optimize for resiliency
  • Ignoring postmortems — pre-mortems and postmortems complement each other
  • Losing momentum — engage in periodic checkpoints, track goals to closure

Summary: Why This Matters
#

The shift from reactive to proactive reliability engineering comes down to a simple choice:

Instead of…Try…
Fail in productionFail on paper
Find surprises after they hitFind surprises before they impact
Scramble during incidentsHave mitigations ready
Blind spotsFull awareness of risks
Quick fixesLasting improvements
Constant firefightingProactive debt reduction

Prismo makes this shift practical by automating the time-consuming parts of pre-mortem analysis while maintaining (and improving) quality.

The Numbers
#

MetricManualPrismoImprovement
Time4 weeks20 minutes99% faster
Risks found10-1530-503x more
Accuracy~60%~85%40% better
Refresh rateQuarterlyContinuousAlways current

The Prismo Promise
#

Just like a prism reveals the hidden spectrum in white light, Prismo reveals the hidden risks in your architecture.

Simple. Elegant. Reliable.


Getting Started
#

Interested in Prismo or AI-powered pre-mortem analysis?

The future of reliability engineering is proactive. Let’s build it together.


Have questions or feedback? Reach out on LinkedIn or open an issue on GitHub.