Skip to content

Quick facts

Phase: Iterate | Version: 2.0.0 | Category: reflection | License: Apache-2.0

Try it: /lessons-log "Your context here"

Lessons Log

A lessons log entry captures significant learning from projects, incidents, or experiences in a format that's useful to future teams who weren't there. Unlike retrospectives (which focus on team improvement), lessons logs focus on organizational knowledge that transcends individual teams—patterns, anti-patterns, and hard-won wisdom.

When to Use

  • After completing a significant project or initiative
  • Following a major incident, outage, or failure
  • When you realize something important that others should know
  • After discovering a pattern that keeps recurring
  • When experienced team members leave (capture their knowledge)
  • During post-mortems to preserve learnings

How to Use

Use the /lessons-log slash command:

/lessons-log "Your context here"

Or reference the skill file directly: skills/iterate-lessons-log/SKILL.md

Instructions

When asked to create a lessons log entry, follow these steps:

  1. Choose a Descriptive Title Write a title that someone searching for this topic would find. Include keywords that describe the situation and the learning. Avoid generic titles like "Project X lessons."

  2. Provide Context Explain the situation fully enough that someone who wasn't there can understand it. Include the project, timeline, team, and any relevant constraints. Future readers need this context to assess applicability.

  3. Describe What Happened Write a factual account of what occurred. Be specific about actions taken, decisions made, and outcomes observed. Avoid blame—focus on events and systems.

  4. Extract the Lesson Articulate what you learned clearly. The lesson should be actionable—something others can apply. Distinguish between what you observed and your interpretation of why it matters.

  5. Formulate Recommendations Provide specific guidance for future teams facing similar situations. What should they do? What should they avoid? What questions should they ask?

  6. Define Applicability Help readers know when this lesson applies. What situations trigger relevance? What context makes it more or less applicable?

  7. Add Tags for Searchability Include keywords and categories that will help future searchers find this entry. Think about what someone would search for when facing a similar situation.

Output Template

[Descriptive Title of the Lesson]

Metadata

Attribute Value
Entry ID [LL-YYYY-NNN]
Date [When the learning occurred]
Author [Who captured this lesson]
Project/Initiative [What this relates to]
Team [Team involved]
Lesson Type [Success Pattern / Failure Pattern / Process / Technical / Strategy]

Summary

[Brief summary of what happened and what was learned—someone should be able to decide if this is relevant from this paragraph alone]


Context

Background

[Description of the project, initiative, or situation]

Timeline

Date Event
[Date] [What happened]
[Date] [What happened]
[Date] [What happened]

Team and Stakeholders

  • [Who was involved]
  • [Who was affected]

Constraints

  • [Constraint 1]
  • [Constraint 2]

What Happened

Sequence of Events

  1. [Event 1]
  2. [Event 2]
  3. [Event 3]
  4. [Event 4]

Key Decisions Made

Decision Rationale Outcome
[Decision 1] [Why we decided this] [What resulted]
[Decision 2] [Why we decided this] [What resulted]

Outcome

[Description of the outcome—good, bad, or mixed]


The Lesson

What We Learned

Primary Lesson: [Clear statement of the core learning]

Supporting Observations: - [Observation 1] - [Observation 2] - [Observation 3]

Why This Matters

[Explanation of the importance and implications]

Root Cause Analysis

  • [Root cause 1]
  • [Root cause 2]

Recommendations

Do This

  1. [Specific recommendation]
  2. [Specific recommendation]
  3. [Specific recommendation]

Avoid This

  1. [What to avoid and why]
  2. [What to avoid and why]

Questions to Ask

  • [Question 1]?
  • [Question 2]?
  • [Question 3]?

Applicability

When This Applies

  • [Situation 1]
  • [Situation 2]
  • [Situation 3]

When This May Not Apply

  • [Exception 1]
  • [Exception 2]
  • [Related situation]
  • [Related situation]

Supporting Evidence

Metrics/Data

Metric Before After
[Metric] [Value] [Value]

Quotes

"[Quote]" — [Attribution]

Artifacts

  • [Link to related document]
  • [Link to related document]

Tags and Categories

Primary Category: [Infrastructure / Process / Communication / Technical / Strategy / People]

Tags: [#tag1] [#tag2] [#tag3] [#tag4]

Related Lessons: [LL-XXXX-XXX], [LL-XXXX-XXX]


Review and Updates

Date Reviewer Update
[Date] [Name] Initial entry
[Date] [Name] [Update description]

This lesson was captured to help future teams avoid repeating our mistakes and build on our successes.

Example Output

Load Testing with Production Traffic Replay Prevents Scale Failures

Load Testing with Production Traffic Replay Prevents Scale Failures

Metadata

Attribute Value
Entry ID LL-2026-001
Date November 29, 2025 (Black Friday)
Author David Park, Engineering Manager
Project/Initiative Black Friday 2025 Preparedness
Team Platform Engineering
Lesson Type Failure Pattern

Summary

Our payment service experienced a 47-minute outage during Black Friday due to database connection pool exhaustion under 10x normal load. Despite running load tests, we didn't catch this because our synthetic load tests didn't replicate real traffic patterns—specifically the bursty, concentrated nature of flash sale traffic. Production traffic replay would have caught this.


Context

Background

Black Friday is our highest traffic day, typically 8-10x normal volume. The platform team conducted load testing in October, simulating 10x traffic with good results. We also completed a capacity review and provisioned additional infrastructure. Despite this preparation, our payment service failed at 9:03 AM EST when traffic spiked.

Timeline

Date Event
Oct 15 Load testing complete—10x capacity validated
Oct 20 Capacity review approved additional database instances
Nov 1 Infrastructure scaling deployed
Nov 29 09:00 Black Friday sale begins
Nov 29 09:03 Payment service latency spikes
Nov 29 09:07 First customer complaints
Nov 29 09:15 Incident declared, war room assembled
Nov 29 09:35 Root cause identified (connection pool exhaustion)
Nov 29 09:50 Connection pool resized, service recovering
Nov 29 10:15 Full service restoration

Team and Stakeholders

  • Platform Engineering team (incident response)
  • Payment team (service owners)
  • Customer Support (user communication)
  • Executive team (business impact)
  • Affected: ~12,000 customers couldn't complete purchases during outage

Constraints

  • Black Friday is immovable—no option to delay or reschedule
  • Payment service is critical path—no graceful degradation possible
  • Real traffic replay wasn't available (privacy concerns, infrastructure gap)

What Happened

Sequence of Events

  1. Sale launch (09:00): Traffic increased as expected. Systems initially handling load well.

  2. Flash sale spike (09:02): A limited-quantity flash sale created concentrated traffic—5,000 users hitting checkout simultaneously rather than distributed load.

  3. Connection pool saturation (09:03): Payment service database connections exhausted. Each flash sale request held connections longer due to inventory checks under contention.

  4. Cascade begins (09:05): New requests queue, timeouts trigger retries, retries amplify load. Database CPU hits 100%.

  5. Service degradation (09:07): Payment API latency exceeds 30 seconds. Frontend shows errors. Customers start contacting support.

  6. Incident response (09:15): War room assembled. Initially suspected DDoS attack due to traffic pattern.

  7. Root cause identified (09:35): Database team identified connection pool exhaustion. Pool sized for 10x sustained load, not 10x burst load with long-held connections.

  8. Remediation (09:50): Connection pool increased from 100 to 500, connection timeout reduced. Service begins recovering.

  9. Resolution (10:15): Service fully restored. Estimated $340,000 in lost revenue during outage.

Key Decisions Made

Decision Rationale Outcome
Load test at 10x sustained Match expected average load Didn't catch burst patterns
Use synthetic traffic Easier to generate, more controllable Didn't match real user behavior
Size connection pool for average Based on sustained throughput Failed under burst
No production traffic replay Privacy concerns, infrastructure cost Missed realistic patterns

Outcome

47-minute service outage during our highest-revenue hour. Estimated $340,000 direct revenue loss, unknown brand damage. Customer trust impacted—we received 400+ complaint tickets in the following week referencing the Black Friday experience.


The Lesson

What We Learned

Primary Lesson: Synthetic load tests with evenly distributed traffic don't catch burst-pattern failures. Production traffic replay—using anonymized real traffic patterns—reveals timing, distribution, and correlation patterns that synthetic tests miss.

Supporting Observations: - Our load test sent 100 requests/second evenly distributed. Real traffic came in bursts of 5,000 in 10 seconds. - Synthetic tests didn't simulate correlated requests (many users hitting the same flash sale item). - Connection pool math was correct for throughput but wrong for concurrency under burst. - The failure mode was predictable in retrospect—database contention under burst is a known pattern.

Why This Matters

Load testing is expensive and time-consuming. If we're going to invest in it, we need it to actually predict production behavior. Synthetic tests gave us false confidence. Production traffic replay is the only way to truly simulate real-world patterns, especially for events with unusual traffic characteristics.

Root Cause Analysis

  • Immediate cause: Connection pool exhaustion under burst load
  • Contributing cause: Load test didn't simulate burst patterns
  • Root cause: No mechanism to replay real traffic patterns in test environments
  • Systemic cause: Load testing treated as checkbox, not continuous practice

Recommendations

Do This

  1. Implement production traffic replay capability. Capture and anonymize real traffic patterns. Replay them in staging environments before major events. Investment pays off on the first prevented outage.

  2. Test for burst patterns specifically. Design load tests that include concentrated spikes, not just sustained throughput. Model flash sale dynamics, viral traffic patterns, and marketing campaign spikes.

  3. Size for concurrency, not just throughput. When capacity planning, consider maximum concurrent connections, not just requests per second. These require different calculations.

  4. Do chaos engineering for resource exhaustion. Regularly test what happens when connection pools, thread pools, and memory approach limits. Don't wait for production to discover failure modes.

Avoid This

  1. Don't assume synthetic load tests prove production readiness. Synthetic tests prove a specific pattern works—they don't prove all patterns work. Know what you haven't tested.

  2. Don't treat load testing as a one-time checkbox. Traffic patterns change. Run load tests regularly, especially before major events or after significant changes.

  3. Don't dismiss production traffic replay due to privacy concerns. Traffic patterns can be anonymized and replayed without exposing user data. The technique is well-established.

Questions to Ask

  • What are the busiest patterns in our real traffic, not just the average?
  • Have we tested what happens when N requests hit the same resource simultaneously?
  • What's our failure mode when [resource] is exhausted? Do we fail gracefully?
  • When was the last time we load tested with realistic patterns?

Applicability

When This Applies

  • Preparing for known high-traffic events (sales, launches, campaigns)
  • Operating services with shared resources (connection pools, caches)
  • Running systems where traffic is bursty, not steady
  • After scaling capacity based on throughput calculations

When This May Not Apply

  • Services with naturally even traffic distribution
  • Systems designed for graceful degradation (non-critical paths)
  • Very early-stage products where traffic is low and predictable
  • Marketing campaign launches with concentrated traffic spikes
  • Product launches with viral potential
  • Any "thundering herd" scenario (cache expiration, failover recovery)
  • Services downstream of rate-limiters that can "dam break"

Supporting Evidence

Metrics/Data

Metric Load Test Black Friday Reality
Peak RPS 1,000 1,200
Peak concurrent connections 80 450+
Request distribution Even over time 80% in first 30 seconds of flash sale
Connection hold time 50ms average 800ms under contention

Quotes

"The load test results looked great. We hit 10x traffic with no issues. What we didn't realize is that we were testing a fantasy version of 10x traffic." — Sarah Chen, Platform Engineering

"In retrospect, of course flash sales create burst patterns. We just didn't think to simulate it." — Mike Johnson, Payment Service Owner

Artifacts

  • Black Friday Incident Postmortem (internal doc)
  • Load Testing Report - October 2025 (internal doc)
  • Production Traffic Replay RFC (internal doc)

Tags and Categories

Primary Category: Infrastructure

Tags: #load-testing #scale-failure #connection-pools #traffic-patterns #black-friday #incident #database #capacity-planning

Related Lessons: LL-2024-008 (Cache thundering herd), LL-2025-003 (Database connection limits)


Review and Updates

Date Reviewer Update
2025-12-05 David Park Initial entry from postmortem
2026-01-14 Sarah Chen Added production traffic replay RFC link

This lesson was captured to help future teams avoid repeating our mistakes and build on our successes.

Real-World Examples

See this skill applied to three different product contexts:

Storevine (B2B): Storevine B2B ecommerce platform — lesson learned from the Campaigns guided first-campaign flow A/B test

Prompt:

/lessons-log

Lesson from: Campaigns guided first-campaign flow A/B test
Experiment ran: April 28 – June 27, 2026 (60 days from GA)
Decision: Ship guided flow as default for all non-adopter merchants

Primary finding:
- Control: 34.2% opened flow, 39.2% of openers clicked Send [fictional]
- Treatment: 35.8% opened flow, 88.5% of openers clicked Send [fictional]
- Flow entry rate was nearly IDENTICAL between variants
- All of the treatment advantage was in completion rate, not flow entry

What surprised us: We expected the guided UX structure (progress
indicator, step sequencing, explicit prompts) to be the mechanism.
Funnel analysis shows the treatment advantage is concentrated at the
template and subject line step — content pre-population, not UX
scaffolding, is doing the work.

Lesson type: Success Pattern / Strategy
Need: structured lessons log entry for the team knowledge base.

Output:

Content Quality, Not UX Scaffolding, Is the Non-Adopter Activation Mechanism

Brainshelf (Consumer): Brainshelf consumer PKM app — timezone handling lesson from the Resurface Sprint 8 launch

Prompt:

/lessons-log

lesson from: timezone bug at resurface launch
what happened: ~340 UK/EU users got their digest at 7:30 UTC instead
of 7:30 local because the backend didn't recognize "Europe/Kyiv" and
silently fell back to UTC.

detected: feb 18, 2026 — alex noticed the send log timestamps were
wrong for EU users.
fixed: feb 18, same day — updated IANA database, added explicit
fallback logging, patched within 2 hours.

lesson type: failure pattern / technical
the real lesson isn't the specific bug — it's that we had ZERO
integration tests for timezone-dependent behavior. the unit tests
mocked the timezone, so they never caught the gap between the device
timezone string and the server's IANA database.

Output:

Mocked Timezones Hide Real-World Failures: Always Integration-Test Time-Dependent Features Against the Live IANA Database

Workbench (Enterprise): Workbench enterprise collaboration platform: Blueprints skip-and-submit enforcement lesson

Prompt:

/lessons-log

I need to document a lesson learned from the Blueprints launch. Here's the context:

**Title:** Required-Section Enforcement Caused Skip-and-Submit Behavior

**What happened:**
- Required-section enforcement successfully reduced the empty-section submission rate from 37% to 7% [fictional]
- But post-launch analysis revealed that approximately 4% of submitted Blueprints contain placeholder text in required sections [fictional] -- authors typing "TBD," "See above," "Will update," or pasting Lorem Ipsum to clear the gate
- Combined, 11% of Blueprints reaching approval still have quality issues [fictional] (7% empty + 4% placeholder), vs. our 10% target
- This was flagged as a risk in the PRD (Risk row 1: "Required sections cause author frustration and workarounds") but we underestimated the speed of adoption
- The root cause: we enforced completeness at the wrong point. Blocking submit punishes the author but doesn't help the approver. Approvers still have to manually check for placeholder text.

**The lesson:**
- Enforce quality at the approval gate, not the authoring gate
- Shift from "block the author from submitting" to "give the approver a completeness score"
- The submit gate should remain (it catches genuine omissions) but add a visible completeness/quality signal to the approval view

**Team:** Rachel V., Karen L., Leo M.
**Timeline:** Identified May 10, 2026 (Leo M.'s 2-week post-launch analysis); documented May 14

Please generate the full lessons-log entry.

Output:

Required-Section Enforcement Caused Skip-and-Submit: Enforce at the Approval Gate, Not the Authoring Gate

Quality Checklist

Before finalizing, verify:

  • Title is descriptive and searchable
  • Context is complete enough for someone who wasn't there
  • Lesson is clearly articulated and actionable
  • Recommendations are specific, not vague
  • Entry stands alone (doesn't require external context)
  • Tags enable future discovery

Output Format

Use the template in references/TEMPLATE.md to structure the output.