AI Dev Team Playbook

Coding standards and AI workflows
for teams that ship fast and stay safe.

7 chapters. Built from what we're learning in the field. A living document for teams that want to move quickly with AI without burning down the codebase.

📖 7 chapters Practical, not theoretical 🔄 Updated as we learn 🆓 Free
Chapter 01

Git Workflow & Branching Strategy

A consistent branching model keeps AI-assisted commits from becoming spaghetti.

Branching Commits CI
Key Points
  • main is sacred — always deployable. Protected with required PR reviews, status checks, no force pushes.
  • Use trunk-based development for startups: short-lived branches (hours to 2–3 days), merge frequently to main
  • Branch naming: feature/TICKET-123-short-description, fix/TICKET-123-description, chore/description
  • Commit messages follow Conventional Commits: feat:, fix:, refactor:, test:, docs:, chore:
  • PR size target: under 400 lines ideal; 800+ requires justification and possible breakdown
  • Use feature flags for large features instead of long-lived branches — merge early, activate per-environment

Protect main From Day One

Your main branch should always be deployable. Set up branch protection rules before you write a line of code:

Settings → Branches → Add branch protection rule for "main":
  ✓ Require a pull request before merging
  ✓ Require at least 1 approval
  ✓ Require status checks to pass before merging
  ✓ Require branches to be up to date before merging
  ✗ Allow force pushes (OFF)
  ✗ Allow deletions (OFF)

Trunk-Based vs. GitFlow

For most startups, use Trunk-Based Development. Everyone branches off main, works in small increments, and merges back quickly. Feature flags handle unfinished features.

Trunk-BasedGitFlow
Best forStartups, fast teams, CI/CDEnterprise, long release cycles
Branch lifespanHours to 2–3 daysDays to weeks
Merge conflictsRareFrequent
Junior-friendlyYes (simpler mental model)Can be confusing
AI tooling fitExcellentOkay

Branching Structure & Naming

main
├── feature/TICKET-123-add-video-scoring
├── feature/TICKET-124-dashboard-filters
├── fix/TICKET-125-null-pointer-on-upload
└── chore/update-dependencies

Naming rules: lowercase, hyphenated, include ticket number when one exists, 3–5 word descriptions. Delete branches after merging.

Feature Flags for AI-Heavy Teams

When Copilot helps a junior write a bigger feature than expected, resist keeping the branch alive for days. Use feature flags instead — merge incomplete features safely and activate per-environment:

# config/features.py
FEATURE_FLAGS = {
    "new_scoring_algorithm": os.getenv("FF_NEW_SCORING", "false").lower() == "true",
    "enhanced_dashboard": os.getenv("FF_DASHBOARD", "false").lower() == "true",
}

# usage
if feature_flags.is_enabled("new_scoring_algorithm"):
    result = new_algorithm.score(video)
else:
    result = legacy_algorithm.score(video)

Conventional Commit Messages

Use the Conventional Commits format. Good commit messages are your changelog, debugging breadcrumbs, and code review history in one.

<type>(<scope>): <short description>

[optional body]

[optional footer: TICKET-123, BREAKING CHANGE, etc.]

Good examples:

feat(scoring): add confidence threshold filter to video evaluation

fix(auth): handle null token on session expiry — closes TICKET-98

refactor(api): extract video upload logic into VideoService class

test(scoring): add edge case tests for zero-frame video input

Reject these in review: fix stuff, WIP, updates, asdf, Copilot suggestion

💡
AI Tip: After staging changes, run git diff --cached and paste into Copilot Chat: "Write a conventional commit message for these changes." This habit alone cleans up git history significantly.

PR Size Guidelines

Large PRs are a code quality problem. AI tools make them larger (generating code is cheap). Set a soft limit and hold it:

  • Under 400 lines changed: Ideal
  • 400–800 lines: Okay with a good description
  • 800+ lines: Requires justification. Break it up if possible.
Chapter 02

Code Review Process

AI writes fast. Humans still need to review carefully — maybe more carefully than ever.

PRs Review Quality
Key Points
  • The Golden Rule: Every line that ships was approved by a human who understood it — doubly true for AI-generated code
  • AI does automated checks first (lint, format, tests, security scan); humans review second with that context
  • Review labels are defined and enforced: Blocker ❌, Suggestion 💡, Nit, Question ❓, FYI
  • SLA: standard PRs reviewed within 1 business day; hotfixes within 2 hours (ping reviewer directly)
  • Safety-critical PRs require a senior reviewer plus a second approval — no exceptions
  • First 90 days: junior PRs reviewed by senior. This is investment, not distrust.

PR Template

Create .github/pull_request_template.md in your repo so every PR answers the same questions:

## Summary
<!-- What does this PR do? One to three sentences. Be specific. -->

## Changes Made
- 
- 

## How to Test
1. 
2. 

## Screenshots / Demo
<!-- If UI changes, include a screenshot or Loom link -->

## Checklist

### General
- [ ] I have tested this locally
- [ ] I have written or updated tests for this change
- [ ] I have updated documentation if needed
- [ ] The PR is under 400 lines (or I've justified why it's larger)
- [ ] Branch is up to date with main

### Code Quality
- [ ] No hardcoded secrets, credentials, or API keys
- [ ] No debug logs, console.log, or print statements left in
- [ ] No TODO comments that should be tickets
- [ ] Variable and function names are clear without needing a comment
- [ ] Error cases are handled, not ignored

### AI-Generated Code
- [ ] I can explain every section of this code (including AI-assisted sections)
- [ ] I have reviewed AI suggestions critically and not merged blindly
- [ ] Edge cases have been considered, not just the happy path
- [ ] AI-generated logic has been verified against business requirements

### Safety-Critical (if applicable)
- [ ] This change has no effect on safety-critical paths, OR:
- [ ] I have tagged this PR safety-critical and requested senior review
- [ ] I have added/updated integration tests for affected safety logic
- [ ] I have documented the safety implications in this description

AI Review + Human Review: The Right Sequence

AI Review (automated — happens first): linting & formatting, static analysis, security scanning (Dependabot/Snyk), test coverage, GitHub Copilot PR review.

Human Review (happens second, with AI context):

  • Business logic correctness — does it do what the spec says?
  • Architecture and design decisions
  • Edge case coverage — "what if X is null?" "what if the API errors?"
  • Safety implications
  • Team knowledge transfer — does the reviewer understand what's being built?

The reviewer's job is not to re-run the linter or count lines. The reviewer's job is to understand the change completely before approving.

Review Response Labels

Define what comments mean on your team. Left undefined, this creates friction.

LabelMeaning
Blocker / ❌Must be fixed before merge
Suggestion / 💡Take it or leave it — but acknowledge it
NitTiny style thing, won't block merge
Question / ❓Not blocking — just wants to understand
FYIFor awareness, no action needed
💡 Suggestion: You could extract this into a helper function — it's used in three places now.

❌ Blocker: This will crash if `user` is None. Need a null check here.

Nit: Prefer `is_valid` over `valid` for boolean variable names.

Review SLA

  • Standard PR: Review within 1 business day
  • Hotfix / unblocking: Review within 2 hours (ping the reviewer directly)
  • Large PR (800+ lines): Reviewer may request a walkthrough call

Reviews that sit for more than a day create merge conflicts and kill developer momentum. Make this a cultural priority.

Pair Review for Juniors

For the first 90 days, require junior PRs to be reviewed by a senior. The fastest way to level up a junior is immediate, specific feedback on their actual code. After 90 days, adjust based on demonstrated judgment — some are ready to review peers at 60 days, some need more time. Be honest.

Chapter 03

Coding Standards

Consistent style is table stakes. These are the rules the AI is expected to follow too.

Style Linting Formatting
Key Points
  • Standards are enforceable; best practices are aspirational. For junior-heavy teams, standards win.
  • Python: Black (formatter, no config debates), Ruff (linter), mypy --strict (type checking) — all enforced in CI
  • Type hints required on all new code. Never use any in TypeScript — use unknown and narrow it.
  • Naming is enforced: reject data, result, stuff, temp, val, obj in review — name things after what they are
  • All public functions require Google-style docstrings: description, Args, Returns, Raises, Example
  • Architecture Decision Records (ADRs) in /docs/decisions/ for every significant technical choice

Python Standards

Formatter: Black — no configuration debates. Line length: 88 characters. Done.

Linter: Ruff — handles what Flake8 + isort + pep8 used to require separately, and it's much faster.

# pyproject.toml
[tool.ruff]
line-length = 88
select = ["E", "F", "W", "I", "N", "UP"]
ignore = ["E501"]  # Black handles line length

[tool.ruff.per-file-ignores]
"tests/*" = ["S101"]  # Allow assert in tests

Type hints: Required for all new code. Enable static analysis to catch bugs before runtime.

# BAD — no type hints
def calculate_score(video, weights):
    return sum(weights[k] * video[k] for k in weights)

# GOOD — explicit types with full docstring
def calculate_score(video: VideoAnalysis, weights: dict[str, float]) -> float:
    """Calculate weighted score for a video based on analysis results.

    Args:
        video: The analyzed video object containing scored dimensions.
        weights: A mapping of dimension names to their weight multipliers.

    Returns:
        A float between 0.0 and 1.0 representing the weighted overall score.

    Raises:
        KeyError: If a weight references a dimension not present in the video analysis.
    """
    return sum(weights[k] * getattr(video, k) for k in weights)

Run mypy --strict src/ in CI. Catch type errors before they reach production.

Naming Conventions

Clear naming is the single highest-leverage thing you can do for code quality. AI-generated code is often generic. Push back on generic names in review.

ContextConventionExample
Variablessnake_casevideo_score, user_id
Functionssnake_case, verb-firstcalculate_score(), get_user()
ClassesPascalCaseVideoAnalysis, ScoringEngine
ConstantsSCREAMING_SNAKE_CASEMAX_RETRY_COUNT, DEFAULT_THRESHOLD
Booleansis_, has_, can_, should_ prefixis_valid, has_error, can_submit
PrivateLeading underscore_internal_helper()
Files/modulessnake_casevideo_scoring.py, user_auth.py

Names to reject in review: data, info, stuff, temp, result, val, obj, thing, manager, handler, processor — these signal unclear responsibilities. Copilot often suggests result as a return variable. Rename it to what it is: scored_video, validated_user, parsed_config.

TypeScript Standards

Formatter: Prettier. Linter: ESLint with @typescript-eslint. Types: strict mode, always.

// tsconfig.json
{
  "compilerOptions": {
    "strict": true,
    "noImplicitAny": true,
    "strictNullChecks": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true
  }
}

Never use any. If Copilot suggests it, reject it. If the type is genuinely unclear, use unknown and narrow it. any bypasses the entire type system.

Architecture Decision Records (ADRs)

When the team makes a significant technical decision, document it in /docs/decisions/. These are invaluable for onboarding, debugging six months later, and explaining choices to investors.

# ADR-001: Use PostgreSQL for primary data store

**Status:** Accepted
**Date:** 2026-01-15

**Context:**
Our data has strong relational structure (users → evaluations → videos → scores).
We need ACID guarantees for evaluation records.

**Decision:**
PostgreSQL, hosted on AWS RDS. ORM: SQLAlchemy.

**Consequences:**
- Schema migrations required (handled by Alembic)
- Better query performance for complex joins
- Team needs to learn SQL if they haven't yet
Chapter 04

AI-in-the-Loop Workflow

When to use AI, when not to, and how to work with it without losing your brain.

Prompting Tooling Oversight
Key Points
  • The 70/30 Rule: AI handles ~70% (boilerplate, tests, docs, formatting). Humans own the 30% requiring judgment, domain expertise, and architecture.
  • Green/Yellow/Red light system: know which work is AI-free, AI-assisted, and human-primary
  • The CARE framework for reviewing AI code: Correctness, Appropriateness, Robustness, Explainability
  • Write intent first (comments/docstring), let AI fill in implementation — this produces dramatically better output
  • Never accept a multi-line Copilot suggestion without reading it. Use partial acceptance.
  • The "Explain It Back" Rule: developer must explain every line of AI-generated code before it merges

The 70/30 Rule

The most important mental model for AI-augmented development:

AI handles ~70%: boilerplate (CRUD, API endpoints, form validation), unit tests for well-defined functions, docstrings, error handling patterns, type annotations, refactoring repetitive code, data format conversions, SQL for standard operations, regex, commit messages, PR descriptions.

Humans own the 30%: architecture decisions, business logic (what the system should do), edge case identification, safety-critical path review, security/auth/authorization code, any code whose failure has real-world consequences, technical debt trade-offs, subtle context-dependent bugs.

The goal isn't to minimize AI use. It's to be intentional about it.

When to Use AI: The Traffic Light System

🟢 Green light — use AI freely:

  • Writing a new CRUD endpoint you've written dozens of times before
  • Generating unit tests for a function you just wrote
  • Converting data structures between formats
  • Writing docstrings for a function you just finished
  • Generating boilerplate (Pydantic models, SQLAlchemy models, pytest fixtures)
  • Asking "what's the idiomatic Python way to do X?"

🟡 Yellow light — use AI, but review carefully:

  • Implementing business logic for the first time
  • Code that touches the database schema
  • Authentication and session handling
  • Background jobs and async processing
  • Code using third-party APIs

🔴 Red light — do not use AI as primary author:

  • Security-critical logic (permissions, access control, encryption)
  • Core safety-critical algorithms (see Chapter 6)
  • Database migrations
  • Code that processes PII or sensitive data
  • Anything you don't understand well enough to explain line-by-line

Red light doesn't mean "never use AI here" — use it for reference, explaining concepts, checking your work. But the primary author must be a human who owns the logic.

The CARE Framework for Reviewing AI Code

C — Correctness: Does it actually do what it's supposed to? AI writes code that looks correct. Run it. Test edge cases. Don't just read it.

# Copilot might generate this — looks fine at first glance:
def get_user_by_email(email: str) -> User:
    return db.query(User).filter(User.email == email).first()

# What happens if email is None?
# What if there are two users with the same email?
# What does the caller do with None if no user is found?

A — Appropriateness: Is this the right approach for your system? AI doesn't know your architecture, performance requirements, or team conventions.

R — Robustness: AI tends to generate the happy path. Ask: what if the input is null? What if the network call fails? What if the list is empty?

E — Explainability: Can the developer explain every line? If not, the code does not merge. "Copilot wrote it" is not a valid explanation. This is non-negotiable.

Copilot Best Practices

1. Write intent first, let AI fill in implementation:

def calculate_driver_score(evaluation: DriverEvaluation) -> DriverScore:
    """
    Calculate final driver score from an evaluation.

    Weight the dimensions by importance:
    - Safety behaviors: 40%
    - Defensive driving: 30%
    - Vehicle control: 20%
    - Route efficiency: 10%

    Return a DriverScore with overall score and per-dimension breakdown.
    """
    # Copilot will now generate much better code because context is clear

2. Use Copilot Chat for understanding, not just generation:

  • /explain — "Explain what this function does line by line"
  • /tests — "Write unit tests including edge cases"
  • /fix — "Here's the error I'm getting — what's wrong?"
  • "What are the failure modes for this code?"
  • "What security issues could this code have?"

3. Ask for tests in the same session. After Copilot helps write a function, immediately: "Write pytest tests for this function, including tests for null inputs and edge cases." Generate tests while the context is warm.

The "Explain It Back" Rule

Before any AI-generated code merges, the developer must be able to:

  • Explain what the code does in plain English
  • Identify the inputs, outputs, and side effects
  • Describe what happens in at least three edge cases
  • Explain why this approach was chosen over alternatives

If they can't do this, they need to spend more time understanding the code. This is a teachable moment, not a punishment.

Chapter 05

Onboarding New Developers

Getting someone up to speed in an AI-native team is different. Here's how we do it.

Onboarding Culture Ramp-up
Key Points
  • Week 1 is about understanding — the codebase, the domain, the workflow. No feature tickets on day one.
  • Assign every new hire a "code buddy" — someone to ask "dumb questions" without embarrassment
  • First contributions are tiered: docs/tests → guided bug fixes → independent features (Weeks 2–4)
  • AI-off exercises in Week 1–2: write one feature without Copilot to build baseline judgment
  • Code reviews are the primary source of technical mentorship — explain why, not just what's wrong
  • Measure: tickets closed, PR quality, bug introduction rate — not hours logged or lines of code

Week 1: Foundation Before Code

The worst thing you can do with a new junior is throw them at tickets on day one. They'll copy-paste code they don't understand, and Copilot will help them do it faster. Week 1 is about building the map.

Day 1–2: Environment & Culture

Day 3–5: Read First, Write Second

Buddy system: Assign every new hire a code buddy — a more experienced person they can ask questions without feeling embarrassed. Junior developers will sit on a question for an hour rather than ask a "dumb question." The buddy system removes that barrier.

Weeks 2–4: Tiered First Contributions

Structure first tickets deliberately. Don't give them the hardest or most important ticket just because Copilot makes it look tractable.

Tier 1 (Week 2): Zero-stakes contributions

  • Documentation improvements
  • Adding missing docstrings to existing functions
  • Writing tests for untested code
  • Fixing linting warnings

These contributions are real, valuable, and safe. They also force the new hire to read and understand existing code.

Tier 2 (Week 3): Guided feature work

  • Small, isolated bug fixes
  • Adding a new API endpoint that mirrors an existing one
  • Extending an existing model with a new field

First PR gets a thorough, teaching-focused review. The goal isn't just to catch problems — it's to explain why something is done differently and build judgment.

Tier 3 (Week 4): Independent feature work

  • Small standalone features with full PR review process
  • Must complete the full PR checklist independently
  • Reviewer asks them to explain one AI-generated section

The AI Ramp-Up Path

New junior devs are often already comfortable with Copilot. The goal isn't to teach them to use it — it's to teach them judgment about when and how.

Week 1–2: AI-off exercises. Write one feature without Copilot. Turn it off. See if they can reason through the problem. This isn't hazing — it's building baseline understanding. You can't evaluate AI suggestions if you don't know what the code should look like without AI.

Week 3: AI-on, commentary required. When they use Copilot, they annotate: "Copilot suggested this. I changed X because Y. I verified it handles Z edge case."

Week 4+: Full workflow. AI tools on. But the first review of every Copilot-heavy PR includes the "Explain It Back" exercise from Chapter 4.

Code Review as Teaching

For junior developers, code reviews are the primary source of technical mentorship. The extra 30 seconds to explain why pays back in compounding improvement.

❌ Blocker: This will throw a `KeyError` if `video_id` isn't in the dict.

Instead, use `dict.get(key, default)` which returns None (or your default)
if the key doesn't exist. This is the idiomatic Python pattern for optional dict access.

dict.get(video_id) returns None if video_id isn't present.
dict.get(video_id, []) returns an empty list as default.

vs. the bad version: ❌ Blocker: This will break. — which teaches nothing.

Measuring Progress

Don't measure junior developers by hours logged or lines of code. Those metrics are meaningless with AI tools — Copilot inflates line count trivially.

Measure instead: tickets closed per sprint, PR quality improvement over time, bug introduction rate, independence (fewer clarifying questions over time), code review quality (catching real issues when reviewing others).

Review these metrics monthly in 1:1s. Developers who are improving but struggling deserve support. Developers who plateau at a low level despite support need an honest conversation.

Chapter 06

Safety-Critical Considerations

⚠️ Required Reading

Where AI assistance must be limited, logged, and reviewed with extra care.

Security Safety Compliance
Key Points
  • Safety-critical = code where a bug could harm a person, produce incorrect consequential outcomes, or fail silently
  • Three-tier classification: 🟢 Standard, 🟡 Sensitive, 🔴 Safety-Critical — label PRs and make it visible
  • Safety-critical PRs: 2 reviewers minimum, senior required, explicit edge cases documented, integration tests mandatory
  • AI must NOT be primary author of scoring algorithms, access control logic, or any pass/fail decision on a person
  • Observability is a requirement — log algorithm version, config version, and inputs for every safety-critical evaluation
  • Automate evidence gathering, not the decision — for high-stakes outcomes, human judgment must be explicitly in the chain
⚠️
This chapter is required reading for all developers before they are granted AI tool access. The standards here apply on top of Chapter 3's coding standards for safety-critical paths specifically.

What "Safety-Critical" Means

In this playbook, safety-critical means code where a bug could:

  • Cause harm to a person (physical, financial, reputational)
  • Produce an incorrect outcome in a consequential decision
  • Fail silently in a way that affects real-world results

This includes: driver evaluation and scoring systems, medical record handling, financial calculations and reporting, educational assessment systems, access control for sensitive data, automated decisions that affect people's opportunities or outcomes.

The Three-Tier Classification System

Label PRs with GitHub labels. Make classification visible.

LevelDefinitionExamples
🟢 Standard Low or no real-world impact if wrong. Standard process. UI changes, admin dashboard, internal tooling, logging improvements
🟡 Sensitive Could affect UX or data integrity. Careful review, no extra gates. API endpoints that read/write user data, email notifications, data exports
🔴 Safety-Critical A bug could cause incorrect outcomes affecting people. Elevated review process required. Scoring algorithms, evaluation logic, access control, any code producing a consequential decision

Extra Gates for Safety-Critical PRs

Standard PR process (Chapter 2) plus all of the following:

  • Senior/founder review required. Safety-critical PRs must be approved by a senior developer or technical founder. A junior can write safety-critical code — they cannot be the sole reviewer of it.
  • Two-reviewer minimum. At least two people must approve. A second set of eyes is not overhead — it's the minimum viable safety check for consequential code.
  • Explicit edge case documentation. The PR description must include an "Edge Cases Considered" section.
  • Integration tests required. Unit tests are necessary but not sufficient. You need tests that test the full pipeline.
  • Staging validation. Safety-critical changes must pass in staging with real or realistic data before production deploy.

Edge Cases Considered — example format:

## Edge Cases Considered

- **Zero-frame video:** Returns `EvaluationError` with code `INSUFFICIENT_FOOTAGE`
- **Null driver ID:** Raises `ValidationError` before scoring begins
- **Score weights don't sum to 1.0:** Normalized automatically; logged as warning
- **All dimensions score 0:** Valid result, returned as-is (not treated as error)
- **Video duration < 30 seconds:** Returns `EvaluationError` with code `VIDEO_TOO_SHORT`

Integration test example:

class TestScoringPipelineSafetyEdgeCases:
    """Integration tests for safety-critical scoring pipeline paths."""

    def test_short_video_returns_error_not_zero_score(self, db_session):
        """Ensure short videos are rejected, not scored as zero."""
        video = create_test_video(duration_seconds=15)
        result = scoring_pipeline.run(video.id, ScoringConfig.default())

        assert result.status == EvaluationStatus.ERROR
        assert result.error_code == "VIDEO_TOO_SHORT"
        assert result.overall_score is None  # Not zero — None. Important distinction.

    def test_score_is_deterministic(self, db_session):
        """Same video, same config must always produce same score."""
        video = create_test_video()
        config = ScoringConfig.default()

        result_1 = scoring_pipeline.run(video.id, config)
        result_2 = scoring_pipeline.run(video.id, config)

        assert result_1.overall_score == result_2.overall_score
        assert result_1.dimension_scores == result_2.dimension_scores

AI Usage Rules for Safety-Critical Code

AI must NOT be the primary author of: scoring or evaluation algorithms, access control logic, data validation that affects downstream safety decisions, threshold or weight configuration parsing, any code that produces a binary pass/fail on a person.

AI may assist with: writing tests for safety-critical code (after a human writes the logic), boilerplate wrappers around safety-critical core logic, documentation and comments, explaining how existing code works (for review purposes), identifying potential edge cases: "What could go wrong with this function?"

The key review question for safety-critical AI code: "If this code produced an incorrect result, how would we know? And how quickly?" If the answer is "we'd know when a person complains" — that's not good enough.

Observability as a Safety Requirement

For safety-critical systems, logging and monitoring aren't nice-to-haves. They're requirements. You must be able to answer: "What score did driver #1234 get on January 15th, and what config version was used?" — instantly.

import structlog
from datetime import datetime

logger = structlog.get_logger(__name__)

def evaluate_driver(video_id: str, config: ScoringConfig) -> EvaluationResult:
    log = logger.bind(
        video_id=video_id,
        config_version=config.version,
        evaluation_timestamp=datetime.utcnow().isoformat()
    )

    log.info("evaluation.started")

    try:
        result = _run_scoring_pipeline(video_id, config)
        log.info(
            "evaluation.completed",
            overall_score=result.overall_score,
            dimension_count=len(result.dimensions),
            flags_raised=len(result.safety_flags),
        )
        return result

    except InsufficientVideoError as e:
        log.warning("evaluation.failed.insufficient_video", error=str(e))
        raise

    except Exception as e:
        log.error("evaluation.failed.unexpected", error=str(e), exc_info=True)
        raise

Version everything that affects outcomes: scoring algorithm version, config file version, model weights, any threshold that affects a decision. When something is questioned, you need to know exactly what was running when.

Human-in-the-Loop for High-Stakes Decisions

For decisions with significant real-world consequences (employment, certification, legal, medical): automate the evidence gathering — not the decision.

The system produces: a score, a breakdown, flagged behaviors, a recommendation. A human produces: the decision, with the system's output as input. This isn't just an ethical position — it's a liability position.

Chapter 07

Tools & Setup Checklist

Every tool on the approved list, and how to get your environment running in under 4 hours.

Setup Tools Checklist
Key Points
  • VS Code with specific extensions — commit .vscode/settings.json to the repo so everyone uses the same editor settings
  • Python: pyenv for version management, Black + Ruff + mypy enforced in CI
  • Disable Copilot for YAML files — config file suggestions are a common source of subtle misconfiguration
  • CI pipeline: lint → format check → mypy --strict → pytest with 80% coverage minimum, before any merge
  • GitHub repo minimum: branch protection + PR template + Dependabot + security scanning + required labels
  • Copilot Business: disable "suggestions matching public code" to reduce IP risk and avoid inheriting bugs from public repos

VS Code Setup

# Install VS Code extensions
code --install-extension github.copilot
code --install-extension github.copilot-chat
code --install-extension ms-python.python
code --install-extension ms-python.vscode-pylance
code --install-extension charliermarsh.ruff
code --install-extension ms-python.black-formatter
code --install-extension eamodio.gitlens
code --install-extension streetsidesoftware.code-spell-checker
code --install-extension usernamehw.errorlens

Commit this file to the repo so all developers share the same settings:

// .vscode/settings.json
{
  "editor.formatOnSave": true,
  "editor.codeActionsOnSave": {
    "source.fixAll.ruff": "explicit",
    "source.organizeImports.ruff": "explicit"
  },
  "[python]": {
    "editor.defaultFormatter": "ms-python.black-formatter"
  },
  "python.typeCheckingMode": "strict",
  "python.analysis.typeCheckingMode": "strict",
  "editor.rulers": [88],
  "editor.tabSize": 4,
  "files.trimTrailingWhitespace": true,
  "files.insertFinalNewline": true,
  "github.copilot.enable": {
    "*": true,
    "markdown": true,
    "yaml": false
  }
}
💡
Note: Copilot is disabled for YAML. Suggestions for config files (CI/CD, infrastructure) can introduce subtle misconfiguration. Always write config by hand.

Python Environment Setup

# Install Python version manager (pyenv)
curl https://pyenv.run | bash

# Install target Python version
pyenv install 3.12.3
pyenv local 3.12.3

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dev dependencies
pip install -r requirements-dev.txt

requirements-dev.txt template:

# Core
-r requirements.txt

# Testing
pytest>=7.4
pytest-cov>=4.1
pytest-asyncio>=0.21
factory-boy>=3.3

# Type checking
mypy>=1.6

# Linting & formatting
ruff>=0.1
black>=23.0

# Development utilities
ipython>=8.0
python-dotenv>=1.0
structlog>=23.0

Git Configuration

git config --global user.name "First Last"
git config --global user.email "name@company.com"
git config --global pull.rebase true
git config --global fetch.prune true
git config --global init.defaultBranch main
git config --global diff.colorMoved zebra

GitHub Repository Setup Checklist

CI/CD Pipeline

A basic but complete GitHub Actions pipeline. CI must pass before a PR can merge — enforce this via branch protection.

# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  quality:
    name: Code Quality
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements-dev.txt

      - name: Lint with Ruff
        run: ruff check .

      - name: Check formatting with Black
        run: black --check .

      - name: Type check with mypy
        run: mypy --strict src/

      - name: Run tests with coverage
        run: |
          pytest tests/ \
            --cov=src \
            --cov-report=xml \
            --cov-report=term-missing \
            --cov-fail-under=80

      - name: Upload coverage report
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml

Security Scanning

# Dependabot config — .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: pip
    directory: "/"
    schedule:
      interval: weekly
    open-pull-requests-limit: 5
    labels:
      - "dependencies"
# Bandit security scan — add to CI
- name: Run Bandit security scan
  run: |
    pip install bandit
    bandit -r src/ -ll  # Report medium and high severity

- name: Check for secrets in code
  uses: trufflesecurity/trufflehog@main
  with:
    path: ./
    base: ${{ github.event.repository.default_branch }}
    head: HEAD

Copilot for Teams

Copilot Business (vs. Individual) gives you: centralized license management, policy controls, audit logs, IP indemnification.

Recommended organization settings:
  ✓ Allow Copilot for all members
  ✓ Enable Copilot Chat in IDE
  ✓ Enable Copilot in GitHub.com (PR descriptions, code explanations)
  ✗ Disable: suggestions matching public code (reduces IP risk)

"Suggestions matching public code" disabled prevents Copilot from suggesting verbatim code from public repositories — reducing the risk of inheriting bugs or insecure patterns from public code.

The Setup Checklist

Use this as a day-one guide. Run it even if you think you already have everything — version mismatches bite.

This playbook is a living document.

We update it as we learn. Subscribe to the newsletter and we'll let you know when chapters are revised, new tools are added to the approved list, or a real incident changes how we work.

Free. Weekly. No sales pitch. Unsubscribe anytime.

What Subscribers Get
Weekly updates from Obed Industries — what we're building, testing, and learning.
Playbook updates when chapters change
New AI tool reviews as we test them
Incident post-mortems (anonymized) from real AI dev work
Early access to new guides and frameworks
Full PDF download of the playbook (always the latest version)

We don't sell your data. Unsubscribe any time.