Software engineering spans far beyond any single language or framework. Whether you are preparing for interviews, planning your learning path, or reviewing gaps on a team, it helps to have a software engineer knowledge framework: a language-agnostic map of principles, technologies, patterns, and practices that define the craft.
This framework organizes essential knowledge into 30 categories—from version control and data structures to distributed systems and AI integration. Within each category, subtopics progress from basic fundamentals to advanced depth.
Use it as a reference, not a memorization checklist. The goal is to see how areas connect and where to invest next.
Version Control (Git)
The practice of tracking and managing changes to code over time. Git is the universal standard — understanding it deeply prevents lost work, bad merges, and slow collaboration.
- Core concepts — A repository stores the full history of a project. A commit is a snapshot of changes with a message explaining why. The working tree is what you see on disk; the index (staging area) is what will go into the next commit.
- Branching & merging — Branches allow parallel streams of work without affecting the main codebase. Merging combines branches back together; understanding fast-forward vs three-way merges prevents surprises.
- Resolving conflicts — When two branches change the same lines, Git cannot auto-merge and requires manual resolution. Understanding conflict markers and using a diff tool correctly is an everyday skill.
- Remote operations (fetch, pull, push) —
fetchdownloads changes without applying them;pullfetches and merges;pushsends local commits to the remote. Understanding the difference prevents accidental overwrites. - Stash — Temporarily shelves uncommitted changes so you can switch context without committing half-finished work.
- Rebasing — Replays commits from one branch on top of another, producing a cleaner, linear history. Should never be used on shared branches as it rewrites history.
- Tags — Immutable pointers to specific commits, used to mark release versions. Annotated tags include a message and are the standard for versioned releases.
- Git hooks — Scripts that run automatically at specific points in the Git workflow (pre-commit, pre-push). Used to enforce code style, run tests, or prevent committing secrets.
- Reflog — A local log of every movement of HEAD, including commits no longer reachable from any branch. The primary recovery tool when commits appear to be lost.
- Bisect — A built-in binary search tool that locates which commit introduced a bug by checking out midpoints between a known good and bad commit. Dramatically reduces time to find regressions.
Operating System Fundamentals
The layer between hardware and software that every program runs on. Engineers who understand OS basics debug production issues faster and write more efficient code.
- Processes vs threads — A process is an isolated running program with its own memory; a thread is a lighter unit of execution that shares memory within a process. Understanding the difference is foundational for concurrency and debugging.
- Environment variables — Key-value pairs available to a running process, used to configure applications without hardcoding values. The standard way to pass secrets and config to containers and servers.
- File system basics — How files and directories are organised, accessed, and permissions are controlled (read, write, execute). Understanding paths, inodes, and file descriptors prevents common bugs.
- Shell & terminal proficiency — Using the command line to navigate, search, edit files, manage processes, and pipe commands together. An essential daily skill for any engineer working with servers or CI/CD.
- Signals — Notifications sent to a process by the OS or another process (e.g. SIGTERM to request graceful shutdown, SIGKILL to force stop). Handling them correctly prevents data corruption during deployments.
- Standard streams (stdin, stdout, stderr) — Every process has three default I/O streams. Understanding them is key for piping commands, capturing output, and diagnosing issues in scripts and containers.
- Memory management — Stack memory is automatically managed and fast; heap memory is dynamically allocated and requires explicit or garbage-collected management. Memory leaks occur when heap memory is allocated but never freed.
- System calls — The interface between user-space programs and the OS kernel. Operations like reading files, opening sockets, and spawning processes all go through system calls. Useful context for debugging low-level issues.
- File descriptors & I/O — Every open file, socket, or pipe is represented as a file descriptor. Understanding limits (ulimit), descriptor leaks, and non-blocking I/O is important for high-throughput server applications.
Data Structures & Algorithms
Foundational building blocks for organising data and solving problems efficiently. Understanding these helps you choose the right tool and reason about performance trade-offs in any language or system.
Complexity Analysis
- Big O notation — A way to describe how an algorithm's time or memory usage grows relative to its input size. Allows comparing algorithms independently of hardware or language.
- Time complexity — How the number of operations scales with input size. Common classes: O(1) constant, O(log n) logarithmic, O(n) linear, O(n log n), O(n²) quadratic.
- Space complexity — How much memory an algorithm needs relative to input size. Often a trade-off against time complexity.
Basic Data Structures
- Array — A fixed-size, ordered collection of elements stored contiguously in memory. O(1) access by index; expensive to insert or delete in the middle.
- Dynamic array (list) — An array that resizes automatically when full. The default sequential collection in most languages; amortised O(1) append.
- Stack — A last-in, first-out (LIFO) structure. Supports push and pop in O(1). Used in call stacks, undo systems, and expression parsing.
- Queue / FIFO — A first-in, first-out structure. Supports enqueue and dequeue in O(1). Used in task scheduling, BFS, and message buffering.
- Hash map / hash table — Stores key-value pairs with O(1) average lookup, insert, and delete. The most commonly used data structure in application development.
- Set — A collection of unique values with O(1) average membership test. Used for deduplication and fast existence checks.
- Linked list — A sequence of nodes where each holds a value and a pointer to the next. O(1) insert/delete at a known position; O(n) access by index.
- Deque (double-ended queue) — Supports efficient insert and remove from both ends. Useful for sliding window problems and implementing both stacks and queues.
Trees
- Binary tree — A hierarchical structure where each node has at most two children. The foundation for many other tree types.
- Binary search tree (BST) — Left children are smaller, right children are larger than the parent. O(log n) average search; degrades to O(n) if unbalanced.
- Heap — Always keeps the smallest (min-heap) or largest (max-heap) element at the root. O(1) access to the extremal element, O(log n) insert/remove. Used in priority queues.
- Balanced trees (AVL, Red-Black) — Self-balancing BSTs that guarantee O(log n) operations. Used internally by most sorted map/set implementations.
- Trie (prefix tree) — Each node represents a character; paths spell out strings. O(m) lookup where m is string length. Used in autocomplete and IP routing.
Graphs
- Graph representation — Adjacency list (memory-efficient for sparse graphs) or adjacency matrix (fast edge lookup for dense graphs).
- Directed vs undirected / Weighted graphs — Directed edges model dependencies; weighted edges carry a cost. Used in shortest-path problems, routing, and recommendations.
- BFS (Breadth-First Search) — Explores level by level using a queue. Finds the shortest path in unweighted graphs.
- DFS (Depth-First Search) — Explores as far as possible before backtracking. Used in cycle detection, topological sorting, and maze solving.
- Topological sort — Orders nodes in a DAG so every edge goes from earlier to later. Used in build systems, task scheduling, and dependency resolution.
Sorting & Searching
- Binary search — Finds a target in a sorted array in O(log n) by repeatedly halving the search space.
- Merge sort — Divide-and-conquer with guaranteed O(n log n). Stable; preferred when order of equal elements matters.
- Quick sort — O(n log n) average, in-place. The most common general-purpose sort; degrades to O(n²) without good pivot selection.
- Heap sort — O(n log n) with O(1) extra space. Less common in practice but important conceptually.
Design Principles
Rules of thumb that guide how to structure and write code so it remains readable, maintainable, and easy to change. Ignoring them tends to produce systems that are hard to test, extend, or hand off to others.
- KISS (Keep It Simple, Stupid) — Prefer the simplest solution that works. Complexity should only be introduced when simpler alternatives genuinely fall short.
- DRY (Don't Repeat Yourself) — Every piece of knowledge should have a single, authoritative representation in the codebase. Duplication leads to bugs when one copy is updated and another is not.
- YAGNI (You Aren't Gonna Need It) — Don't build features or abstractions until they are actually needed. Speculative code adds complexity without delivering value.
- SOLID — Five object-oriented design principles (Single responsibility, Open/closed, Liskov substitution, Interface segregation, Dependency inversion) that keep classes focused and loosely coupled.
- GRASP — General Responsibility Assignment Software Patterns: guidelines for assigning responsibilities to classes (Information Expert, Creator, Controller, Low Coupling, High Cohesion, Polymorphism, Pure Fabrication, Indirectness, Protected Variations) so that designs remain maintainable and flexible.
- Separation of concerns — Different responsibilities (e.g. data access, business logic, presentation) should live in separate parts of the system so each can evolve independently.
- Dependency injection (DI) & IoC — Passing dependencies into a component from the outside rather than having it create them internally. Makes code testable and loosely coupled; the basis of most modern frameworks.
- Law of Demeter — A module should only talk to its immediate collaborators, not reach through them to access other objects. Reduces hidden coupling between unrelated parts of the code.
Design Patterns
Proven, reusable solutions to common software design problems. Knowing them gives you a shared vocabulary and prevents reinventing the wheel.
Creational
Patterns that control how objects are created, making instantiation more flexible and decoupled.
- Singleton — Ensures a class has only one instance with a global access point. Useful for shared resources like configuration or connection pools.
- Factory method — Defines an interface for creating an object but lets subclasses decide which class to instantiate.
- Abstract factory — Creates families of related objects without specifying their concrete classes.
- Builder — Constructs a complex object step by step. Useful when an object has many optional parameters.
- Prototype — Creates new objects by cloning an existing one. Useful when object creation is expensive.
Structural
Patterns that describe how to compose classes and objects into larger structures.
- Adapter — Converts one interface into another that a client expects. Useful when integrating with third-party libraries.
- Decorator — Dynamically adds behaviour to an object without modifying its class. Common for logging, caching, or validation wrappers.
- Proxy — Provides a surrogate for another object to control access. Used for lazy loading, access control, or logging.
- Facade — Provides a simplified interface over a complex subsystem. Reduces coupling for callers who don't need full complexity.
- Composite — Lets you treat individual objects and groups uniformly. Classic example: a file system where files and folders share the same interface.
Behavioral
Patterns that deal with communication and responsibility between objects.
- Strategy — Defines a family of algorithms, encapsulates each, and makes them interchangeable at runtime.
- Observer — Lets objects subscribe to and be notified of events from another object. Foundation of event systems and reactive programming.
- Command — Encapsulates a request as an object, allowing it to be queued, logged, or undone.
- State — Allows an object to change behaviour when its internal state changes. Replaces large conditional blocks.
- Mediator — Centralises communication between components so they don't reference each other directly.
- Chain of responsibility — Passes a request along a chain of handlers until one processes it. Common in middleware pipelines.
Architectural
Higher-level patterns that shape the overall structure and data flow of a system.
- Strangler fig — Gradually replaces a legacy system by routing traffic to new components piece by piece.
- Onion Architecture — Concentric layers with the domain model at the centre; dependencies always point inward. Keeps business logic independent of frameworks and infrastructure.
- CQRS (Command Query Responsibility Segregation) — Separates read and write models so each can be optimised independently.
- Event sourcing — Stores the full history of state changes as events rather than just current state. Enables audit logs and time-travel debugging.
- Saga — Manages long-running transactions across services using local transactions and compensating actions on failure.
- Outbox pattern — Ensures a database write and a message publish happen atomically by writing to an outbox table first. Prevents lost events on crash.
Networking Fundamentals
The underlying concepts that govern how computers communicate. Engineers who understand these can diagnose connectivity issues, reason about latency, and design more resilient systems.
- DNS (Domain Name System) — Translates domain names into IP addresses. Understanding resolution, TTLs, and caching helps diagnose slow lookups, misconfigured environments, and propagation delays.
- Ports & sockets — A port is a logical endpoint on a machine (e.g. 443 for HTTPS, 5432 for PostgreSQL). A socket is an open connection between two endpoints. Knowing well-known ports and inspecting open connections is a basic debugging skill.
- Latency vs bandwidth — Latency is the time for a packet to travel from source to destination; bandwidth is the volume transferable per unit of time. Many performance problems are latency-bound, not bandwidth-bound.
- TCP/IP — The foundational protocol suite of the internet. TCP provides reliable, ordered delivery via connection management and retransmission; IP handles addressing and routing packets.
- UDP — A connectionless protocol that sends packets without guaranteeing delivery or order. Faster than TCP; used where speed matters more than reliability (video streaming, gaming, DNS).
- OSI model — A seven-layer framework (Physical → Data Link → Network → Transport → Session → Presentation → Application) for reasoning about where in the stack a network problem is occurring.
- Proxies & reverse proxies — A forward proxy sits between client and internet; a reverse proxy sits in front of servers for load balancing, TLS termination, and caching.
- Firewalls & network security groups — Rules controlling which traffic is allowed in and out based on IP, port, and protocol. Understanding them is essential for diagnosing unreachable services in cloud environments.
- NAT (Network Address Translation) — Maps private IP addresses to a public one, allowing many devices to share a single public IP. Relevant for understanding why services behind NAT are not directly reachable.
HTTP & Web Protocols
The rules that govern how data moves between clients and servers on the web. Most software engineers interact with these daily, even if indirectly.
- HTTP/1.1, HTTP/2, HTTP/3 — Successive versions of the web's core transfer protocol. HTTP/2 adds multiplexing; HTTP/3 replaces TCP with QUIC for faster, more reliable connections.
- HTTPS / TLS / SSL — HTTPS is HTTP over an encrypted TLS connection. Ensures data in transit cannot be read or tampered with by a third party.
- Methods (GET, POST, PUT, PATCH, DELETE) — Verbs that indicate the intended action on a resource. Using the correct method is fundamental to building predictable APIs.
- Status codes — Three-digit response indicators (2xx success, 3xx redirect, 4xx client error, 5xx server error). Choosing the right code makes APIs easier to consume and debug.
- Headers (default & custom) — Key-value metadata used for content negotiation, caching, authentication, and custom application concerns.
- Cookies & sessions — Cookies are small data stored in the browser and sent with every request to the same domain. Sessions use cookies to maintain stateful interactions with a server.
- AJAX / Fetch — Making HTTP requests from JavaScript without reloading the page. The Fetch API is the modern standard replacing XMLHttpRequest.
- CORS (Cross-Origin Resource Sharing) — A browser security mechanism controlling which domains can make requests to a given API. Misconfiguration is a common source of both bugs and vulnerabilities.
- WebSockets — A persistent, bidirectional communication channel over a single TCP connection. Used for real-time features like chat and live dashboards.
- File upload & download — Handling binary data over HTTP using multipart/form-data for uploads and appropriate content-type headers for downloads.
- SSE (Server-Sent Events) — Pushes a stream of updates from server to client over a standard HTTP connection. Simpler than WebSockets for one-directional real-time data. See Server-Sent Events (SSE) and EventSource for a practical guide.
API Design & Data Formats
How to design and expose interfaces that other systems and developers can consume reliably, and which data formats to use for communication.
- REST — An architectural style for designing APIs around resources and standard HTTP methods. When followed consistently, REST APIs are predictable and easy to consume.
- JSON — The de facto standard format for data exchange in modern web APIs. Lightweight, human-readable, and natively supported in most languages.
- XML / XSL — XML is a flexible markup format used in legacy integrations, configuration, and document exchange. XSL transforms or styles XML documents.
- OpenAPI / Swagger — A standard for describing REST APIs in a machine-readable format. Enables automatic client generation, documentation, and contract testing.
- Versioning strategies — How to evolve an API without breaking existing clients. Common approaches: URL versioning (/v1/), header versioning, backwards-compatible changes.
- Pagination, filtering & sorting — Strategies for returning large datasets in manageable chunks. Cursor-based pagination is more reliable than offset for large or frequently changing datasets.
- GraphQL — A query language for APIs letting clients request exactly the data they need. Reduces over-fetching and under-fetching but adds server-side complexity.
- SOAP / WSDL — An older, XML-based protocol for web services, still common in enterprise and financial systems. WSDL describes the service contract in a machine-readable format.
- gRPC — A high-performance RPC framework using HTTP/2 and Protocol Buffers. Preferred for internal service-to-service communication where performance and strong typing matter.
Databases
How data is stored, queried, and kept consistent. Databases are at the core of most systems, and the wrong choices here affect performance and reliability for years.
- Relational vs non-relational — Relational databases (PostgreSQL, MySQL) store data in structured tables with schemas; non-relational (MongoDB, Cassandra, Redis) trade strict structure for flexibility, scale, or speed.
- ACID / BASE — ACID (Atomicity, Consistency, Isolation, Durability) guarantees strong correctness for relational databases. BASE (Basically Available, Soft state, Eventually consistent) describes the trade-offs of many distributed databases.
- Joins — Combining rows from two or more tables based on a related column. Understanding join types (INNER, LEFT, RIGHT, FULL) and their performance implications is fundamental.
- Indexes — Data structures that speed up lookups by avoiding full table scans. Over-indexing slows writes; missing indexes slow reads.
- Transactions — A group of operations that either all succeed or all fail together, keeping the database consistent. Understanding when and how to use them prevents partial updates and data corruption.
- Isolation levels — Define how much one transaction can see of another's uncommitted changes. Higher isolation prevents anomalies like dirty reads but reduces concurrency.
- Connection pooling — Reusing a pool of open database connections instead of opening a new one per request. Critical for performance and resource management under load.
- Database migrations — Versioned, incremental schema changes applied in a controlled, repeatable way. Essential for safely evolving a schema alongside application code.
- Execution plan / query optimization — The database's internal plan for executing a query. Inspecting it reveals why a query is slow and how to fix it.
- Sharding & replication — Sharding splits data across multiple nodes for horizontal scale; replication copies data to multiple nodes for availability and read performance.
- Read replicas — Copies of the primary database that serve read traffic, reducing load on the primary and improving read throughput.
Application Security
Protecting systems from attackers and accidental misuse. Security is not a feature added at the end — it must be considered throughout design and development.
- OWASP Top 10 — The ten most critical web application security risks. A baseline checklist every engineer should know.
- XSS (Cross-Site Scripting) — Malicious scripts injected into a web page viewed by other users. Prevented by sanitising and escaping user input before rendering in HTML.
- CSRF (Cross-Site Request Forgery) — Tricks a logged-in user's browser into making an unwanted request. Mitigated with CSRF tokens or checking the Origin header.
- SQL injection — Malicious SQL inserted via user input, potentially exposing or destroying the database. Prevented by parameterised queries or prepared statements.
- Authentication vs authorization — Authentication verifies who you are; authorization determines what you're allowed to do. Confusing the two leads to significant security gaps.
- Password hashing & salting — Passwords must never be stored in plain text. Hashing with a slow algorithm (bcrypt, Argon2) and a unique salt per user protects against brute-force attacks.
- Bearer tokens / JWT — Token-based authentication where the client presents a signed token with each request. JWTs encode claims directly, avoiding a database lookup on every request.
- OAuth 2.0 / OIDC — OAuth 2.0 is a delegated authorisation framework. OIDC builds on it to add identity verification (who the user is).
- Symmetric & asymmetric encryption — Symmetric uses the same key to encrypt and decrypt (fast, suited for large data). Asymmetric uses a public/private key pair for key exchange and signatures.
- Hashing algorithms — One-way functions producing a fixed-size digest (MD5, SHA-256, bcrypt). Used for integrity checks, digital signatures, and password storage.
- Secrets management — Storing and distributing credentials, API keys, and certificates securely. Tools like HashiCorp Vault and cloud secret managers keep secrets out of source code.
- Threat modeling / STRIDE — A structured process for identifying what could go wrong before a system is built. STRIDE categorises threats as Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, and Elevation of privilege.
Testing
Verifying that software behaves correctly and continues to do so as it changes. Testing is what separates confident releases from fingers-crossed deployments.
- Unit testing — Testing individual functions or classes in isolation. Fast to run, easy to pinpoint failures, and the foundation of any test suite.
- Integration testing — Testing how multiple components work together, including real databases or queues. Catches issues that unit tests miss at the boundaries between parts.
- Functional testing — Validating that the system behaves correctly from the user's perspective, end to end.
- Automation testing — Running tests automatically using tools rather than manually. Includes UI automation (Playwright) and API testing; essential for maintaining quality at speed.
- Test pyramid — Many unit tests, fewer integration tests, even fewer end-to-end tests. An inverted pyramid (many E2E tests) is slow and brittle.
- TDD / BDD — Test-driven development writes tests before code to guide design. Behaviour-driven development (BDD) writes tests in human-readable Given/When/Then language to align developers and stakeholders.
- Contract testing — Verifies that a service's API matches the expectations of the services that consume it. Prevents integration failures caused by unannounced API changes.
- Performance testing — Validates that the system meets its non-functional performance requirements under expected and peak load. Includes load testing, stress testing, and spike testing.
Code Quality & Review
Practices and tooling to keep code readable, safe, and maintainable. Quality degrades gradually and invisibly without deliberate effort.
- Code review practices — Peers read and comment on code before it is merged. Catches bugs, spreads knowledge, and enforces standards — only effective when done thoughtfully, not as a rubber stamp.
- Static analysis — Automated tools that analyse source code without running it to find bugs, code smells, and security vulnerabilities. Integrates into CI pipelines for continuous feedback.
- Refactoring techniques — Restructuring existing code to improve its design without changing external behaviour. Includes Extract Method, Rename, and Replace Conditional with Polymorphism.
- Technical debt management — Tracking and addressing shortcuts and poor design decisions that make future work harder. Left unmanaged, it compounds and eventually paralyses a team. See What is technical debt? for how to recognise and manage it.
- Dynamic analysis — Analysis performed on running code to detect memory leaks, runtime errors, or unexpected behaviour that static analysis cannot find.
- Dependency scanning — Automatically checks third-party libraries for known vulnerabilities (CVEs) and licence violations. Essential for managing supply chain risk.
SDLC & Engineering Process
The end-to-end lifecycle of building and running software. Good process reduces waste, miscommunication, and rework.
Agile / Scrum — Delivers working software in short cycles (sprints) with regular feedback loops. Scrum defines roles (Product Owner, Scrum Master) and ceremonies (sprint planning, retrospectives). For when to use Scrum vs Kanban vs waterfall, see Agile, Scrum, and Kanban — when to use what.
Kanban — A flow-based approach that visualises work on a board and limits work in progress to prevent bottlenecks. Well-suited for continuous, unpredictable workloads.
Waterfall — A sequential process where each phase completes before the next begins. Predictable for well-defined projects but slow to adapt to change.
Functional requirements — Define what the system must do: specific behaviours, features, and interactions from the user's perspective. The direct answer to "what does this system do?"
Non-functional requirements (NFRs) — Define how well the system must perform: quality attributes that cut across features. Common NFRs:
- Performance — Response time, throughput, and latency targets under expected and peak load.
- Scalability — Ability to handle growing load by adding resources vertically (bigger machines) or horizontally (more machines).
- Availability — Percentage of time the system is operational (e.g. 99.9% = ~8.7 hours downtime/year).
- Reliability — The system's ability to function correctly and consistently over time, including graceful failure handling.
- Security — Protection against unauthorised access, breaches, and attacks.
- Maintainability — How easily the codebase can be understood, modified, and extended by current and future engineers.
- Testability — How easily the system can be verified through tests. Driven by design decisions like separation of concerns and dependency injection.
- Observability — The degree to which internal state can be inferred from external outputs (logs, metrics, traces). Pair this with delivery metrics — see Development metrics — what to measure and why for DORA and team health signals.
- Portability — Ability to run in different environments (OS, cloud providers, on-premise) with minimal changes.
- Compliance & auditability — Meeting legal or regulatory obligations (GDPR, HIPAA, SOC 2) and proving it through audit trails.
- Disaster recovery (RTO / RPO) — RTO is maximum acceptable downtime; RPO is maximum acceptable data loss. Together they define the recovery strategy.
- Capacity — Maximum volume of data, users, or transactions the system must support, now and in the future.
Requirements refinement — Clarifying and agreeing on what needs to be built before development starts. Vague requirements are a leading cause of rework and missed expectations.
Estimation techniques — Forecasting how long work will take (story points, T-shirt sizing, three-point estimation). The goal is to communicate confidence ranges, not exact dates.
Documentation — Written records of decisions, APIs, architecture, and processes. Good documentation multiplies team effectiveness; outdated documentation actively misleads.
Localization / i18n — Designing software to support multiple languages and regional formats (dates, currencies, text direction).
Troubleshooting methodologies — Structured approaches to diagnosing production issues: hypothesis-driven debugging, bisecting, rubber duck debugging, and systematic log analysis.
Background jobs & scheduling — Cron jobs, delayed queues, retry logic, and dead-letter queues for processing work outside the request/response cycle. Almost every non-trivial system has async background work.
Planning & prioritization — Deciding what to build next based on value, risk, and effort. Good prioritisation ensures the team always works on what matters most.
Post-mortems / Root cause analysis (RCA) — Blameless reviews after an incident to understand what happened, why, and what prevents recurrence. A mature engineering culture treats these as learning opportunities.
Architecture Decision Records (ADRs) — Lightweight documentation of significant architectural decisions, capturing context, options considered, and reasoning. Prevents future engineers from repeating past mistakes or blindly reversing decisions.
Soft Skills & Engineering Behaviours
The non-technical competencies that distinguish great engineers. Technical knowledge alone does not make a great engineer.
- Written communication — Writing clear, concise tickets, pull request descriptions, and messages. The ability to communicate context and intent in writing is critical in async and distributed teams.
- Asking for help effectively — Knowing when to ask, how to frame the question with enough context, and what you've already tried. Saves time for everyone and accelerates learning.
- Giving and receiving feedback — Providing constructive, specific, and timely feedback on code and behaviour. Receiving feedback without defensiveness and acting on it.
- Estimation and expectation setting — Communicating realistic timelines, surfacing risks early, and updating stakeholders when things change. Reliability and transparency build trust.
- Mentoring & knowledge sharing — Helping less experienced engineers grow through pairing, code review, and documentation. Multiplies team output and is a core expectation at senior level.
- Technical leadership — Guiding architectural decisions, setting technical direction, and being accountable for the outcomes without necessarily being the one who implements them.
- Stakeholder communication — Translating technical concepts for non-technical audiences, influencing decisions, and managing expectations across product, design, and business.
- Incident leadership — Coordinating the response to a production incident: keeping calm, delegating clearly, communicating status, and driving the team toward resolution.
Concurrency & Distributed Systems
How to safely run multiple operations at the same time and coordinate work across machines. One of the hardest areas in engineering — subtle bugs here can be rare and catastrophic.
- async / await — Allows a single thread to handle many concurrent operations by pausing and resuming while waiting for I/O. Avoids blocking threads on network or disk operations.
- Multithreading — Running multiple threads within a single process to perform work in parallel. Requires careful management of shared state to avoid bugs.
- Sync primitives (lock, mutex, semaphore) — Low-level tools that control access to shared resources across threads. Misuse leads to deadlocks or race conditions.
- Race conditions — Bugs that occur when the outcome depends on the unpredictable ordering of concurrent operations. Hard to reproduce; require careful design or locking to prevent.
- Optimistic & pessimistic locking — Optimistic locking assumes conflicts are rare and detects them on write; pessimistic locking prevents conflicts by locking data upfront.
- Idempotency — An operation is idempotent if performing it multiple times produces the same result as once. Critical for safely retrying failed operations in distributed systems.
- Distributed transactions — Transactions spanning multiple services or databases. Require patterns like Saga or 2-phase commit; much harder to implement correctly than local transactions.
- 2-phase commit (2PC) — Coordinates a transaction across multiple nodes by first asking all participants if they can commit, then instructing them to do so. Strong consistency but slow and can block on failure.
- Eventual consistency — Given no new updates, all replicas will eventually converge to the same value. Common in distributed databases that prioritise availability over immediate consistency.
Containers & Orchestration
Packaging applications and their dependencies into portable units, and managing how those units run at scale.
- Docker vs VM — Containers share the host OS kernel and start in milliseconds; VMs emulate full hardware and are heavier but more isolated. Containers are now the standard unit of deployment.
- Dockerfile & layers / caching — A Dockerfile defines how an image is built step by step. Each instruction creates a layer; ordering instructions correctly maximises cache reuse and minimises image size.
- Docker Compose — Defines and runs multi-container applications with a single YAML file. Manages dependencies, networking, and volumes between services locally.
- Docker commands — Core CLI operations:
build,run,exec,logs,ps, and flags-p(port),-e(env var),-v(volume),-d(detached). - Docker volumes & networking — Volumes persist data beyond a container's lifetime. Docker networking lets containers communicate using service names instead of IP addresses. For a hands-on walkthrough, see Run an MCP server in Docker.
- Passing arguments to containers — Environment variables (
-e,--env-file) and build args (--build-arg) configure application behaviour at runtime without modifying the image. - Container registries — Repositories for storing and distributing Docker images (Docker Hub, AWS ECR, GitHub Container Registry). Part of every CI/CD pipeline.
- Kubernetes basics — Automates deployment, scaling, and healing of containerised applications. Core concepts: Pods, Deployments, Services, ConfigMaps, Namespaces, and health probes.
- Helm charts — Package manager for Kubernetes that bundles all manifests needed to deploy an application. Makes it easy to version, share, and configure deployments across environments.
CI/CD & Delivery
Automating the path from code commit to production to make releases faster, safer, and repeatable.
- CI pipelines — Automated pipelines that run on every commit to build, test, and check quality. Catches problems early before they reach production.
- Branching strategies (GitFlow, trunk-based) — GitFlow uses long-lived feature and release branches; trunk-based development favours short-lived branches and frequent merges to main.
- Package managers (npm, NuGet, pip) — Manage third-party dependencies, their versions, and transitive dependencies. Understanding lock files and dependency resolution prevents version conflicts.
- CD & deployment strategies — Continuous delivery makes every passing build potentially shippable. Continuous deployment releases automatically without human approval.
- Blue-green & canary deployments — Blue-green switches traffic between two identical environments for zero-downtime releases. Canary gradually shifts a percentage of traffic to the new version to catch issues early.
- Feature flags — Toggles enabling or disabling features at runtime without deploying new code. Decouples deployment from release and enables safe rollout to subsets of users.
- Artifact management — Repositories storing build outputs (JARs, Docker images, npm packages) with versioning and access control. Ensures reproducible builds from a trusted source.
- Release & rollback — Promoting a build to production and reverting to a previous version if something goes wrong. A well-defined rollback plan is essential for safe deployments.
- GitOps — Infrastructure and application state declared in Git; automated tooling continuously reconciles the live system to match. Changes to production are made via pull requests, not manual commands.
Cloud & Infrastructure
Running and managing software on remote infrastructure. Cloud is now the default deployment environment for most systems.
- Cloud providers (AWS / Azure / GCP) — The three major platforms offering compute, storage, databases, networking, and managed services. Concepts transfer across all three.
- IaaS / PaaS / SaaS / FaaS — Levels of abstraction: IaaS gives raw VMs; PaaS manages the runtime; SaaS is fully managed software; FaaS runs individual functions on demand.
- Object storage (S3-like) — Flat storage for any file or blob, accessed via HTTP. The standard way to store images, backups, logs, and static assets in the cloud.
- Cloud networking (VPC, subnets) — A Virtual Private Cloud is a logically isolated network section. Subnets divide it into public (internet-facing) and private (internal) zones.
- Serverless — Running code without managing servers; the cloud provider handles scaling. Cost-effective for irregular workloads but has cold-start latency and execution time limits.
- Cost awareness — Understanding how cloud resources are priced and how usage translates to spend. Unmonitored costs can grow rapidly; tagging and budgets are basic hygiene.
- Infrastructure as code (Terraform) — Defining and provisioning cloud infrastructure through code rather than manual UI steps. Enables version control, repeatability, and automated environment management.
Observability
Understanding what is happening inside a running system through its outputs. You cannot fix what you cannot see. For measuring delivery and reliability outcomes, see Development metrics — what to measure and why.
- Structured logging — Writing logs as machine-parseable key-value pairs (JSON) rather than plain strings. Makes logs searchable and correlatable across many services.
- Metrics & dashboards — Numeric measurements over time (request rate, error rate, latency, CPU). Dashboards give a real-time health view of the system.
- Alerting & on-call — Automated notifications triggered when metrics breach thresholds. Effective alerting pages the right person quickly without crying wolf.
- SLO / SLA / SLI — SLIs are measurements; SLOs are internal targets; SLAs are contractual commitments to customers. Understanding the difference drives better reliability decisions.
- Distributed tracing — Tracks a single request as it flows through multiple services, recording timing and context at each step. Essential for diagnosing latency and failures in microservice architectures.
- OpenTelemetry — A vendor-neutral, open standard for producing logs, metrics, and traces in a unified format. Now the industry default, replacing proprietary vendor SDKs.
Architecture & System Design
How you organise and connect the major components of a system. Good architecture decisions made early save enormous time later; poor ones compound into technical debt.
- Monolith vs microservices — A monolith is one deployable unit; microservices split functionality into independently deployable services. Each has trade-offs in complexity, scalability, and operational overhead.
- CAP theorem — A distributed system can guarantee at most two of: consistency, availability, and partition tolerance. Helps set realistic expectations for distributed databases.
- Caching strategies (CDN, Redis) — Storing frequently accessed data closer to the consumer to reduce latency and database load. Different layers (browser, CDN, application, database) suit different use cases.
- Load balancing — Distributes incoming traffic across multiple instances to improve availability and throughput. Strategies include round-robin, least-connections, and consistent hashing.
- API gateway — A single entry point for all client requests handling routing, authentication, rate limiting, and protocol translation.
- Rate limiting — Caps the number of requests a client can make in a given time window. Protects services from overload and abuse.
- Event-driven architecture — Components communicate by producing and consuming events asynchronously. Improves decoupling and scalability but adds complexity around ordering and delivery guarantees.
- Message queues (Kafka, RabbitMQ) — Middleware that buffers messages between producers and consumers, allowing them to work at different speeds and recover from failures independently.
- Service mesh — Infrastructure layer (e.g. Istio, Linkerd) that handles service-to-service communication, retries, observability, and mTLS without changing application code.
- Circuit breaker — Stops calling a failing service after a threshold of errors and returns a fallback. Prevents cascading failures in distributed systems.
- Backpressure — A mechanism where a downstream component signals upstream to slow down when it cannot keep up. Essential for preventing memory exhaustion in streaming systems.
Domain-Driven Design (DDD)
An approach to software design that centres the model on the business domain, using a shared language between engineers and domain experts. It provides the conceptual foundation for many architectural patterns like CQRS, Event Sourcing, and Saga.
- Ubiquitous language — A shared vocabulary used consistently by both engineers and domain experts in code, documentation, and conversation. Eliminates translation errors between business intent and technical implementation.
- Bounded context — An explicit boundary within which a particular domain model applies. Different contexts can model the same concept differently without conflict, as long as boundaries are clear.
- Entities & value objects — An entity has a unique identity that persists over time (e.g. a User); a value object is defined entirely by its attributes and has no identity (e.g. a Money amount).
- Aggregates — A cluster of domain objects treated as a single unit for data changes, with one root entity controlling access. Defines the consistency boundary for a transaction.
- Domain events — Immutable records of something significant that happened in the domain (e.g. OrderPlaced). The bridge between DDD and event-driven architecture.
- Repositories & domain services — Repositories abstract data access for aggregates; domain services encapsulate logic that doesn't naturally belong to a single entity or value object.
Performance Profiling
Measuring and improving how efficiently a system uses CPU, memory, I/O, and network. The discipline of finding real bottlenecks rather than guessing at them.
- Profiling basics — A profiler samples or instruments a running program to show where time and memory are being spent. Always profile before optimising — intuition about bottlenecks is frequently wrong.
- CPU profiling — Identifies which functions or code paths consume the most processor time. Flame graphs are the standard visualisation for quickly spotting hot paths.
- Memory profiling — Tracks heap allocations over time to detect memory leaks and excessive allocation rates that pressure the garbage collector.
- I/O profiling — Measures disk and network I/O patterns to find slow queries, chatty APIs, or unnecessary round trips. Often the actual bottleneck in data-heavy services.
- Benchmarking — Running controlled, repeatable experiments to measure the performance of a specific piece of code. Essential for validating that an optimisation actually improves things and doesn't regress later.
AI / LLM Integration
Building software that leverages large language models and AI services as components. This has moved from a specialisation to a mainstream engineering skill. Start with What is an MCP server? and What is an LLM context window? if you are new to AI-assisted development.
- Prompt engineering — Crafting inputs to a language model to reliably produce useful outputs. Techniques include few-shot examples, chain-of-thought prompting, and system instructions. Spec-driven development and vibe coding sit at opposite ends of how much structure you impose before the model writes code.
- Context engineering — Designing what enters the model's working context on each call: system instructions, tool schemas, retrieved documents, and conversation history. Includes prioritisation, compaction, and summarisation so the model sees the right facts within token limits. Distinct from prompt engineering (how you phrase a single turn); central to agents, coding assistants, and RAG pipelines.
- AI APIs & SDK usage — Integrating with AI services (OpenAI, Anthropic, etc.) via REST APIs or SDKs. Includes managing context windows, token limits, streaming responses, and rate limits.
- Retrieval-Augmented Generation (RAG) — Relevant documents are retrieved from a knowledge base and injected into the model's context before generating a response. Reduces hallucination and grounds answers in up-to-date data.
- Embeddings & similarity search — Numerical representations of text where semantically similar items are close together in vector space. The mechanism that makes semantic search and RAG possible.
- Vector databases (pgvector, Pinecone, Qdrant) — Databases optimised for storing and querying high-dimensional embeddings using similarity search. The storage layer in most RAG architectures.
- Evaluating model outputs — Systematically measuring whether an AI system produces correct, safe, and useful responses. Requires defining metrics and a mix of automated and human review. How MCP can accelerate business growth covers practical integration patterns beyond the prototype stage.
- Hallucination mitigation — Techniques to reduce confident but incorrect model outputs, including grounding with retrieval, output validation, and asking models to cite sources.
Supply Chain Security
Protecting the software build process itself from compromise. High-profile attacks (SolarWinds, Log4Shell) showed that attackers can target the tools and libraries that build software, not just the software itself.
- Software Bill of Materials (SBOM) — A machine-readable list of every component and dependency in a software artifact, with version and licence information. Increasingly required by regulators and enterprise customers.
- Dependency provenance — Verifying that a dependency actually comes from where it claims to. Protects against typosquatting, dependency confusion attacks, and compromised registries.
- SLSA framework — Defines a maturity model for build integrity, from basic source control practices to fully hermetic, verified builds.
- Signed artifacts — Cryptographic signatures that prove a build artifact was produced by a specific, trusted pipeline and has not been tampered with.
- Third-party package verification — Checking packages haven't been tampered with after publication, using lock files, hash verification, and tools like Sigstore or Cosign.
Zero-Trust & Identity
A security model that assumes no user, device, or service should be trusted by default — even inside the corporate network.
- Never-trust-always-verify — Verify identity and permissions on every request regardless of network location. Eliminates the assumption that internal traffic is inherently safe.
- Workload identity (SPIFFE / SPIRE) — Assigns cryptographic identities to services and workloads (not just humans). Enables services to authenticate each other without static API keys.
- Short-lived credentials — Credentials that expire in minutes or hours, reducing blast radius if leaked. Automated rotation replaces static, long-lived keys.
- Mutual TLS (mTLS) — Both client and server present certificates to authenticate each other over TLS. Standard in service meshes for securing service-to-service communication.
Platform Engineering
Building internal tools and workflows that enable development teams to deploy, operate, and observe their software independently, without depending on a central ops team.
- Internal developer platforms (IDPs) — Self-service platforms that abstract infrastructure complexity. Developers provision environments and deploy services through a curated interface.
- Golden paths — Opinionated, pre-built templates for common engineering tasks (new service, observability setup, deployment). Reduces cognitive load and enforces standards without mandating them.
- Developer portals (Backstage) — A centralised hub for discovering services, documentation, ownership, and tooling across an organisation.
Accessibility (a11y)
Designing and building software usable by people with disabilities. In many jurisdictions this is a legal requirement, not just a best practice.
- WCAG guidelines — Web Content Accessibility Guidelines define conformance levels (A, AA, AAA). AA is the standard target, covering contrast ratios, keyboard access, and alternative text.
- ARIA roles & semantic HTML — ARIA attributes communicate role, state, and properties to assistive technologies. Correct semantic HTML elements often provide this automatically.
- Keyboard navigation — All interactive elements must be reachable and operable using only a keyboard. Required for users with motor impairments.
- Screen reader compatibility — Testing that screen readers (NVDA, VoiceOver, JAWS) can navigate and announce content correctly. Requires correct labelling of images, forms, and dynamic content updates.
FinOps & Green Engineering
Managing and optimising the financial and environmental costs of running software in the cloud.
- Cloud cost attribution & rightsizing — Tagging resources to track spend by team or service. Rightsizing means choosing the correct instance type rather than over-provisioning.
- Carbon-aware workload scheduling — Shifting batch workloads to times or regions with more renewable energy. Frameworks like Carbon Aware SDK make this programmable.
- Sustainability metrics — Measuring the carbon footprint of software systems. Cloud providers now offer carbon dashboards; some organisations include these in engineering OKRs.
Edge Computing
Running code on servers geographically close to the user rather than in a central data centre. Reduces latency and enables data residency compliance.
- Edge runtimes (Cloudflare Workers, Fastly, Deno Deploy) — Execute JavaScript or WASM globally distributed, with cold-start times in milliseconds.
- Latency trade-offs — Moving compute to the edge reduces round-trip time but limits available resources and services. Not every workload benefits.
- Data residency — Legal or contractual requirements that data stays within a specific geography. Edge deployments can be configured to process requests only in compliant regions.
WebAssembly (WASM)
A binary instruction format that runs at near-native speed in browsers and increasingly on servers. Allows code written in Rust, C++, or Go to run in environments previously limited to JavaScript.
- WASM in the browser — Enables performance-critical workloads (video processing, game engines, codecs) to run client-side without plugins.
- Server-side WASM (WASI) — Extends WASM to run on servers and edge runtimes. Offers a sandboxed, portable alternative to containers for some workloads.
How to use this framework
This map is intentionally broad. No engineer needs every advanced topic on day one — and no team needs every category at the same depth.
A practical approach:
- Audit your basics — Work through Basic topics in Git, OS fundamentals, HTTP, databases, security, and testing. Gaps here show up in almost every project.
- Follow your stack — A backend engineer might go deeper on databases, APIs, and distributed systems; a frontend engineer on HTTP, accessibility, and performance profiling.
- Grow into advanced areas — Advanced topics like CQRS, Kubernetes, or RAG become relevant when the problem domain demands them, not because they appear on a checklist.
- Revisit periodically — The field moves fast (AI integration, supply chain security, platform engineering). Re-scan the map once or twice a year to spot emerging gaps.
For related reading on process and delivery, see Agile, Scrum, and Kanban — when to use what and Development metrics — what to measure and why. For AI integration depth, start with What is an MCP server? and What is an LLM context window?.