Why Pull Requests Should Be Reviewed the Moment They're Submitted: Four Classical Lenses and the AI-Era Twist
/ 15 min read
Table of Contents
The question of when to review a pull request looks, on the surface, like a trivial scheduling problem — another instance of the familiar tension between focused work and incoming interrupts. In practice, however, it sits at the intersection of queueing theory, information economics, control theory, and — since roughly 2024 — a new set of dynamics introduced by LLM-based coding agents. Getting the answer right matters more than it seems, and getting it wrong has consequences that compound over time.
The claim I want to defend is simple: when a teammate submits a pull request, it should be reviewed immediately — on the order of minutes, not hours. This sounds contrarian; the reviewer, after all, pays a context-switch cost, and the author seems like they should be patient. The argument of this essay is that the reviewer’s cost is genuinely smaller than it appears, the author’s cost is genuinely larger, and the gap between them widens further under the conditions that define modern AI-assisted development. I work through the argument via four largely independent theoretical lenses, quantify it with a discrete-event simulation, and then revisit the analysis once AI agents enter the picture — which, it turns out, strengthens the conclusion rather than weakens it.
The four classical lenses
The starting point is an asymmetry between the two roles in a review. The author holds complete context at the moment of submission: every rejected alternative, every edge case they considered, the precise reason for each naming decision. Much of this context never makes it into the commit message or the PR description; it lives in the author’s working memory. Parnin and Rugaber analyzed roughly 10,000 recorded programming sessions across 86 developers and reported that only about 10% of interrupted sessions resumed coding activity within one minute of returning to the task, and only 7% began editing without first navigating elsewhere in the codebase 1. The plain reading is that programmers lose a substantial portion of their task context quickly after disengaging, and reconstructing it involves non-trivial navigation and re-reading. Whatever the precise functional form, the author’s context cost grows monotonically with the time between submission and review.
The reviewer’s position is structurally different. Switching from current work to a pending review incurs a one-time attention cost — winding down, reading the diff, forming an assessment, and winding back up. This cost is real, and reviewers are right to resist frivolous interruptions, but it does not grow with how long the PR has been sitting. It is a constant per-switch price, not a time-dependent one.
The consequence is that a policy which defers reviews to protect the reviewer’s focus is optimizing the smaller of two costs. In aggregate, the team pays more — not less — by letting authors wait, because the author’s cost curve is a rising function of wait time while the reviewer’s curve is roughly flat. This is the first and most fundamental lens.
The second lens is queueing theory. Little’s Law, , relates the average number of items in a queue to the arrival rate and average wait time 2. Batching reviews directly inflates the in-queue count because the wait grows. The more interesting consequence appears in the Kingman approximation for the mean wait time of a G/G/1 queue 3:
Here is utilization, is the coefficient of variation of inter-arrival times, and is the coefficient of variation of service times. A batch policy — “I’ll review everything at 5pm” — dramatically inflates , because from the queue’s perspective arrivals are zero during the day and then a burst at batch time. Combined with high utilization (), wait time does not degrade linearly; it blows up. Reinertsen made this argument forcefully in the context of product development: queues are the dominant invisible cost in most development organizations, and batching is the main cause of queue formation 4. The practical inversion is counterintuitive: the busier the team, the more a “review when I get a chance” policy breaks down, because “busy” means is closer to 1, and that is exactly the regime where Kingman’s formula is supralinear.
The third lens is the dependency graph across pull requests. Empirical studies of pull-based development on GitHub consistently find that time-to-first-response is the single strongest determinant of overall PR latency. Zhang and colleagues analyzed pull requests across many projects and reported that, when a PR contains comments, time-to-first-response alone explains most of the variance in total lifetime 5. Yu and colleagues, studying Travis-CI-integrated projects, found that integrator availability and early reviewer engagement dominate the latency model, with early @mention assignment showing a small but consistent negative effect on closing time 6. These findings map cleanly onto the theoretical picture: once the first review is fast, subsequent activity (discussion, revisions, CI, merge) is substantially compressed; once it’s slow, the rest follows.
Beyond individual PR latency, there is a structural cost in letting work-in-progress accumulate. For open PRs touching related code, the number of potential interference pairs is , so expected merge-conflict cost scales quadratically with WIP. This is a mathematical consequence of open branches coexisting, not an empirical claim, but it is the reason Reinertsen treats WIP constraints — borrowed from lean manufacturing and, in his account, from congestion control in the Internet — as a dominant lever in product development flow 4.
The fourth lens is feedback control. A well-known principle from control engineering is that delayed feedback loops do not permit small corrections; they force either large corrections or none at all. Applied to code review, this explains a familiar pattern: feedback that arrives within minutes lets the author make surgical changes while the problem is still live; feedback that arrives the next day forces a choice between expensive rework and rationalizing the concern away. The loss is not just the extra work but the resolution of the correction — late feedback systematically selects for coarser interventions. Reinertsen connects this directly to the economics of variability: fast feedback is valuable because it enables discrimination between alternatives before the cost of change becomes prohibitive 4.
These four observations are largely independent. Any one of them would support the conclusion on its own; together, they make the case overdetermined.
Quantifying the classical world
Theory is cleaner with numbers, so the simulator below makes the argument concrete. It runs a Monte Carlo discrete-event simulation across three policies — immediate, hourly batch, and daily batch — with Poisson arrivals, exponential author-context decay , and switch costs amortized within clusters. The parameter values are modeling choices, not empirical measurements, and the point of the simulator is to let you see how the comparison behaves across a plausible range.
Parameters
Monte Carlo 30 runs × 8-hour workday / Poisson arrivals / context rebuild cost C(w) = 30 × (1 − exp(−w / τ))
At the defaults, immediate review is cheapest. The breakdown is the interesting part: review time is identical across policies (same PRs, same per-PR work), switch cost is slightly higher under immediate review because clusters don’t amortize as well, but author-context reconstruction is dramatically higher under daily batching and dominates the total. This is the theoretical story made visible.
The sensitivity analyses reinforce it. Lowering switch cost toward zero — realistic for modern developer environments with instant notifications and one-click review access — makes the gap overwhelming. Shrinking below 60 minutes, corresponding to shallower task commitment, widens the gap further. Increasing team size and PR rate brings in the Kingman effect: utilization rises, and batching degrades nonlinearly. For external calibration, Sadowski and colleagues reported a median end-to-end review latency of under 4 hours at Google 7, while Rigby and Bird reported approvals taking on the order of 15–18 hours in several Microsoft subprojects 8. The spread across organizations is large, and the simulator suggests why: small policy changes around first-response timing compound through multiple mechanisms.
The AI era changes the character of review
The preceding argument predates the widespread deployment of LLM-based coding agents. By 2025, however, tools such as Claude Code, Cursor, and GitHub Copilot Agent have shifted both sides of the equation, and it is tempting to assume AI reduces the urgency of human review. The argument below is that the opposite holds.
Quantitatively, four parameters move simultaneously. AI assistance raises author output — the arrival rate . Review-assist tools, including deployed systems such as Google’s AutoCommenter 9 and the CodeRabbit-class of third-party tools, reduce per-PR review time. The magnitude is context-dependent: Collante and colleagues analyzed 25,473 GitHub pull requests and reported that GPT-assisted PRs had substantially shorter review duration and shorter waiting time before acceptance than non-assisted ones in their sample 10, though selection effects (GPT is often reached for on simpler tasks — code optimization, documentation, small bug fixes) make the headline numbers hard to apply uniformly. Field studies find that participants prefer AI-led reviews for large or unfamiliar PRs but remain conditional on code-base familiarity and trust calibration 11. The arrival-rate increase is generally larger than the review-time decrease, so utilization rises rather than falls. AI-assisted authors also tend to produce arrivals in bursts, inflating . Both effects push the Kingman wait time up, not down. Separately, the author’s context-decay time-constant plausibly shrinks when a PR was partially generated by an agent: the author’s mental model is shallower because some of the work was externalized to the model, and reconstructing the original intent later involves recovering both what the author meant and what the agent proposed.
The qualitative shift is more important than the quantitative one. Bacchelli and Bird, in their canonical study of modern code review at Microsoft, found that understanding — of the code and the change — is the key aspect of code review, and that finding defects is less central than practitioners expect 12. Reviews, in their account, do work that automated tools do not: knowledge transfer, team awareness, surfacing alternative solutions. Current AI review tools are good at the complementary task — surface-level correctness, style, and localized bugs — which suggests a natural division of labor but also a clear residual. As automation absorbs more of the former, the portion of the review that remains for humans is increasingly the portion Bacchelli and Bird identified as central: verifying that the change achieves the author’s actual intent.
Intent, however, is not externalized. The prompt given to the agent captures part of it; the author’s working-memory model captures the rest. Isik, Çaglar, and Tüzün, at FORGE 2025, manually labeled a statistically significant sample of pull requests from the Transformers repository and evaluated four modern LLMs (Llama-3.1-70B, Llama-3.1-405B, GPT-4o, and GPT-4o-mini) on detecting inconsistencies between the associated issue description and the submitted PR changes. Their conclusion was that the tested LLMs showed limited performance on this task — merged PRs can still contain missing or tangled changes that LLMs fail to flag, and the automation they evaluated was not yet at the level where it could replace human judgement on alignment 13. A complementary large-scale empirical study of 16 popular AI-based code review GitHub Actions, across roughly 22,000 review comments in 178 repositories, found that effectiveness varies widely and that comments with concrete code snippets and narrow scope are the most likely to result in code changes — the tools’ signal-to-noise ratio is uneven 14. The practical implication is pointed: intent verification remains a human task, at least for now, and that task is specifically the one that degrades with wait time, because the author’s private model of “what I meant” decays while the code is inert on the page. Correctness can be checked asynchronously; intent alignment, in the sense Bacchelli and Bird describe, realistically cannot.
There are also two newer failure modes worth naming, though they are more speculative than the above. The first is what I’ll call agent-authored PR backlog: when review is slow, agent-assisted PRs accumulate unmerged, and authors respond by generating more (the marginal cost of asking an agent to draft one is low). This turns the queueing problem into a positive-feedback loop and accelerates the WIP quadratic. The second is training-signal contamination: teams that use their own merged code as a retrieval or fine-tuning source for coding agents are, in effect, training on whatever gets merged — and if review is slow and rubber-stamped, the training signal degrades. Both effects are plausible and consistent with the theoretical framework, but are not yet well-studied empirically, so I flag them as conjectures rather than established results.
The net conclusion is that AI assistance strengthens the case for immediate review rather than weakening it: the quantitative parameters all move in the wrong direction for batching, and the qualitative residual work left to humans is precisely the wait-time-sensitive kind.
Quantifying the AI era
The extended simulator below adds three AI-specific parameters: the fraction of PRs that are AI-assisted, the acceleration factor by which those authors produce PRs faster, and the intent-misalignment rate that produces rework. AI-assisted PRs internally have halved and per-PR review time multiplied by 0.6; rework cost follows , a saturating form intended to encode the time-sensitivity of intent alignment. Again, the numeric choices are modeling assumptions; the simulator’s value is in letting you sweep the parameter space and observe the shape of the comparison.
Baseline
AI effects
Monte Carlo 30 runs × 8h / AI PRs: τ × 0.5, review time × 0.6 / rework: 15 + 45 × (1 − exp(−w / 30))
At default AI-era settings (60% AI share, 2.5× speedup, 20% intent-misalignment rate), the gap between immediate and daily review is larger than in the classical case, and a fourth cost stack — intent-alignment rework — emerges visibly under daily batching. Sweeping from 0% to 100% shows the gap widening monotonically with AI adoption; sweeping upward shows rework cost growing superlinearly under any non-immediate policy. The simulator does not prove the argument; the theory and the empirical studies do. What it provides is a concrete sense of how differently the policies score even under conservative parameter assumptions.
Practical guidance
The theoretical and empirical conclusions translate into a short list of concrete practices.
First, treat a PR review request as a first-class interrupt with an explicit response target. Teams vary, but a 10–30 minute target for starting the first review during working hours is consistent with the evidence on the compounding cost of delay. This is not about individual heroism; it’s about making the policy legible to the team and removing the ambiguity that leads to drift.
Second, make review requests maximally visible. GitHub-to-chat notification wiring, CODEOWNERS-based automatic assignment, and review-queue dashboards serve the same purpose: reducing the chance that a review request is effectively invisible. Yu and colleagues showed that early assignment correlates with shorter latency 6; mechanically, visibility is the same effect.
Third, make the human/AI split explicit. Assign AI reviewers — AutoCommenter-class tools 9, CodeRabbit, Claude Code’s review workflows, and the broader CodeRabbit-class of GitHub Actions 14 — responsibility for correctness, style, and surface-level bug-catching, and reserve human review for intent alignment, architectural fit, and implications for adjacent work. This division reflects the observed LLM weakness at intent verification 13 and the central role of understanding in human review 12.
Fourth, when PR first-response time drifts beyond the target, treat it as a structural signal rather than a motivational one. Reviewer availability, ownership boundaries, or team composition is probably misaligned with the flow the team is trying to sustain. Fixing the SLA is downstream of fixing those.
The broader claim behind all of this is that “protecting focus by deferring review” is, in most engineering organizations, an individual-optimum solution that actively harms the team’s aggregate throughput. The asymmetry of context cost, the superlinear behavior of queues under variance, the quadratic growth of WIP conflicts, the resolution cost of delayed feedback, and — in the AI era — the time-sensitive nature of intent verification all point in the same direction. The next time you consider which PR to look at first, the simpler rule wins: the one that just came in.
Footnotes
-
Parnin, C., & Rugaber, S. (2011). Resumption strategies for interrupted programming tasks. Software Quality Journal, 19(1), 5–34. doi.org/10.1007/s11219-010-9104-9 ↩
-
Little, J. D. C. (1961). A proof for the queuing formula: L = λW. Operations Research, 9(3), 383–387. doi.org/10.1287/opre.9.3.383 ↩
-
Kingman, J. F. C. (1961). The single server queue in heavy traffic. Mathematical Proceedings of the Cambridge Philosophical Society, 57(4), 902–904. doi.org/10.1017/S0305004100036094 ↩
-
Reinertsen, D. G. (2009). The Principles of Product Development Flow: Second Generation Lean Product Development. Celeritas Publishing. ↩ ↩2 ↩3
-
Zhang, X., Yu, Y., Wang, T., Rastogi, A., & Wang, H. (2022). Pull request latency explained: an empirical overview. Empirical Software Engineering, 27(6), 126. doi.org/10.1007/s10664-022-10143-4 ↩
-
Yu, Y., Wang, H., Filkov, V., Devanbu, P., & Vasilescu, B. (2015). Wait for it: determinants of pull request evaluation latency on GitHub. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR) (pp. 367–371). IEEE. ↩ ↩2
-
Sadowski, C., Söderberg, E., Church, L., Sipko, M., & Bacchelli, A. (2018). Modern code review: a case study at Google. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 181–190). doi.org/10.1145/3183519.3183525 ↩
-
Rigby, P. C., & Bird, C. (2013). Convergent contemporary software peer review practices. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE) (pp. 202–212). doi.org/10.1145/2491411.2491444 ↩
-
Vijayvergiya, M., Salawa, M., Budiselić, I., Zheng, D., Lamblin, P., Ivanković, M., Carin, J., Lewko, M., Andonov, J., Petrović, G., Tarlow, D., Maniatis, P., & Just, R. (2024). AI-Assisted Assessment of Coding Practices in Modern Code Review. arXiv:2405 .13565 ↩ ↩2
-
Collante, A., Abedu, S., Khatoonabadi, S., Abdellatif, A., Alor, E., & Shihab, E. (2025). The Impact of Large Language Models (LLMs) on Code Review Process. arXiv:2508 .11034 ↩
-
Li, Z., et al. (2025). Rethinking Code Review Workflows with LLM Assistance: An Empirical Study. arXiv:2505 .16339 ↩
-
Bacchelli, A., & Bird, C. (2013). Expectations, outcomes, and challenges of modern code review. In Proceedings of the 35th International Conference on Software Engineering (ICSE) (pp. 712–721). doi.org/10.1109/ICSE.2013.6606617 ↩ ↩2
-
Isik, A. T., Çaglar, H. K., & Tüzün, E. (2025). Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests. In Proceedings of the 2nd IEEE/ACM International Conference on AI Foundation Models and Software Engineering (FORGE 2025). ↩ ↩2
-
Sun, K., Kuang, H., Baltes, S., Zhou, X., Zhang, H., Ma, X., Rong, G., Shao, D., & Treude, C. (2025). Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions. arXiv:2508 .18771 ↩ ↩2