Rendered at 20:08:58 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
OtherShrezzing 1 days ago [-]
>Code review is a fantastic mechanism for catching bugs and sharing knowledge
"Sharing knowledge" is one of the first phrases in the article, and highlighted as a key benefit of code review. But the loss to human-capital from this process is never examined in the post.
> Trivial reviews (typo fixes, small doc changes) cost 20 cents on average
They did around 25,000 of these runs (about 20% of total). So CF spent $5k in the period making language models run through PRs which were <10 lines long. I get that CF engineers are paid well, but the labour cost of having an intern/entry level engineer spend ~30-60s looking through these is likely close to $0.20, and that engineer builds some human-capital while they're at it.
alain94040 1 days ago [-]
the labour cost of having an intern/entry level engineer spend ~30-60s looking through these is likely close to $0.20
Did you do the math? Your estimate feels way off. First, I doubt an intern would process one PR in 30s. Maybe 2-3 minutes, to read 10 lines carefully looking for typos and indentation mistakes. We pay interns close to $100K these days (in a company like CloudFlare), so that's ~80c/minute. My estimate is therefore closer to $1.6 per PR. About 10X.
You are correct that there is a residual value with the intern, over time they would start learning (a little bit) about the code base.
MeetingsBrowser 1 days ago [-]
$100k is probably at the top end. Even at Cloudflare, interns are more likely to be making $25-40/hour according to levels.fyi, which works out to $52-83k.
onlyrealcuzzo 1 days ago [-]
Well, AI costs are definitely going to go down at least 90% in the next ~18 months for the same quality of output (and probably 90% again in the 24 months after).
Are you sure it's going to make sense to pay someone to do that moving forward?
I don't think it's worth it now, by the way.
It's definitely not going to be worth it in the near future.
Can we even blink for $0.002? What happens when the next 90% increase in efficiency happens??
FuckButtons 1 days ago [-]
> Well, AI costs are definitely going to go down at least 90% in the next ~18 months for the same quality of output (and probably 90% again in the 24 months after
As far as I can see, token costs have been steadily increasing over the past few months, so I’m not sure that buying the hype that another 90% cost reduction is just around the corner is warranted.
wetoastfood 21 hours ago [-]
Doesn’t seem like token costs, specifically, are increasing.
Opus cut its token pricing by 66% 6 months ago and it had previously been that higher price consistently for a year and a half (since that model launch).
GPT’s latest model is harder to track since it’s not named, but it’s historically inline with its history.
Not to mention what’s happening with other models like DeepSeek, GLM, and Kimi.
It seems to me the bigger change in costs is based on token appetite. People are discovering agentic capabilities are stronger than they used to be and use cases have broadened because of that. They’ll eventually discover too that these alternative models offer 95% of the intelligence at 20% of the price.
selicos 17 hours ago [-]
On local models that cost power (post initial hardware cost), makes sense. My work is building this out and I think it's solid. But until we can use our own hardware and local models the long term cost is a big question mark.
Zopieux 1 days ago [-]
Is the price reduction in the room with us right now?
fg137 1 days ago [-]
Would like to know where that 90% number comes from, and if it matches historical trend.
LLMs are so comically inefficient compared to the human brain that it is pretty easy to imagine this trend continuing for several more 90% drops.
If LeCun's JEPA or GRAM turn out to be a thing, we could see a 3-4 order of magnitude drop in a single release cycle / generation.
Keep in mind that performance per watt on the hardware side - at the same time - is still doubling every ~24 months - and this doesn't factor that in.
dingnuts 1 days ago [-]
[dead]
1 days ago [-]
notawhitemale 24 hours ago [-]
[dead]
Dumbledumb 1 days ago [-]
This blog post is full of small inconsistencies that make it read like a low quality SEO piece.
> We also extract a shared context file (shared-mr-context.txt) from the coordinator's prompt and write it to disk. Sub-reviewers read this file instead of having the full MR context duplicated in each of their prompts. This was a deliberate decision, as duplicating even a moderately-sized MR context across seven concurrent reviewers would multiply our token costs by 7x.
No, it would not, because neither is the prompt of the subagent 100% of its token usage, nor will the "shared-mr-context.txt" which is then being read have a size of zero compared to the creation of this shared context.
> You don't need seven concurrent AI agents burning Opus-tier tokens to review a one-line typo fix in a README.
Yeah, well you wouldn't have anyways. Earlier in the post it says that Opus is "exclusively for the Review Coordinator".
Dumbledumb 1 days ago [-]
Also this seems plain wrong. Input token caching has now idea whether you @include the file or copy the contents into the prompt. That is handled entirely by opencode and, all else being equal, has no bearing on the cache ability of a trace.
> Our cache hit rate sits at 85.7%, which saves us an estimated five figures compared to what we would pay at full input token pricing. This is partially thanks to the shared context file optimisation — sub-reviewers reading from a cached context file rather than each getting their own copy of the MR metadata, but also by using the exact same base prompts across all runs, across all merge requests.
fg137 1 days ago [-]
Looks like they need some sort of agent to review their blog posts as well.
thih9 1 days ago [-]
> Today, when an engineer at Cloudflare opens a merge request, it gets an initial pass from a coordinated smörgåsbord of AI agents.
I’d prefer to have that happen as some sort of pre commit hook, before a merge request is sent. The feedback loop might be a bit faster and the process might produce less noise this way.
derwiki 1 days ago [-]
My company has the AI review agents, and you can run them locally, but practically it’s easier to just open a merge request to have CI run the agents. Especially if you’re juggling a bunch of merge requests.
rhgraysonii 1 days ago [-]
Valid, but you lose the lived history that comes with the audit log of it being actual review back and forth and CI runs vs lost to a developers machine and only a relic in the commit log. I can see both sides, though.
Zanfa 1 days ago [-]
Can you elaborate about the practical value of having the history of back and forth, in a PR or even in the commit log? In my 20ish years of experience, I can’t recall a single instance where I’ve solved something thanks to having this work-in-progress state persisted in the repo history.
It’s exclusively been the other way around where having a smaller number of larger squished commits (post merge) that’s made the project be more maintainable.
SpicyLemonZest 1 days ago [-]
It's not about having it in the commit history. I've seen a few cases where the back and forth revealed that the AI reviewer was offering bad advice (and a few others where I suspect bad local AI advice is why people keep sending me the same category of mistake).
cush 1 days ago [-]
People usually squash merge anyways
krzyk 1 days ago [-]
Actually not, it is similar debate like rebase or merge.
e.g. I don't squash, I prefer to see full history, not redacted one.
krzyk 1 days ago [-]
It is easier to view code review results in a tool and not in a text during commit.
There is no universal standard which IDEs support for code review results (there is SARIF, but it is not supported that widely). A review result on a web page with comments from humans, is valuable.
NiloCK 1 days ago [-]
Like it or not, the "merge request" (eg, open a PR) is the Schelling point of relevant information. I expect that At scale here refers to size of software projects, and not only code velocity. Software projects of large enough size have CI configuration that don't typically fully-run on each dev machine.
Cthulhu_ 1 days ago [-]
I have mixed feelings, but it boils down to how long it takes and / or cost.
Pre-commit hooks should be fast, as it's something you'd do (normally) a few dozen times a year. I don't believe sending a review job to a remote agent is fast, nor will waiting for a review to finish a commit be good for anyone.
CI on the other hand can be slower and runs async, it's fire and forget so you can switch tasks.
If noise is an issue, one possible solution is to create a merge request, have the tools review it, make the fixes, rewrite history like you did it perfectly the first time ("fix" commits are noise), then create a new one for human review.
nullbio 1 days ago [-]
Few dozen times a year?
trollbridge 1 days ago [-]
You must have some pretty monster commits.
mock-possum 1 days ago [-]
Pre commit only happens on your machine though - you lose the ability to have a shared review surface where you can tag others on your team to specifically prompt discussion or verification on issues that touch their domain. When an agent points out a potential security issue with how my work ties into infrastructure, I want to just be able to tag our infra team and ask “hey is this something to worry about?” The agent, myself, and the other team member have now all contributed to a threaded discussion that is easily referencable in the future.
esafak 1 days ago [-]
You can and should run your reviews beforehand but the same reviews should run in CI too (just like with commit hooks) because reviews are nondeterministic and for verification (even if they were deterministic).
jellyfishbeaver 5 hours ago [-]
How do you all handle code review for projects that use specific frameworks or libraries? I write a lot of PySpark and the "flavour" of code is sort of different than traditional Python. AI code reviewers tend to nitpick conventions and common patterns in these libraries, so I find it not very helpful.
appplication 5 hours ago [-]
Also write a lot of pyspark and the best I can say is to let the repo become its own style guide, can enforce on review with “make sure code is consistent with patterns and style in this module” seems to work well enough.
joshuamoyers 1 days ago [-]
we’ve been struggling with review throughput. this actually seems worthwhile to build at this point though i remain fairly skeptical of workflows that are agent-only, at a point it seems like the only practical solution.
we are finding lots of value in self review. its the “imagine you are doing a synchronous paired review with someone - anything that is difficult to explain, has a code smell, doesnt fit the architecture of the system around you, write a comment.” then at the end, agents do a good job of looping over PR comments.
the second thing would be a guided, educational code review tool -
there are a few attempts at this, but nothing that has a good enough interface to actually stick. organize hunks by semantic importance, spend some tokens exploring the surrounding systems, showing how new code, public apis and data model flow with the existing design, and allow a human to traverse larger PRs more quickly.
thank you to cloudflare for publishing this.
ramoz 1 days ago [-]
I do think Cloudflare probably institutes a similar manual review process as well. I have a handful of fairly vocal and supportive engineers I stay in contact with around https://plannotator.ai (there is an integrated code review surface that creates a feedback loop with your local agent).
> agents do a good job of looping over PR comments
This is the easy part. Most harnesses enable some sort of integration now, so you can actually create a smooth local experience around this as well - better code before it ships to more costly review or bloats PR threads.
> guided, educational code review tool
This is a bit tougher, and I find the main harness chat tends to work best. I learn better when I'm more engaged and aware of what I'm asking. It's easy to stick a code tour type of thing on a screen. It's hard to really nail the right attention and learning mechanism around it.
azuanrb 6 hours ago [-]
I’ve built something similar internally, but under the hood it’s mostly codex exec + Git worktrees. The main advantage over diff-only review is that it can walk the entire codebase, trace dependencies, and understand architectural or cross-system impacts instead of only looking at the changed files. The tradeoff is that it’s noticeably more expensive to run. I'm still experimenting on it but I quite like this approach so far.
rzmmm 1 days ago [-]
> The entire system also runs locally.
I think approaches like this don't need to run other than locally. Maybe integrated as pre-push hook. The system is nondeterministic, so it's at odds with the purpose of CI.
proofofcontempt 1 days ago [-]
I'm not sure the people integrating it into CI process understand what CI is.
Scea91 1 days ago [-]
Same can be said about human review if the argument is non-determinism.
proofofcontempt 1 days ago [-]
Human review is about learning and there's an implied social contract in that someone is giving you their time to make you better. It isn't necessarily necessary but replacing it with AI shows a fundamental misunderstanding of why it is part of the process.
fhd2 1 days ago [-]
I'd argue it's pretty much like monitoring, which certainly benefits from multiple people seeing the same stats and alerts. I agree it's at odds with CI/CD and should probably not block anything, like deterministic checks commonly do.
krzyk 1 days ago [-]
It is starting the review during CI (CI just triggers the review), not blocking merges like failed build or lint failures.
rzmmm 1 days ago [-]
[dead]
new_account_101 1 days ago [-]
[dead]
plmpsu 1 days ago [-]
I built a more naive version for our team using Copilot and GitHub actions and it works quite well (wish I had metrics too). The team loves it.
The ROI here is so high that I don't mind using the strongest model available for the actual code review. I don't trust Sonnet and such. Just let Opus or GPT 5.5 do the whole thing and pay a bit more for less complexity.
krzyk 1 days ago [-]
I did similarly with copilot.
I have about 15 or so subagents doing reviews from different perspectives (or providing some additional value, like finding agents.md files, doing confidence ranking, describing images attached to the PR, that get validated later on with Jira issue description).
I used it since about November, with large scale popularity in my company reaching in April - all that on a 300 premium requests (because they allowed starting subagents, and there was no limit how long a single request can last) - so it would cost something like $5000 and $8000 for April and May if it was API pricing. I had similar cost per review (about $0.90) with Opus 4.6 and help from Sonnet and Haiku for simpler tasks. It did about 4000 reviews during the last 2 months.
And starting in June, it will be dead because it will be API pricing and for $30 (or $19 since September) it will do just few reviews.
A fun project.
neebz 1 days ago [-]
do you also have separate prompts for each domain (security, architecture etc?).
would love to look into it if any part of it is open source
throwaway613746 1 days ago [-]
[dead]
afro88 24 hours ago [-]
> When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability! Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare.
Most people I know had the experience that signal to noise was way off, regardless of scale. So it was a burden rather than a help. Code review by AI ended up being a skill before creating the PR so the dev owning the PR addressed everything before the team got bogged down with it in review
suika 1 days ago [-]
As a solo dev or rather nowadays more so only a decision maker / agent overseer, I came to enjoy letting my agents develop against a Gerrit repository / workflow. Dev agent pushes a CL, review agent picks it up (not just the diff, but the full repo), runs tests/reviews/review-subagents and concludes by posting a review as well as a vote. This goes back and forth with new patch sets / replies to the threads. Eventually the CL gets a +2 or whatever and I have the final call to manually submit it.
It is way slower compared to just pushing through development with one agent doing everything yolo against a normal repository, but it seems to me that the additional time is well spent (no, I don't have fancy graphs or similar analysis to prove this other than my gut feeling after looking at recent development results).
bob1029 1 days ago [-]
> One of the operational headaches we didn’t predict was that large, advanced models like Claude Opus 4.7 or GPT-5.4 can sometimes spend quite a while thinking through a problem, and to our users this can make it look exactly like a hung job.
I had the same problem in my recursive agent harness. It would always come back, but it could sometimes take up to 10 minutes depending. I fixed this by adding a required "purpose" argument to every tool and call/return event. As the recursive evaluation proceeds, every single thing that happens streams incremental purpose text to the user's browser (also using the magic of JSONL for this). The incremental progress events contain the purpose and a detail section (tool arg JSON) that the user can expand/collapse.
derwiki 1 days ago [-]
Nice trick! I am doing something similar but passing those incremental updates to Haiku for a short user-friendly message.
jmakov 1 days ago [-]
Every iteration something can be found. How many times do you iterate e.g. on performance - use optimized struct, oh, you can change the architecture etc.? At that point one can just have a while loop for the agents to make changes until no comments left.
etothet 1 days ago [-]
What’s the over/under on when Cloudflare will acquire OpenCode (and keep it open source)?
34qa123 1 days ago [-]
The suits have taken over Cloudflare. All buzzwords are on the bingo card: Using Bun, modeling agent roles after management, graphs, you name it.
They apparently think they need to cash in on AI by serving models and at the same time blocking scrapers. So they need to fuel the hype by pretending to use it.
This shows how the US economy is fundamentally broken: companies that provide a useful service (in theory, if you discount SSL MITM and turnstile gatekeeping) struggle, quasi-religious scams like OpenAI and Anthropic get funded by mentally ill Boomers and Gen-Xers.
merrvk 1 days ago [-]
PR reviews were never the bottleneck
nine_k 1 days ago [-]
Where I used to sit, they very often were.
krzyk 1 days ago [-]
Well, they are where I am. But LLM reviews doesn't solve that. It just adds another perspective, which catches different issues.
merb 24 hours ago [-]
it would be cool if they would opensoruce that. it would prob be helpful
faangguyindia 1 days ago [-]
what's best workflow for solo devs?
pramodbiligiri 1 days ago [-]
Based on the section about "Specialised agents" (https://blog.cloudflare.com/ai-code-review/#specialised-agen...), I'd say create a bunch of review prompts and run it against the code? The rest of the blog post seems to be the engineering around it: for scale, cost, team size etc.
criley2 1 days ago [-]
You can do basically the same thing as cloudflare except as a skill you run in your local harness. If you're going through the motions with PRs and are familiar with actions, you can have it run in a github action instead. But this is basically just a skill. The Claude code review skill is a simple version of exactly this.
"Sharing knowledge" is one of the first phrases in the article, and highlighted as a key benefit of code review. But the loss to human-capital from this process is never examined in the post.
> Trivial reviews (typo fixes, small doc changes) cost 20 cents on average
They did around 25,000 of these runs (about 20% of total). So CF spent $5k in the period making language models run through PRs which were <10 lines long. I get that CF engineers are paid well, but the labour cost of having an intern/entry level engineer spend ~30-60s looking through these is likely close to $0.20, and that engineer builds some human-capital while they're at it.
Did you do the math? Your estimate feels way off. First, I doubt an intern would process one PR in 30s. Maybe 2-3 minutes, to read 10 lines carefully looking for typos and indentation mistakes. We pay interns close to $100K these days (in a company like CloudFlare), so that's ~80c/minute. My estimate is therefore closer to $1.6 per PR. About 10X.
You are correct that there is a residual value with the intern, over time they would start learning (a little bit) about the code base.
Are you sure it's going to make sense to pay someone to do that moving forward?
I don't think it's worth it now, by the way.
It's definitely not going to be worth it in the near future.
Can we even blink for $0.002? What happens when the next 90% increase in efficiency happens??
As far as I can see, token costs have been steadily increasing over the past few months, so I’m not sure that buying the hype that another 90% cost reduction is just around the corner is warranted.
Opus cut its token pricing by 66% 6 months ago and it had previously been that higher price consistently for a year and a half (since that model launch).
GPT’s latest model is harder to track since it’s not named, but it’s historically inline with its history.
Not to mention what’s happening with other models like DeepSeek, GLM, and Kimi.
It seems to me the bigger change in costs is based on token appetite. People are discovering agentic capabilities are stronger than they used to be and use cases have broadened because of that. They’ll eventually discover too that these alternative models offer 95% of the intelligence at 20% of the price.
See Chart 13 here: https://www.rdworldonline.com/ais-great-compression-20-chart...
See here: https://epoch.ai/data-insights/llm-inference-price-trends
LLMs are so comically inefficient compared to the human brain that it is pretty easy to imagine this trend continuing for several more 90% drops.
If LeCun's JEPA or GRAM turn out to be a thing, we could see a 3-4 order of magnitude drop in a single release cycle / generation.
Keep in mind that performance per watt on the hardware side - at the same time - is still doubling every ~24 months - and this doesn't factor that in.
> We also extract a shared context file (shared-mr-context.txt) from the coordinator's prompt and write it to disk. Sub-reviewers read this file instead of having the full MR context duplicated in each of their prompts. This was a deliberate decision, as duplicating even a moderately-sized MR context across seven concurrent reviewers would multiply our token costs by 7x.
No, it would not, because neither is the prompt of the subagent 100% of its token usage, nor will the "shared-mr-context.txt" which is then being read have a size of zero compared to the creation of this shared context.
> You don't need seven concurrent AI agents burning Opus-tier tokens to review a one-line typo fix in a README.
Yeah, well you wouldn't have anyways. Earlier in the post it says that Opus is "exclusively for the Review Coordinator".
> Our cache hit rate sits at 85.7%, which saves us an estimated five figures compared to what we would pay at full input token pricing. This is partially thanks to the shared context file optimisation — sub-reviewers reading from a cached context file rather than each getting their own copy of the MR metadata, but also by using the exact same base prompts across all runs, across all merge requests.
I’d prefer to have that happen as some sort of pre commit hook, before a merge request is sent. The feedback loop might be a bit faster and the process might produce less noise this way.
It’s exclusively been the other way around where having a smaller number of larger squished commits (post merge) that’s made the project be more maintainable.
e.g. I don't squash, I prefer to see full history, not redacted one.
There is no universal standard which IDEs support for code review results (there is SARIF, but it is not supported that widely). A review result on a web page with comments from humans, is valuable.
Pre-commit hooks should be fast, as it's something you'd do (normally) a few dozen times a year. I don't believe sending a review job to a remote agent is fast, nor will waiting for a review to finish a commit be good for anyone.
CI on the other hand can be slower and runs async, it's fire and forget so you can switch tasks.
If noise is an issue, one possible solution is to create a merge request, have the tools review it, make the fixes, rewrite history like you did it perfectly the first time ("fix" commits are noise), then create a new one for human review.
we are finding lots of value in self review. its the “imagine you are doing a synchronous paired review with someone - anything that is difficult to explain, has a code smell, doesnt fit the architecture of the system around you, write a comment.” then at the end, agents do a good job of looping over PR comments.
the second thing would be a guided, educational code review tool - there are a few attempts at this, but nothing that has a good enough interface to actually stick. organize hunks by semantic importance, spend some tokens exploring the surrounding systems, showing how new code, public apis and data model flow with the existing design, and allow a human to traverse larger PRs more quickly.
thank you to cloudflare for publishing this.
> agents do a good job of looping over PR comments
This is the easy part. Most harnesses enable some sort of integration now, so you can actually create a smooth local experience around this as well - better code before it ships to more costly review or bloats PR threads.
> guided, educational code review tool
This is a bit tougher, and I find the main harness chat tends to work best. I learn better when I'm more engaged and aware of what I'm asking. It's easy to stick a code tour type of thing on a screen. It's hard to really nail the right attention and learning mechanism around it.
I think approaches like this don't need to run other than locally. Maybe integrated as pre-push hook. The system is nondeterministic, so it's at odds with the purpose of CI.
The ROI here is so high that I don't mind using the strongest model available for the actual code review. I don't trust Sonnet and such. Just let Opus or GPT 5.5 do the whole thing and pay a bit more for less complexity.
I have about 15 or so subagents doing reviews from different perspectives (or providing some additional value, like finding agents.md files, doing confidence ranking, describing images attached to the PR, that get validated later on with Jira issue description).
I used it since about November, with large scale popularity in my company reaching in April - all that on a 300 premium requests (because they allowed starting subagents, and there was no limit how long a single request can last) - so it would cost something like $5000 and $8000 for April and May if it was API pricing. I had similar cost per review (about $0.90) with Opus 4.6 and help from Sonnet and Haiku for simpler tasks. It did about 4000 reviews during the last 2 months.
And starting in June, it will be dead because it will be API pricing and for $30 (or $19 since September) it will do just few reviews.
A fun project.
would love to look into it if any part of it is open source
Most people I know had the experience that signal to noise was way off, regardless of scale. So it was a burden rather than a help. Code review by AI ended up being a skill before creating the PR so the dev owning the PR addressed everything before the team got bogged down with it in review
I had the same problem in my recursive agent harness. It would always come back, but it could sometimes take up to 10 minutes depending. I fixed this by adding a required "purpose" argument to every tool and call/return event. As the recursive evaluation proceeds, every single thing that happens streams incremental purpose text to the user's browser (also using the magic of JSONL for this). The incremental progress events contain the purpose and a detail section (tool arg JSON) that the user can expand/collapse.
They apparently think they need to cash in on AI by serving models and at the same time blocking scrapers. So they need to fuel the hype by pretending to use it.
This shows how the US economy is fundamentally broken: companies that provide a useful service (in theory, if you discount SSL MITM and turnstile gatekeeping) struggle, quasi-religious scams like OpenAI and Anthropic get funded by mentally ill Boomers and Gen-Xers.