Can an AI security audit replace a manual penetration test?

No. AI-driven code review is excellent at finding pattern-recognizable, code-level vulnerabilities — injection, missing sanitization, weak crypto primitives in source — and it scales across very large codebases in a way humans cannot. But it operates on the source tree, not on the running system. It cannot probe a deployed listener, hold session state across requests, replay tokens, brute-force a login flow, or scan a public document root for forgotten artifacts. Those are the categories where Flowtriq's manual pentest still surfaced findings after a thorough AI pass had been completed and remediated.

What did the AI security audit at Flowtriq actually find?

Flowtriq's AI-assisted code review surfaced and remediated a meaningful set of real, code-level vulnerabilities — including reflected and stored XSS in places where templating helpers had been bypassed, SQL injection in dynamically constructed queries that pre-dated their adoption of an ORM, a server-side template injection vector in an admin tooling page, and weak hashing primitives still in use in a legacy service. None of those issues were trivial. The AI audit measurably reduced the application's attack surface before our engagement began.

What did the manual pentest find that the AI audit missed?

Five additional findings — two High, one Medium, and two Low — across categories that require active interaction with the running system rather than analysis of source. They clustered in session management (rate-limit and token-validation behavior visible only by exercising the endpoints), transport-layer cryptography (an outdated TLS version still negotiable on the production listener), information disclosure (operational artifacts left in the public document root), and security misconfiguration (incomplete browser-side response headers). None of these were strictly code bugs. All of them were exploitable in production.

Was the AI audit at Flowtriq a waste of money?

The opposite. The AI pass closed a category of code-level bugs that, had they reached the manual pentest, would have consumed engagement hours that were instead spent finding the runtime issues the AI cannot reach. The two stages compounded — Flowtriq paid for both, and got more total coverage at the same total cost than they would have from either approach alone. The right way to think about an AI audit is as a force multiplier on the pentest, not as a substitute for it.

Case Study: Flowtriq Ran an AI Security Audit With Claude. Our Pentest Still Found Five More.

Q: Is Flowtriq a real company?

Yes. Flowtriq is a real Lorikeet Security client; their founder Jacob M. left a verified 5-star review on G2 on 4/23/2026 describing the engagement. Specific finding titles, asset names, payloads, and exploitation paths have been omitted or generalized in this case study. The shape of the engagement, the categories of findings, and the relative outcomes of the AI audit and the manual pentest are reported as-is.

Before Flowtriq engaged Lorikeet Security for a manual penetration test of their workflow automation platform, they did something an increasing number of well-run engineering teams are doing in 2026 — they ran a thorough AI-assisted secure code review against the entire application using Claude. They took the output seriously, fixed what came back, and only then asked us in.

That AI pass was not theatre. It identified, prioritized, and helped Flowtriq's engineers close a meaningful set of real, code-level vulnerabilities that would otherwise have ended up in our final report. Reflected and stored cross-site scripting in places where the templating layer had been bypassed. SQL injection in legacy query construction that pre-dated their migration to an ORM. A server-side template injection vector in an internal admin page. Weak hashing primitives still resident in an older service. None of these were trivial. The AI audit measurably reduced the application's attack surface before we ran our first request through Burp.

And then we still found five additional findings — two High, one Medium, two Low — across categories the AI was structurally unable to see. None of them were exotic. Every one of them was exploitable in production. This is a short case study about where AI security review ends, where active testing begins, and why the two reinforce each other rather than compete.

5 Findings the AI audit missed

2 High-severity findings post-AI

100% Closed before re-test sign-off

About this case study. Flowtriq is a real Lorikeet Security client. Specific finding titles, asset names, payloads, and exploitation paths have been omitted or generalized at their request. The shape of the engagement, the categories of findings, and the relative outcomes of the AI audit and the manual pentest are reported as-is, with the client's permission.

The Engagement

Flowtriq is a multi-tenant SaaS platform that builds workflow automation tooling for mid-market operations teams. Their codebase is mature — a single-tenant monolith that has gradually been carved into a small set of services as their product surface has grown. They have a competent, security-conscious engineering team. They take threat modeling seriously and run static analysis in CI. Their CTO had been an early advocate for using AI assistants in development, and he was equally early to apply the same tooling to security work.

By the time they brought us in, their internal AI audit had been running for roughly three weeks. Engineers had worked through the prioritized output, opened pull requests, written regression tests, and merged fixes. They asked us to come in and validate — to run an external, manual penetration test against the same application, against the same staging environment that had been hardened by the AI-driven review, and to report on what was left.

This is the most useful possible posture for a client to bring to a pentest. We were not patching obvious gaps. We were operating against an application that had already been actively defended.

What the AI Audit Got Right

It is worth being concrete about what the AI pass closed before we arrived, because the easy story to tell is that AI security review does not work, and the easy story is wrong.

An AI-assisted secure code review, done well, is exceptionally good at exactly the class of vulnerability that has dominated the OWASP Top 10 for two decades. It reads the entire codebase. It does not get bored. It is not biased toward the modules a human reviewer happens to be familiar with. It can hold a hundred files in mind at once and notice a single function that takes user input, fans it out into a SQL string, and sends it to the database without a parameterized query. It can spot a templating call where a developer reached for the unsafe variant when the safe one would have done. It can flag legacy crypto in a service that nobody on the current team remembers writing.

Flowtriq's AI audit found and closed all of these. Their engineers did the patient, valuable work of acting on the report — refactoring query construction, adding test coverage, ripping out the legacy hash, and shipping the fixes through their normal review process. By the time we ran our first authenticated session against the staging environment, the obvious code-level surface had genuinely been cleaned up.

That is real defensive value. It is also exactly the value AI code review is best positioned to deliver.

What the AI Audit Could Not See

The five findings that remained when we finished our active testing did not cluster in source. They clustered in the running system. None of them were code bugs in the traditional sense. All of them were exploitable.

The categories, in the same order they appeared in our findings summary, looked like this.

#	Category	Severity	Where it lived
1	Session Management	High	Sensitive endpoint without enforced request budgets
2	Session Management	High	Anti-forgery token validation observable only at runtime
3	Cryptography	Medium	Outdated transport-layer protocol still negotiable on the listener
4	Information Disclosure	Low	Operational artifacts left in the public document root
5	Security Misconfiguration	Low	Incomplete browser-side response headers across paths

Look at the rightmost column. Every one of those findings lived somewhere an AI reading the source tree could not look.

The session management findings required exercising the endpoints

The two High-severity findings both sat in session management — and both were invisible from source. The first was a sensitive endpoint that, on paper, called the right validation helpers and produced the right responses, but which had no enforced ceiling on how many times a single client could hit it inside a short window. You cannot see that from reading code. You can only see it by hitting the endpoint at speed, watching what happens, and noticing that nothing happens — no throttling, no challenge, no progressive backoff.

The second was an anti-forgery token validation behavior that looked correct in source. The token was generated, attached, and checked. What was not visible in source was the exact set of conditions under which the runtime considered a missing or malformed token to be acceptable rather than rejecting the request. Surfacing that required a manual tester replaying requests with the token absent, with the token altered, with the token from a different session, and observing — under each variation — what the server actually did. The bug was in the gap between "the validation function exists and is called" and "the validation function rejects every variation that it should." AI source review confirms the first. Only an active tester can confirm the second.

The cryptography finding required probing the deployed listener

The Medium-severity transport-layer finding was about which protocol versions and cipher suites the production listener actually negotiated when a client offered them. The codebase did not configure this. The application server inherited it from a system-level TLS profile, which was set in an Ansible role, which was last touched eighteen months ago, which still permitted a protocol version that current standards consider deprecated. There is nothing about this that would be visible to an AI reading application source. It is a property of the deployed listener, surfaced only by speaking TLS at it.

The information disclosure finding lived on the filesystem, not in the code

The Low-severity disclosure finding was a pair of operational artifacts that an engineer had placed in the public document root during an incident weeks earlier and forgotten to clean up. They were not referenced from anywhere in the application. They were not committed to the repository. They were just files, sitting where the web server would happily serve them to anyone who guessed the path. An AI auditing the codebase has no way to know they exist. A manual tester scanning the document root for common artifact patterns finds them in the first few minutes.

The security headers finding required inspecting actual responses

The remaining Low — incomplete browser-side response headers — is the kind of finding that looks small in isolation and matters a great deal in aggregate. Several response paths were missing headers that would have hardened the browser-side surface against a cluster of post-exploitation techniques. Some headers were present on the main application but absent on a subdomain. Some were present in production but absent in a staging-style edge case that ended up routable. None of this was visible from source — the headers came from a reverse proxy configuration, with conditional logic that no one had reviewed end-to-end in some time.

Why the Categories Tell the Story

Pull back from the specifics. The five findings the AI did not catch fell into four categories — session management, transport cryptography, information disclosure, security misconfiguration — and they share a single property: none of them are properties of the source code. They are properties of the running system, the deployed infrastructure, the file layout on disk, the response headers from the reverse proxy, and the behavior of validation logic under conditions that only manifest at runtime.

An AI doing secure code review can read every line of the codebase. It cannot:

Send a hundred requests per second at an endpoint and watch what the server does
Hold a session token across requests, mutate it deliberately, and observe how the validator reacts to each mutation
Speak TLS to the production listener and enumerate which protocol versions it agrees to
List the contents of the public document root and notice files that should not be there
Inspect HTTP response headers across paths, subdomains, and edge-case routing
Reason about behavior produced by configuration that lives outside the repository — reverse proxy rules, Ansible roles, Terraform, container base images, kernel settings

This is not a criticism of AI code review. This is a description of its surface area. AI code review is bounded by source. Active penetration testing is bounded by what is reachable on the wire. The two surfaces overlap meaningfully, but neither contains the other.

The compounding effect. Because Flowtriq closed the obvious code-level surface before we arrived, the engagement hours that would have gone to documenting and reproducing those issues went instead to active runtime testing — which is exactly where the residual risk lived. They paid for both the AI audit and the manual pentest, and got more total coverage at the same total cost than they would have from either approach alone. This is the right way to budget for security review in 2026.

The Outcome

Flowtriq's engineering team triaged all five findings within forty-eight hours of the report being delivered. The two High-severity issues were patched first — the rate-limit gap remediated with a token-bucket guard at the edge, the anti-forgery validation tightened so that every failure mode rejected the request rather than logging and continuing. The TLS profile on the production load balancer was updated to drop the deprecated protocol. The forgotten artifacts were removed from the document root and added to a deploy-time scanner that will catch the same pattern next time. The header gaps were closed at the reverse proxy with a centralized configuration that applies uniformly across paths and subdomains.

We re-tested two weeks later. Every finding closed. No regressions. No new findings introduced by the fixes.

"I used Lorikeet for a PTaaS pentest and briefly tried out their ASM tool, which was amazing. I appreciate the fast tests and the accuracy of the findings. We came in thinking our AI audit had probably caught most of what mattered, and the report made us realize it had caught most of what mattered in the source tree — the runtime and infrastructure were a whole second surface area we hadn't actually tested. Their team was super helpful, everything ran through a modern interface, and the white-glove touch is impressive. A 10/10 experience."

— Jacob M., Founder, Flowtriq (verified G2 review, 4/23/2026)

The Lessons That Generalize

Flowtriq's engagement is a small case, not a large one — five findings in a single application, against a single client, in a single quarter. But the pattern it surfaces is generalizable, and we are seeing it across our 2026 client base.

AI code review is real defensive infrastructure. It catches what it is good at catching, at scale, faster than any human review could. Teams that deploy it well are shipping more secure code, sooner, than teams that do not. This is not a fad and it is not a marketing line. The XSS, SQL injection, template injection, and weak-crypto findings the Flowtriq AI audit closed were findings that, three years ago, would have shown up in the manual pentest report. They will show up less often from here forward. That is good news.

The categories that remain are not the categories AI is best at. Session management edge cases, runtime TLS posture, file-system hygiene on production servers, and the configuration of every system that sits between your application and the wire — all of these continue to require active probing. The arrival of AI in the secure development cycle has not made these categories smaller; if anything, by closing the noisier source-level findings, it has made the runtime findings more visible.

Budget for both, in that order. The most efficient security cycle we are observing in well-run 2026 engineering organisations is: continuous AI-assisted code review during development, followed by periodic manual penetration testing against the deployed system. The AI pass acts as a force multiplier on the pentest — it strips out the code-level findings so the human testers can spend their hours where humans are still uniquely effective. Both stages are necessary. Neither is sufficient.

Flowtriq did this well, and they got a stronger security posture out of it than they would have gotten from either stage alone. That is the case study.

Already Run an AI Security Audit? Validate It With a Manual Pentest.

If your engineering team has done the work of running a Claude-driven, Cursor-driven, or Copilot-driven secure code review on your application, you have already closed the easiest part of the surface. The runtime, infrastructure, and configuration findings that remain are exactly what Lorikeet Security's manual penetration tests are built to surface. Get in touch and we will scope an engagement against the application your AI audit has already hardened.

Book a Pentest Scoping Call View All Services

-- views

Link copied!

Lorikeet Security Team

Penetration Testing & Cybersecurity Consulting

We've completed 170+ security engagements across web apps, APIs, cloud infrastructure, and AI-generated codebases. Everything we publish here comes from patterns we see in real client work.

The Engagement

What the AI Audit Got Right

What the AI Audit Could Not See

The session management findings required exercising the endpoints

The cryptography finding required probing the deployed listener

The information disclosure finding lived on the filesystem, not in the code

The security headers finding required inspecting actual responses

Why the Categories Tell the Story

The Outcome

The Lessons That Generalize

Already Run an AI Security Audit? Validate It With a Manual Pentest.

You Might Also Like

TeamPCP Hijacks Bitwarden CLI: A 93-Minute npm Compromise Targeting Developer Workstations

Building Secure Autonomous AI: Architecture, Hardening, and When You Actually Need a Pentest

The Modern Red Team Playbook: Adversary Simulation in 2026