Before Flowtriq engaged Lorikeet Security for a manual penetration test of their workflow automation platform, they did something an increasing number of well-run engineering teams are doing in 2026 — they ran a thorough AI-assisted secure code review against the entire application using Claude. They took the output seriously, fixed what came back, and only then asked us in.
That AI pass was not theatre. It identified, prioritized, and helped Flowtriq's engineers close a meaningful set of real, code-level vulnerabilities that would otherwise have ended up in our final report. Reflected and stored cross-site scripting in places where the templating layer had been bypassed. SQL injection in legacy query construction that pre-dated their migration to an ORM. A server-side template injection vector in an internal admin page. Weak hashing primitives still resident in an older service. None of these were trivial. The AI audit measurably reduced the application's attack surface before we ran our first request through Burp.
And then we still found five additional findings — two High, one Medium, two Low — across categories the AI was structurally unable to see. None of them were exotic. Every one of them was exploitable in production. This is a short case study about where AI security review ends, where active testing begins, and why the two reinforce each other rather than compete.
About this case study. Flowtriq is a real Lorikeet Security client. Specific finding titles, asset names, payloads, and exploitation paths have been omitted or generalized at their request. The shape of the engagement, the categories of findings, and the relative outcomes of the AI audit and the manual pentest are reported as-is, with the client's permission.
The Engagement
Flowtriq is a multi-tenant SaaS platform that builds workflow automation tooling for mid-market operations teams. Their codebase is mature — a single-tenant monolith that has gradually been carved into a small set of services as their product surface has grown. They have a competent, security-conscious engineering team. They take threat modeling seriously and run static analysis in CI. Their CTO had been an early advocate for using AI assistants in development, and he was equally early to apply the same tooling to security work.
By the time they brought us in, their internal AI audit had been running for roughly three weeks. Engineers had worked through the prioritized output, opened pull requests, written regression tests, and merged fixes. They asked us to come in and validate — to run an external, manual penetration test against the same application, against the same staging environment that had been hardened by the AI-driven review, and to report on what was left.
This is the most useful possible posture for a client to bring to a pentest. We were not patching obvious gaps. We were operating against an application that had already been actively defended.
What the AI Audit Got Right
It is worth being concrete about what the AI pass closed before we arrived, because the easy story to tell is that AI security review does not work, and the easy story is wrong.
An AI-assisted secure code review, done well, is exceptionally good at exactly the class of vulnerability that has dominated the OWASP Top 10 for two decades. It reads the entire codebase. It does not get bored. It is not biased toward the modules a human reviewer happens to be familiar with. It can hold a hundred files in mind at once and notice a single function that takes user input, fans it out into a SQL string, and sends it to the database without a parameterized query. It can spot a templating call where a developer reached for the unsafe variant when the safe one would have done. It can flag legacy crypto in a service that nobody on the current team remembers writing.
Flowtriq's AI audit found and closed all of these. Their engineers did the patient, valuable work of acting on the report — refactoring query construction, adding test coverage, ripping out the legacy hash, and shipping the fixes through their normal review process. By the time we ran our first authenticated session against the staging environment, the obvious code-level surface had genuinely been cleaned up.
That is real defensive value. It is also exactly the value AI code review is best positioned to deliver.
What the AI Audit Could Not See
The five findings that remained when we finished our active testing did not cluster in source. They clustered in the running system. None of them were code bugs in the traditional sense. All of them were exploitable.
The categories, in the same order they appeared in our findings summary, looked like this.
| # | Category | Severity | Where it lived |
|---|---|---|---|
| 1 | Session Management | High | Sensitive endpoint without enforced request budgets |
| 2 | Session Management | High | Anti-forgery token validation observable only at runtime |
| 3 | Cryptography | Medium | Outdated transport-layer protocol still negotiable on the listener |
| 4 | Information Disclosure | Low | Operational artifacts left in the public document root |
| 5 | Security Misconfiguration | Low | Incomplete browser-side response headers across paths |
Look at the rightmost column. Every one of those findings lived somewhere an AI reading the source tree could not look.
The session management findings required exercising the endpoints
The two High-severity findings both sat in session management — and both were invisible from source. The first was a sensitive endpoint that, on paper, called the right validation helpers and produced the right responses, but which had no enforced ceiling on how many times a single client could hit it inside a short window. You cannot see that from reading code. You can only see it by hitting the endpoint at speed, watching what happens, and noticing that nothing happens — no throttling, no challenge, no progressive backoff.
The second was an anti-forgery token validation behavior that looked correct in source. The token was generated, attached, and checked. What was not visible in source was the exact set of conditions under which the runtime considered a missing or malformed token to be acceptable rather than rejecting the request. Surfacing that required a manual tester replaying requests with the token absent, with the token altered, with the token from a different session, and observing — under each variation — what the server actually did. The bug was in the gap between "the validation function exists and is called" and "the validation function rejects every variation that it should." AI source review confirms the first. Only an active tester can confirm the second.
The cryptography finding required probing the deployed listener
The Medium-severity transport-layer finding was about which protocol versions and cipher suites the production listener actually negotiated when a client offered them. The codebase did not configure this. The application server inherited it from a system-level TLS profile, which was set in an Ansible role, which was last touched eighteen months ago, which still permitted a protocol version that current standards consider deprecated. There is nothing about this that would be visible to an AI reading application source. It is a property of the deployed listener, surfaced only by speaking TLS at it.
The information disclosure finding lived on the filesystem, not in the code
The Low-severity disclosure finding was a pair of operational artifacts that an engineer had placed in the public document root during an incident weeks earlier and forgotten to clean up. They were not referenced from anywhere in the application. They were not committed to the repository. They were just files, sitting where the web server would happily serve them to anyone who guessed the path. An AI auditing the codebase has no way to know they exist. A manual tester scanning the document root for common artifact patterns finds them in the first few minutes.
The security headers finding required inspecting actual responses
The remaining Low — incomplete browser-side response headers — is the kind of finding that looks small in isolation and matters a great deal in aggregate. Several response paths were missing headers that would have hardened the browser-side surface against a cluster of post-exploitation techniques. Some headers were present on the main application but absent on a subdomain. Some were present in production but absent in a staging-style edge case that ended up routable. None of this was visible from source — the headers came from a reverse proxy configuration, with conditional logic that no one had reviewed end-to-end in some time.
Why the Categories Tell the Story
Pull back from the specifics. The five findings the AI did not catch fell into four categories — session management, transport cryptography, information disclosure, security misconfiguration — and they share a single property: none of them are properties of the source code. They are properties of the running system, the deployed infrastructure, the file layout on disk, the response headers from the reverse proxy, and the behavior of validation logic under conditions that only manifest at runtime.
An AI doing secure code review can read every line of the codebase. It cannot:
- Send a hundred requests per second at an endpoint and watch what the server does
- Hold a session token across requests, mutate it deliberately, and observe how the validator reacts to each mutation
- Speak TLS to the production listener and enumerate which protocol versions it agrees to
- List the contents of the public document root and notice files that should not be there
- Inspect HTTP response headers across paths, subdomains, and edge-case routing
- Reason about behavior produced by configuration that lives outside the repository — reverse proxy rules, Ansible roles, Terraform, container base images, kernel settings
This is not a criticism of AI code review. This is a description of its surface area. AI code review is bounded by source. Active penetration testing is bounded by what is reachable on the wire. The two surfaces overlap meaningfully, but neither contains the other.
The compounding effect. Because Flowtriq closed the obvious code-level surface before we arrived, the engagement hours that would have gone to documenting and reproducing those issues went instead to active runtime testing — which is exactly where the residual risk lived. They paid for both the AI audit and the manual pentest, and got more total coverage at the same total cost than they would have from either approach alone. This is the right way to budget for security review in 2026.
The Outcome
Flowtriq's engineering team triaged all five findings within forty-eight hours of the report being delivered. The two High-severity issues were patched first — the rate-limit gap remediated with a token-bucket guard at the edge, the anti-forgery validation tightened so that every failure mode rejected the request rather than logging and continuing. The TLS profile on the production load balancer was updated to drop the deprecated protocol. The forgotten artifacts were removed from the document root and added to a deploy-time scanner that will catch the same pattern next time. The header gaps were closed at the reverse proxy with a centralized configuration that applies uniformly across paths and subdomains.
We re-tested two weeks later. Every finding closed. No regressions. No new findings introduced by the fixes.
"I used Lorikeet for a PTaaS pentest and briefly tried out their ASM tool, which was amazing. I appreciate the fast tests and the accuracy of the findings. We came in thinking our AI audit had probably caught most of what mattered, and the report made us realize it had caught most of what mattered in the source tree — the runtime and infrastructure were a whole second surface area we hadn't actually tested. Their team was super helpful, everything ran through a modern interface, and the white-glove touch is impressive. A 10/10 experience."
— Jacob M., Founder, Flowtriq (verified G2 review, 4/23/2026)The Lessons That Generalize
Flowtriq's engagement is a small case, not a large one — five findings in a single application, against a single client, in a single quarter. But the pattern it surfaces is generalizable, and we are seeing it across our 2026 client base.
AI code review is real defensive infrastructure. It catches what it is good at catching, at scale, faster than any human review could. Teams that deploy it well are shipping more secure code, sooner, than teams that do not. This is not a fad and it is not a marketing line. The XSS, SQL injection, template injection, and weak-crypto findings the Flowtriq AI audit closed were findings that, three years ago, would have shown up in the manual pentest report. They will show up less often from here forward. That is good news.
The categories that remain are not the categories AI is best at. Session management edge cases, runtime TLS posture, file-system hygiene on production servers, and the configuration of every system that sits between your application and the wire — all of these continue to require active probing. The arrival of AI in the secure development cycle has not made these categories smaller; if anything, by closing the noisier source-level findings, it has made the runtime findings more visible.
Budget for both, in that order. The most efficient security cycle we are observing in well-run 2026 engineering organisations is: continuous AI-assisted code review during development, followed by periodic manual penetration testing against the deployed system. The AI pass acts as a force multiplier on the pentest — it strips out the code-level findings so the human testers can spend their hours where humans are still uniquely effective. Both stages are necessary. Neither is sufficient.
Flowtriq did this well, and they got a stronger security posture out of it than they would have gotten from either stage alone. That is the case study.
Already Run an AI Security Audit? Validate It With a Manual Pentest.
If your engineering team has done the work of running a Claude-driven, Cursor-driven, or Copilot-driven secure code review on your application, you have already closed the easiest part of the surface. The runtime, infrastructure, and configuration findings that remain are exactly what Lorikeet Security's manual penetration tests are built to surface. Get in touch and we will scope an engagement against the application your AI audit has already hardened.