AI code review tools promise to catch bugs and security issues automatically, right inside your pull request workflow. GitHub Copilot now does code review. Amazon CodeGuru analyzes code for defects and security issues. Dozens of startups, from Korbit AI and CodeRabbit to Qodo and Sourcery, offer AI-powered review bots that comment on every PR with suggestions, warnings, and fixes.
The pitch is compelling: automated security review on every commit, at machine speed, for a fraction of the cost of a human reviewer. But how good are these tools at finding security vulnerabilities specifically? Not style issues. Not refactoring opportunities. Not documentation gaps. Actual exploitable vulnerabilities, the kind that lead to data breaches, account takeovers, and compliance failures.
We tested the major AI code review tools against real vulnerability patterns from our penetration testing and secure code review engagements. We fed them code containing IDOR flaws, broken access controls, injection vectors, JWT implementation mistakes, race conditions, and business logic bypasses. The results were instructive and, for anyone relying solely on these tools for security, concerning.
What AI Code Review Tools Actually Do
Before evaluating individual tools, it helps to understand what AI code review tools are doing under the hood. Most fall into one of two categories: pattern-matching tools that use trained models to identify known-bad code patterns, and generative AI tools that use large language models to "understand" code and provide natural-language feedback.
The pattern-matching tools are essentially evolved static analysis. They look for specific code constructs that correlate with vulnerabilities: string concatenation in SQL queries, missing output encoding, hardcoded credential strings, and known-insecure function calls. They are good at what traditional SAST has always been good at, just with lower false positive rates thanks to machine learning.
The generative AI tools are doing something different. They read your code the way an LLM reads text, generating a contextual understanding of what the code does and then producing review comments based on that understanding. This gives them the ability to comment on code quality, suggest refactors, and explain logic, but it does not give them the ability to reason about security in the way a human security engineer does.
Where AI code review excels
- Style and consistency issues: Naming conventions, formatting, dead code, unused imports
- Simple bugs: Off-by-one errors, null reference potential, type mismatches
- Documentation: Missing docstrings, outdated comments, unclear function signatures
- Basic security patterns: Hardcoded secrets, obvious SQL concatenation, known-insecure functions
- Dependency issues: Known CVEs in imported packages, outdated libraries
Where AI code review struggles
- Business logic flaws: The code does what it says, but what it says violates business rules
- Authorization patterns: Determining whether access control is correct requires understanding the permission model, not just whether a check exists
- Race conditions: Timing-dependent vulnerabilities require understanding concurrent execution paths
- Context-dependent vulnerabilities: SSRF, where the danger depends on what is on the internal network. BOLA, where the danger depends on what each user should be able to access
- Cryptographic misuse: Using AES correctly but with a static IV, or HMAC with a key derived from user input
- Cross-service data flows: Vulnerabilities that span multiple microservices or APIs
This distinction matters. The vulnerabilities that AI tools catch reliably are the same ones that traditional SAST tools and even linters have been catching for years. The vulnerabilities that AI tools miss are the ones that actually get companies breached.
GitHub Copilot Code Review
GitHub Copilot's code review capability, integrated directly into pull requests, is the most widely accessible AI review tool. It leverages the same underlying models that power Copilot's code generation, now applied to reviewing diffs and suggesting changes.
What Copilot catches
Copilot's review is reasonably effective at flagging basic injection patterns where user input flows directly into SQL queries or shell commands through obvious string concatenation. It catches obvious credential exposure, including API keys, passwords, and tokens that appear as string literals in source code. It also identifies some dependency issues when combined with GitHub's Dependabot integration, flagging PRs that introduce packages with known CVEs.
In our testing, Copilot consistently flagged hardcoded JWT secrets, basic SQL injection via template literals, and a few instances of missing input sanitization on user-facing endpoints. These are real issues worth catching, and catching them on every PR before merge has genuine value.
What Copilot misses
Copilot struggles significantly with vulnerabilities that require understanding application context. It did not flag BOLA/IDOR patterns where an endpoint accepted a resource ID parameter and returned the resource without verifying the requesting user had access to it. It missed business logic authorization flaws where a role check existed but was insufficient for the specific operation. It did not identify context-dependent SSRF where a URL parameter was validated against a blocklist but the blocklist was incomplete. And it failed to catch JWT implementation flaws beyond hardcoded secrets, missing issues like algorithm confusion, missing expiration validation, and key reuse across environments.
Pricing: Included with GitHub Copilot Enterprise ($39/user/month). Copilot Business ($19/user/month) includes limited review features.
Best for: Catching low-hanging fruit in pull requests. Effective as a first pass that reduces the burden on human reviewers by filtering out basic issues before they reach the review queue.
Amazon CodeGuru
Amazon CodeGuru takes a different approach. It is trained on Amazon's internal codebase, which gives it strong coverage of infrastructure-level issues and AWS-specific patterns, but makes it narrower in scope than general-purpose AI review tools.
What CodeGuru catches
CodeGuru is particularly effective at identifying resource leaks such as unclosed database connections, file handles, and HTTP clients. It catches concurrency issues including thread safety violations and synchronization problems, reflecting Amazon's internal emphasis on highly concurrent systems. It also flags some security anti-patterns specific to AWS services, like overly permissive IAM policies referenced in code and insecure S3 bucket configurations.
What CodeGuru misses
CodeGuru's security coverage outside of AWS-specific patterns is limited. In our testing, it missed most web application security patterns including XSS, CSRF, and insecure deserialization. It did not flag API authorization issues, even straightforward ones where endpoints lacked any permission checks. Its coverage of OWASP Top 10 items was inconsistent; it caught some injection patterns in Java but missed equivalent patterns in Python. The tool's strength is code quality and AWS-specific security, not general application security.
Pricing: Pay-per-line-of-code scanned. Approximately $0.50 per 100 lines for Reviewer, $0.002 per sampling hour for Profiler. Costs can add up quickly on large codebases.
Best for: Java and Python codebases deployed on AWS where resource management and AWS-specific security patterns are primary concerns. Not a substitute for application security testing.
Korbit AI
Korbit AI positions itself as a security-focused AI code review tool, which sets higher expectations than general-purpose alternatives. It integrates with GitHub and GitLab to provide automated review comments on pull requests with an emphasis on security findings.
What Korbit catches
Korbit performs well on OWASP-pattern detection, flagging common injection vectors, missing output encoding, and insecure cryptographic function usage. It catches basic input validation issues where user-controlled data flows into sensitive operations without sanitization. It is also effective at detecting hardcoded secrets and credentials, including patterns that other tools miss like base64-encoded keys and secrets in configuration objects.
What Korbit misses
Despite its security positioning, Korbit shares the same fundamental limitations as other AI review tools. It did not identify complex injection chains where the injection point and the execution point were in different files or different services. It missed business logic vulnerabilities where the security flaw was not in what the code did, but in what it failed to check. And it did not catch authorization patterns that spanned multiple files, such as middleware that checked roles but a specific route handler that bypassed the middleware.
Best for: Teams that want a security-oriented layer on top of their existing PR workflow. More security-aware than general-purpose tools, but should not be the only security review mechanism for critical applications.
CodeRabbit
CodeRabbit is a popular AI-powered PR review bot that provides comprehensive code review comments including summaries, suggestions, and issue detection. It uses large language models to generate contextual feedback on code changes.
What CodeRabbit catches
CodeRabbit is strong on code quality issues including complexity, readability, and maintainability concerns that indirectly affect security by making code harder to audit. It catches basic security patterns similar to what Copilot detects: obvious injection points, hardcoded credentials, and missing error handling. It also provides useful dependency analysis, flagging newly introduced packages that have known vulnerabilities or are unmaintained.
What CodeRabbit misses
CodeRabbit's security analysis is shallow compared to dedicated security tools. It operates on individual pull request diffs, which means it lacks the full codebase context needed to evaluate whether a change introduces a vulnerability in the context of the broader application. It missed deep security issues that required understanding the application's authentication flow, data model, or trust boundaries. A new endpoint that looked perfectly fine in isolation was actually accessible to unauthenticated users because of how the routing middleware was configured, and CodeRabbit had no way to know that from the diff alone.
Pricing: Free for open source. Pro plan starts at $15/user/month.
Best for: General code quality improvement across all PRs. Good developer experience and useful summaries. Not a security tool, and should not be evaluated as one.
SAST Tools with AI: Semgrep and Snyk Code
Semgrep and Snyk Code occupy a different category. They are static application security testing (SAST) tools that have added AI capabilities, rather than AI tools that attempt security analysis. This distinction matters because their foundation is security-first.
Semgrep with AI
Semgrep's core strength is its custom rule engine. You can write precise, pattern-based rules that match exactly the vulnerability patterns relevant to your codebase and your tech stack. The AI layer adds assisted triage, helping developers understand whether a finding is a true positive and providing remediation guidance in natural language.
Semgrep's CI/CD integration is mature, and its rule ecosystem covers a broad range of languages and frameworks. In our testing, Semgrep with well-configured rules caught the highest percentage of pattern-based vulnerabilities of any tool we evaluated. Its false positive rate was the lowest, largely because the rule language allows you to specify precise conditions rather than relying on probabilistic model output.
The limitation is the same as any SAST tool: it matches patterns, not behavior. It can tell you that a query is not parameterized, but it cannot tell you whether the authorization logic on that endpoint is correct for your application's permission model.
Snyk Code
Snyk Code provides cross-file data flow analysis powered by machine learning, which gives it an advantage over single-file pattern matchers. It can trace a user input from an API endpoint through several function calls and transformations to a database query or system command, flagging the chain even when no single file contains a complete vulnerability.
This flow analysis makes Snyk Code more effective at catching indirect injection vulnerabilities where the taint source and the sink are in different files. It also provides real-time scanning in the IDE, catching issues before they even make it to a pull request.
In our testing, Snyk Code caught several injection chains that other tools missed. It also correctly identified insecure deserialization in a Java application where the deserialization call was three function calls removed from the user input. However, like all tools in this comparison, it did not catch authorization logic errors or business logic flaws.
Pricing: Semgrep offers a free tier for individual developers; Team plans start at custom pricing. Snyk Code is included in Snyk plans starting at a free tier with limited scans, with Team plans from $25/user/month.
Best for: Teams serious about application security who want tools purpose-built for security analysis rather than general-purpose code review tools that happen to flag some security issues.
AI Code Review Tool Comparison
The following table summarizes how each tool performed across key security dimensions in our testing.
| Capability | GitHub Copilot | Amazon CodeGuru | Korbit AI | CodeRabbit | Semgrep (AI) | Snyk Code |
|---|---|---|---|---|---|---|
| OWASP Top 10 | Partial | Limited | Good | Basic | Strong | Strong |
| Business logic | None | None | Minimal | None | None | None |
| Authorization flaws | Minimal | None | Basic checks | None | Rule-dependent | Basic flow analysis |
| Dependency analysis | Via Dependabot | Limited | Basic | Good | Via Supply Chain | Strong (Snyk SCA) |
| False positive rate | Moderate | Low-moderate | Moderate | Low | Low | Moderate |
| Language support | Broad | Java, Python | Major languages | Broad | 30+ languages | 10+ languages |
| Pricing tier | $19-39/user/mo | Pay-per-scan | Per-seat | Free-$15/user/mo | Free-custom | Free-$25/user/mo |
The pattern is clear across every tool: pattern-based detection is strong, logic-based detection is absent. No tool in this comparison reliably identified business logic vulnerabilities or complex authorization flaws. The tools that came closest, Semgrep and Snyk Code, did so through precisely configured rules and data flow analysis, not through AI "understanding" of the code's intent.
What AI Code Review Still Cannot Do
The limitations of AI code review tools are not bugs that will be fixed in the next release. They are fundamental constraints of how these tools work, and understanding them is essential for building a security program that does not have blind spots.
Cannot understand business context
An AI tool can determine that an endpoint returns user data. It cannot determine whether User A should be allowed to see User B's data. That distinction, the difference between what IS authorized and what SHOULD be authorized, requires understanding the business rules of the application. No amount of code analysis can derive business intent. This is why business logic vulnerabilities consistently rank among the most impactful findings in our manual secure code reviews.
Cannot test runtime behavior
AI code review tools analyze source code, not running applications. They cannot detect vulnerabilities that only manifest at runtime: race conditions that depend on specific timing, memory corruption that depends on allocation patterns, configuration issues that depend on the deployment environment, or authentication bypasses that depend on how the web server handles edge cases in HTTP parsing.
Cannot follow complex data flows across microservices
Modern applications are distributed across multiple services, each with its own codebase. A user input enters through an API gateway, gets processed by Service A, queued to Service B, and eventually written to a database by Service C. An injection vulnerability in this chain is invisible to any tool that analyzes one repository at a time. Even Snyk Code's cross-file analysis stops at the service boundary.
Cannot identify race conditions in distributed systems
Race conditions like time-of-check-to-time-of-use (TOCTOU) vulnerabilities require reasoning about concurrent execution. Can two requests hit this endpoint simultaneously and both pass the balance check before either deducts the funds? AI tools do not model concurrency. They see sequential code and analyze it sequentially.
Cannot evaluate cryptographic implementations in context
AI tools can flag the use of MD5 or SHA1. They cannot determine that your AES-256-GCM implementation reuses nonces under specific conditions, that your key derivation function uses insufficient iterations for the threat model, or that your HMAC comparison is vulnerable to timing attacks. Cryptographic security depends on implementation details that require specialized knowledge to evaluate.
Cannot assess whether authentication flows are architecturally sound
An authentication system may use all the right primitives (bcrypt, secure sessions, CSRF tokens) and still be architecturally flawed. Maybe the password reset flow does not invalidate existing sessions. Maybe the OAuth implementation does not validate the state parameter. Maybe the MFA enrollment process can be bypassed by directly calling an enrollment API endpoint. These are authentication bypass techniques that require architectural reasoning, not pattern matching.
When to Use AI Review vs. Manual Security Review
This is not an either-or decision. AI code review and manual security review serve different functions and belong at different stages of your development process. Here is a practical decision framework.
Use AI code review tools for every pull request
AI review tools should run on every PR, every day. Their cost is low, their speed is instant, and they catch the kind of surface-level issues that waste human reviewers' time. A well-configured Semgrep ruleset or a Snyk Code scan running in CI catches hardcoded secrets, basic injection patterns, and known-vulnerable dependencies before a human ever looks at the code. This is the security equivalent of automated linting: it does not replace human judgment, but it raises the floor.
Use manual security review for security-critical moments
- Before major releases: Any release that changes authentication, authorization, payment processing, or data handling should have a manual secure code review
- For security-critical code: Authentication systems, encryption implementations, access control layers, and anything that processes sensitive data
- For compliance requirements: SOC 2, PCI-DSS, HIPAA, and other frameworks increasingly expect evidence of security review that goes beyond automated scanning
- When AI tools flag something ambiguous: AI tools occasionally flag issues they cannot fully evaluate. A human reviewer can determine whether the flagged pattern is actually exploitable in context
- After significant architectural changes: New microservices, new API gateways, new authentication providers, or new data flows all introduce security surfaces that require human evaluation
Combine both for a layered approach
The strongest security posture comes from layering automated and manual review. AI tools filter out noise and catch common patterns on every commit. Human reviewers focus their limited time on the complex, context-dependent issues that AI cannot evaluate. This is the same principle behind DevSecOps pipeline design: automate what you can, and reserve human expertise for what you cannot automate.
AI code review tools are excellent at catching the easy stuff. But in our experience, the vulnerabilities that lead to actual breaches, authorization bypasses, business logic flaws, race conditions, require a human who understands what the application is supposed to do, not just what the code says. The hard truth is that the most dangerous vulnerabilities are the ones no tool is equipped to find automatically.
Building an Effective Code Review Security Strategy
Based on our experience reviewing code across hundreds of engagements, here is the approach that works.
Layer 1: Automated AI scanning in CI/CD. Run Semgrep or Snyk Code on every pull request. Configure custom rules for your tech stack. Block merges on high-severity findings. This catches 60-80% of pattern-based vulnerabilities with zero ongoing human effort.
Layer 2: AI-assisted PR review. Use Copilot review or CodeRabbit to provide contextual feedback to developers. This catches code quality issues and some security patterns that the SAST rules miss. It also educates developers by explaining why certain patterns are problematic.
Layer 3: Periodic manual security review. Quarterly or before major releases, have security engineers perform a manual review of security-critical code. Focus on authentication, authorization, data handling, and any areas where business logic determines security. This is where you catch the vulnerabilities that automated tools fundamentally cannot find.
Layer 4: Penetration testing. Annually or after major infrastructure changes, test the running application from an attacker's perspective. This validates that the security measures identified in code review actually work in the deployed environment and catches configuration, infrastructure, and runtime issues that code review does not cover. See our guide on choosing between code review and pentesting for more detail.
Each layer catches what the layer above misses. No single layer is sufficient on its own. The teams we see with the strongest security posture are the ones that invest in all four, with the right balance of automated and manual effort for their stage and risk profile.
Need a Security-Focused Code Review?
AI tools catch the patterns. Our security engineers catch the logic flaws, authorization bypasses, and architectural weaknesses that automated tools miss. Lorikeet Security's manual secure code review goes beyond what any AI tool can deliver.