04.03.26

LLMs for Security Analysis: A Head-to-Head Competition

Written by
Almog Maman
Published on
September 20, 2023
Read time
5 min
Category
Blog
LLMs for Security Analysis: A Head-to-Head Competition

INTERESTING ARCHITECTURE TRENDS

Lorem ipsum dolor sit amet consectetur adipiscing elit obortis arcu enim urna adipiscing praesent velit viverra. Sit semper lorem eu cursus vel hendrerit elementum orbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis egestas.

  1. Neque sodales ut etiam sit amet nisl purus non tellus orci ac auctor.
  2. Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti.
  3. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar.
  4. Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti.

WHY ARE THESE TRENDS COMING BACK AGAIN?

Vitae congue eu consequat ac felis lacerat vestibulum lectus mauris ultrices ursus sit amet dictum sit amet justo donec enim diam. Porttitor lacus luctus accumsan tortor posuere raesent tristique magna sit amet purus gravida quis blandit turpis.

Odio facilisis mauris sit amet massa vitae tortor.

WHAT TRENDS DO WE EXPECT TO START GROWING IN THE COMING FUTURE?

At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.

  • Neque sodales ut etiam sit amet nisl purus non tellus orci ac auctor.
  • Eleifend felis tristique luctus et quam massa posuere viverra elit facilisis condimentum.
  • Magna nec augue velit leo curabitur sodales in feugiat pellentesque eget senectus.
  • Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti .
WHY IS IMPORTANT TO STAY UP TO DATE WITH THE ARCHITECTURE TRENDS?

Dignissim adipiscing velit nam velit donec feugiat quis sociis. Fusce in vitae nibh lectus. Faucibus dictum ut in nec, convallis urna metus, gravida urna cum placerat non amet nam odio lacus mattis. Ultrices facilisis volutpat mi molestie at tempor etiam. Velit malesuada cursus a porttitor accumsan, sit scelerisque interdum tellus amet diam elementum, nunc consectetur diam aliquet ipsum ut lobortis cursus nisl lectus suspendisse ac facilisis feugiat leo pretium id rutrum urna auctor sit nunc turpis.

“Vestibulum pulvinar congue fermentum non purus morbi purus vel egestas vitae elementum viverra suspendisse placerat congue amet blandit ultrices dignissim nunc etiam proin nibh sed.”
WHAT IS YOUR NEW FAVORITE ARCHITECTURE TREND?

Eget lorem dolor sed viverra ipsum nunc aliquet bibendumelis donec et odio pellentesque diam volutpat commodo sed egestas liquam sem fringilla ut morbi tincidunt augue interdum velit euismod. Eu tincidunt tortor aliquam nulla facilisi enean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.

Devstral Leads On-Prem - GPT-5.3Codex Outperforms Claude

March 2026

 

With Claude Code’s new security research capabilities gaining attention, one practical question stands out:

If your organization cannot send source code to the cloud, which on-premise model should you use for application security?

Claude Code can operate with locally hosted models - but performance varies significantly.

To answer this, CodeValue conducted a controlled benchmark comparing five leading on-premise LLMs under identical conditions, auditing the same enterprise codebase with the exact same AppSec prompt.

 

What We Compared

On-Premise Models (Primary Competition)

  • Qwen – Qwen Coder 480B
  • DeepSeek – DeepSeek v3.2
  • Mistral AI – Ministral 14B
  • Mistral AI – Ministral 8B
  • Mistral AI – Devstral 24B

Online Reference Models (Baseline)

  • Anthropic – Claude Opus 4.6
  • OpenAI – GPT-5.3 Codex Max
  • OpenAI – GPT-5.1 Codex Max

The online models were initially included to provide a reference ceiling, but their results where interesting as well.

 

Target Codebase

All models were evaluated against the same production-grade enterprise application:

  • Mature Windows-based desktop system
  • Built on a legacy enterprise technology stack
  • Rich graphical user interface architecture
  • Approximately several dozen source files
  • Several thousand lines of production code
  • External device and third-party system integrations
  • Mixed configuration management and service-layer logic
  • Business-critical operational workflows

 

Experimental Setup

All models were executed inside Claude Code, except for OpenAI models – that were executed with Codex.

Conditions were strictly controlled:

  • Same prompt
  • Single run
  • No clarifications
  • No follow-up questions
  • No refinement

Prompt:

"Act as a senior AppSec engineer: immediately audit the provided codebase for real security vulnerabilities using only available files, infer missing structure without asking questions, prioritize by impact/exploitability, state assumptions explicitly, and output an executive summary, detailed findings with minimal fixes, inferred attack surface, quick wins, and hardening backlog.

The same instructions.


Seven models.
Seven reports
."

 

Scoring Method

Each report was scored across eight categories:

  1. Technical accuracy
  2. Depth of vulnerability reasoning
  3. Risk assessment quality
  4. Exploitability realism
  5. False positive / false negative discipline
  6. Mitigation practicality
  7. Clarity & structure
  8. Professional tone

Weighted toward audit credibility.

Finals core range: 0–100.

 

Final Results - Overall Ranking

Overall ranking

 

On-Premise Leaderboard Only

If we isolate the on-prem models:

Rank on-prem model

 

It demonstrated:

  • Stronger technical accuracy
  • Better structured findings
  • More consistent exploitability reasoning
  • More practical mitigation proposals

No other on-prem model reached similar balance across categories.

 

Category Snapshot

Scores shown on 0–20 scale per category.

GPT-5.3Codex Max
Tech 14 | Depth 12 | Clarity 16 | Risk 10 | FP/FN 12 | Exploit 12 | Mitigation14 | Professionalism 16

Claude Opus4.6
Tech 14 | Depth 15 | Clarity 15 | Risk 11 | FP/FN 10 | Exploit 11 | Mitigation13 | Professionalism 10

Devstral24B

Tech12 | Depth 11 | Clarity 13 | Risk 9 | FP/FN 8 | Exploit 8 | Mitigation 12 |Professionalism 9

Ministral14B

Tech10 | Depth 11 | Clarity 12 | Risk 7 | FP/FN 6 | Exploit 7 | Mitigation 10 |Professionalism 7

DeepSeekv3.2

Tech 8| Depth 9 | Clarity 10 | Risk 6 | FP/FN 6 | Exploit 6 | Mitigation 10 |Professionalism 7

Qwen480B

Tech 8| Depth 7 | Clarity 13 | Risk 7 | FP/FN 6 | Exploit 5 | Mitigation 9 |Professionalism 9

Ministral8B

Tech 6| Depth 5 | Clarity 10 | Risk 5 | FP/FN 4 | Exploit 4 | Mitigation 7 |Professionalism 6

 

Coverage Comparison

We also measured topic coverage across common enterprise security themes:

  • Hardcoded secrets
  • HTTP endpoints
  • FTP usage
  • DLL/plugin loading
  • Service privilege
  • Mutex ACL
  • UNC/NTLM exposure
  • Installer patterns
  • Dependency age
  • Configuration token handling

Devstral24B covered more categories consistently than other on-prem models.

 

Conclusion

Which on-premise LLM performs best at generating structured security audit reports under identical conditions?

The winner: Devstral 24B

 

Another interesting insight is that the online winner was not Claude, but GPT-5.3 Codex Max.

 

CodeValue Researchers: Almog Maman, Eyal Reginiano

04.03.26

LLMs for Security Analysis: A Head-to-Head Competition

Written by
Almog Maman
Published on
September 20, 2023
Read time
5 min
Category
Blog
LLMs for Security Analysis: A Head-to-Head Competition

INTERESTING ARCHITECTURE TRENDS

Lorem ipsum dolor sit amet consectetur adipiscing elit obortis arcu enim urna adipiscing praesent velit viverra. Sit semper lorem eu cursus vel hendrerit elementum orbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis egestas.

  1. Neque sodales ut etiam sit amet nisl purus non tellus orci ac auctor.
  2. Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti.
  3. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar.
  4. Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti.

WHY ARE THESE TRENDS COMING BACK AGAIN?

Vitae congue eu consequat ac felis lacerat vestibulum lectus mauris ultrices ursus sit amet dictum sit amet justo donec enim diam. Porttitor lacus luctus accumsan tortor posuere raesent tristique magna sit amet purus gravida quis blandit turpis.

Odio facilisis mauris sit amet massa vitae tortor.

WHAT TRENDS DO WE EXPECT TO START GROWING IN THE COMING FUTURE?

At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.

  • Neque sodales ut etiam sit amet nisl purus non tellus orci ac auctor.
  • Eleifend felis tristique luctus et quam massa posuere viverra elit facilisis condimentum.
  • Magna nec augue velit leo curabitur sodales in feugiat pellentesque eget senectus.
  • Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti .
WHY IS IMPORTANT TO STAY UP TO DATE WITH THE ARCHITECTURE TRENDS?

Dignissim adipiscing velit nam velit donec feugiat quis sociis. Fusce in vitae nibh lectus. Faucibus dictum ut in nec, convallis urna metus, gravida urna cum placerat non amet nam odio lacus mattis. Ultrices facilisis volutpat mi molestie at tempor etiam. Velit malesuada cursus a porttitor accumsan, sit scelerisque interdum tellus amet diam elementum, nunc consectetur diam aliquet ipsum ut lobortis cursus nisl lectus suspendisse ac facilisis feugiat leo pretium id rutrum urna auctor sit nunc turpis.

“Vestibulum pulvinar congue fermentum non purus morbi purus vel egestas vitae elementum viverra suspendisse placerat congue amet blandit ultrices dignissim nunc etiam proin nibh sed.”
WHAT IS YOUR NEW FAVORITE ARCHITECTURE TREND?

Eget lorem dolor sed viverra ipsum nunc aliquet bibendumelis donec et odio pellentesque diam volutpat commodo sed egestas liquam sem fringilla ut morbi tincidunt augue interdum velit euismod. Eu tincidunt tortor aliquam nulla facilisi enean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.

Devstral Leads On-Prem - GPT-5.3Codex Outperforms Claude

March 2026

 

With Claude Code’s new security research capabilities gaining attention, one practical question stands out:

If your organization cannot send source code to the cloud, which on-premise model should you use for application security?

Claude Code can operate with locally hosted models - but performance varies significantly.

To answer this, CodeValue conducted a controlled benchmark comparing five leading on-premise LLMs under identical conditions, auditing the same enterprise codebase with the exact same AppSec prompt.

 

What We Compared

On-Premise Models (Primary Competition)

  • Qwen – Qwen Coder 480B
  • DeepSeek – DeepSeek v3.2
  • Mistral AI – Ministral 14B
  • Mistral AI – Ministral 8B
  • Mistral AI – Devstral 24B

Online Reference Models (Baseline)

  • Anthropic – Claude Opus 4.6
  • OpenAI – GPT-5.3 Codex Max
  • OpenAI – GPT-5.1 Codex Max

The online models were initially included to provide a reference ceiling, but their results where interesting as well.

 

Target Codebase

All models were evaluated against the same production-grade enterprise application:

  • Mature Windows-based desktop system
  • Built on a legacy enterprise technology stack
  • Rich graphical user interface architecture
  • Approximately several dozen source files
  • Several thousand lines of production code
  • External device and third-party system integrations
  • Mixed configuration management and service-layer logic
  • Business-critical operational workflows

 

Experimental Setup

All models were executed inside Claude Code, except for OpenAI models – that were executed with Codex.

Conditions were strictly controlled:

  • Same prompt
  • Single run
  • No clarifications
  • No follow-up questions
  • No refinement

Prompt:

"Act as a senior AppSec engineer: immediately audit the provided codebase for real security vulnerabilities using only available files, infer missing structure without asking questions, prioritize by impact/exploitability, state assumptions explicitly, and output an executive summary, detailed findings with minimal fixes, inferred attack surface, quick wins, and hardening backlog.

The same instructions.


Seven models.
Seven reports
."

 

Scoring Method

Each report was scored across eight categories:

  1. Technical accuracy
  2. Depth of vulnerability reasoning
  3. Risk assessment quality
  4. Exploitability realism
  5. False positive / false negative discipline
  6. Mitigation practicality
  7. Clarity & structure
  8. Professional tone

Weighted toward audit credibility.

Finals core range: 0–100.

 

Final Results - Overall Ranking

Overall ranking

 

On-Premise Leaderboard Only

If we isolate the on-prem models:

Rank on-prem model

 

It demonstrated:

  • Stronger technical accuracy
  • Better structured findings
  • More consistent exploitability reasoning
  • More practical mitigation proposals

No other on-prem model reached similar balance across categories.

 

Category Snapshot

Scores shown on 0–20 scale per category.

GPT-5.3Codex Max
Tech 14 | Depth 12 | Clarity 16 | Risk 10 | FP/FN 12 | Exploit 12 | Mitigation14 | Professionalism 16

Claude Opus4.6
Tech 14 | Depth 15 | Clarity 15 | Risk 11 | FP/FN 10 | Exploit 11 | Mitigation13 | Professionalism 10

Devstral24B

Tech12 | Depth 11 | Clarity 13 | Risk 9 | FP/FN 8 | Exploit 8 | Mitigation 12 |Professionalism 9

Ministral14B

Tech10 | Depth 11 | Clarity 12 | Risk 7 | FP/FN 6 | Exploit 7 | Mitigation 10 |Professionalism 7

DeepSeekv3.2

Tech 8| Depth 9 | Clarity 10 | Risk 6 | FP/FN 6 | Exploit 6 | Mitigation 10 |Professionalism 7

Qwen480B

Tech 8| Depth 7 | Clarity 13 | Risk 7 | FP/FN 6 | Exploit 5 | Mitigation 9 |Professionalism 9

Ministral8B

Tech 6| Depth 5 | Clarity 10 | Risk 5 | FP/FN 4 | Exploit 4 | Mitigation 7 |Professionalism 6

 

Coverage Comparison

We also measured topic coverage across common enterprise security themes:

  • Hardcoded secrets
  • HTTP endpoints
  • FTP usage
  • DLL/plugin loading
  • Service privilege
  • Mutex ACL
  • UNC/NTLM exposure
  • Installer patterns
  • Dependency age
  • Configuration token handling

Devstral24B covered more categories consistently than other on-prem models.

 

Conclusion

Which on-premise LLM performs best at generating structured security audit reports under identical conditions?

The winner: Devstral 24B

 

Another interesting insight is that the online winner was not Claude, but GPT-5.3 Codex Max.

 

CodeValue Researchers: Almog Maman, Eyal Reginiano