04.03.26

Lorem ipsum dolor sit amet consectetur adipiscing elit obortis arcu enim urna adipiscing praesent velit viverra. Sit semper lorem eu cursus vel hendrerit elementum orbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis egestas.
Vitae congue eu consequat ac felis lacerat vestibulum lectus mauris ultrices ursus sit amet dictum sit amet justo donec enim diam. Porttitor lacus luctus accumsan tortor posuere raesent tristique magna sit amet purus gravida quis blandit turpis.

At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.
Dignissim adipiscing velit nam velit donec feugiat quis sociis. Fusce in vitae nibh lectus. Faucibus dictum ut in nec, convallis urna metus, gravida urna cum placerat non amet nam odio lacus mattis. Ultrices facilisis volutpat mi molestie at tempor etiam. Velit malesuada cursus a porttitor accumsan, sit scelerisque interdum tellus amet diam elementum, nunc consectetur diam aliquet ipsum ut lobortis cursus nisl lectus suspendisse ac facilisis feugiat leo pretium id rutrum urna auctor sit nunc turpis.
“Vestibulum pulvinar congue fermentum non purus morbi purus vel egestas vitae elementum viverra suspendisse placerat congue amet blandit ultrices dignissim nunc etiam proin nibh sed.”
Eget lorem dolor sed viverra ipsum nunc aliquet bibendumelis donec et odio pellentesque diam volutpat commodo sed egestas liquam sem fringilla ut morbi tincidunt augue interdum velit euismod. Eu tincidunt tortor aliquam nulla facilisi enean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.
Devstral Leads On-Prem - GPT-5.3Codex Outperforms Claude
March 2026
With Claude Code’s new security research capabilities gaining attention, one practical question stands out:
If your organization cannot send source code to the cloud, which on-premise model should you use for application security?
Claude Code can operate with locally hosted models - but performance varies significantly.
To answer this, CodeValue conducted a controlled benchmark comparing five leading on-premise LLMs under identical conditions, auditing the same enterprise codebase with the exact same AppSec prompt.
The online models were initially included to provide a reference ceiling, but their results where interesting as well.
All models were evaluated against the same production-grade enterprise application:
All models were executed inside Claude Code, except for OpenAI models – that were executed with Codex.
Conditions were strictly controlled:
Prompt:
"Act as a senior AppSec engineer: immediately audit the provided codebase for real security vulnerabilities using only available files, infer missing structure without asking questions, prioritize by impact/exploitability, state assumptions explicitly, and output an executive summary, detailed findings with minimal fixes, inferred attack surface, quick wins, and hardening backlog.
The same instructions.
Seven models.
Seven reports."
Each report was scored across eight categories:
Weighted toward audit credibility.
Finals core range: 0–100.
.jpg)
If we isolate the on-prem models:
.jpg)
It demonstrated:
No other on-prem model reached similar balance across categories.
Scores shown on 0–20 scale per category.
GPT-5.3Codex Max
Tech 14 | Depth 12 | Clarity 16 | Risk 10 | FP/FN 12 | Exploit 12 | Mitigation14 | Professionalism 16
Claude Opus4.6
Tech 14 | Depth 15 | Clarity 15 | Risk 11 | FP/FN 10 | Exploit 11 | Mitigation13 | Professionalism 10
Devstral24B
Tech12 | Depth 11 | Clarity 13 | Risk 9 | FP/FN 8 | Exploit 8 | Mitigation 12 |Professionalism 9
Ministral14B
Tech10 | Depth 11 | Clarity 12 | Risk 7 | FP/FN 6 | Exploit 7 | Mitigation 10 |Professionalism 7
DeepSeekv3.2
Tech 8| Depth 9 | Clarity 10 | Risk 6 | FP/FN 6 | Exploit 6 | Mitigation 10 |Professionalism 7
Qwen480B
Tech 8| Depth 7 | Clarity 13 | Risk 7 | FP/FN 6 | Exploit 5 | Mitigation 9 |Professionalism 9
Ministral8B
Tech 6| Depth 5 | Clarity 10 | Risk 5 | FP/FN 4 | Exploit 4 | Mitigation 7 |Professionalism 6
We also measured topic coverage across common enterprise security themes:
Devstral24B covered more categories consistently than other on-prem models.
Which on-premise LLM performs best at generating structured security audit reports under identical conditions?
The winner: Devstral 24B
Another interesting insight is that the online winner was not Claude, but GPT-5.3 Codex Max.
CodeValue Researchers: Almog Maman, Eyal Reginiano
04.03.26

Lorem ipsum dolor sit amet consectetur adipiscing elit obortis arcu enim urna adipiscing praesent velit viverra. Sit semper lorem eu cursus vel hendrerit elementum orbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis egestas.
Vitae congue eu consequat ac felis lacerat vestibulum lectus mauris ultrices ursus sit amet dictum sit amet justo donec enim diam. Porttitor lacus luctus accumsan tortor posuere raesent tristique magna sit amet purus gravida quis blandit turpis.

At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.
Dignissim adipiscing velit nam velit donec feugiat quis sociis. Fusce in vitae nibh lectus. Faucibus dictum ut in nec, convallis urna metus, gravida urna cum placerat non amet nam odio lacus mattis. Ultrices facilisis volutpat mi molestie at tempor etiam. Velit malesuada cursus a porttitor accumsan, sit scelerisque interdum tellus amet diam elementum, nunc consectetur diam aliquet ipsum ut lobortis cursus nisl lectus suspendisse ac facilisis feugiat leo pretium id rutrum urna auctor sit nunc turpis.
“Vestibulum pulvinar congue fermentum non purus morbi purus vel egestas vitae elementum viverra suspendisse placerat congue amet blandit ultrices dignissim nunc etiam proin nibh sed.”
Eget lorem dolor sed viverra ipsum nunc aliquet bibendumelis donec et odio pellentesque diam volutpat commodo sed egestas liquam sem fringilla ut morbi tincidunt augue interdum velit euismod. Eu tincidunt tortor aliquam nulla facilisi enean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.
Devstral Leads On-Prem - GPT-5.3Codex Outperforms Claude
March 2026
With Claude Code’s new security research capabilities gaining attention, one practical question stands out:
If your organization cannot send source code to the cloud, which on-premise model should you use for application security?
Claude Code can operate with locally hosted models - but performance varies significantly.
To answer this, CodeValue conducted a controlled benchmark comparing five leading on-premise LLMs under identical conditions, auditing the same enterprise codebase with the exact same AppSec prompt.
The online models were initially included to provide a reference ceiling, but their results where interesting as well.
All models were evaluated against the same production-grade enterprise application:
All models were executed inside Claude Code, except for OpenAI models – that were executed with Codex.
Conditions were strictly controlled:
Prompt:
"Act as a senior AppSec engineer: immediately audit the provided codebase for real security vulnerabilities using only available files, infer missing structure without asking questions, prioritize by impact/exploitability, state assumptions explicitly, and output an executive summary, detailed findings with minimal fixes, inferred attack surface, quick wins, and hardening backlog.
The same instructions.
Seven models.
Seven reports."
Each report was scored across eight categories:
Weighted toward audit credibility.
Finals core range: 0–100.
.jpg)
If we isolate the on-prem models:
.jpg)
It demonstrated:
No other on-prem model reached similar balance across categories.
Scores shown on 0–20 scale per category.
GPT-5.3Codex Max
Tech 14 | Depth 12 | Clarity 16 | Risk 10 | FP/FN 12 | Exploit 12 | Mitigation14 | Professionalism 16
Claude Opus4.6
Tech 14 | Depth 15 | Clarity 15 | Risk 11 | FP/FN 10 | Exploit 11 | Mitigation13 | Professionalism 10
Devstral24B
Tech12 | Depth 11 | Clarity 13 | Risk 9 | FP/FN 8 | Exploit 8 | Mitigation 12 |Professionalism 9
Ministral14B
Tech10 | Depth 11 | Clarity 12 | Risk 7 | FP/FN 6 | Exploit 7 | Mitigation 10 |Professionalism 7
DeepSeekv3.2
Tech 8| Depth 9 | Clarity 10 | Risk 6 | FP/FN 6 | Exploit 6 | Mitigation 10 |Professionalism 7
Qwen480B
Tech 8| Depth 7 | Clarity 13 | Risk 7 | FP/FN 6 | Exploit 5 | Mitigation 9 |Professionalism 9
Ministral8B
Tech 6| Depth 5 | Clarity 10 | Risk 5 | FP/FN 4 | Exploit 4 | Mitigation 7 |Professionalism 6
We also measured topic coverage across common enterprise security themes:
Devstral24B covered more categories consistently than other on-prem models.
Which on-premise LLM performs best at generating structured security audit reports under identical conditions?
The winner: Devstral 24B
Another interesting insight is that the online winner was not Claude, but GPT-5.3 Codex Max.
CodeValue Researchers: Almog Maman, Eyal Reginiano