
Optimizing LLM Context for Vulnerability Scanning
A comparison of various code chunking strategies
The Challenge of Code Context in AI Security
When it comes to vulnerability detection with Large Language Models (LLMs), how context is used plays an important role in performance. Should you shove the entire code base into one (very large) context window as one (very large) chunk? Or split the code base, feeding the LLM one (smaller) chunk at a time? If you split, what makes a chunk effective? The chunking strategy, the size you choose, along with several other factors, directly determine what the model can see and reason about.
Previously, our personal experience showed that two opposing forces affect performance. Provide too much code, and relevant signals get diluted—e.g., mixing tests and production files or many unrelated modules reduced recall in our experiments. Provide too little and the model misses necessary cues—e.g., a sink defined in one file and the input validation in another, or a subtle configuration that disables sanitization. This trade‑off also affects speed and cost: larger prompts are slower and more expensive, while smaller prompts risk false negatives.
In this post, we will explore the effects chunking strategies and context size have when using LLMs to discover OWASP Top 10 vulnerabilities using Fraim. We’ll answer questions like:
- Do LLMs tend to miss vulnerabilities if they are fed too many lines of code at once?
- What effect do different chunking strategies have on performance?,
- How much surrounding code does an LLM need to identify a vulnerability in the first place?.
TL;DR, Our results showed that increasing chunk sizes improved accuracy, and inversely, decreasing chunk sizes improved coverage. This appears to be due to decreases in chunk sizes being inversely correlated with the total number of results. Most surprisingly, we found that syntax-aware chunking provided very little benefit, and in some cases, can perform less consistently compared to other naive approaches.
Fraim: AI-Powered Vulnerability Scanner
Our system, Fraim, uses a two-stage process for its code workflow module. First, it chunks repositories into smaller inputs so an LLM can scan them for vulnerabilities. Then, a triager stage validates those findings with full repository access through tool calling, filtering out incorrect vulnerabilities.
The code workflow’s first stage produces candidate findings—hypotheses based on chunked context. Then the second stage, triage, does not use chunking; it evaluates candidates using tools with access to the full repository to confirm, deduplicate, and discard false positives.
This two-stage architecture is optimized for recall and speed. The first stage ensures that the LLM sees every line in the code base. With a purely tool-driven approach, the LLM might skip some files. The slower (due to tool calls) second stage is run only for likely vulnerabilities.
Note: For the experiments in this article, we ran only the first stage to easily isolate the effects of chunking. We want high recall (low false negatives) from this stage to avoid missed vulnerabilities. Some false positives are acceptable (they would be filtered by the second stage), but would make the scanning slower and more expensive (more second stage executions).
The scanner targets common vulnerabilities such as SQL injection, XSS, and CSRF across multiple languages (Python, JavaScript, Java, C, etc.). In these cases, context is critical. A snippet that looks unsafe in isolation may be harmless in its actual usage—or vice versa. This is why finding the right chunking method is so important.
The Mechanics of Code Chunking for LLMs
Chunking means splitting a large codebase into smaller segments that fit within an LLM’s token window. It’s not only about size constraints; it’s about feeding the model the right information without overwhelming it to the point that it degrades performance.
The key idea is to maximize useful context while minimizing irrelevant details. Done poorly, chunking can cut off the information the model needs to reason about a function. Done well, it lets the model focus on vulnerabilities in context without distraction.
Why Context Matters
Fraim’s scan stage surfaces hypotheses, not verdicts. We ask the model to flag code that might be risky so a reviewer—or a triage agent—can trace the call sites and decide what the bug really is. A real example from one of the tests in a modified version of the XBoW validation-benchmarks dataset shows why.
Chunk A – the localhost-only password reset
@app.route('/internal/update', methods=['GET'])
def update_user():
if request.remote_addr != '127.0.0.1':
return "Unauthorized", 403
username = request.args.get('username')
new_password = request.args.get('newpassword')
if username in users:
users[username]['password'] = new_password
return "Password reset successfully", 200
return "User not found", 404
If a model only sees this small snippet, it may not consider it publicly exploitable: the guard rejects any request that didn’t originate from the loopback interface, so the obvious exploit path is closed. At this stage, we don’t want the model to stop—ideally, we would want it to report the possibility that there may be a way to call this endpoint.
Chunk B – the profile picture fetcher earlier in the same file
@app.route('/profile', methods=['GET', 'POST'])
def profile():
<..snip..>
profile_picture_url = request.form.get('profile_picture_url')
if profile_picture_url:
try:
filename = secure_filename(profile_picture_url.split('/')[-1])
urllib.request.urlretrieve(profile_picture_url, os.path.join(app.config['UPLOAD_FOLDER'], filename))
user['profile_picture'] = filename
except Exception as e:
flash('Failed to fetch image from URL.')
This chunk says nothing about passwords, but it lights up a different hypothesis: “user input controls a server-side URL fetch; check for SSRF.” Following that thread reveals the connection—the attacker can use the profile_picture_url
parameter on the /profile
endpoint to potentially bypass the localhost guard. The guard wasn’t wrong; it was simply that two unrelated sections of code were necessary to result in an exploitable vulnerability.
If both these pieces of code appear in the same chunk, the LLM can reason about it. If this chunk that contains both pieces of code grows much larger than necessary, the common understanding is that the critical information may be drowned out in the additional unnecessary context. Lastly, if they end up in separate chunks, then we are relying on the LLM to consider the localhost check exploitable and report the potential vulnerability solely on the first code block.
Exploring Chunking Strategies
Our choices for chunking strategies covered below reflect the following questions:
- What if we didn’t chunk at all?
- What happens when we start combining files?
- This will help us understand performance at chunk sizes larger than most files in our dataset.
- What are the effects of semantic chunking, which aims to preserve logical code structures?
- How do different-sized chunks affect chunking strategies?
- How do these results compare to the original strategy?
Below, we’ll describe each of the strategies chosen for the questions above in detail.
Line-Based (Original)
Fraim’s original chunking strategy is a simple, line‑based splitter. It caps chunks by line count rather than tokens. It attempts to preserve some common code boundaries (e.g., open and closing braces) but is not language-aware. Additionally, it includes the file line numbers for the code included in the context, something that wasn’t carried over to other chunkers.
We include it in these benchmarks to compare performance across earlier versions of Fraim and to ensure no significant regressions exist.
Sizes for all other chunking methods are defined in tokens, while the original strategy uses lines.
Fixed Token
Fixed Token chunking splits code into calls of a set number of tokens. It is as naive as the original line-based approach, but using token count makes it easier to use as a fallback for the other strategies and aims to provide consistent performance over the original line-based approach. Despite being measured in tokens, chunks are only split on line breaks, similar to before.
Because chunks are only split on lines, the chunk overlap, which represents the size of the overlap included between distinct chunks, can not be consistent. When the limit is reached for a chunk, any tokens in the last incomplete line are removed.
For fixed token, the chunk size to use becomes an important question due to the trade-off between larger contexts (potential dilution) and providing enough context to hypothesize a vulnerability.
Our hypothesis here was that testing this method across a range of chunk sizes would surface a “sweet spot” where accuracy is maximized and false positives minimized. Too small, and critical context is lost; too large, and noise increases or token limits are hit.
In practice, the F1-score does not trace a smooth curve for two reasons. First, critical context can fall in and out of chunk boundaries. Second, many files are smaller than the larger chunk sizes. This produces apparent “randomness” in per-size results while still revealing a band of competitive sizes.
Syntactic Token
Syntactic token (aka language‑aware) chunking is similar to fixed token but uses language heuristics to split at logical code boundaries. Specifically, the syntactic chunking used in these tests is based on LangChain’s RecursiveCharacterTextSplitter.from_language
to split each file, falling back on fixed_token when the file type is not supported. LangChain defines a priority of language‑specific separators (functions/classes/blocks) and snaps boundaries to them.
This is also affected by the same inconsistent chunk overlap issue we covered above with fixed token. The effect this issue has on each token splitter is related to the number of possible locations the chunk can be split at. This effect will vary widely depending on what code is being processed, but will generally affect the syntactic chunker more than the fixed token chunker, which splits on newlines.
We hypothesized that by preserving logical code units (functions, methods, classes) and providing semantically complete snippets, the syntactic chunking method would lead to improved accuracy compared to the equivalent fixed-size chunking tests, as chunks would not be split mid‑function/class. In practice, there proved to be very little difference over the fixed token approach, which we’ll dive into more in the results and analysis section.
File
The file strategy uses one chunk per file in the code base. We limit the chunk size to 70% of the model’s max context to avoid near-limit failures. If a file is larger than this limit, it is chunked using the fixed token approach.
We originally hypothesized that the performance would vary depending on file size and internal code organization. It would perform adequately for small self-contained files, but would lack context for vulnerability requiring cross-file context. Additionally, we expected it would lack context for vulnerabilities requiring cross-file context.
Compared to the project strategy (discussed below), the file strategy suffered in precision but performed better in recall and, as expected, did better on the Validation Benchmarks dataset, which contains many unrelated sub‑projects. When looking at the F1‑score on other datasets, performance was relatively poor, with the file generally below the stronger strategies.
This aims to answer the question of what if no chunking was used, because of this we’ll use this as the baseline to compare other results.
Project
This strategy attempts to add the entire project in a single call by combining files to fill the LLM’s context window and aims to answer our second question. In practice, this uses the fixed token strategy with a token limit of 70% of the model’s max context. When used, it can send small to medium-sized codebases in a single API call. The details of how splitting works for larger projects can be found below in the fixed token section above.
We originally predicted this would degrade performance overall for most codebases, while potentially performing well on small repositories. This strategy shows a competitive accuracy/speed trade‑off against large packed sizes, which we’ll analyze in depth later in this post.
Packed Fixed/Syntactic
In addition to how the project strategy combines files, two packed strategies work similarly: packed fixed and packed syntactic. The first uses a fallback to the packed token splitter, and the second falls back to syntactic token, besides this, they use the same approach.
The packed strategies represent a balanced version of the same approach, where files are combined until a set number of tokens are used. Similar to project, it avoids mixes of partial + whole files within a single chunk and will only split a file when it is larger than the tested chunk size
. In that case, a default fallback splitter is used to break that file into smaller pieces. The original packed strategy splits oversized files using the syntactic strategy, while the packed strategy with the fixed token fallback uses the fixed token splitter.
We originally added the packed strategy with the fixed token fallback after discovering that the fixed token strategy allowed for predictable overlap between chunks and, in some cases, produced more consistent results. In general, the packed strategies avoid unused context and proved to be a strong overall performer, as we’ll see later in this post.
Originally, we expected this would perform better than project, but we’re unsure how it may perform compared to other strategies. As we’ll see later, as this strategy’s context grows beyond smaller file sizes in a dataset, the model appears to make fewer, more accurate guesses. See Results and Analysis for performance outcomes. It offers significant benefit to precision with the tradeoff of slightly lower recall compared to File, and typically outperforms Project.
Methodology: Testing with Fraim
To evaluate each chunking method, we integrated them into Fraim and ran them across four datasets using gemini-2.5-flash. Each dataset is unique; understanding the differences between them can help explain the results that we’ll be covering later.
Generated
The generated datasets contain small, simple, artificial projects with one vulnerability each. They were generated by specifying a specific vulnerability in a given language using a custom internal tool. However, because these ended up being so simple, they primarily act as sanity checks rather than decision drivers. The results from the generated datasets are not included in the overall average.
Validation Benchmarks
The validation benchmarks were created by XBOW for their dynamic AI scanner, and later adapted for SAST AI scanners by ZeroPathAI. The process of adapting these for SAST use is covered here.
Juice Shop
The original Juice Shop dataset was originally intended for SAST scanners and represents a large, real-world vulnerable application (594+ source files overall). The baseline includes 75 results across 32 files., Similar to the Validation benchmark, we adapted these for use with AI SAST:
- Vulnerability details were translated to an expected SARIF results file
- References to the vulnerabilities in the code were removed to avoid providing hints to the LLM.
It’s difficult to predict what aspect of a vulnerability might be highlighted by Fraim. To limit the impact of this inherent unpredictability, we included several plausible locations for each vulnerability in the expected results file. Because a non-significant amount of this process involved manual and semi-automated review, there isn’t a set process to recreate this dataset, as with the Validation Benchmark; however, the tooling that was used for this may be made public at a later date.
Verademo
The verademo repository is published by Veracode and is intended for non-AI SAST scanners. Comments making references to vulnerable code were removed from the repository in a similar way as the validation benchmarks to ensure hints found in comments didn’t affect the results. It features heavy same‑file, same‑type vulnerability clustering, especially in XSS in JSP templates.
Results and Analysis
We measured:
- True Positives (TP) – The number of known, real vulnerabilities that the scanner correctly identified.
- False Positives (FP) – The number of times the scanner reported a vulnerability that doesn’t actually exist.
- False Negatives (FN) – The number of known, real vulnerabilities that the scanner missed.
- Precision – Of all the vulnerabilities the scanner reported, what percentage were actually real?
- Recall – Of all the real vulnerabilities present, what percentage did we find?
- F1-Score – The harmonic mean of Precision and Recall. This provides a single, balanced score of the scanner’s accuracy.
- Processing Time – The average chunk processing time.
- Skipped Lines / 100k – Measures the number of skipped lines due to unrecoverable LLM API failures.
- In our conclusion, we did not consider configurations that resulted in unrecoverable API failures; however, because we expect this to be improved in future versions of Fraim, it’s useful to consider what effect these cases may have on the results.
Here are links to interactive graph of the results discussed in this post. The first graph averages the results across all the datasets.
The next three graphs show the individual results for each of the datasets.
The Generated dataset isn’t included in the overall average and wasn’t considered in the result analysis, but is interesting in isolation because it shows the effect of critical related context falling in and out of independent chunks.
Overall Best
Goal | Recommendation | Why | Trade‑off |
---|---|---|---|
Best overall balance | packed syntactic 5.5k-12k | Consistently high F1 score with increased precision particularly at higher chunk sizes. A range is given here despite F1 peaking at 5.5K because it is dependent on precision and recall being equally important. Above 12k, the skipped line issue becomes more prevalent. | At lower chunk sizes, total scanning time is longer due to the triage module needing to validate more vulnerabilities. Similarly, recall is impacted due to some legitimate results being filtered out. |
Highest precision | packed syntactic 20k | Fewest false positives most datasets; no skipped‑lines | Reduced recall. |
Highest recall | packed fixed 1–4k | More hypotheses made. | Increased triage processing time. Increased risk of incorrect results making it through triage. |
Fastest overall (excluding triage) | packed syntactic 6.5k | Lowest avg time (~93.06s) | Dataset dependent |
Least Guesses | packed syntactic 11K | 17 total reported | Reduced recall. Larger datasets can result in some skipped lines currently. |
Testing Notes
A few important details are worth remembering when interpreting the results:
- All benchmarks were run using gemini-2.5-flash.
- Chunk overlap, which we did not measure the effects of here, was set to 10% of the chunk size for strategies that supported it. Additionally, in practice, chunk overlap is affected by the implementation of each chunking strategy and is not always consistent.
- Fraim’s triage step of the code workflow is disabled to isolate the effects of the chunking strategy and size. Because of this, you should expect better end‑to‑end results in practice than the raw figures shown here. The triage step will improve precision, i.e., reduce false positives.
- Testing was limited to scanning the files that were known to be related to a vulnerability according to the baseline SARIF files; whole‑project scans exclude ancillary documentation and files unrelated to the known vulnerabilities based on our baseline results.
- This means the concentration of vulnerabilities is likely higher than in most real-world scenarios, and in turn, this ends up affecting the F1 score due to it being a weighted balance between precision and recall. In the future, we may consider replacing the F1 score with a weighted version that takes this into account.
- Test cases resulting in more than 5% skipped lines are excluded from the later analysis sections.
- However, these cases may be worth re-checking when this issue is improved in a future version of Fraim.
- The results covered below reflect Fraim version v0.5.1 on the feature-additional-chunkers branch.
About The Results
With the winners and graphs in view, here’s why these patterns emerge and how to apply them..
-
Certainty vs Coverage
As context grows, the model becomes more certain about its hypotheses and commits to fewer, stronger guesses. At the initial scanning stage, some of those hypotheses can stem from incorrect assumptions, yet still land close enough to true issues to count as hits. This happens because the triager can correct for incorrect details when it finds a similar issue in related code. This can be seen in practice by comparing results with and without triage enabled. Recall tends to perform better with it enabled, even though the triager is intended to filter out incorrect results.
Conversely to what we describe above, smaller chunks (e.g., 1k–4k) tend to produce more total guesses. Recall rises as the model becomes less informed, while precision drops as false positives increase. This pattern was even more visible in earlier benchmarks with looser comparison logic, but can still be seen to a degree in the 1k–4k range, where recall tends to track the total number of guesses. The ideal mid‑sized chunks (~5.5-12k) balance these effects—preserving local relationships while avoiding over-or under-guessing.
-
Boundary reflow and critical context
Changing chunk sizes can reflow a file across calls and accidentally split key cues. Example: a 90‑line file at 30‑line chunks yields [1–30], [31–60], [61–90]; bumping to 45‑line chunks yields [1–45], [46–90]. If the critical evidence sits around lines 31–60, the larger setting now resides across two independent API calls, lowering recall. This boundary effect explains some of the “non‑monotonic” inconsistencies you see when F1 doesn’t steadily improve with size.
This can be seen very clearly with the generated datasets, which, beyond this behavior, do not appear to be impacted by changes in chunking strategies and chunk sizes.
-
Why does packing work well here?
Packed keeps files whole and fills the window efficiently, only splitting when necessary. This maintains coherent, human‑like review units (files and their immediate neighbors). As more relevant code is packed together, the model tends to make fewer—but more accurate—hypotheses, boosting precision while holding F1 somewhat steady.
-
A reasonable middle-ground exists when considering the lowest run-time
API calls with large context sizes ended up being slow enough that there was consistently a lower chunk size, despite having more API calls, resulting in a faster benchmarked run-time.
-
Why project can be fast (and competitive)
On small/medium repos, the whole‑repo context creates natural proximity between related code without chunking overhead, often running faster and producing fewer guesses with accuracy comparable to the larger packed sizes. At the moment, larger datasets can result in a significant number of skipped lines, which can be seen with the juice-shop dataset.
Despite a middle-ground result for packed, resulting in the fastest run-time, we need to consider this in the context of running with triage enabled. Because of this, it seems likely the project packing strategy will result in the fastest overall run-time; however, more testing is needed to verify this.
-
Syntactic did not provide any significant benefits over more naive chunking
Originally, we hypothesized that language-aware syntactic splitting would significantly outperform other options because it preserves code structures, avoiding situations where code is split in the middle of a function or class. However, we see very little difference between syntactic and fixed token results.
Two factors likely contributed to this behavior:
- Overlap reliability: the language‑aware splitter we used (LangChain’s RecursiveCharacterTextSplitter) doesn’t reliably enforce overlap, so cues can be split across boundaries.
- Non‑structured cues drive early hypotheses: at this stage, we aren’t confirming vulnerabilities, we’re generating plausible candidates to follow up in triage. LLMs lean on signals that aren’t tightly bound to AST units—function and variable names, inline comments, docstrings/documentation, and general “code‑smell” (e.g., undue complexity, awkward patterns, inconsistent naming, missing comments). This mirrors how human reviewers start with threat models and intent rather than deep control‑flow. Because these cues span or ignore strict syntax boundaries, preserving whole files and predictable overlap often matters more than perfect syntactic splits. In our runs, that helps explain why syntactic tracked fixed‑size and sometimes underperformed.
Conclusion
In our tests, adding more code often increased precision. Overall, F1‑score peaked at mid‑sized chunks (~5.5-12k) with the packed syntactic strategy, while recall sometimes dropped at the largest sizes due to the Certainty vs Coverage effect covered earlier. These mid‑size windows typically strike the best balance between accuracy, speed, and cost. Tiny chunks miss necessary cues but improve coverage at the cost of false negatives, which is expected to translate to increased run time in real-world scenarios with the triage stage enabled. The ideal option can depend significantly on your goals and dataset; for example, mixing unrelated subprojects in a single chunk degrades recall more than repository size alone.
Surprisingly, syntactic splitting did not outperform fixed‑size with predictable overlap; in several cases, keeping files intact with reliable overlap produced more consistent performance than perfect syntactic boundaries. We believe this is likely due to the non-structured signals contributing significantly to our measured results; however, understanding the exact cause may require more research.
On Precision vs Recall
At the mid-sized packed chunk sizes, the F1‑score peaked. However, this is only preferred if you consider precision and recall equal. A notable pattern that emerged is that recall was often inverse to precision: smaller chunks produced more hypotheses, many incorrect, but still resulting in higher recall—even when some guesses were approximate. Those approximate guesses could still make sense to consider valid, as the triager stage can correct for almost-correct (but for the wrong reason) guesses.
However, the lower precision tended to be directly due to an increase in the total number of vulnerabilities, which correlates directly to a longer triage run time, separate from the scanner run time we included in these benchmarks. It may also make sense to consider precision a more valuable metric because alert fatigue and the effort involved in manually sorting out false positives can come at a high cost. In general, an increase in false positives from the initial stage will correlate to an increase in the risk of a false positive being missed by the triage stage. Depending on the effectiveness of the triage module’s ability to filter out false positives, a precision‑first stance may make sense, which would affect our previous conclusion regarding the ideal chunking sizes.
Separately, at the largest chunk sizes, recall occasionally slipped due to relevant signals being diluted—especially when unrelated subprojects were chunked together. In short, repository composition can contribute significantly to the results: code relevance within a chunk is critical, so the ideal size may be heavily affected by the nature of the codebase.
The Future
In the near term, addressing the skipped‑line issue appears to be a good investment; we expect it will result in chunk sizes with significantly improved precision. Separately, implementing a method of selecting from among a set of several defaults, optimizing performance for specific goals, would allow users to take advantage of the results discovered here.
Consider what an effective chunker may look like based on the findings above. One promising direction is prioritizing gathering the unstructured cues and data that help the model form early, useful hypotheses — such as names, comments, docstrings, and what someone might consider “code‑smell”. These are similar to what a human reviewer might pick out when initially scanning a code base. Summarizing documentation, code purpose, and a threat model into the context window helps the LLM zero in on only the results we care about.
Another avenue is to use the code structure to inform the chunk packing. Parse the AST to build a simple map of the code—imports, calls, and shared names—and pack these related pieces into a chunk. Tracking which data is prioritized here will also reveal how different types of data impact performance.