ghsa-4qjh-9fv9-r85r
Vulnerability from github
This issue arises from the prefix caching mechanism, which may expose the system to a timing side-channel attack.
Description
When a new prompt is processed, if the PageAttention mechanism finds a matching prefix chunk, the prefill process speeds up, which is reflected in the TTFT (Time to First Token). Our tests revealed that the timing differences caused by matching chunks are significant enough to be recognized and exploited.
For instance, if the victim has submitted a sensitive prompt or if a valuable system prompt has been cached, an attacker sharing the same backend could attempt to guess the victim's input. By measuring the TTFT based on prefix matches, the attacker could verify if their guess is correct, leading to potential leakage of private information.
Unlike token-by-token sharing mechanisms, vLLM’s chunk-based approach (PageAttention) processes tokens in larger units (chunks). In our tests, with chunk_size=2, the timing differences became noticeable enough to allow attackers to infer whether portions of their input match the victim's prompt at the chunk level.
Environment
- GPU: NVIDIA A100 (40G)
- CUDA: 11.8
- PyTorch: 2.3.1
- OS: Ubuntu 18.04
- vLLM: v0.5.1 Configuration: We launched vLLM using the default settings and adjusted chunk_size=2 to evaluate the TTFT.
Leakage
We conducted our tests using LLaMA2-70B-GPTQ on a single device. We analyzed the timing differences when prompts shared prefixes of 2 chunks, and plotted the corresponding ROC curves. Our results suggest that timing differences can be reliably used to distinguish prefix matches, demonstrating a potential side-channel vulnerability.
Results
In our experiment, we analyzed the response time differences between cache hits and misses in vLLM's PageAttention mechanism. Using ROC curve analysis to assess the distinguishability of these timing differences, we observed the following results: - With a 1-token prefix, the ROC curve yielded an AUC value of 0.571, indicating that even with a short prefix, an attacker can reasonably distinguish between cache hits and misses based on response times. - When the prefix length increases to 8 tokens, the AUC value rises significantly to 0.99, showing that the attacker can almost perfectly identify cache hits with a longer prefix.
Fixes
- https://github.com/vllm-project/vllm/pull/17045
{ "affected": [ { "package": { "ecosystem": "PyPI", "name": "vllm" }, "ranges": [ { "events": [ { "introduced": "0" }, { "fixed": "0.9.0" } ], "type": "ECOSYSTEM" } ] } ], "aliases": [ "CVE-2025-46570" ], "database_specific": { "cwe_ids": [ "CWE-208" ], "github_reviewed": true, "github_reviewed_at": "2025-05-28T18:02:24Z", "nvd_published_at": "2025-05-29T17:15:21Z", "severity": "LOW" }, "details": "This issue arises from the prefix caching mechanism, which may expose the system to a timing side-channel attack.\n\n## Description\nWhen a new prompt is processed, if the PageAttention mechanism finds a matching prefix chunk, the prefill process speeds up, which is reflected in the TTFT (Time to First Token). Our tests revealed that the timing differences caused by matching chunks are significant enough to be recognized and exploited.\n\nFor instance, if the victim has submitted a sensitive prompt or if a valuable system prompt has been cached, an attacker sharing the same backend could attempt to guess the victim\u0027s input. By measuring the TTFT based on prefix matches, the attacker could verify if their guess is correct, leading to potential leakage of private information.\n\nUnlike token-by-token sharing mechanisms, vLLM\u2019s chunk-based approach (PageAttention) processes tokens in larger units (chunks). In our tests, with chunk_size=2, the timing differences became noticeable enough to allow attackers to infer whether portions of their input match the victim\u0027s prompt at the chunk level.\n\n## Environment\n\n- GPU: NVIDIA A100 (40G)\n- CUDA: 11.8\n- PyTorch: 2.3.1\n- OS: Ubuntu 18.04\n- vLLM: v0.5.1\nConfiguration: We launched vLLM using the default settings and adjusted chunk_size=2 to evaluate the TTFT.\n\n## Leakage\nWe conducted our tests using LLaMA2-70B-GPTQ on a single device. We analyzed the timing differences when prompts shared prefixes of 2 chunks, and plotted the corresponding ROC curves. Our results suggest that timing differences can be reliably used to distinguish prefix matches, demonstrating a potential side-channel vulnerability.\n\u003cimg src=\"https://github.com/user-attachments/assets/db3491e9-02b7-424c-9b6d-56f553b39f2f\" alt=\"roc_curves_combined_block_2\" width=\"400\"/\u003e\n\n\n## Results\nIn our experiment, we analyzed the response time differences between cache hits and misses in vLLM\u0027s PageAttention mechanism. Using ROC curve analysis to assess the distinguishability of these timing differences, we observed the following results:\n- With a 1-token prefix, the ROC curve yielded an AUC value of 0.571, indicating that even with a short prefix, an attacker can reasonably distinguish between cache hits and misses based on response times.\n- When the prefix length increases to 8 tokens, the AUC value rises significantly to 0.99, showing that the attacker can almost perfectly identify cache hits with a longer prefix.\n\n## Fixes\n\n* https://github.com/vllm-project/vllm/pull/17045", "id": "GHSA-4qjh-9fv9-r85r", "modified": "2025-05-29T21:27:56Z", "published": "2025-05-28T18:02:24Z", "references": [ { "type": "WEB", "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-4qjh-9fv9-r85r" }, { "type": "ADVISORY", "url": "https://nvd.nist.gov/vuln/detail/CVE-2025-46570" }, { "type": "WEB", "url": "https://github.com/vllm-project/vllm/pull/17045" }, { "type": "WEB", "url": "https://github.com/vllm-project/vllm/commit/77073c77bc2006eb80ea6d5128f076f5e6c6f54f" }, { "type": "PACKAGE", "url": "https://github.com/vllm-project/vllm" } ], "schema_version": "1.4.0", "severity": [ { "score": "CVSS:3.1/AV:N/AC:H/PR:L/UI:R/S:U/C:L/I:N/A:N", "type": "CVSS_V3" } ], "summary": "Potential Timing Side-Channel Vulnerability in vLLM\u2019s Chunk-Based Prefix Caching" }
- Seen: The vulnerability was mentioned, discussed, or seen somewhere by the user.
- Confirmed: The vulnerability is confirmed from an analyst perspective.
- Exploited: This vulnerability was exploited and seen by the user reporting the sighting.
- Patched: This vulnerability was successfully patched by the user reporting the sighting.
- Not exploited: This vulnerability was not exploited or seen by the user reporting the sighting.
- Not confirmed: The user expresses doubt about the veracity of the vulnerability.
- Not patched: This vulnerability was not successfully patched by the user reporting the sighting.