fix(web): prevent Output page crash from undefined forEach payload fields#628
fix(web): prevent Output page crash from undefined forEach payload fields#628mahek2016 wants to merge 32 commits intoAOSSIE-Org:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR swaps string-similarity imports to Changes
Sequence DiagramsequenceDiagram
participant Client
participant RateLimiter as Rate Limiter
participant Server as Flask Server
participant Executor as Timeout Executor
participant MediaWiki
participant MCQGen as MCQ Generator
participant BoolQGen as Bool Generator
participant ShortQGen as Short Generator
Client->>RateLimiter: POST /get_problems
RateLimiter->>Server: allow / reject
Server->>Server: validate JSON & inputs
Server->>MediaWiki: fetch/process content (with timeout via Executor)
MediaWiki-->>Server: content / timeout
Server->>Executor: start parallel generators (MCQ/BoolQ/ShortQ)
par Parallel
Executor->>MCQGen: generate (timeout)
Executor->>BoolQGen: generate (timeout)
Executor->>ShortQGen: generate (timeout)
end
MCQGen-->>Executor: result / error
BoolQGen-->>Executor: result / error
ShortQGen-->>Executor: result / error
Executor-->>Server: merged results + fallbacks
Server-->>Client: JSON response (results or aggregated error)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/server.py (1)
609-617:⚠️ Potential issue | 🟠 MajorWrap each boolean question in a list before calling the predictor.
AnswerPredictor.predict_boolean_answer()iteratespayload["input_question"]. Passing a bare string here makes it score one character at a time, andif qa_response:then turns every non-empty question into"True".Proposed fix
for question in input_questions: qa_response = answer.predict_boolean_answer( - {"input_text": input_text, "input_question": question} + {"input_text": input_text, "input_question": [question]} ) - if qa_response: + if qa_response and qa_response[0]: output.append("True") else: output.append("False")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 609 - 617, The loop is passing a bare string to answer.predict_boolean_answer which iterates payload["input_question"] character-by-character; change the call to pass the question as a single-item list (e.g., {"input_text": input_text, "input_question": [question]}) and then inspect the returned list element (qa_response[0]) rather than the truthiness of qa_response before appending to output (use output.append("True") or "False" based on qa_response[0], and guard for an empty qa_response).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 94-102: The file sets MCQGen, BoolQGen, ShortQGen, answer, qg,
qa_model, and llm_generator to None which prevents real backends from being used
and causes 500s when routes dereference them; remove these unconditional None
assignments (or replace them with proper initializers) and ensure these symbols
are instantiated with their real classes (e.g., MCQGen = MCQGenerator(...),
BoolQGen = BoolQGenerator(...), ShortQGen = ShortQGenerator(...), qg =
QuestionGenerator(...), qa_model = QAModel(...), llm_generator =
LLMGenerator(...), answer = AnswerService(...)) during app startup or lazily
inside the route handlers for /get_mcq, /get_boolq, /get_shortq and the
corresponding _llm/_answer/_hard routes so that routes never see None before use
and any initialization errors are logged and handled gracefully.
- Around line 981-988: The call to qg.generate via execute_with_timeout uses
answer_style="true_false", but QuestionGenerator.generate only accepts "all",
"sentences", or "multiple_choice", so update the code that constructs qg (the
generator instance used here) to use a boolean-capable generator implementation
(one that supports "true_false") or, if you prefer not to add a new generator,
change the answer_style argument to one of the supported values ("all",
"sentences", or "multiple_choice"); locate the qg instantiation (the code that
creates the generator passed into qg.generate) and either replace it with a
TrueFalse-capable generator class or add handling to map/convert "true_false"
requests into a supported style before calling execute_with_timeout.
- Around line 234-248: The mock `/get_boolq` response is returning list items as
{"question","answer"} objects but the frontend (Output.jsx when questionType ===
"get_boolq") expects each item in the "output" array to be the question string;
change the `questions` variable to be a list of strings (e.g. ["AI is useful?",
...]) and keep the existing slicing (`questions = questions[:max_questions]`)
and the JSON response keys ("output", "status", "mock_mode", "warning") intact
so the UI receives an array of question strings rather than objects.
- Around line 490-519: The response currently flattens each generator to a bare
array; instead preserve the nested objects so the frontend can access
output_mcq.questions, output_boolq.Boolean_Questions and
output_shortq.questions. Update the assignments that use mcq_out, bool_out,
short_out so that when no error and isinstance(..., dict) you set
response["output_mcq"] = mcq_out or {} (and likewise response["output_boolq"] =
bool_out or {}, response["output_shortq"] = short_out or {}), leaving empty
dicts as the fallback; keep the existing all-failed error return but ensure its
output_* values are empty objects ({}), not arrays.
- Around line 43-58: The current execute_with_timeout uses ThreadPoolExecutor
and simply calls future.result(timeout=...), so on FuturesTimeoutError the
worker thread keeps running; replace this with a cancellable approach: either
switch to ProcessPoolExecutor (in execute_with_timeout) so you can
terminate/kill worker processes after timeout (call
process_executor.shutdown(...) and terminate child processes) or implement
cooperative cancellation by passing a threading.Event into the submitted
function (e.g., cancel_event) and have the target function periodically check
cancel_event.is_set(); update the submit call and function signatures
accordingly, handle FuturesTimeoutError by setting cancel_event.set() and then
shutting down the executor, and ensure executor.shutdown(wait=False) is still
called; reference execute_with_timeout, ThreadPoolExecutor, ProcessPoolExecutor,
future.result, FuturesTimeoutError, executor.shutdown, and the cooperative
cancel Event so reviewers can locate and modify the code.
- Around line 438-476: The generator wrapper functions run_mcq, run_boolq, and
run_shortq are calling MCQGen.generate_mcq / BoolQGen.generate_boolq /
ShortQGen.generate_shortq with keyword args, but those methods expect a single
payload dict; change each lambda passed to execute_with_timeout to call the
generator with a single dict payload (e.g., {'input_text': input_text,
'max_questions': max_questions_mcq} for run_mcq, analogous keys/values for
run_boolq and run_shortq) so the generators receive one argument as defined;
keep the same timeout and existing None/"not_available" checks and return
behavior.
In `@eduaid_web/src/pages/Output.jsx`:
- Around line 22-23: The current outside-click guard in the click handler (the
condition using dropdown and !event.target.closest('button')) is too broad and
prevents the dropdown from closing when any other button is clicked; narrow the
check to only ignore the specific PDF toggle instead. Modify the condition in
the Output.jsx click handler to look for a specific selector or ref for the PDF
toggle (e.g., event.target.closest('.pdf-toggle') or compare against a
pdfToggleRef/current element) instead of generic 'button', so only clicks on
that exact toggle prevent closing while all other outside clicks close the
dropdown.
- Around line 141-162: The current conditional skips iterating `output` when
`questionType === "get_mcq"`, dropping MCQs; update the logic in Output.jsx
around the `questionType` branches so that when `questionType === "get_mcq"` you
still iterate `output` (push into `combinedQaPairs`) and also include items from
`output_mcq.questions` (merge both sources), e.g., add an explicit `if
(questionType === "get_mcq")` branch that processes `output.forEach(...)` and
`output_mcq.questions.forEach(...)`, keeping the existing field mapping used for
other branches and retaining `combinedQaPairs` as the single merged array.
In `@requirements.txt`:
- Around line 31-38: Update the mammoth dependency in requirements.txt to at
least version 1.11.0 (e.g., change "mammoth" to "mammoth>=1.11.0") and rerun
dependency install so that backend code using mammoth.extract_raw_text (in
Generator/main.py) uses the patched library; after upgrading, run tests or a
quick smoke test of the .docx upload path to confirm external file access is
disabled and behavior remains correct.
---
Outside diff comments:
In `@backend/server.py`:
- Around line 609-617: The loop is passing a bare string to
answer.predict_boolean_answer which iterates payload["input_question"]
character-by-character; change the call to pass the question as a single-item
list (e.g., {"input_text": input_text, "input_question": [question]}) and then
inspect the returned list element (qa_response[0]) rather than the truthiness of
qa_response before appending to output (use output.append("True") or "False"
based on qa_response[0], and guard for an empty qa_response).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8725e66b-6869-4a93-8bbe-775ce182c4da
⛔ Files ignored due to path filters (1)
backend/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (5)
backend/Generator/main.pybackend/Generator/mcq.pybackend/server.pyeduaid_web/src/pages/Output.jsxrequirements.txt
| def execute_with_timeout(func, timeout, *args, **kwargs): | ||
| executor = ThreadPoolExecutor(max_workers=1) | ||
| future = executor.submit(func, *args, **kwargs) | ||
|
|
||
| try: | ||
| result = future.result(timeout=timeout) | ||
| executor.shutdown(wait=False) # ✅ FIX | ||
| return result, None | ||
|
|
||
| except FuturesTimeoutError: | ||
| executor.shutdown(wait=False) # ✅ FIX | ||
| return None, "timeout" | ||
|
|
||
| except Exception: | ||
| executor.shutdown(wait=False) | ||
| logging.exception("Error occurred during execution") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "server.py" -path "*/backend/*" | head -5Repository: AOSSIE-Org/EduAid
Length of output: 80
🏁 Script executed:
cat -n backend/server.py | sed -n '43,62p'Repository: AOSSIE-Org/EduAid
Length of output: 855
🌐 Web query:
ThreadPoolExecutor future.result timeout does the function keep running Python concurrent.futures
💡 Result:
Yes, when calling future.result(timeout=...) on a ThreadPoolExecutor future in Python's concurrent.futures and the timeout expires, a TimeoutError is raised in the calling thread, but the submitted function continues running in its worker thread until completion. The timeout only limits how long the caller waits for the result; it does not cancel or terminate the underlying task. This behavior occurs because Python threads cannot be forcibly terminated from another thread (due to risks like leaving locks held). The future.cancel method only works for pending (not yet started) tasks; for running tasks, it returns False. Official documentation confirms: Future.result(timeout) waits up to the specified seconds and raises TimeoutError if not completed, but does not mention stopping the task. To stop long-running tasks, make them cooperative by checking a shared threading.Event flag periodically, or use ProcessPoolExecutor (though it also doesn't auto-terminate on timeout without extra work).
Citations:
- 1: https://www.pythontutorials.net/blog/how-to-use-concurrent-futures-with-timeouts/
- 2: https://stackoverflow.com/questions/78460097/why-cant-threadpoolexecutor-timeout-a-long-running-expression
- 3: https://docs.python.org/3/library/concurrent.futures.html
- 4: https://stackoverflow.com/questions/6509261/how-to-use-concurrent-futures-with-timeouts
This timeout only stops waiting; it does not stop the work.
After future.result(timeout=...) raises TimeoutError, the submitted function continues running in its worker thread, and executor.shutdown(wait=False) returns immediately without terminating the thread. On model endpoints this means timed-out jobs keep burning CPU/GPU after a 504 response; repeated timeouts can accumulate and snowball into resource exhaustion. To enforce a hard execution cap, the work must either check a cooperative cancellation flag (threading.Event) periodically, or use a killable mechanism like ProcessPoolExecutor (which also requires additional cancellation logic beyond shutdown).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 43 - 58, The current execute_with_timeout
uses ThreadPoolExecutor and simply calls future.result(timeout=...), so on
FuturesTimeoutError the worker thread keeps running; replace this with a
cancellable approach: either switch to ProcessPoolExecutor (in
execute_with_timeout) so you can terminate/kill worker processes after timeout
(call process_executor.shutdown(...) and terminate child processes) or implement
cooperative cancellation by passing a threading.Event into the submitted
function (e.g., cancel_event) and have the target function periodically check
cancel_event.is_set(); update the submit call and function signatures
accordingly, handle FuturesTimeoutError by setting cancel_event.set() and then
shutting down the executor, and ensure executor.shutdown(wait=False) is still
called; reference execute_with_timeout, ThreadPoolExecutor, ProcessPoolExecutor,
future.result, FuturesTimeoutError, executor.shutdown, and the cooperative
cancel Event so reviewers can locate and modify the code.
| MCQGen = None | ||
| answer = None | ||
| BoolQGen = None | ||
| ShortQGen = None | ||
| qg = None | ||
| file_processor = main.FileProcessor() | ||
| mediawikiapi = MediaWikiAPI() | ||
| qa_model = pipeline("question-answering") | ||
| llm_generator = LLMQuestionGenerator() | ||
|
|
||
| qa_model = None | ||
| llm_generator = None |
There was a problem hiding this comment.
These None assignments disable the real backend paths.
Nothing in this module reinitializes MCQGen, BoolQGen, ShortQGen, answer, qg, qa_model, or llm_generator after setting them to None. That leaves /get_mcq/get_boolq/get_shortq permanently in mock mode, while /get_*_llm, /get_*_answer, /get_boolean_answer, and the *_hard routes dereference None and 500.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 94 - 102, The file sets MCQGen, BoolQGen,
ShortQGen, answer, qg, qa_model, and llm_generator to None which prevents real
backends from being used and causes 500s when routes dereference them; remove
these unconditional None assignments (or replace them with proper initializers)
and ensure these symbols are instantiated with their real classes (e.g., MCQGen
= MCQGenerator(...), BoolQGen = BoolQGenerator(...), ShortQGen =
ShortQGenerator(...), qg = QuestionGenerator(...), qa_model = QAModel(...),
llm_generator = LLMGenerator(...), answer = AnswerService(...)) during app
startup or lazily inside the route handlers for /get_mcq, /get_boolq,
/get_shortq and the corresponding _llm/_answer/_hard routes so that routes never
see None before use and any initialization errors are logged and handled
gracefully.
| # ✅ SAFE GENERATOR CALLS (FIXED WITH LAMBDA) | ||
| def run_mcq(): | ||
| if MCQGen is None: | ||
| logging.warning("MCQGen not initialized") | ||
| return None, "not_available" | ||
|
|
||
| return execute_with_timeout( | ||
| lambda: MCQGen.generate_mcq( | ||
| input_text=input_text, | ||
| max_questions=max_questions_mcq | ||
| ), | ||
| 60 | ||
| ) | ||
|
|
||
| def run_boolq(): | ||
| if BoolQGen is None: | ||
| logging.warning("BoolQGen not initialized") | ||
| return None, "not_available" | ||
|
|
||
| return execute_with_timeout( | ||
| lambda: BoolQGen.generate_boolq( | ||
| input_text=input_text, | ||
| max_questions=max_questions_boolq | ||
| ), | ||
| 60 | ||
| ) | ||
|
|
||
| def run_shortq(): | ||
| if ShortQGen is None: | ||
| logging.warning("ShortQGen not initialized") | ||
| return None, "not_available" | ||
|
|
||
| return execute_with_timeout( | ||
| lambda: ShortQGen.generate_shortq( | ||
| input_text=input_text, | ||
| max_questions=max_questions_shortq | ||
| ), | ||
| 60 | ||
| ) |
There was a problem hiding this comment.
Pass payload dicts into the generator methods.
backend/Generator/main.py defines generate_mcq(), generate_boolq(), and generate_shortq() as single-argument payload methods. Calling them with input_text= / max_questions= kwargs raises TypeError, so /get_problems can never reach the real generation path once those generators are re-enabled.
Proposed fix
def run_mcq():
if MCQGen is None:
logging.warning("MCQGen not initialized")
return None, "not_available"
return execute_with_timeout(
- lambda: MCQGen.generate_mcq(
- input_text=input_text,
- max_questions=max_questions_mcq
- ),
- 60
+ MCQGen.generate_mcq,
+ 60,
+ {
+ "input_text": input_text,
+ "max_questions": max_questions_mcq,
+ },
)
def run_boolq():
if BoolQGen is None:
logging.warning("BoolQGen not initialized")
return None, "not_available"
return execute_with_timeout(
- lambda: BoolQGen.generate_boolq(
- input_text=input_text,
- max_questions=max_questions_boolq
- ),
- 60
+ BoolQGen.generate_boolq,
+ 60,
+ {
+ "input_text": input_text,
+ "max_questions": max_questions_boolq,
+ },
)
def run_shortq():
if ShortQGen is None:
logging.warning("ShortQGen not initialized")
return None, "not_available"
return execute_with_timeout(
- lambda: ShortQGen.generate_shortq(
- input_text=input_text,
- max_questions=max_questions_shortq
- ),
- 60
+ ShortQGen.generate_shortq,
+ 60,
+ {
+ "input_text": input_text,
+ "max_questions": max_questions_shortq,
+ },
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 438 - 476, The generator wrapper functions
run_mcq, run_boolq, and run_shortq are calling MCQGen.generate_mcq /
BoolQGen.generate_boolq / ShortQGen.generate_shortq with keyword args, but those
methods expect a single payload dict; change each lambda passed to
execute_with_timeout to call the generator with a single dict payload (e.g.,
{'input_text': input_text, 'max_questions': max_questions_mcq} for run_mcq,
analogous keys/values for run_boolq and run_shortq) so the generators receive
one argument as defined; keep the same timeout and existing None/"not_available"
checks and return behavior.
| # ✅ FINAL SAFE RESPONSE | ||
| response = { | ||
| "output_mcq": [], | ||
| "output_boolq": [], | ||
| "output_shortq": [] | ||
| } | ||
|
|
||
| if not mcq_err and isinstance(mcq_out, dict): | ||
| response["output_mcq"] = mcq_out.get("questions", []) or [] | ||
|
|
||
| if not bool_err and isinstance(bool_out, dict): | ||
| response["output_boolq"] = bool_out.get("Boolean_Questions", []) or [] | ||
|
|
||
| if not short_err and isinstance(short_out, dict): | ||
| response["output_shortq"] = short_out.get("questions", []) or [] | ||
|
|
||
| # ✅ If ALL failed | ||
| if mcq_err and bool_err and short_err: | ||
| all_errors = [mcq_err, bool_err, short_err] | ||
|
|
||
| status = 504 if all(e == "timeout" for e in all_errors) else 500 | ||
| code = "timeout" if status == 504 else "all_failed" | ||
|
|
||
| return jsonify({ | ||
| "error": "All generators failed", | ||
| "code": code, | ||
| "output_mcq": [], | ||
| "output_boolq": [], | ||
| "output_shortq": [] | ||
| }), status |
There was a problem hiding this comment.
Preserve the nested output_* objects in the mixed response.
The Output page reads output_mcq.questions, output_boolq.Boolean_Questions plus Text, and output_shortq.questions. Flattening each generator result to a bare array makes those lookups undefined, so mixed MCQ/Boolean/Short results disappear even though the page no longer crashes.
Proposed response-shape fix
- response = {
- "output_mcq": [],
- "output_boolq": [],
- "output_shortq": []
- }
+ response = {
+ "output_mcq": {"questions": []},
+ "output_boolq": {"Boolean_Questions": [], "Text": input_text},
+ "output_shortq": {"questions": []},
+ }
if not mcq_err and isinstance(mcq_out, dict):
- response["output_mcq"] = mcq_out.get("questions", []) or []
+ response["output_mcq"]["questions"] = mcq_out.get("questions", []) or []
if not bool_err and isinstance(bool_out, dict):
- response["output_boolq"] = bool_out.get("Boolean_Questions", []) or []
+ response["output_boolq"]["Boolean_Questions"] = bool_out.get("Boolean_Questions", []) or []
+ response["output_boolq"]["Text"] = bool_out.get("Text", input_text)
if not short_err and isinstance(short_out, dict):
- response["output_shortq"] = short_out.get("questions", []) or []
+ response["output_shortq"]["questions"] = short_out.get("questions", []) or []
@@
return jsonify({
"error": "All generators failed",
"code": code,
- "output_mcq": [],
- "output_boolq": [],
- "output_shortq": []
+ "output_mcq": {"questions": []},
+ "output_boolq": {"Boolean_Questions": [], "Text": input_text},
+ "output_shortq": {"questions": []}
}), statusBased on learnings, mixed question types are intentionally supported in eduaid_web/src/pages/Output.jsx, and output_mcq plus output are distinct data sources.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 490 - 519, The response currently flattens
each generator to a bare array; instead preserve the nested objects so the
frontend can access output_mcq.questions, output_boolq.Boolean_Questions and
output_shortq.questions. Update the assignments that use mcq_out, bool_out,
short_out so that when no error and isinstance(..., dict) you set
response["output_mcq"] = mcq_out or {} (and likewise response["output_boolq"] =
bool_out or {}, response["output_shortq"] = short_out or {}), leaving empty
dicts as the fallback; keep the existing all-failed error return but ensure its
output_* values are empty objects ({}), not arrays.
| # ✅ Question generation with timeout | ||
| generated, error = execute_with_timeout( | ||
| qg.generate, | ||
| 60, | ||
| article=input_text, | ||
| num_questions=input_questions, | ||
| num_questions=len(input_questions), | ||
| answer_style="true_false" | ||
| ) |
There was a problem hiding this comment.
QuestionGenerator.generate() does not accept "true_false".
backend/Generator/main.py only allows answer_style values "all", "sentences", and "multiple_choice" here. This route will always fall into the error path even after qg is initialized, so it needs to call a boolean-capable generator instead of forcing an unsupported answer_style.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 981 - 988, The call to qg.generate via
execute_with_timeout uses answer_style="true_false", but
QuestionGenerator.generate only accepts "all", "sentences", or
"multiple_choice", so update the code that constructs qg (the generator instance
used here) to use a boolean-capable generator implementation (one that supports
"true_false") or, if you prefer not to add a new generator, change the
answer_style argument to one of the supported values ("all", "sentences", or
"multiple_choice"); locate the qg instantiation (the code that creates the
generator passed into qg.generate) and either replace it with a
TrueFalse-capable generator class or add handling to map/convert "true_false"
requests into a supported style before calling execute_with_timeout.
| mammoth | ||
| mediawikiapi | ||
| PyMuPDF | ||
| textblob | ||
| llama-cpp-python No newline at end of file | ||
| Flask-Limiter==3.5.0 | ||
| yt-dlp | ||
| llama-cpp-python | ||
| strsimpy |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n requirements.txt | grep -A2 -B2 mammothRepository: AOSSIE-Org/EduAid
Length of output: 151
🏁 Script executed:
head -50 backend/Generator/main.py | tail -30Repository: AOSSIE-Org/EduAid
Length of output: 1075
🏁 Script executed:
sed -n '360,375p' backend/Generator/main.pyRepository: AOSSIE-Org/EduAid
Length of output: 544
🌐 Web query:
mammoth library directory traversal vulnerability Python docx parser
💡 Result:
Yes, the Mammoth Python library (a .docx to HTML/Markdown converter) has a directory traversal vulnerability identified as CVE-2025-11849. It affects versions from 0.3.25 up to but not including 1.11.0. The vulnerability occurs due to lack of path or file type validation when processing .docx files containing images referenced via external links (r:link attribute instead of embedded r:embed). The library resolves these URIs to local file paths, reads the files, base64-encodes them, and embeds them in the output HTML as data URIs. Attackers can craft malicious .docx files with paths like ../../etc/passwd or /dev/random to read arbitrary files or cause resource exhaustion. Fix: Upgrade to Mammoth version 1.11.0 or higher, which disables external file access by default (via externalFileAccess: false). Official sources: PyPI (pypi.org/project/mammoth/), GitHub (github.com/mwilliamson/python-mammoth). Latest version as of 2026-03-13 is 1.12.0. No relation to any "mammoth library" in the context of directory services or unrelated projects.
Citations:
- 1: https://security.snyk.io/vuln/SNYK-JS-MAMMOTH-13554470
- 2: https://advisories.gitlab.com/pkg/npm/mammoth/CVE-2025-11849/
- 3: https://nvd.nist.gov/vuln/detail/CVE-2025-11849
- 4: https://www.sentinelone.com/vulnerability-database/cve-2025-11849/
- 5: https://security.snyk.io/package/pip/mammoth/1.10.0
- 6: https://pypi.org/project/mammoth/
- 7: https://github.com/mwilliamson/python-mammoth
- 8: https://security.snyk.io/vuln/SNYK-PYTHON-MAMMOTH-13561967
Upgrade mammoth to version 1.11.0 or later to patch CVE-2025-11849.
backend/Generator/main.py lines 365-368 use mammoth.extract_raw_text() on user-uploaded .docx files. Mammoth versions before 1.11.0 are vulnerable to directory traversal (CVE-2025-11849), allowing attackers to read arbitrary files or exhaust resources via crafted .docx files with external file references. Upgrade to mammoth>=1.11.0 immediately, which disables external file access by default.
🧰 Tools
🪛 OSV Scanner (2.3.5)
[CRITICAL] 31-31: mammoth 1.9.1: Mammoth is vulnerable to Directory Traversal
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@requirements.txt` around lines 31 - 38, Update the mammoth dependency in
requirements.txt to at least version 1.11.0 (e.g., change "mammoth" to
"mammoth>=1.11.0") and rerun dependency install so that backend code using
mammoth.extract_raw_text (in Generator/main.py) uses the patched library; after
upgrading, run tests or a quick smoke test of the .docx upload path to confirm
external file access is disabled and behavior remains correct.
Addressed Issues:
Fixes #610
Additional Notes:
output,output_mcq.questions,output_boolq.Boolean_Questions,output_shortq.questions) using a helper function to prevent runtime errors..forEachcalls with guarded iterations.?.) to safely access nested properties.This improves the robustness of the Output page and prevents crashes due to incomplete or malformed payload data.
AI Usage Disclosure:
I have used the following AI models and tools:
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Refactor