refactor: split large main.py into modular components and update imports#604
refactor: split large main.py into modular components and update imports#604Aditya30ag wants to merge 6 commits intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughSplit and reorganized the QA codebase: removed many classes from Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant FileProc as FileProcessor / GoogleDocsService
participant Generators as QuestionGenerators
participant AnswerPred as AnswerPredictor
participant Models as Model Runtime
Client->>Server: Upload file / request generation
Server->>FileProc: extract_text_from_file / fetch_doc
FileProc-->>Server: plain_text
Server->>Generators: generate_mcq / generate_shortq / generate_paraphrase / generate_boolq (plain_text)
Generators->>Models: tokenize & generate
Models-->>Generators: generated_questions
Server->>AnswerPred: predict_answer / predict_boolean_answer (questions, context)
AnswerPred->>Models: generation / NLI inference
Models-->>AnswerPred: answers / entailment_scores
AnswerPred-->>Server: answers / booleans
Server-->>Client: structured QA payload
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
backend/Generator/answer_predictor.py (1)
28-31: Reusegreedy_decodingto avoid duplicated generation logic.The same generate+decode path is implemented twice. Consolidating on
greedy_decodingwill keep behavior consistent and reduce drift risk.♻️ Proposed refactor
- greedy_output = self.model.generate(input_ids=input_ids, attention_mask=attention_masks, max_length=256) - Question = self.tokenizer.decode(greedy_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True) - answers.append(Question.strip().capitalize()) + answers.append(self.greedy_decoding(input_ids, attention_masks))Also applies to: 45-49
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/answer_predictor.py` around lines 28 - 31, The generate+decode logic in greedy_decoding is duplicated elsewhere; replace the duplicated generate/decode blocks with calls to greedy_decoding(inp_ids, attn_mask) so the model.generate and tokenizer.decode path is centralized. Locate the other method that performs the same generate+decode (the block around lines 45-49 in the diff) and change it to use the return value from greedy_decoding, ensuring you pass the same input tensors (inp_ids, attn_mask) and use the returned string (already stripped and capitalized) instead of reimplementing generation/decoding.backend/Generator/question_generators.py (1)
17-33: Consider a shared base for repeated model/seed/device setup.The four classes duplicate nearly identical initialization and seed logic. A shared base/helper would reduce maintenance overhead.
Also applies to: 76-92, 128-140, 186-197
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/question_generators.py` around lines 17 - 33, Extract the duplicated initialization and seeding logic into a new shared base class (e.g., BaseQuestionGenerator) that defines __init__ and set_seed (including tokenizer = T5Tokenizer.from_pretrained(...), model = T5ForConditionalGeneration.from_pretrained(...), device setup and model.to(device), nlp = spacy.load(...), s2v.from_disk(...), fdist/FreqDist(brown.words()), normalized_levenshtein, and the call to set_seed(42)); then update each of the four concrete generator classes to inherit from this base and remove their local duplicated __init__ and set_seed bodies, calling super().__init__ from their constructors as needed so only class-specific setup remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/answer_predictor.py`:
- Around line 33-43: The predict_answer method currently iterates
payload.get("input_question") without validation which can raise TypeError;
update predict_answer to validate the payload first by checking payload is a
dict and that payload.get("input_question") is present and is an iterable/list
(and that payload.get("input_text") is a non-empty string) before the for loop,
returning or raising a controlled error response if validation fails; add clear
guards around the for ques in payload.get("input_question") loop and use the
validated local variables (inp, context, question) to proceed only when checks
pass.
In `@backend/Generator/question_generators.py`:
- Around line 34-43: The generate_mcq function assumes payload["input_text"] is
a non-empty string before calling tokenize_into_sentences and building
sentences/modified_text; add a guard that validates payload.get("input_text") is
a string and not empty (e.g., if not isinstance(inp_text, str) or not
inp_text.strip(): handle early by raising ValueError or returning a clear error
response), then proceed to call tokenize_into_sentences and compute sentences
and modified_text; apply the same validation pattern to the other blocks that
use tokenize_into_sentences (the other occurrences noted) so you never call
tokenize_into_sentences with malformed or None input.
- Line 217: Remove the raw prompt dump "print(form)" which exposes user-derived
content; instead either delete that statement or replace it with a safe,
non-sensitive log (e.g., a call to the module logger with a redacted or
anonymized summary). Locate the plain print(form) call in question_generators.py
and remove it; if you must keep an audit, use the project's logger and log only
metadata (e.g., form length or a masked token) rather than the full form
contents.
- Around line 57-60: The bare except is hiding errors when calling
generate_multiple_choice_questions; replace it with an explicit except Exception
as e and log the exception and context before returning final_output.
Specifically, in the block that calls
generate_multiple_choice_questions(keyword_sentence_mapping, self.device,
self.tokenizer, self.model, self.s2v, self.normalized_levenshtein), catch
Exception as e, emit a helpful log via the module/class logger (or
processLogger) including the exception e and relevant context like
keyword_sentence_mapping or self.model identifier, then return final_output.
In `@backend/Generator/utilities.py`:
- Around line 64-78: process_file currently trusts file.filename (allowing path
traversal) and may leave orphaned temp files if extraction raises; fix by
sanitizing/normalizing and constraining the destination path (use
os.path.basename or werkzeug.utils.secure_filename and resolve with
os.path.realpath to ensure the final path is inside upload_folder) and write to
a safely created temp path (or NamedTemporaryFile) instead of directly using the
raw filename; wrap the save + detection + calls to extract_text_from_pdf and
extract_text_from_docx in a try/finally so os.remove(file_path) always runs, and
propagate or handle extraction exceptions as before while still ensuring
cleanup, referencing process_file, upload_folder, extract_text_from_pdf and
extract_text_from_docx to locate the changes.
- Around line 52-57: The extract_text_from_pdf function opens a PDF with
fitz.open(file_path) but never closes it; change the implementation to open the
document with a context manager (e.g., with fitz.open(file_path) as doc:) so the
document is deterministically closed after extracting text, still iterating
pages to build and return the text string in the same function.
---
Nitpick comments:
In `@backend/Generator/answer_predictor.py`:
- Around line 28-31: The generate+decode logic in greedy_decoding is duplicated
elsewhere; replace the duplicated generate/decode blocks with calls to
greedy_decoding(inp_ids, attn_mask) so the model.generate and tokenizer.decode
path is centralized. Locate the other method that performs the same
generate+decode (the block around lines 45-49 in the diff) and change it to use
the return value from greedy_decoding, ensuring you pass the same input tensors
(inp_ids, attn_mask) and use the returned string (already stripped and
capitalized) instead of reimplementing generation/decoding.
In `@backend/Generator/question_generators.py`:
- Around line 17-33: Extract the duplicated initialization and seeding logic
into a new shared base class (e.g., BaseQuestionGenerator) that defines __init__
and set_seed (including tokenizer = T5Tokenizer.from_pretrained(...), model =
T5ForConditionalGeneration.from_pretrained(...), device setup and
model.to(device), nlp = spacy.load(...), s2v.from_disk(...),
fdist/FreqDist(brown.words()), normalized_levenshtein, and the call to
set_seed(42)); then update each of the four concrete generator classes to
inherit from this base and remove their local duplicated __init__ and set_seed
bodies, calling super().__init__ from their constructors as needed so only
class-specific setup remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 566a5c98-74f4-4a81-bc33-7c0f064b220a
📒 Files selected for processing (4)
backend/Generator/advanced_qa.pybackend/Generator/answer_predictor.pybackend/Generator/question_generators.pybackend/Generator/utilities.py
| def predict_answer(self, payload): | ||
| answers = [] | ||
| inp = { | ||
| "input_text": payload.get("input_text"), | ||
| "input_question" : payload.get("input_question") | ||
| } | ||
| for ques in payload.get("input_question"): | ||
|
|
||
| context = inp["input_text"] | ||
| question = ques | ||
| input_text = "question: %s <s> context: %s </s>" % (question, context) |
There was a problem hiding this comment.
Validate request payload before iterating questions.
Line 39 iterates over payload.get("input_question") without validation; missing/invalid input causes runtime failures (TypeError) instead of a controlled response.
💡 Proposed fix
def predict_answer(self, payload):
- answers = []
- inp = {
- "input_text": payload.get("input_text"),
- "input_question" : payload.get("input_question")
- }
- for ques in payload.get("input_question"):
-
- context = inp["input_text"]
+ answers = []
+ context = payload.get("input_text")
+ questions = payload.get("input_question")
+
+ if not isinstance(context, str) or not context.strip():
+ raise ValueError("input_text must be a non-empty string")
+ if not isinstance(questions, list):
+ raise ValueError("input_question must be a list of strings")
+
+ for ques in questions:
question = ques
input_text = "question: %s <s> context: %s </s>" % (question, context)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def predict_answer(self, payload): | |
| answers = [] | |
| inp = { | |
| "input_text": payload.get("input_text"), | |
| "input_question" : payload.get("input_question") | |
| } | |
| for ques in payload.get("input_question"): | |
| context = inp["input_text"] | |
| question = ques | |
| input_text = "question: %s <s> context: %s </s>" % (question, context) | |
| def predict_answer(self, payload): | |
| answers = [] | |
| context = payload.get("input_text") | |
| questions = payload.get("input_question") | |
| if not isinstance(context, str) or not context.strip(): | |
| raise ValueError("input_text must be a non-empty string") | |
| if not isinstance(questions, list): | |
| raise ValueError("input_question must be a list of strings") | |
| for ques in questions: | |
| question = ques | |
| input_text = "question: %s <s> context: %s </s>" % (question, context) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/answer_predictor.py` around lines 33 - 43, The
predict_answer method currently iterates payload.get("input_question") without
validation which can raise TypeError; update predict_answer to validate the
payload first by checking payload is a dict and that
payload.get("input_question") is present and is an iterable/list (and that
payload.get("input_text") is a non-empty string) before the for loop, returning
or raising a controlled error response if validation fails; add clear guards
around the for ques in payload.get("input_question") loop and use the validated
local variables (inp, context, question) to proceed only when checks pass.
| def generate_mcq(self, payload): | ||
| start_time = time.time() | ||
| inp = { | ||
| "input_text": payload.get("input_text"), | ||
| "max_questions": payload.get("max_questions", 4) | ||
| } | ||
|
|
||
| text = inp['input_text'] | ||
| sentences = tokenize_into_sentences(text) | ||
| modified_text = " ".join(sentences) |
There was a problem hiding this comment.
Guard input_text before tokenization/sentence splitting.
These paths assume payload["input_text"] is a valid non-empty string; malformed input leads to runtime exceptions.
✅ Proposed pattern
- text = inp['input_text']
+ text = inp['input_text']
+ if not isinstance(text, str) or not text.strip():
+ raise ValueError("input_text must be a non-empty string")Also applies to: 93-101, 148-152, 211-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/question_generators.py` around lines 34 - 43, The
generate_mcq function assumes payload["input_text"] is a non-empty string before
calling tokenize_into_sentences and building sentences/modified_text; add a
guard that validates payload.get("input_text") is a string and not empty (e.g.,
if not isinstance(inp_text, str) or not inp_text.strip(): handle early by
raising ValueError or returning a clear error response), then proceed to call
tokenize_into_sentences and compute sentences and modified_text; apply the same
validation pattern to the other blocks that use tokenize_into_sentences (the
other occurrences noted) so you never call tokenize_into_sentences with
malformed or None input.
| try: | ||
| generated_questions = generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein) | ||
| except: | ||
| return final_output |
There was a problem hiding this comment.
Replace bare except with explicit exception handling.
Line 59 swallows all exceptions and obscures root causes. Catch Exception explicitly and log context.
🛠️ Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
...
- try:
- generated_questions = generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein)
- except:
- return final_output
+ try:
+ generated_questions = generate_multiple_choice_questions(
+ keyword_sentence_mapping,
+ self.device,
+ self.tokenizer,
+ self.model,
+ self.s2v,
+ self.normalized_levenshtein,
+ )
+ except Exception:
+ logger.exception("MCQ generation failed")
+ return final_output🧰 Tools
🪛 Ruff (0.15.6)
[error] 59-59: Do not use bare except
(E722)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/question_generators.py` around lines 57 - 60, The bare
except is hiding errors when calling generate_multiple_choice_questions; replace
it with an explicit except Exception as e and log the exception and context
before returning final_output. Specifically, in the block that calls
generate_multiple_choice_questions(keyword_sentence_mapping, self.device,
self.tokenizer, self.model, self.s2v, self.normalized_levenshtein), catch
Exception as e, emit a helpful log via the module/class logger (or
processLogger) including the exception e and relevant context like
keyword_sentence_mapping or self.model identifier, then return final_output.
| modified_text = " ".join(sentences) | ||
| answer = self.random_choice() | ||
| form = "truefalse: %s passage: %s </s>" % (modified_text, answer) | ||
| print(form) |
There was a problem hiding this comment.
Remove raw prompt logging to avoid leaking user text.
Line 217 prints full request-derived content to stdout. That can leak sensitive data in logs.
🔒 Proposed fix
- print(form)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(form) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/question_generators.py` at line 217, Remove the raw prompt
dump "print(form)" which exposes user-derived content; instead either delete
that statement or replace it with a safe, non-sensitive log (e.g., a call to the
module logger with a redacted or anonymized summary). Locate the plain
print(form) call in question_generators.py and remove it; if you must keep an
audit, use the project's logger and log only metadata (e.g., form length or a
masked token) rather than the full form contents.
| def extract_text_from_pdf(self, file_path): | ||
| doc = fitz.open(file_path) | ||
| text = "" | ||
| for page in doc: | ||
| text += page.get_text() | ||
| return text |
There was a problem hiding this comment.
Close PDF resources deterministically after extraction.
fitz.open(...) is not closed explicitly. Use a context manager to avoid descriptor/memory leaks.
🧹 Proposed fix
def extract_text_from_pdf(self, file_path):
- doc = fitz.open(file_path)
text = ""
- for page in doc:
- text += page.get_text()
+ with fitz.open(file_path) as doc:
+ for page in doc:
+ text += page.get_text()
return text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/utilities.py` around lines 52 - 57, The
extract_text_from_pdf function opens a PDF with fitz.open(file_path) but never
closes it; change the implementation to open the document with a context manager
(e.g., with fitz.open(file_path) as doc:) so the document is deterministically
closed after extracting text, still iterating pages to build and return the text
string in the same function.
|
Need Tests !!!! |
| if not file_abs.startswith(upload_abs): | ||
| raise ValueError("Invalid file path") | ||
|
|
||
| file.save(file_path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
|
|
||
| try: | ||
| if safe_filename.endswith('.txt'): | ||
| with open(file_path, 'r', encoding='utf-8') as f: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| return "" | ||
| finally: | ||
| # Clean up the temporary file | ||
| if os.path.exists(file_path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| finally: | ||
| # Clean up the temporary file | ||
| if os.path.exists(file_path): | ||
| os.remove(file_path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/Generator/utilities.py (1)
105-108:⚠️ Potential issue | 🟡 MinorPath validation
startswithcheck is missing trailing separator.The check
file_abs.startswith(upload_abs)can be bypassed if a sibling directory shares the prefix (e.g.,/app/uploads_evil/xpasses whenupload_abs = "/app/uploads"). Appendos.sepfor correctness.While the UUID-based filename sanitization mitigates practical exploitation here, this is a defense-in-depth measure that should be correct.
🛡️ Proposed fix
def extract_text_from_docx(self, file_path): # Validate file path is within upload folder upload_abs = os.path.abspath(self.upload_folder) file_abs = os.path.abspath(file_path) - if not file_abs.startswith(upload_abs): + if not file_abs.startswith(upload_abs + os.sep): raise ValueError("Invalid file path")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/utilities.py` around lines 105 - 108, The current path-check using file_abs.startswith(upload_abs) can be bypassed by sibling dirs; ensure upload_abs includes a trailing separator before the startswith check (e.g., normalize upload_abs to os.path.abspath(self.upload_folder) + os.sep or append os.sep when comparing) so the comparison becomes file_abs.startswith(upload_abs_with_sep); update the variables upload_abs and the conditional that raises ValueError("Invalid file path") accordingly to prevent prefix collisions.
🧹 Nitpick comments (2)
backend/Generator/utilities.py (2)
148-167: Useisinstanceand simplify conditional expressions.Minor style suggestions:
- Line 157:
type(answer) is list→isinstance(answer, list)is more Pythonic and handles subclasses.- Lines 150, 162, 167: Using
np.where()for simple conditionals is unconventional and returns numpy types. Consider standard Python ternary expressions for clarity.💡 Suggested changes
- space = " " * int(np.where(i < 9, 3, 4)) + space = " " * (3 if i < 9 else 4) print(f"{i + 1}) Q: {qa_list[i]['question']}") answer = qa_list[i]["answer"] # print a list of multiple choice answers - if type(answer) is list: + if isinstance(answer, list): if show_answers: print( f"{space}A: 1. {answer[0]['answer']} " - f"{np.where(answer[0]['correct'], '(correct)', '')}" + f"{'(correct)' if answer[0]['correct'] else ''}" ) for j in range(1, len(answer)): print( f"{space + ' '}{j + 1}. {answer[j]['answer']} " - f"{np.where(answer[j]['correct'], '(correct)', '')}" + f"{'(correct)' if answer[j]['correct'] else ''}" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/utilities.py` around lines 148 - 167, The loop uses type(answer) is list and np.where for simple conditionals; change type(answer) is list to isinstance(answer, list) (ref: qa_list and answer variables and show_answers branch) and replace np.where usages with Python ternary expressions: compute space using " " * (3 if i < 9 else 4) and replace occurrences like np.where(answer[k]['correct'], '(correct)', '') with '(correct)' if answer[k]['correct'] else '' so standard Python types are returned.
135-136: Returning empty string for unsupported file types may silently mask errors.When an unsupported file type is uploaded, returning
""could lead to downstream issues with empty content being processed. Consider raising an exception or logging a warning.💡 Proposed fix
elif safe_filename.endswith('.docx'): return self.extract_text_from_docx(file_path) else: - return "" + raise ValueError(f"Unsupported file type: {safe_filename}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/utilities.py` around lines 135 - 136, The else branch in utilities.py that currently returns an empty string for unsupported file types should not silently mask errors; replace that return "" with a raised exception (e.g., raise ValueError or define and raise an UnsupportedFileTypeError) including contextual info (file name and detected type) and/or emit a warning log via the module logger before raising; update the function that handles uploaded file parsing (the branch that ends with return "") to raise the error and ensure callers either catch it or propagate it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/utilities.py`:
- Around line 119-123: The current check using file_abs.startswith(upload_abs)
can be bypassed (same as in extract_text_from_docx); update the validation in
utilities.py to use a robust comparison such as os.path.commonpath([upload_abs,
file_abs]) != upload_abs (or ensure upload_abs ends with os.sep and compare
startswith upload_abs + os.sep) to guarantee file_path is contained within
self.upload_folder; adjust the block that computes upload_abs and file_abs and
raise ValueError("Invalid file path") when the containment check fails, matching
the fix applied in extract_text_from_docx.
---
Duplicate comments:
In `@backend/Generator/utilities.py`:
- Around line 105-108: The current path-check using
file_abs.startswith(upload_abs) can be bypassed by sibling dirs; ensure
upload_abs includes a trailing separator before the startswith check (e.g.,
normalize upload_abs to os.path.abspath(self.upload_folder) + os.sep or append
os.sep when comparing) so the comparison becomes
file_abs.startswith(upload_abs_with_sep); update the variables upload_abs and
the conditional that raises ValueError("Invalid file path") accordingly to
prevent prefix collisions.
---
Nitpick comments:
In `@backend/Generator/utilities.py`:
- Around line 148-167: The loop uses type(answer) is list and np.where for
simple conditionals; change type(answer) is list to isinstance(answer, list)
(ref: qa_list and answer variables and show_answers branch) and replace np.where
usages with Python ternary expressions: compute space using " " * (3 if i < 9
else 4) and replace occurrences like np.where(answer[k]['correct'], '(correct)',
'') with '(correct)' if answer[k]['correct'] else '' so standard Python types
are returned.
- Around line 135-136: The else branch in utilities.py that currently returns an
empty string for unsupported file types should not silently mask errors; replace
that return "" with a raised exception (e.g., raise ValueError or define and
raise an UnsupportedFileTypeError) including contextual info (file name and
detected type) and/or emit a warning log via the module logger before raising;
update the function that handles uploaded file parsing (the branch that ends
with return "") to raise the error and ensure callers either catch it or
propagate it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7b84365d-2e8e-415a-80c4-907b3d400933
📒 Files selected for processing (1)
backend/Generator/utilities.py
| # Ensure the file path is within the upload folder | ||
| upload_abs = os.path.abspath(self.upload_folder) | ||
| file_abs = os.path.abspath(file_path) | ||
| if not file_abs.startswith(upload_abs): | ||
| raise ValueError("Invalid file path") |
There was a problem hiding this comment.
Same startswith pattern issue as in extract_text_from_docx.
Apply the same fix here for consistency and correctness.
🛡️ Proposed fix
# Ensure the file path is within the upload folder
upload_abs = os.path.abspath(self.upload_folder)
file_abs = os.path.abspath(file_path)
- if not file_abs.startswith(upload_abs):
+ if not file_abs.startswith(upload_abs + os.sep):
raise ValueError("Invalid file path")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/utilities.py` around lines 119 - 123, The current check
using file_abs.startswith(upload_abs) can be bypassed (same as in
extract_text_from_docx); update the validation in utilities.py to use a robust
comparison such as os.path.commonpath([upload_abs, file_abs]) != upload_abs (or
ensure upload_abs ends with os.sep and compare startswith upload_abs + os.sep)
to guarantee file_path is contained within self.upload_folder; adjust the block
that computes upload_abs and file_abs and raise ValueError("Invalid file path")
when the containment check fails, matching the fix applied in
extract_text_from_docx.
Addressed Issues:
Fixes #603
Screenshots/Recordings:
N/A (No UI changes)
Additional Notes:
main.py(~800 lines) into smaller modulesserver.pyaccordinglyAI Usage Disclosure:
I have used the following AI models and tools: ChatGPT (GPT-5.3)
Checklist
Files Changed
Summary by CodeRabbit
Refactor
New Features