refactor: split large main.py into modular components and update imports by Aditya30ag · Pull Request #604 · AOSSIE-Org/EduAid

Aditya30ag · 2026-03-18T07:56:25Z

Addressed Issues:

Fixes #603

Screenshots/Recordings:

N/A (No UI changes)

Additional Notes:

Refactored large main.py (~800 lines) into smaller modules
Improved code organization and maintainability
No changes in logic or functionality
Updated imports in server.py accordingly

AI Usage Disclosure:

This PR does not contain AI-generated code at all.
This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: ChatGPT (GPT-5.3)

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Files Changed

Summary by CodeRabbit

Refactor
- Backend question-generation logic reorganized into smaller, focused modules; legacy monolithic implementation trimmed.
New Features
- Added dedicated generators for multiple-choice, short-answer, paraphrase, and boolean question creation.
- Added an answer-prediction component with boolean entailment checks.
- Added document intake and extraction for PDFs, DOCX, and Google Docs links.

coderabbitai · 2026-03-18T07:57:54Z

📝 Walkthrough

Walkthrough

Split and reorganized the QA codebase: removed many classes from advanced_qa.py, and introduced focused modules question_generators.py, answer_predictor.py, and utilities.py; server.py imports/instantiations updated to use the new classes.

Changes

Cohort / File(s)	Summary
Question Generator Classes `backend/Generator/question_generators.py`	New module adding `MCQGenerator`, `ShortQGenerator`, `ParaphraseGenerator`, and `BoolQGenerator`. Each initializes tokenizer/model/device/seed and exposes a single generation method returning structured outputs; includes basic error handling and optional CUDA cleanup.
Answer Prediction Logic `backend/Generator/answer_predictor.py`	New module adding `AnswerPredictor` which loads a T5 generator and an NLI classifier, provides `greedy_decoding`, `predict_answer`, and `predict_boolean_answer`, handles device selection and seeding, and optionally clears CUDA cache.
Utilities / File & Docs `backend/Generator/utilities.py`	New module adding `GoogleDocsService` (doc ID extraction, Docs content retrieval) and `FileProcessor` (upload folder management, sanitized filenames, `.txt`/`.pdf`/`.docx` extraction, safe temp cleanup) plus `print_qa` helper.
Refactored Core Module `backend/Generator/advanced_qa.py`	Major trimming: removed previously bundled generator/service classes and heavy logic, leaving minimal imports and a skeletal `QuestionGenerator`-related structure.
Server updates `backend/server.py`	Updated imports to reference new modules/classes directly and replaced previous `main`-based instantiation with direct constructors; added a startup print.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Server
  participant FileProc as FileProcessor / GoogleDocsService
  participant Generators as QuestionGenerators
  participant AnswerPred as AnswerPredictor
  participant Models as Model Runtime
  Client->>Server: Upload file / request generation
  Server->>FileProc: extract_text_from_file / fetch_doc
  FileProc-->>Server: plain_text
  Server->>Generators: generate_mcq / generate_shortq / generate_paraphrase / generate_boolq (plain_text)
  Generators->>Models: tokenize & generate
  Models-->>Generators: generated_questions
  Server->>AnswerPred: predict_answer / predict_boolean_answer (questions, context)
  AnswerPred->>Models: generation / NLI inference
  Models-->>AnswerPred: answers / entailment_scores
  AnswerPred-->>Server: answers / booleans
  Server-->>Client: structured QA payload

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: Add Qwen3-0.6B LLM endpoint for question generation #452 — Restructures the question-generation subsystem and updates server imports; closely related to the module split and new generator files.

Poem

🐇 I munched the monolith down to bits,

Carried modules in nimble fits,
Now questions hop and answers sing,
Files and docs in orderly ring,
A tidy burrow for code—what bliss! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main refactoring effort—splitting a large file into modular components and updating imports—which aligns with the primary changeset.
Linked Issues check	✅ Passed	All objectives from issue `#603` are met: main.py split into question_generators.py, answer_predictor.py, utilities.py, advanced_qa.py, server.py imports updated, and all logic preserved.
Out of Scope Changes check	✅ Passed	All changes directly support the refactoring objective; no unrelated modifications or feature additions are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

backend/Generator/answer_predictor.py (1)

28-31: Reuse greedy_decoding to avoid duplicated generation logic.

The same generate+decode path is implemented twice. Consolidating on greedy_decoding will keep behavior consistent and reduce drift risk.

♻️ Proposed refactor

-            greedy_output = self.model.generate(input_ids=input_ids, attention_mask=attention_masks, max_length=256)
-            Question = self.tokenizer.decode(greedy_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
-            answers.append(Question.strip().capitalize())
+            answers.append(self.greedy_decoding(input_ids, attention_masks))

Also applies to: 45-49

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/answer_predictor.py` around lines 28 - 31, The
generate+decode logic in greedy_decoding is duplicated elsewhere; replace the
duplicated generate/decode blocks with calls to greedy_decoding(inp_ids,
attn_mask) so the model.generate and tokenizer.decode path is centralized.
Locate the other method that performs the same generate+decode (the block around
lines 45-49 in the diff) and change it to use the return value from
greedy_decoding, ensuring you pass the same input tensors (inp_ids, attn_mask)
and use the returned string (already stripped and capitalized) instead of
reimplementing generation/decoding.

backend/Generator/question_generators.py (1)

17-33: Consider a shared base for repeated model/seed/device setup.

The four classes duplicate nearly identical initialization and seed logic. A shared base/helper would reduce maintenance overhead.

Also applies to: 76-92, 128-140, 186-197

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/question_generators.py` around lines 17 - 33, Extract the
duplicated initialization and seeding logic into a new shared base class (e.g.,
BaseQuestionGenerator) that defines __init__ and set_seed (including tokenizer =
T5Tokenizer.from_pretrained(...), model =
T5ForConditionalGeneration.from_pretrained(...), device setup and
model.to(device), nlp = spacy.load(...), s2v.from_disk(...),
fdist/FreqDist(brown.words()), normalized_levenshtein, and the call to
set_seed(42)); then update each of the four concrete generator classes to
inherit from this base and remove their local duplicated __init__ and set_seed
bodies, calling super().__init__ from their constructors as needed so only
class-specific setup remains.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/answer_predictor.py`:
- Around line 33-43: The predict_answer method currently iterates
payload.get("input_question") without validation which can raise TypeError;
update predict_answer to validate the payload first by checking payload is a
dict and that payload.get("input_question") is present and is an iterable/list
(and that payload.get("input_text") is a non-empty string) before the for loop,
returning or raising a controlled error response if validation fails; add clear
guards around the for ques in payload.get("input_question") loop and use the
validated local variables (inp, context, question) to proceed only when checks
pass.

In `@backend/Generator/question_generators.py`:
- Around line 34-43: The generate_mcq function assumes payload["input_text"] is
a non-empty string before calling tokenize_into_sentences and building
sentences/modified_text; add a guard that validates payload.get("input_text") is
a string and not empty (e.g., if not isinstance(inp_text, str) or not
inp_text.strip(): handle early by raising ValueError or returning a clear error
response), then proceed to call tokenize_into_sentences and compute sentences
and modified_text; apply the same validation pattern to the other blocks that
use tokenize_into_sentences (the other occurrences noted) so you never call
tokenize_into_sentences with malformed or None input.
- Line 217: Remove the raw prompt dump "print(form)" which exposes user-derived
content; instead either delete that statement or replace it with a safe,
non-sensitive log (e.g., a call to the module logger with a redacted or
anonymized summary). Locate the plain print(form) call in question_generators.py
and remove it; if you must keep an audit, use the project's logger and log only
metadata (e.g., form length or a masked token) rather than the full form
contents.
- Around line 57-60: The bare except is hiding errors when calling
generate_multiple_choice_questions; replace it with an explicit except Exception
as e and log the exception and context before returning final_output.
Specifically, in the block that calls
generate_multiple_choice_questions(keyword_sentence_mapping, self.device,
self.tokenizer, self.model, self.s2v, self.normalized_levenshtein), catch
Exception as e, emit a helpful log via the module/class logger (or
processLogger) including the exception e and relevant context like
keyword_sentence_mapping or self.model identifier, then return final_output.

In `@backend/Generator/utilities.py`:
- Around line 64-78: process_file currently trusts file.filename (allowing path
traversal) and may leave orphaned temp files if extraction raises; fix by
sanitizing/normalizing and constraining the destination path (use
os.path.basename or werkzeug.utils.secure_filename and resolve with
os.path.realpath to ensure the final path is inside upload_folder) and write to
a safely created temp path (or NamedTemporaryFile) instead of directly using the
raw filename; wrap the save + detection + calls to extract_text_from_pdf and
extract_text_from_docx in a try/finally so os.remove(file_path) always runs, and
propagate or handle extraction exceptions as before while still ensuring
cleanup, referencing process_file, upload_folder, extract_text_from_pdf and
extract_text_from_docx to locate the changes.
- Around line 52-57: The extract_text_from_pdf function opens a PDF with
fitz.open(file_path) but never closes it; change the implementation to open the
document with a context manager (e.g., with fitz.open(file_path) as doc:) so the
document is deterministically closed after extracting text, still iterating
pages to build and return the text string in the same function.

---

Nitpick comments:
In `@backend/Generator/answer_predictor.py`:
- Around line 28-31: The generate+decode logic in greedy_decoding is duplicated
elsewhere; replace the duplicated generate/decode blocks with calls to
greedy_decoding(inp_ids, attn_mask) so the model.generate and tokenizer.decode
path is centralized. Locate the other method that performs the same
generate+decode (the block around lines 45-49 in the diff) and change it to use
the return value from greedy_decoding, ensuring you pass the same input tensors
(inp_ids, attn_mask) and use the returned string (already stripped and
capitalized) instead of reimplementing generation/decoding.

In `@backend/Generator/question_generators.py`:
- Around line 17-33: Extract the duplicated initialization and seeding logic
into a new shared base class (e.g., BaseQuestionGenerator) that defines __init__
and set_seed (including tokenizer = T5Tokenizer.from_pretrained(...), model =
T5ForConditionalGeneration.from_pretrained(...), device setup and
model.to(device), nlp = spacy.load(...), s2v.from_disk(...),
fdist/FreqDist(brown.words()), normalized_levenshtein, and the call to
set_seed(42)); then update each of the four concrete generator classes to
inherit from this base and remove their local duplicated __init__ and set_seed
bodies, calling super().__init__ from their constructors as needed so only
class-specific setup remains.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 566a5c98-74f4-4a81-bc33-7c0f064b220a

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 2c52bf8.

📒 Files selected for processing (4)

backend/Generator/advanced_qa.py
backend/Generator/answer_predictor.py
backend/Generator/question_generators.py
backend/Generator/utilities.py

coderabbitai · 2026-03-18T08:03:23Z

backend/Generator/answer_predictor.py

+    def predict_answer(self, payload):
+        answers = []
+        inp = {
+                "input_text": payload.get("input_text"),
+                "input_question" : payload.get("input_question")
+            }
+        for ques in payload.get("input_question"):
+
+            context = inp["input_text"]
+            question = ques
+            input_text = "question: %s <s> context: %s </s>" % (question, context)


⚠️ Potential issue | 🟠 Major

Validate request payload before iterating questions.

Line 39 iterates over payload.get("input_question") without validation; missing/invalid input causes runtime failures (TypeError) instead of a controlled response.

💡 Proposed fix

def predict_answer(self, payload): - answers = [] - inp = { - "input_text": payload.get("input_text"), - "input_question" : payload.get("input_question") - } - for ques in payload.get("input_question"): - - context = inp["input_text"] + answers = [] + context = payload.get("input_text") + questions = payload.get("input_question") + + if not isinstance(context, str) or not context.strip(): + raise ValueError("input_text must be a non-empty string") + if not isinstance(questions, list): + raise ValueError("input_question must be a list of strings") + + for ques in questions: question = ques input_text = "question: %s <s> context: %s </s>" % (question, context)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def predict_answer(self, payload):

answers = []

inp = {

"input_text": payload.get("input_text"),

"input_question" : payload.get("input_question")

}

for ques in payload.get("input_question"):

context = inp["input_text"]

question = ques

input_text = "question: %s <s> context: %s </s>" % (question, context)

def predict_answer(self, payload):

answers = []

context = payload.get("input_text")

questions = payload.get("input_question")

if not isinstance(context, str) or not context.strip():

raise ValueError("input_text must be a non-empty string")

if not isinstance(questions, list):

raise ValueError("input_question must be a list of strings")

for ques in questions:

question = ques

input_text = "question: %s <s> context: %s </s>" % (question, context)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/answer_predictor.py` around lines 33 - 43, The predict_answer method currently iterates payload.get("input_question") without validation which can raise TypeError; update predict_answer to validate the payload first by checking payload is a dict and that payload.get("input_question") is present and is an iterable/list (and that payload.get("input_text") is a non-empty string) before the for loop, returning or raising a controlled error response if validation fails; add clear guards around the for ques in payload.get("input_question") loop and use the validated local variables (inp, context, question) to proceed only when checks pass.

coderabbitai · 2026-03-18T08:03:23Z

backend/Generator/question_generators.py

+    def generate_mcq(self, payload):
+        start_time = time.time()
+        inp = {
+            "input_text": payload.get("input_text"),
+            "max_questions": payload.get("max_questions", 4)
+        }
+
+        text = inp['input_text']
+        sentences = tokenize_into_sentences(text)
+        modified_text = " ".join(sentences)


⚠️ Potential issue | 🟠 Major

Guard input_text before tokenization/sentence splitting.

These paths assume payload["input_text"] is a valid non-empty string; malformed input leads to runtime exceptions.

✅ Proposed pattern

- text = inp['input_text'] + text = inp['input_text'] + if not isinstance(text, str) or not text.strip(): + raise ValueError("input_text must be a non-empty string")

Also applies to: 93-101, 148-152, 211-214

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/question_generators.py` around lines 34 - 43, The generate_mcq function assumes payload["input_text"] is a non-empty string before calling tokenize_into_sentences and building sentences/modified_text; add a guard that validates payload.get("input_text") is a string and not empty (e.g., if not isinstance(inp_text, str) or not inp_text.strip(): handle early by raising ValueError or returning a clear error response), then proceed to call tokenize_into_sentences and compute sentences and modified_text; apply the same validation pattern to the other blocks that use tokenize_into_sentences (the other occurrences noted) so you never call tokenize_into_sentences with malformed or None input.

coderabbitai · 2026-03-18T08:03:23Z

backend/Generator/question_generators.py

+            try:
+                generated_questions = generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein)
+            except:
+                return final_output


⚠️ Potential issue | 🟠 Major

Replace bare except with explicit exception handling.

Line 59 swallows all exceptions and obscures root causes. Catch Exception explicitly and log context.

🛠️ Proposed fix

+import logging + +logger = logging.getLogger(__name__) ... - try: - generated_questions = generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein) - except: - return final_output + try: + generated_questions = generate_multiple_choice_questions( + keyword_sentence_mapping, + self.device, + self.tokenizer, + self.model, + self.s2v, + self.normalized_levenshtein, + ) + except Exception: + logger.exception("MCQ generation failed") + return final_output

🧰 Tools

🪛 Ruff (0.15.6)

[error] 59-59: Do not use bare except

(E722)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/question_generators.py` around lines 57 - 60, The bare except is hiding errors when calling generate_multiple_choice_questions; replace it with an explicit except Exception as e and log the exception and context before returning final_output. Specifically, in the block that calls generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein), catch Exception as e, emit a helpful log via the module/class logger (or processLogger) including the exception e and relevant context like keyword_sentence_mapping or self.model identifier, then return final_output.

coderabbitai · 2026-03-18T08:03:23Z

backend/Generator/question_generators.py

+        modified_text = " ".join(sentences)
+        answer = self.random_choice()
+        form = "truefalse: %s passage: %s </s>" % (modified_text, answer)
+        print(form)


⚠️ Potential issue | 🟠 Major

Remove raw prompt logging to avoid leaking user text.

Line 217 prints full request-derived content to stdout. That can leak sensitive data in logs.

🔒 Proposed fix

- print(form)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(form)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/question_generators.py` at line 217, Remove the raw prompt dump "print(form)" which exposes user-derived content; instead either delete that statement or replace it with a safe, non-sensitive log (e.g., a call to the module logger with a redacted or anonymized summary). Locate the plain print(form) call in question_generators.py and remove it; if you must keep an audit, use the project's logger and log only metadata (e.g., form length or a masked token) rather than the full form contents.

coderabbitai · 2026-03-18T08:03:23Z

backend/Generator/utilities.py

+    def extract_text_from_pdf(self, file_path):
+        doc = fitz.open(file_path)
+        text = ""
+        for page in doc:
+            text += page.get_text()
+        return text


⚠️ Potential issue | 🟠 Major

Close PDF resources deterministically after extraction.

fitz.open(...) is not closed explicitly. Use a context manager to avoid descriptor/memory leaks.

🧹 Proposed fix

def extract_text_from_pdf(self, file_path): - doc = fitz.open(file_path) text = "" - for page in doc: - text += page.get_text() + with fitz.open(file_path) as doc: + for page in doc: + text += page.get_text() return text

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/utilities.py` around lines 52 - 57, The extract_text_from_pdf function opens a PDF with fitz.open(file_path) but never closes it; change the implementation to open the document with a context manager (e.g., with fitz.open(file_path) as doc:) so the document is deterministically closed after extracting text, still iterating pages to build and return the text string in the same function.

backend/Generator/utilities.py

+        return text
+
+    def extract_text_from_docx(self, file_path):
+        with open(file_path, "rb") as docx_file:


backend/Generator/utilities.py

Aditya30ag · 2026-03-18T08:26:27Z

Need Tests !!!!

backend/Generator/utilities.py

+        if not file_abs.startswith(upload_abs):
+            raise ValueError("Invalid file path")
+
+        file.save(file_path)


backend/Generator/utilities.py

+
+        try:
+            if safe_filename.endswith('.txt'):
+                with open(file_path, 'r', encoding='utf-8') as f:


backend/Generator/utilities.py

+                return ""
+        finally:
+            # Clean up the temporary file
+            if os.path.exists(file_path):


backend/Generator/utilities.py

+        finally:
+            # Clean up the temporary file
+            if os.path.exists(file_path):
+                os.remove(file_path)


coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

backend/Generator/utilities.py (1)
105-108: ⚠️ Potential issue | 🟡 Minor

Path validation startswith check is missing trailing separator.

The check file_abs.startswith(upload_abs) can be bypassed if a sibling directory shares the prefix (e.g., /app/uploads_evil/x passes when upload_abs = "/app/uploads"). Append os.sep for correctness.

While the UUID-based filename sanitization mitigates practical exploitation here, this is a defense-in-depth measure that should be correct.
🛡️ Proposed fix
     def extract_text_from_docx(self, file_path):
         # Validate file path is within upload folder
         upload_abs = os.path.abspath(self.upload_folder)
         file_abs = os.path.abspath(file_path)
-        if not file_abs.startswith(upload_abs):
+        if not file_abs.startswith(upload_abs + os.sep):
             raise ValueError("Invalid file path")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 105 - 108, The current
path-check using file_abs.startswith(upload_abs) can be bypassed by sibling
dirs; ensure upload_abs includes a trailing separator before the startswith
check (e.g., normalize upload_abs to os.path.abspath(self.upload_folder) +
os.sep or append os.sep when comparing) so the comparison becomes
file_abs.startswith(upload_abs_with_sep); update the variables upload_abs and
the conditional that raises ValueError("Invalid file path") accordingly to
prevent prefix collisions.

🧹 Nitpick comments (2)

backend/Generator/utilities.py (2)

148-167: Use isinstance and simplify conditional expressions.

Minor style suggestions:

Line 157: type(answer) is list → isinstance(answer, list) is more Pythonic and handles subclasses.
Lines 150, 162, 167: Using np.where() for simple conditionals is unconventional and returns numpy types. Consider standard Python ternary expressions for clarity.

💡 Suggested changes

-        space = " " * int(np.where(i < 9, 3, 4))
+        space = " " * (3 if i < 9 else 4)
 
         print(f"{i + 1}) Q: {qa_list[i]['question']}")
 
         answer = qa_list[i]["answer"]
 
         # print a list of multiple choice answers
-        if type(answer) is list:
+        if isinstance(answer, list):
 
             if show_answers:
                 print(
                     f"{space}A: 1. {answer[0]['answer']} "
-                    f"{np.where(answer[0]['correct'], '(correct)', '')}"
+                    f"{'(correct)' if answer[0]['correct'] else ''}"
                 )
                 for j in range(1, len(answer)):
                     print(
                         f"{space + '   '}{j + 1}. {answer[j]['answer']} "
-                        f"{np.where(answer[j]['correct'], '(correct)', '')}"
+                        f"{'(correct)' if answer[j]['correct'] else ''}"
                     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 148 - 167, The loop uses
type(answer) is list and np.where for simple conditionals; change type(answer)
is list to isinstance(answer, list) (ref: qa_list and answer variables and
show_answers branch) and replace np.where usages with Python ternary
expressions: compute space using " " * (3 if i < 9 else 4) and replace
occurrences like np.where(answer[k]['correct'], '(correct)', '') with
'(correct)' if answer[k]['correct'] else '' so standard Python types are
returned.

135-136: Returning empty string for unsupported file types may silently mask errors.

When an unsupported file type is uploaded, returning "" could lead to downstream issues with empty content being processed. Consider raising an exception or logging a warning.

💡 Proposed fix

             elif safe_filename.endswith('.docx'):
                 return self.extract_text_from_docx(file_path)
             else:
-                return ""
+                raise ValueError(f"Unsupported file type: {safe_filename}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 135 - 136, The else branch in
utilities.py that currently returns an empty string for unsupported file types
should not silently mask errors; replace that return "" with a raised exception
(e.g., raise ValueError or define and raise an UnsupportedFileTypeError)
including contextual info (file name and detected type) and/or emit a warning
log via the module logger before raising; update the function that handles
uploaded file parsing (the branch that ends with return "") to raise the error
and ensure callers either catch it or propagate it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/utilities.py`:
- Around line 119-123: The current check using file_abs.startswith(upload_abs)
can be bypassed (same as in extract_text_from_docx); update the validation in
utilities.py to use a robust comparison such as os.path.commonpath([upload_abs,
file_abs]) != upload_abs (or ensure upload_abs ends with os.sep and compare
startswith upload_abs + os.sep) to guarantee file_path is contained within
self.upload_folder; adjust the block that computes upload_abs and file_abs and
raise ValueError("Invalid file path") when the containment check fails, matching
the fix applied in extract_text_from_docx.

---

Duplicate comments:
In `@backend/Generator/utilities.py`:
- Around line 105-108: The current path-check using
file_abs.startswith(upload_abs) can be bypassed by sibling dirs; ensure
upload_abs includes a trailing separator before the startswith check (e.g.,
normalize upload_abs to os.path.abspath(self.upload_folder) + os.sep or append
os.sep when comparing) so the comparison becomes
file_abs.startswith(upload_abs_with_sep); update the variables upload_abs and
the conditional that raises ValueError("Invalid file path") accordingly to
prevent prefix collisions.

---

Nitpick comments:
In `@backend/Generator/utilities.py`:
- Around line 148-167: The loop uses type(answer) is list and np.where for
simple conditionals; change type(answer) is list to isinstance(answer, list)
(ref: qa_list and answer variables and show_answers branch) and replace np.where
usages with Python ternary expressions: compute space using " " * (3 if i < 9
else 4) and replace occurrences like np.where(answer[k]['correct'], '(correct)',
'') with '(correct)' if answer[k]['correct'] else '' so standard Python types
are returned.
- Around line 135-136: The else branch in utilities.py that currently returns an
empty string for unsupported file types should not silently mask errors; replace
that return "" with a raised exception (e.g., raise ValueError or define and
raise an UnsupportedFileTypeError) including contextual info (file name and
detected type) and/or emit a warning log via the module logger before raising;
update the function that handles uploaded file parsing (the branch that ends
with return "") to raise the error and ensure callers either catch it or
propagate it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b84365d-2e8e-415a-80c4-907b3d400933

📥 Commits

Reviewing files that changed from the base of the PR and between 0f1cb4c and de53385.

📒 Files selected for processing (1)

backend/Generator/utilities.py

coderabbitai · 2026-03-19T14:04:42Z

backend/Generator/utilities.py

+        # Ensure the file path is within the upload folder
+        upload_abs = os.path.abspath(self.upload_folder)
+        file_abs = os.path.abspath(file_path)
+        if not file_abs.startswith(upload_abs):
+            raise ValueError("Invalid file path")


⚠️ Potential issue | 🟡 Minor

Same startswith pattern issue as in extract_text_from_docx.

Apply the same fix here for consistency and correctness.

🛡️ Proposed fix

# Ensure the file path is within the upload folder upload_abs = os.path.abspath(self.upload_folder) file_abs = os.path.abspath(file_path) - if not file_abs.startswith(upload_abs): + if not file_abs.startswith(upload_abs + os.sep): raise ValueError("Invalid file path")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/utilities.py` around lines 119 - 123, The current check using file_abs.startswith(upload_abs) can be bypassed (same as in extract_text_from_docx); update the validation in utilities.py to use a robust comparison such as os.path.commonpath([upload_abs, file_abs]) != upload_abs (or ensure upload_abs ends with os.sep and compare startswith upload_abs + os.sep) to guarantee file_path is contained within self.upload_folder; adjust the block that computes upload_abs and file_abs and raise ValueError("Invalid file path") when the containment check fails, matching the fix applied in extract_text_from_docx.

refactor: split large main.py into modular components and update imports

2c52bf8

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Refactor: Update server.py imports to use new modular structure

0f1cb4c

github-advanced-security bot found potential problems Mar 18, 2026

View reviewed changes

All security vulnerabilities

c86ad3e

github-advanced-security bot found potential problems Mar 18, 2026

View reviewed changes

backend/Generator/utilities.py Fixed Show fixed Hide fixed

backend/Generator/utilities.py Fixed Show fixed Hide fixed

backend/Generator/utilities.py Fixed Show fixed Hide fixed

backend/Generator/utilities.py Fixed Show fixed Hide fixed

All security vulnerabilities

2ab5f46

Aditya30ag added 2 commits March 19, 2026 19:25

fix the build

200a1dd

fix the build

de53385

github-advanced-security bot found potential problems Mar 19, 2026

View reviewed changes

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Uh oh!

Conversation

Aditya30ag commented Mar 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

AI Usage Disclosure:

Checklist

Files Changed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Aditya30ag commented Mar 18, 2026

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aditya30ag commented Mar 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2026 •

edited

Loading