Skip to content

refactor: split large main.py into modular components and update imports#604

Open
Aditya30ag wants to merge 6 commits intoAOSSIE-Org:mainfrom
Aditya30ag:refactor/split-main-modules
Open

refactor: split large main.py into modular components and update imports#604
Aditya30ag wants to merge 6 commits intoAOSSIE-Org:mainfrom
Aditya30ag:refactor/split-main-modules

Conversation

@Aditya30ag
Copy link
Copy Markdown
Contributor

@Aditya30ag Aditya30ag commented Mar 18, 2026

Addressed Issues:

Fixes #603

Screenshots/Recordings:

N/A (No UI changes)

Additional Notes:

  • Refactored large main.py (~800 lines) into smaller modules
  • Improved code organization and maintainability
  • No changes in logic or functionality
  • Updated imports in server.py accordingly

AI Usage Disclosure:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: ChatGPT (GPT-5.3)

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Files Changed

Summary by CodeRabbit

  • Refactor

    • Backend question-generation logic reorganized into smaller, focused modules; legacy monolithic implementation trimmed.
  • New Features

    • Added dedicated generators for multiple-choice, short-answer, paraphrase, and boolean question creation.
    • Added an answer-prediction component with boolean entailment checks.
    • Added document intake and extraction for PDFs, DOCX, and Google Docs links.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

Split and reorganized the QA codebase: removed many classes from advanced_qa.py, and introduced focused modules question_generators.py, answer_predictor.py, and utilities.py; server.py imports/instantiations updated to use the new classes.

Changes

Cohort / File(s) Summary
Question Generator Classes
backend/Generator/question_generators.py
New module adding MCQGenerator, ShortQGenerator, ParaphraseGenerator, and BoolQGenerator. Each initializes tokenizer/model/device/seed and exposes a single generation method returning structured outputs; includes basic error handling and optional CUDA cleanup.
Answer Prediction Logic
backend/Generator/answer_predictor.py
New module adding AnswerPredictor which loads a T5 generator and an NLI classifier, provides greedy_decoding, predict_answer, and predict_boolean_answer, handles device selection and seeding, and optionally clears CUDA cache.
Utilities / File & Docs
backend/Generator/utilities.py
New module adding GoogleDocsService (doc ID extraction, Docs content retrieval) and FileProcessor (upload folder management, sanitized filenames, .txt/.pdf/.docx extraction, safe temp cleanup) plus print_qa helper.
Refactored Core Module
backend/Generator/advanced_qa.py
Major trimming: removed previously bundled generator/service classes and heavy logic, leaving minimal imports and a skeletal QuestionGenerator-related structure.
Server updates
backend/server.py
Updated imports to reference new modules/classes directly and replaced previous main-based instantiation with direct constructors; added a startup print.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Server
  participant FileProc as FileProcessor / GoogleDocsService
  participant Generators as QuestionGenerators
  participant AnswerPred as AnswerPredictor
  participant Models as Model Runtime
  Client->>Server: Upload file / request generation
  Server->>FileProc: extract_text_from_file / fetch_doc
  FileProc-->>Server: plain_text
  Server->>Generators: generate_mcq / generate_shortq / generate_paraphrase / generate_boolq (plain_text)
  Generators->>Models: tokenize & generate
  Models-->>Generators: generated_questions
  Server->>AnswerPred: predict_answer / predict_boolean_answer (questions, context)
  AnswerPred->>Models: generation / NLI inference
  Models-->>AnswerPred: answers / entailment_scores
  AnswerPred-->>Server: answers / booleans
  Server-->>Client: structured QA payload
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐇 I munched the monolith down to bits,

Carried modules in nimble fits,
Now questions hop and answers sing,
Files and docs in orderly ring,
A tidy burrow for code—what bliss! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main refactoring effort—splitting a large file into modular components and updating imports—which aligns with the primary changeset.
Linked Issues check ✅ Passed All objectives from issue #603 are met: main.py split into question_generators.py, answer_predictor.py, utilities.py, advanced_qa.py, server.py imports updated, and all logic preserved.
Out of Scope Changes check ✅ Passed All changes directly support the refactoring objective; no unrelated modifications or feature additions are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
backend/Generator/answer_predictor.py (1)

28-31: Reuse greedy_decoding to avoid duplicated generation logic.

The same generate+decode path is implemented twice. Consolidating on greedy_decoding will keep behavior consistent and reduce drift risk.

♻️ Proposed refactor
-            greedy_output = self.model.generate(input_ids=input_ids, attention_mask=attention_masks, max_length=256)
-            Question = self.tokenizer.decode(greedy_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
-            answers.append(Question.strip().capitalize())
+            answers.append(self.greedy_decoding(input_ids, attention_masks))

Also applies to: 45-49

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/answer_predictor.py` around lines 28 - 31, The
generate+decode logic in greedy_decoding is duplicated elsewhere; replace the
duplicated generate/decode blocks with calls to greedy_decoding(inp_ids,
attn_mask) so the model.generate and tokenizer.decode path is centralized.
Locate the other method that performs the same generate+decode (the block around
lines 45-49 in the diff) and change it to use the return value from
greedy_decoding, ensuring you pass the same input tensors (inp_ids, attn_mask)
and use the returned string (already stripped and capitalized) instead of
reimplementing generation/decoding.
backend/Generator/question_generators.py (1)

17-33: Consider a shared base for repeated model/seed/device setup.

The four classes duplicate nearly identical initialization and seed logic. A shared base/helper would reduce maintenance overhead.

Also applies to: 76-92, 128-140, 186-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/question_generators.py` around lines 17 - 33, Extract the
duplicated initialization and seeding logic into a new shared base class (e.g.,
BaseQuestionGenerator) that defines __init__ and set_seed (including tokenizer =
T5Tokenizer.from_pretrained(...), model =
T5ForConditionalGeneration.from_pretrained(...), device setup and
model.to(device), nlp = spacy.load(...), s2v.from_disk(...),
fdist/FreqDist(brown.words()), normalized_levenshtein, and the call to
set_seed(42)); then update each of the four concrete generator classes to
inherit from this base and remove their local duplicated __init__ and set_seed
bodies, calling super().__init__ from their constructors as needed so only
class-specific setup remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/answer_predictor.py`:
- Around line 33-43: The predict_answer method currently iterates
payload.get("input_question") without validation which can raise TypeError;
update predict_answer to validate the payload first by checking payload is a
dict and that payload.get("input_question") is present and is an iterable/list
(and that payload.get("input_text") is a non-empty string) before the for loop,
returning or raising a controlled error response if validation fails; add clear
guards around the for ques in payload.get("input_question") loop and use the
validated local variables (inp, context, question) to proceed only when checks
pass.

In `@backend/Generator/question_generators.py`:
- Around line 34-43: The generate_mcq function assumes payload["input_text"] is
a non-empty string before calling tokenize_into_sentences and building
sentences/modified_text; add a guard that validates payload.get("input_text") is
a string and not empty (e.g., if not isinstance(inp_text, str) or not
inp_text.strip(): handle early by raising ValueError or returning a clear error
response), then proceed to call tokenize_into_sentences and compute sentences
and modified_text; apply the same validation pattern to the other blocks that
use tokenize_into_sentences (the other occurrences noted) so you never call
tokenize_into_sentences with malformed or None input.
- Line 217: Remove the raw prompt dump "print(form)" which exposes user-derived
content; instead either delete that statement or replace it with a safe,
non-sensitive log (e.g., a call to the module logger with a redacted or
anonymized summary). Locate the plain print(form) call in question_generators.py
and remove it; if you must keep an audit, use the project's logger and log only
metadata (e.g., form length or a masked token) rather than the full form
contents.
- Around line 57-60: The bare except is hiding errors when calling
generate_multiple_choice_questions; replace it with an explicit except Exception
as e and log the exception and context before returning final_output.
Specifically, in the block that calls
generate_multiple_choice_questions(keyword_sentence_mapping, self.device,
self.tokenizer, self.model, self.s2v, self.normalized_levenshtein), catch
Exception as e, emit a helpful log via the module/class logger (or
processLogger) including the exception e and relevant context like
keyword_sentence_mapping or self.model identifier, then return final_output.

In `@backend/Generator/utilities.py`:
- Around line 64-78: process_file currently trusts file.filename (allowing path
traversal) and may leave orphaned temp files if extraction raises; fix by
sanitizing/normalizing and constraining the destination path (use
os.path.basename or werkzeug.utils.secure_filename and resolve with
os.path.realpath to ensure the final path is inside upload_folder) and write to
a safely created temp path (or NamedTemporaryFile) instead of directly using the
raw filename; wrap the save + detection + calls to extract_text_from_pdf and
extract_text_from_docx in a try/finally so os.remove(file_path) always runs, and
propagate or handle extraction exceptions as before while still ensuring
cleanup, referencing process_file, upload_folder, extract_text_from_pdf and
extract_text_from_docx to locate the changes.
- Around line 52-57: The extract_text_from_pdf function opens a PDF with
fitz.open(file_path) but never closes it; change the implementation to open the
document with a context manager (e.g., with fitz.open(file_path) as doc:) so the
document is deterministically closed after extracting text, still iterating
pages to build and return the text string in the same function.

---

Nitpick comments:
In `@backend/Generator/answer_predictor.py`:
- Around line 28-31: The generate+decode logic in greedy_decoding is duplicated
elsewhere; replace the duplicated generate/decode blocks with calls to
greedy_decoding(inp_ids, attn_mask) so the model.generate and tokenizer.decode
path is centralized. Locate the other method that performs the same
generate+decode (the block around lines 45-49 in the diff) and change it to use
the return value from greedy_decoding, ensuring you pass the same input tensors
(inp_ids, attn_mask) and use the returned string (already stripped and
capitalized) instead of reimplementing generation/decoding.

In `@backend/Generator/question_generators.py`:
- Around line 17-33: Extract the duplicated initialization and seeding logic
into a new shared base class (e.g., BaseQuestionGenerator) that defines __init__
and set_seed (including tokenizer = T5Tokenizer.from_pretrained(...), model =
T5ForConditionalGeneration.from_pretrained(...), device setup and
model.to(device), nlp = spacy.load(...), s2v.from_disk(...),
fdist/FreqDist(brown.words()), normalized_levenshtein, and the call to
set_seed(42)); then update each of the four concrete generator classes to
inherit from this base and remove their local duplicated __init__ and set_seed
bodies, calling super().__init__ from their constructors as needed so only
class-specific setup remains.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 566a5c98-74f4-4a81-bc33-7c0f064b220a

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 2c52bf8.

📒 Files selected for processing (4)
  • backend/Generator/advanced_qa.py
  • backend/Generator/answer_predictor.py
  • backend/Generator/question_generators.py
  • backend/Generator/utilities.py

Comment on lines +33 to +43
def predict_answer(self, payload):
answers = []
inp = {
"input_text": payload.get("input_text"),
"input_question" : payload.get("input_question")
}
for ques in payload.get("input_question"):

context = inp["input_text"]
question = ques
input_text = "question: %s <s> context: %s </s>" % (question, context)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate request payload before iterating questions.

Line 39 iterates over payload.get("input_question") without validation; missing/invalid input causes runtime failures (TypeError) instead of a controlled response.

💡 Proposed fix
 def predict_answer(self, payload):
-        answers = []
-        inp = {
-                "input_text": payload.get("input_text"),
-                "input_question" : payload.get("input_question")
-            }
-        for ques in payload.get("input_question"):
-                
-            context = inp["input_text"]
+        answers = []
+        context = payload.get("input_text")
+        questions = payload.get("input_question")
+
+        if not isinstance(context, str) or not context.strip():
+            raise ValueError("input_text must be a non-empty string")
+        if not isinstance(questions, list):
+            raise ValueError("input_question must be a list of strings")
+
+        for ques in questions:
             question = ques
             input_text = "question: %s <s> context: %s </s>" % (question, context)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def predict_answer(self, payload):
answers = []
inp = {
"input_text": payload.get("input_text"),
"input_question" : payload.get("input_question")
}
for ques in payload.get("input_question"):
context = inp["input_text"]
question = ques
input_text = "question: %s <s> context: %s </s>" % (question, context)
def predict_answer(self, payload):
answers = []
context = payload.get("input_text")
questions = payload.get("input_question")
if not isinstance(context, str) or not context.strip():
raise ValueError("input_text must be a non-empty string")
if not isinstance(questions, list):
raise ValueError("input_question must be a list of strings")
for ques in questions:
question = ques
input_text = "question: %s <s> context: %s </s>" % (question, context)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/answer_predictor.py` around lines 33 - 43, The
predict_answer method currently iterates payload.get("input_question") without
validation which can raise TypeError; update predict_answer to validate the
payload first by checking payload is a dict and that
payload.get("input_question") is present and is an iterable/list (and that
payload.get("input_text") is a non-empty string) before the for loop, returning
or raising a controlled error response if validation fails; add clear guards
around the for ques in payload.get("input_question") loop and use the validated
local variables (inp, context, question) to proceed only when checks pass.

Comment on lines +34 to +43
def generate_mcq(self, payload):
start_time = time.time()
inp = {
"input_text": payload.get("input_text"),
"max_questions": payload.get("max_questions", 4)
}

text = inp['input_text']
sentences = tokenize_into_sentences(text)
modified_text = " ".join(sentences)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard input_text before tokenization/sentence splitting.

These paths assume payload["input_text"] is a valid non-empty string; malformed input leads to runtime exceptions.

✅ Proposed pattern
-        text = inp['input_text']
+        text = inp['input_text']
+        if not isinstance(text, str) or not text.strip():
+            raise ValueError("input_text must be a non-empty string")

Also applies to: 93-101, 148-152, 211-214

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/question_generators.py` around lines 34 - 43, The
generate_mcq function assumes payload["input_text"] is a non-empty string before
calling tokenize_into_sentences and building sentences/modified_text; add a
guard that validates payload.get("input_text") is a string and not empty (e.g.,
if not isinstance(inp_text, str) or not inp_text.strip(): handle early by
raising ValueError or returning a clear error response), then proceed to call
tokenize_into_sentences and compute sentences and modified_text; apply the same
validation pattern to the other blocks that use tokenize_into_sentences (the
other occurrences noted) so you never call tokenize_into_sentences with
malformed or None input.

Comment on lines +57 to +60
try:
generated_questions = generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein)
except:
return final_output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace bare except with explicit exception handling.

Line 59 swallows all exceptions and obscures root causes. Catch Exception explicitly and log context.

🛠️ Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
...
-            try:
-                generated_questions = generate_multiple_choice_questions(keyword_sentence_mapping, self.device, self.tokenizer, self.model, self.s2v, self.normalized_levenshtein)
-            except:
-                return final_output
+            try:
+                generated_questions = generate_multiple_choice_questions(
+                    keyword_sentence_mapping,
+                    self.device,
+                    self.tokenizer,
+                    self.model,
+                    self.s2v,
+                    self.normalized_levenshtein,
+                )
+            except Exception:
+                logger.exception("MCQ generation failed")
+                return final_output
🧰 Tools
🪛 Ruff (0.15.6)

[error] 59-59: Do not use bare except

(E722)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/question_generators.py` around lines 57 - 60, The bare
except is hiding errors when calling generate_multiple_choice_questions; replace
it with an explicit except Exception as e and log the exception and context
before returning final_output. Specifically, in the block that calls
generate_multiple_choice_questions(keyword_sentence_mapping, self.device,
self.tokenizer, self.model, self.s2v, self.normalized_levenshtein), catch
Exception as e, emit a helpful log via the module/class logger (or
processLogger) including the exception e and relevant context like
keyword_sentence_mapping or self.model identifier, then return final_output.

modified_text = " ".join(sentences)
answer = self.random_choice()
form = "truefalse: %s passage: %s </s>" % (modified_text, answer)
print(form)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove raw prompt logging to avoid leaking user text.

Line 217 prints full request-derived content to stdout. That can leak sensitive data in logs.

🔒 Proposed fix
-        print(form)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(form)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/question_generators.py` at line 217, Remove the raw prompt
dump "print(form)" which exposes user-derived content; instead either delete
that statement or replace it with a safe, non-sensitive log (e.g., a call to the
module logger with a redacted or anonymized summary). Locate the plain
print(form) call in question_generators.py and remove it; if you must keep an
audit, use the project's logger and log only metadata (e.g., form length or a
masked token) rather than the full form contents.

Comment on lines +52 to +57
def extract_text_from_pdf(self, file_path):
doc = fitz.open(file_path)
text = ""
for page in doc:
text += page.get_text()
return text
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Close PDF resources deterministically after extraction.

fitz.open(...) is not closed explicitly. Use a context manager to avoid descriptor/memory leaks.

🧹 Proposed fix
     def extract_text_from_pdf(self, file_path):
-        doc = fitz.open(file_path)
         text = ""
-        for page in doc:
-            text += page.get_text()
+        with fitz.open(file_path) as doc:
+            for page in doc:
+                text += page.get_text()
         return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 52 - 57, The
extract_text_from_pdf function opens a PDF with fitz.open(file_path) but never
closes it; change the implementation to open the document with a context manager
(e.g., with fitz.open(file_path) as doc:) so the document is deterministically
closed after extracting text, still iterating pages to build and return the text
string in the same function.

return text

def extract_text_from_docx(self, file_path):
with open(file_path, "rb") as docx_file:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
@Aditya30ag
Copy link
Copy Markdown
Contributor Author

Need Tests !!!!

if not file_abs.startswith(upload_abs):
raise ValueError("Invalid file path")

file.save(file_path)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

try:
if safe_filename.endswith('.txt'):
with open(file_path, 'r', encoding='utf-8') as f:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
return ""
finally:
# Clean up the temporary file
if os.path.exists(file_path):

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
finally:
# Clean up the temporary file
if os.path.exists(file_path):
os.remove(file_path)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/Generator/utilities.py (1)

105-108: ⚠️ Potential issue | 🟡 Minor

Path validation startswith check is missing trailing separator.

The check file_abs.startswith(upload_abs) can be bypassed if a sibling directory shares the prefix (e.g., /app/uploads_evil/x passes when upload_abs = "/app/uploads"). Append os.sep for correctness.

While the UUID-based filename sanitization mitigates practical exploitation here, this is a defense-in-depth measure that should be correct.

🛡️ Proposed fix
     def extract_text_from_docx(self, file_path):
         # Validate file path is within upload folder
         upload_abs = os.path.abspath(self.upload_folder)
         file_abs = os.path.abspath(file_path)
-        if not file_abs.startswith(upload_abs):
+        if not file_abs.startswith(upload_abs + os.sep):
             raise ValueError("Invalid file path")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 105 - 108, The current
path-check using file_abs.startswith(upload_abs) can be bypassed by sibling
dirs; ensure upload_abs includes a trailing separator before the startswith
check (e.g., normalize upload_abs to os.path.abspath(self.upload_folder) +
os.sep or append os.sep when comparing) so the comparison becomes
file_abs.startswith(upload_abs_with_sep); update the variables upload_abs and
the conditional that raises ValueError("Invalid file path") accordingly to
prevent prefix collisions.
🧹 Nitpick comments (2)
backend/Generator/utilities.py (2)

148-167: Use isinstance and simplify conditional expressions.

Minor style suggestions:

  1. Line 157: type(answer) is listisinstance(answer, list) is more Pythonic and handles subclasses.
  2. Lines 150, 162, 167: Using np.where() for simple conditionals is unconventional and returns numpy types. Consider standard Python ternary expressions for clarity.
💡 Suggested changes
-        space = " " * int(np.where(i < 9, 3, 4))
+        space = " " * (3 if i < 9 else 4)
 
         print(f"{i + 1}) Q: {qa_list[i]['question']}")
 
         answer = qa_list[i]["answer"]
 
         # print a list of multiple choice answers
-        if type(answer) is list:
+        if isinstance(answer, list):
 
             if show_answers:
                 print(
                     f"{space}A: 1. {answer[0]['answer']} "
-                    f"{np.where(answer[0]['correct'], '(correct)', '')}"
+                    f"{'(correct)' if answer[0]['correct'] else ''}"
                 )
                 for j in range(1, len(answer)):
                     print(
                         f"{space + '   '}{j + 1}. {answer[j]['answer']} "
-                        f"{np.where(answer[j]['correct'], '(correct)', '')}"
+                        f"{'(correct)' if answer[j]['correct'] else ''}"
                     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 148 - 167, The loop uses
type(answer) is list and np.where for simple conditionals; change type(answer)
is list to isinstance(answer, list) (ref: qa_list and answer variables and
show_answers branch) and replace np.where usages with Python ternary
expressions: compute space using " " * (3 if i < 9 else 4) and replace
occurrences like np.where(answer[k]['correct'], '(correct)', '') with
'(correct)' if answer[k]['correct'] else '' so standard Python types are
returned.

135-136: Returning empty string for unsupported file types may silently mask errors.

When an unsupported file type is uploaded, returning "" could lead to downstream issues with empty content being processed. Consider raising an exception or logging a warning.

💡 Proposed fix
             elif safe_filename.endswith('.docx'):
                 return self.extract_text_from_docx(file_path)
             else:
-                return ""
+                raise ValueError(f"Unsupported file type: {safe_filename}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 135 - 136, The else branch in
utilities.py that currently returns an empty string for unsupported file types
should not silently mask errors; replace that return "" with a raised exception
(e.g., raise ValueError or define and raise an UnsupportedFileTypeError)
including contextual info (file name and detected type) and/or emit a warning
log via the module logger before raising; update the function that handles
uploaded file parsing (the branch that ends with return "") to raise the error
and ensure callers either catch it or propagate it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/utilities.py`:
- Around line 119-123: The current check using file_abs.startswith(upload_abs)
can be bypassed (same as in extract_text_from_docx); update the validation in
utilities.py to use a robust comparison such as os.path.commonpath([upload_abs,
file_abs]) != upload_abs (or ensure upload_abs ends with os.sep and compare
startswith upload_abs + os.sep) to guarantee file_path is contained within
self.upload_folder; adjust the block that computes upload_abs and file_abs and
raise ValueError("Invalid file path") when the containment check fails, matching
the fix applied in extract_text_from_docx.

---

Duplicate comments:
In `@backend/Generator/utilities.py`:
- Around line 105-108: The current path-check using
file_abs.startswith(upload_abs) can be bypassed by sibling dirs; ensure
upload_abs includes a trailing separator before the startswith check (e.g.,
normalize upload_abs to os.path.abspath(self.upload_folder) + os.sep or append
os.sep when comparing) so the comparison becomes
file_abs.startswith(upload_abs_with_sep); update the variables upload_abs and
the conditional that raises ValueError("Invalid file path") accordingly to
prevent prefix collisions.

---

Nitpick comments:
In `@backend/Generator/utilities.py`:
- Around line 148-167: The loop uses type(answer) is list and np.where for
simple conditionals; change type(answer) is list to isinstance(answer, list)
(ref: qa_list and answer variables and show_answers branch) and replace np.where
usages with Python ternary expressions: compute space using " " * (3 if i < 9
else 4) and replace occurrences like np.where(answer[k]['correct'], '(correct)',
'') with '(correct)' if answer[k]['correct'] else '' so standard Python types
are returned.
- Around line 135-136: The else branch in utilities.py that currently returns an
empty string for unsupported file types should not silently mask errors; replace
that return "" with a raised exception (e.g., raise ValueError or define and
raise an UnsupportedFileTypeError) including contextual info (file name and
detected type) and/or emit a warning log via the module logger before raising;
update the function that handles uploaded file parsing (the branch that ends
with return "") to raise the error and ensure callers either catch it or
propagate it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b84365d-2e8e-415a-80c4-907b3d400933

📥 Commits

Reviewing files that changed from the base of the PR and between 0f1cb4c and de53385.

📒 Files selected for processing (1)
  • backend/Generator/utilities.py

Comment on lines +119 to +123
# Ensure the file path is within the upload folder
upload_abs = os.path.abspath(self.upload_folder)
file_abs = os.path.abspath(file_path)
if not file_abs.startswith(upload_abs):
raise ValueError("Invalid file path")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same startswith pattern issue as in extract_text_from_docx.

Apply the same fix here for consistency and correctness.

🛡️ Proposed fix
         # Ensure the file path is within the upload folder
         upload_abs = os.path.abspath(self.upload_folder)
         file_abs = os.path.abspath(file_path)
-        if not file_abs.startswith(upload_abs):
+        if not file_abs.startswith(upload_abs + os.sep):
             raise ValueError("Invalid file path")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/utilities.py` around lines 119 - 123, The current check
using file_abs.startswith(upload_abs) can be bypassed (same as in
extract_text_from_docx); update the validation in utilities.py to use a robust
comparison such as os.path.commonpath([upload_abs, file_abs]) != upload_abs (or
ensure upload_abs ends with os.sep and compare startswith upload_abs + os.sep)
to guarantee file_path is contained within self.upload_folder; adjust the block
that computes upload_abs and file_abs and raise ValueError("Invalid file path")
when the containment check fails, matching the fix applied in
extract_text_from_docx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GOOD FIRST ISSUE]: main.py too large and hard to maintain

1 participant