Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(coordinator): assign static prover first and avoid reassigning failed task to same prover #1584

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from

Conversation

yiweichi
Copy link
Member

@yiweichi yiweichi commented Dec 30, 2024

…ailing task to same prover

Purpose or design rationale of this PR

Describe your change. Make sure to answer these three questions: What does this PR do? Why does it do it? How does it do it?

  1. Tasks assignment static prover first and then dynamic prover (this is to save cost)
  2. Coordinator doesn’t dispatch the same failing job to the same prover

docs

PR title

Your PR title must follow conventional commits (as we are doing squash merge for each PR), so it must start with one of the following types:

  • build: Changes that affect the build system or external dependencies (example scopes: yarn, eslint, typescript)
  • ci: Changes to our CI configuration files and scripts (example scopes: vercel, github, cypress)
  • docs: Documentation-only changes
  • feat: A new feature
  • fix: A bug fix
  • perf: A code change that improves performance
  • refactor: A code change that doesn't fix a bug, or add a feature, or improves performance
  • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
  • test: Adding missing tests or correcting existing tests

Deployment tag versioning

Has tag in common/version.go been updated or have you added bump-version label to this PR?

  • No, this PR doesn't involve a new deployment, git tag, docker image tag
  • Yes

Breaking change label

Does this PR have the breaking-change label?

  • No, this PR is not a breaking change
  • Yes

Summary by CodeRabbit

  • Configuration

    • Added new configuration parameter external_prover_threshold to control external prover task assignment.
  • Task Management

    • Enhanced task assignment logic for external provers.
    • Implemented batch retrieval of assigned and unassigned tasks for batches, bundles, and chunks.
    • Added count retrieval for unassigned tasks to optimize prover task distribution.
  • Improvements

    • Improved error handling and logging for task assignment processes.
    • Added flexibility in task retrieval and assignment mechanisms.
    • Introduced new method to retrieve specific prover tasks based on parameters.
  • Version Update

    • Updated version tag to v4.4.86.

Copy link

coderabbitai bot commented Dec 30, 2024

Warning

Rate limit exceeded

@yiweichi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 13 minutes and 30 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5ad00e7 and 078ac88.

📒 Files selected for processing (3)
  • coordinator/internal/controller/api/auth.go (2 hunks)
  • coordinator/internal/logic/provertask/prover_task.go (2 hunks)
  • coordinator/test/mock_prover.go (1 hunks)

Walkthrough

The pull request introduces a new configuration parameter external_prover_threshold to manage task assignment for external provers. This change modifies the configuration, ORM methods, and task assignment logic across multiple files in the coordinator package. The modifications enhance the flexibility of task retrieval and assignment, particularly for external provers, by allowing batch retrieval of tasks and implementing a threshold-based assignment mechanism.

Changes

File Change Summary
coordinator/conf/config.json Added external_prover_threshold configuration parameter set to 32
coordinator/internal/config/config.go Added ExternalProverThreshold field to ProverManager struct
coordinator/internal/logic/provertask/prover_task.go Added ExternalProverNamePrefix global variable
coordinator/internal/logic/provertask/*_prover_task.go Updated task assignment logic for batch, bundle, and chunk provers to handle external provers
coordinator/internal/orm/batch.go Modified methods to support batch retrieval and added unassigned batch count method
coordinator/internal/orm/bundle.go Updated bundle retrieval methods and added unassigned bundle count method
coordinator/internal/orm/chunk.go Modified chunk retrieval methods and added unassigned chunk count method
coordinator/internal/orm/prover_task.go Added GetTaskOfProver method to retrieve specific prover tasks
common/version/version.go Updated version tag from v4.4.85 to v4.4.86
coordinator/internal/controller/api/auth.go Enhanced JWT claims in PayloadFunc to include ProverProviderType
coordinator/internal/logic/auth/login.go Added validation for ProverProviderType in login logic
coordinator/internal/types/auth.go Introduced ProverProviderTypeKey constant and ProverProviderType field in Message struct
coordinator/internal/types/prover.go Added ProverProviderType type and constants for internal and external providers
coordinator/internal/utils/prover_name.go Introduced IsExternalProverNameMatch function for name validation

Sequence Diagram

sequenceDiagram
    participant Prover
    participant Coordinator
    participant Database

    Prover->>Coordinator: Request Task
    Coordinator->>Database: Check Unassigned Tasks Count
    Database-->>Coordinator: Return Task Count
    alt Count Below Threshold
        Coordinator-->>Prover: No Task Assigned
    else Count Above Threshold
        Coordinator->>Database: Retrieve Assigned/Unassigned Tasks
        Database-->>Coordinator: Return Tasks
        Coordinator->>Prover: Assign Suitable Task
    end
Loading

Poem

🐰 Hopping through code with glee,
External provers, now set free!
Threshold magic, tasks galore,
Batch retrieval opens a new door
Coordination's rabbit dance! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (13)
coordinator/internal/logic/provertask/chunk_prover_task.go (2)

65-75: External prover threshold check
This conditional correctly enforces a minimum number of unassigned tasks before assigning them to external provers, helping to optimize resource usage. Consider adding a debug log indicating when no tasks are assigned.

 if unassignedChunkCount < cp.cfg.ProverManager.ExternalProverThreshold {
+    log.Debug("Skipping external prover assignment",
+              "count", unassignedChunkCount,
+              "threshold", cp.cfg.ProverManager.ExternalProverThreshold)
     return nil, nil
 }

94-117: Avoid re-dispatching failing tasks & concurrency concern
The loop prevents reassigning the same failing task to the same prover. While this logic is correct, confirm that concurrent assignments are sufficiently protected. For high concurrency, consider row-level locks or other synchronization in the DB to avoid competing updates.

coordinator/internal/logic/provertask/bundle_prover_task.go (2)

67-77: External prover threshold check
Similar to the chunk logic, this section cleanly limits external prover usage by verifying if the unassigned bundle count is above the threshold. Consider logging debug info when skipping assignment.

 if unassignedBundleCount < bp.cfg.ProverManager.ExternalProverThreshold {
+    log.Debug("Skipping external prover assignment for bundles",
+              "count", unassignedBundleCount,
+              "threshold", bp.cfg.ProverManager.ExternalProverThreshold)
     return nil, nil
 }

96-119: Loop for assigned/unassigned tasks & fail-check
This loop ensures the same failing task won’t be re-dispatched to the same prover. Validate concurrent updates to avoid race conditions if multiple coordinator instances operate simultaneously.

coordinator/internal/logic/provertask/batch_prover_task.go (1)

96-119: Retry loop and preventing re-dispatch
The loop logic checks assigned tasks first, then unassigned tasks, while ensuring failed tasks are not re-dispatched to the same prover. Consider verifying concurrency safety.

coordinator/internal/orm/chunk.go (3)

76-86: GetUnassignedChunk implementation
Using raw SQL with fmt.Sprintf is generally safe here since the parameters are numeric, but parameter binding is often safer against injection. Consider GORM chain calls to reduce potential risk and improve maintainability.

- sql := fmt.Sprintf("SELECT * FROM chunk WHERE proving_status = %d AND total_attempts < %d ... LIMIT %d;", ...)
- err := db.Raw(sql).Scan(&chunks).Error
+ err := db.Model(&Chunk{}).
+    Where("proving_status = ?", int(types.ProvingTaskUnassigned)).
+    Where("total_attempts < ?", maxTotalAttempts).
+    Where("active_attempts < ?", maxActiveAttempts).
+    Where("end_block_number <= ?", height).
+    Where("deleted_at IS NULL").
+    Order("index").
+    Limit(int(limit)).
+    Find(&chunks).Error

88-101: GetUnassignedChunkCount mismatch in doc
The docstring references “based on the specified limit,” yet no limit parameter is used. Consider updating the docstring for accuracy.


104-115: GetAssignedChunks
Similar to GetUnassignedChunk, this also uses raw SQL. Consider switching to GORM chain calls for improved safety and maintainability.

coordinator/internal/orm/prover_task.go (1)

151-166: Clarify docstring vs. function name.

The doc comment says:

// GetTaskOfOtherProvers get the chunk/batch task of prover

But the function is named GetTaskOfProver. This mismatch may cause confusion. Please update the doc comment or rename the function to keep them in sync.

coordinator/internal/orm/bundle.go (1)

57-69: Fix verbiage in error message and doc comment.

  1. The error message references “Batch.GetUnassignedBundles” instead of “Bundle.GetUnassignedBundles”.
  2. The doc comment mentions “The returned batch sorts…” but the function works with bundles.

These references could create confusion. Rename them to be consistent with the “Bundle” function context.

coordinator/internal/orm/batch.go (3)

86-87: Consider using GORM query builder instead of raw SQL

While the current implementation using fmt.Sprintf is safe since all variables are numeric, using GORM's query builder would be more maintainable and safer by default. This also makes it easier to modify the query structure in the future.

-	sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;",
-		int(types.ProvingTaskUnassigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit)
-	err := db.Raw(sql).Scan(&batch).Error
+	err := db.Where("proving_status = ?", types.ProvingTaskUnassigned).
+		Where("total_attempts < ?", maxTotalAttempts).
+		Where("active_attempts < ?", maxActiveAttempts).
+		Where("chunk_proofs_status = ?", types.ChunkProofsStatusReady).
+		Order("index ASC").
+		Limit(int(limit)).
+		Find(&batch).Error

81-93: Consider adding an index on the batch.index column

The query uses ORDER BY batch.index, which could be slow for large datasets without an index. Consider adding an index if one doesn't exist.

CREATE INDEX idx_batch_index ON batch(index);

110-121: Reduce code duplication with GetUnassignedBatches

The GetAssignedBatches method is nearly identical to GetUnassignedBatches except for the proving_status condition. Consider extracting the common logic into a shared method.

+func (o *Batch) getBatchesByStatus(ctx context.Context, status types.ProvingStatus, maxActiveAttempts, maxTotalAttempts uint8, limit uint64) ([]*Batch, error) {
+	var batch []*Batch
+	return db.Where("proving_status = ?", status).
+		Where("total_attempts < ?", maxTotalAttempts).
+		Where("active_attempts < ?", maxActiveAttempts).
+		Where("chunk_proofs_status = ?", types.ChunkProofsStatusReady).
+		Order("index ASC").
+		Limit(int(limit)).
+		Find(&batch).Error
+}
+
 func (o *Batch) GetUnassignedBatches(ctx context.Context, maxActiveAttempts, maxTotalAttempts uint8, limit uint64) ([]*Batch, error) {
-	var batch []*Batch
-	db := o.db.WithContext(ctx)
-	sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;",
-		int(types.ProvingTaskUnassigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit)
-	err := db.Raw(sql).Scan(&batch).Error
-	if err != nil {
-		return nil, fmt.Errorf("Batch.GetUnassignedBatches error: %w", err)
-	}
-	return batch, nil
+	return o.getBatchesByStatus(ctx, types.ProvingTaskUnassigned, maxActiveAttempts, maxTotalAttempts, limit)
 }
 
 func (o *Batch) GetAssignedBatches(ctx context.Context, maxActiveAttempts, maxTotalAttempts uint8, limit uint64) ([]*Batch, error) {
-	var batch []*Batch
-	db := o.db.WithContext(ctx)
-	sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;",
-		int(types.ProvingTaskAssigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit)
-	err := db.Raw(sql).Scan(&batch).Error
-	if err != nil {
-		return nil, fmt.Errorf("Batch.GetAssignedBatches error: %w", err)
-	}
-	return batch, nil
+	return o.getBatchesByStatus(ctx, types.ProvingTaskAssigned, maxActiveAttempts, maxTotalAttempts, limit)
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f92029a and 791fcaf.

📒 Files selected for processing (11)
  • coordinator/conf/config.json (1 hunks)
  • coordinator/internal/config/config.go (1 hunks)
  • coordinator/internal/logic/provertask/batch_prover_task.go (2 hunks)
  • coordinator/internal/logic/provertask/bundle_prover_task.go (2 hunks)
  • coordinator/internal/logic/provertask/chunk_prover_task.go (2 hunks)
  • coordinator/internal/logic/provertask/prover_task.go (1 hunks)
  • coordinator/internal/orm/batch.go (1 hunks)
  • coordinator/internal/orm/bundle.go (1 hunks)
  • coordinator/internal/orm/chunk.go (1 hunks)
  • coordinator/internal/orm/prover_task.go (1 hunks)
  • rollup/config.json (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • coordinator/internal/logic/provertask/prover_task.go
🔇 Additional comments (15)
coordinator/internal/logic/provertask/chunk_prover_task.go (3)

7-7: Use of strings package
Importing the strings package is necessary for prefix checks in subsequent lines. This is valid and consistent with the new external prover logic.


81-82: Offset-based iteration
Introducing separate offsets for assigned and unassigned tasks enhances clarity and supports batch assignment logic. This is a fine approach.


89-90: Retrieving unassigned chunks
Fetching up to 50 unassigned tasks is aligned with the new batch approach. The error logging is helpful for troubleshooting.

coordinator/internal/logic/provertask/bundle_prover_task.go (3)

7-7: Import strings
The strings import is added for prefix checks. This is appropriate given the external prover feature.


83-84: Assigned/unassigned offset
Using separate offsets for assigned and unassigned bundles promotes clarity in batch retrieval and assignment.


90-91: Fetching unassigned bundles
Fetching multiple unassigned tasks at once scales better than a single-task approach. Good error handling is present.

coordinator/internal/logic/provertask/batch_prover_task.go (4)

7-7: Import strings
Necessary for prefix checks related to external prover logic.


67-77: External prover threshold
Checking unassignedBatchCount ensures external provers only get assigned when tasks exceed a threshold. This aligns with the PR goal of reducing cost.


83-84: Offset definitions
Setting up assignedOffset and unassignedOffset makes it simpler to process tasks in batches.


90-91: Retrieving unassigned batches
Properly retrieves batches in bulk, improving assignment throughput. The error logging is appropriately handled.

coordinator/internal/config/config.go (1)

19-20: New ExternalProverThreshold field
Adding this threshold field is consistent with the logic implemented in the prover tasks. This helps reduce cost by controlling when external provers are utilized.

coordinator/internal/orm/bundle.go (2)

71-83: Validate boundary conditions for counting unassigned bundles.

This function appears correct. However, validate that upstream logic handles the scenario when maxActiveAttempts and maxTotalAttempts are lower than the usage patterns, to avoid retrieving trivial results or none at all.


86-97: Ensure consistent naming in error handling.

Similar to the previous function, the error message references “Bundle.GetAssignedBundles,” which is correct. Just be sure the code references remain consistent across the repository for easier debugging.

coordinator/conf/config.json (1)

5-5: Confirm correctness of threshold.

Adding external_prover_threshold = 32 is good if that’s the intended cutoff. Double-check that calling logic interprets this value as intended.

rollup/config.json (1)

1-40: Watch out for potential exposure of environment-specific values.

The L1 endpoint and relayer configuration appear to contain environment-specific details. Confirm these settings are correct for the environment (test, staging, production).

rollup/config.json Outdated Show resolved Hide resolved
coordinator/internal/orm/batch.go Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Dec 30, 2024

Codecov Report

Attention: Patch coverage is 0% with 53 lines in your changes missing coverage. Please review.

Project coverage is 51.79%. Comparing base (fa0927c) to head (dd3c9e1).

Files with missing lines Patch % Lines
coordinator/internal/orm/prover_task.go 0.00% 15 Missing ⚠️
coordinator/internal/orm/batch.go 0.00% 13 Missing ⚠️
coordinator/internal/orm/chunk.go 0.00% 13 Missing ⚠️
coordinator/internal/orm/bundle.go 0.00% 12 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1584      +/-   ##
===========================================
- Coverage    52.01%   51.79%   -0.23%     
===========================================
  Files          157      157              
  Lines        12476    12529      +53     
===========================================
  Hits          6490     6490              
- Misses        5424     5477      +53     
  Partials       562      562              
Flag Coverage Δ
coordinator 16.57% <0.00%> (-0.57%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
coordinator/internal/orm/batch.go (1)

81-93: Consider using GORM query builder instead of raw SQL.

The raw SQL query could be replaced with GORM's query builder for better maintainability and type safety. This would also help prevent potential SQL injection vulnerabilities.

-	sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;",
-		int(types.ProvingTaskUnassigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit)
-	err := db.Raw(sql).Scan(&batch).Error
+	err := db.Model(&Batch{}).
+		Where("proving_status = ?", int(types.ProvingTaskUnassigned)).
+		Where("total_attempts < ?", maxTotalAttempts).
+		Where("active_attempts < ?", maxActiveAttempts).
+		Where("chunk_proofs_status = ?", int(types.ChunkProofsStatusReady)).
+		Order("index ASC").
+		Limit(int(limit)).
+		Find(&batch).Error
coordinator/internal/logic/provertask/bundle_prover_task.go (2)

70-70: Fix incorrect error message text

The error message references "chunk" instead of "bundle" in the log statement.

-			log.Error("failed to get unassigned chunk proving tasks count", "height", getTaskParameter.ProverHeight, "err", err)
+			log.Error("failed to get unassigned bundle proving tasks count", "height", getTaskParameter.ProverHeight, "err", err)

84-84: Consider making the batch size configurable

The magic number '50' used for batch retrieval size could be moved to configuration for better maintainability and tuning.

Consider adding a configuration parameter for the batch size:

 type Config struct {
     ProverManager struct {
         ProversPerSession int
         SessionAttempts int
         ExternalProverThreshold int
+        TaskBatchSize int `json:"task_batch_size" default:"50"`
     }
 }

Then use it in the code:

-		tmpAssignedBundleTasks, getTaskError := bp.bundleOrm.GetAssignedBundles(ctx.Copy(), maxActiveAttempts, maxTotalAttempts, 50)
+		tmpAssignedBundleTasks, getTaskError := bp.bundleOrm.GetAssignedBundles(ctx.Copy(), maxActiveAttempts, maxTotalAttempts, bp.cfg.ProverManager.TaskBatchSize)

Also applies to: 91-91

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 791fcaf and a75075d.

📒 Files selected for processing (4)
  • coordinator/internal/logic/provertask/batch_prover_task.go (2 hunks)
  • coordinator/internal/logic/provertask/bundle_prover_task.go (2 hunks)
  • coordinator/internal/logic/provertask/chunk_prover_task.go (2 hunks)
  • coordinator/internal/orm/batch.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • coordinator/internal/logic/provertask/chunk_prover_task.go
  • coordinator/internal/logic/provertask/batch_prover_task.go
🔇 Additional comments (4)
coordinator/internal/orm/batch.go (1)

95-108: LGTM! Well-structured implementation.

The method follows best practices by using GORM's query builder and maintains consistency with GetUnassignedBatches conditions.

coordinator/internal/logic/provertask/bundle_prover_task.go (3)

7-7: LGTM: Import addition is appropriate

The addition of the strings package import is necessary for the new external prover name prefix checking functionality.


67-77: LGTM: External prover threshold check implementation

The implementation correctly enforces the threshold check for external provers, aligning with the PR's objective to prioritize static provers.


83-119: LGTM: Improved task assignment logic with failure prevention

The implementation efficiently handles task assignment using batch retrieval and properly prevents reassigning failed tasks to the same prover. The offset-based iteration is a clean approach to process both assigned and unassigned tasks.

Let's verify the prover task status check implementation:

✅ Verification successful

Task reassignment prevention logic is consistently implemented

The verification confirms that the task reassignment prevention logic is correctly implemented across the codebase:

  • The same prevention check types.ProverProveStatus(proverTask.ProvingStatus) != types.ProverProofInvalid is consistently used in all prover task handlers (chunk, batch, and bundle)
  • The status updates to ProverProofInvalid are properly handled in the proof receiver and task collection components
  • The implementation includes proper error handling and logging for invalid proof submissions

Let's verify the task status transitions to ensure the complete flow is correct:


Task status management is well-defined and properly enforced

The verification confirms that the task assignment logic is robust and correctly implemented:

  1. The status transitions are well-defined with clear enum values:

    • ProverProveStatusUndefined (0)
    • ProverAssigned (1)
    • ProverProofValid (2)
    • ProverProofInvalid (3)
  2. The prevention of reassigning failed tasks to the same prover is consistently implemented across all task types (chunk, batch, bundle) using the same status check.

  3. The status flow ensures that:

    • Tasks start in an unassigned state
    • Get marked as assigned when dispatched
    • Can transition to either valid or invalid based on proof verification
    • Invalid proofs from a prover prevent that same prover from getting reassigned the task
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that failed tasks are not being reassigned to the same prover

# Search for all prover task status checks in the codebase
echo "Searching for prover task status checks..."
rg -A 3 "types\.ProverProofInvalid" 

# Search for potential places where tasks might be reassigned
echo "Searching for task assignment patterns..."
ast-grep --pattern 'if $_ != types.ProverProofInvalid { $$$$ }'

Length of output: 4257


Script:

#!/bin/bash
# Check the task status enum definition and transitions
echo "Checking task status definitions..."
rg -A 10 "type ProverProveStatus"

# Check status update implementations
echo "Checking status update implementations..."
ast-grep --pattern 'UpdateProverTaskProvingStatus($$$)'

Length of output: 900

coordinator/internal/orm/batch.go Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
coordinator/internal/logic/provertask/bundle_prover_task.go (1)

96-119: Consider adding early return optimization

The implementation is solid but could be optimized by adding an early return when both offsets exceed their respective list lengths.

 for {
+    if assignedOffset >= len(tmpAssignedBundleTasks) && unassignedOffset >= len(tmpUnassignedBundleTask) {
+        log.Debug("no more tasks available", "height", getTaskParameter.ProverHeight)
+        return nil, nil
+    }
     tmpBundleTask = nil
     if assignedOffset < len(tmpAssignedBundleTasks) {
coordinator/internal/logic/provertask/batch_prover_task.go (3)

74-76: Add debug logging for external prover threshold check

When skipping task assignment due to threshold condition, adding debug logging would help with monitoring and debugging.

 		if unassignedBatchCount < bp.cfg.ProverManager.ExternalProverThreshold {
+			log.Debug("skipping external prover assignment: below threshold", 
+				"unassigned_count", unassignedBatchCount,
+				"threshold", bp.cfg.ProverManager.ExternalProverThreshold,
+				"prover", taskCtx.ProverName)
 			return nil, nil
 		}

83-84: Consider making pagination and retry parameters configurable

The code uses hardcoded values for pagination (50) and retry attempts (5). These should be configurable parameters.

Also, the comment explaining the query split could be more descriptive.

  1. Add configuration parameters:
type Config struct {
    ProverManager struct {
        // ... existing fields ...
+       BatchTaskPaginationSize int `json:"batch_task_pagination_size"`
+       BatchTaskMaxRetries     int `json:"batch_task_max_retries"`
    }
}
  1. Improve the comment:
-       // Why here need get again? In order to support a task can assign to multiple prover, need also assign `ProvingTaskAssigned`
-       // chunk to prover. But use `proving_status in (1, 2)` will not use the postgres index. So need split the sql.
+       // Split the queries for better performance:
+       // 1. We need to support assigning a task to multiple provers, including already assigned tasks
+       // 2. Using a single query with `proving_status in (1, 2)` would not utilize the PostgreSQL index
+       // 3. Therefore, we split into separate queries for assigned and unassigned tasks to leverage indexes

Also applies to: 90-91


111-119: Consider adding metrics for failed task reassignment prevention

The code prevents reassigning failed tasks to the same prover, but there's no monitoring for how often this occurs. Adding metrics would help track the effectiveness of this prevention mechanism.

 type BatchProverTask struct {
     // ... existing fields ...
+    batchTaskReassignmentPrevented *prometheus.CounterVec
 }

 func NewBatchProverTask(cfg *config.Config, chainCfg *params.ChainConfig, db *gorm.DB, reg prometheus.Registerer) *BatchProverTask {
     bp := &BatchProverTask{
         // ... existing initialization ...
+        batchTaskReassignmentPrevented: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
+            Name: "coordinator_batch_task_reassignment_prevented_total",
+            Help: "Number of times a failed task was prevented from being reassigned to the same prover.",
+        }, []string{"prover_name"}),
     }
     return bp
 }

 // In the Assign method:
 if proverTask != nil && types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofInvalid {
+    bp.batchTaskReassignmentPrevented.WithLabelValues(taskCtx.ProverName).Inc()
     continue
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c5d88d and e4c0779.

📒 Files selected for processing (2)
  • coordinator/internal/logic/provertask/batch_prover_task.go (2 hunks)
  • coordinator/internal/logic/provertask/bundle_prover_task.go (2 hunks)
🔇 Additional comments (4)
coordinator/internal/logic/provertask/bundle_prover_task.go (4)

7-7: LGTM: Import addition is appropriate

The strings package import is required for the new prefix checking functionality.


67-77: LGTM: External prover threshold check is well implemented

The implementation correctly:

  1. Identifies external provers using the prefix check
  2. Verifies unassigned bundle count against threshold
  3. Prevents unnecessary external prover assignments when load is low

83-95: LGTM: Efficient batch retrieval implementation

The implementation:

  1. Uses pagination with offset tracking
  2. Properly handles database errors
  3. Includes clear documentation explaining the index usage optimization

Line range hint 151-160: Clarify status of commented hard fork compatibility check

There's a substantial block of commented code related to hard fork compatibility checking. Please either:

  1. Remove if no longer needed
  2. Uncomment if the check is still required
  3. Add a comment explaining why this is temporarily disabled

This will help maintain code clarity and prevent technical debt.

@yiweichi yiweichi added the bump-version Bump the version tag for deployment label Jan 6, 2025
@@ -63,29 +64,59 @@ func (bp *BatchProverTask) Assign(ctx *gin.Context, getTaskParameter *coordinato

maxActiveAttempts := bp.cfg.ProverManager.ProversPerSession
maxTotalAttempts := bp.cfg.ProverManager.SessionAttempts
if strings.HasPrefix(taskCtx.ProverName, ExternalProverNamePrefix) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this way, how about adding a parameter to the request ?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
coordinator/internal/logic/provertask/bundle_prover_task.go (1)

70-70: ⚠️ Potential issue

Fix error variable in logging statement

The error logging statement uses err instead of getCountError.

-log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", err)
+log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError)
🧹 Nitpick comments (4)
coordinator/internal/orm/chunk.go (1)

91-105: Consider adding composite index for performance optimization.

To optimize the frequent queries on chunk status and attempts, consider adding a composite index on (proving_status, total_attempts, active_attempts, end_block_number).

coordinator/internal/logic/provertask/bundle_prover_task.go (3)

73-76: Add logging for external prover threshold check

Consider adding debug logging when skipping external provers due to threshold condition. This will help with monitoring and debugging the assignment process.

 if unassignedBundleCount < bp.cfg.ProverManager.ExternalProverThreshold {
+    log.Debug("skipping external prover assignment: below threshold",
+        "count", unassignedBundleCount,
+        "threshold", bp.cfg.ProverManager.ExternalProverThreshold,
+        "prover", taskCtx.ProverName)
     return nil, nil
 }

104-113: Enhance error and debug logging

The error and debug messages could be more descriptive to aid in troubleshooting.

-      log.Error("failed to get prover task of prover", "proof_type", message.ProofTypeBatch.String(), "taskID", tmpBundleTask.Hash, "key", taskCtx.PublicKey, "Prover_version", taskCtx.ProverVersion, "error", getTaskError)
+      log.Error("failed to check prover's previous attempts", "proof_type", message.ProofTypeBatch.String(), "taskID", tmpBundleTask.Hash, "key", taskCtx.PublicKey, "prover_version", taskCtx.ProverVersion, "error", getTaskError)

-      log.Debug("get empty bundle, the prover already failed this task", "height", getTaskParameter.ProverHeight)
+      log.Debug("skipping task assignment: prover previously failed this task",
+          "height", getTaskParameter.ProverHeight,
+          "task_id", tmpBundleTask.Hash,
+          "prover", taskCtx.ProverName)

Line range hint 81-82: Consider extracting retry configuration

The retry mechanism uses magic numbers (5 retries, 100ms sleep). Consider extracting these values to configuration parameters for better maintainability.

+const (
+    maxRetryAttempts = 5
+    retryDelay       = 100 * time.Millisecond
+)

-for i := 0; i < 5; i++ {
+for i := 0; i < maxRetryAttempts; i++ {
     // ... existing code ...
-    time.Sleep(100 * time.Millisecond)
+    time.Sleep(retryDelay)

Also applies to: 123-124

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d1c303 and c2075c8.

📒 Files selected for processing (7)
  • common/version/version.go (1 hunks)
  • coordinator/internal/logic/provertask/batch_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/bundle_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/chunk_prover_task.go (3 hunks)
  • coordinator/internal/orm/batch.go (1 hunks)
  • coordinator/internal/orm/bundle.go (1 hunks)
  • coordinator/internal/orm/chunk.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • common/version/version.go
  • coordinator/internal/orm/bundle.go
  • coordinator/internal/orm/batch.go
  • coordinator/internal/logic/provertask/batch_prover_task.go
🔇 Additional comments (6)
coordinator/internal/logic/provertask/chunk_prover_task.go (3)

65-75: LGTM! External prover threshold check is well implemented.

The implementation correctly prioritizes static provers by checking the unassigned task count against a threshold before assigning tasks to external provers. Error handling and logging are properly implemented.


102-111: LGTM! Task reassignment prevention is well implemented.

The implementation successfully prevents reassigning failed tasks to the same prover by checking the prover's previous task status. Error handling and logging are properly implemented.


Line range hint 143-151: Verify if version compatibility check should be removed.

The commented out code block appears to implement important version compatibility checks. Please verify if this check should be removed, as it could lead to version incompatibility issues between provers and tasks.

coordinator/internal/orm/chunk.go (1)

91-105: LGTM! GetUnassignedChunkCount implementation is clean and efficient.

The method follows GORM best practices with proper error handling and maintains consistency with existing query conditions.

coordinator/internal/logic/provertask/bundle_prover_task.go (2)

7-7: LGTM!

The addition of the strings package import is appropriate for the new prefix checking functionality.


Line range hint 146-154: Verify if hard fork compatibility check should be uncommented

There's a commented-out block that checks hard fork compatibility. Please verify if this check should be enabled or removed.

Run this script to check if hard fork compatibility checks are implemented elsewhere:

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
coordinator/internal/logic/provertask/bundle_prover_task.go (3)

67-77: Add debug logging for early return case.

When returning early due to insufficient unassigned bundles for external provers, add debug logging to improve observability.

 		if unassignedBundleCount < bp.cfg.ProverManager.ExternalProverThreshold {
+			log.Debug("insufficient unassigned bundles for external prover", 
+				"count", unassignedBundleCount,
+				"threshold", bp.cfg.ProverManager.ExternalProverThreshold,
+				"prover", taskCtx.ProverName)
 			return nil, nil
 		}

80-80: Extract magic number as constant.

The retry count of 5 should be extracted as a named constant at package level for better maintainability.

+const maxAssignmentRetries = 5

-	for i := 0; i < 5; i++ {
+	for i := 0; i < maxAssignmentRetries; i++ {

104-114: Enhance error and debug logging messages.

The logging messages could be more descriptive to aid in debugging and monitoring.

-			log.Error("failed to get prover task of prover", "proof type", message.ProofTypeBatch.String(), "task ID", tmpBundleTask.Hash, "key", taskCtx.PublicKey, "prover version", taskCtx.ProverVersion, "error", getTaskError)
+			log.Error("failed to retrieve previous prover task attempt", 
+				"proof_type", message.ProofTypeBatch.String(), 
+				"task_id", tmpBundleTask.Hash, 
+				"prover_key", taskCtx.PublicKey, 
+				"prover_version", taskCtx.ProverVersion, 
+				"error", getTaskError)

-			log.Debug("get empty bundle, the prover already failed this task", "height", getTaskParameter.ProverHeight)
+			log.Debug("skipping task assignment - prover previously failed this task", 
+				"height", getTaskParameter.ProverHeight,
+				"task_id", tmpBundleTask.Hash,
+				"prover_key", taskCtx.PublicKey)
coordinator/internal/logic/provertask/chunk_prover_task.go (2)

67-70: Enhance error logging for better debugging.

The error message could be more descriptive by including additional context about the external prover check.

-			log.Error("failed to get unassigned chunk proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError)
+			log.Error("failed to get unassigned chunk proving tasks count for external prover threshold check", 
+				"height", getTaskParameter.ProverHeight, 
+				"prover_name", taskCtx.ProverName,
+				"threshold", cp.cfg.ProverManager.ExternalProverThreshold,
+				"err", getCountError)

104-107: Enhance error logging structure.

Consider structuring the error log fields consistently with other error logs in the codebase.

-			log.Error("failed to get prover task of prover", "proof type", message.ProofTypeChunk.String(), "task ID", tmpChunkTask.Hash, "key", taskCtx.PublicKey, "prover version", taskCtx.ProverVersion, "error", getTaskError)
+			log.Error("failed to get prover task of prover", 
+				"proof_type", message.ProofTypeChunk.String(),
+				"task_id", tmpChunkTask.Hash,
+				"public_key", taskCtx.PublicKey,
+				"prover_version", taskCtx.ProverVersion,
+				"err", getTaskError)
coordinator/internal/logic/provertask/batch_prover_task.go (1)

Line range hint 67-114: The task assignment strategy is well-designed but consider caching optimizations.

The implementation successfully achieves both PR objectives:

  1. Prioritizing static provers through threshold-based assignment
  2. Preventing repeated failures through task history checks

Consider adding a short-lived cache for recently failed tasks to reduce database load in high-concurrency scenarios. This could be particularly beneficial when multiple provers are requesting tasks simultaneously.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2075c8 and dd3c9e1.

📒 Files selected for processing (4)
  • coordinator/internal/logic/provertask/batch_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/bundle_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/chunk_prover_task.go (3 hunks)
  • coordinator/internal/orm/prover_task.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • coordinator/internal/orm/prover_task.go
🔇 Additional comments (8)
coordinator/internal/logic/provertask/bundle_prover_task.go (1)

7-7: LGTM!

The addition of the strings package import is appropriate for the new external prover name prefix checking functionality.

coordinator/internal/logic/provertask/chunk_prover_task.go (4)

7-7: LGTM!

The addition of the strings package import is necessary for the new external prover prefix checking functionality.


65-75: LGTM! Implementation aligns with PR objectives.

The code successfully implements the prioritization of static provers by checking the external prover threshold before assignment.


102-111: LGTM! Successfully prevents task reassignment to failed provers.

The implementation effectively prevents the same failing job from being reassigned to the same prover, which aligns with the PR objectives.


Line range hint 65-111: Add test coverage for new logic.

According to the PR comments, there's no test coverage for these changes. Please add tests to verify:

  1. External prover threshold logic
  2. Prevention of reassigning failed tasks

Would you like me to help create test cases for these scenarios?

coordinator/internal/logic/provertask/batch_prover_task.go (3)

7-7: LGTM!

The addition of the strings package import is appropriate for the new prefix checking functionality.


104-113: Consider adding metrics for failed task assignments.

The logic to avoid reassigning failed tasks is solid, but consider these enhancements:

  1. Add metrics to track failed task reassignment attempts
  2. Consider potential race conditions in high concurrency scenarios

Add metrics to track failed task assignments:

 if proverTask != nil && types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofInvalid {
+			bp.batchTaskGetTaskProver.With(prometheus.Labels{
+				coordinatorType.LabelProverName:      taskCtx.ProverName,
+				coordinatorType.LabelProverPublicKey: taskCtx.PublicKey,
+				coordinatorType.LabelProverVersion:   taskCtx.ProverVersion,
+				"status": "previously_failed",
+			}).Inc()
 			log.Debug("get empty batch, the prover already failed this task", "height", getTaskParameter.ProverHeight)
 			return nil, nil
 		}

Let's verify potential race conditions:

✅ Verification successful

Original metrics suggestion is valid, but race condition concerns are addressed.

The existing implementation already handles concurrent access safely through:

  • Optimistic concurrency control in UpdateBatchAttempts
  • Independent operations with proper context handling
  • Safe read-then-write pattern for task status checks

The metrics suggestion remains valuable for operational visibility.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for proper transaction handling and locking mechanisms

# Check for transaction boundaries
rg -B 5 -A 5 "GetTaskOfProver.*ctx"

# Check for concurrent access patterns
rg -B 5 -A 5 "UpdateBatchAttempts.*ctx"

Length of output: 7174


67-77: Consider enhancing error handling and threshold validation.

While the external prover threshold check aligns with the PR objectives, consider these improvements:

  1. Add context to the error log for better debugging
  2. Validate the threshold configuration at startup
 if getCountError != nil {
-			log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError)
+			log.Error("failed to get unassigned batch proving tasks count", 
+				"height", getTaskParameter.ProverHeight, 
+				"prover", taskCtx.ProverName,
+				"threshold", bp.cfg.ProverManager.ExternalProverThreshold,
+				"err", getCountError)
 			return nil, ErrCoordinatorInternalFailure
 		}

Let's verify the threshold configuration:

✅ Verification successful

External prover threshold implementation verified as correct and safe

The threshold implementation is properly configured with a reasonable default value (32) and is consistently used across all prover tasks. While explicit validation could be added as a good-to-have, the current implementation is safe through JSON type checking and implicit validation in comparison operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if ExternalProverThreshold is properly configured across the codebase

# Check for threshold validation in config initialization
rg -A 5 "ExternalProverThreshold.*=|ExternalProverThreshold.*:" 

# Check for any hardcoded threshold values that might conflict
rg -g '!*.md' -g '!*.txt' "ExternalProverThreshold"

Length of output: 1188


Script:

#!/bin/bash
# Check for config validation and initialization
ast-grep --pattern 'func $_($$$ *Config $$$) error {
  $$$
}'

# Check for any default value assignments
rg -A 3 "DefaultConfig|NewConfig|InitConfig" 

Length of output: 8802


Script:

#!/bin/bash
# Check config file content
fd -e json -e example config coordinator/conf/

# Check for any validation during coordinator startup
rg -B 3 -A 3 "ProverManager.*ExternalProverThreshold" coordinator/cmd/

Length of output: 150


Script:

#!/bin/bash
# Check config file content
cat coordinator/conf/config.json

# Check coordinator initialization/startup code
ast-grep --pattern 'func $_($$$ *Coordinator $$$) {
  $$$
}'

# Check for any validation in ProverManager initialization
rg -A 5 "type ProverManager struct" coordinator/

Length of output: 1677

colinlyguo
colinlyguo previously approved these changes Jan 9, 2025
@colinlyguo
Copy link
Member

github also has a nice feature that when applying suggestions, can use batch apply, so the commit history would be more concise.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
coordinator/internal/logic/provertask/batch_prover_task.go (2)

69-71: Enhance error logging with additional context.

Consider adding the unassigned batch count to the error log to help with debugging threshold-related issues.

-			log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError)
+			log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "threshold", bp.cfg.ProverManager.ExternalProverThreshold, "err", getCountError)

105-105: Consider extracting magic number into a constant.

The value 2 in GetFailedProverTasksByHash appears to be a limit. Consider extracting it into a named constant for better maintainability.

+const maxFailedTasksToCheck = 2
-		proverTasks, getTaskError := bp.proverTaskOrm.GetFailedProverTasksByHash(ctx.Copy(), message.ProofTypeBatch, tmpBatchTask.Hash, 2)
+		proverTasks, getTaskError := bp.proverTaskOrm.GetFailedProverTasksByHash(ctx.Copy(), message.ProofTypeBatch, tmpBatchTask.Hash, maxFailedTasksToCheck)
coordinator/internal/utils/prover_name.go (1)

5-15: Enhance input validation and documentation

The function could benefit from more robust validation and clearer documentation:

  1. The comment mentions "cloud_prover_{provider-name}_index" format but the code only checks first 3 segments
  2. Magic number '3' should be defined as a constant

Consider this implementation:

+const minProverNameSegments = 3

 func IsExternalProverNameMatch(localName, remoteName string) bool {
+       // Validate input strings are not empty
+       if localName == "" || remoteName == "" {
+               return false
+       }
+
 	local := strings.Split(localName, "_")
 	remote := strings.Split(remoteName, "_")
 
-	if len(local) < 3 || len(remote) < 3 {
+	if len(local) < minProverNameSegments || len(remote) < minProverNameSegments {
 		return false
 	}
 
-	// note the name of cloud prover is in fact in the format of "cloud_prover_{provider-name}_index"
+	// Matches cloud prover names in the format: "cloud_prover_{provider-name}_index"
+	// Returns true if the first three segments (cloud, prover, provider-name) match
 	return local[0] == remote[0] && local[1] == remote[1] && local[2] == remote[2]
 }
coordinator/internal/types/prover.go (1)

54-54: Fix typo in error message

The error message has a duplicate word "type".

-		return fmt.Sprintf("prover provider type type: %d", r)
+		return fmt.Sprintf("illegal prover provider type: %d", r)
coordinator/internal/types/auth.go (1)

Line range hint 1-40: Consider backward compatibility strategy

The introduction of the provider type system is well-structured, but the implementation might benefit from a phased approach:

  1. The required binding on new fields could break existing clients
  2. The coexistence of ProverTypes and ProverProviderType needs a clear migration path
  3. Consider adding validation logic to ensure consistency between these fields during the transition period

Consider implementing a migration strategy:

  1. Make new fields optional initially
  2. Add a deprecation notice for ProverTypes
  3. Document the migration timeline for clients
  4. Add validation to ensure consistency between old and new fields during transition
coordinator/internal/logic/auth/login.go (1)

111-112: Consider enhancing the error message with the actual value.

The error logging is good, but the returned error message could be more descriptive for debugging purposes.

Consider this improvement:

-		return errors.New("invalid prover provider type.")
+		return fmt.Errorf("invalid prover provider type: %v", login.Message.ProverProviderType)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd3c9e1 and 5ad00e7.

📒 Files selected for processing (10)
  • coordinator/internal/controller/api/auth.go (1 hunks)
  • coordinator/internal/logic/auth/login.go (1 hunks)
  • coordinator/internal/logic/provertask/batch_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/bundle_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/chunk_prover_task.go (3 hunks)
  • coordinator/internal/logic/provertask/prover_task.go (3 hunks)
  • coordinator/internal/orm/prover_task.go (2 hunks)
  • coordinator/internal/types/auth.go (2 hunks)
  • coordinator/internal/types/prover.go (1 hunks)
  • coordinator/internal/utils/prover_name.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • coordinator/internal/logic/provertask/prover_task.go
  • coordinator/internal/logic/provertask/chunk_prover_task.go
  • coordinator/internal/orm/prover_task.go
  • coordinator/internal/logic/provertask/bundle_prover_task.go
🔇 Additional comments (7)
coordinator/internal/logic/provertask/batch_prover_task.go (4)

25-25: LGTM!

The import addition is necessary for the external prover name matching functionality.


67-77: LGTM! External prover threshold check implementation.

The implementation aligns with the PR objectives by prioritizing task assignment based on the unassigned task count threshold.


104-116: LGTM! Robust implementation of failed task reassignment prevention.

The implementation effectively prevents reassigning failed tasks to the same prover, with special handling for external provers.


Line range hint 67-116: Verify performance impact of the new assignment logic.

The changes to task assignment logic (external prover threshold and failed task checks) could impact system performance. Please ensure monitoring is in place.

Run the following script to check if relevant metrics are being tracked:

✅ Verification successful

Performance monitoring is already in place for task assignment changes

The codebase has comprehensive prometheus metrics and logging that cover:

  • Task assignment tracking via coordinator_batch_get_task_total and coordinator_get_task_count
  • Failure monitoring through various coordinator_validate_failure_* metrics
  • Task timeout tracking with coordinator_batch_prover_task_timeout_total
  • Detailed error logging for assignment failures
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for performance monitoring metrics

# Check for prometheus metric definitions
rg -A 2 'prometheus.CounterVec|prometheus.Counter|prometheus.Gauge' | grep -v '_test.go'

# Check for relevant logging patterns that could be used for monitoring
rg -A 2 'log\.(Info|Error|Debug).*batch.*prov(er|ing)'

Length of output: 65812

coordinator/internal/types/auth.go (1)

38-40: Clarify relationship between ProverProviderType and ProverTypes

The Message struct contains both ProverProviderType and ProverTypes fields, which seems redundant. Additionally:

  1. The binding:"required" tag on ProverProviderType might break backward compatibility
  2. The relationship between these fields needs documentation

Let's check for existing usages:

Consider:

  1. Document the relationship between these fields
  2. If ProverTypes is being deprecated, mark it as such
  3. Consider making ProverProviderType optional initially for backward compatibility
coordinator/internal/logic/auth/login.go (2)

110-113: LGTM! The provider type validation aligns with the PR objectives.

The validation ensures that only internal (static) and external (dynamic) provers can authenticate, which is essential for implementing the new prover assignment strategy.


109-114: Verify error handling in upstream code.

Since this introduces a new validation that could reject prover logins, we should ensure that the coordinator's retry mechanisms and monitoring systems properly handle these failures.

Let's check how these errors are handled:

✅ Verification successful

Error handling is sufficient for the new validation.

The coordinator has comprehensive error handling with:

  • Retry mechanisms (up to 5 attempts)
  • Prometheus metrics tracking failures
  • Timeout detection and recovery
  • Detailed error logging
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for error handling patterns around login failures
rg -A 5 "invalid prover provider type" --type go

Length of output: 368


Script:

#!/bin/bash
# Find the login function definition and its callers
ast-grep --pattern 'func $_ Login($$$) {$$$}'

# Look for retry mechanisms
rg -B 2 -A 5 'retry|backoff' --type go

# Search for monitoring/metrics related to login
rg -B 2 -A 5 'metrics|prometheus|monitor' --type go coordinator/internal/

Length of output: 41939

coordinator/internal/controller/api/auth.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bump-version Bump the version tag for deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants