-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(coordinator): assign static prover first and avoid reassigning failed task to same prover #1584
base: develop
Are you sure you want to change the base?
Conversation
…ailing task to same prover
Warning Rate limit exceeded@yiweichi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 13 minutes and 30 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
WalkthroughThe pull request introduces a new configuration parameter Changes
Sequence DiagramsequenceDiagram
participant Prover
participant Coordinator
participant Database
Prover->>Coordinator: Request Task
Coordinator->>Database: Check Unassigned Tasks Count
Database-->>Coordinator: Return Task Count
alt Count Below Threshold
Coordinator-->>Prover: No Task Assigned
else Count Above Threshold
Coordinator->>Database: Retrieve Assigned/Unassigned Tasks
Database-->>Coordinator: Return Tasks
Coordinator->>Prover: Assign Suitable Task
end
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (13)
coordinator/internal/logic/provertask/chunk_prover_task.go (2)
65-75
: External prover threshold check
This conditional correctly enforces a minimum number of unassigned tasks before assigning them to external provers, helping to optimize resource usage. Consider adding a debug log indicating when no tasks are assigned.if unassignedChunkCount < cp.cfg.ProverManager.ExternalProverThreshold { + log.Debug("Skipping external prover assignment", + "count", unassignedChunkCount, + "threshold", cp.cfg.ProverManager.ExternalProverThreshold) return nil, nil }
94-117
: Avoid re-dispatching failing tasks & concurrency concern
The loop prevents reassigning the same failing task to the same prover. While this logic is correct, confirm that concurrent assignments are sufficiently protected. For high concurrency, consider row-level locks or other synchronization in the DB to avoid competing updates.coordinator/internal/logic/provertask/bundle_prover_task.go (2)
67-77
: External prover threshold check
Similar to the chunk logic, this section cleanly limits external prover usage by verifying if the unassigned bundle count is above the threshold. Consider logging debug info when skipping assignment.if unassignedBundleCount < bp.cfg.ProverManager.ExternalProverThreshold { + log.Debug("Skipping external prover assignment for bundles", + "count", unassignedBundleCount, + "threshold", bp.cfg.ProverManager.ExternalProverThreshold) return nil, nil }
96-119
: Loop for assigned/unassigned tasks & fail-check
This loop ensures the same failing task won’t be re-dispatched to the same prover. Validate concurrent updates to avoid race conditions if multiple coordinator instances operate simultaneously.coordinator/internal/logic/provertask/batch_prover_task.go (1)
96-119
: Retry loop and preventing re-dispatch
The loop logic checks assigned tasks first, then unassigned tasks, while ensuring failed tasks are not re-dispatched to the same prover. Consider verifying concurrency safety.coordinator/internal/orm/chunk.go (3)
76-86
:GetUnassignedChunk
implementation
Using raw SQL withfmt.Sprintf
is generally safe here since the parameters are numeric, but parameter binding is often safer against injection. Consider GORM chain calls to reduce potential risk and improve maintainability.- sql := fmt.Sprintf("SELECT * FROM chunk WHERE proving_status = %d AND total_attempts < %d ... LIMIT %d;", ...) - err := db.Raw(sql).Scan(&chunks).Error + err := db.Model(&Chunk{}). + Where("proving_status = ?", int(types.ProvingTaskUnassigned)). + Where("total_attempts < ?", maxTotalAttempts). + Where("active_attempts < ?", maxActiveAttempts). + Where("end_block_number <= ?", height). + Where("deleted_at IS NULL"). + Order("index"). + Limit(int(limit)). + Find(&chunks).Error
88-101
:GetUnassignedChunkCount
mismatch in doc
The docstring references “based on the specified limit,” yet no limit parameter is used. Consider updating the docstring for accuracy.
104-115
:GetAssignedChunks
Similar toGetUnassignedChunk
, this also uses raw SQL. Consider switching to GORM chain calls for improved safety and maintainability.coordinator/internal/orm/prover_task.go (1)
151-166
: Clarify docstring vs. function name.The doc comment says:
// GetTaskOfOtherProvers get the chunk/batch task of prover
But the function is named
GetTaskOfProver
. This mismatch may cause confusion. Please update the doc comment or rename the function to keep them in sync.coordinator/internal/orm/bundle.go (1)
57-69
: Fix verbiage in error message and doc comment.
- The error message references “Batch.GetUnassignedBundles” instead of “Bundle.GetUnassignedBundles”.
- The doc comment mentions “The returned batch sorts…” but the function works with bundles.
These references could create confusion. Rename them to be consistent with the “Bundle” function context.
coordinator/internal/orm/batch.go (3)
86-87
: Consider using GORM query builder instead of raw SQLWhile the current implementation using
fmt.Sprintf
is safe since all variables are numeric, using GORM's query builder would be more maintainable and safer by default. This also makes it easier to modify the query structure in the future.- sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;", - int(types.ProvingTaskUnassigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit) - err := db.Raw(sql).Scan(&batch).Error + err := db.Where("proving_status = ?", types.ProvingTaskUnassigned). + Where("total_attempts < ?", maxTotalAttempts). + Where("active_attempts < ?", maxActiveAttempts). + Where("chunk_proofs_status = ?", types.ChunkProofsStatusReady). + Order("index ASC"). + Limit(int(limit)). + Find(&batch).Error
81-93
: Consider adding an index on the batch.index columnThe query uses
ORDER BY batch.index
, which could be slow for large datasets without an index. Consider adding an index if one doesn't exist.CREATE INDEX idx_batch_index ON batch(index);
110-121
: Reduce code duplication with GetUnassignedBatchesThe
GetAssignedBatches
method is nearly identical toGetUnassignedBatches
except for theproving_status
condition. Consider extracting the common logic into a shared method.+func (o *Batch) getBatchesByStatus(ctx context.Context, status types.ProvingStatus, maxActiveAttempts, maxTotalAttempts uint8, limit uint64) ([]*Batch, error) { + var batch []*Batch + return db.Where("proving_status = ?", status). + Where("total_attempts < ?", maxTotalAttempts). + Where("active_attempts < ?", maxActiveAttempts). + Where("chunk_proofs_status = ?", types.ChunkProofsStatusReady). + Order("index ASC"). + Limit(int(limit)). + Find(&batch).Error +} + func (o *Batch) GetUnassignedBatches(ctx context.Context, maxActiveAttempts, maxTotalAttempts uint8, limit uint64) ([]*Batch, error) { - var batch []*Batch - db := o.db.WithContext(ctx) - sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;", - int(types.ProvingTaskUnassigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit) - err := db.Raw(sql).Scan(&batch).Error - if err != nil { - return nil, fmt.Errorf("Batch.GetUnassignedBatches error: %w", err) - } - return batch, nil + return o.getBatchesByStatus(ctx, types.ProvingTaskUnassigned, maxActiveAttempts, maxTotalAttempts, limit) } func (o *Batch) GetAssignedBatches(ctx context.Context, maxActiveAttempts, maxTotalAttempts uint8, limit uint64) ([]*Batch, error) { - var batch []*Batch - db := o.db.WithContext(ctx) - sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;", - int(types.ProvingTaskAssigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit) - err := db.Raw(sql).Scan(&batch).Error - if err != nil { - return nil, fmt.Errorf("Batch.GetAssignedBatches error: %w", err) - } - return batch, nil + return o.getBatchesByStatus(ctx, types.ProvingTaskAssigned, maxActiveAttempts, maxTotalAttempts, limit) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
coordinator/conf/config.json
(1 hunks)coordinator/internal/config/config.go
(1 hunks)coordinator/internal/logic/provertask/batch_prover_task.go
(2 hunks)coordinator/internal/logic/provertask/bundle_prover_task.go
(2 hunks)coordinator/internal/logic/provertask/chunk_prover_task.go
(2 hunks)coordinator/internal/logic/provertask/prover_task.go
(1 hunks)coordinator/internal/orm/batch.go
(1 hunks)coordinator/internal/orm/bundle.go
(1 hunks)coordinator/internal/orm/chunk.go
(1 hunks)coordinator/internal/orm/prover_task.go
(1 hunks)rollup/config.json
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- coordinator/internal/logic/provertask/prover_task.go
🔇 Additional comments (15)
coordinator/internal/logic/provertask/chunk_prover_task.go (3)
7-7
: Use of strings
package
Importing the strings
package is necessary for prefix checks in subsequent lines. This is valid and consistent with the new external prover logic.
81-82
: Offset-based iteration
Introducing separate offsets for assigned and unassigned tasks enhances clarity and supports batch assignment logic. This is a fine approach.
89-90
: Retrieving unassigned chunks
Fetching up to 50 unassigned tasks is aligned with the new batch approach. The error logging is helpful for troubleshooting.
coordinator/internal/logic/provertask/bundle_prover_task.go (3)
7-7
: Import strings
The strings
import is added for prefix checks. This is appropriate given the external prover feature.
83-84
: Assigned/unassigned offset
Using separate offsets for assigned and unassigned bundles promotes clarity in batch retrieval and assignment.
90-91
: Fetching unassigned bundles
Fetching multiple unassigned tasks at once scales better than a single-task approach. Good error handling is present.
coordinator/internal/logic/provertask/batch_prover_task.go (4)
7-7
: Import strings
Necessary for prefix checks related to external prover logic.
67-77
: External prover threshold
Checking unassignedBatchCount
ensures external provers only get assigned when tasks exceed a threshold. This aligns with the PR goal of reducing cost.
83-84
: Offset definitions
Setting up assignedOffset
and unassignedOffset
makes it simpler to process tasks in batches.
90-91
: Retrieving unassigned batches
Properly retrieves batches in bulk, improving assignment throughput. The error logging is appropriately handled.
coordinator/internal/config/config.go (1)
19-20
: New ExternalProverThreshold
field
Adding this threshold field is consistent with the logic implemented in the prover tasks. This helps reduce cost by controlling when external provers are utilized.
coordinator/internal/orm/bundle.go (2)
71-83
: Validate boundary conditions for counting unassigned bundles.
This function appears correct. However, validate that upstream logic handles the scenario when maxActiveAttempts
and maxTotalAttempts
are lower than the usage patterns, to avoid retrieving trivial results or none at all.
86-97
: Ensure consistent naming in error handling.
Similar to the previous function, the error message references “Bundle.GetAssignedBundles,” which is correct. Just be sure the code references remain consistent across the repository for easier debugging.
coordinator/conf/config.json (1)
5-5
: Confirm correctness of threshold.
Adding external_prover_threshold
= 32 is good if that’s the intended cutoff. Double-check that calling logic interprets this value as intended.
rollup/config.json (1)
1-40
: Watch out for potential exposure of environment-specific values.
The L1 endpoint and relayer configuration appear to contain environment-specific details. Confirm these settings are correct for the environment (test, staging, production).
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1584 +/- ##
===========================================
- Coverage 52.01% 51.79% -0.23%
===========================================
Files 157 157
Lines 12476 12529 +53
===========================================
Hits 6490 6490
- Misses 5424 5477 +53
Partials 562 562
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
coordinator/internal/orm/batch.go (1)
81-93
: Consider using GORM query builder instead of raw SQL.The raw SQL query could be replaced with GORM's query builder for better maintainability and type safety. This would also help prevent potential SQL injection vulnerabilities.
- sql := fmt.Sprintf("SELECT * FROM batch WHERE proving_status = %d AND total_attempts < %d AND active_attempts < %d AND chunk_proofs_status = %d AND batch.deleted_at IS NULL ORDER BY batch.index LIMIT %d;", - int(types.ProvingTaskUnassigned), maxTotalAttempts, maxActiveAttempts, int(types.ChunkProofsStatusReady), limit) - err := db.Raw(sql).Scan(&batch).Error + err := db.Model(&Batch{}). + Where("proving_status = ?", int(types.ProvingTaskUnassigned)). + Where("total_attempts < ?", maxTotalAttempts). + Where("active_attempts < ?", maxActiveAttempts). + Where("chunk_proofs_status = ?", int(types.ChunkProofsStatusReady)). + Order("index ASC"). + Limit(int(limit)). + Find(&batch).Errorcoordinator/internal/logic/provertask/bundle_prover_task.go (2)
70-70
: Fix incorrect error message textThe error message references "chunk" instead of "bundle" in the log statement.
- log.Error("failed to get unassigned chunk proving tasks count", "height", getTaskParameter.ProverHeight, "err", err) + log.Error("failed to get unassigned bundle proving tasks count", "height", getTaskParameter.ProverHeight, "err", err)
84-84
: Consider making the batch size configurableThe magic number '50' used for batch retrieval size could be moved to configuration for better maintainability and tuning.
Consider adding a configuration parameter for the batch size:
type Config struct { ProverManager struct { ProversPerSession int SessionAttempts int ExternalProverThreshold int + TaskBatchSize int `json:"task_batch_size" default:"50"` } }
Then use it in the code:
- tmpAssignedBundleTasks, getTaskError := bp.bundleOrm.GetAssignedBundles(ctx.Copy(), maxActiveAttempts, maxTotalAttempts, 50) + tmpAssignedBundleTasks, getTaskError := bp.bundleOrm.GetAssignedBundles(ctx.Copy(), maxActiveAttempts, maxTotalAttempts, bp.cfg.ProverManager.TaskBatchSize)Also applies to: 91-91
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
coordinator/internal/logic/provertask/batch_prover_task.go
(2 hunks)coordinator/internal/logic/provertask/bundle_prover_task.go
(2 hunks)coordinator/internal/logic/provertask/chunk_prover_task.go
(2 hunks)coordinator/internal/orm/batch.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- coordinator/internal/logic/provertask/chunk_prover_task.go
- coordinator/internal/logic/provertask/batch_prover_task.go
🔇 Additional comments (4)
coordinator/internal/orm/batch.go (1)
95-108
: LGTM! Well-structured implementation.
The method follows best practices by using GORM's query builder and maintains consistency with GetUnassignedBatches conditions.
coordinator/internal/logic/provertask/bundle_prover_task.go (3)
7-7
: LGTM: Import addition is appropriate
The addition of the strings
package import is necessary for the new external prover name prefix checking functionality.
67-77
: LGTM: External prover threshold check implementation
The implementation correctly enforces the threshold check for external provers, aligning with the PR's objective to prioritize static provers.
83-119
: LGTM: Improved task assignment logic with failure prevention
The implementation efficiently handles task assignment using batch retrieval and properly prevents reassigning failed tasks to the same prover. The offset-based iteration is a clean approach to process both assigned and unassigned tasks.
Let's verify the prover task status check implementation:
✅ Verification successful
Task reassignment prevention logic is consistently implemented
The verification confirms that the task reassignment prevention logic is correctly implemented across the codebase:
- The same prevention check
types.ProverProveStatus(proverTask.ProvingStatus) != types.ProverProofInvalid
is consistently used in all prover task handlers (chunk, batch, and bundle) - The status updates to
ProverProofInvalid
are properly handled in the proof receiver and task collection components - The implementation includes proper error handling and logging for invalid proof submissions
Let's verify the task status transitions to ensure the complete flow is correct:
Task status management is well-defined and properly enforced
The verification confirms that the task assignment logic is robust and correctly implemented:
-
The status transitions are well-defined with clear enum values:
ProverProveStatusUndefined
(0)ProverAssigned
(1)ProverProofValid
(2)ProverProofInvalid
(3)
-
The prevention of reassigning failed tasks to the same prover is consistently implemented across all task types (chunk, batch, bundle) using the same status check.
-
The status flow ensures that:
- Tasks start in an unassigned state
- Get marked as assigned when dispatched
- Can transition to either valid or invalid based on proof verification
- Invalid proofs from a prover prevent that same prover from getting reassigned the task
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that failed tasks are not being reassigned to the same prover
# Search for all prover task status checks in the codebase
echo "Searching for prover task status checks..."
rg -A 3 "types\.ProverProofInvalid"
# Search for potential places where tasks might be reassigned
echo "Searching for task assignment patterns..."
ast-grep --pattern 'if $_ != types.ProverProofInvalid { $$$$ }'
Length of output: 4257
Script:
#!/bin/bash
# Check the task status enum definition and transitions
echo "Checking task status definitions..."
rg -A 10 "type ProverProveStatus"
# Check status update implementations
echo "Checking status update implementations..."
ast-grep --pattern 'UpdateProverTaskProvingStatus($$$)'
Length of output: 900
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
coordinator/internal/logic/provertask/bundle_prover_task.go (1)
96-119
: Consider adding early return optimizationThe implementation is solid but could be optimized by adding an early return when both offsets exceed their respective list lengths.
for { + if assignedOffset >= len(tmpAssignedBundleTasks) && unassignedOffset >= len(tmpUnassignedBundleTask) { + log.Debug("no more tasks available", "height", getTaskParameter.ProverHeight) + return nil, nil + } tmpBundleTask = nil if assignedOffset < len(tmpAssignedBundleTasks) {coordinator/internal/logic/provertask/batch_prover_task.go (3)
74-76
: Add debug logging for external prover threshold checkWhen skipping task assignment due to threshold condition, adding debug logging would help with monitoring and debugging.
if unassignedBatchCount < bp.cfg.ProverManager.ExternalProverThreshold { + log.Debug("skipping external prover assignment: below threshold", + "unassigned_count", unassignedBatchCount, + "threshold", bp.cfg.ProverManager.ExternalProverThreshold, + "prover", taskCtx.ProverName) return nil, nil }
83-84
: Consider making pagination and retry parameters configurableThe code uses hardcoded values for pagination (50) and retry attempts (5). These should be configurable parameters.
Also, the comment explaining the query split could be more descriptive.
- Add configuration parameters:
type Config struct { ProverManager struct { // ... existing fields ... + BatchTaskPaginationSize int `json:"batch_task_pagination_size"` + BatchTaskMaxRetries int `json:"batch_task_max_retries"` } }
- Improve the comment:
- // Why here need get again? In order to support a task can assign to multiple prover, need also assign `ProvingTaskAssigned` - // chunk to prover. But use `proving_status in (1, 2)` will not use the postgres index. So need split the sql. + // Split the queries for better performance: + // 1. We need to support assigning a task to multiple provers, including already assigned tasks + // 2. Using a single query with `proving_status in (1, 2)` would not utilize the PostgreSQL index + // 3. Therefore, we split into separate queries for assigned and unassigned tasks to leverage indexesAlso applies to: 90-91
111-119
: Consider adding metrics for failed task reassignment preventionThe code prevents reassigning failed tasks to the same prover, but there's no monitoring for how often this occurs. Adding metrics would help track the effectiveness of this prevention mechanism.
type BatchProverTask struct { // ... existing fields ... + batchTaskReassignmentPrevented *prometheus.CounterVec } func NewBatchProverTask(cfg *config.Config, chainCfg *params.ChainConfig, db *gorm.DB, reg prometheus.Registerer) *BatchProverTask { bp := &BatchProverTask{ // ... existing initialization ... + batchTaskReassignmentPrevented: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ + Name: "coordinator_batch_task_reassignment_prevented_total", + Help: "Number of times a failed task was prevented from being reassigned to the same prover.", + }, []string{"prover_name"}), } return bp } // In the Assign method: if proverTask != nil && types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofInvalid { + bp.batchTaskReassignmentPrevented.WithLabelValues(taskCtx.ProverName).Inc() continue }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
coordinator/internal/logic/provertask/batch_prover_task.go
(2 hunks)coordinator/internal/logic/provertask/bundle_prover_task.go
(2 hunks)
🔇 Additional comments (4)
coordinator/internal/logic/provertask/bundle_prover_task.go (4)
7-7
: LGTM: Import addition is appropriate
The strings package import is required for the new prefix checking functionality.
67-77
: LGTM: External prover threshold check is well implemented
The implementation correctly:
- Identifies external provers using the prefix check
- Verifies unassigned bundle count against threshold
- Prevents unnecessary external prover assignments when load is low
83-95
: LGTM: Efficient batch retrieval implementation
The implementation:
- Uses pagination with offset tracking
- Properly handles database errors
- Includes clear documentation explaining the index usage optimization
Line range hint 151-160
: Clarify status of commented hard fork compatibility check
There's a substantial block of commented code related to hard fork compatibility checking. Please either:
- Remove if no longer needed
- Uncomment if the check is still required
- Add a comment explaining why this is temporarily disabled
This will help maintain code clarity and prevent technical debt.
@@ -63,29 +64,59 @@ func (bp *BatchProverTask) Assign(ctx *gin.Context, getTaskParameter *coordinato | |||
|
|||
maxActiveAttempts := bp.cfg.ProverManager.ProversPerSession | |||
maxTotalAttempts := bp.cfg.ProverManager.SessionAttempts | |||
if strings.HasPrefix(taskCtx.ProverName, ExternalProverNamePrefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this way, how about adding a parameter to the request ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
coordinator/internal/logic/provertask/bundle_prover_task.go (1)
70-70
:⚠️ Potential issueFix error variable in logging statement
The error logging statement uses
err
instead ofgetCountError
.-log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", err) +log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError)
🧹 Nitpick comments (4)
coordinator/internal/orm/chunk.go (1)
91-105
: Consider adding composite index for performance optimization.To optimize the frequent queries on chunk status and attempts, consider adding a composite index on (proving_status, total_attempts, active_attempts, end_block_number).
coordinator/internal/logic/provertask/bundle_prover_task.go (3)
73-76
: Add logging for external prover threshold checkConsider adding debug logging when skipping external provers due to threshold condition. This will help with monitoring and debugging the assignment process.
if unassignedBundleCount < bp.cfg.ProverManager.ExternalProverThreshold { + log.Debug("skipping external prover assignment: below threshold", + "count", unassignedBundleCount, + "threshold", bp.cfg.ProverManager.ExternalProverThreshold, + "prover", taskCtx.ProverName) return nil, nil }
104-113
: Enhance error and debug loggingThe error and debug messages could be more descriptive to aid in troubleshooting.
- log.Error("failed to get prover task of prover", "proof_type", message.ProofTypeBatch.String(), "taskID", tmpBundleTask.Hash, "key", taskCtx.PublicKey, "Prover_version", taskCtx.ProverVersion, "error", getTaskError) + log.Error("failed to check prover's previous attempts", "proof_type", message.ProofTypeBatch.String(), "taskID", tmpBundleTask.Hash, "key", taskCtx.PublicKey, "prover_version", taskCtx.ProverVersion, "error", getTaskError) - log.Debug("get empty bundle, the prover already failed this task", "height", getTaskParameter.ProverHeight) + log.Debug("skipping task assignment: prover previously failed this task", + "height", getTaskParameter.ProverHeight, + "task_id", tmpBundleTask.Hash, + "prover", taskCtx.ProverName)
Line range hint
81-82
: Consider extracting retry configurationThe retry mechanism uses magic numbers (5 retries, 100ms sleep). Consider extracting these values to configuration parameters for better maintainability.
+const ( + maxRetryAttempts = 5 + retryDelay = 100 * time.Millisecond +) -for i := 0; i < 5; i++ { +for i := 0; i < maxRetryAttempts; i++ { // ... existing code ... - time.Sleep(100 * time.Millisecond) + time.Sleep(retryDelay)Also applies to: 123-124
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
common/version/version.go
(1 hunks)coordinator/internal/logic/provertask/batch_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/bundle_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/chunk_prover_task.go
(3 hunks)coordinator/internal/orm/batch.go
(1 hunks)coordinator/internal/orm/bundle.go
(1 hunks)coordinator/internal/orm/chunk.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- common/version/version.go
- coordinator/internal/orm/bundle.go
- coordinator/internal/orm/batch.go
- coordinator/internal/logic/provertask/batch_prover_task.go
🔇 Additional comments (6)
coordinator/internal/logic/provertask/chunk_prover_task.go (3)
65-75
: LGTM! External prover threshold check is well implemented.The implementation correctly prioritizes static provers by checking the unassigned task count against a threshold before assigning tasks to external provers. Error handling and logging are properly implemented.
102-111
: LGTM! Task reassignment prevention is well implemented.The implementation successfully prevents reassigning failed tasks to the same prover by checking the prover's previous task status. Error handling and logging are properly implemented.
Line range hint
143-151
: Verify if version compatibility check should be removed.The commented out code block appears to implement important version compatibility checks. Please verify if this check should be removed, as it could lead to version incompatibility issues between provers and tasks.
coordinator/internal/orm/chunk.go (1)
91-105
: LGTM! GetUnassignedChunkCount implementation is clean and efficient.The method follows GORM best practices with proper error handling and maintains consistency with existing query conditions.
coordinator/internal/logic/provertask/bundle_prover_task.go (2)
7-7
: LGTM!The addition of the
strings
package import is appropriate for the new prefix checking functionality.
Line range hint
146-154
: Verify if hard fork compatibility check should be uncommentedThere's a commented-out block that checks hard fork compatibility. Please verify if this check should be enabled or removed.
Run this script to check if hard fork compatibility checks are implemented elsewhere:
Co-authored-by: colin <[email protected]>
Co-authored-by: colin <[email protected]>
Co-authored-by: colin <[email protected]>
Co-authored-by: colin <[email protected]>
Co-authored-by: colin <[email protected]>
Co-authored-by: colin <[email protected]>
Co-authored-by: colin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
coordinator/internal/logic/provertask/bundle_prover_task.go (3)
67-77
: Add debug logging for early return case.When returning early due to insufficient unassigned bundles for external provers, add debug logging to improve observability.
if unassignedBundleCount < bp.cfg.ProverManager.ExternalProverThreshold { + log.Debug("insufficient unassigned bundles for external prover", + "count", unassignedBundleCount, + "threshold", bp.cfg.ProverManager.ExternalProverThreshold, + "prover", taskCtx.ProverName) return nil, nil }
80-80
: Extract magic number as constant.The retry count of 5 should be extracted as a named constant at package level for better maintainability.
+const maxAssignmentRetries = 5 - for i := 0; i < 5; i++ { + for i := 0; i < maxAssignmentRetries; i++ {
104-114
: Enhance error and debug logging messages.The logging messages could be more descriptive to aid in debugging and monitoring.
- log.Error("failed to get prover task of prover", "proof type", message.ProofTypeBatch.String(), "task ID", tmpBundleTask.Hash, "key", taskCtx.PublicKey, "prover version", taskCtx.ProverVersion, "error", getTaskError) + log.Error("failed to retrieve previous prover task attempt", + "proof_type", message.ProofTypeBatch.String(), + "task_id", tmpBundleTask.Hash, + "prover_key", taskCtx.PublicKey, + "prover_version", taskCtx.ProverVersion, + "error", getTaskError) - log.Debug("get empty bundle, the prover already failed this task", "height", getTaskParameter.ProverHeight) + log.Debug("skipping task assignment - prover previously failed this task", + "height", getTaskParameter.ProverHeight, + "task_id", tmpBundleTask.Hash, + "prover_key", taskCtx.PublicKey)coordinator/internal/logic/provertask/chunk_prover_task.go (2)
67-70
: Enhance error logging for better debugging.The error message could be more descriptive by including additional context about the external prover check.
- log.Error("failed to get unassigned chunk proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError) + log.Error("failed to get unassigned chunk proving tasks count for external prover threshold check", + "height", getTaskParameter.ProverHeight, + "prover_name", taskCtx.ProverName, + "threshold", cp.cfg.ProverManager.ExternalProverThreshold, + "err", getCountError)
104-107
: Enhance error logging structure.Consider structuring the error log fields consistently with other error logs in the codebase.
- log.Error("failed to get prover task of prover", "proof type", message.ProofTypeChunk.String(), "task ID", tmpChunkTask.Hash, "key", taskCtx.PublicKey, "prover version", taskCtx.ProverVersion, "error", getTaskError) + log.Error("failed to get prover task of prover", + "proof_type", message.ProofTypeChunk.String(), + "task_id", tmpChunkTask.Hash, + "public_key", taskCtx.PublicKey, + "prover_version", taskCtx.ProverVersion, + "err", getTaskError)coordinator/internal/logic/provertask/batch_prover_task.go (1)
Line range hint
67-114
: The task assignment strategy is well-designed but consider caching optimizations.The implementation successfully achieves both PR objectives:
- Prioritizing static provers through threshold-based assignment
- Preventing repeated failures through task history checks
Consider adding a short-lived cache for recently failed tasks to reduce database load in high-concurrency scenarios. This could be particularly beneficial when multiple provers are requesting tasks simultaneously.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
coordinator/internal/logic/provertask/batch_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/bundle_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/chunk_prover_task.go
(3 hunks)coordinator/internal/orm/prover_task.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- coordinator/internal/orm/prover_task.go
🔇 Additional comments (8)
coordinator/internal/logic/provertask/bundle_prover_task.go (1)
7-7
: LGTM!The addition of the
strings
package import is appropriate for the new external prover name prefix checking functionality.coordinator/internal/logic/provertask/chunk_prover_task.go (4)
7-7
: LGTM!The addition of the
strings
package import is necessary for the new external prover prefix checking functionality.
65-75
: LGTM! Implementation aligns with PR objectives.The code successfully implements the prioritization of static provers by checking the external prover threshold before assignment.
102-111
: LGTM! Successfully prevents task reassignment to failed provers.The implementation effectively prevents the same failing job from being reassigned to the same prover, which aligns with the PR objectives.
Line range hint
65-111
: Add test coverage for new logic.According to the PR comments, there's no test coverage for these changes. Please add tests to verify:
- External prover threshold logic
- Prevention of reassigning failed tasks
Would you like me to help create test cases for these scenarios?
coordinator/internal/logic/provertask/batch_prover_task.go (3)
7-7
: LGTM!The addition of the
strings
package import is appropriate for the new prefix checking functionality.
104-113
: Consider adding metrics for failed task assignments.The logic to avoid reassigning failed tasks is solid, but consider these enhancements:
- Add metrics to track failed task reassignment attempts
- Consider potential race conditions in high concurrency scenarios
Add metrics to track failed task assignments:
if proverTask != nil && types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofInvalid { + bp.batchTaskGetTaskProver.With(prometheus.Labels{ + coordinatorType.LabelProverName: taskCtx.ProverName, + coordinatorType.LabelProverPublicKey: taskCtx.PublicKey, + coordinatorType.LabelProverVersion: taskCtx.ProverVersion, + "status": "previously_failed", + }).Inc() log.Debug("get empty batch, the prover already failed this task", "height", getTaskParameter.ProverHeight) return nil, nil }Let's verify potential race conditions:
✅ Verification successful
Original metrics suggestion is valid, but race condition concerns are addressed.
The existing implementation already handles concurrent access safely through:
- Optimistic concurrency control in UpdateBatchAttempts
- Independent operations with proper context handling
- Safe read-then-write pattern for task status checks
The metrics suggestion remains valuable for operational visibility.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for proper transaction handling and locking mechanisms # Check for transaction boundaries rg -B 5 -A 5 "GetTaskOfProver.*ctx" # Check for concurrent access patterns rg -B 5 -A 5 "UpdateBatchAttempts.*ctx"Length of output: 7174
67-77
: Consider enhancing error handling and threshold validation.While the external prover threshold check aligns with the PR objectives, consider these improvements:
- Add context to the error log for better debugging
- Validate the threshold configuration at startup
if getCountError != nil { - log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError) + log.Error("failed to get unassigned batch proving tasks count", + "height", getTaskParameter.ProverHeight, + "prover", taskCtx.ProverName, + "threshold", bp.cfg.ProverManager.ExternalProverThreshold, + "err", getCountError) return nil, ErrCoordinatorInternalFailure }Let's verify the threshold configuration:
✅ Verification successful
External prover threshold implementation verified as correct and safe
The threshold implementation is properly configured with a reasonable default value (32) and is consistently used across all prover tasks. While explicit validation could be added as a good-to-have, the current implementation is safe through JSON type checking and implicit validation in comparison operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if ExternalProverThreshold is properly configured across the codebase # Check for threshold validation in config initialization rg -A 5 "ExternalProverThreshold.*=|ExternalProverThreshold.*:" # Check for any hardcoded threshold values that might conflict rg -g '!*.md' -g '!*.txt' "ExternalProverThreshold"Length of output: 1188
Script:
#!/bin/bash # Check for config validation and initialization ast-grep --pattern 'func $_($$$ *Config $$$) error { $$$ }' # Check for any default value assignments rg -A 3 "DefaultConfig|NewConfig|InitConfig"Length of output: 8802
Script:
#!/bin/bash # Check config file content fd -e json -e example config coordinator/conf/ # Check for any validation during coordinator startup rg -B 3 -A 3 "ProverManager.*ExternalProverThreshold" coordinator/cmd/Length of output: 150
Script:
#!/bin/bash # Check config file content cat coordinator/conf/config.json # Check coordinator initialization/startup code ast-grep --pattern 'func $_($$$ *Coordinator $$$) { $$$ }' # Check for any validation in ProverManager initialization rg -A 5 "type ProverManager struct" coordinator/Length of output: 1677
github also has a nice feature that when applying suggestions, can use batch apply, so the commit history would be more concise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
coordinator/internal/logic/provertask/batch_prover_task.go (2)
69-71
: Enhance error logging with additional context.Consider adding the unassigned batch count to the error log to help with debugging threshold-related issues.
- log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "err", getCountError) + log.Error("failed to get unassigned batch proving tasks count", "height", getTaskParameter.ProverHeight, "threshold", bp.cfg.ProverManager.ExternalProverThreshold, "err", getCountError)
105-105
: Consider extracting magic number into a constant.The value
2
inGetFailedProverTasksByHash
appears to be a limit. Consider extracting it into a named constant for better maintainability.+const maxFailedTasksToCheck = 2 - proverTasks, getTaskError := bp.proverTaskOrm.GetFailedProverTasksByHash(ctx.Copy(), message.ProofTypeBatch, tmpBatchTask.Hash, 2) + proverTasks, getTaskError := bp.proverTaskOrm.GetFailedProverTasksByHash(ctx.Copy(), message.ProofTypeBatch, tmpBatchTask.Hash, maxFailedTasksToCheck)coordinator/internal/utils/prover_name.go (1)
5-15
: Enhance input validation and documentationThe function could benefit from more robust validation and clearer documentation:
- The comment mentions "cloud_prover_{provider-name}_index" format but the code only checks first 3 segments
- Magic number '3' should be defined as a constant
Consider this implementation:
+const minProverNameSegments = 3 func IsExternalProverNameMatch(localName, remoteName string) bool { + // Validate input strings are not empty + if localName == "" || remoteName == "" { + return false + } + local := strings.Split(localName, "_") remote := strings.Split(remoteName, "_") - if len(local) < 3 || len(remote) < 3 { + if len(local) < minProverNameSegments || len(remote) < minProverNameSegments { return false } - // note the name of cloud prover is in fact in the format of "cloud_prover_{provider-name}_index" + // Matches cloud prover names in the format: "cloud_prover_{provider-name}_index" + // Returns true if the first three segments (cloud, prover, provider-name) match return local[0] == remote[0] && local[1] == remote[1] && local[2] == remote[2] }coordinator/internal/types/prover.go (1)
54-54
: Fix typo in error messageThe error message has a duplicate word "type".
- return fmt.Sprintf("prover provider type type: %d", r) + return fmt.Sprintf("illegal prover provider type: %d", r)coordinator/internal/types/auth.go (1)
Line range hint
1-40
: Consider backward compatibility strategyThe introduction of the provider type system is well-structured, but the implementation might benefit from a phased approach:
- The required binding on new fields could break existing clients
- The coexistence of ProverTypes and ProverProviderType needs a clear migration path
- Consider adding validation logic to ensure consistency between these fields during the transition period
Consider implementing a migration strategy:
- Make new fields optional initially
- Add a deprecation notice for ProverTypes
- Document the migration timeline for clients
- Add validation to ensure consistency between old and new fields during transition
coordinator/internal/logic/auth/login.go (1)
111-112
: Consider enhancing the error message with the actual value.The error logging is good, but the returned error message could be more descriptive for debugging purposes.
Consider this improvement:
- return errors.New("invalid prover provider type.") + return fmt.Errorf("invalid prover provider type: %v", login.Message.ProverProviderType)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
coordinator/internal/controller/api/auth.go
(1 hunks)coordinator/internal/logic/auth/login.go
(1 hunks)coordinator/internal/logic/provertask/batch_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/bundle_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/chunk_prover_task.go
(3 hunks)coordinator/internal/logic/provertask/prover_task.go
(3 hunks)coordinator/internal/orm/prover_task.go
(2 hunks)coordinator/internal/types/auth.go
(2 hunks)coordinator/internal/types/prover.go
(1 hunks)coordinator/internal/utils/prover_name.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- coordinator/internal/logic/provertask/prover_task.go
- coordinator/internal/logic/provertask/chunk_prover_task.go
- coordinator/internal/orm/prover_task.go
- coordinator/internal/logic/provertask/bundle_prover_task.go
🔇 Additional comments (7)
coordinator/internal/logic/provertask/batch_prover_task.go (4)
25-25
: LGTM!The import addition is necessary for the external prover name matching functionality.
67-77
: LGTM! External prover threshold check implementation.The implementation aligns with the PR objectives by prioritizing task assignment based on the unassigned task count threshold.
104-116
: LGTM! Robust implementation of failed task reassignment prevention.The implementation effectively prevents reassigning failed tasks to the same prover, with special handling for external provers.
Line range hint
67-116
: Verify performance impact of the new assignment logic.The changes to task assignment logic (external prover threshold and failed task checks) could impact system performance. Please ensure monitoring is in place.
Run the following script to check if relevant metrics are being tracked:
✅ Verification successful
Performance monitoring is already in place for task assignment changes
The codebase has comprehensive prometheus metrics and logging that cover:
- Task assignment tracking via
coordinator_batch_get_task_total
andcoordinator_get_task_count
- Failure monitoring through various
coordinator_validate_failure_*
metrics- Task timeout tracking with
coordinator_batch_prover_task_timeout_total
- Detailed error logging for assignment failures
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for performance monitoring metrics # Check for prometheus metric definitions rg -A 2 'prometheus.CounterVec|prometheus.Counter|prometheus.Gauge' | grep -v '_test.go' # Check for relevant logging patterns that could be used for monitoring rg -A 2 'log\.(Info|Error|Debug).*batch.*prov(er|ing)'Length of output: 65812
coordinator/internal/types/auth.go (1)
38-40
: Clarify relationship between ProverProviderType and ProverTypesThe Message struct contains both ProverProviderType and ProverTypes fields, which seems redundant. Additionally:
- The binding:"required" tag on ProverProviderType might break backward compatibility
- The relationship between these fields needs documentation
Let's check for existing usages:
Consider:
- Document the relationship between these fields
- If ProverTypes is being deprecated, mark it as such
- Consider making ProverProviderType optional initially for backward compatibility
coordinator/internal/logic/auth/login.go (2)
110-113
: LGTM! The provider type validation aligns with the PR objectives.The validation ensures that only internal (static) and external (dynamic) provers can authenticate, which is essential for implementing the new prover assignment strategy.
109-114
: Verify error handling in upstream code.Since this introduces a new validation that could reject prover logins, we should ensure that the coordinator's retry mechanisms and monitoring systems properly handle these failures.
Let's check how these errors are handled:
✅ Verification successful
Error handling is sufficient for the new validation.
The coordinator has comprehensive error handling with:
- Retry mechanisms (up to 5 attempts)
- Prometheus metrics tracking failures
- Timeout detection and recovery
- Detailed error logging
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for error handling patterns around login failures rg -A 5 "invalid prover provider type" --type goLength of output: 368
Script:
#!/bin/bash # Find the login function definition and its callers ast-grep --pattern 'func $_ Login($$$) {$$$}' # Look for retry mechanisms rg -B 2 -A 5 'retry|backoff' --type go # Search for monitoring/metrics related to login rg -B 2 -A 5 'metrics|prometheus|monitor' --type go coordinator/internal/Length of output: 41939
…ailing task to same prover
Purpose or design rationale of this PR
Describe your change. Make sure to answer these three questions: What does this PR do? Why does it do it? How does it do it?
docs
PR title
Your PR title must follow conventional commits (as we are doing squash merge for each PR), so it must start with one of the following types:
Deployment tag versioning
Has
tag
incommon/version.go
been updated or have you addedbump-version
label to this PR?Breaking change label
Does this PR have the
breaking-change
label?Summary by CodeRabbit
Configuration
external_prover_threshold
to control external prover task assignment.Task Management
Improvements
Version Update
v4.4.86
.