-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google.location == us-east5 in Nextflow config leads to wrong machine type choices #5602
Comments
Is it reporting an error? can you please include the |
When comparing the returned data for the us-east5 region to the GCP Compute product listing for the us-east1 region (e.g., us-east1 region data), it becomes evident that the us-east5 data contains invalid information. |
@jorgee worth to have a look at this |
In this case, the job should fail instead of submitting it, is it correct? |
Likely it should report a warning and fallback on the default mechanism that's this Line 107 in 6183fdf
|
Are we expecting a valid response from the API call? The us-east5 region is valid, so the data returned should reflect this.
|
It should but, for some reason that we are investigating, it's failing. In any case, Nextflow should be resilient to these error conditions. |
I have run a pipeline using us-east5 and I see there is already a fallback. Looking at the .nextflow.log, I can see the cloudInfo call returns error 400. It generates an exception that is catched and the job has been scheduled and running with an instance type selected by Google Batch in the selected region. The log file shows the following:
and the Google Batch service dashboard I can see the selected instance_type for the job: In this part of the code, Nextflow invoke the GoogleBatchMachineTypeSelector.bestMachineType method that calls the CloudInfo service in GoogleBatchMachineTypeSelector.getAvailableMachineTypes throwing the exception when error 400 is produced. The exception is catch return null to do not include the instance type in the Jobs's allocation policy. When no instance type is provided Google Batch automatically decides which instance to use. From the Nextflow point of view, I would just add the warning message to notify this situation to the user. I think there is no need to modify the fallback mechanism. |
If a user specifies a particular machineType to use, the fallback machine will not align with their request. Should the process fail in such cases to ensure accurate provisioning? |
If the user specifies a particular machineType, then there is no fallback. the machineType is it is set to the BatchJob. |
You're correct. Currently, for any machineType with a pattern like
|
@jorgee can you provide more details about this? |
Here, it is checking if the MachineType is a particular type, a list (separated by commas) or a family (with *). It only sets the MachineType if the user provides a particular type. When there is a list, we could choose one of them (for instance the first one) but when the user provides a family, we can not set any hint to Google Batch because the API only support to set a particular machine type in the allocation policy. So the selected instance by Google Batch could not match with the family specified by the user. There is an Google Compute API call to get the types, that we could use but I think we will re-implement the CloudInfo functionality. Therefore, I think the best in this case is to raise a failure to the user suggesting to specify a single instance. |
My point is that if a concrete nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Lines 614 to 618 in cd41091
and here nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Line 356 in cd41091
Is that correct? |
Yes, it is correct in that case there is no issue. The issue is if the user provides a family. It will continue without any machine type in the instance policy, so the selected instance could not fit with the user directive and could produce a failure. We could warn and continue (assuming that the job could fail) or raise a failure without submitting the job. |
I would limit to report a warning when Cloud info fails to fetch the required metadata. Regarding this comment, the problem looks related when the it's specified a machine type and a GPU. However I was wonder if the problem is only when cloud info is failing or in any case because I don't see any condition in the logic below related to GPU filtering Lines 93 to 128 in cd41091
|
cloudinfo.seqera.io seems to be missing information for us-east5
https://cloudinfo.seqera.io/api/v1/providers/google/services/compute/regions/us-east5/products
yields
and as a consequence the GoogleBatchMachineTypeSelector can't find any available machine types for us-east5 here:
https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchMachineTypeSelector.groovy#L144
which can lead to NextFlow submitting Batch jobs with incorrect machine types.
The text was updated successfully, but these errors were encountered: