-
Notifications
You must be signed in to change notification settings - Fork 40
AUTO TUNING of jobs time limit
Problem: most jobs have a very long time limit, because users either do not put anything (and so use default of 24h) or set a convervative limit that can make all jobs in the task succeed. But gWms uses the time limit both to kill jobs (so it is good to be conservative) and to schedule. Using long time for scheduling makes it impossible to fit jobs in the tail of existing pilots, leading to pilot churn and/or underutilization of partially used multicore pilots.
Solution (proposed by Brian) in two steps:
-
Introduce two ClassAds attributes (see https://github.com/dmwm/CRABServer/pull/5463 for implementation):
-
EstimatedWallTimeMins
: Used for matchmaking of jobs within HTCondor. This is initially set to the wall time requested by the user. -
MaxWallTimeMins
: If the job is idle (to be matched), evaluates to the value ofEstimatedWallTimeMins
. Otherwise, used by thecondor_schedd
for killing jobs that have gone over the runtime limit and set to the user-requested limit (in CRAB, this defaults to 20 hours).
-
-
Introduce a mechanism (based on the existing work for WMAgent) to automatically tune
EstimatedWallTimeMins
based on the time it actually takes for jobs to run:-
gwmsmon
provides running time percentiles for a task. - a python script calculates the new
EstimatedWallTimeMins
as follows:- If less than 20 jobs have finished - or the
gwmsmon
query results in errors - do nothing! - If at least 20 jobs have finished, take the 95th percentile of the runtime for completed jobs; set estimated run time as
min(95th percentile, user-provided runtime)
.
- If less than 20 jobs have finished - or the
- This python script will provide a new configuration for the JobRouter running on the CRAB3 schedd. The route will update the ClassAds for idle jobs
- JobRouter scales much better than a cronjob performing
condor_qedit
for CRAB3 jobs.
- JobRouter scales much better than a cronjob performing
- In order to preserve a single autocluster per task, all jobs in a CRAB3 task will get the same value of
EstimatedWallTimeMins
.
-
As of April 19, 2017, Justas has done the work in gwmsmon [1]
Work to do is tracked in:
The links to the source:
- https://gitlab.cern.ch/CMSSI/SubmissionInfrastructureScripts/blob/crabdev/JobAutoTuner.py
- https://gitlab.cern.ch/CMSSI/SubmissionInfrastructureScripts/blob/crabdev/JobTimeTuner.py
QUESTIONS:
- how do we deal with jobs which run into time limit ? Do we resubmit in post-job with
limit *= 1.5
until we hit 48h ? see https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/2617/1/1.html - what happens to jobs killed by pilot reaching end of life before payload does ?
REFERENCES:
-
original mail thread: https://github.com/dmwm/CRABServer/files/938902/AutoTuningMails.pdf
-
[1] gwmsmon API to use:
http://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/<user>/<task>
works from CERN LAN. Outside CERN need to usehttps
and SSO -
example:
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/smitra/170411_132805:smitra_crab_DYJets
which returns a json file
{"hits": {"hits": [], "total": 263, "max_score": 0.0}, "_shards": {"successful": 31, "failed": 0, "total": 31}, "took": 40, "aggregations": {"2": {"values": {"5.0": 6.5436111111111126, "25.0": 11.444305555555555, "1.0": 3.5115222222222222, "95.0": 19.811305555555556, "75.0": 16.773194444444446, "99.0": 20.513038888888889, "50.0": 13.365277777777777}}}, "timed_out": false}
in hopefully fixed forever format so that the "values" can be extracted
and one would e.g. pick the 95.0 one (i.e. 19.8 hours)
-
gwmsmon : https://github.com/dmwm/gwmsmon
-
how this is done in production: https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/go_condor.py#L168 Note: There, Unified pre-computes the final number -
go_condor.py
is simply converting the Unified data to a JobRouter configuration. Here, we would be doing the calculation in the script based on the raw data from gwmsmon. -
Unified prepares this configuration for
go_condor.py
to use: https://cmst2.web.cern.ch/cmst2/unified/equalizor.json -
example of output JR rule from
go_condor.py
:
[
Name = "Set timing requirement to 105";
set_HasBeenRouted = false;
set_HasBeenTimingTuned = true;
GridResource = "condor localhost localhost";
Requirements = member(target.WMAgent_SubTaskName,
{
"/pdmvserv_task_TSG-PhaseIISpring17GS-00002__v1_T_170419_164853_2948/TSG-PhaseIISpring17GS-00002_0"
}) && ( target.HasBeenTimingTuned isnt true ) && ( target.MaxWallTimeMins <= 105 );
set_OriginalMaxWallTimeMins = 105;
TimeTasknames =
{
"/pdmvserv_task_TSG-PhaseIISpring17GS-00002__v1_T_170419_164853_2948/TSG-PhaseIISpring17GS-00002_0"
};
TargetUniverse = 5
]
-
job router used in production is: https://gitlab.cern.ch:8443/ai/it-puppet-module-vocmshtcondor/blob/qa/code/templates/configs/90_prod_overflow.config and it used the
go_condor.py
indicated above -
job router for CRAB schedd's is: https://gitlab.cern.ch:8443/ai/it-puppet-hostgroup-vocmsglidein/blob/master/code/templates/modules/condor/90_cmslpc_jobrouter.config or, in new schema, in: https://gitlab.cern.ch:8443/ai/it-puppet-module-vocmshtcondor/blob/qa/code/templates/configs/90_cmslpc_jobrouter.config which currently uses https://gitlab.cern.ch/CMSSI/SubmissionInfrastructureScripts/blob/master/CMSLPCRoute.py
Was tried in Feb 2018, but we need to roll it back since it was resulting in lots of early job kill and restarts, putting load on schedd's and wasting resources.
Refined Solution The problem is that we do not simply run on a vanilla condor pool, where that would have been fine. Our startd's are managed by glideinWms pilots, which have both a MATCH and a START expression
Some documentation about this can be found in https://twiki.cern.ch/twiki/bin/view/CMS/GlideinWMSFrontendOpsGWMS in particular: https://twiki.cern.ch/twiki/bin/view/CMS/GlideinWMSFrontendOpsGWMS#Writing_expressions_match_expr_s
Summarizing what's relevant for our use case:
- pilots are requested based on the MATCH expression
- jobs are matched to pilots with the START expression
- start expression is also evaluted a second time when jobs starts to run on the pilot (*) and if it becomes false, it kicks the job out (which happened to CRAB jobs)
- (*) From Diego Davila: "Using a more verbose debug setup at the startd, you can see that the START expression is evaluated twice, the second time JobStatus==2 in the jobClassAd, and jobStatus==1 in the first one"
There is a global match+start expression, then there are more for each frontend group, those are ANDed. But currently groups match/start expressions do not involve Time, so we only worry about: https://gitlab.cern.ch/CMSSI/cmsgwms-frontend-configurations/blob/cern/global/frontend.xml#L20
the relevant parts are, converting a bit in English and taking out stuff added to convert all classAds to same unit (seconds):
match_expr : MaxWallTimeMins +10 min < ( GLIDEIN_Max_Walltime-GLIDEIN_Retire_Time_Spread )
start_expr : MaxWallTimeMins < GLIDEIN_ToDie-MyCurrentTime
Where:
NAME |TYPICAL VALUE| MEANING
MaxWallTimeMins | few hours | what CRAB jobs request in their JDL
GLIDEIN_Max_Walltime | 2~3 days | Max allowed time for the glidein see http://glideinwms.fnal.gov/doc.prd/factory/custom_vars.html#lifetime
GLIDEIN_Retire_Time_Spread | 2 hours | a random spread to smooth out glideins all ending simultaneously see glidein see http://glideinwms.fnal.gov/doc.prd/factory/custom_vars.html#lifetime
The value of those GLIDEIN_* classAds is dyamically set in the factories and can be queried via commands like this
condor_status -pool vocms0805 -any -const 'MyType=="glidefactory" && regexp("CMSHTPC_T1_ES_PIC_ce07-multicore", Name)' -af GLIDEIN_Max_Walltime
being vomcs0805 the CERN_Factory
The above means that when we want to start a job in a slot which may expire before the job, we can not change MaxWallTimeMins
or the job will be immediately killed.
Hence we need to change the Periodic_Remove expression in JDL not to depend on MaxWallTimeMins.
Easiest way seems to keep MaxWallTimeMins as the indicator of the pilot slot that we want the job to fit in, but use a different classAd for the Periodic_Remove, allowing the job to run longer up to when glidein dies and jobs are killed and automatically restarted, or when they hit the time limit defined for them in a new classAd. Can start with using the user-specified Max time (or the default), keeping the spirit of the initial proposal. So will edit https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DagmanCreator.py
- define
-
MaxWallTimeMinsRun
: the max allowed to the job to run -
MaxWallTimeMins
: the time request for the matching (stick to name gWMS wants)
-
- use
MaxWallTimeMinsRun
in place ofMaxWallTimeMins
inperiodic_remove
and+PeriodicRemoveReason
- in the JobRouter time tuning will edit
MaxWallTimeMin
instead of (or i addition to) defining the newEstimatedWallTimeMins
The solution to the problem above is to have two classAds:
- MaxWallTimeMins used by gWms for matching and starting (name set in gWms FrontEnd config)
- MaxWallTimeMinsRun used by crab schedd's to set how long job can run before PeriodicRemove kicks in
Note that since the introduction of Automatic Splitting we have also two more similar classAds: MaxWallTimeMinsProbe and MaxWallTimeMinsTail which are used to set MaxWallTimeMins for probe and tail jobs. Time Tuning is not allowed (currently) on Automatic Splitting tasks, but generally speaking the bulk of jobs in those tasks may still benefit from being run in slots with less time to live than the (reasonably conservative) estimate obtained from the probe jobs.
now tracked as: https://github.com/dmwm/CRABServer/issues/5683 and deployed in pre-prod on June 4th, 2018
Now need to get back to Questions above:
- how do we deal with jobs which run into time limit ?
- what happens to jobs killed by pilot reaching end of life before payload does ?
Answer to second one is simple : HTC will restart those automatically (was tested) We are left with the real one:
-
STEP 0: try to minimize them by making the time estimate large enough that on average at least 99% of the jobs will fit. Beware that first jobs to complete may be the ones which fail for random reason: start very conservative in tuning. Refinements :
- as done in Unified: do not time tune tasks running >24h, little to gain, and likely risky
-
STEP 1: become more aggressive in time tuning (slowly) and watch things. Requires good monitoring setup. But we currently expect that the above will be sufficient to cut idle slots below attention threshold.
-
STEP 2: use maxwalltime more aggressively to detect early and restart elsewhere doomed jobs (malfunctioning hardware, data reading problems), basically reset MaxWallTimeMinsRun from the very conservative default (or value set by users) to something sensible. This is very likely not needed, but if we find that we can't live with the associated inefficiency here's a possible plan: look into a way to increase match time for resubmissions: need to verify which status sequence do jobs go through so that JobRouter leaves them alone if we change MaxWallTimeMins in the classAd according to JobStart. Requires some investigation, but it is possible that we can use
MaxWallTimeMins=EstimatedWallTimeMins*(1+JobStarts)
in the submit JDL and keep editing EstimatedWallTimeMins for Idle jobs only via the JobRouter. Also cap MaxWallTimeMins at 46h:MaxWallTimeMins=min(EstimatedWallTimeMins*(1+JobStarts),2760)