docs: LHCb Workflows and Commands#125
Conversation
aldbr
left a comment
There was a problem hiding this comment.
Thanks! The approach sounds good. A few comments though:
-
I think it would be interesting to have, for each type of jobs:
- left: current XML workflow
- right: the CWL + pre/post process equivalent
-
Could you add a very brief description of each module please?
-
AnalyseXMLSummary cannot work in
ProcessingI think because it needs to set some file status. So in the new workflow, we would need to have it in a post process step. We could potentially have a check withinLbRunAppas we said, to fail the workflow before it starts executing further steps. -
BookkeepingReport and WorkflowAccounting would need modifications (there will process multiple app outputs at once, this needs to appear)
c2ac24f to
83bf433
Compare
chore(docs): Rewrite lhcb-workflows explanations
aldbr
left a comment
There was a problem hiding this comment.
A few minor comments, but I think it's almost ready.
Also we could add a note about the shared instances between the modules. Instead of sharing instances, here we could depend on files.
Example: commands would add json requests in a failover directory, that would be read by FailoverRequest (may be could be generalized and used directly in the JobWrapper instead of as a Command?). What do you think?
Also I wonder whether UserJobFinalization is useful (I have the impression it's could be reproduced if we chain other commands correctly). What do you think?
I guess it would be easier to answer to all these questions if you start the implementation, wouldn't be?
docs/design/lhcb-workflows.md
Outdated
| } | ||
|
|
||
| state Processing_New { | ||
| LbRunApp_New: LbRunApp |
There was a problem hiding this comment.
Here (and for each new workflow), the processing should include dirac-cwl workflow.cwl.
There was a problem hiding this comment.
I don't understand this, you mean it should imply somewhere that LbRunApp executes dirac-cwl workflow.cwl?
There was a problem hiding this comment.
Well it would be contrary: dirac-cwl workflow.cwl would execute LbRunApp in a Command Line Tool (or a workflows of multiple CLT with LbRunApp for MCReconstruction jobs)
There was a problem hiding this comment.
Oh yeah, that's right 👌🏻
Wouldn't this be the case for the USER Job Types too? Those would require to execute dirac-cwl workflow.cwl too, no?
There was a problem hiding this comment.
For the user job, we would just have dirac-cwl-workflow.cwl (they can execute whatever they want inside their workflow, they have total control on it)
| Processing_New: Processing | ||
| PostProcessing_New: PostProcessing | ||
|
|
||
| state PreProcessing_New { |
There was a problem hiding this comment.
I think we need (only for simulation) to run getEventsToProduce that we would need to include in the prodconf.json file. What do you think?
I think for both simulation and reconstruction, we might need to resolve the DDB/CondDB tags.
Unless you see another way to compute them in the CWL workflow in LbProdRun?
There was a problem hiding this comment.
getEventsToProduce is comprised of multiple gConfig.getValue(...) calls that obtain:
Info from Config:
- CPUNormalizationFactor
Info from the CE:
- GridCE
- CEQueue
- maxCPUTime
- CPUTimeLeft
- CPUNormalizationFactor (if not known already)
And then performs some calculations.
getEventsToProduce is a utility function, so it can probably reside inside LbProdRun, as it is only used during the processing step, no other command uses this info (the number of events).
Also, couldn't this be processed before sending the workflow? The maxCPUTime, CPUTimeLeft and CPUNormalizationFactor values of each CE are known beforehand, no?
For the tags, currently they are obtained from the Parameters of the GaudiApplication Module, so they are defined in the Workflow before sending the job and only used inside GaudiApplication.
I think for both cases, we should simply modify LbProdRun, as it's fixed information, it shouldn't contact DIRAC, everything is obtained from files that dirac-cwl should have access to.
What do you think?
There was a problem hiding this comment.
Doesn't getEventsToProduce need CPUe that is outside of the node? (CPUe being the CPU work needed to compute 1 event).
There was a problem hiding this comment.
It's part of the function call: def getEventsToProduce(CPUe, CPUTime=None, ... )
And it's obtained from gaudiAppModule.CPUe at RunApplication, so in principle it should be in the workflow.
#### RunApplication.py 159-171
if (
not gaudiAppModule.stepInputData
and gaudiAppModule.CPUe
and gaudiAppModule.maxNumberOfEvents
and gaudiAppModule.numberOfEvents <= 0
):
# Here we set maxCPUTime to 24 hours, which seems reasonable
prodInfo["input"]["n_of_events"] = getEventsToProduce(
gaudiAppModule.CPUe, maxNumberOfEvents=gaudiAppModule.maxNumberOfEvents, jobMaxCPUTime=86400
)
else:
prodInfo["input"]["n_of_events"] = gaudiAppModule.numberOfEventsThere was a problem hiding this comment.
Yes you're right, this is injected from the transformation in the XML job description indeed.
There was a problem hiding this comment.
But I wonder if it makes sense to try to get these grid-related details from LbProdRun that could also be executed locally. What do you think?
There was a problem hiding this comment.
Yeah, when it is executed locally it doesn't make much sense.
We could create a pre-processing command that retrieves this info, stores it in some kind of "config file" that is read by LbProdRun, mimicking workflow_commons. If it's executed locally, we could prepare this file however we like.
| } | ||
|
|
||
| state PostProcessing_New { | ||
| AnalyseXmlSummary_New: AnalyseXmlSummary |
There was a problem hiding this comment.
It could be interesting to know whether any of these commands could run in parallel. Naively, it seems that UploadLogFile, WorkflowAccounting and BookkeepingReport could potentially run in parallel. Am I wrong?
There was a problem hiding this comment.
That's right, I can put a fork in the state diagram for those 3 commands. This happens also during reconstruction right? For this one I think we should also include RemoveInputData in the fork.
We will need to take that into account during implementation though. Now the commands run one after another.
|
|
||
| Report file usage to a DataFileUsage service. | ||
|
|
||
| ### UploadOutputData |
There was a problem hiding this comment.
Also it could be interesting to see if there any generic piece we could extract from the LHCb modules to be ported to DIRAC (this can wait for the implementation I think, then we can see with CTAO how to generalize for instance).
docs/design/lhcb-workflows.md
Outdated
| | WorkflowAccounting | N/A | N/A | RunNumber ProdID EventType SiteName ProcessingStep CpuTime NormCpuTime InputsStats OutputStats InputEvents OutputEvents EventTime NProcs JobGroup FinalState | | ||
| | AnalyseFileAccess | XMLSummary.xml pool_xml_catalog.xml | N/A | N/A | | ||
| | UserJobFinalization | UserOutputData | request.json | JobId UserOutputSE SiteName UserOutputPath ReplicateUserOutData UserOutputLFNPrep | | ||
| | AnalyzeXmlSummary | XMLSummary.xml | N/A | ProdId ApplicationName | |
There was a problem hiding this comment.
| | AnalyzeXmlSummary | XMLSummary.xml | N/A | ProdId ApplicationName | | |
| | AnalyseXmlSummary | XMLSummary.xml | N/A | ProdId ApplicationName | |
chore: Update processing stage to show dirac-cwl command
See #112
Apparantly, Git does not support ELK as a mermaid renderer...