Connect compute to my tasks- it’s too complicated
2023-11-28
We have plenty of options for connecting compute to our tasks/simulations, but they are all complicated and not easily compatible
I want to run some batch compute for some tasks, it might be big, it might be small. There are so many choices, and
- kubernetes
- slurm
- condor
- nomad
- AWS batch
- Google similar thing
- Azure (I don’t even know but surprised if they don’t have a similar thing)
Problems
- they require infrastructure skills, or an institution or group that has that support
- adhoc solutions do not last
- the configurations are not transferable without significant work
- it’s unlikely that any single solution above will dominate, reducing the likelihood of simper solutions for individuals and researchers
Proposed solution
Users do not think much about where the compute is performed. It just works.
The batch compute job specification is entirely encoded in the URL. For example:
https:
// <origin>
/ #?
definition=<base64 encoded JSON>
The definition can start with simple docker container jobs:
job:
type: docker
image: <image-name>
command: run-script.sh
resources:
cpu: 1000
memory: 2gb
ttl: 1h
The minimal example above is encoded in the URL. The origin can be dynamically changed, and represents the location that will handle actual executing the above job definition.
Job inputs and outputs? Handled by the metapage library.
Advantages:
- You can click on the link, and immediately test, and edit the definition
- Agnostic of where the job is executed, as the origin can be changed without losing the definition
- The web is durable and everyone has a portal/environment (the browser) the will reliably view and execute the above definition
The backend of exactly where the docker image is run is an implementation detail. For starting out, users can simply connect their own personal computers, test and iterate, then copy the link to a domain with access to more resources: https://container.mtfm.io/#?tab=6