Why Isn't My Job Starting?
SDSU HPC uses Slurm to manage job scheduling and resource allocation. If your job isn't starting or seems delayed in starting compared to others, this guide might help you understand why, how Slurm prioritizes jobs and what you can do to troubleshoot your jobs.
Key Concepts in Slurm Scheduling
Priority
Priority is the core mechanism Slurm uses to decide the order in which all jobs are started. Ever queued job gets a priority score, and the scheduler (Slurm) starts the highest priority jobs first when resource become available. Job priority is calculates using some of the following factors:
- Fairshare – Users who have consumed fewer resources recently receive a higher priority.
- Job Age – Jobs gain priority the longer they wait in the queue.
- Dependencies / Holds – Jobs with active dependencies or holds won’t be scheduled, regardless of priority, such as interactive jobs.
Note: These factors only apply to jobs waiting in the queue. Once a job is running, it continues until completion (or until canceled, fails or it hits its walltime).
Fairshare
Fairshare is a dynamic priority system for Slurm that helps ensure equitable access to cluster resources such as CPUs, memory, and GPUs. Instead of using a strict “first come, first served” model, Slurm continuously tracks each user’s recent resource usage and adjusts job priorities accordingly.
- Users (or accounts) that have consumed more than their fair share of resources will see their new jobs assigned a lower priority until others have had the opportunity to run.
- Users (or accounts) that have used less than their fair share will see their jobs receive a higher priority boost.
- The fairshare factor is time-decayed, meaning older usage gradually “fades” and stops impacting current scheduling.
Fairshare helps with the following:
- Prevents a small number of users from monopolizing cluster resources.
- Keep resource distribution aligned with the allocations/policies set by the cluster administrators.
- Ensure both heavy and light users have fair opportunities to run jobs over time.
Innovator HPC Cluster Fairshare Example:
If the GPU partition is full, fairshare adjusts job priority based on past usage.
- User A has used 200 GPU-hours this week and submits a job for 8 GPUs × 24h.
- User B has only used 10 GPU-hours and submits 2 GPUs × 12h at the same time.
Even if User A submits first, their priority is lowered, while User B’s job is boosted to prevent heavy users from blocking lighter ones. Usage doesn’t count against you forever and it decays over time. This means heavy usage gradually “ages out” of the calculation. As a result, User A’s old GPU hours will matter less each day, and their job priority will automatically recover, allowing them to compete more evenly again.
Backfill Scheduling
Backfill allows Slurm to run smaller or shorter jobs ahead of larger jobs, as long as doing so does not delay the larger job’s expected start time.
Example: Suppose a large job is waiting for 100 CPUs. Slurm knows it won’t have all 100 available until 4:00 PM. In the meantime, if your job only needs 2 CPUs for 10 minutes, Slurm may backfill it into the schedule.
This ensures that available resources are not left idle, improving overall system utilization and throughput.
Common Reasons Your Job Isn't Starting
Low Priority - Slurm uses priority to decide which jobs start first. Your job may be waiting because of:
- High recent usage (low fairshare)
- Newer submission time
- Large resource request
Resource Constraints - Jobs may wait longer if they request:
- A large amounts of resources
- Specialized resources (GPU, Bigmem)
- Long walltimes
Partition Rules - Partitions/Queues have limits on:
- Walltime (14 days maximum / 12 hours quickq)
Job Holds or Dependencies
- Jupyter Notebooks (only 1 notebook allowed at a time)
Other Jobs Running First
- If the cluster is already full of running jobs, your job will not cancel another users job in order to run.
- Higher priority
- Smaller jobs eligible for backfill