Commit Graph

25 Commits

Author SHA1 Message Date
dailz
8955e513aa fix(upload): add 30s timeout to background chunk cleanup goroutine
The cleanup goroutine used context.Background() with no timeout, so if
MinIO accepted TCP connections but never responded, the goroutine would
block indefinitely. Now uses context.WithTimeout to prevent leaks.
2026-04-21 13:35:21 +08:00
dailz
f13377ca7d fix(task): return empty string for unknown/empty Slurm states instead of defaulting to running
mapSlurmStateToTaskStatus previously defaulted to 'running' for empty
state arrays and unrecognized states. This was too aggressive — treating
unknown as actively running could cause incorrect status updates when
Slurm returns unexpected or empty state data.

Now empty/unknown states return an empty string, and refreshTaskStatus
skips the update in that case.
2026-04-21 13:23:40 +08:00
dailz
b90942de77 fix(task): prevent duplicate Slurm job submission on backend restart
RecoverStuckTasks now skips tasks that already have a slurm_job_id,
and ProcessTask adds a guard before the submitting step to prevent
re-submission even if a task is incorrectly re-enqueued.

Also deprecates POST /api/v1/jobs/submit endpoint (replaced by POST /tasks)
and comments out related handlers and tests.
2026-04-21 10:57:38 +08:00
dailz
4fd331ebd8 feat(application): add Environment field and inject into Slurm job submission 2026-04-21 10:23:31 +08:00
dailz
d9ca9233b3 fix(service): correct CPU/memory mapping and add TRES/memory_used extraction
- Map CPUs to CpusPerTask (not MinimumCpus) for consistent SlurmDBD history

- Add Set:true to memory Uint64NoVal on submission

- Filter number=0 in mapUint64NoValToInt64 to avoid false zeros

- Extract peak memory from Steps.Tres.Requested.Max across all steps

- Add formatTresList, parseGresDetail, extractMemoryFromSteps helpers

- Update mapJobInfo and mapSlurmdbJob with new field mappings

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-20 17:10:19 +08:00
dailz
08ca4da691 fix(service): inject WORK_DIR and map file_ids before param validation
Previously ValidateParams ran before WORK_DIR injection and file_ids mapping,
causing required parameter missing errors for auto-handled params. Now the
execution order is: inject WORK_DIR, map file_ids to file params, validate, resolve.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-20 13:41:28 +08:00
dailz
e90904cedb test(service): add tests for task defaults and job status
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-20 10:38:49 +08:00
dailz
166ca3092c feat(service): add task defaults, job status, and cluster helpers
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-20 10:38:41 +08:00
dailz
6a7bde4801 test(service): add tests for WORK_DIR injection and file-type param resolution
- ValidateParams: file/directory type validation tests

- RenderScript: file-type not escaped, WORK_DIR injected without quotes

- ProcessTask: file_id→filename resolution, invalid ID, missing file scenarios

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-16 17:56:42 +08:00
dailz
07ae8ad6cd feat(service): auto-inject WORK_DIR and resolve file-type params in task script rendering
- ProcessTask injects $WORK_DIR only when script template uses it

- File/directory type params: resolves file_id to filename before rendering

- ValidateParams validates file/directory params as valid int64 file IDs

- RenderScript no longer shell-escapes file/directory type values

- Log rendered script before submitting to Slurm for debugging

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-16 17:56:28 +08:00
dailz
1d591efeba feat(model,service): add folder_path and user_id to FileResponse, add user_id filter to ListFiles
- FileResponse gains folder_path ("/" for root) and user_id fields

- folder_id no longer uses omitempty, root files return null

- ListFiles accepts optional userID parameter for filtering by owner

- New buildFileResponse helper populates folder_path from FolderStore

- New GetFileResponse method wraps GetFileMetadata + buildFileResponse

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-16 16:04:22 +08:00
dailz
36d842350c refactor(service): disable SubmitFromApplication fallback, fully replaced by POST /tasks
- Comment out SubmitFromApplication method and its fallback path

- Comment out 5 tests that tested the old direct-submission code

- Remove unused imports after commenting out the method

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-16 15:15:42 +08:00
dailz
ec64300ff2 feat(service): add TaskService, FileStagingService, and refactor ApplicationService for task submission 2026-04-15 21:31:02 +08:00
dailz
79870333cb fix(service): tolerate concurrent pending-to-uploading status race in UploadChunk
When multiple chunk uploads race on the pending→uploading transition, ignore ErrRecordNotFound from UpdateSessionStatus since another request already completed the update.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-15 10:27:12 +08:00
dailz
f0847d3978 feat(service): add upload, download, file, and folder services
Add UploadService (dedup, chunk lifecycle, ComposeObject), DownloadService (Range support), FileService (ref counting), FolderService (path validation).

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-15 09:23:09 +08:00
dailz
a65c8762af fix(service): add environment variables and fix work directory permissions for Slurm job submission
Slurm requires environment variables in job submission; without them it returns 'batch job cannot run without an environment'. Also chmod the entire directory path to 0777 to bypass umask, ensuring Slurm and compute node users can write.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-14 13:06:51 +08:00
dailz
32f5792b68 feat(service): pass work directory to Slurm job submission
Add WorkDir to SubmitJobRequest and pass it as CurrentWorkingDirectory to Slurm REST API. Fixes Slurm 500 error when working directory is not specified.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-13 17:12:28 +08:00
dailz
d3eb728c2f feat(service): add Application service with parameter validation and script rendering
Add ApplicationService with ValidateParams, RenderScript, SubmitFromApplication. Includes shell escaping, longest-first parameter replacement, and work directory generation. 15 tests.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-13 17:10:09 +08:00
dailz
2cb6fbecdd feat(service): add pagination to GetJobs endpoint
GetJobs now accepts page/page_size query parameters and returns JobListResponse instead of raw array. Uses in-memory pagination matching GetJobHistory pattern.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 15:14:56 +08:00
dailz
f4177dd287 feat(service): add GetJob fallback to SlurmDBD history and expand query params
GetJob now falls back to SlurmDBD history when active queue returns 404 or empty jobs. Expand JobHistoryQuery from 7 to 16 filter params (add SubmitTime, Cluster, Qos, Constraints, ExitCode, Node, Reservation, Groups, Wckey).

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 13:43:31 +08:00
dailz
30f0fbc34b fix(slurm): correct PartitionInfoMaximums CpusPerNode/CpusPerSocket types to Uint32NoVal
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 11:39:29 +08:00
dailz
34ba617cbf fix(test): update log assertions for debug logging and field expansion
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 11:13:13 +08:00
dailz
824d9e816f feat(service): map additional Slurm SDK fields and fix ExitCode/Default bugs
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 11:12:51 +08:00
dailz
270552ba9a feat(service): add debug logging for Slurm API calls with request/response body and latency
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 10:28:58 +08:00
dailz
4903f7d07f feat: 添加业务服务层和结构化日志
- JobService: 提交、查询、取消、历史记录,记录关键操作日志

- ClusterService: 节点、分区、诊断查询,记录错误日志

- NewSlurmClient: JWT 认证 HTTP 客户端工厂

- 所有构造函数接受 *zap.Logger 参数实现依赖注入

- 提交/取消成功记录 Info,API 错误记录 Error

- 完整 TDD 测试,使用 zaptest/observer 验证日志输出

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-10 08:39:46 +08:00