Responsibilities
Operate and manage Kubernetes or OpenShift clusters for multinode orchestration.
Deploy and manage LLMs and other AI models for inference using Triton Inference Server or custom endpoints.
Automate CI/CD pipelines for model packaging, serving, retraining, and rollback using GitLab CI or ArgoCD.
Set up model and infrastructure monitoring systems (Prometheus, Grafana, NVIDIA DCGM).
Implement model drift detection, performance alerting, and inference logging.
Manage model checkpoints, reproducibility controls, and rollback strategies.
Track deployed model versions using MLFlow or equivalent registry tools.
Implement secure access controls for model endpoints and data artifacts.
Collaborate with AI/Data Engineers to integrate and deploy fine-tuned datasets.
Ensure high availability, performance, and observability of all AI services in production.
Requirements
3 years experience in DevOps, MLOps, or AI/ML infrastructure roles.
10 years overall experience with solution operations.
Proven experience with Kubernetes or OpenShift in production environments, preferably certified.
Familiarity with deploying and scaling PyTorch or TensorFlow models for inference.
Experience with CI/CD automation tools with OpenShift/Kubernetes.
Hands-on experience with model registry systems (e.g., MLFlow, KubeFlow).
Experience with monitoring tools (e.g., Prometheus, Grafana) and GPU workload optimization.
Strong scripting skills (Python, Bash) and Linux system administration knowledge.
Key Skills
ASP.NET, Health Education, Fashion Designing, Fiber, Investigation
Employment Details
Employment type: Full time
Vacancy: 1