AI/ML Solution Architect
About Us:
Headquartered in Sunnyvale, with offices in Dallas & Hyderabad, Fission Labs is a leading software development company, specializing in crafting flexible, agile, and scalable solutions that propel businesses forward.With a comprehensive range of services, including product development, cloud engineering, big data analytics, QA, DevOps consulting, and AI/ML solutions, we empower clients to achieve sustainable digital transformation that aligns seamlessly with their business goals.
Fission Labs Website: https://www.fissionlabs.com/
Key Responsibilities:
Architecture & Infrastructure
● Design, implement, and optimize end-to-end ML training workflows including infrastructure setup, orchestration, fine-tuning, deployment, and monitoring.
● Evaluate and integrate multi-cloud and single-cloud training options across AWS and other major platforms.
● Lead cluster configuration, orchestration design, environment customization, and scaling strategies.
● Compare and recommend hardware options (GPUs, TPUs, accelerators) based on performance, cost, and availability.
Performance & Optimization
● Conduct performance benchmarking, hardware comparisons, and cost-performance trade-off analysis.
● Implement real-time monitoring and control systems with metrics collection, observability, and custom performance tracking.
● Optimize cost models, budget predictability, and resource utilization.
Data & Training Pipelines
● Architect and validate data pipelines with storage, persistence, and throughput optimization.
● Oversee data quality validation, pre-processing, and long-term experiment tracking.
● Support framework flexibility for diverse training techniques (supervised, unsupervised, fine-tuning, reinforcement learning).
Integration & Deployment
● Ensure seamless deployment across multi-cloud environments with security, compliance, and regional availability considerations.
● Collaborate with DevOps and MLOps teams for automation, fault tolerance, job scheduling, and orchestration testing.
● Provide technical guidance on integration with existing enterprise systems.
Analysis & Recommendations
● Lead result analysis, insight generation, and actionable recommendations for training performance and user experience improvements.
● Present performance claims, benchmarking reports, and speculative decoding insights to stakeholders.
Technical Expertise Requirements
Technical Expertise
● 10+ years in architecture roles with at least 5 years in AI/ML infrastructure and large-scale training environments.
● Expert in AWS cloud services (EC2, S3, EKS, SageMaker, Batch, FSx, etc.) and familiar with Azure, GCP, and hybrid/multi-cloud setups.
● Strong knowledge of AI/ML training frameworks (PyTorch, TensorFlow, Hugging Face, DeepSpeed, Megatron, Ray, etc.).
● Proven experience with cluster orchestration tools (Kubernetes, Slurm, Ray, SageMaker, Kubeflow).
● Deep understanding of hardware architectures for AI workloads (NVIDIA, AMD, Intel Habana, TPU).
Performance & Cost Management
● Demonstrated expertise in performance benchmarking, reliability testing, and training speed optimization.
● Skilled in cost modeling, budget forecasting, and cost-performance balancing.
Monitoring & Observability
● Experience with real-time monitoring tools (Prometheus, Grafana, CloudWatch) and custom metric instrumentation.
● Familiarity with network performance testing, regional load testing, and multi-region deployment strategies.
Soft Skills
● Strong problem-solving skills with an analytical mindset.
● Excellent communication skills to present technical trade-offs and strategic recommendations to executives and engineering teams.
● Ability to lead cross-functional teams and drive innovation in AI infrastructure.
We Offer:
● Opportunity to work on technical challenges with global impact.
● Vast opportunities for self-development, including online university access and sponsored certifications.
● Sponsored Tech Talks & Hackathons to foster innovation and learning.
● Generous benefits package including health insurance, retirement benefits, flexible work hours, and more.
● Supportive work environment with forums to explore passions beyond work. This role presents an exciting opportunity for a motivated individual to contribute to the development of cutting-edge solutions while advancing their career in a dynamic and collaborative environment.