Software Engineer – Resiliency
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
About the Role:
T-Mobile runs some of the most transaction-intensive systems in U.S. telecommunications — millions of payments, device activations, and customer interactions processed every day. When those systems fail, customers feel it immediately. Your job is to make sure they don’t.
We’re building the next generation of resiliency solutions including automated failover, cross-datacenter orchestration, observability pipelines, and AI Ops. This is hands-on, high-ownership engineering work with real consequences on a real scale.
We’re looking for engineers who think from first principles, take full ownership of outcomes, and bring the kind of technical depth that earns trust across application teams, DBAs, network engineers, and platform architects alike. If you thrive in ambiguity, move fast, and hold yourself to a high bar — this role was built for you.
A Few Things Worth Knowing:
This team operates in an async-first model with regular sync touchpoints across U.S. and India time zones. You’ll have real ownership of components and workstreams — not just task execution. The systems you work on are production-critical, and the team holds itself to high standards for reliability, documentation, and operational discipline.
If you’re looking for a role where you’ll be handed clean requirements and a clear path, this probably isn’t it. If you want to do work that matters, build things from scratch, and grow fast in a high-trust environment — we’d like to talk.
We pride ourselves on encouraging a culture of innovation, agile ways of working, and transparency in all we do. Join us in embodying the spirit of the Un-carrier and make a tangible impact!
What You’ll Do:
- Design and build applications to improve the resiliency of T-Mobile’s critical systems.
- Work across multiple technologies and applications
- Collaborate with a talented team in a fast-paced environment, learning and helping others learn
- Proactively engage application owners and drive conversations to unblock delivery
- Design and implement observability solutions — build monitoring dashboards, alerting, and health-check mechanisms to provide real-time visibility into failover readiness and execution
- Recommend and establish best practices — evaluate current processes, identify gaps, and propose improvements for failover patterns, automation standards, and operational runbooks
- Document everything — create clear, comprehensive technical documentation, architecture diagrams, runbooks, and onboarding guides that enable team scalability and knowledge transfer
What You’ll Bring:
Must Have:
- 3+ years of hands-on software engineering experience across multiple technologies, languages, and system layers
- Strong first-principles understanding of distributed systems, fault tolerance, and failure modes — not just framework familiarity, but genuine depth
- GitLab CI/CD expertise
- Python/Bash scripting; strong YAML skills
- AWS and Kubernetes experience
- Familiarity with secret management (CyberArk, Vault)
- Accountability mindset — you own problems end-to-end, you don’t wait to be unblocked, and you escalate with context and a proposed path forward
- Strong documentation skills — ability to translate complex systems into clear, actionable guides
- Self-driven – You take ownership, find answers yourself, and don’t wait to be told what to do next
- First-principles thinker – When something breaks in an unfamiliar system, you reason from fundamentals. You don’t just apply patterns — you understand why the pattern exists
- Fast learner – You ramp quickly on new tools and ecosystems with minimal guidance
- Independent operator – You can engage app teams directly, extract what you need, and fill gaps through your own research
- Fast, iterative, and comfortable with ambiguity – You ship something workable quickly, learn from it, and improve. You don’t need the perfect spec to start
- Relationship builder – You build trust with stakeholders and drive conversations forward
- Communicator with standards – You write clearly, document proactively, and treat your teammates’ time as valuable
- Continuous improver – You don’t just execute — you identify what’s suboptimal and propose better ways of doing things, then follow through
- Knowledge sharer – You believe documentation is a first-class deliverable, not an afterthought
Nice-to-Have:
- Ansible and failover experience
- Telecom or large enterprise environment experience
- Experience with observability platforms (Splunk, Grafana, Prometheus, OTEL)
- Experience using AI coding tools (Claude, GitHub Copilot, ChatGPT) as a genuine productivity multiplier — not just having tried them, but having integrated them into your workflow