Siddharth Sabata

Machine Learning Engineer

I'm a master's student at Carnegie Mellon University developing ML solutions for understanding cancer.

Skills

PythonSQLGoRGitSlurmDockerGoogle Cloud PlatformPyTorchTransformersAcceleratescikit-learnpandasnumpymatplotlibseaborn

Projects

codedeck

DockerNext.js

Role: Developer

Organization: Personal Project

Timeline: May 2025

I built CodeDeck to make LeetCode practice smarter—like Anki cards, but for coding problems. It tracks my attempts, lets me log insights, and keeps everything versioned with Git. Built with Next.js, Prisma, and Tailwind, it's my personal interview prep companion.

GitHub

Mase-phi HPC

SlurmNumPyGurobi

Role: Research Assistant

Organization: Schwartz Lab @ CMU

Timeline: September 2024 — Present

I took an early-stage prototype from my lab and turned it into a fully automated, modular pipeline for selecting genetic markers from multi-region cancer sequencing data. I set up a robust HPC workflow with Slurm, standardized and refactored the codebase, and made the system flexible for future needs. It was a great experience in transforming innovative research ideas into reliable, production-ready software that helps advance personalized cancer monitoring.

GitHub

Multiomics Graph Analysis - DNA and network visualization

Population-Specific Multiomics Graph Analysis of ACE Protein Expression

PyTorch Geometric

Role: ML Engineer

Organization: ML & AI Approaches to Multimodal Problems in Computational Biology Hackathon

Timeline: March 2025

I built a graph-based multiomics pipeline to pinpoint genetic variants that shape ACE protein expression in different populations. Using PyTorch Geometric, I put together GPU-friendly graphs and mapped out regulatory relationships. The project won "Most Innovative Project!"

GitHub

Medical Reasoning - AI brain and neural network visualization

Medical Reasoning with Distilled Models

SlurmTransformersAccelerate

Role: ML Engineer

Organization: Introduction to Deep Learning (CMU 11-785)

Timeline: January 2025 — May 2025

We tackled all sorts of technical challenges—fine-tuning a huge LLM (DeepSeek-R1-Distill-Llama-8B), running big jobs on HPC (Slurm), and building an automated evaluation pipeline from scratch. The coolest part? Our model showed super interesting pass@k gains on medical benchmarks—proving it could search out the right answer even if it struggled with strict ranking. Working on this was super fun and gave me a front-row seat to how language models learn from both RL and fine-tuning in real clinical settings.

GitHub Paper

Education

Carnegie Mellon University

Master of Science, Quantitative Biology and Bioinformatics

2025

University of California, Santa Barbara

Bachelor of Science, Statistics and Data Science

2024