Case Study · TalkingPoints
AI/ML Data Operations
Infrastructure to support, evaluate, and control AI features at scale.
The Context
As TalkingPoints added AI features—message classification, translation quality estimation, AI-powered intervention cards—I built the infrastructure to support and evaluate them.
Message Classification
The need was clear—teams wanted to understand what kinds of messages were flowing through the platform—but there was no clear path to get there. I proposed the approach, designed scalable MVPs, mapped out the milestones, and stewarded it to production over months. Now it's running and expanding.
Machine Translation Quality Estimation
Built a Streamlit app so the translation team could evaluate Machine Translation output quality without waiting on engineering. They can use LLM-as-a-judge for automated evaluation, change the evaluation prompt, test on different scales (binary, three-point, five-point, Likert), run evaluation against a data subset, and generate reports automatically.
All self-service. Non-technical product people evaluating their own AI tools without depending on the technical team.
Cost Controls and Safeguards
AI features have a cost per call. I added safeguards to models using Snowflake Cortex so they wouldn't accidentally run on large datasets and blow the budget. Built evaluation frameworks that make accuracy vs. cost tradeoffs visible—so stakeholders can decide what tradeoff they want, instead of that decision happening invisibly in the engineering layer.
The Philosophy
Every AI feature has an accuracy level, a latency, and a cost per unit. Build dashboards that show all three. Empower non-technical teams to evaluate AI tools themselves. The goal isn't to be the bottleneck—it's to build systems that work without you.
Impact
- Message classification in production after months of stewarding
- Self-service MT evaluation for translation team
- Cost safeguards preventing accidental large dataset processing
- Visible tradeoffs for stakeholder decision-making