How Moveworks selects LLMs (and SLMs!)
An autonomous and intelligent employee support system needs to solve a variety of problems to deliver impact. The Moveworks AI Assistant brings together a collection of diverse large and small machine learning models - each carefully evaluated and selected for the specific problem it's meant to solve. Some examples of the models we use for these tasks are:
- Foundation Models - GPT-4o / GPT-4o Mini
- Used for reasoning and action planning based on user utterance
- Execution of “plugins” to fulfill user query
- Summarize results for the user
- Task specific discriminative models
- FT-LangDetection - low latency detection of user language
- FT-Roberta - handoff classification, entity recognition
- Task specific generative models
- FT-Flan-T5 - toxicity judgement for both user input and bot output
- FT-Flan-T5 - file search relevance judgement
- FT-M2M100 - translation of resources (KBs, forms, etc)
And that's just the tip of the iceberg.
Foundation models vs Task-specific Models
Foundation Models have revolutionized the field of AI over the last few years. They are
- Largest models in the world - by training data and model parameters
- Generalized to perform any language task - from coding, reasoning, to summarization
As a result, they are extremely capable, but have higher latency, cost and are not always the most reliable models to use for very specific tasks that require a narrow but highly precise focus.
Task specific models, on the other hand, are:
- Smaller models that are cheaper, faster, and more controllable
- Trained to perform specific language tasks like entity recognition, coding, etc
Task specific models perform subset of tasks performed by foundation models - albeit at higher quality, speed, or lower cost. Therefore, you can see why you might want to pick the most appropriate model for the task at hand.
Evaluating and selecting models
Moveworks relies on a robust and rigorous evaluation process for models to drive continuous innovation.
- Leverage comprehensive Evaluation Datasets
- Curated to cover a broad range of AI Assistant use cases.
- Guarantees model's effectiveness across expected scenarios and identifies areas of improvement
- Diverse types of evaluations to test the performance in a variety of ways to make sure the overall performance is constantly improving
- End-to-End Evaluation: Tests the overall experience from start to finish.
- Component Evaluation: Focuses on specific parts like plugin filtering, selection, and argument filling.
- Human Annotator Evaluation: Involves human annotators to review outputs and interactions, providing nuanced insights and enhanced confidence in evaluation results.
- Prompt Tuning for LLM optimization
- Employed to address any observed degradations in evaluation and to keep improving the copilot experience.
- Our infrastructure allows for extensive prompt tuning experiments to refine and enhance interaction quality.
- In-bot Testing with New Prompts: Immediate, real-time feedback on adjusted prompts.
- Large-Volume Evaluation with Comprehensive Datasets: Ensures robust validation across a wide array of scenarios and datasets, affirming improvements and pinpointing further optimization opportunities.
Updated 2 days ago