Evaluation Driven Development

As a software engineer, I’ve been a huge fun of using test driven development(TDD), but how do we test something like LLMs? LLMs are far away from classical ML models, a stark difference is traditional ML models are often good at doing one task, but LLMs can do a variety of them.

Thinking about it as a engineer, it used to scare me when each API returns a new response, not a very reliable system, is it?

That’s why I believe everyone building with LLMs needs to adopt the practice of using evals while building your LLM powered applications.

For any software application, backward compatibility is a vital aspect to consider. It ensures that newer versions of the software will work with older ones, maintaining functionality and preventing inconveniences that can arise when making upgrades. When it comes to LLMs, however, ensuring backward compatibility can be a substantial challenge.

LLM software, by its very nature, is continually evolving, with newer and more advanced models being released regularly. While one might assume that these software improvements would have linear upward trajectory in terms of performance and usability, the truth is, it may not be the same for certain specific use cases. Upgrading from an earlier model to a newer one might cause discrepancies in results since each model has its unique characteristics and behaviours.

Ensuring backward compatibility, while challenging, is not an impossible task. Below are some strategies that can help in achieving this:

Prompts Standardization: Developing a standardized syntax or nomenclature for prompts can go a long way in ensuring backward compatibility. This would imply a set of universally recognized and accepted guidelines for writing prompts that apply to all existing models and also newer ones.
Thorough Testing: A robust testing process can help mitigate compatibility issues. Older prompts should be rigorously tested with newer models. Any prompt that does not achieve the desired or expected result should be adjusted or rewritten.
Establish Clear Documentation: Detailed documentation of all prompts is crucial. When a developer understands the original intentions and structures behind a prompt, they will be better equipped to make necessary adjustments with newer models.
Creating Model-Specific Code Paths: For critical applications where backward compatibility must be maintained, developers could consider running two versions of the model (the older and the newer one) and switch between them based on the situation. The decision on which model to use could depend on the prompts or the quality of responses.

How do you actually use evals?

Using evals largely differs on what you’re building.

The best starting point to take while building is think evals first, how will the user prompt the LLM and what are the odds of the LLM returning the correct text?

There is also the issue of ‘prompt compatibility’. A prompt that worked perfectly with an older model may not necessarily evoke the intended response from a newer one, compelling developers to re-engineer their prompts, which can be quite a time-consuming task.

IMO, starting from evals and going into the application behaviour is a great way to replicate TDD while building applications! Some things to consider:

User behaviour
Type of application
Complexity: are you building agents or chains?
Tools provided to the base LLM
Prompt engineering: setting roles is a good hack and will ensure better evals!

Metrics that matter

In GenAI, user stickiness seems to be the largest issue, if you’re a devtool or a consumer app, retention is a pain. Evals can in part help eliminate this by ensuring completion quality.

Some key KPIs:

faithfulness - the factual consistency of the answer to the context base on the question.
context_precision - a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.
answer_relevancy - a measure of how relevant the answer is to the question
context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.
Harmfulness: reducing harmful outputs.
PII: making sure senstive user data is not leaked by mistake.

How does this affect your product KPIs?

Evals will make sure your NPS from users increases, while making the need to log each interaction less important, since there’s an established metric for output quality.

Think of using evals as having a QA engineer in form of an SDK!

LLMOps 101

Blog Archive

Archive of all previous blog posts