A Tester's Experience of Testing Gen-AI Chatbot

Large Language Models (LLMs) are revolutionizing the way chatbots interact with users. Many enterprises are eager to capitalize on this AI gold rush. However, ensuring these AI-powered assistants deliver natural and accurate conversations requires a robust testing strategy. This article delves into an approach for testing a Generative AI (Gen-AI) chatbot designed for the financial services industry, exploring its challenges and limitations.

Deep Domain Dive: Understanding the Nuances

The foundation of any testing lies in a thorough understanding of the application's domain. The deeper your knowledge, the better equipped you are to identify edge cases, making testing of Gen-AI chatbots even more effective. In this case, the bot catered to the financial sector, a complex landscape riddled with intricate policies, diverse financial instruments, and complex regulations. This necessitated a deep understanding of the data used by the bot (or the Retrieval-Augmented Generation system), which included:

  • Financial Institution Policies: The nuances specific to each institution's offerings, investment plans, and legal frameworks were meticulously analyzed. This analysis allowed us to understand how different policies are interrelated and how the context for answers will evolve as we incorporate more policies into the vector dataset.

  • Data Intricacies: The composition of the data, encompassing policies, offerings, and legal aspects, was thoroughly examined to identify potential biases or gaps. This examination helped the team prepare better edge cases to cover the complexity of the data.

By understanding the industry and the data, we were able to prepare better edge-case scenarios that could uncover biases and gaps within the system.

Competitive Benchmarking: Learning from the Best

A crucial step involved analyzing existing financial chatbots. By dissecting competitor strengths and weaknesses, the testing team was able to:

  • Identify Best Practices: We learned from successful implementations and incorporated valuable insights into our testing strategy.

  • Uncover Potential Issues: Gaps or shortcomings in competitor chatbots highlighted areas that required extra focus during testing.

Crafting Granular Test Cases: A Multi-Layered Approach

As the application under test was built on LLMs, you can neither guarantee 100% accuracy nor achieve 100% test coverage. However, building a robust test suite was paramount to ensure the best possible quality. Here's how the QA team meticulously crafted test cases:

  • Deep Dives vs. Breadth: Two distinct sets of test cases were designed. The first set delved intricately into specific data points within a single data set. The second set focused on broader coverage across the entire data corpus. For instance, a deep dive might involve thoroughly testing how the bot handles complex retirement plan calculations for a particular financial institution's policies. Conversely, a broader test might assess the bot's overall comprehension of financial terminology across all policies from multiple institutions.

  • Simple & Clear Expected Answers: The team created clearly articulated expected responses for each test case, ensuring the bot conveyed key information in an easily understandable manner.

  • Mistral's Magic Touch: Since the chatbot interacted in human language, the team employed a locally deployed Mistral model to rephrase the expected answers before testing. This ensured alignment with the bot's natural language processing capabilities.

Iterative Testing: From Manual to Automated Efficiency

The testing process employed a two-pronged approach:

  • Manual Scrutiny - The First Line of Defense: For the initial training dataset, manual testing was crucial. This allowed for close scrutiny of the bot's responses and identification of any immediate issues.

  • Automating Efficiency - Scaling Up: For subsequent iterations of training and fine-tuning, automation played a pivotal role. It streamlined the testing process by:

    1. Automating Input & Output: Repetitive tasks like feeding questions to the bot and capturing responses were automated using REST API integration.

    2. Fine Tuning Text Comparison - The Tricky Bit: Comparing the bot's output with expected answers required a sophisticated approach. The team experimented with various LLM models and text-matching algorithms. Ultimately, they found success with the GloVe-6B model and a cosine similarity method. For reference, you can check out a proof of concept (POC) here.

However, this process wasn't without its challenges. There were occasional hiccups, and the automation wasn't foolproof. Despite these hurdles, automation significantly accelerated the testing process.
Manual scrutiny and an exploratory testing approach ensured the system could prepare good and efficient vectors from policies, allowing the system to reach a higher accuracy of over 90%.

Conclusion: Building a Conversational Powerhouse

Although Gen-AI has its own challenges with biases and hallucination but for a Retrieval-Augmented Generation (RAG) system based chatbot, accuracy can be increased significantly, as you are narrowing the context for model.
This testing approach played a vital role in ensuring the Gen-AI chatbot delivered a natural, informative, and accurate user experience. By combining domain expertise, competitive analysis, meticulous test case design, and a strategic blend of manual and automated testing, the team successfully laid the groundwork for a successful chatbot implementation. This approach serves as a valuable blueprint for those venturing into the exciting realm of Gen-AI chatbots.
Next we will discuss more advance tool and techniques to test a RAG system, stay tuned and follow us here.