How We Ensure – and Track – Andesite’s AI Accuracy

By Jesse Bowman, Staff Data Scientist at Andesite

In the high stakes world of SecOps, the accuracy of AI-enabled products is key. At Andesite, we use an evaluation framework to constantly assess our AI accuracy, score the outputs, and track quality over time. By running evaluation tests regularly, and rechecking with every code update and release, we ensure product quality and strive to catch regressions before they impact customers.

Evaluations, or evals, as they are commonly referred to, are structured test protocols that assess the performance of AI systems. Many AI companies run evals. At Andesite, our approach is to emphasize the work on the data – we’re constantly, persistently, and iteratively working on the data to improve our eval. We choose that path to take evals further and continue to improve them because, as Anthropic has noted, the most effective assessment strategies “combine techniques to match the complexity of the systems they measure.” It doesn’t get more complex or more high stakes than AI applied to SecOps.

Our Approach to Assessing Accuracy

Our eval strategy has evolved over time to deliver the kind of quality, seriousness, and rigor that drives confidence in mission critical environments. We use industry standard evaluation frameworks and focus on three key measures: correctness, relevancy, and faithfulness. While correctness is our North Star, we look at the combined score across these three measures. 

We’ve built a simlab where we run attacks based on the latest threats. That means that we test our systems using synthetic security data running in a simulated company environment with its own users, network, automations, tooling, and endpoint detection and response (EDR). Our datasets are engineered to ensure the necessary robustness to effectively measure ground truth. 

Our goal is to ensure that our product is built and ready to handle the chaos and messiness that our customers experience in the real world, rather than measuring performance against generic industry benchmarks. We want a realistic  assessment of our product performance to ensure our users have the confidence they need to trust our product in their SOC.

Our AI accuracy tests are designed specifically for our product. Because Andesite is model agnostic, we make sure every assessment performs reliably and accurately across a variety of models, SecOps and cybersecurity knowledge bases. 

Historical results are archived, so any new release can be compared against its baseline. And we constantly challenge the models with increasingly tougher data to drive continuous improvement. In essence, our test catalog grows as our product evolves.

Three Key Measures of Accuracy

Every Andesite eval response is scored on correctness, relevancy, and faithfulness. Why these? They allow us to answer three essential questions:

  • Is the information correct? 
  • Does it answer the question asked?
  • Is it grounded in real data?


Each assessment is conducted across a variety of connectors and product scenarios, including different security platforms, log sources, document types, knowledge bases, and other inputs.

In assessing Correctness, we are looking at whether the AI gets the facts right— the right entity, right count, right time window. A wrong result during triage can mean a missed threat or wasted effort. For example, if an analyst asks how many failed logins occurred for a specific user and the expected answer is 17, if the AI says 17, it scores high in correctness. Alternatively, if the AI says 12 or confuses two similarly-named accounts, the correctness score would be low.

When we look at Relevancy, we are determining if the AI actually answered what was asked, or if it went off on a tangent. This is important because analysts are time-constrained. Irrelevant responses waste cycles and erode trust. If the AI comes back with something you didn’t ask about, the relevancy score is lower. Consider a situation where an analyst asks “What source IPs are associated with this alert?” A relevant response lists the IPs. An irrelevant one might explain what the alert category means without ever naming the source IPs.

By measuring Faithfulness, we determine whether every claim is grounded in retrieved data, or if the AI filled in the blanks. This is the measure of hallucinations and how we ensure that our AI is not acting as a black box. Everything must be substantiated because hallucinated security data, such as a fabricated indicator or a made-up timeline, is actively dangerous. As an example, when the AI pulls back five log events from a data source, a faithful response summarizes what those events contain. An unfaithful one might invent a sixth event or attribute actions to a user not in the data.

A More Robust Full Picture

When we look at our assessments, high scores across all three metrics mean the analyst can act on the response with confidence. While our primary focus is correctness , by tracking the composite scores over time, we can see whether the quality of our results is holding, improving, or slipping across releases and model or product changes. We actively use these measures to improve the product and address specific customer feedback and needs. 

When one of our customers had an issue with long running agent thought loops and delays to producing an answer, we responded with improvements to make the agent more directed and used the eval framework to validate that agent behaviors were tightened and, most importantly, correct. We also regularly use eval to measure the impact of performance trade-offs, for example time to completion vs. model reasoning. Eval allows us to identify specific impacts and the relative value of the improvement. And for customers who need to run in isolated environments using smaller open weight models, eval allows us to assess their model performance against flagship models and identify ways to bridge any gaps. In this way, we can up-level open weight model performance while maintaining the security of a closed system. 

In all the ways we use eval to evaluate the product, our goal is to build assurance in the quality of Andesite results based on data-driven metrics. By looking at these variables together, we can see trade-offs between them that can impact model accuracy. This is key because SOC teams don’t do linear work. Workflows must be flexible and we have to ensure that the humans at the helm using our product have the flexibility, control, and accuracy they need for a good user experience. Directly measuring  the accuracy of our product over time using a rigorous methodology ensures we can stand confidently behind the results our models deliver for clients. 

Our Commitment to Accuracy

When our customers use our product, their security teams are trusting AI with real incidents. Wrong, irrelevant, or hallucinated answers have real consequences. Systematic evaluation is an essential way we earn and maintain trust in the Andesite product. 

Because we treat AI quality as an engineering discipline, ensuring it is measured, reproducible, and  regression-tested, when customers ask “How do you know your AI is accurate?”, we have a data-backed answer. In a domain where reliability is non-negotiable, our product must deliver the highest possible accuracy.  By applying hard-won, hard-learned engineering discipline to our AI evals, we ensure our customers can be confident that their AI-enabled SOC is accurate and effective. 

About Jesse Bowman
Jesse Bowman is a data scientist at Andesite, where he focuses on evaluation of the AI capabilities of the product and improving its core intelligence. Before Andesite, Jesse spent over seven years at Microsoft where he helped ship products including Xbox Kinect, HoloLens, and Windows. As a data scientist at Microsoft, he delivered telemetry pipelines with ML enrichment to aid with user and software understanding. His past experience spans both Product and Engineering roles. Jesse holds a B.S. in Computer Engineering and a B.M. in Music Technology from Northwestern University.