Evaluating a model

What is model evaluation?

Evaluating allows you to discover the performance of your model’s Machine Learning part.

Thus, evaluating a model is only possible for models expanded with custom training data.

Check out Custom data and Machine Learning best practices page for evaluation best practices.

Evaluating a model

To evaluate a model, follow the steps below:

Navigate to NLU → NLU Models section.
Select a model from the list of available models and click on it.
Select the Evaluate tab and click the Evaluate button.

The following dialog box opens. Upload the data you want to evaluate your model with and confirm by clicking Evaluate. Supported file formats are TXT, CSV, TSV.

5. The evaluation starts and can take up to several minutes depending on the evaluation data scope. When the evaluation is finished, the high-level metrics of the evaluation report are presented on the screen:

6. To download the report, click the here button. The report includes detailed statistics and is available as a ZIP archive containing three TSV files.

To evaluate a model, avoid using the data that you have used to train your model with. Make sure that your evaluation set includes data that are unseen for the Machine Learning part of your model.

Evaluation Metrics

The evaluation statistics comprises the following data components:

Accuracy: This is a per-model metric that represents the percentage of correct predictions made by the model out of the total predictions.

Accuracy = Number of correct predictions / Total number of test sets

Precision: This is a per-intent metric that indicates the model's ability to avoid labeling a negative sample as positive. High precision signifies that utterances from other classes are not incorrectly classified as belonging to this class.

Precision = TruePositives / (TruePositives + FalsePositives)

Recall: This is a per-intent metric that measures the model's capacity to identify all positive samples. It can be viewed as class-specific accuracy. High recall indicates that the model correctly predicts utterances that belong to the class.

Recall = TruePositives / (TruePositives + FalseNegatives)

F1 score: Offers a combined evaluation of both Precision and Recall to provide a more comprehensive view of the model’s performance. The F1 score is particularly useful when there is an uneven class distribution.

Fscore=2 x (Precision x Recall) / (Prescision + Recall)