Let’s put a summary of the common Classification evaluation metrics. What they mean and how to use them.

Accuracy

Meaning:

Correct identifications / all examples

pros:

easy to explain
cons:
works poorly with unbalanced data
cannot express uncertainty about a certain prediction

F1 score

Meaning:

a weighted average of the precision and recall

pros:

can be used for multi-class/multi-label problems by choosing the average method
micro: globally by counting total positives
macro: calculate f1 for each label, then compute unweighted mean
weighted: like macro, but weighted mean
cons:

Area Under the Receiver Operating Characteristic Curve (ROC AUC)

Meaning:

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things. Thus, AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes.

pros:

independent of the response rate
cons:
based on the ranking of probabilities, not the real probabilities values
not be able to interpret your predictions as probabilities
problematic especially the data is imbalanced (highly skewed).
increasing of AUC doesn’t really reflect a better classifier. It’s just the side-effect of too many negative examples.

Brier Score

Meaning:

how close the prediction is to the real case. The lower the closer.

pros:

a great supplement to AUC ROC, measuring the scales.
cons:

Log Loss

Meaning:

negative average of the log of corrected predicted probabilities for each instance. The lower the better, but no absolute values. score depends on case.

pros:

estimates can be interpreted as probabilities
cons:
If you have a lot of predictions that are near the boundaries, your error metric may be very sensitive to false positives or false negatives.

Accuracy

Meaning:

pros:

cons:

F1 score

Meaning:

pros:

cons:

Area Under the Receiver Operating Characteristic Curve (ROC AUC)

Meaning:

pros:

cons:

Brier Score

Meaning:

pros:

cons:

Log Loss

Meaning:

pros:

cons: