KL Divergence: Entropy: Cross Entropy: Example Use Cases. Equations as well.

KL Divergence in Picture and Examples

"Kullback–Leibler divergence is the difference between the Cross Entropy H for PQ and the true Entropy H for P."

KL
KL

[1]

"And this is what we use as a loss function while training Neural Networks. When we have an image classification problem, the training data and corresponding correct labels represent P, the true distribution. The NN predictions are our estimations Q."

Reference for the above (including image) : https://towardsdatascience.com/entropy-cross-entropy-kl-divergence-binary-cross-entropy-cb8f72e72e65
The above URL is a pretty great read.

****
Everything below is from the Internet including images and equations esp. from [1]

"

What's the KL Divergence?

The Kullback-Leibler divergence (hereafter written as KL divergence) is a measure of how a probability distribution differs from another probability distribution.

The KL divergence measures the distance from the approximate distribution QQ to the true distribution PP

."

KL Divergence from Q to P

[1]

not a distance metric, not symmetric

Can be written as:

[1]

First term is the is the cross entropy between
PP and Q. Second term is the entropy of P

Forward and Reverse KL

Forward: mean seeking behaviour. Where P (.) has High Probability, Q (.) will also have to have high probability.

Kind of will approximate around mean. P = the one with two peaks. Q kind of took mean.

[1]

Reverse KL: Mode Seeking Behaviour
Where Q (.) has High Probability, P (.) will also have to have high probability.

[1]

References:
[1] https://dibyaghosh.com/blog/probability/kldivergence.html
[2] https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8

*** ***

"What is KL divergence used for?
Very often in Probability and Statistics we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.May 10, 2017

www.countbayesie.com › blog › kullback-leibler-divergence-explained

 

 

Kullback-Leibler Divergence Explained — Count Bayesie

"

***. ***. ***
Note: Older short-notes from this site are posted on Medium: https://medium.com/@SayedAhmedCanada

*** . *** *** . *** . *** . ***

Sayed Ahmed

BSc. Eng. in Comp. Sc. & Eng. (BUET)
MSc. in Comp. Sc. (U of Manitoba, Canada)
MSc. in Data Science and Analytics (Ryerson University, Canada)
Linkedin: https://ca.linkedin.com/in/sayedjustetc

Blog: http://Bangla.SaLearningSchool.com, http://SitesTree.com
Online and Offline Training: http://Training.SitesTree.com (Also, can be free and low cost sometimes)

Facebook Group/Form to discuss (Q & A): https://www.facebook.com/banglasalearningschool

Our free or paid training events: https://www.facebook.com/justetcsocial

Get access to courses on Big Data, Data Science, AI, Cloud, Linux, System Admin, Web Development and Misc. related. Also, create your own course to sell to others. http://sitestree.com/training/