………… In the last installment of this amateur series of self-tutorials on DIY data mining metrics, we saw how a mathematical property known as absolute continuity hampered the Kullback-Leibler Divergence. It is still perhaps the most widely mentioned information distance measure based on Shannon’s Entropy, but a search has been under way since its development in 1951 to find alternatives that have less restrictive properties. In this article I’ll cover some of the most noteworthy alternatives, which are often used for mind-blowing applications like quantum physics, but which can easily be ported for ordinary day-to-day data mining uses in SQL Server databases; in fact, one of the main points I want to get across in this series is that it doesn’t take a rocket scientist to harness these Space Age information measures for more down-to-earth purposes. The key is to match the right metrics to the right use cases, a decision that is often strongly influenced by their mathematical properties.
…………
As we saw in the last article, absolute continuity means that we can’t use the KL Divergence to measure the similarity between distributions in cases where one value has a probability of 0 and its counterpart does not. This is often the culprit when the KL Divergence and its relatives don’t intersect with other distance measures at the expected points, or stay within the expected bounds, which also constitute properties in their own right. Towards the end of that article, I also pointed out that divergences are technically differentiated from distances by the fact that they’re not required to have the usually desirable property of symmetry, whereas true metrics possess symmetry in addition to subadditivity. The motivation behind the most widely known metric I’ll cover today is to calculate a variant of the KL Divergence that possesses symmetry, which means we can measure it in either direction and get the same result. Imagine how cumbersome it would be to take a few tables of projected sales for various products and use the KL Divergence to measure the information gain based on the actual results, to discern which model would fit best only to discover that the results depended on which direction we made the comparison in. The Jensen-Shannon Divergence also “provides both the lower and upper bounds to the Bayes probability of error” and can be weighted in ways that are useful in Decision Theory.[1] It is also used in assessing random graphs and electron orbits, among others.[2] The Jeffreys Divergence and Other Jensen-Shannon Kin The primary motivation for development of the KL’s contemporaneous cousin, the Jeffreys Divergence[3], was to measure Bayesian a priori probabilities in a way that exhibited the property of invariance. [4] This means that it is consistent when stretched across a dimension, as multiplying by threes or adding by twos is in ordinary math; obviously, it would be awkward to compare projections of things like customer satisfaction across data models if they were warped like carnival mirrors by ordinary math operations. The Jeffreys Divergence is a “special case” of the measure I discussed in Outlier Detection with SQL Server, part 8: Mahalanobis Distance , which is merely one example of the multifarious interrelationships such measures exhibit. Don’t let this put you off though: the more we can connect them together, the easier it is to construct a web of sorts out of them, to ensnare the actionable information hidden in our tables and cubes.…………
I’ll only mention a handful of the boundaries and intersections these measures must obey, such as the fact that the Jensen Shannon Divergence is bounded by 1 when bits are used as the units (i.e. a logarithm base of 2). The K-Divergence is bounded by the Kullback-Leibler Divergence and simpler measure known as the Variational Distance, which is just the absolute difference between the two probabilities. It exhibits a different slope, which fits within the concentric curves of relatives like the KL Divergence, as depicted in a handy graph in the original paper.[5] One of Jianhua Lin’s primary motivations was to devise a measure like the KL that didn’t require absolute continuity, thereby allowing us to make comparisons of information distance between models that previously weren’t valid[6]. The Topse Distance is just a symmetric form of the K-Divergence, whereas the Jensen Difference leverages the concavity of the Jensen-Shannon Divergence, meaning that on a graph, it would bow outwards towards more extreme values rather than inwards towards 0. There are a whole smorgasbord of information metrics out there in the wild, exhibiting all kinds of useful properties that can be easily leveraged for productive purposes on SQL Server tables and cubes.[7] I’m introducing myself to the topic and merely passing along my misadventures to readers along the way, but it is plainly clearly from sifting through the literature that there is a yawning gap between the research and the potential applications in real-world situations. Not only is this low-hanging fruit going unpicked, but I strongly suspect that set-based languages like T-SQL are the ideal tools to harvest it with. I cannot possibly code all of the available metrics or even adequately explain the fines points of validating and interpreting them, but I will consider this segment a success if I can at least convey the idea that they can and probably ought to be implemented in SQL Server on a DIY basis. Don’t let the arcane formulas and jargon fool you: it takes very little code and very few server resources to calculate many of these entropic measures. In fact, it is a pleasant surprise to discover that if you calculate one of them across a large table, it costs very little to calculate the rest of them in the same pass. Looking Past the Mathematical Formulas The Jeffreys Divergence is the simplest of them, in that we merely have to replace the multiplicand (the first term in a multiplication operation) with the difference between the two probabilities. In the K-Divergence, we instead replace the multiplier of the KL Divergence with the result of two times the first probability divided by the sum of the two probabilities. The Topse is similar, in that we merely take the same input to the K-Divergence and add it to the result we’d get if we swapped the positions of the two probabilities in the same K-Divergence formula. The Jensen-Shannon Divergence is almost identical to the Topse, except that we SUM over the two inputs before adding them together, then halving the results. The Jensen Difference is the most complex of the lot, in the sense that it requires precalculation of Shannon’s Entropy for both probabilities and adding them together, then halving the result and subtracting half the sum of the two probabilities times the LOG of the same sum. I always recommend checking my formulas over before putting them into a production environment, given that I’m an amateur learning as I go; I highly recommend Cha Sung-Hyuk’s excellent taxonomy of distance measures, which became my go-to source for the equations.[8] ………… As explained in previous articles, I rarely post equations, in order to avoid taxing either the brains of readers or my own any more than I have to, on the grounds that they’re certainly indispensable for helping mathematicians communicate with each other but often a hindrance to non-specialists. There is a crying need in these fields for middlemen to translate such formulas into code with as little mathematical jargon and as many user interface affordances as possible, so that end users can apply them to real-world pro