Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Implementing Fuzzy Sets in SQL Server, Part 6: Fuzzy Numbers and Linguistic Modi ...

$
0
0
By Steve Bolton

………… I’ve written several amateur tutorial series on this blog in order to more quickly absorb difficult data mining, statistical and machine learning topics, while hopefully helping other SQL Server users avoid some of my inevitable mistakes. Since I don’t know what I’m talking about, I’m occasionally surprised at how useful some of the material turns out to be particularly in the case of this week’s topic, fuzzy numbers, which I originally thought were a curiosity. In recent years I’ve made moderate progress in recapturing some of the advanced math skills I had as a kid, but some of the resources I’m consulting still wear me out with densely packed, arcane symbols; that was also true of George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications and several other sources I’ve used for this series on using T-SQL to implement fuzzy sets. Much of the literature on the subject is written by mathematicians who need thick equations and highly precise terminology to communicate with each other effectively, which can be taxing on non-specialists not accustomed to the jargon of the field.

…………

That communication gap may be responsible for the enormous lag between the refinement of the theory and its adoption; this comes despite all of the empirical evidence that these data modeling techniques are insanely useful in solving certain classes of real-world problems, in a wide variety of industries. Adoption has also lagged in the relational database and data mining markets, in spite of the fact that set-based languages like T-SQL and Multidimensional Expressions (MDX) are ideal for implementing fuzzy sets. I’m trying to put my drop in the bucket to remedy that, but my lack of adequate equation translation skills almost caused me to skip over fuzzy numbers, which are actually one of the simplest and most useful aspects of fuzzy set theory.

Modeling Everyday Speech

………… One rule of thumb I’ve learned along the way is that whenever confusion arises over fuzzy sets, it is best to return to basics and state the problem in terms of natural language. After all, the main overarching use case for fuzzy set theory is to model imprecision that can be expressed linguistically, particularly when it would be useful to use continuous scales to ordinal categories. If we were using a Behavior-Driven Development (BDD) methodology, we might want to flag any qualifiers we encounter in user stories like “about,” “around,” “near,” “approximately,” “most,” “few” and the like as candidates for fuzzy numbers. These require several numbers and calculations of various intensity in order to assign a range of values to a single number but since ordinary numbers require none of this extra computation, the obvious question is, why bother? It turns out that these particular instances of fuzzy sets are useful in modeling these types of natural language qualifiers, which express uncertainty about how close a value is to a definite target value.

………… “About half” is a clear and simple example from everyday speech. In fuzzy set parlance, this would be modeled as a “triangular number,” which is a fancy way of saying that we’d assign a perfect membership grade of 1 to values that were exactly equal to 0.5, but descending on either side of 0.5 in proportion to how far away the value was from that target. The term “triangular” is used because if we use a line chart to depict a membership function of this kind, it peaks at the target value and descends to 0 as the values decrease or increase away from it. Trapezoidal numbers are a mouthful, but really aren’t much more difficult; they have basically have the same shape as triangular numbers, except that the peak is flattened in order to express an interval of some kind.[1] University of Minnesota Prof. Glen Meeden’s natural language example of a trapezoidal number is the best I’ve yet run across in the literature: “What is the average yearly snowfall in the Twin Cities? You might answer somewhere between 20 and 50 inches.”[2] ………… Functions that only increase or decrease are useful in capturing related distinctions, like “a large number of” or “a small number of.”[3] Although it is wise to obey a few restrictions that lead to certain useful mathematical properties particularly the industry-standard requirement of a boundary between 0 and 1 it is not necessary for the membership functions to be symmetric. In fact, assigning a lopsided peak to a triangular number can be useful in modeling statements like “almost all,” which would be closer to the right edge of a line graph than a function that implements the term “most.”[4] They can even be bell-shaped.[5]

…………

At a higher level of sophistication, fuzzy numbers can be used to model terms like “very,” which fall under the rubric of linguistic hedges and fuzzy modifiers including statements like “very true,” which can be useful in fuzzy logic.[6] Klir and Yuan suggest applying powers and square roots to model such distinctions, since they don’t follow a linear scale. For example, they say that if we use a score of 0.8 to model the term “John is young,” squaring it would could model the phrase “very young” and using a square root could lead be used for “fairly young.” This is because the result of the first is 0.64, which strongly modifies the original term, while the second returns 0.89, which modifies it slightly.[7] Of course, the exact boundaries of all of these terms have to be set in light of domain knowledge of some kind. “Half” is a definite term but the modifier “about” can mean different things to different people, which may require aggregating the viewpoints of users in some way, perhaps using Decision Theory methods that integrate seamlessly with fuzzy sets. Neural nets likewise work well together with fuzzy sets because they are highly useful for encoding unknown functions, which means they can be put to use to derive such boundaries if greater precision is required. For many use cases, an informal guesstimate may suffice. Translating Fuzzy Numbers into T-SQL

………… My sample code below implements four types of fuzzy numbers, using quite simple and arbitrary criteria that is merely designed to illustrate the concepts. For the sake of consistency, I’m once again using the procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores to my derive membership grades and storing the results in a table variable (which has an extraneous column named GroupRank that can be safely ignored). In this case we’re calculating Z-Scores on the LactateDehydrogenase column of the Duchennes muscular dystrophy dataset I’ve been using as practice data for the last several tutorial series, which I downloaded ages ago from Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table in a dummy DataMiningProjects database. Some of the key differences from previous articles include the absence of the ReversedZScore column and associated @Rescaling variables. This is because we don’t need to perform rescaling of any kind, since we’re measuring nearness to a few Z-Score values in all four examples, not calculating a relative score on a range of 0 to 1. The MembershipScore column is missing for the same reason. In its place, we have four computed columns, two of which measure the closeness of each Z-Score in the dataset to a target value or range.

…………

To put it simply, the triangular number assigns a grade to the natural language statement, “around a Z-Score of 0.960526,” while the trapezoidal expresses the concept of “somewhere between 0.450526 and 1.360526.” The other two columns define increasing an

Viewing all articles
Browse latest Browse all 3160

Trending Articles