Spoofing Data Convincingly: Masking Continuous Variables.

In any real numeric data from a database , you are only rarely going to see any sort of normal distribution of the values. Sales data will rise and fall according to the time of year and the economic cycle. The date of input of a record will vary with the workload. If you plot any real data, you are unlikely to see that bell-curve or straight line. It will look more like the welsh hills. (or, technically ‘multimodal’). Truly independent data will tend to conform with the bell-curve whereas dependent data tends to be multimodal.

Imagine that you have a table giving invoice values. You will want your spoof data to conform with the same ups and downs of the real data over time. You may be able to get the overall distribution the same as the real data, but the resulting data would be useless for seeing the effect of last years sales promotion. The invoice values will depend on your sales promotions if your marketing people have done their job properly.

By making your data the same distribution as your production data, you don’t necessarily get the same strategy chosen by the query analyser, but you dramatically increase the chances of getting it. SQL Server uses a complex paradigm to select amongst its alternative plans for a query. It maintains distribution statistics for every column and index that is used for selecting rows. These aren’t actually histograms in the classic sense, but they perform a similar function and are used by the SQL Server engine to predict the number of rows that will be returned.

The easiest numeric data to spoof is an independent variable, so we can start with that.

We will collect the current values and use them to create a histogram . We then use this to select the bin (I prefer the alternative term ‘bucket’ as the word ‘bin’ is too overloaded!) from which we then generate the random number. This will result in a distribution that is a bit more jagged than it should be, but any sort of interpolation is a bit of a distraction at this stage. It is good enough, because pseudonymization can lack finesse, yet be fit for the intended purpose.

We aim to produce a simple function to produce a random float that can be used to stock the ModifiedDate column of AdventureWorks2016.Person.Person (which we have, of course, copied for the purpose.

UPDATE AdventureWorksCopy.Person.Person SET modifiedDate=Convert(DATETIME,dbo.RandomFloat(@dh))

@dh is our distributionHistogram. The rest is mere detail!

IF Object_Id('dbo.RandomFloatTable') IS NOT NULL DROP function dbo.RandomFloatTable IF Object_Id('dbo.RandomFloat') IS NOT NULL DROP function dbo.RandomFloat IF Object_Id('dbo.CreateDistributionHistogram') IS NOT NULL DROP function dbo.CreateDistributionHistogram GO IF EXISTS (SELECT * FROM sys.types WHERE types.name LIKE 'DistributionHistogram') DROP TYPE DistributionHistogram IF EXISTS(SELECT * FROM sys.types WHERE name LIKE 'ContinuousVariable') DROP TYPE ContinuousVariable CREATE TYPE ContinuousVariable AS TABLE (number FLOAT NOT null) GO CREATE TYPE [dbo].[DistributionHistogram] AS TABLE( Bucket INT NOT NULL, --the number in the sequence 1..40 StartValue FLOAT NOT NULL,--the starting value for membership of the bucket EndValue FLOAT NOT NULL,--the end value for membership of the bucket "distribution" INT NOT NULL, --the number of values in this bucket RunningTotal INT NOT NULL, --the number of values in or lower than this point PRIMARY KEY CLUSTERED ( [runningtotal] ASC, [distribution] ASC )WITH (IGNORE_DUP_KEY = OFF) ) GO IF Object_Id('dbo.RandomFloatTable') IS NOT NULL DROP function dbo.RandomFloatTable GO CREATE FUNCTION dbo.RandomFloatTable /** Summary: > This returns a random float from a population described by the distibution histogram Author: sa Date: 18/07/2018 Database: PhilFactor Examples: - Select * from dbo.RandomFloat(@dh) - Select * from MyTable cross apply dbo.RandomFloat(MyColumn) Returns: > A table with a single result **/ ( @dh DistributionHistogram readonly ) RETURNS TABLE --WITH ENCRYPTION|SCHEMABINDING, .. AS RETURN ( SELECT [@dh].StartValue + ((SELECT randomnumber FROM SingleRandomNumber) * ([@dh].EndValue - [@dh].StartValue)) AS RandomNumber FROM @dh CROSS JOIN ( SELECT randomnumber * (SELECT Max(RunningTotal) FROM @dh) FROM SingleRandomNumber ) AS f(choice) WHERE f.choice BETWEEN [@dh].RunningTotal - [@dh].distribution AND [@dh].RunningTotal ); go IF Object_Id('dbo.RandomFloat') IS NOT NULL DROP function dbo.RandomFloat GO CREATE FUNCTION dbo.RandomFloat (@dh DistributionHistogram READONLY --our distribution histogram ) RETURNS FLOAT AS BEGIN DECLARE @where INT, @randomnumber FLOAT; /*a bucket on our distribution histogram where we create our number */ SELECT @where = randomnumber * (SELECT Max([@dh].RunningTotal) FROM @dh) FROM SingleRandomNumber; SELECT @randomnumber = --we create a random number within the range of the bucket StartValue + ((SELECT randomnumber FROM SingleRandomNumber) * (EndValue - StartValue) ) FROM @dh --the distribution histogram WHERE @where BETWEEN RunningTotal - distribution AND RunningTotal; RETURN @randomnumber; END; GO IF Object_Id('dbo.CreateDistributionHistogram') IS NOT NULL DROP function dbo.CreateDistributionHistogram GO CREATE FUNCTION dbo.CreateDistributionHistogram /** Summary: > This creates a Distribution histogram from a table that is just a list of FLOATs Author: Phil Factor Date: 18/07/2018 Database: PhilFactor Examples: - Select * from dbo.CreateDistributionHistogram(@InVar) Returns: > A table Bucket, StartValue, EndValue,"distribution",RunningTotal **/ ( @InVar ContinuousVariable readonly ) RETURNS @histogramTABLE-- create our distribution map (--Bucket, StartValue, EndValue,"distribution",RunningTotal Bucket INT NOT NULL, StartValue FLOAT NOT NULL, EndValue FLOAT NOT NULL, "distribution" INT NOT NULL, RunningTotal INT NOT NULL ) AS -- body of the function BEGIN DECLARE @Maximum float,--the maximum value in the set @Minimum float ,--the minimum value in the set @increment FLOAT; --the span of values in each grouping --find the maximum and min imum values in the set SELECT @Maximum = MAX(number)+StDev(number), @minimum = MIN(number)-StDev(number) FROM @InVar; --find the width of each grouping, allowing for random outliers SELECT @increment= (@Maximum-@Minimum)/40.00; WITH buckets AS (SELECT OneToForty.value AS Bucket, @Minimum + (@increment * OneToForty.value) AS StartValue, @Minimum + (@increment * (OneToForty.value + 1)) AS EndValue FROM ( VALUES --the range of each 'bucket' (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11), (12), (13), (14), (15), (16), (17), (18), (19), (20), (21), (22), (23), (24), (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35), (36), (37), (38), (39), (40) ) AS OneToForty (value) ) INSERT INTO @histogram (Bucket, StartValue, EndValue,"distribution",RunningTotal) SELECT allbuckets.Bucket, allbuckets.StartValue, allbuckets.EndValue, Coalesce(Goodbuckets.coun, 1) AS "distribution", Sum(Coalesce(Goodbuckets.coun, 1)) OVER (ORDER BY allbuckets.Bucket) AS RunningTotal FROM buckets AS allbuckets LEFT OUTER JOIN ( SELECT buckets.Bucket, Count(*) AS coun FROM @InVar iv INNER JOIN buckets ON iv.number BETWEEN buckets.StartValue AND buckets.EndValue GROUP BY buckets.Bucket ) AS Goodbuckets ON Goodbuckets.Bucket = allbuckets.Bucket ORDER BY allbuckets.Buck

Spoofing Data Convincingly: Masking Continuous Variables.

Trending Articles

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

EasyU 優啟通 3.6.2020.1208 繁體中文版 - WinPE開機USB製作工具

到底什么是区块链

A180社團群組?

【梦奇字幕组】★银河奥特曼S★[01-04][1080P][BDrip][Disc1]

【粉羽社&超古代】[新干线变形机器人进化先锋] [改变世界] [25] [1080p] [Tvrip] [简体内嵌]

【搬运】大鹅pj的Roon 2.47.1510

盧金箴命理師大栽問(9)－黑道卜卦你敢講壞事嗎？

unipush2.0 使用getClientStatusByCid查询设备状态，一直返回offline

微軟PowerToys即將新增OCR工具，可複製圖像中的文字

食品安全與禽流感

最快大概什麼時候可以確定Fit3 改款要搭載 1.0 升 VTEC 渦輪引擎??

出售: ISOTEK GEMINI 6-way UK 英式電源拖板

STEAG EBSILON Professional 13.02 crack 破解版

[出售] 現金美西普通 ▓☞金門暗黑館☜▓ 各式D2裝備應有盡有【優良商家】

关门一家亲：习远平、张澜澜、徐才厚

我喜欢的水瓶男主动牵手拥抱亲吻我了，但是不表白。他到底是怎么想的？ (豆瓣我爱我恨水瓶男小组)

請問8016TA隔熱紙清晰度？

身份背景特殊？这几名中共高官快速晋升图

[转载]煞貢、直星、人專吉日\金神七煞歌