Why Automate?

Because, as Jen Underwood ( jenunderwood.com | LinkedIn | @idigdata ) states in an upcoming podcast: The future of data science is automation .

If automation is the future, how do we decide what to automate? We look for thelong pole. What’s the long pole in data science?

Image may be NSFW.
Clik here to view. Why Automate?

Data Integration is the Long Pole

According to Lars Nielsen in his book Unicorns Among Us :

“As intellectually stimulating as the work might be, there are also mundane aspects to the data scientist's job. In fact, there is pure drudge work that can take from 50 to 80 percent of a data scientist's time : This involves the collecting, cleaning and organizing of unruly, disparate, messy, unstructured data before it can be mined for the gold of actionable BI. Some call this aspect of the job ‘data wrangling.’" (emphasis mine)

The long pole in data science is “data wrangling” or “ data munging ,” also known as data integration.

“How Do We Automate Data Integration, Andy?”

I’m glad you asked! My answer is:

Eliminate redundancy.

Automate execution.

Practice DILM (Data Integration Lifecycle Management).

First, eliminate the cut-and-paste drudge work from data integration development.

What is the “cut-and-paste drudge work?” A significant amount of data integration is based on repeatable patterns. One can automate SSIS design patterns , such as truncate-and-load and incremental load, using Biml (Business Intelligence Markup Language). Hours, days, and weeks of work can be reduced to minutes by using Biml and Biml Frameworks, such as BimlFlex and the Biml Express Metadata Framework .

Second, automate data integration execution.

Data integration execution has long been grouped into processing jobs or batches. A best practice in SSIS development is to build small, unit-of-work packages that perform specific load or transformation operations. Why? Smaller packages are easier to develop, test, manage, and maintain. Unit-of-work packages promote code re-use, as some packages can be used in multiple batched operations.

There’s no free lunch, and building unit-of-work packages presents a new issue: Now one has lots more SSIS packages to execute. An execution framework addresses this issue. Enterprise Data & Analytics offers a free execution framework, the SSIS Framework Community Edition .

The SSIS Framework Community Edition includes a Parent SSIS package that executes packages collected into batches called “SSIS Applications.” Metadata about SSIS Applications is stored in tables integrated (although isolated by a custom schema) into the SSISDB database. Data integration professionals can configure metadata for three (or three hundred) packages in a single batch, and this batch can be executed by starting the Parent SSIS package and overriding a single parameter.

Consider this quote in a CNN article from a senior official with Obama 2012 US re-election campaign:

Image may be NSFW.
Clik here to view. Why Automate?

Third, data integration needs enterprise lifecycle management, like all other software development. (Data integration development is software development.)

Repeatable, collaborative, and communicable processes form the heart of enterprise DevOps . Repeatable releases and source control for SSIS are no longer optional. Enterprises need at least three lifecycle “tiers” development, test, and production. Why? Development and Production environments are usually not in question; what about this third tier? It’s not important what this tier is called or even if there are more tiers between Development and Production. This tier is important because it’s not Production and not Development.

All software works in Development. Software is built in Development and the design-time defaults all point to Development resources. Enterprises do not want the first deployment of any software to be the Production deployment. Instead, a test deployment to a different environment (not Development and not Production) will assure all external parameters are properly configured and included in the deployment plan. A successful test deployment to an environment (lifecycle tier) that matches Production dramatically improves confidence that the Production deployment will succeed.

When deploying SSIS to the SSIS Catalog, though, you need to be sure your test deployment tier is closely aligned to the Production environment. That can be… challenging, but SSIS Catalog Compare detects and can script and deploy differences between SSIS Catalogs residing in different Data Integration Lifecycle tiers.

Conclusion

Why is automation important? Automating data integration changes the dynamics for data science.

Jen ’s right. Lars is right. Automation is the future of data science and automating the long pole is the place to begin. For SSIS automation, the DILM Suite is a solution.

What can we accomplish by automating data science? We can change the world.

:{>

Why Automate?

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本