Quantcast
Viewing all articles
Browse latest Browse all 3160

Why Automate?

Because, as Jen Underwood ( jenunderwood.com | LinkedIn | @idigdata ) states in an upcoming podcast: The future of data science is automation .

If automation is the future, how do we decide what to automate? We look for thelong pole. What’s the long pole in data science?


Image may be NSFW.
Clik here to view.
Why Automate?

Data Integration is the Long Pole

According to Lars Nielsen in his book Unicorns Among Us :

“As intellectually stimulating as the work might be, there are also mundane aspects to the data scientist's job. In fact, there is pure drudge work that can take from 50 to 80 percent of a data scientist's time : This involves the collecting, cleaning and organizing of unruly, disparate, messy, unstructured data before it can be mined for the gold of actionable BI. Some call this aspect of the job ‘data wrangling.’" (emphasis mine)

The long pole in data science is “data wrangling” or “ data munging ,” also known as data integration.

“How Do We Automate Data Integration, Andy?”

I’m glad you asked! My answer is:

Eliminate redundancy.

Automate execution.

Practice DILM (Data Integration Lifecycle Management).

First, eliminate the cut-and-paste drudge work from data integration development.

What is the “cut-and-paste drudge work?” A significant amount of data integration is based on repeatable patterns. One can automate SSIS design patterns , such as truncate-and-load and incremental load, using Biml (Business Intelligence Markup Language). Hours, days, and weeks of work can be reduced to minutes by using Biml and Biml Frameworks, such as BimlFlex and the Biml Express Metadata Framework .

Second, automate data integration execution.

Data integration execution has long been grouped into processing jobs or batches. A best practice in SSIS development is to build small, unit-of-work packages that perform specific load or transformation operations. Why? Smaller packages are easier to develop, test, manage, and maintain. Unit-of-work packages promote code re-use, as some packages can be used in multiple batched operations.

There’s no free lunch, and building unit-of-work packages presents a new issue: Now one has lots more SSIS packages to execute. An execution framework addresses this issue. Enterprise Data & Analytics offers a free execution framework, the SSIS Framework Community Edition .

The SSIS Framework Community Edition includes a Parent SSIS package that executes packages collected into batches called “SSIS Applications.” Metadata about SSIS Applications is stored in tables integrated (although isolated by a custom schema) into the SSISDB database. Data integration professionals can configure metadata for three (or three hundred) packages in a single batch, and this batch can be executed by starting the Parent SSIS package and overriding a single parameter.

Consider this quote in a CNN article from a senior official with Obama 2012 US re-election campaign:


Image may be NSFW.
Clik here to view.
Why Automate?

Third, data integration needs enterprise lifecycle management, like all other software development. (Data integration development is software development.)

Repeatable, collaborative, and communicable processes form the heart of enterprise DevOps . Repeatable releases and source control for SSIS are no longer optional. Enterprises need at least three lifecycle “tiers” development, test, and production. Why? Development and Production environments are usually not in question; what about this third tier? It’s not important what this tier is called or even if there are more tiers between Development and Production. This tier is important because it’s not Production and not Development.

All software works in Development. Software is built in Development and the design-time defaults all point to Development resources. Enterprises do not want the first deployment of any software to be the Production deployment. Instead, a test deployment to a different environment (not Development and not Production) will assure all external parameters are properly configured and included in the deployment plan. A successful test deployment to an environment (lifecycle tier) that matches Production dramatically improves confidence that the Production deployment will succeed.

When deploying SSIS to the SSIS Catalog, though, you need to be sure your test deployment tier is closely aligned to the Production environment. That can be… challenging, but SSIS Catalog Compare detects and can script and deploy differences between SSIS Catalogs residing in different Data Integration Lifecycle tiers.

Conclusion

Why is automation important? Automating data integration changes the dynamics for data science.

Jen ’s right. Lars is right. Automation is the future of data science and automating the long pole is the place to begin. For SSIS automation, the DILM Suite is a solution.

What can we accomplish by automating data science? We can change the world.

:{>


Viewing all articles
Browse latest Browse all 3160

Trending Articles