Monday, October 22, 2007

SQL Server 2005 Data Mining

SQL Server 2005 Data Mining

Introduction

The Microsoft SQL Server 2005 Data Mining Platform introduces significant capabilities to address data mining in both traditional and new ways. In traditional terms, data mining can predict future results based on input, or attempt to find relationships among data or cluster data in previously unrecognized yet similar groups.

Microsoft data mining tools are different from traditional data mining applications in significant ways. First, they support the entire development lifecycle of data in the organization, which Microsoft refers to as Integrate, Analyze, and Report. This ability frees the data mining results from the hands of a select few analysts and opens those results up to the entire organization. Second, SQL Server 2005 Data Mining is a platform for developing intelligent applications, not a stand-alone application. You can build custom applications that are intelligent, because the data mining models are easily accessible to the outside world. Further, the model is extensible so that third parties can add custom algorithms to support particular mining needs. Finally, Microsoft data mining algorithms can be run in real time, allowing for the real-time validation of data against a set of mined data.

Creating Intelligent Applications

The concept behind creating intelligent applications is to take the benefits of data mining and apply them to the entire data entry, integration, analysis, and reporting process. Most data mining tools show predictions of future results and help determine relationships between different data elements. Most of these tools are run against the data and produce results which are then interpreted separately. Many data mining tools are stand-alone applications that exist for the purpose of forecasting demand or identifying relationships and their functionality stops there.

Intelligent applications take the output of data mining and apply that as input to the entire process. One example of an application that makes use of a data mining model would be a data entry form for accepting personal information. Users of the application can enter a tremendous amount of data, such as birth date, gender, education level, income level, occupation, and so forth. Certain combinations of attributes don't make logical sense; for example, a seven-year-old person working as a doctor and holding a high-school diploma indicates someone is either filling in random data or showing their inability to handle data input forms. Most applications try to handle such issues by implementing complicated and deeply nested logic, but realistically it is nearly impossible to handle all such combinations of data that are valid or invalid.

To solve this problem, a business can use data mining to look at existing data and build rules for what looks valid. Each combination is scored with a level of confidence. The organization can then build the data entry application to use the data mining model for real-time data entry validation. The model scores the input against the universe of existing data and returns a level of confidence in the input. The application can then decide whether or not to accept the input based on a pre-determined level of confidence threshold.

This example points out the advantage of using a data mining engine that can run in real time: applications can be written that take advantage of the power of data mining. Rather than data mining being the end result, it becomes a part of the overall process and plays a role at each phase of integration, analysis, and reporting.

While validating input uses data mining at the front end of the data integration process, data mining can be used in the analysis phase as well. Data mining provides the ability to group or cluster values, such as similar customers or documents based on keywords. These clusters can then be fed back into the data warehouse so that analysis can be performed using these groupings. Once the groupings are known and fed back into the analysis loop, analysts can use them to look at data in ways that were not possible before.

One of the primary goals of intelligent applications is to make the power of the data mining models available to anyone, not just analysts. In the past, data mining has been the domain of experts with backgrounds in statistics or operations research. The data mining tools were built to support such users, but not to easily integrate with other applications. Thus, the ability to use data mining information was greatly restricted outside of the data mining product itself. However, with a tool that spans the entire process and opens up its models and results to other applications, businesses have the power to create intelligent applications that use data mining models at any stage.

Another aspect of a platform that allows for the creation of intelligent applications is a centralized server to store the data mining models and results. These models tend to be highly proprietary and secret. Storing them on the server protects them from being distributed outside of the organization. An added benefit is that with a shared location for models, companies have a single version of each model, not multiple variants residing on each analyst's desktop. Having a single version of the truth is one of the goals of data warehousing, and this concept can be extended to data mining so that there is a single version of the model that has been created and tuned for the particular business.

 

Regards,

Sankata
PT. Ecomindo Saranacipta
Gedung YDAP Denta Medika, 4th floor
Jl. Raya Pasar Minggu No. 45
Jakarta Selatan, Indonesia

Office : +6221 7900909

E-mail : sankata.ec@ecomindo.com

Blog : http://sankatalee.blogspot.com
ym: sankatalee | gtalk: sankatalee

 

No comments: