CodeBuild: December 2010

This article introduces the aim of data mining and explains basic concepts and terms.

Data Mining (i. e. Knowledge discovery from data): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.

Data Warehouse : A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] Data warehouses are used for data mining.

Potential Usages : Web information mining, spam filtering, medical data mining, weather data mining, market sale strategies etc.

Data Mining Related Operations

Preprocessing:

Handling Noisy Data : Handling missing, duplicate or errorneous data before data mining. Noisy data can be removed, or corrected by a specific approach (i.e. correlation analysis).

Integration : Combining data from multiple sources.

Normalization : Scaling data to specified range. For example, scaling 750 in [500, 1000] to range [0,1] (the result is 0.5)

Feature Selection : Selecting only useful features (i.e. attributes for record data) of data.

Data Mining:

Classification: Finding a model for a class attribute of data to predict the values of other attributes. (An example class attribute: CustomerBuysProduct (bool))

Different methods can be used for classification:

Decision Trees: Uses decision trees to make model and evaluates new data on the tree.
Rule-Based Classifying: Deduces rules on the data (if X = Y and if Z z T result is W etc.).
Bayes Classifying: Uses previous probabilities to classify.
K-Nearest Neighbor Classifying: Uses distances between previous data to new data, to classify.
...

Clustering: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Different methods can be used for clustering:

K-means Clustering: Splits data according to a previously known number of clusters.
Hierarchical Clustering: Produces a set of nested clusters organized as a hierarchical tree.
...

Association (Rule) Discovery: Producing dependency rules which will predict occurrence of a feature (i.e. attribute) of data based on occurrences of other features.

Pattern Discovery: Deducing patterns as a result of classification, clustering, Pattern discovery etc.

Postprocessing: Evaluating and selecting interesting patterns, interpreting and visualizing them as an information report.

Pages

Tuesday, December 14, 2010

A Theorical Introduction to Data Mining