Pages

Tuesday, December 14, 2010

A Theorical Introduction to Data Mining




This article introduces the aim of data mining and explains basic concepts and terms.
Data Mining (i. e. Knowledge discovery from data): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.

Data Warehouse : A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] Data warehouses are used for data mining.

Potential Usages : Web information mining,  spam filtering, medical data mining, weather data mining, market sale strategies etc.

Data Mining Related Operations
Preprocessing:
Handling Noisy Data : Handling missing, duplicate or errorneous data before data mining. Noisy data can be removed, or corrected by a specific approach (i.e. correlation analysis).
Integration  : Combining data from multiple sources.
Normalization : Scaling data to specified range. For example, scaling 750 in [500, 1000] to range [0,1] (the result is 0.5) 
Feature Selection : Selecting only useful features (i.e. attributes for record data) of data.

Data Mining:
Classification: Finding a model for a class attribute of data to predict the values of other attributes. (An example class attribute: CustomerBuysProduct (bool))
Different methods can be used for classification:
  • Decision Trees: Uses decision trees to make model and evaluates new data on the tree.
  • Rule-Based Classifying: Deduces rules on the data (if X = Y and if Z z T result is W etc.).
  • Bayes Classifying: Uses previous probabilities to classify.
  • K-Nearest Neighbor Classifying: Uses distances between previous data to new data, to classify.
  • ...
Clustering: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
 Different methods can be used for clustering:
  • K-means Clustering: Splits data according to a previously known number of clusters.
  • Hierarchical Clustering: Produces a set of nested clusters organized as a hierarchical tree.
  • ...
Association (Rule) Discovery: Producing dependency rules which will predict occurrence of a feature (i.e. attribute) of data based on occurrences of other features.
Pattern Discovery: Deducing patterns as a result of classification, clustering, Pattern discovery etc.

Postprocessing: Evaluating and selecting interesting patterns, interpreting and visualizing them as an information report.