This article introduces the aim of data mining and explains basic concepts and terms.
Data Mining (i. e. Knowledge discovery from data): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.
Data Warehouse : A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] Data warehouses are used for data mining.
Potential Usages : Web information mining, spam filtering, medical data mining, weather data mining, market sale strategies etc.
|Data Mining Related Operations|
Handling Noisy Data : Handling missing, duplicate or errorneous data before data mining. Noisy data can be removed, or corrected by a specific approach (i.e. correlation analysis).
Integration : Combining data from multiple sources.
Normalization : Scaling data to specified range. For example, scaling 750 in [500, 1000] to range [0,1] (the result is 0.5)
Feature Selection : Selecting only useful features (i.e. attributes for record data) of data.
Classification: Finding a model for a class attribute of data to predict the values of other attributes. (An example class attribute: CustomerBuysProduct (bool))
Different methods can be used for classification:
- Decision Trees: Uses decision trees to make model and evaluates new data on the tree.
- Rule-Based Classifying: Deduces rules on the data (if X = Y and if Z z T result is W etc.).
- Bayes Classifying: Uses previous probabilities to classify.
- K-Nearest Neighbor Classifying: Uses distances between previous data to new data, to classify.
Clustering: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
Different methods can be used for clustering:
- K-means Clustering: Splits data according to a previously known number of clusters.
- Hierarchical Clustering: Produces a set of nested clusters organized as a hierarchical tree.
Association (Rule) Discovery: Producing dependency rules which will predict occurrence of a feature (i.e. attribute) of data based on occurrences of other features.
Pattern Discovery: Deducing patterns as a result of classification, clustering, Pattern discovery etc.
Postprocessing: Evaluating and selecting interesting patterns, interpreting and visualizing them as an information report.