Introducton and definition
Data mining
This is a field of computer science that deals with extracting knowledge from large databases and using the information for other activities such as future planning of a firm’s activities, predicting the behaviors of samples in a database. It does not actually mean the digging out of data as can be easily inferred from the word. It is and interdisciplinary field which includes the use of both automatic and semi automatic methods to ascertain unusual occurrences, known occurrences, predicted occurrences and using such information in statistics.
Contents:
ØIntroduction and definition
ØProcesses involved in data mining
ØApplications and carrier opportunities in data mining
ØData mining tools.
According to Zbingniew R. Struzik, in his reply to a question asked by Sudhakar Stigh at researchgate.net, data mining and statistics are two different things, but statistical methods are used in certain data mining approaches. Data mining can be seen as being synonymous with phrases like Archeology of data, knowledge extraction, and data fishing. In statistics, you already know something about the data while data mining involves the discovery of the invisible knowledge in data.
Some others believe that data mining
PROCESSES INVOLVED IN DATA MINING
Data mining involves the application of a series of processes which include the following:
1. Selection: This is the process where data relevant to the analysis task are retrieved from the large database to a manageable size for processing and analysis. It involves the simplification of models to make them easier to interpret. However, this process is automatically done by algorithms written and embedded in data mining tools which will be discussed later.
2. Pre-processing:This is the process of processing raw data to prepare it for future processing,. It involves sampling, denoising, removal of duplicate data, removal of unreliable information. Sampling is done to measure a classifier’s performance and obtain a better balance between class distributions etc.
3. Transformation
4. Data mining
5. Interpretation
APPLICATIONS AND CARRIER OPPORTUNITIES IN DATA MINING
Data mining has diverse application in different fields of human activities. Such applications include:
1. In telecommunication industries:
Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service.
2. In the health care sector: this is used in health care to evaluate health trends from which new products, diets or vaccines can be produced. It can be used to find out success rates and side-effects of a medicine in the market. This involves the shifting of lots of data from multiple patients and multiple aspects
3. In the sporting world: data mining can be used to evaluate players, to check their field statistics and training practices to find out the best ways to make better players.
4. In advertising: a large consumer package goods company can apply data mining to improve its sales process to its retailers and consumers. Data collected from consumer panels, shipments, etc can be use to determine the change in demand, the advertising process that works for their company from which they can select the best strategies that reach their customer’s demand and target potential markets.
5.Fraud detection
6. By internet service providers and search engines: have you ever asked how search engines like google come up with the most relevant search results before you finish typing a search query? They use data mining in its determination.
DATA MINING TOOLS
Over the years, many methods have been used to obtain knowledge from large data bases. Such methods include the use of Baye’s theorem, regression analysis, rough set theory, Hace theorem, etc. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectation.
1.Orange:Orange is a component based data mining and machine learning software suite written in the Python language. It is an Open source data visualization and analysis for novice and experts. Data mining can be done through visual programming or Python scripting. It has components for machine learning. There are addons for bioinformatics and text mining. It is also packed with features for data analytics, different visualizations, from scatterplots, bar charts, trees, to dendrograms, networks and heatmaps. Orange remembers the choices, and suggests most frequently used combinations, and intelligently chooses which communication channels between widgets to use.
2. OpenNN:OpenNN is an open source class library written in C++ which implements neural networks. The library is intended for advanced users, with high C++ and machine learning skills. OpenNN provides an effective framework for the research and development of data mining and predictive analytics algorithms and applications.
3. Weka:Weka is a suite of machine learning software applications written in the Java programming language. Weka is Waikato Environment for Knowledge Analysis. It is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Weka provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. It is not capable of multi-relational data mining, but there is separate software for converting a collection of linked database tables into a single table that is suitable for processing using Weka.
4. Rattle GU:Rattle GUI is a free and open source software providing a graphical user interface (GUI) for Data Mining using the R statistical programming language. Rattle provides considerable data mining functionality by exposing the power of the R Statistical Software through a graphical user interface.
5.ADaMSoftADaMSoft is a free and Open Source Data Mining software developed in Java. It contains data management methods and it can create ready to use reports. It can read data from several sources and it can write the results in different formats.
6. Apache MahoutApache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform.
7. RapidMinerRapidMiner provides an incorporated environment for machine learning, data mining, text mining, prognostic analytics and business analytics. RapidMiner is used for business, industrial applications, research, education, training, rapid prototyping, and application development and has more than 600 enterprise customers and more than 250,000 active users.
Other tools used for data mining include:
8.Databionic ESOM Tools
9.NLTK (Natural Language Toolkit)
10.SenticNet API
11.ELKI
12.UIMA
13.KNIME
14.Chemicalize.org
15.Vowpal Wabbit
17.GraphLab
18.GNU Octave