zaterdag 31 januari 2015

Datamining : Reasoning (Part I)

Introduction

According to Wikipedia, Datamining is: "Searching for statistical patterns in (large) datasets for purposes like scientific, commercial usage, etc". So, datamining is for discovery based strategies (in contrast with Business based discovery) to discover relations, correlations or perhaps categorizations. Data mining is about induction. Induction is generalizing a certain number of observations. Induction is the opposite of deduction. Deduction is generic rule applied to a certain part of observations.

Deductive reasoning

In contrast to inductive reasoning, deductive reasoning is concerned with conclusions follow with certainty from their premises and inductive reasoning refers to situations where conclusions only probabilistic follow from their premises.  There are a couple of rules of inference according to the book from my good old study at the University: "Cognitive psychology and it's implications" by John R. Anderson. I can recommend this book by the way (if it's still sold).

Conditional statement (modus ponens)
A conditional statement is an assertion, such as "if you read this blogpost, you will be wiser". The if part is called the antecedent and the then part is the consequent. This certain reasoning is called modus ponens (if A, then B and the proposition A, we can infer B). Suppose the following:

1. If you read this blogpost, you will be wiser.
2. You read this post.

From premises 1 and 2 we can infer 3 by modus pones.

3. You are wiser!

Modus tollens
Another rule of inference is modus tollens. This rule states, that if we are given the proposition A implies B and the fact that B is false, then we cab infer A is false. An example:

1. If you understand this blogpost, you understand reasoning with datamining
2. You don't understand reasoning with datamining.

It follows from modus tollens that:

3. You didn't understand this blogpost.

Inductive reasoning

With induction, a way of reasoning is meant and it's acting as a proof.  With inductive reasoning an general rule is defined on a certain number of observations. We call this generalization. On a certain number of observations we reason to a generalization of the rule. Inductive reasoning is the term used to  describe the process by which one comes to conclusions that are probable rather that certain. This seems much more useful because very little is certain end at very best, very likely. 

1. If you read this blogpost, you will be wiser.
2. You are wiser.

Then it follows with inductive reasoning:

3. You've read this blogpost! Great!

Bayes's Theorem provides a way for assessing the plausibility of this reasoning. What is the probability that you are wiser because of reading this blogpost? There is also a probability that you have become wiser because of reading something else. That is important to know too.

Conclusion

This is an introduction to datamining and inductive reasoning with data. This blogpost discusses some issues with inductive reasoning and datamining.