Linear Regression:
Here one dependent variable is described as a combination of a list of independent variables
Assumptions:
There is one dependent variable and a number of independent variables
Process:
Excel will let you run linear regression given a bunch of columns of independent variables
The distribution of independent variables should be normal?
Output:
An Equation. which describes which variables are significant in describing the variability in data of dependent variable
The result of the equation is a numeric value which can be used for forecasting or predicting
Sales = 0.8 * # of people entering the store
where sales is the dependent variable and
# of people entering the store is independent variable
Linear Discriminant Analysis
In order to discriminate or seperate out 2 ore more groups of data and characterizing the difference by a vector, we use Linear Discriminant Analysis.
Assumptions:
the independent variables need to be normally distributed
Process:
You can obtain LDA equation using R given a set of data
Output:
# of equations equal to the number of groups/classes -1
The result is used to find if a particular set of data belongs to group#1 or group#2 or group#3 or group#4 & so on
Eg. For 3 classes of flowers, there are 4 characteristics which classifies one flower from another
the 4 characterizing features are Sepal width, sepal length, petal width, petal length.
LD1 = 0.39 * sepal length + 2.067 * sepal width -2.27 * petal length -2.28 * petal width
LD2 = 0.56 * sepal length -2.45 * sepal width +0.41 * petal length -2.12 * petal width
LD1 line goes parallel through the means and LD2 line gives the distance between the means
If for a random flower we know the sepal length, sepal width , petal width and petal length, we can evaluate equation 2 to find which flower it is based on the sepal length, sepal width, petal length, petal width
Disadvantage:
It needs variance within each group to be same.
Logistic Regression
In order to run linear regression on a set of data but have an outcome which is dichotomous, we use logistic regression.We use logistic regression when we want to know which variables the dependent variable is dependent on.
Assumptions:
The inputs need not be normally distributed.
Output:
Output is an equation which is a natural log of the likelihood of the occurance of a state of the dependent variable.
eg.
1 = female
0 = male
ln(P(1)/(1-P(1)) = 0.2 + 0.4*height
Disadvantage:
It needs large sample size
ANOVA
tests whether the groups are different enough to be classified
the output of anova too is a true or false whether teh data columns are statistically different. There is a way they can be seperated
F value =SSE(between)/ SSE(Within)
Process:
You can run it in excel
Output
Output is an F value and probability of F-value. if p < 0.05, then the null hypothesis is rejected and the data is proven to be statistically different
Here one dependent variable is described as a combination of a list of independent variables
Assumptions:
There is one dependent variable and a number of independent variables
Process:
Excel will let you run linear regression given a bunch of columns of independent variables
The distribution of independent variables should be normal?
Output:
An Equation. which describes which variables are significant in describing the variability in data of dependent variable
The result of the equation is a numeric value which can be used for forecasting or predicting
Sales = 0.8 * # of people entering the store
where sales is the dependent variable and
# of people entering the store is independent variable
Linear Discriminant Analysis
In order to discriminate or seperate out 2 ore more groups of data and characterizing the difference by a vector, we use Linear Discriminant Analysis.
Assumptions:
the independent variables need to be normally distributed
Process:
You can obtain LDA equation using R given a set of data
Output:
# of equations equal to the number of groups/classes -1
The result is used to find if a particular set of data belongs to group#1 or group#2 or group#3 or group#4 & so on
Eg. For 3 classes of flowers, there are 4 characteristics which classifies one flower from another
the 4 characterizing features are Sepal width, sepal length, petal width, petal length.
LD1 = 0.39 * sepal length + 2.067 * sepal width -2.27 * petal length -2.28 * petal width
LD2 = 0.56 * sepal length -2.45 * sepal width +0.41 * petal length -2.12 * petal width
LD1 line goes parallel through the means and LD2 line gives the distance between the means
If for a random flower we know the sepal length, sepal width , petal width and petal length, we can evaluate equation 2 to find which flower it is based on the sepal length, sepal width, petal length, petal width
Disadvantage:
It needs variance within each group to be same.
Logistic Regression
In order to run linear regression on a set of data but have an outcome which is dichotomous, we use logistic regression.We use logistic regression when we want to know which variables the dependent variable is dependent on.
Assumptions:
The inputs need not be normally distributed.
Output:
Output is an equation which is a natural log of the likelihood of the occurance of a state of the dependent variable.
eg.
1 = female
0 = male
ln(P(1)/(1-P(1)) = 0.2 + 0.4*height
Disadvantage:
It needs large sample size
ANOVA
tests whether the groups are different enough to be classified
the output of anova too is a true or false whether teh data columns are statistically different. There is a way they can be seperated
F value =SSE(between)/ SSE(Within)
Process:
You can run it in excel
Output
Output is an F value and probability of F-value. if p < 0.05, then the null hypothesis is rejected and the data is proven to be statistically different