Important Basic Concepts: Statistics for Big Data

Important Basic Concepts: Statistics for Big Data

Graphical : Exploratory Data Analysis (EDA) methods?
First of all, EDA is about exploring the data and understanding if the data will be good for the experiment and study. Graphs and plots can easily show the data patterns. The raw data can be difficult to understand for patterns and fitness, Graphs can easily show some information about the data.

Graphical Methods can be as follows:
1. Scatter Plots
2. Histograms
3. Box Plots
4. Normal Probability plots

Quantitative Exploratory Data Analysis Techniques:

1. Interval Estimation (Ranges)
2. Hypothesis testing (Null Hypothesis, Alternate Hypothesis)

1. Interval Estimation (Ranges): Create a range of values within which a variable is likely to fall. Confidence Interval (mean will be here) is an interval estimation.

2. Hypothesis testing: Test various propositions about a data

Example: Test that the mean age of Canadian Population is 53.

It's a multi-step process. Steps can be as follows:

1. Test Null Hypothesis: Assume the Hypothesis is true
2. Alternate Hypothesis: Hypothesis that will be accepted if the null hypothesis is rejected
3. Significance Level: what level of significance the null hypothesis will be conducted (i.e. 95% of the time the average return of index investing is 6% for 10 years period)
4. Test Statistic: Numerical measure showing sample data is consistent with Null Hypothesis
6. Critical Value: If test statistic (numerical measure) is more extreme than critical value - null hypothesis is rejected
7. Decision: decision is made by considering Test Statistic and Critical value

Some Basic Probability Distributions:

Binomial Distribution: When the variable can have only one of two values

Poisson Distribution: Describe the likelihood of given number of events occurring during a time interval (customers to your shop in an hour)

Normal Distribution: Symmetrical data. probability that a variable will have a given distance from the mean on both lower and higher side is equal.

t distribution: Similar to Normal Distribution. Extreme large or extreme low values are highly likely. Shows too much variance. Useful when the sample size is small (it is also told when there is not variance, standard deviation)

Chi Square Test: Test to see if a population follows a particular distribution such as normal distribution.

The F distribution: To test if two datasets are from the same population (by using variances).

Related Concepts:

What is Z Score?
Probability of a particular score to be occurring in our normal distribution.
Helps to compare two values that are from two different normal distributions

Another definition: it is a measure on how a value is related to the mean.

Chi Square test for Normal Distribution:
Null Hypothesis: No relation exists between categorical variables. They are independent. If the Hypothesis is true, it is a normal distribution

What is p value in Chi Square test:
p value is just a significance. Helps to understand the significance of the result. A small p value means a strong evidence against the Null Hypothesis.

Reference: Anderson A., Semmelroth D., Statistics for Big Data

Sayed Ahmed

Linkedin: https://ca.linkedin.com/in/sayedjustetc

Blog: http://sitestree.com, http://bangla.salearningschool.com