Examples: Experiment Design

Experiment 1:

Forecast the nations that will have the most suicides, 

Data:

Output variables:

Method/Algorithm for this experiment

Experiment 2:

Find out the association of GDP and population size on suicide rates,

Data:

Output variables:

Method/Algorithm for this experiment

Experiment design 3:

Predict which age groups are most prone to commit suicide

Data:

Output variables:

Method/Algorithm for this experiment

Data Analytics, Machine Learning, Data Science

Tools and Tutorials for Data Manipulation

Join Data from Multiple Sources

Power BI

Python

SQL

•Databases and Data Warehouse

https://durhamcollege.desire2learn.com/d2l/le/content/467097/viewContent/6376898/View

•Data Modeling and SQL

https://durhamcollege.desire2learn.com/d2l/le/content/467097/viewContent/6376900/View

•Microsoft Power BI

https://durhamcollege.desire2learn.com/d2l/le/content/467097/viewContent/6377023/View

Tutorials and Examples

•MySQL Data Manipulation:

https://www.databasejournal.com/mysql/mysql-data-manipulation-and-query-statements/

https://www.w3schools.com/sql/

https://www.tutorialspoint.com/sql/index.htm

•Workbench: https://www.tutorialspoint.com/create-a-new-database-with-mysql-workbench

•SQL Server Data Manipulation

https://www.tutorialspoint.com/ms_sql_server/index.htm

•Management Studio:

https://www.tutorialspoint.com/ms_sql_server/ms_sql_server_management_studio.htm

•Power BI Data Manipulation

https://learn.microsoft.com/en-us/power-bi/connect-data/desktop-tutorial-importing-and-analyzing-data-from-a-web-page

Data Manipulation in Python

https://www.analyticsvidhya.com/blog/2021/06/data-manipulation-using-pandas-essential-functionalities-of-pandas-you-need-to-know/

Data Analytics, Machine Learning, Data Science

Threat To Validity for Your Data Analytics Projects

•Internal

•External

•Construct

•Statistical Conclusion

Internal: Informative variable missing. Bring data from other sources

External: Fixation variable make the result perfect. Model may not generalize

Construct: Class imbalance affects outcome badly

Statistical Conclusion: Based on the statistical measure used, the conclusion can be incorrect.

•Data Mining: Association: Support, Confidence, and Lift

Internal Validity

Is your experiment (and Model) Internally Valid?

What is the Threat that

the experiment (model, and outcome) is invalid (internally)?)

Example: Reasons that inferences between two variables are causal are incorrect. [b]

Cause: Lack of informative variables

Solution: Bring data from other sources

External Validity

Is your experiment (and Model) Externally Valid?

What is the Threat to external Validity that the experiment (model, and outcome) is externally invalid?)

“Study results may not apply to other groups.”

Cause: Fixation Variable

Solution: exclude fixation variable from the study

Ref: https://en.wikipedia.org/wiki/External_validity

Construct Validity

Is your experiment (and Model) Valid by Construction?

What is the Threat that  the experiment (model, and outcome) is invalid by Construction?)

Example: in Classification if the data is imbalanced,

Variables’ effect on the outcome can be invalid

Cause: Construction/balance problem

Solution: Treat Data for Imbalance

Statistical Conclusion Validity

Is your conclusion (from the experiment and the Model) Statistically Valid, even done by Statistical Analysis?

What is the Threat that  the conclusion (from the experiment and the Model) is invalid?)

Example: In data mining, you just considered Association. But that does not give the full picture

Solution: Include Support, Confidence, and Lift

Ref: https://www.analyticsvidhya.com/

Data Analytics, Machine Learning.

Data Analytics, Machine Learning, Data Science

Threat To Validity for Your Data Analytics Projects

• Internal

• External

• Construct

• Statistical Conclusion

• Internal: Informative variable missing. Bring data from other sources

• External: The Fixation variable makes the result perfect. The model may not generalize

• Construct: Class imbalance affects the outcome badly

• Statistical Conclusion: Based on the statistical measure used, the conclusion can be incorrect.


Data Mining: Association: Support, Confidence, and Lift

Data Analytics, Machine Learning, Data Science

McNemar’s Test

Chi-Square

McNemar’s  Test

Chi-square: “A chi-square test is used to help determine if observed results are in line with expected results, and to rule out that observations are due to chance.” Coinflip as an example [1]

References:
1. https://www.investopedia.com/terms/c/chi-square-statistic.asp

Data Analytics, Machine Learning, Data Science

Statistics for Data Analytics and Machine Learning Projects

Null Hypothesis

[2]

Paired t-test

Unpaired t-test

•Pearson Correlation

One Way: Analysis of variance

Spearman Correlation

Spearman

•Kendal Tau Coef

Wilcoxon Sum test

Basic EDA

•Mcnaimer’s test

•Friedman test

•Kruskal-Wallis Test

Two Way Analysis of variance

•K-Fold Cross Validation paired t-test

•Wilcoxon Signed Rank Test

Data Analytics, Machine Learning, Data Science

Make Sense of your Data: For Data Analytics Project

Hypothesis-based versus data-driven analysis

“Only those data analysts who are given time to explore and analyze data thoughtfully and thoroughly are consistently successful.”

Data Identification and Prioritization

Use Augmented data besides Data Pipeline

Analytics Sandbox


Characterizing the Data—Exploring a Single Variable

Data: Descriptive analysis options

Find: Distribution of quantitative variables

Reference: [1]. Gregory S. Nelson. The Analytics Lifecycle Toolkit: A Practical Guide for an Effective  Analytics Capability,  John Wiley & Sons © 2018

Data Analytics, Machine Learning

Data Analytics, Machine Learning, Data Science

Factors/Variables to Consider For Experimental Design for Data Analytics Projects

Design of experiments fishbone

REF: [1]. Gregory S. Nelson. The Analytics Lifecycle Toolkit: A Practical Guide for an Effective  Analytics Capability,  John Wiley & Sons © 2018 . Chapter 6 – Problem Framing

Data Analytics, Machine Learning

Data Analytics, Machine Learning, Data Science

Data Analytics Project: Problem Framing and Project Lifecycle

REF: Internet and

Gregory S. Nelson. The Analytics Lifecycle Toolkit: A Practical Guide fo an Effective  Analytics Capability,  John Wiley & Sons © 2018 . Chapter 6 – Problem Framing

Data Analytics, Machine Learning

Data Analytics, Machine Learning, Data Science

Model Selection

• Optimizations/Machine Learning/Data Mining/Deep Learning/Reinforcement Learning/Graph Mining/NLP/Genetic Algorithms

• Regression

• Linear

• Non-Linear

• Classifications

• Logistics Regression

• Sigmoid : Binary

• Softmax: Multi-Class

• Bayes Classifier

• SVM

• Bayesian: Regression/Classification

• Clustering

• K-NN

• KNN+

• Kmeans, Hierarchical, Density

•Machine Learning/Data Mining/Deep Learning/Reinforcement Learning/Graph Mining/NLP

•Time Series Analysis

•Decision (Regression, Classification) Trees

•Univariate

•Multivariate

•Random Forest

•Reinforcement Learning

•Q-Learning

•Monte Carlo

•Deep Learning (Know variations, find a fit)

•MLP

•LSTM

•RNN

•Ensemble Methods

•Multiple Learners Together

Ref: Internet, Demir Slides

Data Analytics, Machine Learning, Data Science