Part 1: Some Math/Stat Background that (true) Data Scientists will know/use: from the internet

Chebyshev's inequality

"In probability theory, Chebyshev's inequality (also called the Bienaymé–Chebyshev inequality) guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean.

Specifically, no more than 1/k2 of the distribution's values can be more than k standard deviations away from the mean

equivalently, at least 1 − 1/k2 of the distribution's values are within k standard deviations of the mean

In statistics. The inequality has great utility because it can be applied to any probability distribution in which the mean and variance are defined."

Ref: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality

Probabilistic statement[edit]

Let X (integrable) be a random variable with finite expected value μ and finite non-zero variance σ2. Then for any real number k > 0,

\Pr(|X-\mu |\geq k\sigma )\leq {\frac {1}{k^{2}}}.

Only the case k > 1 is useful. When {\displaystyle k\leq 1} the right-hand side {\displaystyle {\frac {1}{k^{2}}}\geq 1} and the inequality is trivial as all probabilities are ≤ 1.

As an example, using {\displaystyle k={\sqrt {2}}} shows that the probability that values lie outside the interval {\displaystyle (\mu -{\sqrt {2}}\sigma ,\mu +{\sqrt {2}}\sigma )} does not exceed {\frac {1}{2}}.

Ref: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality

"Markov's inequality

"Markov's inequality (and other similar inequalities) relate probabilities to expectations, and provide (frequently loose but still useful) bounds for the cumulative distribution function of a random variable."

Statement

"If X is a nonnegative random variable and a > 0, then the probability that X is at least a is at most the expectation of X divided by a:[1]

{\displaystyle \operatorname {P} (X\geq a)\leq {\frac {\operatorname {E} (X)}{a}}.}

Let {\displaystyle a={\tilde {a}}\cdot \operatorname {E} (X)}{\displaystyle a={\tilde {a}}\cdot \operatorname {E} (X)}{\displaystyle {\tilde {a}}>0}); then we can rewrite the previous inequality as

"

Ref: https://en.wikipedia.org/wiki/Markov%27s_inequality

Check Null Hypothesis concept as well as Chi Square Test here: http://bangla.salearningschool.com/recent-posts/important-basic-concepts-statistics-for-big-data/

Chi-Square Statistic:

"A chi square (χ2) statistic is a test that measures how expectations compare to actual observed data (or model results)."

https://www.investopedia.com/terms/c/chi-square-statistic.asp

"What does chi square test tell you?

The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is also called a "goodness of fit" statistic, because it measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent."

https://www.ling.upenn.edu/~clight/chisquared.htm

"In probability theory and statistics, the chi-square distribution (also chi-squared or χ2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-square distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals.[2][3][4][5] When it is being distinguished from the more general noncentral chi-square distribution, this distribution is sometimes called the central chi-square distribution.": https://en.wikipedia.org/wiki/Chi-squared_distribution

"A chi-squared test, also written as χ2 test, is any statistical hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Without other qualification, 'chi-squared test' often is used as short for Pearson's chi-squared test. The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.": https://en.wikipedia.org/wiki/Chi-squared_test

"

Statistical Significance Tests for Comparing Machine Learning Algorithms

Learn

"

  • Statistical hypothesis tests can aid in comparing machine learning models and choosing a final model.
  • The naive application of statistical hypothesis tests can lead to misleading results.
  • Correct use of statistical tests is challenging, and there is some consensus for using the McNemar’s test or 5×2 cross-validation with a modified paired Student t-test.

https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/

Probability Axioms (I am not convinced that the following is the  best way to say)

"

  • Axiom 1: The probability of an event is a real number greater than or equal to 0.
  • Axiom 2: The probability that at least one of all the possible outcomes of a process (such as rolling a die) will occur is 1.
  • Axiom 3: If two events A and B are mutually exclusive, then the probability of either A or B occurring is the probability of A occurring plus the probability of B occurring.

"

https://plus.maths.org/content/maths-minute-axioms-probability

1. Probability is non-negative

2. P{S} = 1

3. Probability is additive

If A and B are two mutually exclusive (independent) events

P (A U B) = P(A) + P(B)

P (A intersection B) = empty = 0 . [nothing common]

P{A} = 1 - P'(A)

P{phi = empty} = 0

What does probability density function mean?

"Probability density function (PDF) is a statistical expression that defines a probability distribution for a continuous random variable as opposed to a discrete random variable. When the PDF is graphically portrayed, the area under the curve will indicate the interval in which the variable will fall" https://www.investopedia.com/terms/p/pdf.asp

"A probability density function is most commonly associated with absolutely continuous univariate distributions. A random variable X has density f_X, where f_X is a non-negative Lebesgue-integrable function, if:
\Pr[a\leq X\leq b]=\int _{a}^{b}f_{X}(x)\,dx.

Hence, if F_{X} is the cumulative distribution function of X, then:

F_{X}(x)=\int _{-\infty }^{x}f_{X}(u)\,du,

and f_X is continuous at x

f_{X}(x)={\frac {d}{dx}}F_{X}(x).

Intuitively, one can think of {\displaystyle f_{X}(x)\,dx} as being the probability of X falling within the infinitesimal interval [x,x+dx]."
https://en.wikipedia.org/wiki/Probability_density_function

Probability mass function

Jump to navigationJump to search
The graph of a probability mass function. All the values of this function must be non-negative and sum up to 1.

"In probability and statistics, a probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.[1] Sometimes it is also known as the discrete density function. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.

A probability mass function differs from a probability density function (PDF) in that the latter is associated with continuous rather than discrete random variables. A PDF must be integrated over an interval to yield a probability.[2]

The value of the random variable having the largest probability mass is called the mode."https://en.wikipedia.org/wiki/Probability_mass_function

4.3.1 Mixed Random Variables

Here, we will discuss mixed random variables. These are random variables that are neither discrete nor continuous, but are a mixture of both. In particular, a mixed random variable has a continuous part and a discrete part.

https://www.probabilitycourse.com/chapter4/4_3_1_mixed.php . Also check the examples from here

Expected values of a random variable
The expected value of a discrete random variable is the probability-weighted average of all its possible values. In other words, each possible value the random variable can assume is multiplied by its probability of occurring, and the resulting products are summed to produce the expected value.
https://en.wikipedia.org/wiki/Expected_value

The “moments” of a random variable

The “moments” of a random variable (or of its distribution) are expected values of powers or related functions of the random variable. The rth moment of X is E(Xr). In particular, the first moment is the mean, µX = E(X). The mean is a measure of the “center” or “location” of a distribution

http://homepages.gac.edu/~holte/courses/mcs341/fall10/documents/sect3-3a.pdf

Joint distributions

"Joint distributions Notes: Below X and Y are assumed to be continuous random variables. This case is, by far, the most important case. Analogous formulas, with sums replacing integrals and p.m.f.’s instead of p.d.f.’s, hold for the case when X and Y are discrete r.v.’s. Appropriate analogs also hold for mixed cases (e.g., X discrete, Y continuous), and for the more general case of n random variables X1, . . . , Xn.

• Joint cumulative distribution function (joint c.d.f.): F(x, y) = P(X ≤ x, Y ≤ y)"

https://faculty.math.illinois.edu/~hildebr/461/jointdistributions.pdf

The above were mostly from the Internet and as is.