This series of posts which I’ll try to publish once a while will look to address some statistics fundamentals. Starting from variable types, methodologies, sampling, hypotheses testing, distributions, and then gradually explore advanced statistical concepts from then on. It’s a way for me to revise what I’ve studied, for beginner’s to get an overview before deep diving on concepts, and also for others in the field who’ve lost touch and would like some refreshers.
Topics to be covered under Fundamentals of Statistics. These will be followed by more advanced concepts.
||Central Limit Theorem||
Statistics, as defined by Wikipedia is study of the collection, analysis, interpretation, presentation, and organization of data. Data at the broadest level can be divided into two types – discrete or continuous.
|They can only take particular values. There is no grey area between the values. For example, 5 boys||Can occupy any value in a continuous range. The value can be infinitesimally small. Eg, Weight can be measured to 75.6463 kgs|
|They can also be categorical. Eg Red, Blue||Only numerical|
|Values are often put in ‘bins’ to identify groups at the cost of data loss. Eg, 13-19 teenager, 19-30 young adult and so on|
Statistics is based on two main types of methodologies
|Descriptive Statistics||Inferential Statistics||Predictive Statistics|
|Summarizes data||Draws conclusions||Estimate continuous or categorical data using past data|
There are two major ways to describe data:
- Measures of Central Tendency
- Measures of Spread (Dispersion)
- Measures of Shape (Distribution)
Measures of Central Tendency – Measures of Central Tendency are to estimate a value which identifies the central position taken by that set of data. Also, known as measures of central location. The mean or average is the most common measure of central tendency. The others being mean and mode.
Some differences between the 3 measures of central tendency –
|Represents centre of gravity of data set||Represents middle of data set (half below and half above)||Represents most common value|
|Sensitive to extreme values||Not sensitive to extreme values||Data set may have no mode, one or multiple mode|
|Most useful when data is normally distributed||Most useful when data set is skewed or has extreme values in one direction||Most useful in finding the most commonly occurring value|
At what instance to use mean vs median vs mode is a fundamental skill to be developed in statistical analysis. An example below:
The above is a distribution which has a positive skew, or a long right tail. This is a positively skewed continuous distribution. Here if the mean is calculated using all of the values, the answer would be very different to the mode and to the median.
Let’s say we have 10 people in a class. Their scores in an exam out of 100 are
Scores1 : 70, 75,65,61,78,82,76,63,45,90
In the next exam, if the first student fails and gets a 15 out of 100, let’s see how the statistics change
Scores 2: 15, 75,65,61,78,82,76,63,45,90
Difference between mean and median has increased to 5 from 2 in the first set of scores. As outliers increase, or as the distribution gets more skewed with values deviating significantly from the mean, we observe that the mean becomes more and more susceptible and is not able to provide a proper estimate of the central tendency.
- Statistics – a set of concepts, rules, and procedures that help us to:
- organize numerical information in the form of tables, graphs, and charts;
- understand statistical techniques underlying decisions that affect our lives and well-being; and
- make informed decisions.
- Data – facts, observations, and information that come from investigations.
- Measurement data sometimes called quantitative data — the result of using some instrument to measure something (e.g., test score, weight);
- Categorical data also referred to as frequency or qualitative data. Things are grouped according to some common property(ies) and the number of members of the group are recorded (e.g., males/females, vehicle type).
- Variable – property of an object or event that can take on different values. For example, college major is a variable that takes on values like mathematics, computer science, English, psychology, etc.
- Discrete Variable – a variable with a limited number of values (e.g., gender (male/female), college class (freshman/sophomore/junior/senior).
- Continuous Variable – a variable that can take on many different values, in theory, any value between the lowest and highest points on the measurement scale.
- Independent Variable – a variable that is manipulated, measured, or selected by the researcher as an antecedent condition to an observed behavior. In a hypothesized cause-and-effect relationship, the independent variable is the cause and the dependent variable is the outcome or effect.
- Dependent Variable – a variable that is not under the experimenter’s control — the data. It is the variable that is observed and measured in response to the independent variable.
- Qualitative Variable – a variable based on categorical data.
- Quantitative Variable – a variable based on quantitative data.
- Measures of Center – Plotting data in a frequency distribution shows the general shape of the distribution and gives a general sense of how the numbers are bunched. Several statistics can be used to represent the “center” of the distribution. These statistics are commonly referred to as measures of central tendency.
- Mode – The mode of a distribution is simply defined as the most frequent or common score in the distribution. The mode is the point or value of X that corresponds to the highest point on the distribution. If the highest frequency is shared by more than one value, the distribution is said to be multimodal. It is not uncommon to see distributions that are bimodal reflecting peaks in scoring at two different points in the distribution.
- Median – The median is the score that divides the distribution into halves; half of the scores are above the median and half are below it when the data are arranged in numerical order. The median is also referred to as the score at the 50th percentile in the distribution. The median location of N numbers can be found by the formula (N + 1) / 2. When N is an odd number, the formula yields a integer that represents the value in a numerically ordered distribution corresponding to the median location. (For example, in the distribution of numbers (3 1 5 4 9 9 8) the median location is (7 + 1) / 2 = 4. When applied to the ordered distribution (1 3 4 5 8 9 9), the value 5 is the median, three scores are above 5 and three are below 5. If there were only 6 values (1 3 4 5 8 9), the median location is (6 + 1) / 2 = 3.5. In this case the median is half-way between the 3rdand 4th scores (4 and 5) or 4.5.
- Mean – The mean is the most common measure of central tendency and the one that can be mathematically manipulated. It is defined as the average of a distribution is equal to the SX / N. Simply, the mean is computed by summing all the scores in the distribution (SX) and dividing that sum by the total number of scores (N). The mean is the balance point in a distribution such that if you subtract each value in the distribution from the mean and sum all of these deviation scores, the result will be zero.
Summary source: http://bobhall.tamu.edu/FiniteMath/Module8/Introduction.html