Unit 2
1) What is statistics? How is it used in Data Science?
1. What is Statistics?
Statistics is the branch of mathematics that deals with collecting, analyzing, interpreting, and presenting numerical data. It helps in understanding patterns, relationships, and trends in data.
Key Components of Statistics:
- Descriptive Statistics → Summarizes data using measures like mean, median, variance, etc.
- Inferential Statistics → Makes predictions or inferences about a population using sample data.
2. How is Statistics Used in Data Science?
Statistics is the backbone of Data Science, helping in data analysis, modeling, and decision-making.
A. Data Understanding & Exploration
- Descriptive statistics (mean, median, mode) summarize datasets.
- Data visualization (histograms, box plots) reveal patterns and outliers.
B. Data Cleaning & Preprocessing
- Identifies missing values, outliers, and inconsistent data.
- Normal distribution & standardization help scale data for machine learning.
C. Hypothesis Testing & Inferential Analysis
- Uses t-tests, chi-square tests, and ANOVA to validate assumptions.
- Helps determine whether observed patterns are statistically significant.
D. Machine Learning & Predictive Modeling
- Regression Analysis (e.g., Linear Regression) predicts numerical outcomes.
- Probability & Bayes’ Theorem (e.g., Naïve Bayes Classifier) are used in classification tasks.
E. Performance Evaluation of Models
- Error metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE) for regression models.
- Confusion matrix, Precision, Recall, and F1-score for classification models.
Conclusion
Statistics is crucial in Data Science for understanding data, making predictions, validating models, and ensuring insights are reliable. Mastering statistical concepts helps data scientists build accurate and efficient machine learning models.
2) Differentiate between Descriptive and Inferential Statistics with examples.
Descriptive vs. Inferential Statistics
Feature | Descriptive Statistics | Inferential Statistics |
---|---|---|
Definition | Summarizes and describes data. | Makes predictions and generalizations about a population using a sample. |
Purpose | Organizes, visualizes, and simplifies raw data. | Draws conclusions and makes inferences about a larger group. |
Techniques | Measures of central tendency (mean, median, mode), dispersion (variance, standard deviation), frequency distributions, data visualization (histograms, box plots). | Hypothesis testing (t-test, chi-square test), confidence intervals, regression analysis, probability distributions. |
Data Used | Works with the entire dataset (population or sample). | Uses a sample to infer insights about the entire population. |
Example | - Finding the average height of students in a class. - Calculating the percentage of students scoring above 80 in an exam. | - Predicting the average height of all students in a city based on a sample. - Determining if a new drug is effective by testing on a small group and generalizing results to the entire population. |
Example for Better Understanding
💡 Descriptive Statistics Example:
A company surveys 100 employees and finds that the average salary is $50,000. This is a direct summary of the dataset.
💡 Inferential Statistics Example:
The company wants to predict the average salary of all employees in the industry using this sample of 100 employees. Inferential statistics help estimate this.
Conclusion
- Descriptive Statistics helps understand data by summarizing and organizing it.
- Inferential Statistics helps make predictions and generalizations from sample data to a larger population.
Both are essential in Data Science for analyzing trends and making data-driven decisions.
3) Explain the concepts of Population and Sample in statistics. Why is sampling important?
1. Population vs. Sample in Statistics
Concept | Population | Sample |
---|---|---|
Definition | The entire group of individuals or data points under study. | A subset of the population used for analysis. |
Size | Larger, often too vast to study completely. | Smaller, selected from the population. |
Representation | Includes all possible observations. | Represents a portion of the population. |
Example | All students in a country. | A random group of 500 students selected for a survey. |
Analysis Method | Uses parameters (e.g., population mean μ, population standard deviation σ). | Uses statistics (e.g., sample mean x̄, sample standard deviation s). |
2. Why is Sampling Important?
Studying an entire population is often impractical due to time, cost, and effort. Sampling helps by:
✅ Reducing Cost & Time → Collecting data from a sample is faster and more affordable.
✅ Better Feasibility → Allows analysis when population size is too large.
✅ Statistical Inference → Enables generalization of findings to the entire population.
✅ Higher Accuracy → Proper sampling techniques ensure reliable results with minimal bias.
3. Example of Sampling in Data Science
A company wants to analyze customer satisfaction for millions of customers. Instead of surveying all, they collect responses from a random sample of 1,000 customers and infer results for the entire customer base.
📌 Key Sampling Methods:
- Random Sampling → Every individual has an equal chance of selection.
- Stratified Sampling → Population is divided into groups, and samples are taken proportionally.
- Systematic Sampling → Every k-th individual is chosen.
Conclusion
- Population is the complete dataset, while sample is a smaller subset used for study.
- Sampling is crucial for making predictions about a population efficiently, without analyzing every individual.
- In Data Science, sampling techniques ensure accurate and scalable models.
4) What are the different types of variables in statistics? Give examples.
Types of Variables in Statistics
In statistics, variables are characteristics or properties that can take different values. They are classified into Qualitative (Categorical) and Quantitative (Numerical) variables.
1. Qualitative (Categorical) Variables
These represent categories or labels without numerical meaning.
Type | Description | Example |
---|---|---|
Nominal | Categories with no inherent order. | - Eye color (Brown, Blue, Green) - Blood type (A, B, AB, O) |
Ordinal | Categories with a meaningful order, but no fixed difference. | - Education level (Primary, Secondary, College) - Customer satisfaction (Low, Medium, High) |
2. Quantitative (Numerical) Variables
These represent measurable quantities and have numerical values.
Type | Description | Example |
---|---|---|
Discrete | Whole numbers, countable values. | - Number of students in a class (30, 35, 40) - Number of cars in a parking lot (10, 20, 50) |
Continuous | Can take any value within a range, including decimals. | - Height of a person (5.7 feet, 6.2 feet) - Temperature (23.5°C, 30.8°C) |
Key Differences
- Categorical variables classify data into groups (e.g., gender: Male/Female).
- Numerical variables involve measurable quantities (e.g., weight: 70 kg).
📌 Example:
A survey collects the following data:
- Name: John (Nominal)
- Age: 25 years (Discrete)
- Height: 5.8 feet (Continuous)
- Education Level: Bachelor’s Degree (Ordinal)
Conclusion
Understanding variable types helps in choosing the right statistical analysis methods and building accurate models in Data Science.
5) Define Measures of Central Tendency. Why is it important in data science? Explain Mean, Median, and Mode with examples.
1. What are Measures of Central Tendency?
Measures of Central Tendency are statistical metrics that describe the center or typical value of a dataset. The three main measures are:
- Mean (Average)
- Median (Middle Value)
- Mode (Most Frequent Value)
2. Importance in Data Science
✅ Summarizes Data: Helps in understanding data distribution.
✅ Comparison & Decision-Making: Assists in comparing datasets.
✅ Modeling & Predictions: Used in statistical and machine learning models.
✅ Outlier Detection: Identifies skewness and anomalies in data.
3. Explanation of Mean, Median, and Mode with Examples
A. Mean (Arithmetic Average)
- The sum of all values divided by the number of values.
- Formula:
- Example:
Dataset: [10, 20, 30, 40, 50] - Use Case: Used when data is normally distributed (e.g., average income, test scores).
B. Median (Middle Value)
- The middle value when data is arranged in ascending order.
- Steps to Calculate:
- If odd number of elements: Median = Middle value.
- If even number of elements: Median = Average of two middle values.
- Example:
Dataset: [5, 10, 15, 20, 25] (Odd count) → Median = 15
Dataset: [5, 10, 15, 20] (Even count) → Median = (10+15)/2 = 12.5 - Use Case: Preferred when data has outliers (e.g., house prices, salaries).
C. Mode (Most Frequent Value)
- The most repeated value in a dataset.
- Example:
Dataset: [2, 3, 3, 4, 5, 5, 5, 6]- Mode = 5 (since it appears most frequently).
- Use Case: Useful in categorical data analysis (e.g., most preferred product, most common disease).
4. Comparison & When to Use
Measure | Best Used When | Affected by Outliers? |
---|---|---|
Mean | Data is normally distributed | ✅ Yes |
Median | Data has skewness or extreme values | ❌ No |
Mode | Categorical or repeated values are important | ❌ No |
5. Conclusion
Mean, Median, and Mode are essential statistical tools in Data Science. They help summarize datasets, identify trends, and guide decision-making, making them crucial in business analytics, research, and machine learning.
6) What are Measures of Variability? Discuss Range, Variance, and Standard Deviation.
1. What are Measures of Variability?
Measures of Variability describe how spread out or dispersed data points are in a dataset. They indicate how much the values differ from the central tendency (mean, median, mode).
2. Importance in Data Science
✅ Understanding Data Distribution – Helps in analyzing data spread.
✅ Detecting Outliers – Identifies extreme values affecting predictions.
✅ Improving Model Performance – Variability helps in feature selection and normalization.
✅ Risk Assessment – Higher variability indicates greater uncertainty in data.
3. Key Measures of Variability
A. Range
- The simplest measure of dispersion.
- Formula:
- Example:
Dataset: [10, 20, 30, 40, 50] - Use Case: Quick overview of spread, but sensitive to outliers.
B. Variance (σ² or s²)
- Measures how far each data point is from the mean.
- Formula for Population Variance (σ²):
- Formula for Sample Variance (s²):
- Example:
Dataset: [5, 10, 15] (Mean = 10) - Use Case: Helps understand variability, but squared units make interpretation difficult.
C. Standard Deviation (σ or s)
- Square root of variance, giving spread in the same unit as data.
- Formula:
- Example (Using Variance from Above):
- Use Case:
- A low standard deviation means data points are close to the mean.
- A high standard deviation means data is widely spread.
4. Comparison Table
Measure | Description | Sensitivity to Outliers |
---|---|---|
Range | Difference between max & min | ✅ High |
Variance | Average squared deviations from mean | ✅ High |
Standard Deviation | Square root of variance (same unit as data) | ✅ High |
5. Conclusion
Measures of Variability provide insight into data spread, essential for statistical analysis, risk assessment, and machine learning models. Standard deviation is the most commonly used metric due to its interpretability in real-world scenarios.
7) Define Coefficient of Variance (CV) and explain its significance.
1. What is the Coefficient of Variation (CV)?
The Coefficient of Variation (CV) is a relative measure of dispersion that compares the standard deviation to the mean. It helps assess the consistency and variability of data across different datasets, even if they have different units or scales.
-
Formula:
Where:
- = Standard Deviation
- = Mean
-
Expressed as a Percentage (%) to make comparisons easier.
2. Significance of Coefficient of Variation
✅ Comparison Across Different Datasets: CV allows comparison of variability between datasets with different units or scales.
✅ Risk Assessment: In finance, a lower CV indicates a more stable investment, while a higher CV suggests higher risk.
✅ Quality Control: In manufacturing, a low CV means consistent product quality, while a high CV signals variability in production.
✅ Machine Learning & Data Science: Helps normalize features and understand the reliability of data.
3. Example of CV Calculation
Scenario: Comparing test score consistency in two classes.
Class | Mean Score () | Standard Deviation () | CV (%) |
---|---|---|---|
A | 80 | 5 | |
B | 75 | 10 |
💡 Interpretation:
- Class A has a lower CV (6.25%), meaning scores are more consistent.
- Class B has a higher CV (13.33%), indicating more variability in scores.
4. When to Use CV?
- Best for comparing variability when datasets have different units or magnitudes.
- Not suitable when the mean is close to zero, as it leads to high CV values that are misleading.
5. Conclusion
The Coefficient of Variation (CV) is a powerful statistical tool for comparing relative variability in different datasets. It is widely used in finance, economics, manufacturing, and machine learning to assess consistency and stability.
8) What is Skewness? How does it indicate the shape of a distribution?
1. What is Skewness?
Skewness is a statistical measure that describes the asymmetry of a dataset’s distribution around its mean. It indicates whether the data is symmetrically distributed or skewed to one side.
- Positive Skewness (Right-Skewed) → Tail extends toward higher values.
- Negative Skewness (Left-Skewed) → Tail extends toward lower values.
- Zero Skewness (Symmetric Distribution) → Data is evenly spread around the mean.
2. How Skewness Indicates the Shape of a Distribution
A. Symmetric Distribution (Skewness ≈ 0)
- Mean ≈ Median ≈ Mode
- Shape: Bell-shaped (Normal Distribution).
- Example: Heights of people in a population.
B. Positively Skewed (Right-Skewed) (Skewness > 0)
- Mean > Median > Mode
- Shape: Long right tail, more low values.
- Example: Income distribution (few high earners pull the mean up).
C. Negatively Skewed (Left-Skewed) (Skewness < 0)
- Mean < Median < Mode
- Shape: Long left tail, more high values.
- Example: Exam scores (most students score high, few score very low).
3. Skewness Formula
Where:
- = Individual data points
- = Mean
- = Standard Deviation
- = Number of observations
📌 Easier Calculation in Python:
import scipy.stats as stats
skewness = stats.skew(data)
4. Why is Skewness Important in Data Science?
✅ Influences Model Selection – Many statistical tests assume normality.
✅ Affects Mean & Median Relationship – Helps understand central tendency.
✅ Improves Decision-Making – Essential in finance, risk analysis, and data preprocessing.
✅ Detects Data Anomalies – Helps identify extreme values and distribution shape.
5. Conclusion
Skewness helps in understanding data distribution, detecting biases, and making better statistical and machine learning decisions. Right-skewed distributions have long right tails, left-skewed have long left tails, and zero-skewness indicates symmetry.
9) What is Kurtosis? How does it describe the characteristics of a probability distribution?
1. What is Kurtosis?
Kurtosis is a statistical measure that describes the “tailedness” or peakiness of a probability distribution. It indicates how heavily the tails of a distribution differ from a normal distribution.
2. Types of Kurtosis & Their Characteristics
Kurtosis Type | Value | Characteristics | Example |
---|---|---|---|
Mesokurtic (Normal Distribution) | ≈ 3 | - Moderate tails and peak. - Similar to a normal distribution. | Standard normal distribution (e.g., IQ scores). |
Leptokurtic (Heavy-Tailed Distribution) | > 3 | - Higher peak, fatter tails. - More extreme values (outliers). | Stock market returns (high volatility). |
Platykurtic (Light-Tailed Distribution) | < 3 | - Lower peak, thinner tails. - Fewer extreme values. | Uniform distribution (e.g., dice rolls). |
📌 Formula for Kurtosis:
Where:
- = Data values
- = Mean
- = Standard deviation
- = Number of observations
📌 Python Code to Calculate Kurtosis:
import scipy.stats as stats
kurtosis_value = stats.kurtosis(data)
print(kurtosis_value)
3. How Kurtosis Describes a Probability Distribution
- Tails → Higher kurtosis means more extreme values (outliers).
- Peak → Leptokurtic distributions have sharper peaks, while platykurtic distributions are flatter.
- Risk Analysis → Used in finance to measure market crashes and rare events.
4. Importance of Kurtosis in Data Science
✅ Outlier Detection → High kurtosis indicates the presence of extreme values.
✅ Risk & Financial Modeling → Used in stock market analysis to predict crashes.
✅ Quality Control & Reliability → Detects unusual variations in manufacturing.
✅ Improving Machine Learning Models → Helps in preprocessing data to handle outliers.
5. Conclusion
Kurtosis helps in understanding distribution shape, outlier impact, and risk assessment in various fields like finance, quality control, and machine learning. Leptokurtic distributions have extreme values, while platykurtic distributions are more uniform.
10) What is the Normal Distribution, and why is it important in statistics?
1. What is Normal Distribution?
The Normal Distribution, also called the Gaussian Distribution, is a symmetrical, bell-shaped probability distribution where most data points cluster around the mean.
Key Characteristics:
✅ Symmetrical → Mean, median, and mode are equal.
✅ Bell-Shaped Curve → Majority of values lie close to the center.
✅ Defined by Mean (μ) & Standard Deviation (σ) →
- 68% of data falls within 1σ of the mean.
- 95% falls within 2σ.
- 99.7% falls within 3σ (Empirical Rule).
📌 Mathematical Formula:
Where:
- = Mean
- = Standard deviation
2. Importance of Normal Distribution in Statistics
✅ Basis for Statistical Inference
- Many hypothesis tests (e.g., t-test, Z-test) assume normality.
- Used in confidence intervals & probability predictions.
✅ Central Limit Theorem (CLT)
- States that sample means follow a normal distribution even if the population is not normally distributed, given a large enough sample size.
✅ Machine Learning & Data Science
- Many models (e.g., Linear Regression) assume normality for optimal performance.
- Helps in data preprocessing (e.g., standardization).
✅ Real-Life Applications
- Finance: Stock market returns often follow a normal distribution.
- Medicine: Human height, blood pressure, IQ scores follow normal distribution.
- Quality Control: Manufacturing defects follow a normal pattern.
3. Example: Normal Distribution in Python
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Generate data
data = np.random.normal(loc=50, scale=10, size=1000)
# Plot histogram
plt.hist(data, bins=30, density=True, alpha=0.6, color='b')
# Plot normal curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, 50, 10)
plt.plot(x, p, 'k', linewidth=2)
plt.title("Normal Distribution Curve")
plt.show()
4. Conclusion
The Normal Distribution is fundamental in statistics and data science. Its symmetry, predictability, and real-world applicability make it essential for statistical modeling, hypothesis testing, and decision-making across various fields.
11) Explain Hypothesis Testing and its importance in statistics.
1. What is Hypothesis Testing?
Hypothesis Testing is a statistical method used to make decisions or inferences about a population based on sample data. It helps determine whether an observed effect is real or due to random chance.
✅ Compares Two Hypotheses:
- Null Hypothesis () → Assumes no effect or no difference.
- Alternative Hypothesis () → Suggests a significant effect or difference.
2. Steps in Hypothesis Testing
1️⃣ State the Hypotheses
- (Null Hypothesis): There is no difference/effect.
- (Alternative Hypothesis): There is a significant difference/effect.
2️⃣ Set Significance Level ()
- Commonly used values: 0.05 (5%) or 0.01 (1%).
- It represents the probability of rejecting when it is true (Type I Error).
3️⃣ Choose a Test Statistic
- Z-Test (for large samples, known variance).
- T-Test (for small samples, unknown variance).
- Chi-Square Test (for categorical data).
4️⃣ Calculate the Test Statistic & P-value
- The p-value measures the probability of observing the result if is true.
5️⃣ Compare P-value with
- If p-value ≤ → Reject (Significant result).
- If p-value > → Fail to reject (No significant result).
3. Importance of Hypothesis Testing in Statistics
✅ Decision-Making → Used in research, business, and medicine to make data-driven decisions.
✅ Validates Assumptions → Helps check if a claim about a population is statistically valid.
✅ Risk Management → Reduces uncertainty in financial markets, A/B testing, and clinical trials.
✅ Scientific Research → Used to test theories in psychology, biology, and economics.
4. Example of Hypothesis Testing
💡 Scenario: A company claims their new drug increases recovery rates by 10%. We test this by comparing the recovery rates of 100 patients.
- : The drug has no effect.
- : The drug improves recovery rates.
- Conduct a T-test and obtain a p-value = 0.03.
- Since p-value < 0.05, we reject and conclude the drug is effective.
5. Conclusion
Hypothesis Testing is a powerful statistical tool used to validate assumptions, guide decision-making, and reduce uncertainty in various fields like healthcare, finance, and machine learning.
12) What is the Central Limit Theorem (CLT)? Why is it important in inferential statistics?
1. What is the Central Limit Theorem (CLT)?
The Central Limit Theorem (CLT) states that, regardless of the population’s original distribution, the sampling distribution of the sample mean will approach a normal distribution as the sample size increases (n ≥ 30).
✅ Key Points of CLT:
- The mean of the sample means approximates the population mean.
- The standard deviation of the sample means is called the Standard Error (SE):
- Larger sample sizes () result in a tighter, more normal distribution.
2. Why is CLT Important in Inferential Statistics?
✅ Allows Statistical Inference
- Enables us to make predictions about a population using a sample.
✅ Supports Hypothesis Testing
- Justifies using t-tests, Z-tests, and confidence intervals, even if data is not normally distributed.
✅ Used in Machine Learning
- Ensures model assumptions hold, especially in regression and classification tasks.
✅ Real-World Applications
- Polling & Surveys → Predict election results from a sample.
- Quality Control → Test product quality using small samples.
3. Example of CLT in Action
📌 Scenario: A factory produces metal rods with an unknown length distribution.
- Population Mean () = 50 cm
- Population Standard Deviation () = 10 cm
- If we take samples of size multiple times:
- The sample means will form a normal distribution centered at 50 cm.
- The standard error will be:
4. Conclusion
The Central Limit Theorem (CLT) is a fundamental concept in inferential statistics that enables us to estimate population parameters, perform hypothesis tests, and make predictions based on sample data. It is widely used in research, business, and machine learning.
13) What is a Confidence Interval? How is it calculated?
1. What is a Confidence Interval (CI)?
A Confidence Interval (CI) is a range of values that estimates a population parameter (like the mean) with a certain level of confidence. It provides an interval estimate instead of a single point estimate, accounting for variability in data.
✅ Key Points:
- Expressed as:
- A 95% confidence interval means that 95 out of 100 times, the true population parameter will fall within this range.
- Wider CI → More uncertainty; Narrower CI → Higher precision.
2. How is a Confidence Interval Calculated?
Formula for CI (when population standard deviation is known, large sample )
Where:
- = Sample Mean
- = Z-score for confidence level (e.g., 1.96 for 95%)
- = Population Standard Deviation
- = Sample Size
Formula for CI (when population standard deviation is unknown, small sample )
Where:
- = t-score from t-distribution (depends on degrees of freedom )
- = Sample Standard Deviation
3. Example Calculation (95% CI, Large Sample, Known )
📌 Scenario: A sample of n = 100 students has an average height of 170 cm with a standard deviation of 15 cm.
Find the 95% confidence interval for the population mean height.
🔹 Z-score for 95% CI →
🔹 CI Calculation:
🔹 Interpretation: We are 95% confident that the true population mean height lies between 167.06 cm and 172.94 cm.
4. Importance of Confidence Intervals in Statistics
✅ Estimates Population Parameters → Provides a range instead of a single point estimate.
✅ Helps in Decision-Making → Used in medicine, business, and research to assess reliability.
✅ Supports Hypothesis Testing → If the CI excludes a certain value, it can indicate statistical significance.
5. Conclusion
A Confidence Interval (CI) gives a range of values where the population parameter is likely to fall, considering sampling variability. It is crucial for statistical inference, risk assessment, and hypothesis testing in Data Science.
14) What is a t-test? Explain its applications with examples.
1. What is a t-test?
A t-test is a statistical test used to compare the means of one or two groups to determine if they are significantly different from each other. It is used when:
✅ The sample size is small ().
✅ The population standard deviation () is unknown.
✅ The data follows a **normal distribution** (or approximately normal).
2. Types of t-tests & Their Applications
Type | Purpose | Example Application |
---|---|---|
1. One-Sample t-test | Compares the mean of a single sample to a known population mean. | A company claims the average battery life of a phone is 10 hours. A sample of 15 phones is tested to see if the claim is true. |
2. Two-Sample t-test (Independent t-test) | Compares the means of two independent groups. | Testing if male and female students have different average test scores. |
3. Paired t-test (Dependent t-test) | Compares means from the same group before and after a change. | Measuring the effect of a new diet plan by comparing weights before and after following the diet. |
3. Formula for t-test
Where:
- = Sample mean
- = Sample standard deviation
- = Sample size
4. Example: Independent t-test in Python
📌 Scenario: A researcher wants to check if two different teaching methods lead to different student performance.
import scipy.stats as stats
# Sample data: Test scores of two groups
group1 = [85, 90, 78, 92, 88]
group2 = [80, 83, 77, 85, 82]
# Perform independent t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
# Decision
if p_value < 0.05:
print("Significant difference in teaching methods.")
else:
print("No significant difference in teaching methods.")
5. Importance of t-tests in Data Science
✅ A/B Testing → Comparing website versions to measure effectiveness.
✅ Medical Research → Checking drug effectiveness before & after treatment.
✅ Quality Control → Ensuring product consistency across different batches.
✅ Business Analytics → Evaluating customer satisfaction between different stores.
6. Conclusion
A t-test helps determine if mean differences between groups are statistically significant. It is widely used in research, business, and machine learning for hypothesis testing and decision-making.
15) Differentiate between Type I and Type II errors in hypothesis testing.
Type I vs. Type II Errors in Hypothesis Testing
Error Type | Definition | Meaning | Example | Probability |
---|---|---|---|---|
Type I Error (False Positive) | Rejecting a true null hypothesis (). | Detecting an effect that does not exist. | A medical test wrongly diagnosing a healthy person as sick. | Significance level (), usually 5% (0.05). |
Type II Error (False Negative) | Failing to reject a false null hypothesis. | Missing an effect that actually exists. | A medical test failing to detect a disease in a sick person. | Beta (), related to statistical power (). |
1. Explanation with an Example
💡 Scenario: Testing a new drug’s effectiveness.
- : The drug has no effect.
- : The drug is effective.
✅ Type I Error (False Positive):
- We reject (claim the drug works) when it actually doesn’t.
- Consequence: Approving an ineffective drug → Waste of resources, health risks.
✅ Type II Error (False Negative):
- We fail to reject (conclude no effect) when the drug actually works.
- Consequence: A useful drug is not approved → Missed opportunity for treatment.
2. How to Reduce These Errors?
🔹 Lowering Type I Error () → Use a stricter significance level (e.g., 0.01 instead of 0.05).
🔹 Reducing Type II Error () → Increase sample size or statistical power.
3. Conclusion
- Type I Error (False Positive): Detects a false effect (too cautious).
- Type II Error (False Negative): Misses a real effect (too lenient).
- Trade-off: Lowering one increases the other, so balance is needed in hypothesis testing.
16) The number of points scored by two teams in a hockey match is given below. With the help of Coefficient of Variation, determine which team is more consistent.
No. of Points Scored | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
No. of Matches (Team A) | 20 | 5 | 4 | 10 | 1 | 2 |
No. of Matches (Team B) | 7 | 15 | 10 | 3 | 2 | 5 |
Step 1: Calculate the Mean
For each team, the mean is given by:
-
Team A:
-
Team B:
Step 2: Calculate the Standard Deviation
Standard deviation measures how spread out the points are around the mean. It is calculated as:
-
Team A Calculations:
Points () Frequency () 0 20 1 5 2 4 3 10 4 1 5 2 Sum of weighted squared deviations ≈
Variance for Team A:
-
Team B Calculations:
Points () Frequency () 0 7 1 15 2 10 3 3 4 2 5 5 Sum of weighted squared deviations ≈ (rounded)
Variance for Team B:
Step 3: Calculate the Coefficient of Variation (CV)
The CV is given by:
-
Team A:
-
Team B:
Step 4: Interpretation
- A lower CV indicates less relative variability and hence more consistency.
- Team B’s CV (~83.1%) is lower than Team A’s CV (~113.5%), indicating that Team B is more consistent in scoring.
Final Answer:
Team B is more consistent than Team A based on the Coefficient of Variation.
17) Coefficients of Variation and Standard Deviation of two series X and Y are 55.43% and 48.86%, and 25.5 and 24.43, respectively. Find the means of series X and Y.
The coefficient of variation (CV) is defined as:
where:
- is the standard deviation, and
- is the mean.
We can rearrange the formula to solve for the mean:
For Series X:
- Given:
For Series Y:
- Given:
Final Answer:
- Mean of Series X
- Mean of Series Y
18) The standard deviation and mean of the data are 8.5 and 14.5 respectively. Find the coefficient of variation.
The coefficient of variation (CV) is calculated using the formula:
where:
- is the standard deviation,
- is the mean.
Given:
- ,
- ,
Plugging the values into the formula:
Thus, the coefficient of variation is approximately 58.62%.
19) If the mean and coefficient of deviation of data are 13 and 38 respectively, then locate the value of expected variation?
The coefficient of variation (CV) is given by the formula:
Where:
- is the standard deviation (expected variation), and
- is the mean.
We are given:
To find the standard deviation , rearrange the formula:
Calculating:
Thus, the expected variation (standard deviation) is approximately 4.94.
20) The mean and standard variation of marks received by 40 students of a class in three subjects Mathematics, English and economics are given below.Which of the three subjects indicates the most elevated deviation and which indicates the most subordinate variation in marks?
Subject | Mean | Standard Deviation |
---|---|---|
Maths | 65 | 10 |
English | 60 | 12 |
Economics | 57 | 14 |
To compare the variability in marks across the three subjects, we use the Coefficient of Variation (CV), which is calculated as:
Let’s compute the CV for each subject:
-
Maths:
-
English:
-
Economics:
Interpretation:
- The highest CV is in Economics (≈24.56%), which means Economics has the most elevated deviation in marks.
- The lowest CV is in Maths (≈15.38%), indicating that Maths has the most subordinate (least) variation in marks.
Final Answer:
- Economics indicates the most elevated deviation.
- Maths indicates the most subordinate variation in marks.
21) In a small business firm, two typists are employed- typist A and Typist B. Typist A types out, on an average, 30 pages per day with a standard deviation of 6. Typist B, on an average, types out 45 pages with a standard deviation of 10. Which typist shows greater consistency in his output.
We determine consistency by comparing the Coefficient of Variation (CV) for each typist. The CV is calculated as:
For Typist A:
For Typist B:
Since a lower CV indicates more consistency in performance, Typist A (with a CV of 20%) shows greater consistency than Typist B (with a CV of approximately 22.22%).
22) The male population’s weight data follows a normal distribution. It has a mean of 70 kg and a standard deviation of 15 kg. What would the mean and standard deviation of a sample of 50 guys be if a researcher looked at their records?
The Central Limit Theorem tells us that the sample mean will have the same mean as the population, and its standard deviation (called the standard error) will be the population standard deviation divided by the square root of the sample size.
Given:
- Population mean, kg
- Population standard deviation, kg
- Sample size,
Mean of the sample:
Standard deviation of the sample (Standard Error):
Thus, for a sample of 50 guys, the mean weight would be 70 kg and the standard deviation of the sample would be approximately 2.12 kg.
23) A distribution has a mean of 69 and a standard deviation of 420. Find the mean and standard deviation if a sample of 80 is drawn from the distribution.
For a sample drawn from a distribution, the sample mean remains the same as the population mean, and the standard deviation of the sample mean (also called the standard error) is calculated by dividing the population standard deviation by the square root of the sample size.
Given:
- Population mean,
- Population standard deviation,
- Sample size,
Mean of the sample:
Standard deviation of the sample (Standard Error):
Calculating :
Thus,
Final Answer:
- Mean of the sample: 69
- Standard deviation of the sample: approximately 47
24) A boy collects some rupees in a week as follows (25,28,26,30,40,50,40) and finds the skewness and Kurtosis of the given Data in question with the help of the skewness formula.
Let’s first list the data:
We’ll use the following steps:
- Compute the Mean ()
- Compute the deviations and then the standard deviation ()
- Compute the third and fourth central moments
- Calculate skewness and kurtosis using the “moment” formulas
Note: There are several formulas for sample skewness and kurtosis (with bias corrections). Here, we use the “population‐moment” approach as an illustration.
Step 1. Mean
It is often useful to work exactly in fractions. Notice that:
- Another
So, the mean is:
Step 2. Deviations and Standard Deviation
Express each deviation in fractional form (using denominator 7):
- For 25:
- For 28:
- For 26:
- For 30:
- For 40:
- For 50:
- For the other 40: Again,
Now, the squared deviations are:
\begin{array}{lcl} \left(-\frac{64}{7}\right)^2 &=& \frac{64^2}{7^2} = \frac{4096}{49}\$$1mm] \left(-\frac{43}{7}\right)^2 &=& \frac{1849}{49}\$$1mm] \left(-\frac{57}{7}\right)^2 &=& \frac{3249}{49}\$$1mm] \left(-\frac{29}{7}\right)^2 &=& \frac{841}{49}\$$1mm] \left(\frac{41}{7}\right)^2 &=& \frac{1681}{49}\$$1mm] \left(\frac{111}{7}\right)^2 &=& \frac{12321}{49}\$$1mm] \left(\frac{41}{7}\right)^2 &=& \frac{1681}{49}\$$1mm] \end{array}Sum of squared deviations:
The population (or “moment‐based”) variance is then:
So, the standard deviation is:
(Note: Using the sample formula with would yield a slightly larger value, but here we proceed with the moment method for simplicity.)
Step 3. Third and Fourth Central Moments
Third Central Moment (for Skewness)
We need .
Using our deviations in fractional form:
- For 25:
- For 28: . (Note: )
- For 26: with
- For 30: with
- For 40: with
- For 50: with
- For the other 40: again,
Sum of cubes:
Let’s add the negatives:
Now, the positives:
Thus, sum of cubes:
Now, the average third moment is:
Skewness Calculation
Using the formula for skewness based on moments:
We have so:
Thus:
This indicates a moderate positive skew.
Fourth Central Moment (for Kurtosis)
We need .
We calculate the fourth powers (approximately using our decimal deviations):
- For 25: . Since , then .
- For 28: with , then squared gives about .
- For 26: with , then squared gives about .
- For 30: with , squared gives about .
- For 40: with , squared gives about .
- For 50: with , squared gives about .
- For the other 40: again about .
Sum of fourth powers (approximate):
Average fourth moment:
Now, the denominator for kurtosis is . With ,
Then, the kurtosis (using the “raw” moment definition) is:
In many texts, a normal distribution has a kurtosis of 3. Thus the excess kurtosis here is about , indicating a platykurtic (flatter) distribution relative to the normal curve.
Final Answers
- Skewness: Approximately 0.61 (indicating a moderate positive skew).
- Kurtosis: Approximately 2.0 (or an excess kurtosis of about ), meaning the distribution is flatter than a normal distribution.
Summary:
For the data , the skewness is about 0.61 and the kurtosis is about 2.0.
Keep in mind that slight differences may arise depending on the specific formula (and bias correction) used for sample skewness and kurtosis.
25) Calculate Population Skewness, Population Kurtosis from the following grouped data and explain the type of kurtosis and skewness of the data.
Class Interval | Frequency (f) | Midpoint (x) |
---|---|---|
10 - 20 | 2 | 15 |
20 - 30 | 3 | 25 |
30 - 40 | 5 | 35 |
We are given the following grouped data:
Class Interval | Frequency () | Midpoint () |
---|---|---|
10 – 20 | 2 | 15 |
20 – 30 | 3 | 25 |
30 – 40 | 5 | 35 |
The total frequency is:
We’ll calculate the population moments (using the formulas for the entire population) and then obtain the population skewness and kurtosis.
Step 1. Calculate the Mean
The mean is given by:
Step 2. Calculate the Variance and Standard Deviation
The variance () is:
First, compute and for each class:
-
For :
and
Contribution: -
For :
and
Contribution: -
For :
and
Contribution:
Now, sum the contributions:
So,
The standard deviation is:
Step 3. Calculate the Third Central Moment and Skewness
The third central moment is:
Compute for each midpoint:
-
For :
so
Contribution: -
For :
so
Contribution: -
For :
so
Contribution:
Sum the contributions:
Then, the third central moment is:
Population skewness is given by:
We already have . Note that:
Thus:
Interpretation:
A skewness of about indicates a moderate negative skew (the left tail is longer or more pronounced than the right).
Step 4. Calculate the Fourth Central Moment and Kurtosis
The fourth central moment is:
Compute for each midpoint:
-
For :
so
Contribution: -
For :
so
Contribution: -
For :
so
Contribution:
Sum the contributions:
Then,
Population kurtosis is given by:
We have:
Thus,
Note on Interpretation:
- A normal distribution has a kurtosis of 3 (using the “raw” kurtosis definition).
- Here, the calculated kurtosis is about 1.86, which is less than 3.
- The excess kurtosis (kurtosis minus 3) is approximately .
This indicates a platykurtic distribution—one that is flatter than the normal distribution with thinner tails.
Final Summary and Interpretation
- Mean:
- Standard Deviation:
- Population Skewness:
This indicates a moderate negative skew (the distribution is slightly tilted to the left). - Population Kurtosis: (or an excess kurtosis of about )
This indicates that the distribution is platykurtic, meaning it is flatter with lighter tails than a normal distribution.
Conclusion:
The given grouped data shows a distribution with a moderate negative skew and a platykurtic (flatter than normal) shape.
26) A nutritionist claims that the average sugar content in a brand of cereal is less than 10 grams per serving. A random sample of 30 cereal boxes shows an average sugar content of 9.5 grams with a standard deviation of 1.2 grams. At a 5% significance level (α = 0.05), test whether the nutritionist’s claim is supported.
Step 1: State the Hypotheses
- Null Hypothesis (): The mean sugar content grams.
- Alternative Hypothesis (): The mean sugar content grams.
This is a one-tailed (left-tailed) test since the claim is that the average is less than 10 grams.
Step 2: Compute the Test Statistic
Given:
- Sample size,
- Sample mean, grams
- Sample standard deviation, grams
- Significance level,
Since the population standard deviation is unknown and is moderate, we use the t-test. The test statistic is calculated by:
where grams (hypothesized mean).
Plugging in the values:
Step 3: Determine the Critical t-value
For a one-tailed test at with , the critical t-value is approximately:
Step 4: Decision
Since the calculated -value is less than (i.e., it falls in the rejection region), we reject the null hypothesis.
Step 5: Conclusion
At the 5% significance level, the sample provides sufficient evidence to support the nutritionist’s claim that the average sugar content in the cereal is less than 10 grams per serving.
27) A manufacturer claims that the average lifespan of its LED bulbs is at least 25,000 hours. A consumer protection agency tests 40 randomly selected bulbs and finds an average lifespan of 24,500 hours with a standard deviation of 1,200 hours. At a 5% significance level (α = 0.05), test whether the agency’s data contradicts the manufacturer’s claim.
Step 1: Formulate the Hypotheses
- Null Hypothesis (): hours (the average lifespan is at least 25,000 hours).
- Alternative Hypothesis (): hours (the average lifespan is less than 25,000 hours).
This is a one-tailed (left-tailed) test.
Step 2: Compute the Test Statistic
Given:
- Sample size,
- Sample mean, hours
- Sample standard deviation, hours
The t-statistic is computed as:
Where hours.
First, calculate the standard error (SE):
Then, calculate the t-statistic:
Step 3: Determine the Critical Value
For a one-tailed test at with , the critical t-value is approximately:
Step 4: Decision
Since the calculated t-value () is less than the critical t-value (), it falls in the rejection region.
Step 5: Conclusion
At the 5% significance level, we reject the null hypothesis. The consumer protection agency’s data provides sufficient evidence to conclude that the average lifespan of the LED bulbs is less than 25,000 hours. Therefore, the agency’s findings contradict the manufacturer’s claim.
28) A soft drink company claims that the average sugar content in its cola is 39 grams per can. A health organization collects a random sample of 50 cans and finds the average sugar content is 40 grams, with a standard deviation of 2 grams. At a 1% significance level (α = 0.01), test if the actual sugar content is different from 39 grams.
Step 1: State the Hypotheses
- Null Hypothesis (): The average sugar content is 39 grams per can ().
- Alternative Hypothesis (): The average sugar content is different from 39 grams ().
This is a two-tailed test.
Step 2: Compute the Test Statistic
Given:
- Sample size,
- Sample mean, grams
- Sample standard deviation, grams
The test statistic (t) is calculated as:
Calculate the standard error:
Now, compute t:
Step 3: Determine the Critical t-value
For a two-tailed test at with , the critical t-value is approximately (using standard t-distribution tables).
Step 4: Decision
Since the calculated t-value exceeds the critical value (in absolute value), we reject the null hypothesis.
Step 5: Conclusion
At the 1% significance level, there is sufficient evidence to conclude that the actual sugar content in the cola is different from 39 grams per can. Given that the sample mean is 40 grams, it appears that the sugar content is higher than claimed.
29) A company manufacturing automobiles finds that tyre-life is normally distributed with a mean of 40,000 km and standard deviation of 3000 km. It is believed that a change in the production process will result in a better product and the company has developed a new tyre. A sample of 100 new tyres has been selected. The company has found that the mean life of these new tyres is 40,900 Km. Can it be concluded that the new tyre is significantly better than the old one, using the significance level of 0.01.
Hint; we are interested in testing whether or not there has been an increase in the mean life of tyres or test whether the mean life of new tyre has increased beyond 40,000 km.
Step 1: Define the Hypotheses
- Null Hypothesis (): The new tyre has the same mean life as the old one, km.
- Alternative Hypothesis (): The new tyre has a higher mean life, km.
This is a one-tailed (right-tailed) test.
Step 2: Calculate the Test Statistic
Given:
- Population mean (old tyre), km
- Population standard deviation, km
- Sample size,
- Sample mean (new tyre), km
Since the tyre-life is normally distributed and the standard deviation is known, we use the z-test:
Calculate the standard error:
Then, the z-value is:
Step 3: Determine the Critical Value
At a significance level of for a one-tailed test, the critical z-value is approximately 2.33.
Step 4: Make the Decision
Since the calculated z-value (3) is greater than the critical value (2.33), we reject the null hypothesis.
Step 5: Conclusion
At the 0.01 significance level, there is sufficient evidence to conclude that the mean life of the new tyres is significantly greater than 40,000 km. Therefore, the new tyre is significantly better than the old one.
30) Following are the runs scored by two batsmen in 5 cricket matches, Who is more consistent in scoring runs?
Score 1 | Score 2 | Score 3 | Score 4 | Score 5 | |
---|---|---|---|---|---|
Batsman A | 38 | 47 | 34 | 18 | 33 |
Batsman B | 37 | 35 | 41 | 27 | 35 |
To assess consistency, we can calculate the mean and the standard deviation, and then use the Coefficient of Variation (CV):
Batsman A
Runs: 38, 47, 34, 18, 33
-
Mean ():
-
Deviations and Squared Deviations:
Score | Deviation () | Squared Deviation |
---|---|---|
38 | 38 - 34 = 4 | |
47 | 47 - 34 = 13 | |
34 | 34 - 34 = 0 | |
18 | 18 - 34 = -16 | |
33 | 33 - 34 = -1 |
Sum of Squared Deviations:
- Sample Variance and Standard Deviation:
Using (for a sample of 5 matches):
- Coefficient of Variation (CV):
Batsman B
Runs: 37, 35, 41, 27, 35
-
Mean ():
-
Deviations and Squared Deviations:
Score | Deviation () | Squared Deviation |
---|---|---|
37 | 37 - 35 = 2 | |
35 | 35 - 35 = 0 | |
41 | 41 - 35 = 6 | |
27 | 27 - 35 = -8 | |
35 | 35 - 35 = 0 |
Sum of Squared Deviations:
- Sample Variance and Standard Deviation:
- Coefficient of Variation (CV):
Conclusion
- Batsman A has a CV of approximately 30.9%.
- Batsman B has a CV of approximately 14.6%.
A lower coefficient of variation indicates greater consistency. Therefore, Batsman B is more consistent in scoring runs than Batsman A.
31) Find the skewness for the given Data ( 2,4,6,6) :
Skewness = 3(Mean – Median)/S.D.
Step 1: Calculate the Mean, Median, and Standard Deviation
Given Data: 2, 4, 6, 6
-
Mean ()
-
Median
For the ordered data (2, 4, 6, 6), the median is the average of the two middle values: -
Standard Deviation (Population Standard Deviation)
Calculate each deviation from the mean and then square them:
Sum of squared deviations:
Population variance:
Standard deviation:
Step 2: Calculate Skewness
Using the formula:
Plug in the values:
Final Answer:
The skewness for the given data is approximately -0.90, which indicates a slight negative skew (the distribution is skewed to the left).
32) For the given observations {23, 24, 56, 55, 28, 38, 48}, calculate:
- Skewness
- Kurtosis
- Determine the type of kurtosis
We’ll compute the population skewness and kurtosis for the data
using the following steps:
Step 1. Compute the Mean
Step 2. Compute the Median
First, sort the data:
Since there are 7 values, the median is the 4th value:
Step 3. Compute the Standard Deviation (Population)
For each observation , compute the deviation and its square:
23 | ||
24 | ||
56 | ||
55 | ||
28 | ||
38 | ||
48 |
Sum of squared deviations:
Since we are treating this as the entire population, the variance is
and the population standard deviation is
Step 4. Compute Population Skewness
A quick (Pearson’s) measure of skewness is given by
Using our values:
This indicates a slight positive skew (a small positive value).
Step 5. Compute Population Kurtosis
Population kurtosis (using the raw moment definition) is given by
Calculate the Fourth Powers
For each observation, compute :
(approx.) | ||
---|---|---|
23 | ||
24 | ||
56 | ||
55 | ||
28 | ||
38 | ||
48 |
Now, sum these values:
Average fourth moment:
Next, calculate :
Since , then
Thus, the kurtosis is
For many definitions, a normal distribution has a kurtosis of 3. When we subtract 3, we get an excess kurtosis of
Step 6. Interpretation
- Skewness: ~0.194
The small positive value indicates a slightly right-skewed distribution. - Kurtosis: ~1.33 (or an excess kurtosis of –1.67)
Since the kurtosis is less than 3, the distribution is platykurtic (flatter than the normal distribution with thinner tails).
Final Answers
- Population Skewness: Approximately 0.19 (indicating a slight positive skew).
- Population Kurtosis: Approximately 1.33 (excess kurtosis ≈ –1.67), which means the data is platykurtic (flatter than a normal distribution).
33) Given the weights of five persons: 120, 140, 150, 160, and 180, find the following:
- Mean
- Median
- Mode
- Standard deviation
- Variance
- Interquartile range
- Box plot
- Whisker plot
Below are the steps and results for the given weights:
Data: 120, 140, 150, 160, 180
1. Mean
2. Median
When the data are arranged in order (which they already are):
3. Mode
Each weight appears only once, so there is no mode (or no unique mode).
4. Variance and Standard Deviation
Step 4a. Compute the squared deviations from the mean:
- For 120:
- For 140:
- For 150:
- For 160:
- For 180:
Step 4b. Sum of squared deviations:
Assuming these 5 values represent the entire population:
If you were to use the sample formula (n – 1), then:
For this exercise, we assume the given values form the entire population.
5. Interquartile Range (IQR)
Step 5a. Determine Q1 and Q3:
For 5 data points, the median (Q2) is 150.
- The lower half (excluding the median) is: 120, 140
- The upper half is: 160, 180
Step 5b. Compute IQR:
6. Box Plot / Whisker Plot
A box (or whisker) plot for these data is constructed using:
- Minimum: 120
- Q1: 130
- Median: 150
- Q3: 170
- Maximum: 180
On the plot:
- The box spans from Q1 (130) to Q3 (170).
- A line is drawn at the median (150) inside the box.
- The “whiskers” extend from the box out to the minimum (120) and maximum (180).
Summary of Results
- Mean: 150
- Median: 150
- Mode: None (no repeated value)
- Variance: 400
- Standard Deviation: 20
- Interquartile Range (IQR): 40
- Box Plot / Whisker Plot:
- Minimum = 120
- Q1 = 130
- Median = 150
- Q3 = 170
- Maximum = 180
These calculations provide a complete picture of the central tendency and dispersion for the given data.
34) A random sample of n = 500 observations from a binomial population produced x = 240 successes.
- Find a point estimate for p and place a 95% confidence interval.
- Find a 90% confidence interval for p.
Step 1: Point Estimate for
The point estimate for the probability of success, , is given by:
Step 2: Standard Error Calculation
The standard error (SE) for a proportion is:
Calculating:
Step 3: 95% Confidence Interval for
For a 95% confidence interval, the critical z-value is approximately .
The margin of error (ME) is:
Thus, the 95% confidence interval is:
Step 4: 90% Confidence Interval for
For a 90% confidence interval, the critical z-value is approximately .
The margin of error is:
Thus, the 90% confidence interval is:
Summary of Results
- Point Estimate for :
- 95% Confidence Interval: Approximately
- 90% Confidence Interval: Approximately
These intervals indicate that we are 95% confident that the true proportion of successes lies between about 43.6% and 52.4%, and 90% confident it lies between about 44.3% and 51.7%.
35) Given the observations {6, 8, 10, 12, 14, 16, 18, 20, 22, 24}, calculate the following:
- Mean
- Median
- Standard deviation
- Variance
- Skewness
- Kurtosis
- Lower quartile
- Upper quartile
- Middle quartile
- Interquartile range
- Range
Let’s start by listing the data and then calculate each measure step‐by‐step.
Data:
6, 8, 10, 12, 14, 16, 18, 20, 22, 24
1. Mean
2. Median
Since there are 10 observations (an even number), the median is the average of the 5th and 6th values.
The ordered data: 6, 8, 10, 12, 14, 16, 18, 20, 22, 24
3. Variance and Standard Deviation
Step 3a. Compute the deviations from the mean and square them:
6 | 6 – 15 = –9 | 81 |
8 | –7 | 49 |
10 | –5 | 25 |
12 | –3 | 9 |
14 | –1 | 1 |
16 | 1 | 1 |
18 | 3 | 9 |
20 | 5 | 25 |
22 | 7 | 49 |
24 | 9 | 81 |
Sum of squared deviations:
.
For the population variance (assuming the data represent the entire population):
Standard deviation:
4. Skewness
Since the data are perfectly symmetric around the mean (mean = median = 15), the skewness is 0.
Using Pearson’s formula for skewness:
5. Kurtosis
The population kurtosis (using the fourth moment) is computed as:
where
Step 5a. Compute for each observation:
6 | –9 | |
8 | –7 | |
10 | –5 | |
12 | –3 | |
14 | –1 | |
16 | 1 | |
18 | 3 | |
20 | 5 | |
22 | 7 | |
24 | 9 |
Sum of fourth powers:
.
Now,
Next, calculate . We already found , so
Thus,
A normal distribution has a kurtosis of 3 (using the “raw” kurtosis measure). The excess kurtosis is given by:
This indicates the distribution is platykurtic (flatter than a normal distribution with lighter tails).
6. Quartiles and Interquartile Range (IQR)
Ordered Data: 6, 8, 10, 12, 14, 16, 18, 20, 22, 24
-
Lower Quartile (Q1): Median of the lower half (first 5 numbers: 6, 8, 10, 12, 14)
(the 3rd value). -
Upper Quartile (Q3): Median of the upper half (last 5 numbers: 16, 18, 20, 22, 24)
(the 3rd value). -
Middle Quartile (Median, Q2): As computed above, .
-
Interquartile Range (IQR):
7. Range
8. Box Plot / Whisker Plot Description
A box plot is drawn using these five summary statistics:
- Minimum: 6
- Q1 (Lower Quartile): 10
- Median: 15
- Q3 (Upper Quartile): 20
- Maximum: 24
In the box plot:
- The box extends from Q1 to Q3 (10 to 20).
- A line inside the box marks the median (15).
- Whiskers extend from the box to the minimum (6) and maximum (24).
Final Answers
- Mean: 15
- Median: 15
- Standard Deviation: Approximately 5.74
- Variance: 33
- Skewness: 0 (symmetric distribution)
- Kurtosis: Approximately 1.78 (excess kurtosis ≈ –1.22, indicating a platykurtic distribution)
- Lower Quartile (Q1): 10
- Upper Quartile (Q3): 20
- Middle Quartile (Median, Q2): 15
- Interquartile Range (IQR): 10
- Range: 18
These statistics provide a complete descriptive summary of the given data.
36) The mean of the following frequency distribution was found to be 1.46:
No. of Accidents | No. of Days (Frequency) |
---|---|
0 | 46 |
1 | ? |
2 | ? |
3 | 25 |
4 | 10 |
5 | ? |
Total | 200 days |
Calculate the missing frequencies.
Let the missing frequencies be:
- for 1 accident,
- for 2 accidents,
- for 5 accidents.
We are given:
- For 0 accidents:
- For 3 accidents:
- For 4 accidents:
- Total days:
Thus, the sum of frequencies is
That is,
Also, we are told that the mean number of accidents is 1.46. (Since “accidents” is our “x‐value”, the mean is given by)
So the total number of accidents in 200 days is
Now, the total number of accidents computed from the frequency distribution is
Calculate the contribution from the known parts:
So the known contribution is
Thus, the unknowns must satisfy
Now we have two equations:
Subtract (1) from (2):
Now, from (3) we can express
Substitute this into (1):
Thus the missing frequencies are given by:
Since frequencies must be nonnegative integers, can be any integer such that . This gives
There is not a unique solution unless we have extra information (for example, if the data suggest that days with 5 accidents are very rare, one might take ). A natural (and common) choice is to take (i.e. no day had 5 accidents), so that the distribution is concentrated in the lower accident numbers.
If we set :
- Then .
- And .
Check the totals:
- Total frequency: .
- Total accidents: .
- Mean .
Everything checks.
Final Answer
A very acceptable solution is:
- Frequency for 1 accident: 61 days
- Frequency for 2 accidents: 58 days
- Frequency for 5 accidents: 0 days
This gives the correct total frequency (200 days) and the overall mean of 1.46 accidents per day.
37) Calculate Sample mean, sample variance, sample skewness and sample kurtosis from the following grouped data:
Class Interval | Frequency |
---|---|
2-4 | 3 |
4-6 | 4 |
6-8 | 2 |
8-10 | 1 |
Let’s denote the mid‐points (x) for each class interval and use the frequencies (f) to calculate the sample statistics. The data are:
Class Interval | f | Midpoint (x) |
---|---|---|
2–4 | 3 | 3 |
4–6 | 4 | 5 |
6–8 | 2 | 7 |
8–10 | 1 | 9 |
The total number of observations is:
We’ll now compute each required statistic step‐by‐step.
1. Sample Mean
The sample mean is given by:
Calculate the sum of :
Thus,
2. Sample Variance and Standard Deviation
The sample variance is computed by:
First, compute the deviations for each midpoint:
x | f | Contribution | ||
---|---|---|---|---|
3 | 3 | |||
5 | 4 | |||
7 | 2 | |||
9 | 1 |
Now, sum the contributions:
Then the sample variance is:
And the sample standard deviation is:
3. Sample Skewness
A common formula for sample skewness (using the adjusted moment estimator) is:
Let’s compute for each midpoint:
x | f | Contribution | ||
---|---|---|---|---|
3 | 3 | |||
5 | 4 | |||
7 | 2 | |||
9 | 1 |
Sum of the contributions:
Now, standardize by dividing each deviation by before cubing would yield the same overall ratio; here we can combine it as:
Note: and , so the multiplier is:
We already have , so:
Thus:
So, the sample skewness is approximately 0.61, indicating a moderate positive skew.
4. Sample Kurtosis
A commonly used formula for sample excess kurtosis (adjusted) is:
Here, , so:
- ,
- ,
- The first multiplier is .
Now compute for each midpoint:
x | f | Contribution | ||
---|---|---|---|---|
3 | 3 | |||
5 | 4 | |||
7 | 2 | |||
9 | 1 |
Sum of contributions:
Next, we need . We have , so:
Now, the standardized fourth moment is:
Using the adjusted formula:
Calculate the ratio:
Thus, first term:
Now, compute the second term:
Then,
This is the excess kurtosis. A normal distribution has excess kurtosis 0 (or a raw kurtosis of 3).
Thus, the sample kurtosis (excess) is approximately –0.14, meaning the distribution is slightly platykurtic (flatter than a normal distribution).
5. Final Answers
- Sample Mean: 5.2
- Sample Variance: ≈ 3.96
- Sample Standard Deviation: ≈ 1.99
- Sample Skewness: ≈ 0.61 (indicating a moderate positive skew)
- Sample Kurtosis (Excess): ≈ –0.14 (or a raw kurtosis of about 2.86, indicating a slightly platykurtic distribution)
These values summarize the central tendency, spread, and shape characteristics of the grouped data.