This project is to find a convenient way to measure a person’s body fat percentage. We used a data set that contains measurement of body parts. A linear regression model and a ridge regression model were fitted. We found out that a linear regression model is better fit for this project. The finding is we can predict the body fat percentage by the size of abdomen, wrist, weight, biceps and height.
According to “Obesity Facts in America” from Healthline, in America, the obesity rate for children aged 2 to 19 is around 17%. The obesity rate for adults is over 36.5%. People who are obese have a high risk for many types of diseases. These diseases include diabetes, heart disease, stroke, etc. Because of this, knowing body fat percentage in your body is helpful for evaluating your overall health. One way to know your body fat percentage is by going to a clinic or a hospital, but this is not convenient. It would be great if there was a way to measure body fat conveniently. The goal for this project is to find an easier way to measure body fat percentage.
There are a few questions of interest for this project:
Does a person who is heavy and tall tend to have a higher body fat percentage?
How well can measuring the size of the body (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, and wrist) predict the bodyfat percentage?
What age range has the highest body fat percentage?
The data Bodyfat contains 16 variables. They are Density determined from underwater weighing, Percent body fat from Siri’s (1956) equation, Age (years), Weight (lbs), Height (inches), Neck circumference (cm), Chest circumference (cm), Abdomen 2 circumference (cm), Hip circumference (cm), Thigh circumference (cm), Knee circumference (cm), Ankle circumference (cm), Biceps (extended) circumference (cm), Forearm circumference (cm), Wrist circumference (cm).
Quantitative variables contain bodyfat.p, age, weight, height, neck, chest, abdomen2, hip, thigh, knee, ankle, biceps, forearm and wrist. There is no qualitative variable.
The Siri’s equation is percentage of body fat = (495 / body density) - 450, so the Density variable will not be used as one of the predictors.
## density bodyfat.p age weight
## Min. :0.995 Min. : 0.00 Min. :22.00 Min. :118.5
## 1st Qu.:1.041 1st Qu.:12.47 1st Qu.:35.75 1st Qu.:159.0
## Median :1.055 Median :19.20 Median :43.00 Median :176.5
## Mean :1.056 Mean :19.15 Mean :44.88 Mean :178.9
## 3rd Qu.:1.070 3rd Qu.:25.30 3rd Qu.:54.00 3rd Qu.:197.0
## Max. :1.109 Max. :47.50 Max. :81.00 Max. :363.1
## height neck chest abdomen2
## Min. :29.50 Min. :31.10 Min. : 79.30 Min. : 69.40
## 1st Qu.:68.25 1st Qu.:36.40 1st Qu.: 94.35 1st Qu.: 84.58
## Median :70.00 Median :38.00 Median : 99.65 Median : 90.95
## Mean :70.15 Mean :37.99 Mean :100.82 Mean : 92.56
## 3rd Qu.:72.25 3rd Qu.:39.42 3rd Qu.:105.38 3rd Qu.: 99.33
## Max. :77.75 Max. :51.20 Max. :136.20 Max. :148.10
## hip thigh knee ankle biceps
## Min. : 85.0 Min. :47.20 Min. :33.00 Min. :19.1 Min. :24.80
## 1st Qu.: 95.5 1st Qu.:56.00 1st Qu.:36.98 1st Qu.:22.0 1st Qu.:30.20
## Median : 99.3 Median :59.00 Median :38.50 Median :22.8 Median :32.05
## Mean : 99.9 Mean :59.41 Mean :38.59 Mean :23.1 Mean :32.27
## 3rd Qu.:103.5 3rd Qu.:62.35 3rd Qu.:39.92 3rd Qu.:24.0 3rd Qu.:34.33
## Max. :147.7 Max. :87.30 Max. :49.10 Max. :33.9 Max. :45.00
## forearm wrist
## Min. :21.00 Min. :15.80
## 1st Qu.:27.30 1st Qu.:17.60
## Median :28.70 Median :18.30
## Mean :28.66 Mean :18.23
## 3rd Qu.:30.00 3rd Qu.:18.80
## Max. :34.90 Max. :21.40
Missing value
Based on the summary, we can see that the data set does not contact any missing value.
A person with 0% body fat
We see an unusual value of a person who has 0% body fat. According to Dr. Sutterer from Men’s health, it is not possible to have 0% body fat in a human’s body. A person who with 0% body fat can not function normally.Having too little body fat can cause nutritional deficiencies, electrolyte imbalances, and organ malfunction. Men need a minimum of 3 percent body fat, and women need at least 12 percent for proper bodily function, according to Garber from abcNEWS. Below that, serious health problems may arise when body conditions fall below a certain level, possibly leading to death due to organ failure.
Base on the histogram below, because this data point with 0% body fat is not an outlier, we will not remove the point.
Histogram of the respone variable Bodyfat.p
From the histogram of the variable bodyfat.p, because it is right skewed, a box-cox procedure may require.
Box-cox procedure
## [1] 0.989899
Lambda is 0.989899 which is close to 1. From the plot above,we can see that 1 is inside of the 95% confident interval, therefore, transformation for response variable is not necessary.
Histograms of the predictor variables
Base on the histograms, we don’t see any obvious outlier.
Distribution for each predictor variable:
Normal distribution: age, neck, wrist and forearm
Right skewed: weight, chest, thigh, knee, ankle, biceps, abdomen2 and hip
Left skewed: height
Training set and validation set
The whole data set is separated into two groups. One group that uses as a training data contains 70% of the data. The other group that uses as a validation data set contains 30% of the data. The side-by-side box plots show that these two data sets are similar to each other.
Scatter plot matrax and correlation matrix
Based on the scatter plot matrix, there is some linear relation between response variable bodyfat.p and weight, neck, abdomen2, hip, thigh, knee, biceps, forearm and wrist.
There is no obvious linear relation between bodyfat.p and age, height and ankle.
Some X variables are highly correlated to each other. The correlation between weight and neck is 0.83567571, weight and chest is 0.90083203, weight and abdomen2 is 0.90014657, weight and hip is 0.95276955, weight and thigh is 0.88541527, weight and knee is 0.86589619, weight and biceps is 0.84100793, chest and abdomen2 is 0.9219589, chest and hip is 0.8368452, abdomen2 and hip is 0.8819369, hips and thigh is 0.9079775.
Multicollinearity
Checking VIFs(variance inflation factors) to see if which predictor variables are highly correlated with each other.
## age weight height neck chest abdomen2 hip thigh
## 2.480287 39.728794 1.579624 4.405679 10.055931 12.578180 17.971653 9.064177
## knee ankle biceps forearm wrist
## 4.746972 2.327092 4.899184 2.388159 3.434389
Some of the VIFs are larger than 10, indicating multicollinearity among variable weight, chest, abdomen2 and hip.
Assumptions checking
All required assumptions have to be met before fitting the model. The assumptions are as follow:
Outliers and Leverage
Outlying Y observations
## 81 207 82 225 231 224
## 2.749330 2.465451 2.459357 2.403352 2.400372 2.367505
## [1] 3.516987
For α = 0.1, the absolute value of the Bonferroni’s procedure is 3.516987. None of the absolute studentized deleted residuals of Y is larger than 3.516987, so there is no outlying in Y.
Leverage and Outlying in X
## 42 31 39 175 41 206 106 5
## 0.7965237 0.5275080 0.4389803 0.3749227 0.3085681 0.2752319 0.2034773 0.1617982
There are 8 leverage values are larger than 0.1587302 which is 2p/n. Their positions (from larger to small) are 42, 31, 39, 175, 41, 206, 106 and 5. They are all outlying X.
Influential cases (cook’s distance)
Based on the calculation of Cook’s distance, 9 values are larger than 0.02463054 (4/(n-p)), but only case 39, has Cook’s distance 0.24746926, stands out as much more influential than other cases. Case 39 need further investigation.
Investigate case 39
Method:
Building a first-order model (model name: fit.39) without case 39, and then compare the coefficients with the model (model name: model) with all cases.
## (Intercept) age weight height neck chest
## [1,] -6.8529897 0.03886669 -0.05538916 -0.1061088 -0.5833429 0.05747088
## [2,] -0.2608794 0.05076848 -0.01761815 -0.1368490 -0.4875556 -0.01241289
## abdomen2 hip thigh knee ankle biceps forearm
## [1,] 0.8822237 -0.2706743 0.1465729 0.08447148 -0.3135759 0.3850058 0.4199717
## [2,] 0.8691145 -0.2203312 0.1097472 -0.01199683 -0.2748800 0.4092526 0.2447795
## wrist
## [1,] -1.443656
## [2,] -1.600127
The outcome is the coefficients changed. The obvious change is the coefficient of intercept for model without point 39 is -0.2608794, for model with all cases is -6.8529897. The coefficients of other variables also changed. The investigation result for the case 39 is it needs to be removed.
Normality
The box-cox proceduce above showed that the response variable doesn’t need to be transformed. The distribution is normal.
Independence
A Durbin-Watson Test can be used to check if the residuals are independent.
The null hypothesis: There is no correlation among the residuals.
The alternative hypothesis: The residuals are auto correlated.
## Loading required package: carData
## lag Autocorrelation D-W Statistic p-value
## 1 -0.06599532 2.111929 0.442
## Alternative hypothesis: rho != 0
From the output we can see that the test statistic is 2.111929 and the corresponding p-value is 0.442. Since this p-value is larger than 0.05, we failed to reject the null hypothesis and conclude that there is no correlation among the residuals. The residuals are independent.
Equal variances
Equal variances can be tested by using the Non-Constant Error Variance (NVC) test.
The null hypothesis: The population variances are equal
The alternative hypothesis: The population variances are not equal
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2579675, Df = 1, p = 0.61152
Based on the Non-Constant Error Variance (NVC) test result, we failed to reject the \(H_0\) and conclude the population variances are equal because the p-value is 0.61152 which is larger than 0.05.
The plots are using data without point 39
Linearity: Residual vs fitted value plot does not show any pattern. Assumption of linearity is met.
Constant variance: The expected value of the residuals is approximately 0, and the variance is approximately constant. Assumption of constant variance is met.
Normality: QQ plot looks like a straight line. Assumption of normality is met.
Model fitting
We are going to fit two models, and then choose the better model as the final model:
Linear regression: \(Y = X \beta + \epsilon\) and \(\hat \beta = (X^{\top}X)^{-1}X^{\top}Y\)
Ridge regression: \(Y_{\lambda}= X \beta_{\lambda}\) and \(\hat \beta_{\lambda}=(X^{\top}X+\lambda I)^{-1}X^{\top}Y\)
Linear model
Stepwise Regression
Since there is high multicollinearity among variables, forward stepwise procedure will be used for model selection.
First order model with all predictors
The first “best” model: fit a first-order model with all predictors as a full model, and then use forward stepwise procedure to find the “best” model base on AIC.
The final model (model1) is: bodyfat.p ~ abdomen2 + wrist + weight + biceps + height.
First-order and second-order effects for all predictors
The second “best” model: fit a first-order and second- order effects for all predictors as a full model, and then use forward stepwise procedure to find the “best” model base on AIC.
The final model (model2) is: bodyfat.p ~ prodictors abdomen2+wrist + weight + biceps + height + age + neck + interaction weight and biceps + interaction abdomen2 and neck.
Performance evaluation: compare the RMSPE(root mean squared prediction error) values between model1 and model2
We looked at the RMSPE for both the training set and the validation set. We compare the difference and see if we can test which model does a better job.
## [1] 4.223362 4.402156
## [1] 4.096879 4.344482
For the root mean squared prediction error (RMSPE), model1 for training data is 4.223362, model1 for validation data is 4.402156; model2 for training data is 4.096879, model2 for validation data is 4.344482. Since the RMSPE values are close to each other, it is hard to tell which model is better.
Moreover, for model1, RMSPE for training set and for validation set are similar, so there is no overfitting in this model. for model2, RMSPE for training set and for validation set are similar, so there is also no overfitting in this model.
The “Best” Model final decision for linear regression
##
## Call:
## lm(formula = bodyfat.p ~ abdomen2 + wrist + weight + biceps +
## height, data = train.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5907 -2.9934 -0.6306 3.1370 10.3983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.91040 10.64676 -2.152 0.03283 *
## abdomen2 0.92117 0.07387 12.470 < 2e-16 ***
## wrist -1.64610 0.53487 -3.078 0.00244 **
## weight -0.10542 0.03598 -2.930 0.00386 **
## biceps 0.44621 0.19727 2.262 0.02497 *
## height -0.13160 0.09363 -1.406 0.16169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.298 on 169 degrees of freedom
## Multiple R-squared: 0.7378, Adjusted R-squared: 0.73
## F-statistic: 95.09 on 5 and 169 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = bodyfat.p ~ abdomen2 + wrist + weight + biceps +
## height + age + neck + weight:biceps + abdomen2:neck, data = train.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5787 -2.8382 -0.3092 2.6513 10.8889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.762992 45.442832 0.017 0.98662
## abdomen2 -0.129136 0.685203 -0.188 0.85074
## wrist -1.832476 0.603063 -3.039 0.00276 **
## weight 0.368951 0.179163 2.059 0.04104 *
## biceps 2.815559 0.944736 2.980 0.00332 **
## height -0.134748 0.092311 -1.460 0.14627
## age 0.057218 0.034694 1.649 0.10101
## neck -2.632958 1.629160 -1.616 0.10797
## weight:biceps -0.012841 0.005199 -2.470 0.01454 *
## abdomen2:neck 0.024999 0.017746 1.409 0.16080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.219 on 165 degrees of freedom
## Multiple R-squared: 0.7532, Adjusted R-squared: 0.7398
## F-statistic: 55.96 on 9 and 165 DF, p-value: < 2.2e-16
For model1, based on the P-values for each X variable, height is not significant. For model2, based on the P-value for each X variable, intercept, abdomen2, height, age, neck, interaction abdomen2 and neck are not significant.In addition, the R-squared value and adjusted R-squared value for both models are similar. Therefore, we decided the model1 is the better model.
Ridge regression
Fitting a ridge regression and choose \(\lambda\) by GCV and evaluating the Performance between the linear regression model and ridge regression.
If a model fits well, the points on the plot should line up along the red color straight line (slope = 1 ). Obviously, we can tell that the OLS model doing a better job than the ridge model. The points for the ridge model are not at all at the red line. Therefore, the linear regression is a better model.
## [1] 4.223362 4.402156
## [1] 40.39622 42.00466
The RMSPE for the ridge regression model are 40.39622 and 42.00466. The RMSPE for the OLS are 4.223362 and 4.402156. Compare the RMSPE, the ridge model has a much larger error.
According to the Performance for the OLS and Ridge regression, we decided the linear regression model is the best model.
Fitting the “Best” model with the whole data set except point 39
##
## Call:
## lm(formula = bodyfat.p ~ abdomen2 + wrist + weight + biceps +
## height, data = bodyfat.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0406 -3.0377 -0.3776 3.2839 9.2021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.62233 9.22625 -2.560 0.01106 *
## abdomen2 0.93624 0.06124 15.287 < 2e-16 ***
## wrist -1.43766 0.43334 -3.318 0.00105 **
## weight -0.10084 0.03050 -3.306 0.00109 **
## biceps 0.26078 0.15284 1.706 0.08924 .
## height -0.11380 0.08788 -1.295 0.19656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.276 on 245 degrees of freedom
## Multiple R-squared: 0.7413, Adjusted R-squared: 0.7361
## F-statistic: 140.4 on 5 and 245 DF, p-value: < 2.2e-16
The final linear regression model is \(Bodyfat.p = -23.62233 + 0.93624 * abdomen2 - 1.43766 * wrist -0.10084 * weight + 0.26078 * biceps - 0.11380 * height\)
Use the linear regression model to predict the body fat percentage for the outlier point 39
Since data point 39 is considered as an outlier, it has been removed from the data set. We wanted to see what the predicted bodyfat percentage is if we use the point 39’s measurement.
## fit lwr upr
## 1 51.16071 41.75175 60.56968
The actual bodyfat percentage for point 39 is 35.2%, however, if we use our final linear regression model to predict the bodyfat percentage for point 39, the 95% prediction interval is 41.75% to 60.57%. Obviously, the actual bodyfat percentage is far below 41.75%.
The R-square value for the final model is 74.13%. This model explains 74.13% of the body fat percentage in the regression model. The coefficients of abdomen, wrist, weight and biceps are significant. The coefficient of Height in the model is not significant.
Based on the final model, the goal of finding an easier way to measure body fat percentage was achieved. By measuring the abdomen (cm), waist (cm), weight (lbs), biceps (cm) and height (inches), a person’s body fat percentage can be determined by using the equation: \(Bodyfat.p = -23.62233 + 0.93624 * abdomen2 - 1.43766 * wrist -0.10084 * weight + 0.26078 * biceps - 0.11380 * height\)
Answer the questions of interest:
Does a person who is heavy and tall tend to have a higher body fat percentage? No. There is a negative relationship between body fat percentage and weight. Keeping other variables constant, on average, every one unit increases in weight, there is 0.10084 unit decreases in body fat percentage. There is also a negative relationship between body fat percentage and height. Keeping other variables constant, on average, every one unit increases in height, there is 0.11380 unit decreases in body fat percentage.
How well can measuring the size of the body (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, and wrist) predict the body fat percentage? In the final model, abdomen, biceps, wrist can predict body fat percentage significantly. However, neck, chest, hip, thigh, knee, ankle and forearm cannot significantly predict body fat percentage. Keeping other variables constant, on average, every one unit increases in abdomen, there is 0.93624 unit increases in body fat percentage; every one unit increases in biceps, there is 0.26078 unit increases in body fat percentage; every one unit increases in wrist, there is 1.43766 unit decreases in bodyfat percentage.
What age range has the highest body fat percentage? Based on the model, age cannot tell anything about body fat percentage. The variable age had been eliminated while doing forward stepwise model selection procedure.
Tanita Europe B.V. Understanding your measurements - Body Fat Percentage. (n.d.). Retrieved March 25, 2023, from https://tanita.eu/understanding-your-measurements/body-fat-percentage
Healthline. (2022, January 24). Obesity Facts: Causes, Risks, and More. Retrieved March 25, 2023, from https://www.healthline.com/health/obesity-facts#1.-More-than-one-third-of-adults-in-the-United-States-are-obese
Mackenzie, B. (n.d.). Siri Equation for Body Density. Topend Sports. Retrieved March 25, 2023, from https://www.topendsports.com/testing/siri-equation.htm
Men’s Health. (2020, July 22). Ronnie Coleman Says He Now Has No Feeling From the Waist Down. Retrieved March 25, 2023, from https://www.menshealth.com/health/a33247811/ronnie-coleman-body-fat/
ABC News. (2015, April 8). Legendary Bodybuilder Who Died of Bodybuilding Diet Lives On in Graphic Photos. Retrieved March 25, 2023, from https://abcnews.go.com/Health/legendary-bodybuilder-died-body-fat-lives/story?id=29899438#
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] car_3.1-1 carData_3.0-5 MASS_7.3-58.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 evaluate_0.20
## [5] highr_0.10 cachem_1.0.6 rlang_1.0.6 cli_3.6.0
## [9] rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2 vctrs_0.5.2
## [13] rmarkdown_2.20 tools_4.2.2 abind_1.4-5 xfun_0.37
## [17] yaml_2.3.7 fastmap_1.1.0 compiler_4.2.2 htmltools_0.5.4
## [21] knitr_1.42 sass_0.4.5