1 Abstract

This project is to find a convenient way to measure a person’s body fat percentage. We used a data set that contains measurement of body parts. A linear regression model and a ridge regression model were fitted. We found out that a linear regression model is better fit for this project. The finding is we can predict the body fat percentage by the size of abdomen, wrist, weight, biceps and height.

2 Introduction

According to “Obesity Facts in America” from Healthline, in America, the obesity rate for children aged 2 to 19 is around 17%. The obesity rate for adults is over 36.5%. People who are obese have a high risk for many types of diseases. These diseases include diabetes, heart disease, stroke, etc. Because of this, knowing body fat percentage in your body is helpful for evaluating your overall health. One way to know your body fat percentage is by going to a clinic or a hospital, but this is not convenient. It would be great if there was a way to measure body fat conveniently. The goal for this project is to find an easier way to measure body fat percentage.

There are a few questions of interest for this project:

  1. Does a person who is heavy and tall tend to have a higher body fat percentage?

  2. How well can measuring the size of the body (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, and wrist) predict the bodyfat percentage?

  3. What age range has the highest body fat percentage?

3 Background

The data Bodyfat contains 16 variables. They are Density determined from underwater weighing, Percent body fat from Siri’s (1956) equation, Age (years), Weight (lbs), Height (inches), Neck circumference (cm), Chest circumference (cm), Abdomen 2 circumference (cm), Hip circumference (cm), Thigh circumference (cm), Knee circumference (cm), Ankle circumference (cm), Biceps (extended) circumference (cm), Forearm circumference (cm), Wrist circumference (cm).

Quantitative variables contain bodyfat.p, age, weight, height, neck, chest, abdomen2, hip, thigh, knee, ankle, biceps, forearm and wrist. There is no qualitative variable.

The Siri’s equation is percentage of body fat = (495 / body density) - 450, so the Density variable will not be used as one of the predictors.

4 Descriptive analysis

##     density        bodyfat.p          age            weight     
##  Min.   :0.995   Min.   : 0.00   Min.   :22.00   Min.   :118.5  
##  1st Qu.:1.041   1st Qu.:12.47   1st Qu.:35.75   1st Qu.:159.0  
##  Median :1.055   Median :19.20   Median :43.00   Median :176.5  
##  Mean   :1.056   Mean   :19.15   Mean   :44.88   Mean   :178.9  
##  3rd Qu.:1.070   3rd Qu.:25.30   3rd Qu.:54.00   3rd Qu.:197.0  
##  Max.   :1.109   Max.   :47.50   Max.   :81.00   Max.   :363.1  
##      height           neck           chest           abdomen2     
##  Min.   :29.50   Min.   :31.10   Min.   : 79.30   Min.   : 69.40  
##  1st Qu.:68.25   1st Qu.:36.40   1st Qu.: 94.35   1st Qu.: 84.58  
##  Median :70.00   Median :38.00   Median : 99.65   Median : 90.95  
##  Mean   :70.15   Mean   :37.99   Mean   :100.82   Mean   : 92.56  
##  3rd Qu.:72.25   3rd Qu.:39.42   3rd Qu.:105.38   3rd Qu.: 99.33  
##  Max.   :77.75   Max.   :51.20   Max.   :136.20   Max.   :148.10  
##       hip            thigh            knee           ankle          biceps     
##  Min.   : 85.0   Min.   :47.20   Min.   :33.00   Min.   :19.1   Min.   :24.80  
##  1st Qu.: 95.5   1st Qu.:56.00   1st Qu.:36.98   1st Qu.:22.0   1st Qu.:30.20  
##  Median : 99.3   Median :59.00   Median :38.50   Median :22.8   Median :32.05  
##  Mean   : 99.9   Mean   :59.41   Mean   :38.59   Mean   :23.1   Mean   :32.27  
##  3rd Qu.:103.5   3rd Qu.:62.35   3rd Qu.:39.92   3rd Qu.:24.0   3rd Qu.:34.33  
##  Max.   :147.7   Max.   :87.30   Max.   :49.10   Max.   :33.9   Max.   :45.00  
##     forearm          wrist      
##  Min.   :21.00   Min.   :15.80  
##  1st Qu.:27.30   1st Qu.:17.60  
##  Median :28.70   Median :18.30  
##  Mean   :28.66   Mean   :18.23  
##  3rd Qu.:30.00   3rd Qu.:18.80  
##  Max.   :34.90   Max.   :21.40

Missing value

Based on the summary, we can see that the data set does not contact any missing value.

A person with 0% body fat

We see an unusual value of a person who has 0% body fat. According to Dr. Sutterer from Men’s health, it is not possible to have 0% body fat in a human’s body. A person who with 0% body fat can not function normally.Having too little body fat can cause nutritional deficiencies, electrolyte imbalances, and organ malfunction. Men need a minimum of 3 percent body fat, and women need at least 12 percent for proper bodily function, according to Garber from abcNEWS. Below that, serious health problems may arise when body conditions fall below a certain level, possibly leading to death due to organ failure.

Base on the histogram below, because this data point with 0% body fat is not an outlier, we will not remove the point.

Histogram of the respone variable Bodyfat.p

From the histogram of the variable bodyfat.p, because it is right skewed, a box-cox procedure may require.

Box-cox procedure

## [1] 0.989899

Lambda is 0.989899 which is close to 1. From the plot above,we can see that 1 is inside of the 95% confident interval, therefore, transformation for response variable is not necessary.

Histograms of the predictor variables

Base on the histograms, we don’t see any obvious outlier.

Distribution for each predictor variable:

Normal distribution: age, neck, wrist and forearm

Right skewed: weight, chest, thigh, knee, ankle, biceps, abdomen2 and hip

Left skewed: height

Training set and validation set

The whole data set is separated into two groups. One group that uses as a training data contains 70% of the data. The other group that uses as a validation data set contains 30% of the data. The side-by-side box plots show that these two data sets are similar to each other.

Scatter plot matrax and correlation matrix

Based on the scatter plot matrix, there is some linear relation between response variable bodyfat.p and weight, neck, abdomen2, hip, thigh, knee, biceps, forearm and wrist.

There is no obvious linear relation between bodyfat.p and age, height and ankle.

Some X variables are highly correlated to each other. The correlation between weight and neck is 0.83567571, weight and chest is 0.90083203, weight and abdomen2 is 0.90014657, weight and hip is 0.95276955, weight and thigh is 0.88541527, weight and knee is 0.86589619, weight and biceps is 0.84100793, chest and abdomen2 is 0.9219589, chest and hip is 0.8368452, abdomen2 and hip is 0.8819369, hips and thigh is 0.9079775.

Multicollinearity

Checking VIFs(variance inflation factors) to see if which predictor variables are highly correlated with each other.

##       age    weight    height      neck     chest  abdomen2       hip     thigh 
##  2.480287 39.728794  1.579624  4.405679 10.055931 12.578180 17.971653  9.064177 
##      knee     ankle    biceps   forearm     wrist 
##  4.746972  2.327092  4.899184  2.388159  3.434389

Some of the VIFs are larger than 10, indicating multicollinearity among variable weight, chest, abdomen2 and hip.

5 Inferential analysis

Assumptions checking

All required assumptions have to be met before fitting the model. The assumptions are as follow:

Outliers and Leverage

Outlying Y observations

##       81      207       82      225      231      224 
## 2.749330 2.465451 2.459357 2.403352 2.400372 2.367505
## [1] 3.516987

For α = 0.1, the absolute value of the Bonferroni’s procedure is 3.516987. None of the absolute studentized deleted residuals of Y is larger than 3.516987, so there is no outlying in Y.

Leverage and Outlying in X

##        42        31        39       175        41       206       106         5 
## 0.7965237 0.5275080 0.4389803 0.3749227 0.3085681 0.2752319 0.2034773 0.1617982

There are 8 leverage values are larger than 0.1587302 which is 2p/n. Their positions (from larger to small) are 42, 31, 39, 175, 41, 206, 106 and 5. They are all outlying X.

Influential cases (cook’s distance)

Based on the calculation of Cook’s distance, 9 values are larger than 0.02463054 (4/(n-p)), but only case 39, has Cook’s distance 0.24746926, stands out as much more influential than other cases. Case 39 need further investigation.

Investigate case 39

Method:

Building a first-order model (model name: fit.39) without case 39, and then compare the coefficients with the model (model name: model) with all cases.

##      (Intercept)        age      weight     height       neck       chest
## [1,]  -6.8529897 0.03886669 -0.05538916 -0.1061088 -0.5833429  0.05747088
## [2,]  -0.2608794 0.05076848 -0.01761815 -0.1368490 -0.4875556 -0.01241289
##       abdomen2        hip     thigh        knee      ankle    biceps   forearm
## [1,] 0.8822237 -0.2706743 0.1465729  0.08447148 -0.3135759 0.3850058 0.4199717
## [2,] 0.8691145 -0.2203312 0.1097472 -0.01199683 -0.2748800 0.4092526 0.2447795
##          wrist
## [1,] -1.443656
## [2,] -1.600127

The outcome is the coefficients changed. The obvious change is the coefficient of intercept for model without point 39 is -0.2608794, for model with all cases is -6.8529897. The coefficients of other variables also changed. The investigation result for the case 39 is it needs to be removed.

Normality

The box-cox proceduce above showed that the response variable doesn’t need to be transformed. The distribution is normal.

Independence

A Durbin-Watson Test can be used to check if the residuals are independent.

The null hypothesis: There is no correlation among the residuals.

The alternative hypothesis: The residuals are auto correlated.

## Loading required package: carData
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.06599532      2.111929   0.442
##  Alternative hypothesis: rho != 0

From the output we can see that the test statistic is 2.111929 and the corresponding p-value is 0.442. Since this p-value is larger than 0.05, we failed to reject the null hypothesis and conclude that there is no correlation among the residuals. The residuals are independent.

Equal variances

Equal variances can be tested by using the Non-Constant Error Variance (NVC) test.

The null hypothesis: The population variances are equal

The alternative hypothesis: The population variances are not equal

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.2579675, Df = 1, p = 0.61152

Based on the Non-Constant Error Variance (NVC) test result, we failed to reject the \(H_0\) and conclude the population variances are equal because the p-value is 0.61152 which is larger than 0.05.

The plots are using data without point 39

Linearity: Residual vs fitted value plot does not show any pattern. Assumption of linearity is met.

Constant variance: The expected value of the residuals is approximately 0, and the variance is approximately constant. Assumption of constant variance is met.

Normality: QQ plot looks like a straight line. Assumption of normality is met.

6 Sensitivity analysis

Model fitting

We are going to fit two models, and then choose the better model as the final model:

  1. Linear regression: \(Y = X \beta + \epsilon\) and \(\hat \beta = (X^{\top}X)^{-1}X^{\top}Y\)

  2. Ridge regression: \(Y_{\lambda}= X \beta_{\lambda}\) and \(\hat \beta_{\lambda}=(X^{\top}X+\lambda I)^{-1}X^{\top}Y\)

Linear model

Stepwise Regression

Since there is high multicollinearity among variables, forward stepwise procedure will be used for model selection.

First order model with all predictors

The first “best” model: fit a first-order model with all predictors as a full model, and then use forward stepwise procedure to find the “best” model base on AIC.

The final model (model1) is: bodyfat.p ~ abdomen2 + wrist + weight + biceps + height.

First-order and second-order effects for all predictors

The second “best” model: fit a first-order and second- order effects for all predictors as a full model, and then use forward stepwise procedure to find the “best” model base on AIC.

The final model (model2) is: bodyfat.p ~ prodictors abdomen2+wrist + weight + biceps + height + age + neck + interaction weight and biceps + interaction abdomen2 and neck.

Performance evaluation: compare the RMSPE(root mean squared prediction error) values between model1 and model2

We looked at the RMSPE for both the training set and the validation set. We compare the difference and see if we can test which model does a better job.

## [1] 4.223362 4.402156
## [1] 4.096879 4.344482

For the root mean squared prediction error (RMSPE), model1 for training data is 4.223362, model1 for validation data is 4.402156; model2 for training data is 4.096879, model2 for validation data is 4.344482. Since the RMSPE values are close to each other, it is hard to tell which model is better.

Moreover, for model1, RMSPE for training set and for validation set are similar, so there is no overfitting in this model. for model2, RMSPE for training set and for validation set are similar, so there is also no overfitting in this model.

The “Best” Model final decision for linear regression

## 
## Call:
## lm(formula = bodyfat.p ~ abdomen2 + wrist + weight + biceps + 
##     height, data = train.new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5907 -2.9934 -0.6306  3.1370 10.3983 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.91040   10.64676  -2.152  0.03283 *  
## abdomen2      0.92117    0.07387  12.470  < 2e-16 ***
## wrist        -1.64610    0.53487  -3.078  0.00244 ** 
## weight       -0.10542    0.03598  -2.930  0.00386 ** 
## biceps        0.44621    0.19727   2.262  0.02497 *  
## height       -0.13160    0.09363  -1.406  0.16169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.298 on 169 degrees of freedom
## Multiple R-squared:  0.7378, Adjusted R-squared:   0.73 
## F-statistic: 95.09 on 5 and 169 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = bodyfat.p ~ abdomen2 + wrist + weight + biceps + 
##     height + age + neck + weight:biceps + abdomen2:neck, data = train.new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5787 -2.8382 -0.3092  2.6513 10.8889 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.762992  45.442832   0.017  0.98662   
## abdomen2      -0.129136   0.685203  -0.188  0.85074   
## wrist         -1.832476   0.603063  -3.039  0.00276 **
## weight         0.368951   0.179163   2.059  0.04104 * 
## biceps         2.815559   0.944736   2.980  0.00332 **
## height        -0.134748   0.092311  -1.460  0.14627   
## age            0.057218   0.034694   1.649  0.10101   
## neck          -2.632958   1.629160  -1.616  0.10797   
## weight:biceps -0.012841   0.005199  -2.470  0.01454 * 
## abdomen2:neck  0.024999   0.017746   1.409  0.16080   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.219 on 165 degrees of freedom
## Multiple R-squared:  0.7532, Adjusted R-squared:  0.7398 
## F-statistic: 55.96 on 9 and 165 DF,  p-value: < 2.2e-16

For model1, based on the P-values for each X variable, height is not significant. For model2, based on the P-value for each X variable, intercept, abdomen2, height, age, neck, interaction abdomen2 and neck are not significant.In addition, the R-squared value and adjusted R-squared value for both models are similar. Therefore, we decided the model1 is the better model.

Ridge regression

Fitting a ridge regression and choose \(\lambda\) by GCV and evaluating the Performance between the linear regression model and ridge regression.

If a model fits well, the points on the plot should line up along the red color straight line (slope = 1 ). Obviously, we can tell that the OLS model doing a better job than the ridge model. The points for the ridge model are not at all at the red line. Therefore, the linear regression is a better model.

## [1] 4.223362 4.402156
## [1] 40.39622 42.00466

The RMSPE for the ridge regression model are 40.39622 and 42.00466. The RMSPE for the OLS are 4.223362 and 4.402156. Compare the RMSPE, the ridge model has a much larger error.

According to the Performance for the OLS and Ridge regression, we decided the linear regression model is the best model.

Fitting the “Best” model with the whole data set except point 39

## 
## Call:
## lm(formula = bodyfat.p ~ abdomen2 + wrist + weight + biceps + 
##     height, data = bodyfat.new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0406  -3.0377  -0.3776   3.2839   9.2021 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.62233    9.22625  -2.560  0.01106 *  
## abdomen2      0.93624    0.06124  15.287  < 2e-16 ***
## wrist        -1.43766    0.43334  -3.318  0.00105 ** 
## weight       -0.10084    0.03050  -3.306  0.00109 ** 
## biceps        0.26078    0.15284   1.706  0.08924 .  
## height       -0.11380    0.08788  -1.295  0.19656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.276 on 245 degrees of freedom
## Multiple R-squared:  0.7413, Adjusted R-squared:  0.7361 
## F-statistic: 140.4 on 5 and 245 DF,  p-value: < 2.2e-16

The final linear regression model is \(Bodyfat.p = -23.62233 + 0.93624 * abdomen2 - 1.43766 * wrist -0.10084 * weight + 0.26078 * biceps - 0.11380 * height\)

Use the linear regression model to predict the body fat percentage for the outlier point 39

Since data point 39 is considered as an outlier, it has been removed from the data set. We wanted to see what the predicted bodyfat percentage is if we use the point 39’s measurement.

##        fit      lwr      upr
## 1 51.16071 41.75175 60.56968

The actual bodyfat percentage for point 39 is 35.2%, however, if we use our final linear regression model to predict the bodyfat percentage for point 39, the 95% prediction interval is 41.75% to 60.57%. Obviously, the actual bodyfat percentage is far below 41.75%.

7 Discussion

The R-square value for the final model is 74.13%. This model explains 74.13% of the body fat percentage in the regression model. The coefficients of abdomen, wrist, weight and biceps are significant. The coefficient of Height in the model is not significant.

Based on the final model, the goal of finding an easier way to measure body fat percentage was achieved. By measuring the abdomen (cm), waist (cm), weight (lbs), biceps (cm) and height (inches), a person’s body fat percentage can be determined by using the equation: \(Bodyfat.p = -23.62233 + 0.93624 * abdomen2 - 1.43766 * wrist -0.10084 * weight + 0.26078 * biceps - 0.11380 * height\)

Answer the questions of interest:

  1. Does a person who is heavy and tall tend to have a higher body fat percentage? No. There is a negative relationship between body fat percentage and weight. Keeping other variables constant, on average, every one unit increases in weight, there is 0.10084 unit decreases in body fat percentage. There is also a negative relationship between body fat percentage and height. Keeping other variables constant, on average, every one unit increases in height, there is 0.11380 unit decreases in body fat percentage.

  2. How well can measuring the size of the body (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, and wrist) predict the body fat percentage? In the final model, abdomen, biceps, wrist can predict body fat percentage significantly. However, neck, chest, hip, thigh, knee, ankle and forearm cannot significantly predict body fat percentage. Keeping other variables constant, on average, every one unit increases in abdomen, there is 0.93624 unit increases in body fat percentage; every one unit increases in biceps, there is 0.26078 unit increases in body fat percentage; every one unit increases in wrist, there is 1.43766 unit decreases in bodyfat percentage.

  3. What age range has the highest body fat percentage? Based on the model, age cannot tell anything about body fat percentage. The variable age had been eliminated while doing forward stepwise model selection procedure.

Acknowledgement

Reference

Tanita Europe B.V. Understanding your measurements - Body Fat Percentage. (n.d.). Retrieved March 25, 2023, from https://tanita.eu/understanding-your-measurements/body-fat-percentage

Healthline. (2022, January 24). Obesity Facts: Causes, Risks, and More. Retrieved March 25, 2023, from https://www.healthline.com/health/obesity-facts#1.-More-than-one-third-of-adults-in-the-United-States-are-obese

Mackenzie, B. (n.d.). Siri Equation for Body Density. Topend Sports. Retrieved March 25, 2023, from https://www.topendsports.com/testing/siri-equation.htm

Men’s Health. (2020, July 22). Ronnie Coleman Says He Now Has No Feeling From the Waist Down. Retrieved March 25, 2023, from https://www.menshealth.com/health/a33247811/ronnie-coleman-body-fat/

ABC News. (2015, April 8). Legendary Bodybuilder Who Died of Bodybuilding Diet Lives On in Graphic Photos. Retrieved March 25, 2023, from https://abcnews.go.com/Health/legendary-bodybuilder-died-body-fat-lives/story?id=29899438#

Session info

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] car_3.1-1     carData_3.0-5 MASS_7.3-58.1
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        jsonlite_1.8.4  evaluate_0.20  
##  [5] highr_0.10      cachem_1.0.6    rlang_1.0.6     cli_3.6.0      
##  [9] rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2     vctrs_0.5.2    
## [13] rmarkdown_2.20  tools_4.2.2     abind_1.4-5     xfun_0.37      
## [17] yaml_2.3.7      fastmap_1.1.0   compiler_4.2.2  htmltools_0.5.4
## [21] knitr_1.42      sass_0.4.5