# Statistics - Five Factors Affect the Accuracy of the Model

Essay by Amy Meng • January 24, 2019 • Essay • 1,808 Words (8 Pages) • 674 Views

## Essay Preview: Statistics - Five Factors Affect the Accuracy of the Model

**Page 1 of 8**

(a)

[pic 1][pic 2][pic 3][pic 4][pic 5]

We found five factors would affect the accuracy of the model, including Insurance, chcond, age, sex and illness . We identify four unusual points in age: point 416, point 525, point 711 and point 739, which are both horizontally away from the main cluster. The reason behind is that these points stand for some people with special conditions. In detail, point 525 is the older adults with bad healthy score who suffered from illness frequently. Point 416 are the younger adults who care about their physical condition and go to healthy centre to do healthy test. Point 711 indicates the aged do the healthy consultations to get prescription drugs or advices for mild age-related disease. Point 739 shows the older people with serious chronic illness. Therefore, they can be unusual points and the distribution of the points doesn’t fit linear model.

There are six points identified in the illness graph: point 235, point 525, point 663, point 739, point 766 and point 909. The six points are in the upper and quite away from the main cluster, because maybe point 235 the people with fitness but suddenly get limiting activities illness like fracture. Point 525 stand for the middle-aged, they easily got l illness. Point 663 the older adults who got Point 663 show the people with chronic illness like acute heart disease or asthma. Point 739 is the older people who suffer from limited activities illness and can’t take themselves independently. It is noticeable that point 766 indicates the people with acute disease like acute allergy need stay in hospital for a long time to come back to life. Last point 909 is the people with good physical condition get mild disease like cold and they just need non-doctor advices. Therefore, the outliers cannot form a linear trend.

In terms of chcond, insurance and sex, we identify point739, 766,235(chcond), point 739, 265(insurance) and point 739, 265(sex) respectively. These points are both further away from the main clusters. Point 739 accounts for the people with private health insurance. That means they can get discount even free for medical service. Overall, we need further modifications for this linear model.

(b)

[pic 6][pic 7]

From the residual vs fitted plot, the residuals took on positive values with small fitted value. The observations are not randomly scattered around the red line( the less curved red line) and the unusual observations are not evenly distributed around 0, do suggest that it violated the assumption of independent error. The plot also shows that the variance is not constant. Therefore, there are outlier points (point711, point 739 and point831). In the following part, we will do some adjustments for better fitted.

(c)

[pic 8][pic 9]

This residual vs fitted plot reveals that no correction has been identified after the first transformation (log(y+1)). As we can see, the points are still not evenly distributed around 0, but the highest residual reduces. In addition, the lowess smoother shows a temporarily downward and is finally below 0. The hint of curvature reflects the violation of the assumption of independence and constant variance. Observation 711 discussed in part (b) remain unusual. However, R2 in this model is 0.2541, higher than the original model.

[pic 10][pic 11]

[pic 12][pic 13]

In terms of the second and third transformations (y^0.5, y^0.25), the plots do not fix any previously apparent problems. Both lowess smoothers suggest some curvature in the plots, and would be a strong violation of the underlying assumption. Besides, there are some obvious patterns and total five unusual points (2494, 2185, 680, 711 and 933) recognised in these two plots, which can be results of non-constant variance. R2 in the third and fourth transformations are 0.2536 and 0.2289 respectively, not higher than that in the second trannsformation.

(d)

[pic 14][pic 15][pic 16]

We use MASS to get a suitable λ (-3.16) as it accounts for the highest log-likelihood. Thus, we replace y with y^ (-3.16) as the fourth transformation. From the residuals vs fitted plot, two apparent lines can be identified and the lowess smoother has an obvious curvature. Non-constant variance has been reflected by the hint of the outliers (2494,2185 and 2628),violating the underlying assumptions. Besides, R2 in this model is the lowest one, 0.2024. Thus, the Box-cox method suggests a poor transformation.

In the conclusion, comparing 5 models in part (b), (c) and (d), we consider replacing y to log(y+1) is more suitable for the multiple linear regression as its plot better comply with the underlying assumptions.

[pic 17](e)

[pic 18]

[pic 19]

From the added variable plot for income, we can see a quite flat smoother (red), which indicates that there is a linear relationship between y and income. Besides, although observation 711 has been identified in this plot, we do not think it is an extreme point since the regression line (green) is quite close to the smoother. It is also exhibited that the regression line in the plot is close to 0 (-0.012) and the corresponding F test is not significant (F1,5188 = 0.4491, p = 0.5028 > 0.05); thus, we conclude that adding income as an additional predictor to the model may lead to few improvements.

[pic 20][pic 21]

From the added variable plot for age, linearity between y and age can be identified as the lowess smoother (red) is flat and around 0. Furthermore, we consider that there is no need to consider unusual points (711 and 1473) because the regression line (green) and the smoother are quite similar. Meanwhile, the regression coefficient is very small (0.048) and the corresponding sequential F test is not significant (F1,5188 = 1.445, p = 0.2294 > 0.05), which means that if the variable age is added, no improvement may be recognised.

(f)

[pic 22]

With a family level α = 0.05, the confidence intervals of “medlevy to freepor”, “medlevy to levyplus” and “medlevy to freerepa” contain 0; thus, there are no differences between medlevy and other insurance contracts. Meanwhile, since 0 is included in the confidence interval of “levyplus to freerepa”, no significant difference is identified between these contracts. However, because the confidence intervals of “levyplus to freepor” and “freepor to freerepa” do not contain 0, differences of them are statistically significant. It illustrates that using freepor or freerepa can have different impacts on the response, and it is the same as choosing levyplus or freepor.

...

...