data analysis代写|数据可视化代写Intro to Data Analytics & Visualization代考|Understanding simple regression

In simple regression, we analyze the relationship between a predictor (the attribute we think to be the cause) and the criterion (the attribute we think is the consequence). There are two very important parameters (among others) that result from a regression analysis:

  • The intercept: This is the average value of the criterion when the predictor is 0 , which is when the effect of the predictor is partialed out
  • The slope coefficient: This indicates by how many units, on average, the criterion changes (with reference to the intercept) when the predictor increases by one unit
    Regression seeks to obtain the values that explain the relationship the best, but such a model only seldom reflects the relationship entirely. Indeed, measurement error, but also attributes that are not included in the analysis affect also the data. The residuals express the deviation of the observed data points to the model. Its value is the vertical distance from a point to the regression line. Let’s examine this with an example of the iris dataset. We have already seen that the dataset contains data about iris flowers. For the purpose of this example, we will consider the petal length as the criterion and the petal width as the predictor.
    We will now create a scatterplot, with the petal width on the $x$ axis and the petal length on the $y$ axis, in order to display the data points on these dimensions. We will then compute the regression model and use it to add the regression line to the plot. This should look familiar, as we have already done this in Chapter 2, Visualizing and Manipulating Data Using R, and Chapter 3, Data Visualization with Lattice, when discussing plots in $\mathrm{R}$. This redundancy is not accidental-plotting data and their relationship is one of the most important aspects of analyzing data:
    1 plot (iris\$Petal. Length iris\$Petal. Width,
    2 main $=$ “Relationship between petal length and petal width”,
    $3 \quad \mathrm{xlab}=$ “Petal width”, $y l a b=$ “Petal length”)
    $\begin{array}{ll}1 & \text { plot (iris\$Petal. Length iris\$Petal. Width, } \ 2 & \text { main }=\text { “Relationship between petal length and petal } \ 3 & x l a b=\text { “Petal width”, ylab = “Petal length”) } \ 4 & \text { iris.lm }=1 \mathrm{~m}(\text { iris\$Petal. Length iris\$Petal. Width) } \ 5 & \text { abline (iris.lm) }\end{array}$
    4 iris.lm $=1 \mathrm{~m}$ (iris\$Petal. Length iris\$Petal. Width)
    5 abline (iris.1m)

data analysis代写|数据可视化代写Intro to Data Analytics & Visualization代考|Computing the intercept and slope coefficient

In simple regression, data can be modeled as the intercept, plus the slope multiplied by the value of the predictor, plus the residual. We are now going to explain how to compute these.
The slope coefficient can be computed in several ways. One is to multiply the correlation coefficient by the standard deviation of the criterion divided by the standard deviation of the predictor. Another is to first compute the value corresponding to the number of observations multiplied by: the sum of the observation-wise products of the criterion and the predictor minus the sum of the values of the predictor multiplied by the sum of the values of the criterion multiplied. The result is then divided by the number of observations multiplied by the sum of the squared values of the predictor minus the squared sum of the predictors. Another way is to rely on matrix computations, which we will not examine here.
The intercept can simply be computed as the mean of the criterion minus the slope coefficient multiplied by the mean of the predictor.
Let’s take the same example as before to compute the regression coefficient (using the two computations we have seen), and the intercept.

To compute the slope coefficient using the first way presented, we start by computing the correlation coefficient of the petal length and petal width, and the standard deviation of the predictor and criterion. We then perform the described computation:
Slopecoef $=\operatorname{cor}($ iris\$Petal . Length,iris\$Petal $.$ Width) *
(sd(iris\$Petal. Length) / sd(iris\$Petal. Width))
The outputted value is $2.22994$. Let’s program a function that implements the other way to compute the slope we’ve seen. The criterion will be called y and the predictor $\mathrm{x}$ :
1 coeffs = function $(y, x){$
$\left(\left(\right.\right.$ length $\left.(y) * \operatorname{sum}\left(y^{*} x\right)\right)-$
$(\operatorname{sum}(\mathrm{y}) \star \operatorname{sum}(\mathrm{x}))$ ) /
$\left(\right.$ length $\left.(y) \star \operatorname{sum}\left(x^{\wedge} 2\right)-\operatorname{sum}(x)^{\wedge} 2\right)$
coeffs (iris\$Petal. Length, iris\$Petal.Width)

DATA ANALYSIS代写|数据可视化代写INTRO TO DATA ANALYTICS & VISUALIZATION代考|Computing the significance of the coefficient

As we have seen in the first section of the chapter, determining the significance of the estimates is essential for interpretation; even a big coefficient cannot be interpreted if it is not significantly different from 0 . Here, you will learn a little more about the computation of the significance for simple regression:

  1. The first thing we need to do is to compute the standard error of the slope coefficient (a value that assesses its precision).
  2. We obtain the standard error by first taking the square root of: the sum of the squared residuals (SSR) divided by the degrees of freedom (DF $-$ that is, the number of observations minus two).
  3. We then divide this value (called $S$ in the following code) by the square root of the squared mean subtracted values of $x$.
  4. After we obtain the standard error, we can compute a t-score by dividing the slope coefficient by the standard error.
  5. The score is then compared to 0 on a t-distribution.
    There is also a significance test for the intercept. In order to compute the standard error, we first:
  6. Compute 1 divided by the number of observations, plus the square mean of the predictor, divided by the sum of the squared mean subtracted values of the predictor.
  7. We take the square root of this value and multiply it by the value $S$ that we saw previously.

After we obtain the standard error for the intercept, its t-score can be computed as seen previously. The following code implements this and returns the standard error, t score, and significance for both the slope coefficient and the intercept of a simple linear regression:
1 Significance = function $(y, x$, model {
$\mathrm{SSE}=\operatorname{sum}\left(\right.$ resids $\left.(y, x, \text { model })^{\wedge} 2\right)$
$\mathrm{DF}=$ length (y) $-2$
$s=\operatorname{sqrt}(\mathrm{SSE} / \mathrm{DF})$
SEslope $=S / \operatorname{sqrt}\left(\operatorname{sum}\left((\mathrm{x}-\operatorname{mean}(\mathrm{x}))^{\wedge} 2\right)\right)$
tslope $=\operatorname{model}[2] /$ SEslope
sigslope $=2 *(1-p t$ (abs (tslope),$D F))$
SEintercept $=S * \operatorname{sqrt}((1 /$ length(y) $+$
$\left.\left.\operatorname{mean}(x)^{\wedge} 2 / \operatorname{sum}\left((x-\operatorname{mean}(x))^{\wedge} 2\right)\right)\right)$
tintercept $=$ model $[1] /$ SEintercept
sigintercept $=2 *(1-p t($ abs (tintercept),$D F)$ )

在简单回归中,我们分析预测变量之间的关系吨H和一种吨吨r一世b在吨和在和吨H一世nķ吨这b和吨H和C一种在s和和标准吨H和一种吨吨r一世b在吨和在和吨H一世nķ一世s吨H和C这ns和q在和nC和. 有两个非常重要的参数一种米这nG这吨H和rs回归分析的结果:

  • 截距:这是预测变量为 0 时标准的平均值,即预测变量的效果被部分排除时
  • 斜率系数:这表明标准平均有多少单位发生变化在一世吨Hr和F和r和nC和吨这吨H和一世n吨和rC和p吨当预测变量增加一个单位时,
    回归试图获得最能解释这种关系的值,但这样的模型很少能完全反映这种关系。实际上,测量误差以及分析中未包括的属性也会影响数据。残差表示观察到的数据点与模型的偏差。它的值是从一个点到回归线的垂直距离。让我们用一个 iris 数据集的例子来研究一下。我们已经看到数据集包含有关鸢尾花的数据。出于本示例的目的,我们将花瓣长度作为标准,将花瓣宽度作为预测变量。
    我们现在将创建一个散点图,花瓣宽度在X轴和花瓣长度是轴,以显示这些维度上的数据点。然后我们将计算回归模型并使用它来将回归线添加到图中。这看起来应该很熟悉,因为我们在第 2 章“使用 R 可视化和操作数据”和第 3 章“使用 Lattice 数据可视化”中讨论R. 这种冗余并不是意外绘制数据,它们之间的关系是分析数据最重要的方面之一:
    1 plot (iris\$Petal. Length iris\$Petal. Width,
  • 2 main $=$ “Relationship between petal length and petal width”,
  • $3 \quad \mathrm{xlab}=$ “Petal width”, $y l a b=$ “Petal length”)
  • $\begin{array}{ll}1 & \text { plot (iris\$Petal. Length iris\$Petal. Width, } \ 2 & \text { main }=\text { “Relationship between petal length and petal } \ 3 & x l a b=\text { “Petal width”, ylab = “Petal length”) } \ 4 & \text { iris.lm }=1 \mathrm{~m}(\text { iris\$Petal. Length iris\$Petal. Width) } \ 5 & \text { abline (iris.lm) }\end{array}$
  • 4 iris.lm $=1 \mathrm{~m}$ (iris\$Petal. Length iris\$Petal. Width)
  • 5 abline (iris.1m)


斜率系数可以通过多种方式计算。一种是将相关系数乘以标准的标准偏差除以预测变量的标准偏差。另一种是首先计算对应于观察数乘以的值:标准和预测器的观察乘积之和减去预测器的值之和乘以标准的值之和乘以. 然后将结果除以观测数乘以预测变量的平方值之和减去预测变量的平方和。另一种方法是依赖矩阵计算,我们在这里不做研究。

Slopecoef $=\operatorname{cor}($ iris\$Petal . Length,iris\$Petal $.$ Width) *
(sd(iris\$Petal. Length) / sd(iris\$Petal. Width))
The outputted value is $2.22994$. Let’s program a function that implements the other way to compute the slope we’ve seen. The criterion will be called y and the predictor $\mathrm{x}$ :
1 coeffs = function $(y, x){$
$\left(\left(\right.\right.$ length $\left.(y) * \operatorname{sum}\left(y^{*} x\right)\right)-$
$(\operatorname{sum}(\mathrm{y}) \star \operatorname{sum}(\mathrm{x}))$ ) /
$\left(\right.$ length $\left.(y) \star \operatorname{sum}\left(x^{\wedge} 2\right)-\operatorname{sum}(x)^{\wedge} 2\right)$
coeffs (iris\$Petal. Length, iris\$Petal.Width)


正如我们在本章第一部分所看到的,确定估计值的重要性对于解释至关重要。如果与 0 没有显着差异,即使是大系数也无法解释。在这里,您将了解更多关于简单回归显着性计算的知识:

  1. 我们需要做的第一件事是计算斜率系数的标准误差一种在一种l在和吨H一种吨一种ss和ss和s一世吨spr和C一世s一世这n.
  2. 我们通过首先取平方根来获得标准误差: 残差平方和小号小号R除以自由度DF$−$吨H一种吨一世s,吨H和n在米b和r这F这bs和r在一种吨一世这ns米一世n在s吨在这.
  3. 然后我们除以这个值C一种ll和d$小号$一世n吨H和F这ll这在一世nGC这d和通过平方平均减去的值的平方根X.
  4. 在我们获得标准误差后,我们可以通过将斜率系数除以标准误差来计算 t 分数。
  5. 然后将分数与 t 分布上的 0 进行比较。
  6. 计算 1 除以观测值的数量,加上预测变量的平方均值,再除以减去预测变量的平方均值之和。
  7. 我们取该值的平方根并将其乘以该值小号我们之前看到的。

在我们获得截距的标准误差后,可以如前所述计算其 t-score。以下代码实现了这一点,并返回简单线性回归的斜率系数和截距的标准误差、t 分数和显着性:
1 Significance = function $(y, x$, model {
$\mathrm{SSE}=\operatorname{sum}\left(\right.$ resids $\left.(y, x, \text { model })^{\wedge} 2\right)$
$\mathrm{DF}=$ length (y) $-2$
$s=\operatorname{sqrt}(\mathrm{SSE} / \mathrm{DF})$
SEslope $=S / \operatorname{sqrt}\left(\operatorname{sum}\left((\mathrm{x}-\operatorname{mean}(\mathrm{x}))^{\wedge} 2\right)\right)$
tslope $=\operatorname{model}[2] /$ SEslope
sigslope $=2 *(1-p t$ (abs (tslope),$D F))$
SEintercept $=S * \operatorname{sqrt}((1 /$ length(y) $+$
$\left.\left.\operatorname{mean}(x)^{\wedge} 2 / \operatorname{sum}\left((x-\operatorname{mean}(x))^{\wedge} 2\right)\right)\right)$
tintercept $=$ model $[1] /$ SEintercept
sigintercept $=2 *(1-p t($ abs (tintercept),$D F)$ )

