Weighted Least Squares Computations using Stata
Ronald A Thisted
1 November 1998
Reconciling Stata's and Weisberg's versions of WLS
As noted in class, Stata and Weisberg differ slightly in the way that
the anova table is constructed. These differences make it essential
that you understand both what Stata is doing and what it is that
Weisberg is doing. [See, for instance, Weisberg pp 82-87, and
Stata Reference Manual [R] regress pp 130-132.]
This document is intended to clarify the issues, and to describe a new
Stata command that you can use (wls) to calculate weighted
least-squares estimates for problems such as the ``Strong interaction''
physics data described in Weisberg's example 4.1 (p. 83). We shall
use this data set to show how to obtain the WLS results tabulated on
page 87 in Table 4.2. You should have your copy of Weisberg open as
you read this document.
. infile p x y esd using alr085
(10 observations read)
As Weisberg formulates the problem, the weights wi are
proportional to the inverses of the individual item
variances --- not the standard deviations. Note, too, that
in the physics data, Weisberg takes the variance (sigma2) to be
equal to one (top of page 86).
Thus, we must first generate a Stata variable containing the weights,
which we calculate from the column of SD's provided in Table 4.1.
. generate wt = 1/esd^2
The regress command with aweights
Our first approach is to use the regress command in Stata,
specifying weights. This almost works. Note that we are specifying
what Stata refers to as analytic weights. These are the
(only) appropriate kinds of weights to use for this problem.
[Another type of weights, called frequency weights, are
useful when many individuals in the sample have identical values for Y
and all of the X's. In this case, the data set can be made much
smaller by simply listing each unique set of (Y,X) values a single
time, and then also recording the number of individuals in the sample
who had that set of values in a new Stata variable. Let's call the
new variable freq. In this case, regression can be
performed by using frequency weights in this (condensed) data set.
The regression command would look something like:
regress y x [fweight=freq]
Note that these frequency weights must, by definition, be positive
integers.]
. regress y x [aweight = wt]
(sum of wgt is 2.3230e-01)
Source | SS df MS Number of obs = 10
---------+------------------------------ F( 1, 8) = 124.63
Model | 14721.8925 1 14721.8925 Prob > F = 0.0000
Residual | 945.00803 8 118.126004 R-squared = 0.9397
---------+------------------------------ Adj R-squared = 0.9321
Total | 15666.9006 9 1740.76673 Root MSE = 10.869
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 530.8354 47.55003 11.164 0.000 421.1849 640.486
_cons | 148.4732 8.078651 18.378 0.000 129.8438 167.1026
------------------------------------------------------------------------------
When we compare this output to Table 4.2 in ALR, we see that the
parameter estimates for b0
and b1 and their standard errors
all agree with the values reported above.
Where the differences occur is in the analysis of variance table.
Although the F statistic is the same in ALR as in Stata, the sums of
squares differ. Note particularly that the residual mean square is
quite different. While Weisberg reports an estimated error variance
of 2.744, the calculation above reports 118.126.
The source of the difference is described in the Stata manual.
Briefly put, Stata is estimating \sigma^{2}/W,
where W denotes the
average value of the weights. Stata reports the
sum of the weights, so that the estimated value for \sigma^{2}
can be obtained by the calculation
(118.12) x [(2.3230e-01) / 10] = 2.744
While this is cumbersome, it can certainly be done. This emphasizes,
however, that
you must always take extra efforts to confirm
that your statistical software is doing what you believe it ought to
be doing.
Applying weights directly and using regression through the origin
Another approach is described on pages 82 and 83. This approach can
be taken in Stata directly by calculating ``adjusted'' Y and X
variables, which Weisberg calls Z and M. The adjusted values are
obtained from the original values by multiplying by the
square root of the weight. Note that the constant term is
also adjusted here, so doing the ordinary least-squares fit for Z and
M requires a regression through the origin.
. gen w = sqrt(wt)
. gen yw = y*w
. gen xw = x*w
. regress yw xw w, noconstant
Source | SS df MS Number of obs = 10
---------+------------------------------ F( 2, 8) = 2303.24
Model | 12640.58 2 6320.28999 Prob > F = 0.0000
Residual | 21.9526539 8 2.74408174 R-squared = 0.9983
---------+------------------------------ Adj R-squared = 0.9978
Total | 12662.5326 10 1266.25326 Root MSE = 1.6565
------------------------------------------------------------------------------
yw | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
xw | 530.8354 47.55003 11.164 0.000 421.1849 640.486
w | 148.4732 8.078651 18.378 0.000 129.8438 167.1026
------------------------------------------------------------------------------
At first glance, this output looks better in terms of reproducing
Table 4.2. Not only are the regression coefficients and their
standard errors correct, the residual variance estimate is now correct
as well. Unfortunately, the F statistic is now enormous, and somehow
R2 has also changed.
The new problems are a result of having to use noconstant in
the regress command. In this case, Stata cannot tell that
the column we have called w above represents the (adjusted)
constant term, so it does not know that it should be omitted from both
the Total and Model sums of squares.
Of course, you could always construct the analysis of variance table
``by hand'' by running a series of regressions through the origin --
first with just w, and then the full model above. This would
provide all of the necessary ingredients to construct the Anova table
using a hand calculator to do the necessary subtractions and divisions.
A third solution
Since there is no direct method in Stata to obtain WLS estimates for
the error variance --- at least in the situation appropriate to the
physics data in Weisberg --- I have written a little Stata program,
wls.ado, which will perform the necessary calculations. Here
is how to use it.
First, download a cpoy of wls.ado from the course website.
You can find it in the ``Software'' section.
Put a copy of this file into the same directory that you typically
run Stata from. [If you are a Stata expert, you can put this file in
your personal ado directory (not the official Stata ado
directory). If this makes no sense to you, then you can ignore it.]
To confirm that there is a copy of wls.ado in your working
directory, ask Stata to check for you using the dir command
as shown below. If the file is present, it will produce one or more
lines of output that describe the file. In the example below, it
shows that the file was last modified on 11/1/98 and is 3207
characters long.
If the files is absent, Stata will report ``file not found.'' In this
case, give the pwd command (print working directory) to
determine the directory in which Stata is currently operating.
. dir wls.ado
:wls.ado
wls.ado
11/01/98 15:32 3207 wls.ado
Once you have determined that this .ado file is present, you can use
the wls command. This is not a standard Stata command; it
is one that I wrote for use by the class.
It's syntax is identical to that of the regress command,
except that it must specify a weighted regression, and the
weights must be analytical weights.
You can see that this command produces output that coincides with
Table 4.2 in Weisberg.
. wls y x [aweight = wt]
WLS anova table uses renormalized weights (weight var is wt)
Source | SS df MS Number of obs = 10
---------+------------------------------ F( 1, 8) = 124.63
Model | 341.99136 1 341.99136 Prob > F = 0.0000
Residual | 21.952652 8 2.7440815 R-squared = 0.9397
---------+------------------------------ Adj R-squared = 0.9321
Total | 363.94401 9 40.438224 Root MSE = 1.657
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 530.8354 47.55003 11.164 0.000 421.1849 640.486
_cons | 148.4732 8.078651 18.378 0.000 129.8438 167.1026
------------------------------------------------------------------------------
Variance-weighted least squares: Another variation
In a sense, none of the calculations done above are really appropriate
for the physics data. Why?
The weighted least squares calculation is based on the assumption that
the variance of the observations is unknown, but that the relative
variances are known. Thus, only a single unknown parameter having to
do with variance needs to be estimated.
In the case of the physics data, however, the values in the last
column of Table 4.1 are essentially known values. When the
study was done at p=150 GeV/c, the standard deviation of the
distribution from which the Y's were generated was very nearly 5
y-units. It is for this reason that Weisberg asserts that we may as
well take sigma=1 without loss of generality at the top of page 86.
So in the physics data, there are no unknown variances. We
are still interested in the fit of the model (4.14), however, and we
are also interested in the estimated coefficients.
Stata has one more WLS command (this time, it is a real Stata command)
to deal with this situation. The command is named vwls, for
variance-weighted least squares. This command assumes that the
standard deviations (NOT variances) corresponding to each observation
are recorded in a column, in our case, esd. Here is the
result when applied to the physics data.
. vwls y x, sd(esd)
Variance-weighted least-squares regression Number of obs = 10
Goodness-of-fit chi2(8) = 21.95 Model chi2(1) = 341.99
Prob > chi2 = 0.0050 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 530.8354 28.70466 18.493 0.000 474.5753 587.0955
_cons | 148.4732 4.87686 30.444 0.000 138.9148 158.0317
------------------------------------------------------------------------------
A few things have changed from the wls output above. Most
important, the standard errors associated with the regression
coefficients have changed from those reported by all three WLS
variants we examined earlier. In effect, this is because the
individual variances are known here, and there is no common scale
factor that needs to be estimated from the observed (residual)
variation in the Y's.
Notice that there are two overall chi-squared tests here. The ``Model
chi-squared'' corresponds to the F tests that we performed in the
earlier regressions. The difference --- and the reason that the
chi-squared statistic is larger than the F value --- is that here we
do not need to estimate the denominator of the F statistic. Indeed,
this test statistic is exactly equal to the model mean
square in the WLS model fitted using the wls command.
The ``Goodness-of-fit'' chi-squared test that vwls produces
is a test for the adequacy of the linear model (4.14). Note that it
is equal to the error sum of squares in the wls output. The
large value of this statistic (and the small p-value corresponding to
it) strongly suggest that the linear model doesn't fit the data
well. Weisberg describes this test for lack of fit in section 4.2.
This can be confirmed visually by plotting the data (as in Weisberg's
Figure 4.1).