Weighted Least Squares

Weighted Least Squares Computations using Stata

Ronald A Thisted

1 November 1998

Reconciling Stata's and Weisberg's versions of WLS

As noted in class, Stata and Weisberg differ slightly in the way that the anova table is constructed. These differences make it essential that you understand both what Stata is doing and what it is that Weisberg is doing. [See, for instance, Weisberg pp 82-87, and Stata Reference Manual [R] regress pp 130-132.] This document is intended to clarify the issues, and to describe a new Stata command that you can use (wls) to calculate weighted least-squares estimates for problems such as the ``Strong interaction'' physics data described in Weisberg's example 4.1 (p. 83). We shall use this data set to show how to obtain the WLS results tabulated on page 87 in Table 4.2. You should have your copy of Weisberg open as you read this document.

. infile p x y esd using alr085
(10 observations read)

As Weisberg formulates the problem, the weights w_i are proportional to the inverses of the individual item variances --- not the standard deviations. Note, too, that in the physics data, Weisberg takes the variance (sigma²) to be equal to one (top of page 86). Thus, we must first generate a Stata variable containing the weights, which we calculate from the column of SD's provided in Table 4.1.

. generate wt = 1/esd^2

The regress command with aweights

Our first approach is to use the regress command in Stata, specifying weights. This almost works. Note that we are specifying what Stata refers to as analytic weights. These are the (only) appropriate kinds of weights to use for this problem. [Another type of weights, called frequency weights, are useful when many individuals in the sample have identical values for Y and all of the X's. In this case, the data set can be made much smaller by simply listing each unique set of (Y,X) values a single time, and then also recording the number of individuals in the sample who had that set of values in a new Stata variable. Let's call the new variable freq. In this case, regression can be performed by using frequency weights in this (condensed) data set. The regression command would look something like:

	regress y x [fweight=freq]

Note that these frequency weights must, by definition, be positive integers.]

. regress y x [aweight = wt]
(sum of wgt is   2.3230e-01)

  Source |       SS       df       MS                  Number of obs =      10
---------+------------------------------               F(  1,     8) =  124.63
   Model |  14721.8925     1  14721.8925               Prob > F      =  0.0000
Residual |   945.00803     8  118.126004               R-squared     =  0.9397
---------+------------------------------               Adj R-squared =  0.9321
   Total |  15666.9006     9  1740.76673               Root MSE      =  10.869

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
       x |   530.8354   47.55003     11.164   0.000       421.1849     640.486
   _cons |   148.4732   8.078651     18.378   0.000       129.8438    167.1026
------------------------------------------------------------------------------

When we compare this output to Table 4.2 in ALR, we see that the parameter estimates for b₀ and b₁ and their standard errors all agree with the values reported above. Where the differences occur is in the analysis of variance table. Although the F statistic is the same in ALR as in Stata, the sums of squares differ. Note particularly that the residual mean square is quite different. While Weisberg reports an estimated error variance of 2.744, the calculation above reports 118.126. The source of the difference is described in the Stata manual. Briefly put, Stata is estimating \sigma^{2}/W, where W denotes the average value of the weights. Stata reports the sum of the weights, so that the estimated value for \sigma^{2} can be obtained by the calculation

   (118.12) x [(2.3230e-01) / 10] = 2.744

While this is cumbersome, it can certainly be done. This emphasizes, however, that you must always take extra efforts to confirm that your statistical software is doing what you believe it ought to be doing.

Applying weights directly and using regression through the origin

Another approach is described on pages 82 and 83. This approach can be taken in Stata directly by calculating ``adjusted'' Y and X variables, which Weisberg calls Z and M. The adjusted values are obtained from the original values by multiplying by the square root of the weight. Note that the constant term is also adjusted here, so doing the ordinary least-squares fit for Z and M requires a regression through the origin.

. gen  w = sqrt(wt)

. gen yw = y*w

. gen xw = x*w

. regress yw xw w, noconstant

  Source |       SS       df       MS                  Number of obs =      10
---------+------------------------------               F(  2,     8) = 2303.24
   Model |    12640.58     2  6320.28999               Prob > F      =  0.0000
Residual |  21.9526539     8  2.74408174               R-squared     =  0.9983
---------+------------------------------               Adj R-squared =  0.9978
   Total |  12662.5326    10  1266.25326               Root MSE      =  1.6565

------------------------------------------------------------------------------
      yw |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
      xw |   530.8354   47.55003     11.164   0.000       421.1849     640.486
       w |   148.4732   8.078651     18.378   0.000       129.8438    167.1026
------------------------------------------------------------------------------

At first glance, this output looks better in terms of reproducing Table 4.2. Not only are the regression coefficients and their standard errors correct, the residual variance estimate is now correct as well. Unfortunately, the F statistic is now enormous, and somehow R² has also changed. The new problems are a result of having to use noconstant in the regress command. In this case, Stata cannot tell that the column we have called w above represents the (adjusted) constant term, so it does not know that it should be omitted from both the Total and Model sums of squares. Of course, you could always construct the analysis of variance table ``by hand'' by running a series of regressions through the origin -- first with just w, and then the full model above. This would provide all of the necessary ingredients to construct the Anova table using a hand calculator to do the necessary subtractions and divisions.

A third solution

Since there is no direct method in Stata to obtain WLS estimates for the error variance --- at least in the situation appropriate to the physics data in Weisberg --- I have written a little Stata program, wls.ado, which will perform the necessary calculations. Here is how to use it. First, download a cpoy of wls.ado from the course website. You can find it in the ``Software'' section. Put a copy of this file into the same directory that you typically run Stata from. [If you are a Stata expert, you can put this file in your personal ado directory (not the official Stata ado directory). If this makes no sense to you, then you can ignore it.] To confirm that there is a copy of wls.ado in your working directory, ask Stata to check for you using the dir command as shown below. If the file is present, it will produce one or more lines of output that describe the file. In the example below, it shows that the file was last modified on 11/1/98 and is 3207 characters long. If the files is absent, Stata will report ``file not found.'' In this case, give the pwd command (print working directory) to determine the directory in which Stata is currently operating.

. dir wls.ado
:wls.ado
wls.ado
11/01/98 15:32     3207 wls.ado

Once you have determined that this .ado file is present, you can use the wls command. This is not a standard Stata command; it is one that I wrote for use by the class. It's syntax is identical to that of the regress command, except that it must specify a weighted regression, and the weights must be analytical weights. You can see that this command produces output that coincides with Table 4.2 in Weisberg.

. wls y x [aweight = wt]

WLS anova table uses renormalized weights (weight var is wt)

  Source |       SS       df       MS                  Number of obs =      10
---------+------------------------------               F(  1,     8) =  124.63
   Model |   341.99136     1   341.99136               Prob > F      =  0.0000
Residual |   21.952652     8   2.7440815               R-squared     =  0.9397
---------+------------------------------               Adj R-squared =  0.9321
   Total |   363.94401     9   40.438224               Root MSE      =   1.657

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
       x |   530.8354   47.55003     11.164   0.000       421.1849     640.486
   _cons |   148.4732   8.078651     18.378   0.000       129.8438    167.1026
------------------------------------------------------------------------------

Variance-weighted least squares: Another variation

In a sense, none of the calculations done above are really appropriate for the physics data. Why? The weighted least squares calculation is based on the assumption that the variance of the observations is unknown, but that the relative variances are known. Thus, only a single unknown parameter having to do with variance needs to be estimated. In the case of the physics data, however, the values in the last column of Table 4.1 are essentially known values. When the study was done at p=150 GeV/c, the standard deviation of the distribution from which the Y's were generated was very nearly 5 y-units. It is for this reason that Weisberg asserts that we may as well take sigma=1 without loss of generality at the top of page 86. So in the physics data, there are no unknown variances. We are still interested in the fit of the model (4.14), however, and we are also interested in the estimated coefficients. Stata has one more WLS command (this time, it is a real Stata command) to deal with this situation. The command is named vwls, for variance-weighted least squares. This command assumes that the standard deviations (NOT variances) corresponding to each observation are recorded in a column, in our case, esd. Here is the result when applied to the physics data.

. vwls y x, sd(esd)

Variance-weighted least-squares regression           Number of obs   =      10
Goodness-of-fit chi2(8)    =   21.95                 Model chi2(1)   =  341.99
Prob > chi2                =  0.0050                 Prob > chi2     =  0.0000
------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
       x |   530.8354   28.70466     18.493   0.000       474.5753    587.0955
   _cons |   148.4732    4.87686     30.444   0.000       138.9148    158.0317
------------------------------------------------------------------------------

A few things have changed from the wls output above. Most important, the standard errors associated with the regression coefficients have changed from those reported by all three WLS variants we examined earlier. In effect, this is because the individual variances are known here, and there is no common scale factor that needs to be estimated from the observed (residual) variation in the Y's. Notice that there are two overall chi-squared tests here. The ``Model chi-squared'' corresponds to the F tests that we performed in the earlier regressions. The difference --- and the reason that the chi-squared statistic is larger than the F value --- is that here we do not need to estimate the denominator of the F statistic. Indeed, this test statistic is exactly equal to the model mean square in the WLS model fitted using the wls command. The ``Goodness-of-fit'' chi-squared test that vwls produces is a test for the adequacy of the linear model (4.14). Note that it is equal to the error sum of squares in the wls output. The large value of this statistic (and the small p-value corresponding to it) strongly suggest that the linear model doesn't fit the data well. Weisberg describes this test for lack of fit in section 4.2. This can be confirmed visually by plotting the data (as in Weisberg's Figure 4.1).