Modeling the dependence of a discontinuous “age indicator” on age

We write a single “age indicator” as C, for “character.”  The character can take any of 1,..,I values, each represented as ci.  For example, in the Suchey-Brooks pubic phase system the character states are c1, c2, c3, c4, c5, and c6.  Our first task, then, is to find P(C(A)=ci|A=a), which is the probability that an individual would be in the ith state of the character conditional on their age being exactly A=a years.  We use a probit regression to model the dependence of the “indicator” on age.  In the statistical literature the probit (or the comparable logit) when applied to ordinal dependent data, is generally referred to as an “ordinal probit model” (Johnson and Albert 1999b), an “ordered probit model” (Powers and Xie 2000), or a “cumulative probit model” (Long 1997).  This model can be written as:

 

 

                                                     

 

where  is the standard normal integral,  is a set of intercepts with  equal to negative infinity and  equal to positive infinity, and  is a slope.  The method of maximum likelihood can be used on a reference sample in order to estimate the parameters  and .  In addition to proprietary software, there are a number of freely available programs that can be used to fit the ordinal probit model (tda, which is available from ftp://ftp.stat.ruhr-uni-bochum.de/pub/tda/,nkotp from ftp://k7moa.gsia.cmu.edu/nkotp.zip and see http://k7moa.gsia.cmu.edu/nkotp.htm for documentation, and  MIXOR from http://www.uic.edu/~hedeker/mix.html).  Following Boldsen and co-worker's terminology, the ordinal probit model can also be written in terms of the mean “age to transitions” and the standard deviation of the transition distributions, so that equation becomes:

 

                                             

where is the normal probability density function at variate value a with mean m and standard deviation s.

 

The representation of the ordinal probit in equation makes clear one particular problem, which is that the integration across age starts at negative infinity.  There are two ways to circumvent this problem.  The first, and simplest way, is to measure age in a logarithmic scale, or equivalently to let the “age to transition” distributions be log-normal rather than normal.  The second way is to allow the standard deviations for transitions to vary by phase, so that equation becomes:

 

                                         

 

In our experience, the standard deviations for transitions between early phases are generally so small relative to the means, that the integrals from negative infinity to zero (across age) are negligible.  Equation does, however, produce a different problem, which is that some of the calculated probabilities for being in a particular phase at a particular age can be negative.  To circumvent this problem, we set any calculated probabilities to zero and renormalize (i.e., divide the probability of being in a particular phase at a particular age by the sum of the probabilities of being in each phase at that age). 

 

            Both models that we have discussed (the log-normal model and the separate standard deviation model) are supported in CatReg (U.S. EPA 2000), a collection S+ routines available from http://www.epa.gov/ncea/catreg.htm and http://www.stat.uiuc.edu/~simpson/papers.html.  In CatReg the separate standard deviation model is referred to as an “unrestricted cumulative” probit, while in Hedeker et al.’s (1999) terminology this is referred to as a “thresholds of change model.”  In general, we have not been able to fit the separate standard deviation model in CatReg, or in MIXOR, because the occurrence of negative probabilities early in the optimization of the total log-likelihood causes severe numerical problems.  We have written our own software which uses a tensor method (Chow et al. 1994) for maximizing the log-likelihood, and which replaces negative probabilities with positive values near zero.  In our experience this method has almost always converged properly, and yields answers identical to the “unconstrained cumulative” probit in CatReg when that program has converged properly.

 

Documentation for “NPHASES2”

 

            NPHASES2 is written in FORTRAN 77 and was compiled using the GNU compiler g77 (in Windows ’98).  When you obtain the executable it also comes with the source code (“nphases2.for”) in case you want to modify and re-compile, and with a test data set called “stewart.dat” (these are the ages and pubic phases in the Todd system for 358 individuals from the McKern and Stewart Korean War Dead study).  To run nphases2.exe click on its icon or use a DOS window to run it.  You will be prompted for a file name, and it is simplest if the file is in the same directory as the executable.  On that note, I strongly recommend that you do not run the program “from the desktop,” as most people who do that have little chance of finding the path to their data files.  The stuff below is what you will see on your screen, where the blue font represents your answers.  Output is appended to a file called “junk.out.”

 

  Name of input file:

stewart.dat

 I read  358 records.  If that is wrong, bail!

PAUSE  statement executed

To resume execution type go.  Other input will terminate the job.

go

 This system has 10 phases

  Model to fit?

   ("C" or "c" will fit cumulative probit,

    "L" or "l" will fit log age cumulative probit,

    any other single character will fit

    multiple s.d. model):

c

  Bottom truncation age?:

-1000

  Top truncation age?:

1000

  nopar =  10

 INITIAL FUNCTION VALUE F= 0.7783672510073E+03

 INITIAL GRADIENT G= 0.1067040628979E+03-0.7033229053722E+02-0.2452956146482E+02-0.2560136423968E+01-0.5417676633931E+01-0.7138809903946E+00 0.2453813637782E+02-0.2343716163692E+02-0.7530402087538E+01 0.3092735961612E+02

 -----------    ITERATION    1 ----------------

 X= 0.1952383859257E+02 0.2027291494099E+02 0.2146744689456E+02 0.2400697040114E+02 0.2668122749510E+02 0.2896725086346E+02 0.3028639393876E+02 0.3151697174511E+02 0.3471555599626E+02 0.4528791742785E+01

 F(X)= 0.7516959811154E+03

 G(X)= 0.9671368859902E+02-0.6395290800352E+02-0.2220154557146E+02-0.2262817546794E+01-0.4858710084908E+01-0.6406097763174E+00 0.2229295836350E+02-0.2117806466688E+02-0.6751974075181E+01 0.2646326869703E+02

 -----------    ITERATION    2 ----------------

 X= 0.1829815952978E+02 0.1969586953910E+02 0.2161373582133E+02 0.2457896447418E+02 0.2767576452350E+02 0.2964025813414E+02 0.3072021865315E+02 0.3283829437895E+02 0.3750775641667E+02 0.2693270850434E+01

 F(X)= 0.6990714429314E+03

 G(X)= 0.3386741265048E+02-0.3631685556265E+02-0.7546418979795E+01 0.2871656402900E+01 0.1462232795863E+01 0.3609745638392E+00 0.1306661289861E+02-0.3832301237806E+01 0.6144807827946E+00-0.9688545789301E+02

 

.

.

.

 -----------    ITERATION    7 ----------------

 X= 0.1561109877024E+02 0.2013905856279E+02 0.2256982849924E+02 0.2513979706418E+02 0.2784830905376E+02 0.2957148915937E+02 0.3054926553731E+02 0.3432316275048E+02 0.3982809918431E+02 0.4500524337221E+01

 F(X)= 0.6081384792419E+03

 G(X)=-0.7599598613541E-05-0.2606296839793E-04 0.1784020950171E-04 0.3203291321853E-04-0.7229352203519E-05-0.4486042227881E-04 0.1005002827707E-03 0.1179400417934E-03-0.2346896592650E-04-0.2859763366388E-03

 OPTSTP    RELATIVE GRADIENT CLOSE TO ZERO.

 OPTSTP    CURRENT ITERATE IS PROBABLY SOLUTION.

 FINAL X= 0.1561108853982E+02 0.2013905471715E+02 0.2256982376472E+02 0.2513979048805E+02 0.2784830141980E+02 0.2957147731599E+02 0.3054924678061E+02 0.3432313361715E+02 0.3982811697564E+02 0.4500534490873E+01

 GRADIENT G=-0.4605820359870E-06 0.3570270325310E-06-0.4778628547525E-06-0.1430041142758E-06 0.0000000000000E+00-0.2431460175903E-06 0.1176819021968E-06-0.1047425771762E-06-0.9026521324173E-07-0.1597629561161E-05

 FUNCTION VALUE F(X)= 0.6081384792382E+03 AT ITERATION    8

  End parms =   15.6110885  20.1390547  22.5698238  25.1397905  27.8483014

  29.5714773  30.5492468  34.3231336  39.828117  4.50053449  0.

  1.30303408E+270  1.07266771E+270  1.07195056E+270  1.30301843E+270

  1.30304261E+270  1.05658906E+270  8.48798324E-314  3.33928504E-289

  4.52090354E-308

 0

  0.531710688

  0.369493084

  0.36217331

  0.390503705

  0.431154948

  0.462691614

  0.485789855

  0.615553357

  1.04282831

  0.270274789

 

input/output

 

            The input format is free-field, with a column of ages and a column of phases, like so (this taken from the “stewart.dat”) file:

 

18.5681    1            

21.03765   3            

20.82409   4            

20.26831   3            

21.27858   4            

.

.

.

31.89596   8            

39.77276   9            

36.50103   9            

28.41068   8            

34.26694   6            

30.4011    5            

 

Note that there should not be any “tabs.”  The only allowed characters are numerals and spaces, and a decimal for the age.  The last line of data should be the last line (don’t put any blank lines below).

 

            The output in “junk.out” looks like the following:

 

 Data read from file stewart.dat with  358 records

 This system has 10 phases

 Model fit with option "c"

 Following model was fit with truncation at ages

   -1000.0 and     1000.0

       15.611089        0.531711

       20.139055        0.369493

       22.569824        0.362173

       25.139790        0.390504

       27.848301        0.431155

       29.571477        0.462692

       30.549247        0.485790

       34.323134        0.615553

       39.828117        1.042828

        4.500534        0.270275

 

     -608.138479

 

 Data read from file stewart.dat with  358 records

 This system has 10 phases

 Model fit with option "l"

 Following model was fit with truncation at ages

   -1000.0 and     1000.0

        2.842642        0.018351

        3.008901        0.013086

        3.100221        0.013039

        3.198777        0.014059

        3.301085        0.015242

        3.364243        0.016143

        3.399392        0.016842

        3.531747        0.020985

        3.716118        0.034767

        0.155982        0.008928

 

     -589.737731

 

 Data read from file stewart.dat with  358 records

 This system has 10 phases

 Model fit with option "s"

 Following model was fit with truncation at ages

   -1000.0 and     1000.0

       17.747852        0.422595

       20.419633        0.256458

       22.107561        0.246049

       24.086645        0.314911

       27.087691        0.440840

       28.962254        0.516070

       30.603027        0.705234

       36.380301        1.328248

       47.220229        4.591539

        2.249520        0.452320

        3.023636        0.337138

        2.602468        0.246692

        2.919999        0.265688

        4.169001        0.366412

        4.660805        0.456106

        6.159124        0.784415

        7.318319        1.163636

        9.236322        2.713909

 

     -578.792138

 

      What you see above are three runs, one with “c” (for cumulative probit), one with “l” (for cumulative probit on log age)      and one with the separate standard deviation model.  The left column contains estimates and the right contains standard errors of those estimates.  The last (negative) number is the log-likelihood of the model.  In the “c” model the average age for moving from a Todd I to a Todd II is 15.61 years, for Todd II to Todd III is 20.14,…, and for Todd IX to Todd X is 39.83, and all of these transitions have a standard deviation of 4.5 years.  In the “l” model the interpretation is the same, but on a natural log scale.  In the final model the first transition has an average of 17.75 years with a standard deviation of 2.25 years, while the last transition has an average of 47.22 years with a standard deviation of 9.24 years.  Finally, what are the truncations about?  They are an old feature I originally put in the program that I suggest you don’t use (so use –1000 and 1000 as the truncations).  If you are using the log age model you must use a positive number for the lower truncation point, because the log of a negative number is undefined.  I often use something like 0.0001, which on a natural log scale is about –9.2.

 

How to get “NPHASES2” up and running

 

1)      Download the file from http://konig.la.utk.edu/nphases.exe

 

2)      Put the file in a directory and “run it” (either double click or “run” in a DOS Window).

 

3)      The file is a self-extracting file, so you will get the executable, plus other “stuff.”

 

One “Bug”

 

            The program is written so that the amount of user input is quite minimal.  Consequently, the program picks reasonable starting values, so that you don’t have to.  In an earlier version some of these starting values were far enough away from reality that the optimization did not converge properly.  If you get results that look ridiculous (e.g., parameters with enormous standard errors) or mean age-to-transitions of, oh say, 600 years, let me know.  I’m still trying to fix this part.  E-mail to: lylek@utk.edu.

 

References

Chow T, Eskow E and Schnabel R (1994) Algorithm 739: A software package for unconstrained optimization using tensor methods.  ACM Transactions on Mathematical Software, 20, 518-530.

 

Hedeker D, Mermelstein RJ and Weeks KA (1999) The thresholds of change model: an approach to analyzing stages of change data.  Annals of Behavioral Medicine, 21, 61-70.

 

Johnson VE and Albert JH (1999) Ordinal Data Modeling.  New York, NY: Springer-Verlag.

 

Long JS (1997) Regression Models for Categorical and Limited Dependent Variables.  Thousand Oaks, CA: Sage Publications.

 

Powers DA and Xie Y (2000) Statistical Methods for Categorical Data Analysis.  New York, NY: Academic Press.