Education 231C

Applied Categorical & Nonnormal Data Analysis

Intoduction to Discrete-Time Survival Analysis


As indicated in the previous unit, discrete-time survival analysis treats time, not as a continuous variable, but as being divided into discrete chunks or units. We will be able to analyze discrete time data using logistic or cloglog regression with indicator variables for each of the time periods. We will illustrate discrete-time survival analysis using the cancer.dta dataset.

Cancer Example

After reading in the dataset, we will describe the variables and list several variables for patient 5, 10 and 20.


use http://www.gseis.ucla.edu/courses/data/cancer
 
describe

Contains data from cancer.dta
  obs:            48                          Patient Survival in Drug Trial
 vars:             7                          2 Jan 1904 13:58
 size:         1,248 (99.1% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g                  
studytime       int    %8.0g                  Months to death or end of exp.
died            int    %8.0g                  1 if patient died
drug            float  %9.0g                  
age             int    %8.0g                  Patient's age at start of exp.
distime         float  %9.0g                  
censor          float  %9.0g                  
-------------------------------------------------------------------------------
 
tab distime 

    distime |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         11       22.92       22.92
          2 |         13       27.08       50.00
          3 |          6       12.50       62.50
          4 |          8       16.67       79.17
          5 |          4        8.33       87.50
          6 |          6       12.50      100.00
------------+-----------------------------------
      Total |         48      100.00
 
univar age
                                        -------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
     age      48    55.88     5.66    47.00    50.50    56.00    60.00    67.00
-------------------------------------------------------------------------------
 
list distime drug age died censor if id==5

       distime       drug       age      died     censor
  5.         1          0        56         1          0
   
list distime drug age died censor if id==10

       distime       drug       age      died     censor
 10.         2          0        58         0          1
 
list distime drug age died censor if id==20

       distime       drug       age      died     censor
 20.         4          0        52         1          0

Patient 5 (56 years old, did not receive a drug treatment) was observed for one time period, died. So, the observation for that patient was not censored. Patient 10 (58, no drug) was observed for two time periods did not die, i.e., observation was censored. Finally, patient 20 (52, no drug) was observed for four time periods, died (not censored).

In this dataset there is one observation for each patient. In order to do discrete-time survival analysis we need to have as many observations as there are time periods for each patient. For patients that die we need a response variable that is zero until the last time period when it is coded one. For patients that don't die the response variable will be zero for every observation.

A collection of Stata commands written by Alexis Dinno (Harvard School of Public Health) will help with the analysis. Here is how we can obtain the commands.

net from http://www.doyenne.com/stata

net get dthaz

The command that we are interested in is prsnperd which creates the type of dataset that we want. prsnperd wants a variable that indicates whether the observation is censored or not which in our dataset is the variable censor. prsnperd creates the following variables: _period which is the time period, _Y which is the response variable and _d1 through _d6 which are the dummy coded time periods. Here is what it looks like.

prsnperd id distime censor
 
list id _period _Y if id==5

            id    _period         _Y
  5.         5          1          1
 
list id _period _Y if id==10

            id    _period         _Y
 11.        10          1          0
 12.        10          2          0
 
list id _period _Y if id==20

            id    _period         _Y
 35.        20          1          0
 36.        20          2          0
 37.        20          3          0
 38.        20          4          1

Now we can actually do the discrete-time survival analysis using the logit command. The logit command estimates a proportional odds discrete-time survival model. We will run logit with and without the cluster and nocons options. The nocons options is used so that the dummy variables for all of the time periods can be included.

logit _Y _d1-_d6, cluster(id) nocons 

Logit estimates                                   Number of obs   =        143
                                                  Wald chi2(6)    =      39.22
Log pseudo-likelihood =  -74.24676                Prob > chi2     =     0.0000

                               (standard errors adjusted for clustering on id)
------------------------------------------------------------------------------
             |               Robust
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _d1 |  -1.335001    .359167    -3.72   0.000    -2.038955   -.6310468
         _d2 |   -1.13498   .3872325    -2.93   0.003    -1.893942   -.3760181
         _d3 |  -1.609438   .5534596    -2.91   0.004    -2.694199   -.5246771
         _d4 |  -.9555114   .5318035    -1.80   0.072    -1.997827    .0868043
         _d5 |  -1.386294   .7989229    -1.74   0.083    -2.952155    .1795658
         _d6 |  -1.609438   1.106919    -1.45   0.146     -3.77896    .5600838
------------------------------------------------------------------------------
  
logit _Y _d1-_d6, nocons 

Logit estimates                                   Number of obs   =        143
                                                  LR chi2(6)      =          .
Log likelihood =  -74.24676                       Prob > chi2     =          .

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _d1 |  -1.335001   .3554076    -3.76   0.000    -2.031587   -.6384149
         _d2 |   -1.13498   .3831778    -2.96   0.003    -1.885995   -.3839652
         _d3 |  -1.609438   .5476933    -2.94   0.003    -2.682897   -.5359788
         _d4 |  -.9555114   .5262348    -1.82   0.069    -1.986913    .0758898
         _d5 |  -1.386294   .7905632    -1.75   0.080     -2.93577    .1631811
         _d6 |  -1.609438   1.095387    -1.47   0.142    -3.756356    .5374803
------------------------------------------------------------------------------

If we use the predict command we obtain the predicted probabilities for each of the time intervals. These probabilities are, in fact, the estimated hazard probabilities for each time interval.

predict p1

tablist _period p1

  +---------------------------+
  | _period         p1   Freq |
  |---------------------------|
  |       1   .2083333     48 |
  |       2   .2432432     37 |
  |       3   .1666667     24 |
  |       4   .2777778     18 |
  |       5         .2     10 |
  |       6   .1666667      6 |
  +---------------------------+

Now we add the covariate drug to the model. Drug is a binary indicator of whether the patient received chemotherapy or not.

logit _Y drug _d1-_d6, nocons 
 
Logit estimates                                   Number of obs   =        143
                                                  LR chi2(7)      =          .
Log likelihood = -61.192357                       Prob > chi2     =          .

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |  -2.553352   .5554143    -4.60   0.000    -3.641944    -1.46476
         _d1 |   -.305909   .4172929    -0.73   0.464    -1.123788    .5119701
         _d2 |    .246559   .5014684     0.49   0.623    -.7363011    1.229419
         _d3 |   .2249641   .7090498     0.32   0.751    -1.164748    1.614676
         _d4 |   1.259138   .7556111     1.67   0.096    -.2218328    2.740108
         _d5 |   1.167058   .9661703     1.21   0.227    -.7266011    3.060717
         _d6 |   .9439143   1.228204     0.77   0.442    -1.463321     3.35115
------------------------------------------------------------------------------
To get additional information we run the model with a constant.
logit _Y drug _d2-_d6
 
Logit estimates                                   Number of obs   =        143
                                                  LR chi2(6)      =      27.14
                                                  Prob > chi2     =     0.0001
Log likelihood = -61.192357                       Pseudo R2       =     0.1815

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |  -2.553352   .5554142    -4.60   0.000    -3.641944    -1.46476
         _d2 |    .552468    .606679     0.91   0.362     -.636601    1.741537
         _d3 |   .5308731   .7680236     0.69   0.489    -.9744256    2.036172
         _d4 |   1.565047   .7879091     1.99   0.047     .0207732     3.10932
         _d5 |   1.472967   .9836119     1.50   0.134     -.454877    3.400811
         _d6 |   1.249823   1.241971     1.01   0.314    -1.184395    3.684042
       _cons |   -.305909   .4172929    -0.73   0.464    -1.123788      .51197
------------------------------------------------------------------------------
 
fitstat

Measures of Fit for logit of _Y

Log-Lik Intercept Only:      -74.761     Log-Lik Full Model:          -61.192
D(136):                      122.385     LR(6):                        27.138
                                         Prob > LR:                     0.000
McFadden's R2:                 0.181     McFadden's Adj R2:             0.088
Maximum Likelihood R2:         0.173     Cragg & Uhler's R2:            0.267
McKelvey and Zavoina's R2:     0.271     Efron's R2:                    0.208
Variance of y*:                4.512     Variance of error:             3.290
Count R2:                      0.811     Adj Count R2:                  0.129
AIC:                           0.954     AIC*n:                       136.385
BIC:                        -552.562     BIC':                          2.639
Next we add in the second covariate, age, to the model.
logit _Y drug age _d1-_d6, nocons

Logit estimates                                   Number of obs   =        143
                                                  LR chi2(8)      =          .
Log likelihood =  -55.65503                       Prob > chi2     =          .

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |  -3.024052   .6347086    -4.76   0.000    -4.268058   -1.780046
         age |   .1607128    .051414     3.13   0.002     .0599433    .2614823
         _d1 |  -9.309867   2.922645    -3.19   0.001    -15.03815   -3.581589
         _d2 |  -8.335442   2.780394    -3.00   0.003    -13.78491   -2.885969
         _d3 |  -8.326742   2.823744    -2.95   0.003    -13.86118   -2.792306
         _d4 |  -7.071942   2.734906    -2.59   0.010    -12.43226   -1.711624
         _d5 |   -7.19799   2.811519    -2.56   0.010    -12.70847   -1.687513
         _d6 |  -7.622593   2.988678    -2.55   0.011    -13.48029   -1.764892
------------------------------------------------------------------------------
 
logit, or


Logit estimates                                   Number of obs   =        143
                                                  LR chi2(8)      =          .
Log likelihood =  -55.65503                       Prob > chi2     =          .

------------------------------------------------------------------------------
          _Y | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |   .0486039   .0308493    -4.76   0.000      .014009    .1686304
         age |   1.174348   .0603779     3.13   0.002     1.061776    1.298854
         _d1 |   .0000905   .0002646    -3.19   0.001     2.94e-07    .0278315
         _d2 |   .0002399   .0006669    -3.00   0.003     1.03e-06    .0558007
         _d3 |    .000242   .0006832    -2.95   0.003     9.55e-07    .0612797
         _d4 |   .0008486   .0023208    -2.59   0.010     3.99e-06    .1805723
         _d5 |   .0007481   .0021033    -2.56   0.010     3.03e-06     .184979
         _d6 |   .0004893   .0014623    -2.55   0.011     1.40e-06    .1712052
------------------------------------------------------------------------------

Both drug and age are significant with the older patients more likely to die and those on drug therapy less likely. It is useful to look at the hazard function (and survival function) to ascertain the effects over time. The dthaz command (from Dinno) will produce a table with hazard and survival values for each time period. We will specify the function for drug=1 (drug therapy) and age=56 (the median age).

dthaz drug age, specify(1 56)


Discrete-Time Estimation of Conditional Hazard and Survival Probabilities
------------------------------------------------------------------------------
Time Parameterization: Fully Discrete

Additional predictors specified as:
drug = 1
age = 56


-----------------------------------------
   Period      p(Hazard)   p(Survival)
   (T_j)       ^H(T_j)     ^S(T_j)
-----------------------------------------
     0            --        1
     1          0.0344      0.9656
     2          0.0863      0.8822
     3          0.0870      0.8055
     4          0.2505      0.6037
     5          0.2276      0.4663
     6          0.1616      0.3910
-----------------------------------------
Logit Link (assumes proportional odds)

Notice that the hazard maxes out at time period four and then declines. Now if we do the same for drug=0, we see a completely different pattern with much greater hazards.

dthaz drug age, specify(0 56)
 
Discrete-Time Estimation of Conditional Hazard and Survival Probabilities
------------------------------------------------------------------------------
Time Parameterization: Fully Discrete

Additional predictors specified as:
drug = 0
age = 56


-----------------------------------------
   Period      p(Hazard)   p(Survival)
   (T_j)       ^H(T_j)     ^S(T_j)
-----------------------------------------
     0            --        1
     1          0.4231      0.5769
     2          0.6603      0.1960
     3          0.6622      0.0662
     4          0.8730      0.0084
     5          0.8584      0.0012
     6          0.7986      0.0002
-----------------------------------------

Let's check the fit of this model. To do this we will rerun the model with a constant and then run the fitstat command. Notice that we have to drop one of the time dummies if we include the constant. Models without the constant but with all of the dummies and models with the constant dropping one dummy are equivalent they are merely different parameterizations of the same model.

logit _Y drug age _d1- _d5

Logit estimates                                   Number of obs   =        143
                                                  LR chi2(7)      =      38.21
                                                  Prob > chi2     =     0.0000
Log likelihood =  -55.65503                       Pseudo R2       =     0.2556

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |  -3.024052   .6346992    -4.76   0.000     -4.26804   -1.780064
         age |   .1607128   .0514134     3.13   0.002     .0599444    .2614813
         _d1 |  -1.687274   1.312815    -1.29   0.199    -4.260344    .8857956
         _d2 |  -.7128491   1.273976    -0.56   0.576    -3.209797    1.784099
         _d3 |  -.7041496   1.317687    -0.53   0.593    -3.286768    1.878469
         _d4 |   .5506505   1.296978     0.42   0.671    -1.991381    3.092682
         _d5 |   .4246033   1.409744     0.30   0.763    -2.338444    3.187651
       _cons |  -7.622593   2.988652    -2.55   0.011    -13.48024   -1.764942
------------------------------------------------------------------------------
 
fitstat

Measures of Fit for logit of _Y

Log-Lik Intercept Only:      -74.761     Log-Lik Full Model:          -55.655
D(135):                      111.310     LR(7):                        38.213
                                         Prob > LR:                     0.000
McFadden's R2:                 0.256     McFadden's Adj R2:             0.149
Maximum Likelihood R2:         0.234     Cragg & Uhler's R2:            0.362
McKelvey and Zavoina's R2:     0.410     Efron's R2:                    0.271
Variance of y*:                5.579     Variance of error:             3.290
Count R2:                      0.818     Adj Count R2:                  0.161
AIC:                           0.890     AIC*n:                       127.310
BIC:                        -558.674     BIC':                         -3.473

We can then compare the fit with a model that treats time as a linear variable.

logit _Y drug age _period

Logit estimates                                   Number of obs   =        143
                                                  LR chi2(3)      =      35.96
                                                  Prob > chi2     =     0.0000
Log likelihood = -56.782549                       Pseudo R2       =     0.2405

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |  -2.903144   .6025565    -4.82   0.000    -4.084133   -1.722154
         age |   .1480103   .0495297     2.99   0.003     .0509339    .2450867
     _period |   .4508214   .1868341     2.41   0.016     .0846332    .8170096
       _cons |  -8.849313    2.82147    -3.14   0.002    -14.37929   -3.319333
------------------------------------------------------------------------------
 
fitstat

Measures of Fit for logit of _Y

Log-Lik Intercept Only:      -74.761     Log-Lik Full Model:          -56.783
D(139):                      113.565     LR(3):                        35.958
                                         Prob > LR:                     0.000
McFadden's R2:                 0.240     McFadden's Adj R2:             0.187
Maximum Likelihood R2:         0.222     Cragg & Uhler's R2:            0.343
McKelvey and Zavoina's R2:     0.378     Efron's R2:                    0.262
Variance of y*:                5.290     Variance of error:             3.290
Count R2:                      0.811     Adj Count R2:                  0.129
AIC:                           0.850     AIC*n:                       121.565
BIC:                        -576.270     BIC':                        -21.069

Based on the deviance (111.31 vs 113.565) and BIC (-558.674 vs -576.27) it appears that the model using indicator variables for time fits slightly better, although it uses four more degrees of freedom then the model with linear time.

Using a model that includes both _period and t-2 dummy coded time variables indicates that the dummy coded time does not account for significantly more variability than using linear time alone.

logit _Y drug age _d1-_d4 _period 

Logit estimates                                   Number of obs   =        143
                                                  LR chi2(7)      =      38.21
                                                  Prob > chi2     =     0.0000
Log likelihood =  -55.65503                       Pseudo R2       =     0.2556

------------------------------------------------------------------------------
          _Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |  -3.024052   .6346992    -4.76   0.000     -4.26804   -1.780064
         age |   .1607128   .0514134     3.13   0.002     .0599444    .2614813
         _d1 |  -3.810291   6.209824    -0.61   0.539    -15.98132     8.36074
         _d2 |  -2.411262   4.800772    -0.50   0.615     -11.8206    6.998079
         _d3 |  -1.977959   3.441179    -0.57   0.565    -8.722546    4.766627
         _d4 |  -.2985561   2.099693    -0.14   0.887    -4.413879    3.816767
     _period |  -.4246033   1.409744    -0.30   0.763    -3.187651    2.338444
       _cons |  -5.074973   7.898617    -0.64   0.521    -20.55598    10.40603
------------------------------------------------------------------------------
 
test _d1 _d2 _d3 _d4

 ( 1)  _d1 = 0
 ( 2)  _d2 = 0
 ( 3)  _d3 = 0
 ( 4)  _d4 = 0
 
           chi2(  4) =    2.15
         Prob > chi2 =    0.7087
A Proportional Hazards Model

A discrete-time proportional hazards model can be estimated using the cloglog command.

There is also a program called pgmhaz (findit pgmhaz) that esitmates two different discrete time proportional hazards models, one of which incorporates a gamma mixture distribution to summarize unobserved individual heterogeneity (or "frailty").


Categorical Data Analysis Course

Phil Ender