Univariable Survival Analysis

  1. Individualized Risk Prediction: The code calculates individualized risk predictions using survival analysis, specifically the Cox proportional hazards model.

  2. Semi-Parametric Approach: The Cox model is considered a semi-parametric model. This is because it makes no assumptions about the form of the baseline hazard function (hence “non-parametric” for the baseline hazard) but does assume a parametric form for the effect of predictors. In mathematical terms, the hazard function is given by:

    \(h(t | X) = h_0(t) \exp(\beta X)\)

    where \(h_0(t)\) is the baseline hazard function, \(\beta\) is the vector of coefficients, and \(X\) is the vector of covariates.

  3. Subgroups Considered: The survival analysis is performed for specific subgroups, identified by the variable ridreth3. These include Mexican, Hispanic, White, Black, Asian, and Other.

  4. Matrices Explanation:

    • b: Contains the coefficients \(\beta\) of the model.

    • V: Contains the variance-covariance matrix, used for calculating standard errors.

    • SV: A scenario vector, used to calculate the risk score for a specific scenario.

    • risk_score: The risk score calculated as \(\text{SV} \times \beta\).

    • var_prediction: Variance of the risk score prediction.

    • se_prediction: Standard error of the risk score prediction.

  5. Graphs:

    • The first graph illustrates the survival rates for different subgroups.

    • The second graph depicts survival rates for a specific scenario defined in the matrix SV. For the mathematical details, look here.


Am I not understood?—Have I not been understood?—“Certainly not, sir?”—Well, let us go back to the beginning!

  1. Preparing the Environment and Loading the Data

The first part of the code defines global macros and loads the necessary data for analysis.

global data https://github.com/muzaale/ikesa/raw/main/nhanes.dta
global subgroup ridreth3
global subgroupvar: var lab ridreth3
cls
use $data, clear
di "obs: `c(N)', vars: `c(k)'"
gen years = permth_exm / 12
stset years, fail(mortstat)
  • Line 1-3: Definition of global variables for data URL, subgroup identifier (ridreth3), and subgroup variable label.

  • Line 4: Clears the screen (cls).

  • Line 5: Loads the dataset (use $data, clear).

  • Line 6: Displays the number of observations and variables in the dataset.

  • Line 7: Generates a variable for years by dividing a given variable by 12.

  • Line 8: Sets the data as survival-time data with years as the time variable and mortstat as the failure indicator.

  1. Plotting the Survival Graph by Subgroup

This section visualizes the survival rates for different subgroups by plotting the Kaplan-Meier survival estimates.

#delimit ;
sts graph if inlist($subgroup,1,2,3,4,6,7),
    by($subgroup)
    fail
    ti("Mortality in NHANES III",pos(11))
    subti("by self report: ${subgroupvar}",pos(11))
    yti("%",orientation(horizontal))
    xti("Years")
    per(100)
    ylab(0(5)20,
        format(%3.0f)
        angle(360)
    )
    legend(on
        lab(1 "Mexican")
        lab(2 "Hispanic")
        lab(3 "White")
        lab(4 "Black")
        lab(5 "Asian")
        lab(6 "Other")
        ring(0)
        pos(11)
        col(1)
        order(3 4 1 2 5)
    )
    note("Source: RDC/NCHS/CDC/DHHS")  
;
#delimit cr
cd "~/dropbox/1f.ἡἔρις,κ/1.ontology/alpha"
graph export nhanes.png, replace

nonparametric mortality risk:

#

Here, a mortality graph (i.e., 1 - survival-time graph; sts graph, failure) is drawn for specific subgroups and the legend ordered so that the highest-risk group (3 - white) appears first, the second highest-risk (4 - black) appears second, etc. Various formatting options are applied to enhance readability, and the graph is exported as an image.

  1. Fitting the Cox Proportional Hazards Model

This part fits the Cox proportional hazards model to the data for the selected subgroups (which one?) and extracts important matrices such as coefficients and the variance-covariance matrix.

stcox i.$subgroup if inlist(${subgroup}, 1, 2, 3, 4, 6, 7), basesurv(s0)
matrix define m = r(table)
matrix b = e(b)
matrix V = e(V)
matrix SV = (0, 0, 0, 0, 1, 0)
matrix risk_score = SV * b'
di exp(risk_score[1,1])
matrix var_prediction = SV * V * SV'
matrix se_prediction = sqrt(var_prediction[1,1])
  • stcox: Fitting the Cox model.

  • matrix define: Defining matrices for coefficients (b), variance-covariance (V), and others.

  • di exp(risk_score[1,1]): Displays the exponential of the risk score for interpretation.

  1. Calculating and Plotting Specific Scenario

This section calculates the risk for a specific scenario defined in the matrix SV, and plots the survival rates.


semiparametric mortality risk:


gen f0 = (1 - s0) * 100
gen f1 = f0 * exp(risk_score[1,1])
drop if _t > 10
line f1 _t, sort connect(step step) ylab(0(5)20) xlab(0(2)10)
graph export nhanes_5.png, replace
  • gen: Generates new variables for the survival function.

  • s0: Nonparametric (like Kaplan-Meier) baseline survival function (the base-case scenario in which all covariates are 0). Ideally this scenario should be of the typical of “average” person in the population and is achieved by centering the covariates at their means (binary variables take the value greater than 0.5)

  • drop if _t > 10: Excludes observations where time exceeds 10 since the race variable was only collected in the last 12 years.

  • line: Plots the survival rates. The sort option sorts the data by time, and the connect option connects the data points with a step function (Kaplan-Meier function).

  • graph export: Exports the plotted graph seen above as an image.

Final Remarks

The entire code process takes a semi-parametric approach to calculate and visualize individualized risk predictions using survival analysis. Subgroup analysis is also performed based on the variable ridreth3, considering different ethnicities. Two different graphs illustrate the survival rates both generally and for a specific scenario.

Images depicting nonparametric and semiparametric mortality risk are also included in the documentation.

By following the steps detailed above, the code achieves the goal of analyzing 10-year risk with potential adaptability to assess 30-year risk in the future. For more practice in a multivariable survival analysis setting, see here.