1. Replicated Table 1 from our healthy nondonor study:

    • N = 70,000

    • 60% of the population is healthy

    • And 30% is obese

  2. Only included variables from the following datasets

    • Demo

    • Diet

    • Exam

    • Lab

    • Questionnaire

    • Restricted Data

  3. How this has been done and next steps

install.packages("tableone")
library(tableone)
install.packages("tidyverse")
library(tidyverse)
install.packages("NHANES")
library(NHANES)
install.packages("dplyr")
library(dplyr)
install.packages("nhanesA")
library(nhanesA)

# Creating Dataset With All Years
demo <- bind_rows(nhanes("DEMO"),nhanes("DEMO_B","DEMO_C", "DEMO_D", "DEMO_E", "DEMO_F", "DEMO_G", "DEMO_H", "DEMO_I", "DEMO_J", "DEMO_P"))
# Categorical Data
demo["RIAGENDR"] %>% count(RIAGENDR)
catVars <- c("DMDEDUC2","RIAGENDR", "RIDRETH1")
demo <- subset(demo, RIDAGEYR > 17)
tab <- CreateTableOne(data=demo, factorVars=catVars)

# Sorted Categorical and Continuous Outputs
print(tab$CatTable, showAllLevels=TRUE)
print(tab$ContTable, nonnormal=TRUE)
summary_table <- all_years_data %>% summarize(
MedianAge=median(Age, na.rm=TRUE),
MedianSBP=median(BPSysAve, na.rm=TRUE),
MedianDBP=median(BPDiaAve, na.rm=TRUE),
MedianCreatinine=median(LBDSCRSI, na.rm=TRUE),
MedianBMI=median(BMI, na.rm=TRUE),
MedianHbA1c=median(HbA1c, na.rm=TRUE),
MedianuACR=median(uACR, na.rm=TRUE),
MedianGlucose=median(Pulse, na.rm=TRUE),
FemalePercentage=mean(Sex == "Female", na.rm=TRUE),
HypertensionPercentage=mean(Hypertension == "Yes", na.rm=TRUE),
SmokePercentage=mean(SmokingStatus == "Current", na.rm=TRUE),
RaceEthnicityPercentage=table(RaceEthnicity) / length(RaceEthnicity),
EducationPercentage=table(Education)
)
print(summary_table)

# Cohort Percentage One-by-One
nhanes <- subset(nhanes("DEMO_P"), RIDAGEYR > 17)
divisor <- count(nhanes)

  1. Before the next steps:

We are working on an extensive project involving data analysis with R, and collaboration and openness are key principles in our work. Here’s a roadmap for the next steps in the analysis, taking into account the principles of the “fena” philosophy:

  • Data Integrity and Cleaning:

    • Check for missing data, outliers, and inconsistencies in the data.

    • Utilize exploratory data analysis to identify underlying patterns and potential biases.

    • Validate the data by cross-referencing with known values or an external data source if available.

  • Integration with Other Relevant Datasets:

    • Include additional variables and datasets if they align with the objectives of your research.

    • Perform the necessary merging and data transformations to align various datasets for analysis.

    For this project in particular, we’ll need to integrate the data from the living donor study with the NHANES data.

  • Analysis Targeting Specific Objectives:

    • Conduct detailed statistical analysis to understand the distribution and relationship between healthy and obese populations.

    • Utilize multivariate analysis, modeling (e.g., regression models), and machine learning techniques if necessary to predict or uncover deeper insights.

  • Visualization:

    • Create compelling visualizations to represent findings.

    • Utilize graphs, charts, and maps as needed to convey complex information in an accessible way.

  • Community Collaboration and Peer Review:

    • Engage with other collaborators and the wider community for input and feedback. So far engaged with:

      • Abi Muzaale

      • Andrew Aarking

    • Embrace peer review internally and externally to validate findings.

  • Dissemination and Knowledge Sharing:

    • Document the process and findings in a comprehensive report or scientific paper (Fena is the environment that will facilitate this).

    • Share findings and insights through appropriate channels, including publication, conferences, and the Fena platform.

    • Develop web apps or tools to enable others to access and utilize the findings.

  • Ethics and Compliance:

    • Ensure all data handling, analysis, and sharing comply with ethical guidelines, legal regulations, and best practices in the field of medical research.

    • Thus far we’ve only utilized publicly available data (nondonors), but we’ll need to be mindful of this as we move forward.

    • Next we’ll use simulate data from the living donor study to test our models (scroll down for some code!).

    • But ultimately we’ll need to rope in the living donor data itself (after addressing the ethical and legal considerations)

  • Ongoing Learning and Development:

    • Encourage and provide opportunities for team members to continue learning new tools and methodologies.

    • Reflect on what went well and what could be improved for future research endeavors.

By continuing to foster the sense of collective engagement and shared responsibility that Fena represents, we’ll be well-positioned to navigate the complexities of this research. The focus on collaboration, transparency, and open access to knowledge is aligned with modern scientific values, and the use of tools like R, Python, and AI will enhance the quality and reach of our work. Let’s keep steering in the same direction, and we’ll undoubtedly contribute valuable insights to our living donor project.

See below for Python code to simulate data from the living donor study:

Hide code cell source
import numpy as np
import pandas as pd
from scipy.stats import multivariate_normal

# Constants
N = 1000

# Means and covariance matrix for continuous variables (age, SBP, SCr, BMI, HbA1c)
mean_cont = [40, 124, 1, 27, 6]
cov_matrix = [
    [25, 5, 0.01, 2, 0.1],
    [5, 121, 0.02, 4, 0.2],
    [0.01, 0.02, 0.0004, 0.01, 0.001],
    [2, 4, 0.01, 25, 0.2],
    [0.1, 0.2, 0.001, 0.2, 0.64]
]
cont_vars = multivariate_normal.rvs(mean=mean_cont, cov=cov_matrix, size=N)

# Simulating categorical variables (Race, Education) and binary variables (Diabetes, Hypertension, Smoke, Male)
race = np.random.choice([0, 1, 2, 3, 4], N, p=[0.37, 0.23, 0.23, 0.13, 0.04])
education = np.random.choice([0, 1, 2, 3], N, p=[0.16, 0.42, 0.22, 0.20])
diabetes = np.random.choice([0, 1], N, p=[0.88, 0.12])
hypertension = np.random.choice([0, 1], N, p=[0.69, 0.31])
smoke = np.random.choice([0, 1], N, p=[0.43, 0.57])
male = np.random.choice([0, 1], N, p=[0.5, 0.5]) # Assuming a 50-50 split

# Hazard function incorporating the given hazard ratios
def hazard_function(x):
    age, race, male, diabetes, hypertension, uacr, egfr, sbp, smoke = x
    hr = 0.5*age + [1, 3.2, 4, 0.7, 1.1][race] + 1.2*male + 5.2*diabetes + 1.0*hypertension + 4.0*uacr + 2.7*egfr + 2.3*sbp + 1.8*smoke
    return hr

# Simulating time to event (kidney failure) based on the hazard function
time_to_failure = np.zeros(N)
status = np.zeros(N)
for i in range(N):
    x = (cont_vars[i, 0], race[i], male[i], diabetes[i], hypertension[i], cont_vars[i, 2], cont_vars[i, 3], cont_vars[i, 1], smoke[i])
    hr = hazard_function(x)
    time_to_failure[i] = np.random.exponential(30/hr)
    status[i] = time_to_failure[i] < 30

# Combine all variables into DataFrame
data = np.column_stack([cont_vars, diabetes, hypertension, smoke, race, education, male, time_to_failure, status])
columns = ['age', 'SBP', 'SCr', 'BMI', 'HbA1c', 'Diabetes', 'Hypertension', 'Smoke', 'Race', 'Education', 'Male', 'Time_to_Kidney_Failure', 'Status']
df = pd.DataFrame(data, columns=columns)
df['Race'] = df['Race'].astype(int).map({0: 'White', 1: 'Black', 2: 'Hispanic', 3: 'Asian', 4: 'Other'})
df['Education'] = df['Education'].astype(int).map({0: 'K-8', 1: 'High School', 2: 'Some college', 3: 'College'})

# Save to CSV
csv_file = 'simulated_data.csv'
df.to_csv(csv_file, index=False)
print(f"Saved dataset to {csv_file}")

# Print summaries
print(df['Time_to_Kidney_Failure'].describe())
print(df['Status'].value_counts())
Saved dataset to simulated_data.csv
count    1000.000000
mean        0.080576
std         0.078595
min         0.000067
25%         0.022787
50%         0.055797
75%         0.117878
max         0.619337
Name: Time_to_Kidney_Failure, dtype: float64
Status
1.0    1000
Name: count, dtype: int64