Study design and participants

Using a prospective cohort study design following individuals for up to 180 days after the first positive test, we utilized data from the Norwegian Emergency Preparedness Register, Beredt C19 (BC19). BC19 is a national database containing linked register data aiming to provide rapid knowledge to authorities in handling the pandemic. Sources included in the current study were the Norwegian Population Register (demographic characteristics), the Norwegian Tax Authorities and National Education Database (socioeconomic variables), the Norwegian Surveillance System of Communicable Diseases (results from all Polymerase Chain Reaction (PCR) testing), the Norwegian Immunization Registry (data on all vaccination against COVID-19), the Norway Control and Payment of Health Reimbursement Registry (primary health care visits before and during the pandemic) and the Norwegian Patient Registry (specialist health care visits before and during the pandemic). These data sources were linked using a deidentified version of the personal identification number received upon birth or immigration.

Our study population included all Norwegian residents aged between 30 and 70 years old (i.e., working age individuals) on Jan 1st 2020, and who had their first positive SARS-CoV-2 PCR test, as registered in the Norwegian Surveillance System of Communicable Diseases, between July 1st 2020 and January 23rd 2022. By including individuals from their date of first positive test, we could ensure that the included individuals had no pre-existing post-covid complaints resulting from previous COVID-19 illness.

We excluded individuals with one or more positive tests in the period 31 to 180 days after the first positive test. In this way, we could exclude new onset symptoms that were due to a new SARS-CoV-2 infection and not related to the first SARS-CoV-2 infection (i.e., all positive test occurring the first 30 days after the first positive test were regarded to result from the same infection period29). We also excluded individuals that were hospitalized due to COVID-19 as these experienced considerably more bodily stress from the infection. We required complete follow-up data, i.e., all individuals were followed for 180 days after testing positive.

Outcome: post-COVID condition

The main outcome of interest was having the post-COVID condition (yes/no) as recorded by a general practitioner (GP) in primary or emergency care by the International Classification of Primary Care code (ICPC-2). From May 4th 2020, primary care physicians were instructed to use the code R992 diagnosis for patients with COVID-19 disease. Persistent complaints after COVID-19 were coded by an R992 code together with at least one code for a persistent symptom, for example fatigue or pain30. For example, if a patient reported to be struggling with fatigue after the infection, it was coded with R992 together with A04 (weakness/tiredness). Correspondingly, if the complaint was shortness of breath, it was coded with R992 and R02. This coding for persistent complaints was possible for primary care physicians to use at any time during the pandemic. However, an official recommendation to do so was provided by national health authorities from April 1st 2021. The recommendation stated that persistent COVID-19 complaints should be coded by primary care physicians based on patient history of persistent complaints and an earlier, confirmed COVID-19. In our study, we assessed physician-reported post-COVID condition for one or more of several long-term symptoms after a SARS-CoV2 infection as described in Table 231, if they occurred in the time range 90-180 days after the first positive test. As such, our definition is in accordance with the World Health Organization’s definition of post-covid conditions (covid-like complaints present 3 months after infection)8. The assumption of our main outcome “post-COVID condition“ was that the risk of the diverse symptoms together makes up the risk of the post-COVID condition, which we assume shares common predictors. However, the predictors may differ by symptoms, and to examine the sensitivity of our results we also assessed two secondary outcome measures, based on findings in previous register-based research27,32 and the number of observations for each outcome in our sample: (1) Respiratory complaints (including cough and shortness of breath) and (2) fatigue (Table 2). As a robustness check, because individuals with anxiety and/or depression might be more prone than others to seek medical care due to health concerns also for physical health issues33, we also examined how the results were affected when recoding individuals with anxiety and depression post-COVID symptoms as non-post-COVID cases.

Table 2 Diagnostic codes of conditions/complaintsa used in concurrence with “R992” (confirmed COVID) to operationalize the post-COVID condition (ICPC-2)

Medical recording to the National registries is mandated by law in Norway, reducing potential bias due to missing data in our study. Norwegian health register data have been demonstrated to have high validity and reliability in a small comparative study of medical journal notes and medical records34, i.e., they are well suited for studying patterns of health care use and complaints leading to health care use. Still, we made use of a diagnostic coding practice that was introduced during the pandemic and therefore was relatively new to primary care physicians. Indeed, the use of the codes as described above was limited in the beginning of the pandemic (when both the post-COVID condition was new, and also coding practices were new), before slowly rising and reaching its top in March 2022 (Supplementary Fig. 10).


We included predictors based on demographic and socioeconomic characteristics, previous healthcare use, virus variant, and vaccination against COVID-19 (Table 3), all as identified by the registries described above. For “health care utilization prior to infection” (Table 3), we relied on the pre-pandemic period 2017-19 because of periodically restricted access to care during the COVID-19 pandemic and hence corresponding differences in the data generating process during the different phases of the pandemic. Virus variant was identified based on which virus type was dominant among infected individuals: the Wuhan virus (March 1st 2020– February 16th 2021), the Alpha virus (February 17th 2021–June 30th 2021), the Delta virus (July 1st 2021–December 23rd 2021), and the Omicron virus (December 24th 2021–January 23rd 2022).

Table 3 Overview of predictors used in the analysis

Statistical analyses

The statistical analysis consisted of two parts. In the first part we explored the incidence and risk factors for doctor-diagnosed post-COVID condition. In the second part we built prediction models using machine learning algorithms.

We estimated the incidence of post-COVID condition for each stratum of the included covariates (Table 3). We then estimated Odds Ratios (OR) for the post-COVID condition in bivariate and multivariate logistic regression models. While the bivariate models only contained the outcome and exposure of interest (each factor separately), the multivariate models used two different sets of explanatory variables, depending on which factor was under study: (i) When analyzing healthcare utilization, we controlled for all the demographic and socioeconomic factors and vaccination status. (ii) When studying the risk related to demographic and sociodemographic characteristics, and vaccine status, we ran separate models for each characteristic while adjusting for the healthcare utilization prior to infection (2017–2019). To illustrate, the adjusted model for a specific age group shows the risk adjusted for health care utilization history. Note that since virus type and vaccination status at the time of infection were strongly correlated, the multivariate models analyzing virus types included both vaccination status and sociodemographic factors as controls. We repeated our analyses for the secondary outcome measures (post-COVID respiratory complaints and post-COVID fatigue) and when recoding individuals with anxiety and depression post-COVID symptoms as non-post-COVID cases. We also repeated the main analyses in several sensitivity analyses related to the study sample: (1) An analysis of risk factors when including hospitalized individuals, (2) an analysis of individuals with reinfection within 180 days, (3) an analysis of individuals either hospitalized and/or reinfected within 180 days, (4) an analysis of individuals who were infected after December 2020 (as opposed to the first period when the virus and its short- and long-term consequences were unknown). For a more standardized interpretation of predictor-specific incidence and odds ratios, we used “everyone else” as the reference group in all analyses. Thus, all predictors were added to the model as a binary 0/1 variable, where 1 represented having the characteristic of interest (for example Age group (50,60]) taking value 1), and 0 represented everyone else, not having the characteristic of interest (in the example, all other age groups, i.e., age groups [30,40], [40,50], [60,70] taking value 0). Likewise, for predictor Female, coded as 1, everyone else, who were typically categorized as Male, were coded as 0. As such, the odds ratio for females will be the inverse of the odds ratio for males and vice versa.

The aim of the machine learning models was to predict post-COVID cases. We built prediction models with two different machine learning algorithms, one transparent (Least Absolute Shrinkage and Selection Operator, or LASSO) and one more flexible and opaque (Random Forest). To limit overfitting, both models were tuned with cross-validation.

The Least Absolute Shrinkage and Selection Operator (LASSO) is one of several penalized regression methods available for prediction35. Due to its sparsity and performance, the LASSO has become widely used when aiming for an interpretable, yet well performing out of sample, predictive model. What separates the LASSO from other penalized regression models is the functional form of the penalty term: The LASSO uses the absolute sum of coefficients (L1 penalty), while other methods use the sum of squared coefficients (L2 – ridge regression), or a combination of both (elastic net regression). The result is that the LASSO tends to suggest sparse models, keeping only a small set of strong predictors.

The random forest averages the predictions from multiple Classification and Regression trees (CARTs)36. Hence, it is an “ensemble learner”. The random forest has gained popularity due to its high level of performance, robustness to various data challenges (missing observations, rescaling of predictors etc.) and limited set of tunable hyperparameters. What is particular with the random forest is that each CART is fit using only a random subset of the available predictors. This random selection of predictors has been shown to boost the predictive performance by limiting inefficient dependency across individual CARTs.

Both the LASSO and random forest models were estimated on the full set of covariates, i.e., both sociodemographic data, health care utilization data, vaccine status and virus type. The outcome was a binary indicator of the post-COVID condition. The models were tuned using 10-fold cross validation with the same folds across model types, and performance was assessed on the same hold-out sample (20%). Using bootstrapping, we also estimated the confidence intervals of their performance (area under the curve).

We extended our machine learning models in two directions. First, to explore the potential for improving the AUC score by adding complexity, we also estimated models where we split each ICPC chapter into symptom (00–29) and diagnoses codes (30–79). Moreover, instead of a binary marker for primary healthcare we counted the number of visits for each symptom and diagnosis. Second, to explore the potential for simplifying the model in order to make it more clinically relevant, we estimated models which only a small set of strong predictors.

All prediction analysis was done within the Tidymodels machine learning framework. The confidence intervals for the area under the curve were estimated with the pROC-package, using the Delong-method37. All analyses were run in R (v.4.0.2), using the packages tidyverse (v.1.3.2), broom (v.1.0.2), tidymodels (v.0.1.4), ranger (v.0.13.1), and glmnet (v.4.1–3). The data from the different registers were linked in R using the RODBC-package (v.1.3-19).

Inclusion and ethics

The Ethics Committee of South-East Norway confirmed on June 4, 2020 that external ethical board review was not required (#153204). The data sources (The emergency preparedness register for COVID-19) were established and handled in accordance with the Health Preparedness Act §2-4 (11), enabling a quick and responsive way for the Norwegian government to access knowledge of how to handle the pandemic. Hence, the data and analysis were regarded by the ethical committee to respond to research aims not falling under the Law of Health Research §§ 2 and 4a. Informed consent from participants was not required, since the study was based on routinely collected administrative register data. Data from the different registers were linked by the certified researchers and using an encrypted personal ID-variable. Unencrypted ID-numbers were not available to the researchers. All methods were carried out in accordance with relevant guidelines and regulations. To protect participant privacy and security of personal data, all data were handled under strict confidentiality and access control as described in the Norwegian Institute of Public Health’s internal documentation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link