Study design

The data presented herein are derived from a prospective, multicenter, nonrandomized, open-label clinical trial ( identifier: NCT05547035) designed specifically for this purpose. Patients diagnosed with MDD were enrolled in the study by their general practitioner or psychiatrist.

The study was designed and conducted in accordance with Good Clinical Practice as defined by the Agence Nationale de Sécurité du Médicament et des Produits de Santé, (ANSM; ID: 2017-A00595-48) and the Declaration of Helsinki. An independent ethics research committee, CPP Sud-Est 1 (ID: 2017-34), approved the protocol and informed consent documents. All patients provided written informed consent prior to participation.

Inclusion/exclusion criteria

The enrolled patients fulfilled the following inclusion criteria: male or female aged 18 to 65 years; treated for MDD according to the DSM-5 definition; presenting a Montgomery and Åsberg Depression Rating Scale (MADRS) score ≥ 20; French speaker; able to read and write in French; able to understand and follow all study procedures; provided informed consent in writing. Patients were excluded for the following reasons: unable to wear a portable monitor for the study duration (6 months); subject to a severe medical pathology (e.g. neurological, rheumatological) at the investigator’s discretion; resistant depression; chronic depression; dysthymia; depression with psychotic features not congruent with mood, schizophrenia disorder; depression with catatonic features; substance use disorder in the last 6 months; extreme sports during the conduct of the study; pre-existing skin infection at the wearable monitor site; pregnant or lactating woman; participation in another drug or medical device study; inability to give informed consent.

Study procedures

During the enrollment visit, patients received a portable passive monitoring device (described in section Wearable device and physiological features) that they were asked to wear continuously (i.e., 24 h per day, 7 days a week) for 6 months, except for battery charging and during activities that may represent a risk to the integrity of the device (e.g., showering or participation in contact sports). Charging time of the device was up to 2 h, and patients were instructed to charge it during moments they would not wear it. To minimize data noise, the device was to be worn on the nondominant wrist, a common practice in studies using wearable devices (e.g., actigraphy)46. This criterion ensures that reliable features can be derived from raw physiological measurements47.

The study period was 6 months and comprised seven periods (baseline and months 1, 2, 3, 4, 5 and 6). At each monthly follow-up visit, physicians assessed the patients’ mood status, which included administration of the MADRS. The clinician administered MADRS48 is a widely used and accepted instrument for assessing depression and evaluating treatment efficacy in patients diagnosed with MDD. The Structured Interview Guide for the MADRS (SIGMA) provides structured questions that are be asked exactly as written to ensure that administration of the MADRS is standardized. The interrater reliability of the MADRS according to the intraclass correlation coefficient with the SIGMA has been reported as excellent (r = 0.93)49. Appropriate for both clinical and research settings, the MADRS can be used to stratify the severity of depressive symptoms and to evaluate trends in the severity of a patient’s depressive episode and response to treatment. In this study, we stratified the MADRS score50: no depression (score: 0–6), mild depression (7–19), moderate depression (20–34), and severe depression (≥ 35).


The primary endpoint was to compare the change in physiological variable (e.g., motor activity, cardio-respiratory activity, and sleep parameters) with the change in the clinical variable (the MADRS total score) over a period of six months.

The secondary endpoint was to train an algorithm to identify markers of mood disorders using six months of physiological and clinical data.

Wearable device and physiological features

The wearable device takes the form of a wristband (see Fig. 1 in the Supplementary Materials) and was custom manufactured by Éolane (Angers, France), an ISO 13485-compliant company that adheres to medical standards, including IEC 60601 and EN 62304, for medical device software. The use of a custom device was necessary to obtain the raw measurements from all sensors and to enable the addition of new physiological feature acquisition systems if needed. The custom wearable device contained the following standard sensors: a photoplethysmograph (PPG) with a 50 Hz sampling rate (for deriving cardiorespiratory features), a 3-axis accelerometer with a 25 Hz sampling rate (for deriving multiple actigraphy features) and an electrodermal activity (EDA) sensor with a 4 Hz sampling rate.

Figure 1
figure 1

Label detrending procedure. This diagram shows how data and labels are handled and partitioned for the machine learning algorithm. The actual MADRS score is obtained only during clinical visits, although it can be safely extended to 5 days on either side of a clinical visit. The physiological data are available every day. The residual between the output of the optimistic model and the known MADRS score is used as a label for the machine learning model. The data and labels are then partitioned between train and test to fit and test the multi-layer perceptron.

To extract physiological features from sensors raw measurements, we proceeded as follows. Cardiorespiratory (such as heart rate, breathing rate, heart rate variability)18,19,20,21,22,23, actigraphy (e.g., L5, M10, etc.)46,47,51 and sleep-based physiological features (e.g., sleep stages such as REM/NREM/WASO)52 were extracted respectively from PPG, 3-axis accelerometer and both sensors' data using a combination of standard algorithms from the literature53,54,55. These features were further grouped into physical activity (12 features), heart rate (25 features), heart rate variability (39 features), breathing rate (12 features) and sleep (13 features) and were smoothed with a mean filter to remove potential outliers. Missing values were imputed using interpolation, and features were normalized between 0 and 1 independently of one another to account for patient heterogeneity. Due to proprietary concerns, the full list of these physiological features cannot yet be disclosed.

Machine learning algorithm

A detailed description of the ML algorithm is available in the Supplementary Materials. The optimization procedure is divided into two parts: training SiBaMoD, which depends on hyperparameters λ and ν, respectively the recovery rate and the signature size, and an optimization scheme for selecting appropriate values of λ and ν for a given patient. SiBaMoD itself is composed of several parts: label extension and detrending processes, a feature selection, and a deep learning Multi-Layer Perceptron (MLP) model described below.

Label extension addresses the sparsity of the MADRS scores (collected once per month) relative to the abundance of physiological values (collected daily). To this end, the clinical labels were extended over a window of ± 5 days around each follow-up visit. This procedure is summarized in Fig. 1. The label extension methodology was justified by considering the test–retest reliability of the MADRS over the course of several days as described in the literature56 and was confirmed by experiments conducted with our dataset.

The label detrending procedure addressed the issue of having a nonstationary label over the course of the clinical trial for a given patient (since most patients tend to recover due to treatment). The detrending procedure consisted of replacing the MADRS scores with the discrepancy of an optimistic model that estimates change in the MADRS score based solely on the most recent clinical visit. This model has no trainable parameters but rather depends solely on constant λ. On a given day, this model predicts a MADRS score given an amelioration rate based on the previous clinical evaluation of the patient by the physician. The difference between the actual MADRS score and the MADRS score predicted by this optimistic model is called the residual MADRS score. This choice of optimistic model is supported by known models of affective disorders in the literature57.

The feature selection block is performed by a statistical computation on the train/validation sets. This component of the algorithm selects the ν most correlated features with the MADRS on the train and validation sets. To be as general as possible and to detect non-linear correlations, this component selects physiological features that minimize their independence with MADRS based on the Hilbert–Schmidt Independence Criterion (HSIC)58, which is more adapted to non-monotonic and non-linear signals than Spearman or linear correlation coefficients. This set of ν features is called the individual depression biosignature and can be used to efficiently predict the disease’s progression with respect to the clinical scale used. Thus, for each day of recorded physiological data, we can extract a subvector of dimension ν by selecting only the features appearing in the biosignature.

The last component of SiBaMoD is a multi-layer perceptron (MLP) which takes as input the ν features of the individual biosignature selected by the feature selection component, and outputs an estimate of the residual MADRS. Specifically, the MLP consists of an input of dimension ν followed by 3 hidden layers of respective dimension 8ν, 4ν, and 2ν, and a single scalar as an output. After early experiments, the parameters chosen for MLP training were batch size of 16, training for 500 epochs, and early stopping callback of 5 epochs monitoring improvements in validation loss. To smooth out random fluctuations due to kernel initialization, and to avoid having inaccurate predictions because of potential local minima in the parameter space of the model, this process is repeated 11 times and the final output prediction is set to be the median of the predictions.

The SiBaMoD is trained to minimize the mean square error (MSE) loss using stochastic gradient descent on the extended clinical labels (± 5 days around the follow-up visits until month 3, resulting in 30 days) with the corresponding physiological features, and is evaluated on the remaining extended clinical labels and physiological features. Once trained, SiBaMoD can be used to predict the MADRS score daily, including unlabeled days. This predicted MADRS score is then further reduced into 2 classes: healthy (MADRS score < 20) and ill (MADRS score ≥ 20) to enable the use of a binary accuracy metric to optimize the hyperparameters of the model. Details of the SiBaMoD algorithm are presented in Fig. 2a.

Figure 2
figure 2

Overview of the full algorithm’s pipeline presented in this work. (a) The SiBaMoD pipeline with parameters (λ, ν) for a single given patient. The physiological features are used to train a multi-layer perceptron model, along with the training labels, which were detrended using the optimistic model with recovery rate λ. The prediction output of the model is then combined with the observed MADRS score to determine the predicted MADRS score. The ν features used by the model that are best correlated with the MADRS score form the patient specific signature. (b) In our cohort, the SiBaMoD pipeline is repeated using every patient as a test patient in a leave one patient out (LOPO) procedure, estimating the hyperparameters (λ, ν) for all patients except the test patient to determine the optimal values for these parameters.

To determine the best λ and ν values for all patients, we optimize the binary accuracy by performing a gridsearch optimization of these parameters, using a leave one patient out (LOPO) scheme on the SiBaMoD. Specifically, the SiBaMoD pipeline is repeated such that all patients in the cohort are set as the test (left out) patient to estimate the optimal hyperparameters (λ, ν) of SiBaMoD for all other patients in the cohort. This LOPO scheme prevents overfitting; in other words, it ensures a good performance generalization across the entire pipeline for new unseen patients, even though the SiBaMoD itself is patient specific. The hyperparameter optimization scheme is presented in Fig. 2b.

Standard statistical analyses, such as analysis of variance, cannot be conducted in ML-based analyses, which are not based on a distribution of a single factor in different populations. In this work, to validate each model, we rely on the following metrics. Firstly, we compute the 2-class and 4-class accuracies of the data. Following the literature50, the 2 classes are obtained by merging the classes “recovered” (MADRS 0–6) with “mild depression” (MADRS 7–19), and by merging the classes “moderate depression” (MADRS 20–34) with “severe depression” (MADRS 35–60). True positive and true negative rates are reported for the binary classification task. Secondly, the mean absolute error (MAE) in MADRS, together with confidence interval with α = 0.05, are computed considering each patient as a separate sample of our true distribution. Finally, a visual inspection of the predicted curves and MADRS label (Fig. 3) given by the clinician is performed.

Figure 3
figure 3

Comparison of the model's prediction with the ground truth clinical labels. Predicted MADRS evolution (in red) and actual MADRS scores measured by the physician during the monthly clinic visits (black dots) for a sample of 6 selected patients. The lines around each visit represent the extended clinical labels to 5 days on either side of a clinical visit. The background color intensity indicates the 4 classes of depression symptom severity.

It should be noted that the easiest metric, namely the mean absolute (or mean squared) error in MADRS prediction is not ideal in terms of reliability and usability, since the noise in the labels themselves is important with respect to the signal we are detecting (concordance between different physicians ranging from r = 0.89 to r = 0.9740). However, the 2-class and 4-class classifications are more agreed upon between physicians since they are broader categories.

Considering the novelty of the database under study, we cannot directly compare our metrics to baselines from the literature, therefore we evaluate the performance of our model by comparison with 2 baseline models. The first baseline is the constant prediction model, which always predicts the majority class when classifying disease severity and the mean MADRS score in regression analyses, and the second is the optimistic model that sets the residual MADRS score to 0.

Source link