Please read the following study:
Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020). Protecting patient privacy in survival analyses. Journal of the American Medical Informatics Association, 27(3), 366–375. https://doi.org/10.1093/jamia/ocz195
Discuss your response to this survival analysis study. Do you have the same concerns as the researchers regarding the patient privacy issues when presenting actuarial/survival analysis tables? Do you have other suggestions regarding protecting patient privacy within a study?
Be sure to support your statements with logic and argument, use at least two peer reviewed articles and cite them to support your statements.
Research and Applications
Protecting patient privacy in survival analyses
Luca Bonomi1, Xiaoqian Jiang2, and Lucila Ohno-Machado1,3
1Department of Biomedical Informatics, UC San Diego Health, University of California, San Diego, La Jolla, California, USA,
2School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA, and 3Division of
Health Services Research and Development, VA San Diego Healthcare System, La Jolla, California, USA
Corresponding Author: Luca Bonomi, PhD, UCSD Health Department of Biomedical Informatics, University of California
San Diego, 9500 Gilman Dr., La Jolla, California 92093, USA; firstname.lastname@example.org
Received 15 July 2019; Revised 9 September 2019; Editorial Decision 6 October 2019; Accepted 18 October 2019
Objective: Survival analysis is the cornerstone of many healthcare applications in which the “survival” proba-
bility (eg, time free from a certain disease, time to death) of a group of patients is computed to guide clinical
decisions. It is widely used in biomedical research and healthcare applications. However, frequent sharing of
exact survival curves may reveal information about the individual patients, as an adversary may infer the pres-
ence of a person of interest as a participant of a study or of a particular group. Therefore, it is imperative to de-
velop methods to protect patient privacy in survival analysis.
Materials and Methods: We develop a framework based on the formal model of differential privacy, which pro-
vides provable privacy protection against a knowledgeable adversary. We show the performance of privacy-
protecting solutions for the widely used Kaplan-Meier nonparametric survival model.
Results: We empirically evaluated the usefulness of our privacy-protecting framework and the reduced privacy risk
for a popular epidemiology dataset and a synthetic dataset. Results show that our methods significantly reduce the
privacy risk when compared with their nonprivate counterparts, while retaining the utility of the survival curves.
Discussion: The proposed framework demonstrates the feasibility of conducting privacy-protecting survival
analyses. We discuss future research directions to further enhance the usefulness of our proposed solutions in
biomedical research applications.
Conclusion: The results suggest that our proposed privacy-protection methods provide strong privacy protec-
tions while preserving the usefulness of survival analyses.
Key words: data privacy, survival analysis, data sharing, Kaplan-Meier, actuarial
Survival analysis aims at computing the “survival” probability (ie,
how long it takes for an event to happen) for a group of observa-
tions that contain information about individuals, including time to
event. In medical research, the primary interest of survival analysis
is in the computation and comparison of survival probabilities
across patient groups (eg, standard of care vs. intervention), in
which survival may refer, for example, to the time free from the
onset of a certain disease, time free from recurrence, and time to
death. Survival analysis provides important insights, among other
things, on the effectiveness of treatments, identification of risk,
biomarker utility, and hypotheses testing.1–10 Survival curves ag-
gregate information from groups of interest and are easy to gener-
ate, interpret, compare, and publish online. Although aggregate
data can be protected by different approaches, such as, round-
ing,11,12 binning,13 and perturbation,14 survival analysis models
have special characteristics that warrant the development of cus-
tomized methods. Before describing our proposed solutions, we
briefly review how survival curves are derived and what their vul-
nerabilities are from a privacy perspective.
VC The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.
All rights reserved. For permissions, please email: email@example.com
Journal of the American Medical Informatics Association, 27(3), 2020, 366–375
Advance Access Publication Date: 21 November 2019
Research and Applications
ia/article/27/3/366/5637338 by guest on 09 M
Survival analysis methods and privacy
Methods for survival analysis can be divided into 3 main categories:
parametric, semiparametric, and nonparametric models. Parametric
models rely on known probability distributions (eg, the Weibull distri-
bution) to learn a statistical model. These models are less frequently
used than semi- or nonparametric methods, as their parametric
assumptions hardly apply in practice. Even though the released curves
exhibit a natural “smoothing,” studies have shown that the parame-
ters of the model may reveal sensitive information.15 Semiparametric
methods are extremely popular for multivariate analyses and can be
used to identify important risk factors for the event of interest. As an
example, the Cox proportional hazards model16 only assumes a pro-
portional relationship between the baseline hazard and the hazard at-
tributed to a specific group (ie, it does not assume that survival
follows a known distribution, as is the case with parametric models).
Nonparametric models are frequently used to describe the survival
probability over time, without requiring assumptions on the underly-
ing data distribution. Among those models, the Kaplan-Meier (KM)
product-limit estimators are frequent in the biomedical literature. As
an example, a search for PubMed articles using the term Kaplan-
Meier retrieves more than 8000 articles each year, from 2013 to
2018. A search for actuarial returns about 500 articles per year. In
this article, we focus on the KM estimator and present results for the
actuarial model in the Supplementary Appendix. The KM method
generates a survival curve in which each event can be seen by a corre-
sponding drop in the probability of survival. For example, Foldvary
et al4 used the KM method to analyze seizure outcomes for patients
who underwent temporal lobectomy for epilepsy. In contrast, in the
actuarial method,17,18 the survival probability is computed over pre-
specified periods of time (eg, 1 week, 1 month). For example, Balsam
et al19 used actuarial curves to describe the long-term survival for
valve surgery in an elderly population.
It is surprising that relatively little attention has been given so far
to the protection of individual privacy in survival analysis. Survival
analyses generate aggregated results that are unlikely to directly re-
veal identifying information (eg, name, SSN).20 However, a knowl-
edgeable adversary, who observes survival analysis results over time,
may be able to determine whether a targeted individual participated
in the study and even if the individual belongs to a particular sub-
group in the study, thus learning sensitive phenotypes. Several previ-
ous privacy studies have shown that sharing aggregated results may
lead to this privacy risk.15,21,22 For example, small values of counts
(eg, <11) may reveal identifiable information about patients and their demographics.11,23 As survival analyses rely on statistical prim- itives (eg, counts of time to events), they share similar privacy risks. In fact, each patient is responsible for a drop or step in the survival curve. Therefore, the released curves may reveal, in combination with personal or public knowledge, sensitive information about a single patient. For example, an adversary who (1) has knowledge of the time to events of individuals in various groups at a certain time (eg, previously released survival curves for different groups) and (2) knows that a person of interest joined the study may infer the pres- ence of such an individual in a specific group (eg, patients in the hep- atitis B subgroup) as the released curves are updated. Specifically, an adversary can construct a survival curve based on their auxiliary knowledge and can infer whether the person of interest is in the group by comparing such a curve with the one from a group, as il- lustrated with the curves s1’ and s2’ in Figure 1 (left panel). The dif- ferences between the exact curves and those obtained by the adversary disclose the participation of the person of interest in a group (ie, the patient with time to event at time unit 61 contributed to the curve s2’, thus the individual of interest was in group 2). This scenario is realistic for dashboard of “aggregate” results, where tools for data exploration (eg, web interfaces and application programming interfaces) may enable users to obtain frequent fine- grained releases, and certainly is not limited to survival analysis, ap- plying also to counts, histograms, proportions (when accompanied by information on the total number of participants), and other seem- ingly harmless “aggregate” data. It is imperative to develop privacy solutions to protect the individ- ual presence in the released survival curves. In this work, we consider the formal and provable notion of differential privacy,24 in which the released statistics are perturbed with carefully calibrated random noise. Specifically, differential privacy ensures that the output statis- tics are “roughly the same” regardless of the presence or absence of any individual, thus providing plausible deniability. In fact, the differ- ences between differentially private survival curves s1’-dp and s2’-dp and those obtained with the adversarial knowledge in Figure 1 (right panel) do not reveal information about the presence of any individual in either group, as opposed to the original curves (left panel). Objective Current research in survival analysis includes the development of ac- curate prediction models, under the assumption that sharing aggregate survival data does not compromise privacy. For example, deep neural networks have been recently used to learn the relationship between a patient’s covariates and the survival distribution predictions.25–28 An- other example by Lu et al29 describes a decentralized method for learning a distributed Cox proportional hazards model without shar- ing individual patient-level data. Those solutions disclose exact results that may enable privacy attacks by untrusted users.15,22,30 Several approaches have been proposed for privacy-protecting survival analyses.20,31–33 However, they do not provide provable privacy guarantees. O’Keefe et al20 discussed privacy techniques based on data suppression (eg, removal of censored events), smooth- ing, and data perturbation. Yu et al32 proposed a method based on affine projections for the Cox model. Similarly, Fung et al33 devel- oped a privacy solution using random linear kernel approaches. De- spite promising results, these solutions do not provide provable privacy protection and may be vulnerable in the presence of an adversary who has auxiliary information (eg, knowledge of the time-to-event data [hospitalization, death, etc.] and from previous publication of survival curves). We developed a privacy framework, based on the notion of differ- ential privacy, that provides formal and provable privacy protection against a knowledgeable adversary who aims at determining the pres- ence of an individual of interest in a particular group. Intuitively, our framework transforms the data before the release, similarly to previ- ous methods based on generalization (eg, smoothing) and truncation (eg, censoring aggregate counts below a threshold).20,23 In our case, privacy is protected with the injection of calibrated noise. We show how this framework can be used to release differentially private sur- vival analyses for the KM estimator (see the Supplementary Appendix for the actuarial method). Furthermore, we define an empirical pri- vacy risk that measures how well an informed adversary may recon- struct the temporal information of time to event of an individual who participated in the study. Our evaluations show that an adversary can reconstruct the time to event with a small error from the observed nonprivate survival curves, thus indicating high privacy risk (eg, po- tential reidentification by linking the exact time intervals with external data). Our proposed methods significantly reduce privacy risks while Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 367 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data retaining the usefulness of the survival curves. We must emphasize that an ideal privacy protection mechanism should not rely on specific assumptions about what background knowledge the adversary has, as violations in the adversary’s knowledge may make privacy protection invalid. Thanks to differential privacy, our methods do not require such assumptions and thus provide protection regardless of how much information the adversary has. MATERIALS AND METHODS Nonparametric survival models Nonparametric survival models estimate the survival probability of a group of individuals by analyzing the temporal distribution of the recorded events during the study. Typically, each individual has a single temporal event, which may represent the development of a symptom, disease, or death. Some of these events may be only par- tially known (eg, subject drops out of the study, no follow-up)17,34 and therefore are denoted as censored events. We assume a study of N individuals over a period of T time units (eg, days, months). Fur- thermore, ui denotes the number of uncensored patients (known recorded event [eg, death]), ci denotes the censored patient at time ti; and ri represents those remaining before ti (excluding any individual censored previously). Table 1 summarizes the nonparametric models considered in this article. Additional details are reported in the Supple- mentary Appendix. Differential privacy Differential privacy24 enables the release of statistical information about a group of participants while providing strong and provable privacy protection. Specifically, differential privacy ensures that the probability distribution on the released statistics should be “roughly the same” regardless the presence/absence of any individual, thus providing plausible deniability. Differential privacy has been suc- cessfully applied in a variety of settings,14,35 such as data publication (eg, 1-time data release),36–40 iterative query answering,41–43 contin- ual data release (eg, results are published over time),44–50 and in combination with various machine learning models.30,51–53 Among those works, we are inspired by the differentially private model pro- posed for continual data release,46–49 as survival analyses estimate the survival function at time t using the time to events up to t: In our setting, we consider an event stream S ¼ðe1; e2; . . . ; eTÞ, where each event ei ¼ðci; ui; tiÞ refers to the number of events and whether cen- soring happened at time ti, and the events are in chronological order (ie, ti < tiþ1). For example, consider a study over a period T ¼ 10 units of time (eg, months) comprising a total of N ¼ 6 individuals with time to events of 2, 4, 4, 5*, 6, 8*, where time marked with * corresponds to when censoring happened (ie, a participant was lost to follow up). Under our notation, we have an event stream S ¼ð0; 0; 1Þ;ð0; 1; 2Þ; ð0; 0; 3Þ;ð0; 2; 4Þ;ð1; 0; 5Þ;ð0; 1; 6Þ; ð0; 0; 7Þ;ð1; 0; 8Þ; ð0; 0; 9Þ;ð0; 0; 10Þ, where (0; 0; 3) indicates that no events were observed at time 3. We assume a trusted data curator who wishes to release an esti- mate of the survival probability sðtÞ at each time stamp 1 � t � T using the information in the poststream of events up to time t, namely the prefix stream St ¼ðe1; e2; . . . ; etÞ. Neighboring streams of time to events Two streams of time to events St and S 0 t are neighboring streams if there exists at most 1 ti 2 ½1; . . . ; t�; such that: ci – c’ij jþ jui – u’ij � 1 (ie, they differ at most by 1 event). Using this notion, we present the definition of differential pri- vacy considered in our work as follows. Differential privacy Let M �ð Þ be a randomized algorithm that takes in input a stream S, and let O be the set of all possible outputs of M �ð Þ. Then, we say that M �ð Þ satisfies e-differential privacy if, for all sets O 2 O; all neighboring streams St and S 0 t , and all t; it holds that: Pr M Stð Þ¼ O½ � � ee � Pr½M S0tð Þ¼ O� Intuitively, the notion of differential privacy ensures that neigh- boring streams should be indistinguishable by an adversary who Figure 1. Survival curves obtained using the Kaplan-Meier method. (Left panel) An adversary observes 2 exact curves s1’ (group 1) (eg, consisting of patients without hepatitis B) and s2’ (group 2) (patients with hepatitis B) and compares them with the curves constructed with knowledge of s1 and s2 (eg, previously re- leased curves). The adversary knows that the person of interest had an event at time 61 and thus can learn from the change in s2’ that this individual contributed to group 2. This is an example of a difference attack. (Right panel) When the curves are generated using differential privacy (s1’-dp and s2’-dp), their difference does not reveal individual time-to-event information. The data on this plot were obtained from a publicly available repository (http://lib.stat.cmu.edu/datasets/ veteran). Here, we only report on the first 80 time units (days) to highlight the difference between the survival curves. 368 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data observes the output of the mechanism M �ð Þ at any time. In differen- tially private models, e denotes the privacy parameter (also known as privacy budget). Lower values indicate higher indistinguishabil- ity, thus providing stronger privacy. Determining the right value for e is a challenging problem, as specific values depend on the applica- tion (ie, risk tolerance).54 Typically, e assumes values in the range [1/1000, 10]. As an example consider e¼1, then the probability of a stream St being mapped to a particular output is no greater than 2 times the probability that any of its neighboring streams getting mapped to the same output. Perfect privacy can be achieved with e¼0 (ie, neighboring streams are equally likely to produce the same output); however, it obviously leads to no utility in the released curve, as the mechanism has to completely ignore each individual record in input. The guarantee of indistinguishability between neighboring streams protects the presence of the individual in the released statistics be- cause, in survival analysis, an individual can contribute at most once to the stream. Typically, differential privacy is achieved via output perturba- tion, in which the released statistics are perturbed with calibrated random noise to hide the presence of individuals (details are reported in the Supplementary Appendix). Intuitively, the noise per- turbation “generalizes” the aggregated time to events, similarly to traditional ad hoc techniques in which the released aggregated counts are obtained by binning and thresholding (eg, reporting counts as “less than 10”). Our framework for privacy-protecting survival analyses Publishing survival values sðtÞ may pose significant privacy chal- lenges, as the event for an individual at time t0 will affect the sur- vival curve at time t0 (ie, step) as well as subsequent values. Therefore, an adversary who observes these changes may gain knowledge about the individual associated with such an event: To mitigate these risks, traditional differential privacy methods perturb each released survival value sðtÞ. However, these methods may lead to overly perturbed results when the study spans over a long period of time. To this end, we propose a framework that compresses the stream of events into partitions in which the survival probabilities can be accurately computed over time using an input perturbation strategy. Overall, our framework (Figure 2) comprises 3 main steps: (1) data partitioning, (2) survival curve computation, and (3) post- processing. In the data partitioning step, the time to events are grouped into partitions generated in a differentially private manner. The idea is to compress the stream, so that the privacy cost for computing the sur- vival curve can be reduced while retaining the distribution of the events. In the survival computation step, we estimate the number of censored and uncensored events over time using a binary tree de- composition. This step reduces the perturbation noise in the estima- tion of the events, which are then used to compute the survival probability. Specifically, we use an input perturbation approach in which privacy is achieved by perturbing the counts of the events rather than the output of the survival function, thus improving the utility compared with standard output perturbation techniques. Be- cause the noise perturbation may disrupt the shape of the survival curve, we perform a postprocessing step, in which we enforce consis- tency in the released curve (ie, monotonically decreasing survival probabilities). For brevity, in the following we describe the instantia- tion of our framework for the KM method. The private solution for the actuarial method follows the same steps, except for the fact that partitioning is performed over fixed intervals (see Supplementary Appendix). Data partitioning Our partitioning strategy takes in input the stream of events St and produces a stream of partitions as output, where multiple events are grouped. We compress the stream into partitions of variable length with the goal of retaining the distribution of the events. Our method processes 1 event at the time and keeps an active partition, which is sealed when more than H time to events are observed. Intuitively, this approach produces a coarser representation of the stream, where each event is grouped with at least other H-1, by varying the interval of time to publish survival for a group of events. In this pro- cess, we perturb the count of the events in the stream and the thresh- old H with calibrated noise. As a result, the events and the size of partitions are protected, thus providing an additional level of protec- tion compared with other privacy methods that rely on binning (ie, rounds to the nearest 10). The privacy budget e1 dedicated to this step is equally divided among the threshold and event count pertur- bation. As any neighboring streams may differ at most by 1 segment, these perturbations ensure that the partitions returned by the algo- rithm satisfy e1-differential privacy. 55 Survival curve computation In this step, we determine the survival probability at time t using an input perturbation strategy. The idea is to estimate the number of uncensored and censored events in the partitions in a differentially private manner and then use those values to compute the survival curve, up to t. One could estimate these events by perturbing the counts over the partitions processed so far. However, this simple process leads to high perturbation noise, as the magnitude of the noise grows linearly with the number of partitions. To this end, we use a binary tree counting approach with privacy parameter e2, where leaves represent the original partitions and internal nodes de- note partitions obtained by merging the partitions of their children. Consider Figure 2, the internal node associated with the count C14 comprises the events over the partitions P1, P2, P3, and P4. This bi- nary mechanism is very effective in reducing the overall impact of perturbation noise.46,47 With this mechanism, the differentially pri- vate number of uncensored û ið Þ and censored ĉ ið Þ events in the stream can be estimated with a perturbation noise that grows only logarithmically with the number of partitions in the stream. Table 1. Nonparametric models for survival analysis considered Actuarial model (see Supplemen- tary Appendix) Kaplan-Meier model • time to events are grouped into intervals of fixed length (l) • survival computed on the set interval fI1; I2; . . . ; ITg of length l • the censored patients are as- sumed to withdraw from the study at random during the interval • survival function computed on each time unit Survival function at each interval Ii: si ¼ Qi j¼1 1 � uj rj� cj 2 � � Survival function at time t: s tð Þ¼ Q ti � t 1 � ui ri � � Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 369 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data To compute the privacy-protecting survival curve for the KM method, denoted ŝKM ið Þ, we rewrite the KM survival curve formula- tion as follows: ŝKM ið Þ¼ ŝKM i � 1ð Þ� N � û ið Þ� ĉ i � 1ð Þ N � û i � 1ð Þ� ĉ i � 1ð Þ where, û ið Þ and ĉ ið Þ represent the total number of uncensored and censored events up to the time of partition i; respectively. At the end of this step, we obtain a step function representing the survival prob- ability of the patients over time that remains constant within each partition. Data postprocessing A survival curve satisfies the following properties: (1) it assumes val- ues in the range [0, 1] and (2) it monotonically decreases with time (ie, ŝðtÞ� ŝðt þ 1Þ for 1 � t < T). While our solution ensures that the released curve satisfies differential privacy, the noise pertur- bation may violate properties 1 and 2. To this end, we propose a postprocessing step, in which we compute the survival curve ŝ�ðtÞ satisfying these properties and that best resembles ŝ tð Þ: Similarly to previous work,56,57 we solve this optimization problem with iso- tonic regression methods (details in the Supplementary Appendix). An illustrative example of our postprocessing step is reported in Figure 3. Overall, our approach achieves e-differential privacy (with e1 ¼ e2 ¼ e=2), as differential privacy guarantees in phase 1 and phase 2 compose sequentially.35 Furthermore, our framework is highly scalable, as all the steps can be performed efficiently. Evaluation metrics We conducted empirical evaluations of our proposed privacy- protecting framework to assess the usefulness of the released sur- vival curves and the reduction in privacy risk when the privacy- protecting model is compared with a nonprivate counterpart. Utility metric To assess the usefulness of the released survival curves, we compared the differentially private (here named “private” for brevity) curve with the exact survival curve in terms of mean absolute error (MAE). The MAE measures the similarity between 2 curves by aver- aging the sum of the absolute differences, formally: MAE ¼ 1 T PT t¼1 js tð Þ� ŝðtÞj, where s tð Þ and ŝðtÞ denote the nonpri- vate and the private survival curves, respectively. As the survival curves are based on probabilities, the MAE assumes values in the Figure 2. Overview of our proposed framework to release differentially private survival curves in 3 main steps. First, in data partitioning, the stream of time to events in input is partitioned into segments while satisfying e1 � differential privacy: Second; in survival curve computation; the aggregated time to events over the stream of partitions are computed to satisfy e2 -differential privacy using the binary tree mechanism. These noisy counts are used in the estimate of the survival probability over time using an input perturbation mechanism. Third, in postprocessing, the values of the estimated curve are bounded in the interval [0, 1] and monotonicity is enforced. 370 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data range ½0; 1�. Smaller values of MAE indicate stronger similarity be- tween the 2 curves, hence higher utility. In addition, we compare the curves using the Kolmogorov-Smirnov (KS) test to estimate the sta- tistical difference between the differentially private and nonprivate curves.58 Privacy metrics Differential privacy ensures that the adversary’s probability of inferring information about an individual from the survival curve is only a factor ð1 þ eÞ larger than the probability of inferring that information if such an individual were not included in the study. Thus, the privacy parameter e provides us with a theoreti- cal bound on the privacy risk of disclosing the participation of each individual. In addition, we consider an empirical privacy risk (defined in the Supplementary Appendix), named inference error, which measures the error in reconstructing the time to event (in time units), by a knowledgeable adversary who observes the released curve. Higher values of inference error indicate lower privacy risk. For our differentially private solutions, we conduct our evaluations of the inference error by varying the pri- vacy parameter e. RESULTS We used the real-world Surveillance Epidemiology and End Results (SEER)59 dataset to evaluate the effectiveness of our proposed approaches. Specifically, we generated a stream of time to events by sampling N 2f1000; 10000; 100000g patients from 707157 breast cancer patients with first diagnosis from 1973 to 2015, using the time unit of a month. Our results are reported with a confidence level of 95% over 100 runs of our algorithms for the KM survival analyses. Evaluations for the actuarial method, additional experi- ments on the SEER dataset, and on a synthetically generated dataset are reported in the Supplementary Appendix. In the figures, we de- note the standard KM method and our proposed differentially pri- vate version by KM and DP-KM, respectively. KM survival curve Figure 4 reports the inference error for the nonprivate and private KM approaches with different data sizes, which quantifies the abil- ity of an adversary in inferring the exact time to event of an individ- ual of interest from the released curves. The adversary’s inference error for the standard KM approach grows with the size of the data- set, ie, from 62 time units (N ¼ 1000) to 680 time units (N ¼ 100000), but is still very low. Intuitively, the contribution of an individual in the survival probability decreases as N increases (as an individual is hidden in a larger crowd). With our DP-KM method, the inference error for the adversary is significantly higher across all the sizes of the data (ie, “better” pri- vacy). Consider N ¼ 1000; for example: an adversary can recon- struct the time to event of a targeted individual up to 62 time units from the KM survival curve. In contrast, with our DP-KM solution, the error is at least 6250 time units (Figure 4A). In other words, with our DP-KM solution, an adversary cannot infer the time to event for an individual of interest with precision, as the confidence interval for the inferred time to event spans over 250 time units (as opposed to 2 time units with the nonprivate method). Overall, our solution provides consistently stronger privacy protection across all the dataset sizes when compared with the standard KM method. Furthermore, the inference error in our differentially private method is robust against variations of the privacy parameter (e). Figure 5 reports the MAE for the differentially private curves. Both the privacy parameter and size of the dataset impact the utility. For larger values of the privacy parameter e (weaker privacy), the magnitude of the perturbation noise decreases, thus leading to more accurate results. Similarly, for larger datasets, the impact of the per- turbation noise is smaller, thus leading to higher utility (ie, lower MAE). For example, with e � 1; our method achieves MAE � 0:1 for N ¼ 10000, and MAE � 0:03 for N ¼ 100000. In conclusion, our DP-KM solution produces survival curves that retain the usefulness of the nonprivate curves while providing strong privacy protection. Survival analysis We performed nonparametric survival analysis using data from the SEER database. We considered patients diagnosed with breast can- cer after 2005, from which we randomly selected groups of 2500 patients representing different races: white, black, and other. We obtained the survival curves using the nonprivate KM approach (ie, KM) and its privacy-protecting counterpart (ie, DP-KM). In Figure 6, we observe that our DP-KM solution generates survival curves that closely resemble those obtained with the nonprivate method. We compared the private curves with their nonprivate counter- parts using the KS test, and the results are reported in Table 2 for the KM method and in Table 3 for the DP-KM method. We adopted the KS test rather than the log-rank test, as the former can be per- formed on the survival curves that are outputs of our differentially private methods. Overall, the differentially private curves obtained with our methods are not statistically different from the exact curves (P > .05) and the differences between groups continue to be statisti-
We presented a differentially private framework that can be used to
release survival curves while protecting patient privacy. We demon-
strated that our method significantly reduces the risk of a privacy
Figure 3. Illustrative example of the postprocessing step in our proposed
framework. The differentially private noisy curve (s) may not be monotonic
due to the injection of random noise. Such a curve is postprocessed to gener-
ate a monotonically decreasing curve (s*) that approximates s.
Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 371
ia/article/27/3/366/5637338 by guest on 09 M
breach when compared with its nonprivate counterpart, while
retaining the utility of the survival curves. We discuss several future
Distributed survival analysis
Current research initiatives often rely on collaborative efforts, such
as the clinical data research network pSCANNER60 and equivalent
multicenter consortia. While our proposed methods are designed for
a centralized setting (ie, trusted aggregator), they could be adapted
to the distributed setting. Inspired by previous work,61 we can con-
sider a protocol in which each institution perturbs the local stream
of time to events, while a central unit (not necessary trusted) aggre-
gates and partitions the received streams.
Achieving high utility under differential privacy is very challenging
in applications that require continual data releases. Recent works
have proposed extensions of the differential privacy model, in which
privacy is relaxed over time.48,49 Extending our privacy solutions to
satisfy those privacy relaxations would help improve the utility of
the released survival curves.
Solutions for other survival models
In this work, we presented a preliminary study on privacy-
protecting survival analyses based on the KM method (the actuarial
method is shown in the Supplementary Appendix). However, there
are many other types of survival models, including those based on
Figure 4. Inference error for the Kaplan-Meier (KM) survival curves for N¼1000, 10000, and 100000 sampled patients obtained with the nonprivate (KM) and pri-
vate (differentially private KM [DP-KM]) methods. Inference error for KM method and differently private solution (DP-KM) vs the privacy parameter (�Þ; with (A) N
¼ 1000, (B) N ¼ 10000, and (C) N ¼ 100000.
Figure 5. Mean absolute error (MAE) of the differentially private Kaplan-Meier
(DP-KM) curve vs the privacy parameter (�Þ; for N ¼ 1000; 10000; and 100000.
372 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3
ia/article/27/3/366/5637338 by guest on 09 M
Cox proportional hazards,16 accelerated failure time,62 recurrent
time-to-event data,63 and competing risk64 methods. Building on
our results, we plan to develop new privacy methods for enabling
other popular privacy-protecting survival analyses in the future.
Publication of survival curves is frequent in the biomedical literature
and is becoming more frequent in websites. In this work, we studied
the privacy risk in conducting survival analyses and proposed a dif-
ferentially private framework for the KM product limit estimator.
The differentially private curves generated by our framework pre-
vent an adversary to infer the time to event for a particular target in-
dividual without a significant error (eg, 250 time units) while
retaining the usefulness of the original nonprivate curves.
This work was supported by the National Heart, Lung, and Blood Institute
grant R01HL136835, and National Institute of General Medical Sciences
grant R01GM118609, and National Human Genome Research Institute
LB developed the methods, contributed the majority of the writing,
and conducted the experiments. XJ provided helpful comments on
both methods and presentation. LO-M provided the motivation for
this work, detailed edits, and critical suggestions.
Supplementary material is available at Journal of the American
Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
1. Ohno-Machado L. Modeling medical prognosis: survival analysis techni-
ques. J Biomed Inform 2001; 34 (6): 428–39.
2. Cortese G, Scheike TH, Martinussen T. Flexible survival regression
modelling. Stat Methods Med Res 2010; 19 (1): 5–28.
3. Schwartzbaum JA, Hulka BS, Fowler JW, Kaufman DG, Hoberman D.
The influence of exogenous estrogen use on survival after diagnosis of en-
dometrial cancer. Am J Epidemiol 1987; 126 (5): 851–60.
4. Foldvary N, Nashold B, Mascha E. Seizure outcome after temporal lobec-
tomy for temporal lobe epilepsy: a Kaplan-Meier survival analysis. Neu-
rology 2000; 54 (3): 630.
5. Galon J, Costes A, Sanchez-Cabo F, et al. Type, density, and location of
immune cells within human colorectal tumors predict clinical outcome.
Science 2006; 313 (5795): 1960–4.
6. Le Voyer TE, Sigurdson ER, Hanlon AL, et al. Colon cancer survival is as-
sociated with increasing number of lymph nodes analyzed: a secondary
survey of intergroup trial INT-0089. J Clin Oncol 2003; 21 (15): 2912–9.
7. Lee ET, Go OT. Survival analysis in public health research. Annu Rev
Public Health 1997; 18 (1): 105–34.
Figure 6. Survival curves for breast cancer patients in the Surveillance Epidemiology and End Results dataset for different groups. We sampled 2500 patients for
each group (ie, black, white, and others) who have been diagnosed since 2005. The curves obtained with the (A) nonprivate KM method and (B) differentially pri-
vate curve (DP-KM).
Table 2. Kolmogorov-Smirnov test results for the Kaplan-Meier
White Black Other
White 0.0 (1.0) 0.37 (1.38 � 10�8) 0.21 (4.32 � 10�6)
Black – 0.0 (1.0) 0.48 (2.39 � 10�14)
Other – – 0.0 (1.0)
Values are the Kolmogorov-Smirnov statistic (P value).
Table 3. Kolmogorov-Smirnov test results for the DP-KM method
DPWhite DPBlack DPOther
DPWhite 0.0 (1.0) 0.36 (2.95 � 10�8)a 0.23 (1.08 � 10�4)a
DPBlack – 0.0 (1.0) 0.45, 1.15 � 10�12)a
DPOther – – 0.0 (1.0)
White 0.10 (.52)a 0.34 (1.29 � 10�9) 0.21 (4.34 � 10�3)
Black 0.38 (5.28 � 10�9) 0.14 (.16)a 0.48 (6.45 � 10�14)
Other 0.28 (5.47 � 10�5) 0.49 (8.70 � 10�15) 0.13 (.21)a
Values are the Kolmogorov-Smirnov statistic (P value). The test results
obtained on the curve produced by the differentially private Kaplan-Meier
DP: differentially private.
aDifferentially private curves are not statistically different from the original
ones (P > .05), and they preserve the separation between groups (P < .05). Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 373 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data 8. Wagner M, Redaelli C, Lietz M, Seiler CA, Friess H, Büchler MW. Cura- tive resection is the single most important factor determining outcome in patients with pancreatic adenocarcinoma. Br J Surg 2004; 91 (5): 586–94. 9. Strober M, Freeman R, Morrell W. The long-term course of severe an- orexia nervosa in adolescents: survival analysis of recovery, relapse, and outcome predictors over 10–15 years in a prospective study. Int J Eat Dis- ord 1997; 22 (4): 339–60. 10. Erbes R, Schaberg T, Loddenkemper R. Lung function tests in patients with idiopathic pulmonary fibrosis: are they helpful for predicting out- come? Chest 1997; 111 (1): 51–7. 11. Murphy SN, Chueh HC. A security architecture for query tools used to ac- cess large biomedical databases. Proc AMIA Symp 2002; 2002: 552–6. 12. Bacharach M. Matrix rounding problems. Manage Sci 1966; 12 (9): 732–42. 13. Lin Z, Hewett M, Altman RB. Using binning to maintain confidentiality of medical data. Proc AMIA Symp 2002; 2002: 454–8. 14. Dwork C. Differential privacy: a survey of results. In: Agrawal M, Du D, Duan Z, and Li A, eds. Theory and Applications of Models of Computa- tion (Lecture Notes on Computation Series, volume 4978). New York, NY: Springer; 2008: 1–19. 15. Fredrikson M, Jha S, Ristenpart T. Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Secu- rity. New York, NY: ACM; 2015: 1322–33. 16. Cox DR. Regression models and life-tables. J R Stat Soc Ser B 1972; 34 (2): 187–220. 17. Cutler SJ, Ederer F. Maximum utilization of the life table method in ana- lyzing survival. J Chronic Dis 1958; 8 (6): 699–712. 18. Berkson J, Gage RP. Calculation of survival rates for cancer. Proc Staff MeetMayo Clinic 1950; 25 (11): 270–86. 19. Balsam LB, Grossi EA, Greenhouse DG, et al. Reoperative valve surgery in the elderly: predictors of risk and long-term survival. Ann Thorac Surg 2010; 90 (4): 1195–201. 20. O’Keefe CM, Sparks RS, McAullay D, Loong B. Confidentialising survival analysis output in a remote data access system. J Priv Confid 2012; 4 (1): 127–54. 21. Homer N, Szelinger S, Redman M, et al. Resolving individuals contribut- ing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008; 4 (8): e1000167. 22. Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy. Piscataway, NJ: IEEE; 2017: 3–18. 23. Klann JG, Joss M, Shirali R, et al. The Ad-Hoc uncertainty principle of pa- tient privacy. AMIA Summits Transl Sci Proc 2018; 2017: 132–8. 24. Dwork C, McSherry F, Nissim K, Smith A, Smith A. Calibrating noise to sensi- tivity in private data analysis. In: Halevi S, Rabin T, eds. TCC 2006: Theory of Cryptography Conference. New York, NY: Springer; 2006: 265–84. 25. Faraggi D, Simon R. A neural network model for survival data. Stat Med 1995; 14 (1): 73–82. 26. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. Deep sur- vival: a deep Cox proportional hazards network. stat 2016; 1050: 2. 27. Luck M, Sylvain T, Cardinal H, Lodi A, Bengio Y. Deep learning for patient-specific kidney graft survival analysis. arXiv 2017 May 29 [E-pub ahead of print]. 28. Lee C, Zame WR, Yoon J, der Schaar M. Deephit: van A deep learning ap- proach to survival analysis with competing risks. In: Thirty-Second AAAI Conference on Artificial Intelligence; 2018. 29. Lu C-L, Wang S, Ji Z, et al. WebDISCO: a Web service for DIStributed COx model learning without patient-level data sharing. J Am Med Infor- matics Assoc 2015; 22 (6): 1212–9. 30. Chaudhuri K, Monteleoni C. Privacy-preserving logistic regression. In: Koller D, Schuurmans D, eds. Advances in Neural Processing Systems 21 (NIPS 2008). San Diego, CA: Neural Information Processing Systems Foundation; 2008. 31. Chen T, Zhong S. Privacy-preserving models for comparing survival curves using the logrank test. Comput Methods Programs Biomed 2011; 104 (2): 249–53. 32. Yu S, Fung G, Rosales R, et al. Privacy-preserving Cox regression for sur- vival analysis. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM; 2008: 1034–42. 33. Fung G, Yu S, Dehing-Oberije C, et al. Privacy-preserving predictive mod- els for lung cancer survival analysis. Pract Priv-Preserving Data Min 2008; 40 . 34. Pagano M, Gauvreau K. Principles of Biostatistics. New York, NY: Chap- man and Hall/CRC; 2018. 35. Dwork C, Roth A. The algorithmic foundations of differential privacy. FnT Theor Comput Sci 2013; 9 (3–4): 211–407. 36. Xiao X, Wang G, Gehrke J. Differential privacy via wavelet transforms. IEEE Trans Knowl Data Eng 2011; 23 (8): 1200–14. doi: 10.1109/ TKDE.2010.247. 37. Bonomi L, Xiong L. A two phase algorithm for mining sequential patterns with differential privacy. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York, NY: ACM; 2013: 269–78. 38. Li N, Qardaji W, Su D, Cao J. Privbasis: frequent itemset mining with dif- ferential privacy. Proc VLDB Endow 2012; 5 (11): 1340–51. 39. Bhaskar R, Laxman S, Smith A, Thakurta A. Discovering frequent pat- terns in sensitive data. In: Proceedings of the 16th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining-KDD ’10. New York, NY: ACM Press; 2010: 503–12. 40. Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table re- lease. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). New York, NY: ACM; 2007: 273–82. 41. Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T. Differentially pri- vate spatial decompositions. In: 2012 IEEE 28th International Confer- ence on Data Engineering. Piscataway, NJ: IEEE; 2012: 20–31. 42. Li C, Hay M, Rastogi V, Miklau G., McGregor A. Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). New York, NY: ACM; 2010: 123–34. 43. Li C, Miklau G. An adaptive mechanism for accurate query answering un- der differential privacy. Proc VLDB Endow 2012; 5 (6): 514–25. 44. Fan L, Bonomi L, Xiong L, Sunderam VS. Monitoring web browsing be- havior with differential privacy. In: Chung C-W, Broder AZ, Shim K, Suel T, eds. 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic Of Korea, April 7-11, 2014. New York, NY: ACM; 2014: 177–88. 45. Fan L, Xiong L. An adaptive approach to real-time aggregate monitoring with differential privacy. IEEE Trans Knowl Data Eng 2014; 26 (9): 2094–106. 46. Dwork C, Naor M, Pitassi T, Rothblum GN. Differential privacy under continual observation. In: Proceedings of the Forty-Second ACM Sympo- sium on Theory of Computing. New York, NY: ACM; 2010: 715–24. 47. Chan T-H, Shi E, Song D. Private and continual release of statistics. ACM Trans Inf Syst Secur 2011; 14 (3): 1. 48. Kellaris G, Papadopoulos S, Xiao X, Papadias D. Differentially private event sequences over infinite streams. Proc VLDB Endow 2014; 7 (12): 1155–66. 49. Bolot J, Fawaz N, Muthukrishnan S, Nikolov A, Taft N. Private decayed predicate sums on streams. In: Proceedings of the 16th International Con- ference on Database Theory. New York, NY: ACM; 2013: 284–95. 50. Bonomi L, Xiong L. On differentially private longest increasing subse- quence computation in data stream. Trans Data Priv 2016; 9 (1): 73–100. 51. Chaudhuri K, Monteleoni C, Sarwate A. Differentially private empirical risk minimization. J Mach Learn Res 2011; 12: 1069–109. 52. Ji Z, Jiang X, Wang S, Xiong L, Ohno-Machado L. Differentially private distributed logistic regression using private and public data. BMC Med Genomics 2014; 7 (Suppl 1): S14. 53. Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential pri- vacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York, NY: ACM; 2016: 308–18. 374 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 54. Nissim K, Steinke T, Wood A, et al. Differential privacy: A primer for a non-technical audience. In: 10th Annual Privacy Law Scholars Confer- ence; June 1–2, 2017; Berkeley, California. 55. Dwork C, Naor M, Reingold O, Rothblum GN. Pure differential privacy for rectangle queries via private partitions In: International Conference on the Theory and Application of Cryptology and Information Security. New York, NY: Springer; 2015: 735–51. 56. Hay M, Rastogi V, Miklau G, Suciu D. Boosting the accuracy of differen- tially private histograms through consistency. Proc VLDB Endow 2010; 3 (1–2): 1021–32. 57. Barlow RE, Brunk HD. The isotonic regression problem and its dual. J Am Stat Assoc 1972; 67 (337): 140–7. 58. Fleming TR, O’Fallon JR, O’Brien PC, Harrington DP. Modified Kolmogorov-Smirnov test procedures with application to arbitrarily right- censored data. Biometrics 1980; 36 (4): 607–25. 59. Noone AM, Howlader N, Krapcho M. SEER Cancer Statistics Review, 1975-2015. Bethesda, MD: National Cancer Institute. 60. Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered scalable national network for effectiveness research. J Am Med Inform Assoc 2014; 21 (4): 621–6. doi: 10.1136/amiajnl-2014-002751. 61. Chan T-H, Li M, Shi E, Xu W. Differentially private continual monitoring of heavy hitters from distributed streams. In: International Symposium on Privacy Enhancing Technologies Symposium. New York, NY: Springer; 2012: 140–59. 62. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med 1992; 11 (14–15): 1871–9. 63. Amorim L, Cai J. Modelling recurrent events: a tutorial for analysis in epi- demiology. Int J Epidemiol 2015; 44 (1): 324–33. 64. Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemi- ologic data. Am J Epidemiol 2009; 170 (2): 244–56. Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 375 D ow nloaded from https://academ ic.oup.com /jam ia/article/27/3/366/5637338 by guest on 09 M ay 2022 ocz195-TF1 ocz195-TF2 ocz195-TF3 ocz195-TF4
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.Read more
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.Read more
Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.Read more
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.Read more
By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.Read more