SCCM is performing maintenance on its websites. For the best browsing experience, please use Microsoft Edge or Safari. Those using Chrome or Firefox may experience access issues at this time.

SCCM Pod-442 Continuous Prediction of Mortality in the PICU: A Recurrent Neural Network Model in a S

visual bubble
visual bubble

As a proof of concept, a recurrent neural network (RNN) model was developed using electronic medical record (EMR) data capable of continuously assessing a child’s risk of mortality throughout an ICU stay as a proxy measure of illness severity. Host Margaret M. Parker, MD, MCCM, is joined by Melissa D. Aczon, PhD, to discuss how the RNN model can process hundreds of input variables contained in a patient’s EMR and integrate them dynamically as measurements become available. The RNN’s high discrimination suggests its potential to provide an accurate, continuous, and real-time assessment of a child in the PICU. (Aczon M, et al. Ped Crit Care Med. 2021;22:519-529) Dr. Aczon is a principal data scientist at Children’s Hospital Los Angeles. Melissa D. Aczon, PhD is a Principal Data Scientist at Children’s Hospital Los Angeles. 

*If you are unable to play the podcast please click here to download the file.

Category: PCCM Podcast


Margaret Parker, MD, MCCM (Host): Hello and welcome to the Society of Critical Care Medicine’s iCritical Care podcast. I’m your host, Dr. Margaret Parker. Today I will be speaking with Melissa D. Aczon, PhD, on the article “Continuous Prediction of Mortality in the PICU: A Recurrent Neural Network Model in a Single-Center Dataset,” published in the June 2021 issue of Pediatric Critical Care Medicine. To access the full article, visit Dr. Aczon is a principal data scientist with the virtual PICU team at Children’s Hospital Los Angeles (CHLA) in California. Welcome, Dr. Aczon. Before we start, do you have any disclosures to report?

Dr. Aczon: No, I don’t. Thank you so much for this opportunity to speak with you.

Dr. Parker: We’re very grateful to have you here today. This work is about using deep learning methods to provide a continuous assessment of a critically ill child. What motivated you to develop the proof-of-concept model?

Dr. Aczon: Critical care requires constant monitoring and evaluations of a patient’s condition throughout the ICU stay. Is it improving? Is it deteriorating? There has been long-time interest in developing tools that can automatically and continuously make these evaluations using data that are available to clinicians at the bedside. So, there already are existing severity-of-illness scores for ICU settings. On the adult side we have Simplified Acute Physiology Score (SAPS) and Acute Physiology and Chronic Health Evaluation (APACHE), and on the pediatric side we have Pediatric Logistic Organ Dysfunction (PELOD). Also in pediatrics, we have Pediatric Index of Mortality (PIM) and Pediatric Risk of Mortality (PRISM), which were developed as population risk of mortality models for benchmarking purposes in quality improvement. Over time, we came to think of risk of mortality as related to—or a proxy for—severity of illness.

Even the scores that were developed to assess severity of illness are validated in terms of how well they predict mortality. There are some common themes of these scoring systems. First, they are static. What I mean by that is they’re not continuous. They look at data from a fixed time window, such as the past 12 hours or the past 24 hours. And they generate a single score, whether it’s an ICU admission or 12 hours later or a day later. Second, they typically use a limited number of variables. In the past several years, studies have tried to repurpose some of these static scores so that they generate a score in some regular manner, whether it’s every hour, every few hours, or daily.

I think this again underscores the desire for a system that continuously and automatically updates. We wanted to build on that body of work, especially the more recent studies that focused on the continuous aspect. While a single score at the 12th or 24th hour of ICU stay provides important information, we wanted a score that can reflect the changing nature of a patient’s condition in the ICU. This is what happens; a child improves or deteriorates over the course of the ICU stay. Can we leverage the increasing availability of electronic medical records (EMRs) and deep learning methods to achieve this purpose? This is what we set out to do.

Dr. Parker: Can you give us some background on deep learning and recurrent neural networks (RNNs) in particular? What made you choose RNNs for your work?

Dr. Aczon: I describe deep learning algorithms and neural networks in particular as the stacked combinations of many, many logistic regressions. Let’s look at what a logistic regression actually does. For example, say that I have measurements at noon for heart rate, systolic and diastolic blood pressures, respiratory rate, temperature, and so forth. For a logistic regression, we use the data we have to determine a set of coefficients that multiply these variables. Maybe I’ll have five times the heart rate, negative 1.5 times systolic, and so forth. We multiply each of these variables with those coefficients and add the resulting terms, then apply what is called the logistic function. I think you see a lot of clinical models that use this kind of modeling, a logistic function applied to a sum of these terms.

Instead of stopping there, we do that again, so we have a second logistic regression with its own set of coefficients, and I can have a third logistic regression, a fourth, and so on. Let’s say I have 20 of these logistic regressions, and each one I’m applying to the inputs that I have. Now I have created 20 new variables. Now I can derive another set of logistic regressions on those new variables that I just created. This is what we call layering. I have my input variables—we can call them the first layer variables. When I apply 20 logistic regressions on those variables, I have created a second layer of those 20 new variables. I can create another layer of blind logistic regressions on top of those. I keep building these. And I think that’s what the word “deep” in “deep learning” really applies to. I have a stack of these combinations of logistic regressions. The way I can combine these refers to that. You’ll see the phrase “the architecture of the network.”

I can keep doing that. This process allows deep learning algorithms to combine my input variables in many more different ways than a single logistic regression. Of course, this process gives rise to many more coefficients or weights than the starting number of input variables. This allows the resulting model to capture more complex interactions among inputs than a much simpler model. Then one could ask, Well, I started with 20 input variables, but now I have these hundreds of coefficients. Is that going to overfit? This is a question I hear a lot. Over the past decade, there have been many advances in machine learning to help these models manage hundreds, sometimes thousands, of inputs with many more coefficients to maintain the generalizability of the model that I have built applied to a new set of data.

Let’s go to RNNs. They are a type of deep learning algorithm that is specifically designed to process sequential data—time series measurements. We think about the measurements that come into the ICU; you see these monitors and you have this time series. They are quite well suited for those types of measurements. RNNs have a built-in mechanism that allows them to retain information from previous time stamps and integrate that information with new measurements when updating predictions. For example, I have my measurements of 12 noon, but instead of just using that information in isolation, when I make my prediction, after I received that information, there’s actually a connection within that model to information that I gathered from earlier in the day, from 11:30, from 10:00, from 8:00, and so forth.

And that’s the term “recurrent” in front of “neural network.” That’s what it refers to, that mechanism. In theory, the RNN can learn to portal trends in the data as it acquires measurements. Again, I think that makes them really well suited to manage these streaming sets of data that come into the unit.

Dr. Parker: What is the key difference between this model and other models of severity of illness, such as the PIM2, PRISM3, PELOD, etc.?

Dr. Aczon: They can manage many more variables. So in our case, the RNN model, the proof-of-concept model, uses more than 400 inputs describing vital signs, laboratory results, medications, and interventions. Also, that dynamic integration of time series data I mentioned earlier is a result of that built-in mechanism. PIM, PIM2, PRISM3, and PRISM4 are static and not continuously updating. They make a single prediction. For PRISM3 there’s a 12th- and 24th-hour version. PELOD can update every 24 hours. But each 24-hour period in PELOD is treated independently from the previous 24-hour period.

As I was starting to explain earlier, the RNN model updates whenever a new measurement becomes available. But it does not operate on a fixed-length time window. It has flexibility and, again, there is that dynamic nature to it. Instead of looking in isolation at my measurements at the 12th hour or at 2:00 p.m., I have the ability to retain information. So it is a very dynamic processing of the measurements as they come in, and that continuous dynamic aspect results in a trajectory of scores for an individual child. And it’s changing in time. A child with a higher risk of mortality than another child is more likely to die, i.e., you think he is more likely to die, you have a higher risk of mortality. Even within a single child’s stay in the ICU, that risk of mortality can go up and down. If I see 0.2 increasing suddenly to 0.9, we think that reflects serious deterioration of condition.

Dr. Parker: Can you describe the data and how you used it to develop this model?

Dr. Aczon: We have de-identified EMRs of children admitted to the CHLA PICU from about 2010 to 2019. Before the de-identification process, the EMR data were linked with data that were previously collected for virtual pediatric services (VPS). Our dataset has more than 12,000 PICU admissions or episodes. The EMR data for each episode included charted measurements of physiologic observations—heart rate, temperature—we have lab results, creatinine levels, we have the therapies administered, how much vasopressors they received and whether they were on ECMO, things like that. And the VPS data included disposition information. On this particular admission, did the child survive or not survive? We also have demographics information and diagnoses, but diagnoses were not used as inputs. We used them only to help us evaluate our results. The important thing is that we have more than 12,000 episodes, which corresponds to roughly 9000+ children. When we divided those episodes into a training set, a validation set, and a holdout test set, we made sure that episodes of a single child were in only one of those sets.

Let me explain what those sets are. It’s very important that, when you develop your model, the dataset you’re using to derive that model is different from the dataset in which you’re actually evaluating its performance. If you don’t do that, you will have what I call leakage and you might get very optimistic performance. You might be fooling yourself into thinking you’re doing really well when you’re really not because if you’re using the same data to measure your performance that you used to develop the model, that will lead to very optimistic results. Again, we have those 12,000+ episodes that we’ve partitioned. So the training set is where we derive the coefficients of the model, as when I talked about those coefficients in front of those logistic regressions.

So we do that and then there’s this validation set. There are also parameters when you are training these deep learning methods—parameters involved in how you train the model. You need to derive those parameters as well. So you go back and forth between the training set and the validation set to optimize those model weights and those other parameters involved with the training. And when I finally have a model that I’m happy with, then we go to that holdout test set to measure the performance of our model.

Dr. Parker: What challenges did you encounter while developing the model?

Dr. Aczon: I’m sure everyone has heard that data is really, really messy, messy, messy. We invested a significant amount of time trying to deal with that from curating the data and cleaning the data. For example, I was new to clinical and healthcare when I first joined CHLA. I always thought, okay, heart rate. But when we look at the data that we have, we find we’ve got all these sources for heart rate. So, one step is we combine, for example, measurements from different sensors of systolic blood pressure into a single variable that we call systolic blood pressure. That’s the aggregation of these different measurements, and also curation. Talking to the clinicians, when I say messy, we’ll get a measurement and we’ll think, does that look reasonable? There was that back and forth between the data scientists and the clinicians, which I really appreciated.

As a data scientist, if I look at them, they’re just numbers. So I needed context, which is the other great thing about being embedded within the hospital. We could go into the ICU, go on rounds, and understand how those data are collected. When we saw 500 for a heart rate, we knew that was erroneous. So that collaboration between the data scientists and the clinicians and cleaning that data took a lot of time. All of this preprocessing and ensuring that the measurements we’re using as input to the model at a particular time are actually available at that time.

We are looking at retrospective data. When I have a retrospective dataset, trying to make a prediction at a particular time, I want to make sure that I’m not inadvertently using information from the future to make a prediction at a particular point in time. Again, that’s learning. For example, we have this lab result recorded at a particular time, but then we realized that lab result did not come in until much later. It was understanding a lot of these nuances about the data that may be obvious to those who have been in the clinical setting for a long time but not necessarily obvious to us data scientists. We’re still improving on that process. I would say that is really the hardest challenge, as well as formulating the problems.

Even now, when we are working with clinicians on other problems, trying to understand the data, do we have the data that can answer the question you are asking? This is something I’ve really come to appreciate and I find very fascinating. You ask a clinical problem, in our case, choosing the risk of mortality, that was what I would call one of the easier targets, because it is easier to define, we have this variable called disposition—survive, not survive—but that’s not always the case for these other clinical problems. It is really understanding the data to make sure that we’re asking the right questions. I think that’s the real challenge.

Applying the machine learning, we were very lucky that the community has developed a lot of the tools and I find it really fascinating when we have collaborations with our clinical team. We always say to them, the machine learning aspect is the easier side, the hard part for us is understanding what’s really going on; are we applying them correctly? It’s the most challenging but it’s also the most fascinating and satisfying for me. Data, understanding it, cleaning it, making sure we’re communicating properly with the clinicians to understand that data.

Dr. Parker: How did you evaluate the model?

Dr. Aczon: We’re trying to build this severity of illness assessment, but we developed it as a risk of mortality assessment again, with the notion that if I have a higher risk of mortality, then I’m sicker. When we evaluated the predictions, we evaluated in terms of how well they could discriminate between survivors and nonsurvivors. This is that binary classification problem. One of the most common measures for problems like that is the receiver operating characteristic (ROC) curve, and the area under the ROC curve (AUROC), area under the curve (AUC), you could also look at precision recalls on the area under the precision-recall curve (AUC-PR). Those were the primary metrics. When you have a ROC curve or a precision-recall curve, you can look at particular points in them and ask, what was my sensitivity, what was my specificity, my PPV, NPV. We look at a lot of those things. Stepping back a little bit, as I mentioned before, the RNN is a continuous or dynamic model, so it is making multiple predictions over the course of a single child’s ICU stay. How do you evaluate that? We chose some particular time points.

Let’s look at our predictions. Let’s say we have measurements on admission. Let’s look at how well the RNN did with just that one piece of information that was available when the child was admitted. We’ll look at the metrics that I mentioned earlier, the ROC curve, the AUC, the sensitivity, specificity, at a particular time point, but we looked at multiple time points as well, at admission, we looked specifically at the 12th hour and 24th hour. We’ve concentrated a lot on the 12th hour because we wanted to use PIM, PRISM, and PELOD for reference points. Assessing that risk of mortality, how well could we discriminate between survivors and nonsurvivors? Something like PRISM, the 12th-hour version; we could use the PRISM AUC computed on our test set cohort.
We could use that to judge how well the RNN did in terms of separating the survivors and the nonsurvivors if we look at something like PIM or PRISM. We did that and we also analyzed how well it discriminated within subpopulations in our test set cohort. That test set cohort had more than 2200 episodes. We looked at different age groups, because we wanted to be able to say, whether you’re looking at the entire PICU population, at least as represented in our test set, how well can it discriminate within this particular age group? This is where the diagnosis information came in.

How well were we able to discriminate among survivors and nonsurvivors among those with neurologic conditions, those with respiratory conditions, and so forth. We found some interesting results. One of the more interesting results was all the models that we used, our RNN models, and we used PIM, PRISMA, and PELOD as reference points and compared to ours. We looked at the 12th-hour RNN predictions in terms of diagnoses and ended up partitioning that self-analysis and all models. Interestingly we found they did really, really well on the neurologic patients and all models perform worse in the respiratory group.

We were talking to the clinicians and realized a couple of interesting things. One, we learned in that respiratory group, it’s a big catchall. There are all kinds of kids. And the other thing is there’s this concept of lead time, but you’re trying to make a prediction for the end of stay because that disposition—survive or not survive—it’s some time in the future. Some kids have short stays, they’re there for a day or so, or even shorter. Some kids are there for weeks. For those long-stay kids, it turns out that it’s harder to predict them, which is to be expected. That lead time is how much further into the future? For the shorter-stay kids, it’s much easier to predict their outcomes. You would expect that my 12th-hour predictions are probably really, really good for kids who are in the ICU for just two days versus those kids who are in the ICU for a month or two months. What we found is that the models did the worst among the respiratory kids. It turns out that the respiratory kids had the longest ICU stays. There are a couple of factors. The much longer lead time into the future that you’re predicting, you’re much farther away from it. Those were some interesting things, these analyses, looking at these different groups to try to understand what was going on, part of the evaluations.

Dr. Parker: Really interesting. What are the limitations of this kind of model?

Dr. Aczon: First is the single-center nature of the data that we used to develop the model and assess it. It’s only from the CHLA PICU. In an earlier study we looked at a model we developed within the PICU. CHLA also has a cardiothoracic ICU (CTICU). We applied that model, developed only on PICU admissions, to CTICU episodes. We could see the drop in performance versus the performance in the PICU. Because of the single-center nature of the data, we never emphasized that this was a proof of concept. Unlike PIM and PRISM, which are developed on multiple institutions, this is developed on just a single institution. We did not expect this model to deploy to other intuitions. We used hundreds of variables to test the ability of the RNN to handle those hundreds of inputs.

Many of these variables may not be available at other institutions, or even if they are available, they are not standardized across institutions. All the steps that we did, how we curated and rated the variables, the preprocessing steps that we performed on the input variables, reflect the practices of our PICU, where the data were collected. I would like to know that the principles and framework that we described here using the RNN—for example, though not its exact inputs, how we preprocess those inputs, how we partition the data, making sure that we have separate training validation and test sets, how we analyze those results—are generalizable and can be used by other institutions.

Dr. Parker: What are the next steps? How might this model or others similar to it potentially be used in a clinical setting?

Dr. Aczon: One of the things that we’re working on right now is a silent validation of the model. This model was developed on retrospective data. We would like to do a silent validation at the bedside, which requires a lot. It’s not really the machine learning, but many other aspects, such as engineering to make sure that the data pipelines are in place. In terms of using it for a clinical setting, we need to think a lot more about the information that we need to display, for example, how the RNN is reaching its decisions and its predictions and how do we display that information to clinicians in a way that would help them? It’s really not a machine learning problem or a math problem anymore. It’s a much wider collaborative effort involving many clinical disciplines, how it would fit into the clinical flow, and a lot of the engineering aspects, and working with IT, all those things are needed before these types of models can be used in a meaningful way in the unit. I think that’s exciting. At the same time, the data scientists in our virtual PICU team continue to collaborate closely with clinical partners in the ICU to identify problems that are relevant to them.

I’m learning so much more in having those collaborations, which is something I have really come to appreciate. Just when I think I’m beginning to understand a little bit, I keep learning, and it’s such a joy. I don’t know if that’s even the right word, it’s just an amazing opportunity to be in this setting where I’m learning so much from the clinicians, how I look at the data and identifying the problems that are important to them and helping to formulate. How can I use the mathematics? How can I use that to help look at this problem, and maybe improve care for the next child who comes through the ICU. Those collaborations are such a big part of what we’re working toward.

Dr. Parker: That is really great stuff that you’re working on. Do you have any further comments you’d like to make?

Dr. Aczon: No, I just want to say thank you again for this opportunity to address the audience of this podcast. I had a great time talking to you.

Dr. Parker: I’m really glad you could talk with me and I really enjoyed it as well. Thank you very much. We have been talking today with Dr. Melissa Aczon from the Children’s Hospital of Los Angeles in California, about her paper, “Continuous Prediction of Mortality in the PICU: A Recurrent Neural Network Model in a Single-Center Dataset,” published in the June 2021 issue of Pediatric Critical Care Medicine. This concludes another edition of the iCritical Care podcast. For the iCritical Care podcast, I’m Dr. Margaret Parker. Thank you.

Margaret M. Parker, MD, MCCM, is professor emeritus of pediatrics at Stony Brook University in New York, and is the director of the pediatric intensive care unit (ICU) at Stony Brook University Medical Center. She is a former president of the Society of Critical Care Medicine and currently serves as associate editor of Critical Care Medicine and senior associate editor of Pediatric Critical Care Medicine (PCCM).

In her role as associate editor, Dr. Parker conducts interviews with authors of PCCM articles and other pediatric critical care experts. Dr. Parker received her bachelor of science and medical degrees from Brown University. She trained in internal medicine at Roger Williams General Hospital in Providence, Rhode Island, USA, and in critical care at the National Institutes of Health (NIH) in Bethesda, Maryland, USA. She spent 11 years in the Critical Care Medicine Department at the NIH where she was head of the Critical Care Section. In 1991, she accepted a position in the pediatric ICU at Stony Brook University and became the director of the unit, where she served for 27 years.

Join or renew your membership with SCCM, the only multiprofessional society dedicated exclusively to the advancement of critical care. Contact a customer service representative at 847-827-6888 or visit for more information. The iCritical Care podcast is copyrighted material of the Society of Critical Care Medicine.

All rights are reserved. Statements of fact, and opinion expressed in this podcast are those of authors and participants, and do not imply an opinion or endorsement on the part of the Society of Critical Care Medicine, its officers, volunteers, or members, or that of the podcast commercial supporter. The iCritical Care podcast is the copyrighted material of the Society of Critical Care Medicine.

Some episodes of the iCritical Care Podcast include a transcript of the episode’s audio. Although the transcription is largely accurate, in some cases it is incomplete or inaccurate due to inaudible passages or transcription errors and should not be treated as an authoritative record.