My guest for this week's Perspectives show is George Hripcsak, professor of biomedical informatics at Columbia and one of six researchers recently funded by Microsoft Research through its Computational Challenges of Genome Wide Association Studies (GWAS) program.
JU: For starters, what is a genome wide association study?
GH: A genome wide association study involves scanning markers across the human genome to find genetic variations associated with certain diseases.
JU: Specifically what's being looked for is single-nucleotide markers, right?
GH: Yes. Now our role in this project is the phenotype. We're trying to address the phenotypic computational challenge. Often it's simple. Someone has diabetes or doesn't. Or two people have it, but one gets complications and the other doesn't.
JU: So by phenotype you mean the expression of diabetes, in this case?
GH: Yes. Often you start with a disease, and some number of patients, often very small, but up to several thousand, plus a control group of patients without the disease.
You study the entire genotype, and you look for which sites on the genoome are associated with that disease. Then you look into that site. Now the fact that you may find a certain genetic mutation at that site -- that's not necessarily the cause of the difference between the two sets of patients. The cause may be something near that marker on the genome. So you might sequence that area, looking for other information about what genes are in the area, and so on.
JU: So the computational challenge is one of correlation.
GH: The first step is correlation. But...I'm at the zeroth step. There are other people working on the part I'm describing here. First they'll come up with associations, which is a computational challenge in its own way, because there is a vast number -- a hundred thousand, someday a million -- markers that you're looking at, to see if they're associated with this trait, diabetes or not diabetes.
Then they need to figure out what proteins are coded at the marker, or near the marker. In order to get to that point you need the phenotype too. As long as it's something simple, like patients with or without diabetes, it may seem like that's the easy part of the experiment.
But as time goes on, and genotyping gets easier and cheaper -- and as we learn to handle patient privacy, that's the other thing that limits the study, we have to be careful about how we collect and store these data -- the hard part is going to be collecting the phenotype.
JU: When you say "collecting the phenotype" -- that's clinical observation and description?
GH: Exactly. Imagine that every patient who comes into the hospital is given the option to participate in a trial where, in a secure fashion, their genotype is done, and then their information can be used to discover new things about disease. Some number of patients would agree to that, and then all you have to do is take their blood samples, check the DNA, and genotype it. It's relatively straightforward if you have the money to do it.
Then you have to find out what the phenotype of the patient is. But what questions should you ask? We don't know what diseases we might be studying, or what we might discover. We want to know the whole medical course of this patient: When they've been well, and when they've been sick.
What we have for these patients are their electronic health records. And in the future, with Microsoft HealthVault for example, we have the personal health records. And so the question is, with the patient's permission, can we use those data to come up with a reliable phenotype?
JU: OK. Now I see how this ties into your career history. You've done a lot of work in the area of mining clinical records, using a variety of techniques.
GH: Exactly. So in addition to working on the statistical models to do the genome association, we thought it'd be worth investing in the phenotype part of the problem. We've been working on it for 20 years, and it's harder than most people think.
JU: I wouldn't think it'd be easy, but tell us: What are the challenges unique to mining health records and clinical data?
GH: Two of our collaborators are Rich Smiley and Pamela Flood, both faculty members at Columbia, in anesthesiology. They're studying a specific protein, the beta-2 adrenergic receptor, and they're sampling about 2000 patients. They're just studying two snips, it's not a genome wide association, but they need to collect the phenotype on those 2000 patients. It's prohibitive to have a research nurse accurately gather all the information they need to do their study -- and what they're studying is labor, the length of time you spend in pre-term labor and how much pain you experience, and how that's associated with variations on these two sites. It's an enormous amount of work, and it might not be reliable.
But we do have an electronic health record. For each of these patients, the nurse has painstakingly documented what he or she recorded on the patient. Plus we have monitors, and lab tests, all fed into the electronic health record.
JU: But much of what's there is anecdotal, textual, and narrative, right?
GH: Well, it's a mixture of structured and narrative. So ideally we should be able to generate the phenotype rather than have a person do it. We're trying to do that by computational analysis of the health record.
Of course the health care record is intended for patient care, not for research. Anytime you take an information source intended for one purpose and try to use it for another you have to be careful. It takes a lot of processing and interpretation.
Whether it's structured or narrative data, people use different words to encode things. In a cardiovascular study, is it "chest pain"? Is it "angina"? Is it "coronary artery disease"?
The terminology varies, and often the terms are ambiguous. Someone says a person has diabetes, they probably mean diabetes mellitus, a problem with glucose. But there's a diabetes insipidus which is a completely different disease. All they have in common is that you urinate a lot.
JU: So on the one hand you can try to provide a more structured data collection environment that's aware of these distinctions. And on the other hand you can do a lot of text mining, correlation, and natural language processing.
GH: Yes. But remember, the people who are collecting the data are not interested in your research study. As a nation, we're working on improving our terminology, whether you get there by natural language processing or by having the doctor fill out a template. Either way we want to end up with computable knowledge.
But when the purpose is clinical care, not research, we're always going to wind up with these problems.
Furthermore, there's a reason why we speak in narrative style, and not in templates. It's an efficient means of communication. It may be true that it's best for health care providers -- and for all other human beings -- to speak in narrative language, and to have our systems, as they improve, turn that narrative into something structured.
JU: Using natural language processing to extract structure from narrative is something you've been doing for a long time. What can you say about the progress of the state of that art?
GH: We did a study in 1995 where we had 200 chest x-ray reports, and we had 12 physicians review them. Six were radiologists, the ones who generate the reports, and six were internists, who generally use them. They weren't looking at the images, they were just looking at the reports dictated by the radiologist who did the initial reading.
We wanted to see if we could use natural language processing to say yes or no to a set of questions, like: Is this a report indicative of bacterial pneumonia, or of cancer, or of chronic obstructive pulmonary disease? We had six conditions we were looking for, and we compared the reliability of each doctor's reading to the other eleven, and we compared the computer system's interpretation to all twelve. We found that the computer system was about as accurate as the 12 experts.
JU: And what is that system? There are general NLP frameworks, and also domain-specific ones...
GH: This one is a medical system called MedLEE, which Carol Friedman started building back in 1990 or 91. It went into production use in our hospital in 1995, and we've been using it ever since.
JU: And you've been training it as you use it?
GH: Well, improving it. It's not a data-driven system. So as it makes mistakes, we fix it, but it's not a machine learning system.
JU: It's a language understanding system.
GH: Yes. It uses a semantic grammar, it divides all words into classes, so rather than getting into the details of syntax, like noun phrase and verb, it says, this thing is a body part, this is a disease, this is a procedure, this is a medication, and then it has a grammar that has sequences of these classes. It also has some syntactic parsing to figure out negation and things like that, so it's a blended approach. It was used initially for radiology reports, but now it's used for all of medicine.
It was as accurate as humans at answering simple questions. When you get to complex interpretations, it doesn't do as well, but you're still in a situation where a human can't be expected to read a million chest x-ray reports, or discharge summaries. If we can do things that there just isn't the money for people to do, even if the accuracy is a bit lower, that's still useful.
JU: Given that context, how will this apply to the funded project?
GH: So, I've outlined some challenges. Things are narrative, terminology varies. Another is that data are sometimes wrong. Mistakes can be made in recording information on the chart, and often those are mistakes that the doctor would notice and immediately discount. Or it may be a subtle mistake that isn't important to a human interpreting the case, but could matter for a research trial where you're trying to automatically understand what's in the chart.
Often, there's also missing data. The patient may go for care elsewhere. Or a data value may not have been recorded. Or a test may not have been done. So you don't really have a complete record. If you're doing a clinical trial, you have a lot of money to pay a lot of people to spend a lot of time tracking patients, following up with them, measuring everything that needs to be measured. But if you're just using the combination of electronic health record and personal health record, you have to rely on whatever was collected for that purpose.
JU: It's going to be sparse data, for the foreseeable future.
GH: Exactly. So our challenge is to generate a reliable phenotype from that electronic health record and personal health record. Or, if it's not reliable, to know that there's not enough information in those records to make a determination.
JU: So part of the challenge is to infer what's missing. How can you do that?
GE: Let's say you're trying to study complications of diabetes, and you want to do a genome wide association study on people who've had severe diabetes from the point of view of treating it with insulin, but have had no complications, versus people who have had complications, to see if there's a genetic difference. If you can discover why some people don't have complications, can you develop a drug that mimics that in the other people?
To do that, we want to come up with a phenotype of people who have diabetes severe enough to be treated with insulin, but who don't have complications. And we want to use the electronic health record to identify them. What are the challenges?
Well, what if someone comes here for their diabetes care, because there's an expert in this medical center, but when they have complications, they go to the nearest hospital? My electronic health record doesn't have the data about their complications.
JU: Of course this is the promise, and the holy grail, of federated health records.
GE: Right. But this is just one example of many problems that can come up. When you're trying identify someone who hasn't had complications, you don't know if you're missing the data, or if they're truly without complications.
How can you figure it out? Well, you can use information theoretic methods to figure out, look, I have enough information such that if this person had complications, I'd know it. If this person has a history and a physical by an internist, or three different internists over the course of 10 years, and none of them ever mentioned a complication of diabetes, then odds are this patient doesn't have a complication.
JU: So you're interpreting the negative space?
GE: Exactly. Whereas another patient, who has diabetes, and disappears for 5 years, and then comes in and has a complete blood count but not a glucose, and then has some minor dermatological procedure, and then disappears for 5 years, and is here now -- I have no reason to think that person doesn't have diabetes complications. All I know is that he or she came in to have a mole removed. I have no information about diabetic complications, for example an opthamologic complication.
JU: Electronic health records are moving into the mainstream. You mentioned Microsoft HealthVault, Google Health was just announced. Most people have yet to encounter these things in their routine interaction with the health care system. I presume that in five years, many will have.
I think a lot of people have the notion that the information that's being collected will be of value, not only clinically but also to research. Your point is: No, not necessarily. So my question is, if you were the czar of electronic health information, what would you like to see happen in order to merge those two goals?
GE: I'd start with a caution. There's a knee-jerk reaction to say that we need to have doctors document more accurately, and more completely. But the problem is that you end up with a big structured template.
What I envision is an intelligent record that produces a summary for clinicians that they can read, correct, and then write their note which is a succinct summary of their thinking.
Now that doesn't answer your question, which was: How does that then get used for research? But I think that to the degree we make documentation efficient in serving health care, I think it'll also be more accurate for the sake of research.
One thing that can go wrong, for example, is that if you're filling out a record for the sake of billing, you'll have an incentive to use diagnosis codes that optimize billing. Does that then reflect clinical accuracy? And would that then be useful for research?
The important thing is to be grounded in the clinical truth. Put health care first, and then use new computational methods to extract accurate information.
JU: So clinical truth is what the doctor said, in the doctor's own language. Of course there's a lot of shared convention around the terminology.
GE: They learn in medical school, and throughout their professional lives, what to document. Things aren't always called the same, but the nation is working on health care standards in various ways, both for transferring information between systems and for coming up with common vocabularies.
JU: So although many of us would assume that those vocabulary terms need to be fields in a template, you're saying that's not the first and best strategy. You'd like to see that language just used naturally, as doctors speak their narratives, and then we'll harvest what we need out of that.
Do you think natural language processing will get us there?
GE: It's not perfect. We achieved expert-level performance on a simple task. We have less than expert performance -- but not bad performance -- on the more complex task.
JU: How has the system improved since its introduction in the 1990s?
GE: What we've done is expand our breadth. Back then we were doing mainly radiology reports, and now we cover most of medicine. I don't know that the accuracy got better, though.
Modern natural language processing systems often depend on machine learning, and don't have deep linguistic knowledge.
JU: Well, there are both breeds.
GE: In medicine we're seeing more emphasis on statistics than on linguistics, but we believe the right answer is a combination of the two. In our case we've tried some statistical systems too, but our semantic system seems to outperform them.
If you have a specific question, and that's the only one you need to answer, a statistical system is probably the more efficient way to go. What we do is parse the entire report, and spit out everything we can figure out from it.
In the 1995 study our goal was to answer six questions, but the system actually parsed the whole report, said everything it found, and then in those things it said we found which were indicators of pneumonia.
There are various techniques that you can use that do pretty well on a single question, but that don't do well if you give them an entire history and physical, and say, tell me everything there is to know about the patient. That's what MedLEE is good at.
Systems should make it easy for people to express what they need to express -- in this case, the clinical truth. If it turns out that a super-efficient template model works best, then that's great. It's an empirical study. People will experiment over time, and see what works.
JU: You've also mentioned the compromise approach: summarize, then present for approval or correction.
GH: Yes, but clinicians don't want to stop and correct. So we need to work on presenting the structured format that's useful enough to them to justify that effort.
JU: It's a perennial and vexing problem. In some ways, maybe, one of the grand computational challenges. At the interface between the data collector and the human being, the person is always going to regard the collector as an impediment.
So, when does your project start?
GH: We've already started. For that Rich Smiley and Pamela Flood study in pre-term labor, we're already taking data out of the electronic health record for them to do their associations.
It's nice to have a concrete problem to work on. Over the summer, what we're working on is a generic framework. So, how does the next person and the next person do this? And then we'll be working on putting together a pipeline of tools. You'll still need a person there to process the data, but it won't involve reading every chart.
JU: Well this sounds hopeful. Thanks!
GH: Thank you, Jon.