How blockchain can help make healthcare data more useful


Patient health data is the single biggest bottleneck for advancing healthcare

Healthcare is going through a tremendous phase of innovation. Healthcare data is finally coming online due to the government mandate to adopt Electronic Medical Records (EMR). AI and machine learning is advancing at an exponential rate. Soon, precision medicine will finally become reality. Doctors will be able to better diagnose my condition based on data from millions of research papers that no single human can possibly read in his lifetime. They will prescribe me the best treatment available not because it’s “what generally works”,  but because millions of other patients who are similar to me in age, sex, and genetic makeup saw a positive impact from the treatment.

While I have no doubt that is the future, I think one fundamental thing that is often not talked about is that these advancements all require a fundamental ingredient, holistic patient health data at scale. In my opinion, it’s not the lack of algorithms, but the lack of access to useful health data that is the single biggest bottleneck for advancing healthcare treatment. Any researcher will tell you that a simple algorithm with lots of data  will generally hold more predictive power than an complicated algorithm with little data. As far as I know, this type of data just simply doesn’t exist today for research. This is because sharing healthcare data is very hard. There are a number of reasons for this:


  1. HIPAA mandates that healthcare provider can only share PHI without patient consent if it’s for direct treatment purposes. Research requires large amount of data. It’s unrealistic/costly to get patient authorization individually.
  2. HIPAA does allow providers to share data without patient authorization, but only after it’s de-identified. For the uninitiated, there are 2 ways to de-identify data: Safe Harbor and Expert Determination.
    • Safe Harbor is simple, it just requires that the provider remove all 18 types of patient identifiers in the data. They include things like name, zip, SSN, account number, phone, email etc. The problem with this method is that the utility of the data for research goes down dramatically when those data are removed.
    • Expert Determination is harder, but more useful. It requires a statistician to analyze the data and determine a set of rules that will render the risk of re-identification very small. This means in theory, researcher will have real utility for the data and the healthcare provider doesn’t have to worry about PHI disclosure. However, as with anything in life, there are trade-offs. The rule to de-identify data using expert determination is very subjective, and vary by use case. For example, the risk of re-identification is dramatically higher if the data is released to the public vs contained in the hands of a few researchers. Domain knowledge is also a big factor as well. If the researcher also happens to be a doctor in the hospital that is disclosing the de-identified data, then the probability that they’ll be able to re-identify a patient is far greater. These factors make de-identification more of an art than a science at the moment, and is very hard to automate.


On the technology side, the biggest challenge is interoperability. A patient’s information often lives in different systems across different organizations that aren’t affiliated with each other. The systems are silos and don’t talk to each other. As a result, setting up infrastructure to facilitate information exchange securely is very costly.


Data has this unique property that it can be incredibly valuable to one person, but utterly worthless to another. In research, the investigator often wouldn’t know how valuable the data is until after the analysis. As a result, it’s very hard to value the economic utility of health data. As a result, given the limited amount of IT resources available and the cost of setting up these infrastructure, there are often other projects that offer bigger and more predictable ROI.

Ultimately, getting more healthcare provider to share more data can be represented in a single formula

Expected gain = Expected benefit – Expected cost > 0

The higher the expected gain, the more likely a provider is willing to share data. In the short/medium term, I believe we can significantly drive down the cost of sharing data by using technology to automate most of the de-identification process. Once that infrastructure is in place, it would be much easier to build an exchange that connect people that want access to the data and people who can provide it. More people on the exchange should get more providers to share their data. This should create a virtuous cycle, and hopefully dramatically increase the frequency and scale of data sharing in healthcare.

Proposed solution and high level design

The underlying strategy to reducing transaction cost is simple. Automate as much as possible, and minimize the number of parties that is required in the process.

In the ideal world, a researcher should be able to just formulate a list of desired data for research and send the request to the system. The system would then directly connect him to the appropriate data sources, where he can ask for permission to access that data. Once the data providers approve, the system would then automatically query the database, link the associated data records together and de-identify them before sending the results to the researcher.

To minimize transaction cost, the system should be setup in a way that no trusted 3rd party is needed to facilitate the data sharing agreement.

To ensure security and increase adoption, no PHI should ever leave the healthcare provider’s firewall. There is the argument that a cloud storage provider that is HIPAA compliant are better equipped to setup security and deal with potential malicious attacks. However, I believe making that a requirement of the system will dramatically reduce people’s willingness to participate.

To ensure that the system is robust and grow sustainably long term, I also believe that it should be decentralized so that no single entity truly owns the network. This way, there is no single point of failure. Improvements are proposed and discussed, and changes can be made by a pre-determined voting process defined by the community.

So to summarize, in order to do this, we would have to do is build a decentralized network that automatically de-identify and link patient health data across different institutions and data systems. Piece of cake, right?


If this idea was proposed 3 years ago, I would have said that it’s impossible. In 2016, I don’t think it’s that crazy of an idea. The key innovation that I think will enable this is the advent of blockchain and the concept of smart contracts popularized by Ethereum. I wouldn’t go into too much details on how blockchain and Ethereum works in this post, but basically, a smart contract is a piece of pre-written program that is stored on a distributed peer-to-peer network and can be executed by other computers. Those programs can also read and write data from/to the network. To get a deeper intro, here is a good link.

Conceptually, think of each data sharing request as a legal contract, with a pre-defined set of rules. Under the data sharing agreement, you most likely have to stipulate the following:

  1. Who the data provider(s) are?
  2. Who the researcher(s) that are requesting the data are?
  3. What data fields are the researcher(s) requesting from each data provider?
  4. Where are those data stored? (can be encrypted so that only the program knows the actual credential at runtime)
  5. What rules need to be applied to the aggregate dataset to ensure output is de-identified?
  6. Where is the de-identified data going to be stored? (can be encrypted so that only the program knows the actual credential at runtime)
  7. Who has permission to access the de-identified data?
  8. Is this request a one-time request, or can it be executed repeatedly on-demand?

Now imagine all those information are converted into a program that can be predictably executed. When the program executes, it’s first going to check whether the user has permission to access the data. If yes, then it access the relevant databases, run the queries, apply the de-identification rules, and lastly push the results to the pre-defined storage location on file. All of this can be executed autonomously and securely.

Figure-1 System Components

One of the big benefits of using blockchain is that the smart contract is stored on the blockchain and replicated on every node. This means that no trusted 3rd party needed to facilitate the transaction. Once it’s setup, it will live forever unless it’s specifically programed to expire after a certain event/date. The smart contract can be easily peer-reviewed by others. People can also setup programmatic rules so that unless everyone approves the logic within the program, it cannot be executed. Any changes to the program as well as its execution logs can also be saved in the blockchain. This way, there’s always a full audit history. This should bring transparency and accountability  to the network.

Lastly, by taking advantage of a an existing blockchain network, we are automatically building on a high scalable and decentralized platform. There is no need to recruit miners to help run the blockchain network, and there is no reason to rebuild all the complexities that is associated with creating the base layer blockchain infrastructure.

Known Challenges

While this all sounds magical from a conceptual level, the devil is always in the details. I don’t yet have a full picture of what the implementation will look like, but I do know that there are several key problems that need to be solved:

  1. Computation on Ethereum network is extremely expensive, as every full node will run the same contract with the same input. De-identification of data is computationally intensive. There needs to be a way to run these computation off the blockchain. The smart contract should simply the orchestrator of the system.
  2. Storage on the Ethereum network is also expensive. The output data, if stored on blockchain, is replicated across every node. The more practical way is to simply have the transaction saved on the blockchain, with a hash reference to an off-site storage location where the actual data is stored. However, given that direction, then a 3rd party would presumably still be needed to ensure that the data is stored securely.
  3. Prepping the data for sharing behind the healthcare provider’s firewall is also going to be challenging. Healthcare data is dirty, and often not structured in the most ideal way. Furthermore, different organizations represent the same data in different ways. There has to be a process to extract the necessary data from the source table, clean/normalize them based on a shared standard and and save them into a separate datastore. This is an extremely costly and manual endeavor, and require ongoing effort to maintain.

What’s next

I’m sure there are many more issues that I don’t even know that I don’t know. I’m not even sure if healthcare provider would be willing to incorporate a nascent technology like blockchain. However, from everything I have read so far, it feels like the idea itself is fundamentally sound. This is an area that I’m extremely excited to explore more. If this sounds like an interesting problem to you, I would love to chat and exchange thoughts. Just shoot me a note on twitter at @yilunzh


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s