Data is crucial for health and the life sciences, and the foundation of a new Health IT infrastructure consists of developing highly interoperable platforms that deal with the various aspects of health-related data processing in a secure and confidential manner, with patient consent and data privacy. These platforms that will make-up the Health IT infrastructure need to be based on interoperable standards that allow ease of adoption by stakeholders in the ecosystem. Data collected by all healthcare agents need to provide the highest regard to the privacy of the parties concerned.
The urgent need for solutions to the various challenges of Health IT is nowhere clearer than in the issues relating the handling of citizen data in the recent COVID-19 pandemic. Several proposals were put forward based on the idea of contact tracing using mobile devices belonging to individual in communities. The basic idea is that by collecting location data (e.g. GPS, Bluetooth) from the mobile devices of healthy individuals and comparing their proximity over time to diagnosed patients, individuals can obtain some rough measure as to the probability of them being exposed to the virus. Such individuals could then be motivated to obtain a laboratory test to confirm any suspicions.
Like many other data-intensive projects in the area of health and the life sciences generally, several the question arise regarding the handling of citizen data – including location data of mobile devices. Thus, in the case of the various contact-tracing proposals, one of many outstanding issues pertains to the “how’ and “where” the data matching is performed and whether such data processing activities may affect the privacy of citizens in various communities. A further concern would be the social reaction and implications of such revelatory information (e.g. leading to diagnosed patients being outcasted). Today consumer confidence in institutions are in decline [1]. Reports regarding data loss, theft and hacking [2][3][4] exacerbate this situation. Thus, any notion of establishing a nation-wide contact-tracing program – albeit championed by leading tech companies such as Apple and Google [5] may be met with some degree of skepticism on the part of the public. One valid concern here is that a program created during an emergency (such as the current COVID-19 pandemic) may continue to be used long after the emergency ended, thereby leading to potential surveillance abuses. Another concern would be the “ownership” of this kind of mobility data, which could be validly viewed as a new class of digital asset belonging to the individual [6].
In the current chapter we extend our previous work in [7][8] explore the open algorithms paradigm [9][10] coupled with confidential computing strategies in the context of addressing the need to preserve the privacy of individuals whose data are used in computations:
Open algorithms for minimizing data movement & data loss: Today society, governments and institutions need data in order to operate. Rather than exporting, copying or moving citizen data as in commonly done today – leading to the proliferation of copies of data files, and therefore leading to a broader attack-surface for theft – we explore the open algorithms paradigm that shifts computations to the location of the data repository.
Federated consent and authorizations across institutions: Consent issuance, propagation and retraction remain one of the complex problems in health due to the complex nature of relationships [11]. Proposed approaches based on federated data systems – such as that outlined by the WEF [12][13] – must address the question of consent and privacy in order to obtain buy-in from communities at large.
In line with the open algorithms paradigm, we explore the notion of “consent to execute” an algorithm (over data), but without consent to read/copy the data. We believe this approach maybe suitable for certain types of health data (e.g. aggregate computations over cohort data). We discuss the current approaches based on the popular authorization-tokens.
Protecting data in collaborative computations: In some cases, different data are held by entities who are unable to disclose their data but who wish to collaborate using their respective data. Thus, methods is needed that allow data to be obfuscated or encrypted in such a manner that parties can share their encrypted data for collaborative computations, yielding insights that would not otherwise be possible using disparate data.
2. The Challenges of Health IT: The Case of Precision Medicine
Precision medicine is an innovative approach that provides a holistic understanding of a given patient’s health, disease and condition, and provides a means to choose treatments that would be most effective for an individual. Translating initial successes to a larger scale will require a coordinated and sustained national effort. On this front, the US Precision Medicine Initiative (PMI) was announced by President Obama on 20 January 2015 [14]. The PMI sought to move away from the “one-size-fits-all” approach to health care delivery and to instead tailor treatment and prevention strategies to people’s unique characteristics, including environment, lifestyle, and genes [15]. The PMI sought to build a cohort of at least one million participants between 2016 and 2020 for the purpose of creating a resource for researchers working to understand the many aspects that influence health and disease. The PMI Cohort Program (PMI-CP) aims to collect and share samples and data from these participants, including data from electronic health records (EHR) and from participants’ devices. The overall goal is to begin building a roadmap for precision medicine in the U.S by collecting good quality data and samples from a large cohort of participants over several years [16]. The PMI-CP was later renamed as the PMI All of Us Research Program (PMI-AURP).
In order to obtain a better picture of the magnitude of the PMI proposal, the project seeks to collect very detailed information from a cohort of 1 million volunteers, which would make it the largest longitudinal study in the history of the United States. The volunteer participants must be willing to contribute data freely, generously, regularly, and longitudinally, including (i) agreeing to ongoing accessibility of their electronic health records, (ii) participating in and sharing the results of additional clinical and behavioral assessments, (iii) contributing DNA samples and other biologic specimens, and (iv) participating in mobile health (mHealth) data-gathering activities to collect geospatial and environmental data. These data will be made available for research to academic and commercial researchers and to citizen scientists [17]. The long-term goal is for these finding findings, once tested and confirmed, will be integrated to improve care and used to drive more research.
In order to begin addressing some of the challenges related starting the PMI-CP participants recruitment, a working group was established that issued its report in September 2015 [18]. The working group was tasked to address the various questions relating to the launching PMI recruitment process. This includes issues such as participants recruitment, how to set-up the biobank for participants samples, how to create the databases for holding the participants data, the method of obtaining consent, and several IT infrastructure related questions. The working group identified a number of roles for entities involved in the project. First, the entities taking-on the role of the Biobank will build the PMI-CP biobank and support the collection, analyses, storage and physical distribution of biospecimens. Secondly, the Data and Research Support Center entities will acquire, organize and provide secure access to the PMI-CP data-sets, and provide research support and analysis tools. Third, the Participant Technologies Center entities will enroll patients through direct enrollment and develop, test and maintain the PMI-CP mobile applications – which is one of the key means to enroll volunteers, obtain their consent and to communicate to the participants. Finally, the Healthcare Provider Organizations (HPO) will engage their patients and enroll participants into the PMI-CP program through regional medical centers and community-based federally qualified HPOs [19].
The concern regarding data privacy was already called out in the original White House PMI announcement [14].
Thus, following this announcement the White House released a trust framework for PMI to ensure that PMI data is appropriately secured and protected. This framework includes principles for both privacy and data security. In February 2016, the White House announced that the Office of the National Coordinator for Health Information Technology (ONC), in collaboration with the National Institute of Standards and Technology (NIST) and the Office for Civil Rights (OCR), would develop a precision medicine-specific guide [15], following the NIST Cybersecurity Framework [20][21]. In order that data privacy was addressed in the PMI effort, the White House convened an interagency working group in March 2015 with the charge of developing a set of privacy principles for PMI. This group was co-led by the White House Office of Science and Technology Policy, the Department of Health and Human Services Office for Civil Rights, and the National Institutes of Health. As an output, the group produced a Proposed Privacy and Trust Principles for PMI document [14].
It is crucial to note here that the implementation of the PMI-CP (PMI-AURP) entails significant changes to the relationship between patients and their health care – which are evident in the new demands for patient information zen scientists [17]. Convincing individuals to re-conceptualize the purpose of their health information in this way requires building and maintaining public trust – something that will be difficult given the recent history of data-theft in other industry sectors [3][4]. Several ethical, legal and social issues also come into play.
3. Towards a Privacy-Centric Data Architecture
In order for new Health IT infrastructures to implement provable mechanisms for preserving data privacy, a proper privacy-centric view of data is needed. This privacy-centric view should focus on sharing insights by design (versus exporting data), on the quality and provenance of data, on the privacy protection of data during computations and on the protection of data-at-rest while in storage. Figure 1 illustrates our proposed privacy-centric data architecture.
The architecture in Figure 1 is layered in the classical sense of the Internet’s layered architecture, whereby functions within a lower layer hides complexity details from the layer above. The boundary of each layer is defined through standardized interfaces, which can be implemented in differing manners (e.g. RESTful APIs, RPC, etc.). An upper layer function accesses services at the next layer down by using these standardized interfaces.
This layered approach provides the advantage of a decoupling of technologies and leads to highly modular implementations. This, in turn, promotes the development of new solutions for technical problems a given layer independent of other layers. The use of standardized layer-interfaces ensures that as new technologies are introduced at a given layer, other layers are not impacted (or are only minimally impacted).
Applications layer: Data use at this layer is driven by the specific area of application. For example, an EHR application of data will be different (and use different types of data) from a clinical-trials application.
Open Algorithms (OPAL) layer: In this layer, the notion of open algorithms (OPAL) comes into play in the sense that it provides a logical boundary between the client (seeking insights) from the OPAL-server and the data providers in the back-end (see Figure 2 in Section 4).
Federation of Encrypted Data (FDE) layer: The purpose of this layer to federate data that are in encrypted form (e.g. shares, shards, etc.). Here federation means the creation by data providers of a trust network or consortium with the goal of making available encrypted-data (or shards) for collaborative privacy-preserving computing among the members of the federation.
Various technical architectures and designs, as well as and business models and agreements, need to underlie the federation. Multiple federations may exist. We believe this approach may seed the creation of a future market for encrypted data shards, something that can enabled by blockchain technology [22].
Privacy-Preserving Data Computations layer: This layer deals with the complexity of privacy-preserving computing and collaborative confidential computing.
The goal here is that for a given open algorithms scenario and a given application, the “front-end” entities (i.e. users) are shielded from the complexity of “back-end” implementation details (e.g. via the OPAL service) of the chosen privacy-preserving scheme (e.g. homomorphic encryption, MIT Enigma, secure enclaves, etc.).
Decentralized and Distributed Files/Shards layer: This layer deals, among others, with the problem of the protection of data at rest. Notably this includes the storage and accessibility of “raw” data files, shards/shares and other data-objects in connection with the specific privacy-preserving schemes being employed (e.g. at the layer above the current layer).
Thus, for example, if MIT Enigma [23] is employed then shares of data are dispersed throughout the peer-to-peer blockchain network of nodes. Other schemes may require a separate shares/shards management function to be use, in which case solutions such as IPFS/Filecoin [24] may be used at this layer.
4. Open Algorithms Principles for Health Data
The concept of Open Algorithms (OPAL) evolved from several research projects over the past decade within the Human Dynamics Group at the MIT Media Lab [9]. The general interaction flow among the entities is summarized in Figure 2, with a more detailed discussion available in [10][25]. The querier (individual or organization) who is seeking insights (e.g. about a data-subject) uses the Client to select one or more algorithms and their intended data (Step 1). The Client delivers the algorithms (or algorithm-identifiers) to the OPAL Service (Step 2), which delivers these to the corresponding data providers. Once these responses have been received by the OPAL Service, it collates the responses, performs additional filtering for PII leakage prevention and then delivers the safe response to the Client (Step 3).
There are a number of fundamental principles underlying the open algorithms paradigm [10]:
Move the algorithm to the data: Instead of pulling data from various repositories into a centralized location for processing, it is the algorithm that should be sent to the data repositories for processing there. The goal here is to share insights instead of sharing raw data.
Data must never leave its repository: Data must never be exported from (or copied from) its repository. This is consistent with the previous principle and enforces that principle. Exceptions to this rule are when the user requests a download of their data, and when there is a legally valid court order to obtain a copy of the data.
Vetted algorithms: Algorithms should be studied, reviewed and vetted by domain experts. The goal here is to provide all entities in the ecosystem with a better understanding and assessment of the quality of algorithms from the perspective of bias, unfairness and other possible unintended/unforseen side effects.
Default to safe answers: The OPAL Service must place privacy as its main goal. As such, the responses from an OPAL Service to the Client must default to aggregate answers.
If subject-specific algorithms and responses are needed, then explicit consent must be obtained by the affected data subject(s) consistent with and following the GDPR [11]
There are a number of corollary principles from the above principles that enhance the protection of data and therefore enhance privacy:
Data should always in encrypted state: Data must remain encrypted during computation and in storage. The notion here is that in order to protect data repositories from attacks and theft (e.g. theft by insider), data should never be decrypted. This means that data providers who hold subject data should (i) employ data-at-rest protection solutions for data stores, and (ii) privacy-preserving computation schemes when using the data for algorithm executions.
There are a number of emerging technologies – such as homomorphic encryption [26] and secure multi-party computation [27][28][29] – that may provide the future foundations to address this principle. We discuss some of these approaches below.
Decentralized Data Architectures: Data providers should adopt decentralized and distributed data architectures for infrastructure security and resiliency.
Cryptographic techniques such as secret sharing[30] can be applied to data, which yields multiple encrypted “shards” of the data. These shards can in-turn be distributed physically across a network of repositories belonging to the same data provider [9]. This approach increases the resiliency of the data provider infrastructure because an attacker would need to compromise a minimal number of repositories (N out of M nodes) in the data provider’s network before the attacker can obtain access to the data item. This approach increases the attack surface, and makes the task of attacking considerably harder.
The open algorithm principles also applies to individual personal data stores (PDS) [31][32][33], independent of whether the PDS is operated by the individual or by a third party service provider (e.g. hosted model). The basic idea is that in order to include the individual citizen in the open algorithms ecosystem, they must have sufficient interest, empowerment and incentive to be a participant [34]. The ecosystem must therefore respect personal data stores as legitimate OPAL data repository end-points. New models for computations across highly distributed personal data repositories need to be developed following the open algorithms principles.
Today the open algorithms principles are being used in research projects at national scale in Senegal and Colombia, by the DataPop Alliance, Imperial College, the authors here at MIT, and the French telcom company Orange. These deployments are supported by the French AFD, Orange, the governments of Colombia and Sengal, and telcos Sonatel and Telefonica.
5. Challenges in Federated Authorization & Consent
One of the challenges in the broad area of Health IT pertains to the management of consent by an individual (e.g. patient) for a particular action to be performed on the individual’s data. The problem can be complex because multiple entities and flows may be involved. For example, the authorization or consent flows may occur with the patient’s proxy, such as the health provider legal entity where the patient’s preferred doctor (primary care physician) is employed. The desired patient’s data (e.g. imaging files) maybe held by a different entity (e.g. hospital in a different city where the patient last resided), and so on. In a more health-specific language, consent means “... the record of a healthcare consumer’s policy choices, which permits or denies identified recipient(s) or recipient role(s) to perform one or more actions within a given policy context, for specific purposes and periods of time” (HL7 FHIR) [35].
In this section, we discuss some of the issues related to federated authorization and consent, and describe a more user-centric (patient-centric) approach to consent management.
5.1 Policy-Based Access Control and Authorization
The issue of controlling access to multi-user resources has been an important theme since the mid-1960s, with the rise of the time-share mainframe computers. Generally, the term access control is applied not only to physical access (to the computer systems) but also to system-resources (e.g. memory, disk, files, etc.). Notable among the early efforts in the late early 1970s was the Multics system. In the context of government and military applications, there was the further issue of access based on a person’s rank or security clearance. Here, the concept of mandatory and discretionary access control in multi-level systems came to the forefront in the form of the Bell and LaPadula Model (BLM) [36].
In the BLM model, access control is defined in terms of subjects possessing different security levels, seeking access to objects (i.e. system resources). Thus, for example, in the BLM model a subject (e.g. user) is permitted to access an object (e.g. file) if the subject’s security level (e.g. “Top Secret”) is higher than security level of the object (e.g. “Secret”). The notion of roles or capacities was added to this model, leading to the Role-Based Access Control (RBAC) model. Here, as a further refinement of the BLM model, a subject (user) may have multiple roles or capacities within a given organization. Thus, when the subject is seeking access to an object, he or she must indicate the role within which the request is being made. The formal model for RBAC was subsequently defined by NIST in 1992 [37].
Access control to resources is also a major concern for enterprises and corporations. This need became acute with the widespread adoption of Local Area Network (LAN) technology by enterprise organizations in the 1990s. The same RBAC model applies also to corporate resources attached to the corporate LAN. This problem was often referred to as Authentication, Authorization and Audit (AAA) in the 1990s. Part of the AAA model developed during the 1990s was an abstraction of functions pertaining to deciding access rules, from functions pertaining to enforcing them. Entities which decided on access-rules were denoted as Policy Decision Points (PDP), while entities that enforced these access-rules were denoted as Policy Enforcement Points (PEP).
The policy-based access control model is foundational to many systems deployed within enterprises today. Many solutions, such as Microsoft’s Active Directory (AD), are built on the same model of policy-based access control. In the case of AD, a fairly sophisticated cross-domain architecture was developed which allows an enterprise to logically arrange itself into dozens to hundreds of interior domains (e.g. each department as a different AD group). Permissions and entitlements for subjects (employees) in AD are expressed in a comprehensive Privilege Attribute Certificate (PAC) data structure. Interestingly, the main authentication mechanism within Microsoft AD and many similar products is the MIT Kerberos authentication system (RFC1510) [38].
5.2 Federation of Mediated Authorization Services
In order for authorization architectures in the consumer health space to be able to scale-up, an authorization federation among the providers is needed. To place authorization federation in the proper context, we use the classic policy-based resource access control model [37] as our starting point. This is applied to a collection of domains, each representing distinct data controllers (holding personal data of various individuals).
For the current discussion assume there are two domains – Domain 1 and Domain 2 (Figure 3) – and both hold resources associated with an individual, which we refer to as the data subject (or simply subject) following the GDPR definition. The subject as the resource-owner has data located at both Domain 1 and Domain 2. A third party, denoted as the requesting party, seeks access to the subject’s data located in Domain 1 (e.g. to execute an algorithm on the data in Domain 1).
There are at least three (3) goals for a scalable federated authorization model:
Cross-domain policy propagation and enforcement: A subject (resource owner) must be able to set access policies in one domain (at AS2 in Step 1 in Figure 3), and have the policies automatically propagated (Step 2) to all domains (e.g. AS1) in the federation that contain the subject’s resources and have those policies enforced locally by each relevant domain.
Thus, in Figure 3, if the subject sets access policies at AS2 in Domain 2 then enforcement (Step 4) must also occur at RS1 in Domain 1 where the subject’s resources reside.
Decentralization of enforcement: Once an access policy is decided at one policy-decision point (PDP) in one domain, enforcement within all domains in the federation that contain the subject’s data/resources must occur automatically without the subject’s further involvement. Each policy-enforcement point (PEP) – such as RS1 and RS2 – in each relevant domain must operate independently of other PEPs in the same domain or other domains.
Legal trust framework for authorization federation: A legal trust framework must be agreed upon by all domain-owners in the federation, one which defines, among others, the agreed behavior of PDPs and PEPs in propagating access policies and enforcing them.
5.3 Federation Approaches based on Authorization Tokens
Recent advances have been made in federated authorization in Health IT using models, flows constructs that are used elsewhere on the Internet, such as social media platforms [39]. One popular authorization approach is the OAuth2.0 framework [40], which is today used in most (all) social media platforms. By using these existing authorizations flows that are already familiar to the end-users (e.g. mobile apps), the benefit is that users (i.e. patients) can adopt the same app behavior flows for health-data related authorizations.
An important extension to the basic OAuth2.0 framework is the User Managed Access (UMA) profiles [41][42] for consent over resources (e.g. files, data, service endpoints, etc.). The goal of UMA is to provide individual-centric control over “resources” (e.g. personal data, algorithms, assertions) that may be distributed across multiple locations, each of which employing a resource server. The basic idea of UMA is that the data-subject as the resource owner (RO) would set access policies at one authorization provider entity (the AS in Figure 3), and for the access policies to be propagated automatically to all resource servers (the RS in Figure 3) who hold resources (i.e. data) belonging to the data-subject and be enforced by each of the resource servers independently. When a requesting party (RqP) seeks access to a given resource protected by a resource server, the requesting party must first obtain an authorization-token from the authorization provider (AS), and deliver it to the resource server with its access request. The resource server as the policy-enforcement point) can then evaluate the token that was issues by the AS. A health-specific profile of OAuth2.0 [43] and UMA [44] have begun to be developed for the Health IT sector.
In the context of the PMI initiative discussed previously in Section 2 the ONC has developed and promoted a number of standard authorization flows and APIs in support of use-cases such as the PMI. One such advancement is the Sync for Science (S4S) APIs [39] which is based on the OAuth2.0 authorization framework [40] and the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard (Draft Standard for Trial Use 2 [45]). The S4S was created for app developers following the SMART App Authorization Guide (Substitutable Medical Apps, Reusable Technology). This provides electronic access control mechanisms based on rules set to enforce a healthcare provider’s organizational security policy. The S4S API uses OAuth 2.0 flows to allow a designated third-party app to have electronic, read-only access to all or a portion of health information about an individual, made available through a healthcare provider’s EHR patient portal, via the individual’s existing authentication credentials (e.g., username and password). The S4S API was developed with the intention that it can be used for other third-party apps, including those for medical research [39].
6. Protecting Data in Collaborative Computations
In this section we provide a brief overview of the various privacy-preserving computing paradigms that may be applicable for the use-cases of data in Health IT. These paradigms and their respective implementations may be relevant for the different Health IT infrastructures that deal with different types of data. Not every paradigm maybe suitable for a given data-type. For example, digital imaging data (e.g. X-ray files) may not be an appropriate match for models geared towards aggregate computations, but imaging files could be analyzed or compared for differences. We focus on collaborative computation efforts that presume plaintext data located at separate data providers.
6.1 Shamir’s Secret Sharing
One of the fundamental concepts in early cryptography research was secret sharing, which was pioneered by Adi Shamir [30] . In this landmark paper Shamir asked the question as to how a secret piece of data could be “encrypted” into multiple parts in such a manner that only a subset of the parts would be needed to reconstruct the original secret. This notion subsequently became known as threshold secret sharing. Thus, in a given threshold secret sharing scheme the secret data is encrypted into M shares in such a way that a minimum threshold of N shares is required to reconstruct the secret.
The key feature here is that any combination of at least N unique shares suffices for reconstruction. This allows the shares to be distributed across different physical locations, making it more difficult for attackers to compromise the system because an attacker would need to compromise at least N separate computer systems. As we will see, this feature will be core to the MIT Enigma design discussed below.
6.2 Multi-Party Computation
The area of multi-party computation (MPC) focuses on cryptographic schemes that provide a way for a group of entities to perform “ collaborative computation” or joint computations among themselves, where some of the participants are assumed to be competitors and therefore may be dishonest in their computations (i.e. honest majority assumption). As such, the goal of MPC schemes generally is to provide a way for these entities to “encrypt” their data in such a way that some limited computations are still possible using the encrypted data. Within a group of participating entities (e.g. health data providers), each entity has to “prepare” their data respectively by encrypting it using the MPC cryptographic parameters agreed upon by the group. They then exchange these encrypted data or “shares” among themselves. Thus, none of the entities are expected to reveal their plaintext data to one another. The goal is for all the participants to obtain the same output result in the face of a possible dishonest minority of entities in the group [27][46]. Over the years there have been several MPC schemes proposed (e.g. Garbled Circuits [27], Fairplay [47][48], SPDZ [49][50], ShareMind [51], and others).
In the context of Health IT, a number of MPC schemes can be used for a more collaborative computing where the identity of the entities are known and where all entities are assumed to be honest. Thus, here MPC is used for privacy-preservation instead of for the more competitive uses-cases [52]. For example, a group of hospitals located in a municipality or province possessing private health data of citizens could jointly compute aggregate calculations regarding its population in the municipality or province. For instance, together they could compute the average age of patients with some illness (e.g. cancer) without ever disclosing the plaintext data about these patients. The MPC process “encrypts” the data into shares, which are then exchanged among the hospitals.
Figure 4 provides a high-level illustration of the use of MPC schemes as the back-end to open algorithms. As is the previous Figure 2 the querier (requesting party) employs the client to interact with the OPAL server. The OPAL server provides the computation challenge based on the algorithm to the three Data Providers A, B and C (Step 2). After the three Data Providers complete their MPC joint calculations (Step 3), they each return the answer they calculated to the OPAL server over a secure channel.
6.3 MIT Enigma
The MIT Enigma paradigm [23][53][54] explores new MPC and secret-sharing [30] configurations by using nodes on a blockchain system [22]. The basic idea is to employ the computation horsepower of the decentralized nodes (e.g. mining nodes) to perform the joint MPC computations, where the computations task itself is dispersed across multiple nodes. Thus, the blockchain nodes serve as a decentralized storage of the encrypted “shares” (shards) of the original data and the nodes also perform the computations. In this thinking, a data provider (data owner) may not even need to keep a local copy of its data. Knowing the location/identity of the blockchain nodes that hold its shards, the data provider can at any time fetch these shards from the relevant nodes and reconstruct the data for internal use. The nodes can be within a permissionless (public) blockchain network or a private permissioned blockchain network. One possible business model for the permissionless case is for the blockchain nodes to charge fees for the storage and processing of these shards.
There are several features of Enigma which makes it a ground-breaking proposition in the context of data privacy and decentralized computations:
Data is sharded using a multi-party scheme: Following the classical secret-sharing and MPC models, a given data-unit D is “split” using a liner secret-sharing scheme which results in say M shares. Only N shares will be needed to reconstruct the original data-unit D (where N is less or equal to M ).
Shards are distributed across several nodes: Rather than locating all the M shares into one centralized database, the shares are dispersed across multiple nodes on a blockchain P2P network. This dispersal strengthens the security of the overall scheme because an attacker would need to compromise at least N distinct nodes (assuming the attacker can locate the correct nodes).
This is shown in Figure 5, where the entity A for example disperses its shares to the nodes of the blockchain denoted as 4(a). Entity B disperses its shares to nodes 4(b), and entity C to nodes 4(c).
Decentralized computing by groups nodes: When a joint computation needs to be performed by the owners of the data-units (i.e. shards), then the nodes on the blockchain holding these shards perform the computation in a decentralized manner.
Using the simple example in Figure 5 we assume entities A, B and C (e.g. hospitals) own data-units D1, D2 and D3 respectively (e.g. cancer patient data). If the entities wish to perform a joint computation over the data (e.g. compute average of D1, D2 and D3) then each of the entities must notify their respective nodes to begin engaging with the other nodes in solving the MPC computation. Thus, in Figure 5, the nodes at 4(a), 4(b) and 4(c) will then exchange the relevant shards that they collectively require to perform the MPC computation.
It is worth noting that in Figure 5, the “front end” remains to be the OPAL configuration in that the querier at the client may not even be aware of the fact that the data have been sharded and have been dispersed on nodes, and that the computations are performed by nodes.
6.4 Hardware-based Secure Enclaves
An alternative approach for data privacy in computation is to employ a special hardware “black box” that prevents unauthorized access to the data while it is outside the hardware. The hardware essentially provides security assistance to the applications in the processing the data. The hardware assistance consists of processor-extensions that establishes a secure enclave on the platform, providing a protected execution environment. For example, the protected execution environment could be an area in the computer’s memory that is shielded from interference by other processes on the same computer or from attacks by an external entity. In order to prevent unauthorized processes from accessing the protected memory, the access policies are enforced by hardware also. This permits a degree of “self-protection” on the part of the computer, retaining its own integrity and protecting the confidentiality of the data in protected memory. Examples of secure enclaves hardware include the Secure Guard Extension (SGX) from Intel Corp. [55][56], TrustZone from ARM Ltd. [57] and MIT Sanctum [58]. Figure 6 provides an illustration of the use of secure enclaves in the context of open algorithms.
Figure 6 provides an overview of the steps involved in using secure enclaves. In essence, the secure exclaves provides both sides with a guarantee of privacy of the algorithms and data while outside the enclave. This approach is attractive for cases where the algorithm owner or author does not wish for the details of the algorithm (e.g. expressed in an analytics language such as the R-language) to be accessible to the Data Provider (see previous Figure 2). Correspondingly, Data Providers may be prohibited from exporting data from its repository and making it available to external entities. In his case the Data Provide must encrypt the data for the target secure enclave in such a way that the data is decipherable by the secure enclave within its memory. Once in protected memory, the enclave can apply the deciphered algorithm against the deciphered data.
Figure 6 provides a high-level illustration of this process. When the querier on the client seeks to have an algorithm executed on the data set held by the Data Provider , it request the OPAL server as before (Step 1 and Step 2). In response the OPAL Server loads the algorithm to the secure enclaves (e.g. in the cloud) in Step 3. If confidentially is needed for the algorithm (i.e. algorithm contains proprietary information), the OPAL Server has the option to encrypt the algorithm prior to delivering it the secure exclaves. After the Data Provider encrypts the relevant data for the target secure enclave, it delivers the encrypted data to the enclave. The enclave deciphers it within its protected memory and executes the algorithm (from Step 3) on the data (shown as Step 5). The secure enclave outputs the result to the OPAL Server in Step 6, which provides the results to the client.
7. Use-Cases
7.1 Hospital Records
OPAL and confidential computing has the potential to provide valuable, federated insights on hospital data as well as new revenue arms for hospitals sitting on large datasets. Accounting for aforementioned limitations in data formatting, these methods of privacy-preserving computation could provide insights in optimized hospital management, patient treatment, and even general discoveries about a disease without violating the privacy of the patient.
As a proof of concept, Secure AI Labs (SAIL) worked with a pharmaceutical company to reproduce multi-omic association results of microbiomes across four different hospital populations in a privacy-preserving framework based on OPAL in secure enclaves [59]. The study was led by a principal investigator at a Pharmaceutical company in Boston who provided a genomic data set based on stool samples coming from four separate hospitals. Each patient in the data set is associated with genomic data and one of four diagnosis phenotypes: UC (ulcerative colitis), CD (Crohn’s disease) and non IBD (non-inflammatory bowel disease). SAIL normalized analysis across the four hospitals and verified that the distributions were similar for comparison by running a principal component analysis (PCA). Once comparability was confirmed, a Wilcoxon signed-rank test was run to determine which microbes are most significant for UC, CD, and IBD.
7.1.1 Principal Component Analysis
Across four hospitals, a differentially private federated learning PCA was run on the genomic dataset. After appropriate data normalization, variance-covariance matrices were computed across each dataset within 4 sub enclaves (hardware-based secure enclaves) for each hospital. Prior to exporting the matrices from hospital environments, differential privacy noise was added to further ensure security. This noise is a matrix sampled from a multivariate normal distribution with its standard deviation scaled relatively to the size of the data (Figure 7).
The hospitals’ variance-covarience matrices were gathered in a central enclave where they were summed and primed for PCA on the combined variance-covariance matrix. Finally, dimensionality-reduced principal components are gathered for each diagnosis.
After plotting the three principal components federated and unfederated wise, we note a high similarity in the diagnosis principal component distributions regardless of the machine learning setting (federated vs. unfederated), which means that using federated learning or OPAL architectures preserves the accuracy of PCA results while also preserving privacy. Both results imply a strong conclusion that there is neither technical nor biological bias. After getting confirmation from the PCA results that there was no bias between the four hospital datasets, we know the datasets are comparable for further analysis (see Figure 8).
7.1.2 Wilcoxon Signed-Rank Sum Test
Researchers at the pharmaceutical company wanted to further understand which microbes were most important to which specific disease. SAIL implemented a Wilcoxon Signed-Rank sum test in traditional and SAIL environments. Not only were the ranking results comparable (see Figure 8), but also the speed penalties were within reasonable tolerance (10 minutes instead of 3 minutes).
Ultimately, the conclusion of this proof of concept is that the SAIL platform is well suited for distributed data assessment (PCA) and analysis (rank sum for microbes of interest). The extension stretches beyond microbiome analysis into behavioral, multi-omics, chemical, eQTL, and other clinical interests in the pharmaceutical space where timely access to more data is essential to furthering research. Data accessibility challenges in silo searching, internal review boards (IRB), anonymization, and transportation can take anywhere between 20-30% of a project’s timeline because of restrictions in privacy, security, and compliance [60][61]. The vision of the SAIL platform is to streamline this process with a platform that makes privacy, security, and compliance seamless at the software level, finally closing the gap between written contract/legal agreements to the very execution of code on the platform.
7.2 Genomic Data
Genomics-based personalized medicine began more than ten years ago. Genetic big data has shown promise in conducting breast cancer studies, building the cancer genome atlas (TCGA), and improving screening and diagnosis. Many recent studies have prospective results with advanced machine learning and artificial intelligence (AI) technologies on genotypic and phenotypic big data . Using large amounts of federated genetic and health data to train AI models and using these models to predict diseases, drug responses, and personality traits will allow for great advancements benefiting human health.
At the same time, DNA sequencing has become cheaper, better, and faster in recent years. As more people get their DNA sequenced for disease diagnoses or plain curiosity (e.g. finding ancestry), many governments, non-profit organizations, and commercial companies have built up their genetic databases in recent years, and covet as well as protect the value of this data.
With a growing number of genetic testing companies, hospitals, and pharmaceutical firms and an increasing interest in the contribution of genetic data by individuals, there is a great need for a marketplace for genetic and health data to align the interests of stakeholders big and small.
However, the availability of high-quality genetic and phenotypic data remains the bottleneck of personalized health care. Existing databases are silos that cannot scale. Collecting useful data from different individuals, hospitals, biobanks, or other organizations is difficult, not only because of the complexity of the genetic and medical data, but also the highly sensitive private information in the data. Regulations of both the U.S. and EU make it harder for these organizations to meet the big data requirements of fast developing AI training applications [62].
As a proof of concept, SAIL worked with bioinformaticians at a premier research institute to test differential privacy on a computational biology use case, linking single nucleotide polymorphisms (SNPs) to genes driving Alzheimer’s disease. Open data sharing is one of the biggest bottlenecks to discovery in healthcare, and in a diagnosis with over 148 failed drugs there’s a large opportunity to learn through shared data . To address this, SAIL designed a system that protects both data and algorithms by using differentially private (DP) machine learning .
7.2.1 Linear Regression
A simple linear regression was used to predict gene expressions based on single nucleotide polymorphism (SNP) samples. Thus, SAIL implemented a differentially private simple linear regression and compared the results to non-private results using tensorflow.
The first part of the project consists of simulating the SNP data and gene expression to understand how a simple linear regression with differential privacy behaves. To do so, a random matrix sampling from 0, 1 or 2 was produced to correlate to gene expressions (the mathematics behind this is elaborated in the code, via the DataLoader Function).
The differential privacy optimizer used was set with a clip gradient and a noise multiplier parameter. The former is used to avoid gradient divergence that could happen due to the noise added at each iteration and the latter is the amount of Gaussian noise added to the gradient.
The amount of privacy budget achieved after each experiment remains constant, regardless the size of the data set. This is because it only depends on the noise multiplier and the number of iterations which are constant in our experiment.
To ensure model correctness, multiple linear regressions were fit using sklearn and compared by the R2s of both sklearn and tensorflow results.
First, relevant parameters are set: number of iterations and learning rate that leads to a robust linear regression. In this case, 700 and 0.025 were used respectively. Next, the differentially private linear regression was set with the same parameters and checked for robustness.
7.2.2 Differential Privacy
Figure 10 show the results obtained in terms of R2 with and without differential privacy. The epsilon column represents the privacy budget we reached.
Highlighted in green, the rows with the highest non-DP R2 (above .98). Therefore, a model infused with differential privacy reacts very well and preserves the level of accuracy the non-DP model has reached. For example, line 17, there’s a R2 of .99 for a DP R2 of .94; the latter model is still acceptable. In addition, it’s interesting to point out that there isn’t any outlier when one compares the non-DP metrics and the DP ones, regardless the level of accuracy reached. Even in lines 1 to 12, a non-accurate model without DP still behaves the same way with DP, (e.g. line 10). In conclusion, the level of accuracy is retained between a non-DP accurate model and a DP model.
7.3 Molecular Libraries
Another exciting extension of the OPAL architecture is for molecular libraries. As pharmaceutical companies continue to expand and develop new drug pipelines, their research and development arms generate and characterize numerous molecules. However, the vast majority of these molecules will never be used by the company. At the same time, pharmaceutical companies have a lot of difficulty capitalizing on these libraries because they cannot prove the value of their molecules without revealing the molecule and the raw data itself. Using an Open Trial Chain architecture for the collaborative learning in blinded OPAL sandboxes across molecular libraries would allow for greater innovation in the drug space.
7.4 Extensions and Implications
The goal of traditional randomized clinical trials (RCT) is to prove safety, dosing, and efficacy in most people. However, the underspecified benchmark of most people has proven more problematic with recent drugs that have narrower therapeutic windows. In turn, many minority groups are completely overlooked even in phase 3 trials because signal from these groups can be indistinguishable from statistical noise.
Integrating OPAL into the approval process would federate disparate clinical trial results for greater signal boost from these minority groups, returning a more specific and thorough analysis of efficacy. This would not only prevent serious adverse event disparity in the market but also increase targeted efficacy. For example, OPAL queries across similar arms of multiple clinical trials could not only detect ethnic-specific signals but also boost genomic-specific signals on the horizon of personalized medicine.
The movement of data for analysis introduces many privacy risks to regulations like HIPAA and GDPR. However, OPAL’s architecture fundamentally shifts this paradigm by offering an architecture that instead moves the analysis to the data. This not only addresses the regulation at hand but also pre-empts greater regulation shifts for privacy in a digital world of higher granularity and sensitive data.
8. Conclusions
In this chapter we have reviewed a number of strategies and possible solutions for addressing the challenges around data privacy in Health IT. As society becomes increasingly data-driven, health initiatives and programs such as the Precision Medicine Initiative (PMI) would require access to greater amounts of data and at a more detailed level.
We believe paradigms such as open algorithms provide a promising framework of thinking through some of the privacy challenges. The open algorithms paradigm is core to our proposed privacy-centric data architecture, which focuses on sharing insights by design. This is in contrast to the current prevalent model of simply exporting or copying data across institutions. Also important to open algorithms is the notion of consent for the execution of an algorithm. This is also in contrast to the current prevalent interpretation of “consent” to mean permission to copy data.
If initiatives such as the contact tracing applications for reducing the spread of the COVID-19 are to gain traction in communities, then issues such as privacy need to be addressed correctly.
Entirely do-able and actually processing is embeddable in perhaps humans ? Back in 2015 UofM developed the Michigan Mote computer 2x2x4mm- really the only truly functioning computer (Unlike the IBM one and subsequent ones that are glorified sensors that lose all programming/data once it’s powered off) M3 Versatility
M3 computer contains the following layers:
solar cell and optical communication photo cell
harvester control electronics layer
radio layer
sensor interface layer (temp sensor and capacitive interface electronics)
layer with capacitors for stabilizing the power supplies (“decap” layer)
processor + memory + power regulation layer
battery
optional layer for pressure sensor, imager, etc.
Would people, if they understood they would have an always on sensor in their body measuring blood chemistry, viral loads, etc agree to have this data shared with their healthcare provider if it meant better preventative care.
Douglas Kim:
I am hoping when people are provided the proper incentives such as better health for themselves, their families and friends as an outcome that resistance will be reduced. Alternatively could we imagine a world where it is simply legislated for public good - similar in the way we divulge basic stats and blood type in our driver’s license for all to see :)