Skip to main content
SearchLoginLogin or Signup

10. Towards an Ecosystem of Trusted Data and AI

Published onOct 07, 2020
10. Towards an Ecosystem of Trusted Data and AI
·

As the economy and society move from a world where interactions are physical and based on paper documents, toward a world that is primarily governed by digital data and AI, our existing methods of managing security, transparency, and accountability are proving inadequate.  Large-scale fraud, data breaches, and concerns about uses of AI are common. If we can create an ecosystem off Trusted Data and Trusted AI that provides safe, secure and human-centric services for everyone, then huge societal benefits can be unlocked, including better health, greater financial inclusion, and a population that is more engaged with and better supported by its government [1].

In order to avoid having our critical systems suffer increasing rates of damage and compromise, bias and unfairness, and ultimately failure, we need to move decisively toward pervasive data minimization, and general auditing of data use and of AI computation. Current firewall, event sharing, and attack detection approaches are simply not feasible as long-run solutions for cybersecurity, nor is current ad hoc evaluation of AI’s unintended effects sufficient. We need to adopt an inherently more robust approach.

Dramatically better technology for an inherently safe, equitable data and AI ecosystem is has already been built and is being deployed at scattered locations around the world, as is described in this chapter. For instance, the EU data protection authorities are supporting a simplified, easy-to-deploy version called OPAL (which stands for OPen ALgorithms, [2]) for pilot testing within certain countries (see http://opalproject.org). The concept of OPAL is that instead of copying or sharing data, algorithms are sent to existing databases, executed behind existing firewalls, and only the encrypted results are shared. This minimizes opportunities to attack databases or divert data for unapproved use, and OPAL may be combined with differential privacy, homomorphic encryption or secure multiparty computation in order to provably ensure that data remain safe [2].

Perhaps just as importantly, having an OPAL-style system means that the use of data, and the performance of the algorithms, can be continuously logged and audited. Consequently, performance of AI algorithms concerning fairness, privacy, and security can be monitored according to agreed-upon standards. Moreover, if new questions about use and performance arise, the data to answer those questions is immediately available in the logs of the OPAL system.

However, technical solutions such as OPAL are inadequate without human-centric governance. There must be user-centric data ownership and management as well as the development of secure and privacy-preserving machine learning algorithms; the deployment of transparent and accountable algorithms; and the introduction of machine learning fairness principles and methodologies to overcome biases and discriminatory effects. Humans must be placed at the center of the discussion as humans are ultimately both the actors and the subjects of the decisions made via algorithmic means. If we are able to ensure that these requirements are met, we should be able to realize the positive potential of AI-driven decision-making while minimizing the risks and possible negative unintended consequences on individuals and on the society as a whole.

This book has argued that the best way to achieve human-centric governance is through the notion of a data cooperative, which is a voluntary collaborative agreement by individuals that their personal data may be used to derive insights for benefit of their community. These insights can be in the form of simple maps or statistics, or generated by state-of-the-art AI methods. Importantly, a data cooperative does not require that individuals give up ownership of their data, only that their data may be used for specific, agreed-upon uses.

There are several key aspects of a data cooperative. First of all, a data cooperative member has legal ownership of her/his data: this data can be collected into her/his Personal Data Store (PDS) [3] and s/he can add and remove data from the PDS as well as suspend access to the data repository. Members have the option to maintain their single or multiple Personal Data Stores at the cooperative or in private data servers.

However, if the data store is hosted at the cooperative, then data protection (e.g. data encryption) and curation are performed by the cooperative itself for the benefit of its members. Moreover, the data cooperative has a legal fiduciary obligation to its members: [4] this means that the cooperative organization is owned and controlled by the members. Finally, the ultimate goal of the data cooperative is to benefit and empower its members. As highlighted in previous Chapters of this book, credit and labor unions can provide an inspiration for data cooperatives as collective institutions able to represent the data rights of individuals.

Such personal data platform cooperatives are a means for avoiding asymmetries and inequalities in the data economy and realizing the concept of property owning democracy, introduced by the political and moral philosopher John Rawls [5] In particular, Loi et al. [6] argue that a society characterized by multiple personal data platform cooperatives is more likely to realize the Rawls’ principle of fair equality of opportunity, where individuals have equal access to the resources – data in this case – needed to develop their talents.

The Data Cooperative Ecosystem

The data cooperative ecosystem is summarized in Figure 1. The main entities are the (i) data cooperative as a legal entity, (ii) the individuals who make-up the membership and elect the leadership of the cooperative, and (iii) the external entities who interact with the data cooperative, referred to as queriers. The cooperative as an organization may choose to operate its own IT infrastructure or choose to outsource these IT functions to an external operator or IT services provider. In the case of outsourcing, the service level agreement (SLA) and contracts must include the prohibition for the operators to access or copy the member’s data. Furthermore, the prohibition must extend to all other third-party entities from which the outsourcing operator purchases or subcontracts parts if its own services.

Figure 1: Overview of the Data Cooperative Ecosystem

A good analogy can be gleaned from Credit Unions throughout the United States. Many of the small credit unions band together to share IT costs by outsourcing IT services from a common provider, known in industry as Credit Union Service Organizations (CUSO). Thus, a credit union in Vermont may band together with one in Texas and another in California, to contract a CUSO to provide basic IT services. This includes a common computing platform on the cloud, shared storage on the cloud, shared applications, and so on. The credit union may not have any equipment on-premises, other than the PC computers used to connect to the platform operated by the CUSO. Here, despite the three credit unions using a common platform the CUSO may tailor the appearance of the user interface differently for each credit union in order to provide some degree of differentiation to its members. However, the CUSO in turn may be subcontracting functions or applications from a third party. For example, the CUSO may be running its platform using virtualization technology on Amazon Web Services (AWS). It may purchase storage from yet a different entity. This approach of subcontracting functions or services from other service provider is currently very common.

In the context of the data cooperative that choses to outsource IT services, the service contract with the IT services provider must include prohibitions by third party cloud providers from accessing data belonging to the cooperative’s members.

Preserving Data Privacy of Members

We propose to use the MIT Open Algorithms (OPAL) approach to ensure the privacy of the member’s data held within the personal data stores [2]. In essence, the OPAL paradigm requires that data never be moved or be copied out of its data store, and that the algorithms are instead transmitted to the data stores for execution.

The following are the key concepts and principles underlying the open algorithms paradigm:

  • Move the algorithm to the data: Instead of “pulling” data into a centralized location for processing, it is the algorithm that must be transmitted to the data repositories endpoints and be processed there.

  • Data must never leave its repository: Data must never be exported or copied from its repository. Additional local data-loss protection could be applied, such as encryption (e.g. homomorphic encryption) to prevent backdoor theft of the data.

  • Vetted algorithms: Algorithms must be vetted to be “safe” from bias, discrimination, privacy violations and other unintended consequences.

  • Provide only safe answers: When returning results from executing one or more algorithms, return aggregate answers only as the default granularity of the response. Any algorithm that is intended to yield answers that are specific to a data subject (individual) must only be executed after obtaining the subject’s affirmative and fully informed consent [6].

To ensure that data analysis does not reveal personal data, the AI community has over the years developed a range of privacy-preserving machine learning. These methods are inspired by research efforts in cryptography and it has the goal of protecting the privacy of the input data and/or of the models used in the learning task. Examples of this approach are (i) differential privacy, [7] (ii) federated learning, [8] and (iii) encrypted computation. [9]

Differential privacy is a methodology for publicly sharing information about a given dataset by providing a description of the patterns related to the groups represented in the dataset, while at the same time keeping private the information about the individuals. For example, government agencies use differential privacy algorithms to publish statistical aggregates while ensuring that individual survey responses are kept confidential. In order to work, differential privacy adds some amount of statistical noise, thus obscuring the contributions of specific individuals in the dataset.

Federated learning is a machine learning approach where different entities or organizations collaboratively train a model, while keeping the training data decentralized in local nodes. Hence, the raw data samples of each entity are stored locally and never exchanged, while parameters of the learning algorithm are exchanged in order to generate a global model. It is worth noting that federated learning does not provide a full guarantee of the privacy of sensitive data (e.g., personal data) as some characteristics of the raw data could be memorized during the training of the algorithm and thus extracted. For this reason, differential privacy can complement federated learning by keeping private the contribution of single organizations/nodes in the federated setting. [10]

Finally, encrypted computation aims at protecting the learning model itself by allowing to train and evaluate on encrypted data. Thus, the organization training the model is not able to see or leak the data in its non-encrypted form. Examples of methods for encrypted computation are (i) homomorphic encryption, (ii) functional encryption, (iii) secure multi-party computation, and (iv) influence matching.

Consent for Algorithm Execution

One of the contributions of the EU GDPR regulation [11] is the formal recognition at the regulatory level for the need for informed consent to be obtain from subjects. More specifically, the GDPR calls for the ability for the entity processing the data to

...demonstrate that the data subject has consented to processing of his or her personal data (Article 7).

Related to this, a given

...data subject shall have the right to withdraw his or her consent at any time (Article 7).

In terms of minimizing the practice of copying data unnecessarily, the GDPR calls out in clear terms the need to access data

...limited to what is necessary in relation to the purposes for which they are processed (data minimisation) (Article 5).

Figure 2: Consent Management using User Managed Access (UMA)

In the context of the GDPR, we believe that the MIT Open Algorithms approach substantially addresses the various issues raised by the GDPR by virtue of data never being moved or copied from its repository.

Furthermore, because OPAL requires algorithms to be selected and transmitted to the data endpoints for execution, the matter of consent in OPAL becomes one of requesting permission from the subject for the execution of one or more vetted algorithms on the subject’s data. The data cooperative as a member organization has the task of explaining in lay terms the meaning and purpose of each algorithm, and convey to the members the benefits from executing the algorithm on the member’s data.

In terms of the consent management system implementation by a data cooperative, there are additional requirements that pertain to indirect access by service providers and operators that may be hosting data belonging to members of the cooperative. More specifically, when an entity employs a third-party operated service (e.g. client or application running in the cloud) and that service handles data, algorithms and computation results related to the cooperative’s activities, then we believe authorization must be expressly obtained by that third-party.

In the context of possible implementations of authorization and consent management, the current popular access authorization framework used by most hosted application and services providers today is based on the OAuth2.0 authorization framework [12]. The OAuth2.0 model is relatively simple in that it recognizes three (3) basic entities in the authorization work-flow. The first entity is the resource-owner, which in our case translates to the cooperative on behalf of its members. The second entity is the authorization service, which could map to either the cooperative or an outsourced provider. The third entity is the requesting party using a client (application), which maps roughly to our querier (person or organization seeking insights). In the case that the data cooperative is performing internal analytics for its own purposes, then the querier is the cooperative itself.

While the OAuth2.0 model has gained traction in industry over the past decade (e.g. in mobile apps), its simplistic view of the 3-party world does not take into account the reality today of the popularity of hosted applications and services. In reality the three parties in OAuth2.0 (namely the client, the authorization server and the resource) could each be operated by separate legal entities. For example, the client application could be running in the cloud, and thus any information or data passing through the client application becomes accessible to the cloud provider.

An early awareness of the inherent limitations of OAuth2.0 has led to additional efforts to be directed at expanding the 3-party configuration to become a 5 or 6 party arrangement (Figure 3), while retaining the same OAuth2.0 token and messaging formats. This work has been conducted in the Kantara Initiative standards organization since 2009, under the umbrella of User Managed Access (UMA) [13,14]. As implied by its name, UMA seeks to provide ordinary users as resource (data) owners with the ability to manage access policy in a consistent manner across the user’s resources that maybe physically distributed throughout different repositories on the Internet. The UMA entities follows closely and extends the entities defined in the OAuth2.0 framework. More importantly, the UMA model introduces new functions and tokens that allow it to address complex scenarios that explicitly identity hosted services providers and cloud operators as entities that must abide by the same consent terms of service:

  • Recognition of service operators as 3rd party legal entities: The UMA architecture explicitly calls-out entities which provide services to the basic OAuth2.0 entities. The goal is to extend the legal obligations to these entities as well, which is crucial for implementing informed consent in the sense of the GDPR.

    Thus, for example, in the UMA work-flow in Figure 3, the client is recognized to be consisting of two separate entities: the querier (e.g. person) that that operates the hosted client-application, and the Service Provider A that makes available the client-application on its infrastructure. When the querier is authenticated by the authorization server and is issued an access-token, the Service Provider A must also be separately authenticated and be issued a unique access token.

    This means that Service Provider A which operates the client-application must accept the terms of service and data usage agreement presented by the authorization server, in the same manner that the querier (person or organization) must accept them.

  • Multi-round handshake as a progressive legal binding mechanism: Another important contribution of the UMA architecture is the recognition that a given endpoint (e.g. API at the authorization server) provides the opportunity to successively engage the caller to agree to a terms of service and data usage agreement (referred to as binding obligations in UMA).

Figure 3: UMA entities as an extension of the OAuth2.0 model

More specifically, UMA uses the multi-round protocol run between the client and the authorization server to progressively bind the client in a lock-step manner. When the client (client-operator) chooses to proceed with the handshake by sending the next message in the protocol to the endpoint of the authorization server, the client has implicitly agreed to the terms of service at that endpoint. This is akin to the client agreeing step-by-step to additional clauses in a contract each time the client proceeds with the next stage of the handshake.

Identity-related Algorithmic Assertions

A potential role for a data cooperative is to make available the summary results of analytic computations to external entities regarding a member (subject) upon request by the member. Here the work-flow must be initiated by the member who is using his or her data (in their personal data store) as the basis for generating the assertions about them, based on executing one or more of the cooperative-vetted algorithms. In this case, the cooperative behaves as an Attribute Provider or Assertions Provider for its members [15], by issuing a signed assertion in a standard format (e.g. SAML2.0 [16] or

Claims [17,18]). This is particularly useful when the member is seeking to obtains goods and services from an external Service Provider (SP).

As an example, a particular member (individual) could be seeking a loan (e.g. car loan) from a financial institution. The financial institution requires proof of incomes and expenditures regarding the member over a duration of time (e.g. last 5 years), as part of its risk assessment process. It needs an authoritative and truthful source of information regarding the member’s financial behavior over the last 5 years. This role today in the United States is fulfilled by the so called credit scoring or credit report companies, such as Equifax, TransUnion and Experian.

However, in this case the member could turn to its cooperative and request the cooperative to run various algorithms – including algorithms private to the cooperative – on the various data sets regarding the member located in the member’s personal data store. At the end of these computations the cooperative could issue an authoritative and truthful assertion, which it signs using its private-key. The digital signature signifies that the cooperative stands behind its assertions regarding the given member. Then the cooperative or the member could transmit the signed assertion to the financial institution. Note that this cycle of executing algorithms, followed by assertions creation and transmittal to the financial institution can be repeated as many times as needed, until the financial institution is satisfied.

Figure 4: Overview of Obtaining Assertions from the Data Cooperative

There are a number of important aspects regarding this approach of relying on the data cooperative:

  • Member driven: The algorithmic computation on the member’s data and the assertion issuance must be invoked or initiated by the member. The data cooperative must not perform this computation and issue assertions (about a member) without express directive from the member.

  • Short-lived assertions: The assertion validity period should be very limited to the duration of time specified by the service provider. This reduces the window of opportunity for the service provider to hoard or re-sell to a third party the assertions obtained from the cooperative.

  • Limited to a specific purpose: The assertion should carry additional legal clause indicating that the assertion is to be used for a given purpose only (e.g. member’s application for a specific loan type and loan amount).

  • Signature of cooperative: The data cooperative as the issuer of the assertions or claims must digitally sign the assertions. This conveys the consent of the member (for the issuance of assertion) and conveys the authority of the cooperative as the entity who executes algorithms over the member’s data.

  • Portability of assertions: The assertion data structure should be independent (stand-alone), portable and not tied to any specific infrastructure.

  • Incorporates Terms of Use: The assertion container (e.g. SAML2.0 or Claims) issued by the cooperative must carry unambiguous legal statements regarding the terms of use of the information contained in the assertion. The container itself may even carry a copyright notice from the cooperative to discourage service providers from propagating the signed assertions to third parties.

Once the assertion has been issued by the cooperative there are numerous ways to make the assertion available to external third parties – depending on the privacy limitations of the concerned entities. In the case above, a member (subject) may wish for the assertion to be availability only to the specific service provider (e.g. loan provider) because the event pertains to a private transaction. In the case that the service provider needs to maintain copies of assertions from the cooperative for legal reasons (e.g. taxation purposes), the service provider could return a signed digital receipt [19] agreeing to the terms of use of the assertions.

In other cases, a member may wish for some types of assertions containing static personal attributes (e.g. age or year of birth) to be readily available without the privacy limitations. For example, the member might use such attribute-based assertions to purchase merchandise tied to age limits (e.g. alcohol). In this case, the signed assertion can be readable from a well-known endpoint at the cooperative, be readable for the member’s personal website, or be carried inside the member’s mobile device. Hence the importance of the portability of the assertions structure.

Conclusions

Today we are in a situation where individual assets ...people’s personal data... are being exploited by AI algorithms without sufficient value being returned to the individual. This is analogous to the situation in the late 1800’s and early 1900’s that led to the creation of collective institutions such as credit unions and labor unions, and so the time seems ripe for the creation of collective institutions to represent the data rights of individuals.

We have argued that data cooperatives with fiduciary obligations to members provide a promising direction for the empowerment of individuals through collective use of their own personal data. Not only can a data cooperative give members access to expert, community-based advice on how to manage, curate and protect access to their personal data, it can run internal analytics that benefit the collective membership. Such collective insights provide a powerful tool for negotiating better services and discounts for its members.

[1] World Economic Forum, “Personal Data: The Emergence of a New Asset Class,” 2011, http://www.weforum.org/reports/ personal-data-emergence-new-asset-class.

[2] T. Hardjono, D. Shrier, A. Pentland, Trusted Data, MIT Press (2019)

[3] Y.-A. de Montjoye, E. Shmueli, S. Wang, and A. Pentland. OpenPDS: Protecting the privacy of metadata through safe answers. PloS One, (10.1371), 2014.

[4] J.M. Balkin. Information fiduciaries and the first amendment. UC Davis Law Review, 49(4):1183–1234, 2016

[5] J. Rawls. Justice as fairness: A restatement. Harvard University Press, 2001

[6] M. Loi, P.-O. Dehaye, and E. Hafen. Towards rawlsian ’property-owning democracy’ through personal data platform cooperatives. Critical Review of International Social and Political Philosophy, pages 1–19, 2020.

[7] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.

[8] A. Dubey and A. Pentland. Private and byzantine-proof federated decision making. In Pro ceedings of the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS). ACM, 2020.

[9] N. Dowlin, R. Gilad-Bachrach, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proceedings of 2016 International Conference on Machine Learning (ICML), pages 201–210, 2016.

[10] M. Brundage, et, Toward trustworthy AI development: Mechanisms for supporting

verifiable claims. arXiv:2004.07213, 2020.

[11] European Commission, “Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation),” Official Journal of the European Union, vol. L119, pp. 1–88, 2016.

[12] D. Hardt, “The OAuth 2.0 Authorization Framework,” October 2012, RFC6749. [Online].

Available: http://tools.ietf.org/rfc/rfc6749.txt

[15] American Bar Association, “An Overview of Identity Management: Submission for UNCITRAL Commission 45th Session,” ABA Identity Management Legal Task Force, May 2012, available on http://meetings.abanet.org/ webupload/commupload/ CL320041/relatedresources/ ABA-Submission-to-UNCITRAL.pdf.

[16] OASIS, “Assertions and Protocols for the OASIS Security Assertion Markup Language (SAML) V2.0,” March 2005, available on http://docs.oasisopen.org/security/ saml/v2.0/ saml-core-2.0-os.pdf.

[17] M. Sporny, D. Longley, and D. Chadwick, “Verifiable Credentials Data Model 1.0,” W3C, W3C Candidate Recommendation, March 2019, available at https://www.w3.org/TR/verifiable-claims-data-model.

[18] D. Reed and M. Sporny, “Decentralized Identifiers (DIDs) v0.11,” W3C, Draft Community Group Report 09 July 2018, July 2018, https://w3c-ccg.github.io/did-spec/.

[19] M. Lizar and D. Turner, “Consent Receipt Specification Version 1.0,” March 2017, https://kantarainitiative.org/confluence/display/infosharing/Home.

Comments
0
comment
No comments here
Why not start the discussion?