Privacy, Machine Learning, and Monopolies

This blog post was adapted from my term paper for Computer Science Ethics. Private machine learning is a fast-paced field, and my views are likely to have evolved since the writing of this essay. I focused more on problems with the private machine learning space, avoiding discussions about potentially effective solutions- be it human-centered computing approaches or specific policy interventions.

Facebook’s targeted advertising revealed that a user was gay to his parents (Skeba & Baumer, 2020). The NYPD curated an image database of minors to compare with surveillance footage from crime scenes (Levy & Schneier, 2020). Abusive parents have used an app to detect if their child is discussing sexual matters online (Skeba & Baumer, 2020). Each of these scenarios depict harmful privacy violations engendered by the rapid adoption of machine learning in virtually every industry. Effectively engaging with these scenarios entails moving beyond examining those isolated incidents and understanding the widespread lack of privacy in machine learning. This phenomena can be explained by interrogating the power dynamics that underlie the evolution and development of machine learning. This is justified as the privacy of a system or lack thereof is a rearrangement of power and is, therefore, a political decision. (Rogaway, 2015). Machine learning has been largely driven by a few key companies considered to be tech monopolies, often referred to as Big Tech (Jurowetzki et al., 2021). The fundamental lack of privacy in machine learning is an expression of the monopolistic nature of these tech conglomerates and their power over the general public, particularly marginalized communities. Current efforts to further machine learning privacy frequently obscure the broader conditions that maintain injustice, empowering these companies rather than the public.

Privacy is largely defined as the capacity for an individual to choose what information is shared about them and how it is used. This definition has evolved over time, and there are multiple perspectives and notions that are subsumed by this definition. For instance, Nissenbaum contends that privacy must be analyzed within specific contexts and can be measured by conformity to cultural norms governing appropriate flows of information. This is known as contextual integrity (Nissenbaum, 2004). In contrast, Skeba and Baumer (2020) view privacy as a function of Floridi’s measure of informational friction - the amount of work required for an agent to alter or access the information of another agent. Despite these differences of opinion, there is a consensus that privacy cannot be fulfilled by anonymity alone, a fallacy that appears in many failed or deliberately weak attempts to preserve privacy (Desfontaines, 2020).

Privacy protects the public, but is particularly important for marginalized and otherwise vulnerable communities. The right to privacy is often contextualized as a precursor to more traditional threats to being, including blackmail, abuse, and imprisonment (Levy & Schneier, 2020), or as an integral precondition for the rights to free speech and assembly (Skeba & Baumer, 2020), rights recognized by and provided paramount importance in the United States Constitution and similar legislative codes. With the rise of targeted advertising, privacy also disrupts what is known as the right to future tense. To elaborate, mass manipulation through advertisements that target distinct groups can influence the actions of an individual, violating their right to future tense (Srinivasan, 2018). These negative effects are disproportionately felt by vulnerable and marginalized groups such as people of color and victims of domestic abuse (Skeba & Baumer, 2020; Levy & Schneier, 2020). Furthermore, privacy is predicated on consent. Marginalized individuals are the most likely to be forcibly subject to the technology, not informed about the implications of its use, or otherwise exist in situations where they cannot freely and authentically provide consent (Madden et al., 2017). Networked privacy problems intensifies the impact of privacy. Individuals in marginalized groups are in networks by association wherein the actions of one individual has a larger impact on the actions of another in the same marginalized group, especially when considering the utility of aggregated data in machine learning systems - a phenomena that will be elaborated upon (Madden et al., 2017).

Privacy keeps queer people safe in places where their identities are outlawed or persecuted (Skeba & Baumer, 2020), prevents domestic abusers from tracking their victims (Levy & Schneier, 2020), and protects different classes of people from stigma, surveillance, ostracization, abuse, incarceration, and other forms of harm (Liang et al., 2020). Facial recognition best exemplifies how a lack of privacy in machine learning specifically harms marginalized communities. Initially, critiques of facial recognition focused on discriminatory outcomes where these systems misclassified people of color at much higher rates, resulting in several false arrests (Stevens & Keyes, 2021). Inclusive representation was lauded as a panacea, but this only subjects marginalized groups to increased surveillance as it encourages collecting more data from them. More recent critiques of facial recognition demonstrate how it can be used to reinforce racist overpolicing and commercial exploitation as well as the fact that the data collection and algorithmic development establishing facial recognition is rooted in the exploitation and dehumanization of people of color (Stevens & Keyes, 2021). Improving the accuracy of these systems on marginalized groups, namely people of color, continues to harm these individuals by sustaining broader injustices.

Prior to examining the lack of privacy in machine learning, it is pertinent to specify who uses these algorithms. Since machine learning is a pervasive technology that is embedded into applications like facial recognition (as opposed to being directly purchased), not only are consumers not necessarily aware of the inclusion of machine learning, but those subject to this software are members of the public who may not have consented to use of this technology (Knowles & Richards, 2021). In the case of facial recognition, the onus of privacy invasions is primarily on the consumers: law enforcement and relevant commercial entities. Considering the responsibility of privacy and whether those subject to the technology have consented to its use is an analysis of the broader system wherein the technology exists.

Machine learning technologies are infrastructure assemblages, consisting of data, algorithms, and the broader system; analysis should occur upon each constituent part (Stevens & Keyes, 2021). The field of machine learning exhibits a fundamental lack of privacy in every component. Birhane and Prabu investigated problematic practices in ImageNet, the open-source vision dataset pivotal to the growth and success of machine learning. Amongst other issues, they discovered verifiably pornographic images that can be used to re-identify and blackmail women, highlighted down-stream effects in other models and datasets from privacy violations within ImageNet, and identified open datasets built on false conceptions of informed consent and anonymization. More significantly, they posit that the release of ImageNet for machine learning contributed to a culture of surveillance and widespread data collection without accounting for privacy and consent (Birhane & Prabhu, 2021).

Privacy is also intrinsically lacking within the algorithms themselves. Attacks on the privacy of machine learning algorithms include model extraction, model inversion, and membership inference. Model extraction enables an attacker to create a copycat of a model and, therefore, access the inferences of the system (Jagielski et al., 2020). Consider the consequences of attackers obtaining access to a copycat of the NYPD’s facial recognition system; it would be analogous to a leak of confidential, legal data. An individual can launch a model inversion attack to reveal data the system has been trained upon or launch membership inference to identify if a data point was in the training data of a model (Albert et al., 2020). To illustrate, these attacks would allow an attacker to gather images that the NYPD’s facial recognition tool has seen or determine if an individual was inside of the system’s training data. This would potentially reveal information about an individual’s relationship with law enforcement (i.e., whether they were previously arrested, incarcerated, suspected, etc.), disproportionately harming marginalized communities, most saliently people of color (Albert et al., 2020). Efforts to improve the privacy of these algorithms using techniques such as homomorphic encryption and differential privacy often require refashioning these algorithms entirely (Liu et al., 2021). There do not exist robust, well-known, scalable defenses to these attacks. There is also limited tooling to test models for these vulnerabilities, demonstrating the inadequacy of the current state of privacy for machine learning algorithms (Gupta & Galinkin, 2020; Hussain, 2020). Moreover, the right to be forgotten is an integral privacy stipulation in multiple regulations, but ensuring machine learning algorithms can forget specific data is still an open problem (Cao et al., 2015).

One final question remains in analyzing the failure of machine learning privacy: how does a lack of privacy in machine learning express itself in relation to the broader system? As previously stated, the public consists of individuals who did not consent to the use of machine learning, but are still subject to it, violating privacy by virtue of violating the principle of consent. Machine learning inherently complicates privacy by inferring information from data that may seem innocuous whether it is sexuality from Facebook likes or illnesses from sounds of coughs (Imran et al., 2020), rendering definitions of explicitly private personal information obsolete. Machine learning’s intrinsic affordance of scale bolsters this erosion of privacy. Humans cannot match the speed and volume of data processing (Stevens & Keyes, 2021). Most consequentially, machine learning actualizes surveillance and tracking technologies that aid human right violations such as mass incarceration and systematic genocide (Albert et al., 2020).

An analysis of the systemic failures of machine learning privacy is incomplete without mention of surveillance capitalism, conceptualized as the proliferation of violations of the right to future tense, a consequence of the erosion of privacy. Under surveillance capitalism, machine learning instruments Big Tech into a tool for private (and eventually state) surveillance. The more data these companies collect, the better they are at predicting behavior, which, in turn, allows them to seize more power in the world, cementing their status as monopolies (Zuboff, 2019). Doctorow disagrees with this assessment, emphasizing that believing in the effectiveness of these machine learning algorithms is fallacious. He maintains that monopolies establish surveillance, but that this technology did not form these monopolies. Rather, monopolies were the precondition for surveillance (Doctorow, 2021).

Machine learning’s irresponsible approach to privacy is frequently viewed as a technical problem, ignoring the circumstances of its creation, use, and deployment. This deficiency in machine learning should not be construed independently of its evolution inside of technological monopolies. Machine learning lacks privacy because technological monopolies prioritize profits over the well-being of their users. In fact, the very structure of machine learning reflects the nature of these monopolies (Dotan & Milli, 2020). By focusing on compute-rich and data-rich environments, this technology promotes the centralization of power, the core principle behind monopolies. To elaborate, compute power limits access to those with the financial means to obtain GPUs and similar apparatus and deepens the dependency on companies and organizations able to produce and obtain compute at scale. Similarly, data-rich environments privileges large companies and organizations able to collect sufficient amounts, simultaneously encouraging further degradation of the privacy of individuals (Dotan & Milli, 2020). Crucial to machine learning is optimization, a technique that arguably reduces complex questions of society, politics, and governance into economic problems, typically ignoring concerns from the general public and marginalized groups - the primary criticism of a monopoly (Kulynych et al., 2020). The infringement of privacy is seen as an externality of optimization similar to how monopolistic corporations will see privacy violations as an expense in light of legal and social repercussions (Swire, 1997).

The connection between privacy deterioration and technological monopolies is substantiated by Swire’s (1997) holistic framework on the forces affecting the protection of personal information commercially in combination with historical context on the relationship between the monopolistic companies under contention to privacy. Swire defines a market failure as an instance where a company provides less privacy than the consumer desires. According to Swire (1997), this is a product of information asymmetry and bargaining costs. A company will know more about the extent of their data collection and processing than their users. An individual user also does not have the authority and power to effectively negotiate with the company or hold them accountable. The companies are therefore incentivized to profit off of extracting more information since lawsuits and leaks are less probable and cheaper (Swire, 1997). Materializating this observation, Facebook was fined five billion dollars by the Federal Trade Commision for privacy violations. Although this was the largest fine ever levied against a company by the FTC, this was a trivial cost to Facebook (Federal Trade Commission, 2019). Srinivasan (2018) examines Facebook’s evolution over time into one of the largest monopolies in history and the epitome of surveillance capitalism. She explains that Facebook’s erosion of privacy was made possible by its lack of competitors. Whenever they were faced with competition, Facebook incorporated privacy concerns into their marketing and listened to concerns, shedding them as each competitor was eliminated. As their status as a monopoly calcified, their privacy failures became more egregious, implying monopolistic behavior drives a degradation in privacy (Srinivasan, 2018). By virtue of their positions as monopolies, Facebook, Google, and other Big Tech companies engaged in broad-scale commercial surveillance (Birhane & Prabhu, 2021), collecting data critical to the evolution and growth of machine learning, resulting in the current insufficient state of private machine learning.

Efforts to improve the privacy of machine learning are manifold. This begs the question: Are current efforts beneficial to the greater public and the communities most vulnerable to threats posed by the erosion of privacy? Unfortunately, there are numerous flaws with these endeavors that result in them further empowering large technology companies. First, they are frequently centered at the data and algorithms level as the systems perspective can incriminate the technology companies that drive machine learning. For instance, with regards to the lack of privacy in open-source datasets (Birhane & Prabhu, 2021), there is a natural inclination to rely on private datasets with less controversies around consent and re-identification. This action serves to strengthen companies with the means to create and maintain useful, private datasets. A systems perspective might inquire into the ethics of establishing a machine learning system in and of itself, a line of inquiry a monopoly might hope to quell.

This section on privacy is relatively out of date. It is pessimistic on differential privacy and technical approaches to machine learning privacy, favoring policy solutions. However, I believe that approaches such as differential privacy are vital- just not a panacea.

Second, private machine learning research often ignores specific classes of threats to vulnerable groups in favor of mathematical or purely technical measures of privacy (Rogaway, 2015). These measures mitigate privacy leakage in a limited number of situations. This includes differential privacy, federated learning, and encryption schemes. Differential privacy limits the re-identifiability of an individual within a dataset or machine learning model. However, it decreases model accuracy and is less effective on outliers or minorities in the data (Bagdasaryan et al., 2019). A large technology company like Facebook or Google would stand to benefit from the proliferation and normalization of differential privacy. Once again, the trade-offs of differential privacy can be offset with the vast quantities of data these companies can collect, entrenching their monopoly. The other techniques suffer from similar limitations. Many feasible privacy guarantees have schemes and threat models that assume a trustworthy machine learning provider, yet another technique privileging these companies (Rogaway, 2015). These measures drift from traditional, robust privacy lenses such as contextual integrity and information friction, making “privacy” more compatible with optimization and easier to integrate into economic problems. This does not significantly or directly benefit marginalized groups. A differentially private facial recognition system may prevent re-identifiability, but it is still used for surveillance and control. An individual may still have their data leaked from the facial recognition system due to their status as an outlier. The information asymmetry obscuring the information flows of the individual remain as do the bargaining costs that prevent the individual from holding the company accountable. These definitions of privacy fail to address the sociotechnical and behavioral dimensions of privacy across different social contexts.

Members of the public are subject to pervasive and embedded machine learning. They do not always consent to this usage and do not have the means to audit the privacy of these systems. Hence, they must trust “AI-as-an-institution” or the structural assurances society provides on the privacy of this software. The monopolistic nature of tech companies have ensured that distrust in the privacy of machine learning has not been a barrier to adoption (Knowles & Richards, 2021). Nonetheless, there have been industry self-regulation and government regulation acts that have attempted to improve machine learning privacy. Swire contended that industry self-regulation in light of monopolies would still result in market failure (1997). Monopolies establish the norms that self-regulation efforts abide by. Indeed, the ethics codes set by these tech companies focus on consumer rights, an issue that often does not consider the rights of the general public and vulnerable groups that are subject to the application, but are not the consumers (Washington & Kuo, 2020).

These monopolies also stand to gain from government regulation. Swire states that powerful tech companies can lobby for regulations that fortify monopolies by requiring access to privacy infrastructure or enforcing requirements only feasible for large companies (1997). Often, laws are over-broad or under-broad. If they are over-broad, privacy is loosely defined and companies with the resources to litigate and establish norms benefit (Swire, 1997). The most prominent privacy regulation is the European Union’s General Data Protection Regulation (GDPR). GDPR has two major flaws. First, it places the responsibility of privacy violation detection and advocacy onto individuals, maintaining the information asymmetry and bargaining costs that empower Big Tech. This “notice-and-consent framework” ignores situations where individuals may not authentically provide consent (Skeba & Baumer, 2020). Second, GDPR specifies certain classes of protected information, a protection rendered futile by machine learning’s ability to infer private information from supposedly banal data sources (Skeba & Baumer, 2020). Technical privacy metrics, self-regulation initiatives, and government regulations fail to interrogate the power imbalance between technology companies and the common people. Thereby, they help these companies consolidate power and ostensibly do not provide much benefit for the individuals most at risk.

Privacy was not built into machine learning from the onset, expressing the nature of the large tech conglomerates leading the field, companies that prioritize profit over privacy and subsequent harm to vulnerable groups. Technical and structural measures to improve machine learning privacy risk strengthening these monopolies. They do not address the power imbalance between companies and the greater public, particularly vulnerable groups, that is core to the erosion of privacy. Effective endeavors to advance machine learning privacy must empower the individuals subject to this technology.

References: Albert K., Penney J., Schneier B., & Kumar R. (2020). Politics of Adversarial Machine Learning. In Towards Trustworthy ML: Rethinking Security and Privacy for ML Workshop, Eighth International Conferenceon Learning Representations (ICLR). https://dx.doi.org/10.2139/ssrn.3547322

Bagdasaryan, E., Poursaeed, O., & Shmatikov, V. (2019). Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems, 32, 15479-15488. Retrieved April 23, 2021, from https://proceedings.neurips.cc/paper/2019/hash/fc0de4e0396fff257ea362983c2dda5a-Abstract.html

Birhane, A., & Prabhu, V. (2021). Large Image Datasets: A Pyrrhic Win for Computer Vision?. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 1537-1547). Retrieved April 23, 2021, from https://bit.ly/3kaIsSd

Desfontaines, D. (2020). Lowering the cost of anonymization. (Doctoral dissertation, ETH Zurich). Retrieved April 23, 2021, from https://desfontain.es/thesis/ Doctorow, C. (2021). How to destroy surveillance capitalism. Retrieved April 23, 2021, from https://onezero.medium.com/how-to-destroy-surveillance-capitalism-8135e6744d5

Dotan, R., & Milli, S. (2020). Value-Laden Disciplinary Shifts in Machine Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 294). https://doi.org/10.1145/3351095.3373157

Federal Trade Commission. (2019, July 24). FTC Imposes $5 Billion Penalty and Sweeping New Privacy Restrictions on Facebook [Press Release]. Retrieved April 23, 2021, from https://www.ftc.gov/news-events/press-releases/2019/07/ftc-imposes-5-billion-penalty-sweeping-new-privacy-restrictions

Gupta, A., & Galinkin, E. (2020). Green Lighting ML: Confidentiality, Integrity, and Availability of Machine Learning Systems in Deployment. International Conference on Machine Learning Workshop on Challenges in Deploying and monitoring Machine Learning Systems. Retrieved April 23, 2021, from https://arxiv.org/pdf/2007.04693

Hussain, S. (2020, October 8). PrivacyRaven Has Left the Nest. Retrieved April 23, 2021, from https://blog.trailofbits.com/2020/10/08/privacyraven-has-left-the-nest/

Imran, A., Posokhova, I., Qureshi, H. N., Masood, U., Riaz, M. S., Ali, K., John, C. N., Hussain, M. I., & Nabeel, M. (2020). AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app. Informatics in Medicine Unlocked, 20, 100378. https://doi.org/10.1016/j.imu.2020.100378

Jagielski, M., Carlini, N., Berthelot, D., Kurakin, A., & Papernot, N. (2020). High accuracy and high fidelity extraction of neural networks. In 29th USENIX Security Symposium (USENIX Security 20) (pp. 1345-1362). Retrieved April 23, 2021, from https://arxiv.org/abs/1909.01838

Jurowetzki R., Hain D., Mateos-Garcia J., & Stathoulopoulos K. (2021). The Privatization of AI Research(-ers): Causes and Potential Consequences – From university-industry interaction to public research brain-drain?. Retrieved April 23, 2021, from https://arxiv.org/abs/2102.01648

Knowles, B., & Richards, J. (2021). The Sanction of Authority: Promoting Public Trust in AI. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 262–271). https://doi.org/10.1145/3442188.3445890

Kulynych, B., Overdorf, R., Troncoso, C., & Gürses, S. (2020). POTs: Protective Optimization Technologies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 177–188). https://doi.org/10.1145/3351095.3372853

Levy, K., & Schneier, B. (2020). Privacy threats in intimate relationships. Journal of Cybersecurity, 6(1). https://doi.org/10.1093/cybsec/tyaa006

Liang, C., Hutson, J. A., & Keyes, O. (2020). Surveillance, stigma & sociotechnical design for HIV. First Monday, 25(10). https://doi.org/10.5210/fm.v25i10.10274

Liu, B., Ding, M., Shaham, S., Rahayu, W., Farokhi, F., & Lin, Z. (2021). When Machine Learning Meets Privacy: A Survey and Outlook. ACM Computing Surveys (CSUR), 54(2), 1-36. https://doi.org/10.1145/3436755

Madden, M., Gilman, M., Levy, K., & Marwick, A. (2017). Privacy, poverty, and big data: A matrix of vulnerabilities for poor Americans. Washington Law Review, 95, 53. Retrieved April 23, 2021, from https://ssrn.com/abstract=2930247

Nissenbaum, H. (2004). Privacy as contextual integrity. Washington Law Review, 79, 119. Retrieved April 23, 2021, from https://digitalcommons.law.uw.edu/cgi/viewcontent.cgi?article=4450&context=wlr

Rogaway, P. (2015). The Moral Character of Cryptographic Work [Invited Talk]. Asiacrypt, Auckland, New Zealand. Retrieved April 23, 2021, from https://web.cs.ucdavis.edu/~rogaway/papers/moral.html

Skeba, P., & Baumer, E. (2020). Informational Friction as a Lens for Studying Algorithmic Aspects of Privacy. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2). https://doi.org/10.1145/3415172

Washington, A., & Kuo, R. (2020). Whose Side Are Ethics Codes on? Power, Responsibility and the Social Good. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 230–240). https://doi.org/10.1145/3351095.3372844

Cao, Y. & Yang, J. (2015). Towards Making Systems Forget with Machine Unlearning. In 2015 IEEE Symposium on Security and Privacy (pp. 463-480). https://doi.org/10.1109/SP.2015.35

Zuboff, S. (2019). Surveillance capitalism and the challenge of collective action. New Labor Forum, 28(1), 10-29. https://doi.org/10.1177/1095796018819461