Transcripts For CSPAN2 Michael Kearns Aaron Roth The Ethica

CSPAN2 Michael Kearns Aaron Roth The Ethical Algorithm July 13, 2024

Welcome. Welcome to the Keystone Strategy transformative ideas lectureseries. Were lucky today to be hosted with cspan2. A couple housekeeping notes. When we get to q a there are microphones so raise your hand and the microphone will come to you and then the questions will be captured. So today we are extremely lucky to have Michael Behrens and erin ross, both from the university of pennsylvania here to talk about their book the ethical algorithm. I think a day does not go by in the news or otherwise or in our own work when the subject of algorithmic fairness or privacy is not frontpage news. Today were going to the two leading lights in that area and theyre going to help us understand what the stateoftheart is now and what the stateoftheart will be going forward. With that i think we will welcome professor Michael Kearns first to the stage, is that right . Great. Michael and erin,welcome to the stage. Okay, good morning. Thanks to everyone for coming. My name is michaelkearns and with my foot close friend and colleague we have coauthored a book , a general audience book called the general ethical algorithm, the science of algorithm design so what we want to do for roughly half an hour is just to take you at a high level through what some of the major themes of the book are and then we will open it up as jeff said to q a. So i think many, many people and certainly this audience is well aware that in the past decade or so, Machine Learning has gone from a relatively obscure corner of ai to mainstream news. And i would characterize the first half of this decade as the glory period when all the news reports were positive and we were hearing about all these amazing advances in areas like deep learning, applications in speech recognition, image processing, image categorization and many other areas so we all enjoyed the great benefits of this technology and the advances that were made but the last few years or so have been more of a buzz kill. And there have been many, many articles written and now even some popular books on essentially the Collateral Damage that can be caused by algorithmic decisionmaking, especially decisionmaking powered by ai Machine Learning so heres a few of those books, weapons of mass disruption was a big bestseller from a couple of years ago that did a good job of making very real and visceral and personal theways in which algorithms decisionmaking can result in discriminatory predictions like gender discrimination, Racial Discrimination or the like , david and goliath is a book about the fact that weve become something came to a commercial surveillance state and the breaches of privacy and trust and security that a company that and erin and i have read these books and we like these books very much and many others like them. But one of the things we found lacking in these books which was much of the motivation for writing our own was that when you get to the solution section of these books, i. E. What should we do about these problems, the solutions suggested are what we call traditional ones. They say we need better laws, we need better regulations, we need watchdogs, we need to keep an eye on this stuff and we agree with all that. But as Computer Scientists and Machine Learning researchers working directly in the field we also know theres been a movement in the past 5 to 10 years to sort of design algorithms that are better in the first place so rather than act of the fact you wait for some predictive model to exhibit Racial Discrimination in criminal sentencing you can think about making the algorithm better in the first place and theres now a fairly large Scientific Community in the Machine Learning Research Area and many adjacent areas that is trying to do exactly that so our book is really, think of it as a popularscience book. Were trying to explain to the reader how you would go about trying to encode and embed social norms that we care about directly into algorithms themselves. A couple preparatory remarks. We got a review on an early draft of the book that basically said well, i think your title is a conundrum or possibly even an oxymoron. What do you mean an ethical algorithm . Howcan an algorithm be any more ethical than a hammer . This reviewer pointed out an algorithm like a hammer is a human design artifact for particular purposes. And while its possible to make unethical use of a hammer, for instance i might decide to hit you on the hand , no wood nobody would make the mistake of ascribing any unethical behavior or immoral activity to the hammer itself. If i hit you on the hand with the hammer you would blame me and you and i would both know that real harm had come to you because of me hitting you on the handwith a hammer so this reviewer said i dont see why these same arguments dont apply to algorithms. We thought about this for a while and decided we disagree. We think algorithms are different even though they are indeed just tools that are human artifacts for particular purposes, we think theyre different for a couple reasons. One of them is its difficult to predict outcomes and is difficult to ascribe blame and the part of the reason is that our rhythmic decisionmaking when powered by ai Machine Learning as a pipeline so let me quickly review what that pipeline is. You usually start off with some complicated data, complicated in the sense that its high dimensional and has many variables and it might have many roads so think of a medical database for instance of individual citizens medical records and we may not understand this data in any detail and may not understand where it came from in the first place. It may have been garrett rather from Many Disparate sources and the usual pipeline or methodology of Machine Learning is to take that data and turn it into some sort of optimization problem. We have an objective landscape over a space of models and want to find the model that does well on the data in front of us and usually that objective is primarily or often exclusively concerned with predictive accuracy or some notion of utility or profit. Theres nothing more natural to do in the world if youre a Machine Learning researcher or practitioner and to take a data set and say lets find the Neural Network that on this data makes the fewest mistakes in deciding who to give a loan. So you do that and then what results is some perhaps very complicated high dimensional model. This is a classic clip art from the internet of deep learning. This is a Neural Network with many layers between the input and the output and lots of transformations of the data and variables so the point is that a couple things about this pipeline, its very diffuse. If something goes wrong in this pipeline it might not be easy to pin down the blame. Was it the data, was the the objective function, was it the optimization procedure or was it the Neural Network itself and even worse , if this algorithm or this predictive model that we use at the end causes harm to somebody, if you are falsely denied alone for instance becausethe Neural Network said you should be denied the loan , when this is happening at scale behind the scenes we may not be aware that ive even hit you on the head with a hammer and also because we give algorithms so much autonomy, to hit you on the head with a hammer i have to pick the thing up and hit you. These days algorithms are running autonomously without any human intervention so we may not even realize the harms being caused unless we know to explicitly look for them. So our book is about how to make things better, not through regulation and laws and the like but by actually revisiting this pipeline and sort of modifying it in ways that give us various social norms that we care about like privacy, fairness, accountability, etc. And one of the interesting and important things about this endeavor is that even though many, many scholarly communities and others have thought about these social norms before us, so for instance certainly philosophers have been thinking about fairness for time immemorial, lots of people thought about things like privacy and the like , theyve never had to think about these things in such a precise way that you could actually write them into a Computer Program or into an algorithm. And sometimes just the act of forcing yourself to be that precise and reveal flaws in your intuitions about these concepts that you werent going to discover any other way and we will give concrete examples of that during our presentation. So thewhirlwind , high tour of the book is a series of sessions about different social norms, some of which ive written down here and what the science looks like of actually going in and giving a precise definition to these things, a mathematical definition and then encoding at mathematical definition and an algorithm and importantly, what the consequences of doing that are, in particular tradeoffs. In general if i want to get an algorithm thats more fair or more private, that might come at the cost of less accuracy for example and we will talk about this. So youll notice that ive written these different social norms in increasing shades of gray here. And what that roughly represents is our subjective view of how music mature the science is so in particular, we think that when it comes to privacy, this is the field thats in relative terms the most mature in that theres what we think is the right definition of data privacy and quite a bit known about how to embed that definition in powerful algorithms including Machine Learning algorithms, fairness which is a little bit lighter is a more recent, more nascent field but is off to a good start. And things like accountability, interpretability or morality are in grayer shades because in these cases we feel like there arent even good technical definitions yet so its hard to get started about encoding these things in algorithms and i promise you theres a bottom bullet here which says the singularity but its entirely in white so you cant even see it so what were going to do with the rest of our time is talk about privacy and fairness which cover roughly the first half of the book and then we will bend a few words about telling you the same sort of game theoretic twists that the book takes midway through so im going to turn it over to erin for a bit now. So as michael mentioned, privacy is by far the most welldeveloped theme in the book we talk about so i want to fit spend a few minutes giving you a brief history of the study of data privacy which is about 20 years old now and in that process try to go through a case study of how we might think precisely about definitions. So it used to be 20, 25 years ago that when people talked about releasing data sets in a way that wasprivacy preserving what they had in mind was some attempt at anonymous asian. I would have some data set of individual peoples records and in my data set might have peoples names and i would just if i wanted to release this, try to anonymize the records by removing the names and maybe if i was careful, unique identifiers like Social Security numbers but i would keep things like age or zip code, features about people that werent enough to uniquely identify me. So in 1997 the state of massachusetts decided to release a data set that would be useful for medical researchers. It was a good thing,medical data sites sets are hard to get their hands on because of privacy concerns and massachusetts had an enormous data set of medical records, medical records corresponding to everyemployee in massachusetts and they released this in a way that was anonymous. There were no names, no Social Security numbers but there were ages and there were genders. So it turns out that although age is not enough to uniquely identify you, zip code is not enough,gender is not enough , in combination a candy and there was a student named Tonya Sweeney who is a professor at harvard who figured this out. And in particular she figured out that you could cross reference the supposedly anonymized data sets with Voter Registration records which also had demographic information like a zip code and Social Security number and gender but together with names so she crossreferenced this medical data set with the Voter Registration records and was able to with this triple identifier, identify the record, the medical record of bill weld who was the governor of massachusetts at the time and make a point. Okay. So this was a big deal in the study of data privacy and for a long time people tried to fix this problem by basically just using little bandaids. Trying to most directly fix whatever the most recent attack was so for example people thought all right, if it turns out the combinations of zip code and gender and age can uniquely identify someone in a record, why dont we try course and in that information so instead of reporting age exactly, maybe we will report it up to an interval of 10 years, maybe only report zip code up to three digits and we will do this that we can make sure that any combination of attributes in this table that we released doesnt correspond to just one person. So for example if i know that my 56yearold neighbor who is a woman attended some hospital ,maybe the hospital at the university of pennsylvania and theyve released and anonymized data set in this way then theyve got the guarantee that i cannot connect the attributes i know about my neighbor to one record. I can consent totwo records. So for a little while people tried doing this. And if you think about it, if you look at this data set you might already begin to realize this isnt getting quiet at what we mean by privacy because even though if i know that my 36yearold female neighbor attended the hostility of at the university of pennsylvania i cant figure out what their diagnosis is because this response to two records but i can figure out either she had colitis which might already be something she didnt want me to know but the problem goes much deeper than that. Suppose that i know that shes been a patient not just that one hospital but at two hospitals and the other hospital has also released records anonymized in the same way and in fact may be a little better because now my 56yearold neighbor matches not just to but three of these records. But if both of these data sets have been released i can just crossreference them and theres a unique record,only one record that could correspond to my neighbor and all of a sudden ive got her diagnosis. So the overall problem here is the same as it was when we just tried removing names and its that maybe attempts at privacy like this would work if the data set that i was releasing was the only thing out there but thats never the case and the problem is small amounts of idiosyncratic information are enough to identify you in ways that i can uncover if i can cross reference the data sets thats been released with all this stuff thats out there. So people tried catching this up as well but for a long time the history of data privacy was a cat and mouse game where researchers would try to do futuristic things patching up whatever vulnerability ledto the most recent attack. And attackers trying new clever things and this was a losing game for privacy researchers and part of the problem is we were trying to do things, trying to do things we hoped were private without ever really defining what we meant by privacy so this is an approach that was too weak. Let me in an attempt to think about what privacy might mean talk about an approach thats just wrong and we will find the right answer. So you might say okay, lets think about what privacy should mean. Maybe if im going to use data sets to conduct for example medical studies, what i want is that nobody should be able to learn anything about you as a particular individual that they couldnt have learned about you had the study not been conducted. That would be a strong notion of privacy if we could promise it and maybe to make it more concrete lets think about whats come to be known as the british doctors study, a study carried out by hill in the 1950s and it was the first piece of evidence that smoking and lung cancer had Strong Associations. So its called the british doctors study because every doctor in the uk was invited to participate in the study and two thirds of them did so two thirds of the doctors agreed to have theirmedical records included. And very quickly it became apparent that there was a Strong Association between smoking and lung cancer so imagine you are one of the doctors who participated in the study. Say youre a smoker and this is the 50s so you definitely made no attempt to hide the fact that youre a smoker. Youd probably be smoking in this presentation so everyone knows that youre a smoker but when the study is published all of a sudden everyone knows Something Else about you they didnt know before and in particular they know that you are at an increased risk for lung cancer because all of asudden we learned new facts about the world, that smoking and lung cancer are correlated. If youre in the us this might have caused you concrete harm at the time in the sense that your Health Insurance rates might have gone up so thiscould have caused you concrete quantifiable harm. So if we were going to say that what privacy means is that nothing new should be learned about you as the result of conducting a study , we would have to call the british doctors study a violation of your privacy. But theres a couple of things that are wrong about this. First of all, observed that the story could have played out in exactly the same way even if you were one of the doctors who decided not to have your data included. The supposed violation of your privacy in this case, the fact that i learned you are at higher risk of lung cancer, that wasnt something that i learned about your data in particular, i already knew you were a smoker before the study was carried out. The violation of privacy would have to be entreated to the facts about the world th