High-quality and relevant data can be a powerful force for good, but flawed data only perpetuates inequalities under the guise of fairness.
最好data sciencecan impact global societies in incredible ways. It can work to enhance ocean health, identify and deliver food surpluses to feed the hungry, and use cellphone data to standardize public transportation routes in developing areas like Nairobi.
公共部门和私营部门的数据科学家都必须了解在新应用程序中使用数据的基本机会,address potential ethical and bias risks, and weigh the need for data regulation.
Before algorithms can be used appropriately, it’s necessary to access good data sources and evaluate the quality of all available data. According to Vinod Bakthavachalam, a senior data scientist at Coursera, critical questions to ask before using a data set in any application include: Is there measurement error? Do I understand how the data was captured? Are there weird outliers or other abnormal numbers?
“Even if the data on its own is good, there’s always a chance it may be unusable if it’s not right for a specific purpose,” he says.
For example, you may have high-quality data on a consumer’s willingness to spend over $100 on shoes, but perhaps that data was collected during the holiday season when shoppers traditionally spend more and is thus inapplicable to predicting year-round shopping trends. In other words, it may be the best data in the world, but whether it’s the most relevant data is an entirely different matter.
Data scientists must also understand that although algorithms can make a positive difference in society, there is a risk that some algorithms instead further entrench cultural prejudice and bias.
Machine learning algorithms是日常生活中最常见的数据算法之一。它们经常被用来为电子商务网站上的消费者提供产品,并且在招聘或贷款决策等案件中,它们也越来越多地应用。正确使用的是,这种算法可以通过专注于预测成功的内部特征来消除种族或性别偏见,从而忽略human tendency to prefer people who are similar to themselves.
但是,使用错误地使用,这些模型只是为原本不道德的过程提供了尊敬的贴面。当培训数据中看到偏见的算法将在喂养新数据时得出偏见的结论,因为机器学习算法并不能做出最佳决定;他们做出了“训练”它会做出的人类的决定。例如,如果一家公司过去只雇用白人男性,并使用该数据训练其招聘算法,则将使这种招聘做法永存。因此,偏见的数据会导致偏差结果。
To avoid such biases, Coursera deliberately chose to ignore gender when training its machine learning algorithms to recommend classes to potential students.
“In the U.S., women are less likely to enroll in STEM classes, so if we used gender, it wouldn’t recommend certain courses to women,” Bakthavachalam says. “We want to encourage women to enroll in STEM classes and avoid any biases in the algorithms.”
Coursera’s experience underscores the fact that although there is no silver bullet for avoiding algorithmic bias, it’s also not too complicated a problem to fix, either. In fact, it’s more a matter of awareness than a difficult engineering problem to solve, and it begins with the knowledge that artificial intelligence is by no means perfect. According to Bakthavachalam, data scientists must avoid treating machine learning algorithms as black boxes because “if you don’t know what’s going on under the hood, it’s hard to imagine and diagnose issues.”
数据科学家在初次检查时也必须保持警惕training data,一个过程需要拥有一个多元化的团队,在某些情况下,需要外部审稿人。Bakthavachalam认为,最大的风险是,数据科学家意识到数据滥用的潜力,但不要为纠正潜在问题而采取必要的工作。
“Everyone has different value systems, and being open and upfront about the algorithm can lead collectively to the right decision,” says Bakthavachalam.
从积极的角度来看,数据科学使消除偏见变得更加容易,通过量化偏见和突出可能不会引起注意的趋势。这使数据科学家可以通过仅分析合法相关的信息来消除偏见,从而使公司能够为先前服务不足的人群提供服务,尤其是在金融服务领域。
一个例子是mybucks, the fintech company powered by a machine learning-enabled, credit-scoring engine that serves the underbanked in 11 African nations. By aggregating large amounts of data, MyBucks has greater insight into which individuals are likely to default, allowing them to move beyond a reliance on more simplistic predictors like credit score.
In Kenya, for instance, data is pulled solely from an individual’s phone, and loans are paid directly into mobile wallets within minutes.
This service is especially important in nations where schools require full tuition payment upfront, historically a significant barrier to pursuing an education in some poorer countries.
Above all, data scientists must avoid getting lost in the techniques and methods of their trade. They must ask questions about who will be affected by the work and how are they ensuring that by doing “good” for one group, they don’t inadvertently harm another.
It’s through transparency about how data is collected, how it’s defined, and its limitations that analysts working together can get the most impactful results. Machines can learn, but it’s the human insights and supervision that enable organizations to balance power and fairness.