» 
Home  |  Sitemap  |  Login  |  RSS  |  About
Anti-discrimination in Data mining PDF 
Written by Sara Hajian, Josep Domingo Ferrer   

Along with privacy, discrimination is a very important issue when considering the legal and ethical aspects of data mining. It is more than obvious that most people do not want to be discriminated because of their gender, religion, nationality, age and so on, especially when those attributes are used for making decisions about them like giving them a job, loan, insurance, etc. Discovering such potential biases and eliminating them from the training data without harming their decision-making utility is therefore highly desirable. For this reason, anti-discrimination techniques including discrimination discovery and prevention have been introduced in data mining. In this paper, we tackle the problem of discrimination discovery and prevention in data mining and specify the different features of each approach and how these approaches deal with discrimination.

 

Introduction

Unfairly treating people on the basis of their belonging to a specific group, namely race, ideology, gender, etc., is known as discrimination. In law, economics and social sciences, discrimination has been studied over the last decades and anti-discrimination laws have been adopted by many democratic governments. Some examples are the US Employment Non-Discrimination Act (United States Congress 1994), the UK Sex Discrimination Act (Parliament of the United Kingdom 1975) and the UK Race Relations Act (Parliament of the United Kingdom 1976). There are several decision-making tasks which lend themselves to discrimination, e.g. loan granting, education, health insurances and staff selection. In many scenarios, decision-making tasks are supported by information systems. Given a set of information items on a potential customer, an automated system decides whether the customer is to be recommended for a credit or a certain type of life insurance. Automating such decisions reduces the workload of the staff of banks and insurance companies, among other organizations. The use of information systems based on data mining technology for decision making has attracted the attention of many researchers in the field of computer science. In consequence, automated data collection and a plethora of data mining techniques such as association/classification rule mining have been designed and are currently widely used for making automated decisions.

 



Anti-discrimination and data mining

At first sight, automating decisions may give a sense of fairness: classification rules (decision rules) do not guide themselves by personal preferences. However, at a closer look, one realizes that classification rules are actually learned by the system based on training data. If the training data are inherently biased for or against a particular community (for example, foreigners), the learned model may show a discriminatory prejudiced behavior. For example, in a certain loan granting organization, foreign people might systematically have been denied access to loans throughout the years. If this biased historical dataset is used as training data to learn classification rules for an automated loan granting system, the learned rules will also show biased behavior toward foreign people. In other words, the system may infer that just being foreign is a legitimate reason for loan denial. A more detailed analysis of this fact is provided in [1, 2].
Figure 1 illustrates the process of discriminatory and non-discriminatory decision rule extraction. If the original biased dataset DB is used for data analysis without any anti-discrimination process (i.e. discrimination discovery and prevention), the discriminatory rules extracted could lead to automated unfair decisions. On the contrary, DB can go through an anti-discrimination process so that the learned rules are free of discrimination, given a list of discriminatory attributes (e.g. gender, race, age). As a result, fair and legitimate automated decisions are enabled.



Fig. 1. The process of extracting biased and unbiased decision rules

Then, one must prevent data mining from becoming itself a source of discrimination, due to data mining tasks generating discriminatory models from biased datasets as part of the automated decision making. In [3], it is demonstrated that data mining can be both a source of discrimination and a means for discovering discrimination.

 

Direct and Indirect Discrimination

Discrimination can be either direct or indirect (also called systematic, see [4]). Direct discriminatory rules indicate biased rules that are directly inferred from discriminatory items (e.g. Foreign worker = Yes). Indirect discriminatory rules (redlining rules) indicate biased rules that are indirectly inferred from non-discriminatory items (e.g. Zip = 10451) because of their correlation with discriminatory ones. Indirect discrimination could happen because of the availability of some background knowledge (rules), for example, indicating that a certain zipcode corresponds to a deteriorating area or an area with a mostly black population. The background knowledge might be accessible from publicly available data (e.g. census data) or might be obtained from the original dataset itself because of the existence of non-discriminatory attributes that are highly correlated with the sensitive ones in the original dataset.
One might conceive that, for direct discrimination prevention, removing discriminatory attributes from the dataset and, for indirect discrimination prevention, removing non-discriminatory attributes that are highly correlated with the sensitive ones could be a basic way to handle discrimination. However, in practice this is not advisable because in this process much useful information would be lost and the quality/utility of the resulting training datasets and data mining models would substantially decrease.

 


 

Solutions for Anti-discrimination

Despite the wide deployment of information systems based on data mining technology in decision making, the issue of anti-discrimination in data mining did not receive much attention until 2008 [4]. After that, some proposals have addressed the discovery and measure of discrimination. Others deal with the prevention of discrimination. The discovery of discriminatory decisions was first proposed by Pedreschi et al. [4, 5]. The approach is based on mining classification rules (the inductive part) and reasoning on them (the deductive part) on the basis of quantitative measures of discrimination that formalize legal definitions of discrimination. For instance, the U.S. Equal Pay Act (United States Congress 1963) states that: “a selection rate for any race, sex, or ethnic group which is less than four-fifths of the rate for the group with the highest rate will generally be regarded as evidence of adverse impact”.For instance, the U.S. Equal Pay Act (United States Congress 1963) states that: “a selection rate for any race, sex, or ethnic group which is less than four-fifths of the rate for the group with the highest rate will generally be regarded as evidence of adverse impact”. For instance, the U.S. Equal Pay Act (United States Congress 1963) states that: “a selection rate for any race, sex, or ethnic group which is less than four-fifths of the rate for the group with the highest rate will generally be regarded as evidence of adverse impact”.
Beyond discrimination discovery, preventing knowledge-based decision support systems from making discriminatory decisions (discrimination prevention) is a more challenging issue. The challenge increases if we want to prevent not only direct discrimination but also indirect discrimination or both at the same time. In this section, we review a collection of independent works in discrimination prevention. In order to be able to classify the various approaches, we consider two orthogonal dimensions based on which we present the existing approaches. As a first dimension, we consider whether the approach deals with direct discrimination, indirect discrimination, or both at the same time. In this way, we separate the discrimination prevention approaches into three groups [6]: direct discrimination prevention methods, indirect discrimination prevention methods, and direct and indirect discrimination prevention methods. The second dimension in the classification relates to the phase of the data mining process in which discrimination prevention is done. Based on this second dimension, discrimination prevention methods fall into three groups [3]: pre-processing, in-processing and post-processing approaches. We next describe these groups:

  • Pre-processing. Methods in this group transform the source data in such a way that the discriminatory biases contained in the original data are removed so that no unfair decision rule can be mined from the transformed data; any of the standard data mining algorithms can then be applied. The pre-processing approaches of data transformation and hierarchy-based generalization can be adapted from the privacy preservation literature. Along this line, Kamiran et al. [1, 2], perform a controlled distortion of the training data from which a classifier is learned by making minimally intrusive modifications leading to an unbiased dataset, and Hajian et al. [6, 7, 8] try to inspire on the data transformation methods for knowledge (rule) hiding [11] in privacy preserving data mining and devise new data transformation methods (i.e. direct and indirect rule protection, rule generalization) for converting direct and/or indirect discriminatory decision rules to legitimate (non-discriminatory) classification rules; providing reasonable trade-off between discrimination removal and information loss.
  • In-processing. Methods in this group change the data mining algorithms in such a way that the resulting models do not contain unfair decision rules (Calders and Verwer [9], Kamiran et al. [10]). For example, an alternative approach to cleaning the discrimination from the original dataset is proposed in [9] whereby the non-discriminatory constraint is embedded into a decision tree learner by changing its splitting criterion and pruning strategy through a novel leaf re-labeling approach. However, it is obvious that in-processing discrimination prevention methods must rely on new special-purpose data mining algorithms; standard data mining algorithms cannot be used because they ought to be adapted to satisfy the non-discrimination requirement.
  • Post-processing. These methods modify the resulting data mining models, instead of cleaning the original dataset or changing the data mining algorithms. For example, in [5], a confidence-altering approach is proposed for classification rules inferred by the rule-based classifier CPAR (classification based on predictive association rules) algorithm.

 

 

Conclusion

In sociology, discrimination is the prejudicial treatment of an individual based on their membership in a certain group or category. It involves denying to members of one group opportunities that are available to other groups. Like privacy, discrimination could have negative social impact on acceptance and dissemination of data mining technology. Discrimination prevention in data mining is a new body of research focusing on this issue. One of the research questions here is whether we can adapt and use the pre-processing approaches of data transformation and hierarchy-based generalization from the privacy preservation literature for discrimination prevention. There are many other challenges regarding discrimination prevention that could be considered in the rest of this research. For example, the perception of discrimination, just like the perception of privacy, strongly depends on the legal and cultural conventions of a society. If substantially different discrimination definitions and/or measures were to be found, new data transformation methods would need to be designed.


References

[1] F. Kamiran and T. Calders, “Classification without discrimination”, Proc. of the 2nd IEEE Intl. Conf. on Computer, Control and Communication (IC4 2009). IEEE, 2009.


[2] F. Kamiran and T. Calders, “Classification with no discrimination by preferential sampling”, Proc. of the 19th Machine Learning Conf. of Belgium and The Netherlands, 2010.


[3] S. Ruggieri, D. Pedreschi and F. Turini, “Data mining for discrimination discovery”, ACM Trans. on Knowledge Discovery from Data, 4(2) Article 9, ACM, 2010.


[4] D. Pedreschi, S. Ruggieri and F. Turini, “Discrimination-aware data mining”, Proc. of the 14th ACM Intl. Conf. on Knowledge Discovery and Data Mining (KDD 2008), pp. 560-568. ACM, 2008.


[5] D. Pedreschi, S. Ruggieri and F. Turini, “Measuring discrimination in socially-sensitive decision records”, Proc. of the 9th SIAM Data Mining Conf. (SDM 2009), pp. 581-592. SIAM, 2009.


[6] S. Hajian and J. Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data mining. Manuscript, 2012.


[7] S. Hajian, J. Domingo-Ferrer and A. Martnez-Ballest, “Discrimination prevention in data mining for intrusion and crime detection”, Proc. of the IEEE Symp. on Computational Intelligence in Cyber Security (CICS 2011), pp. 47-54. IEEE, 2011.


[8] S. Hajian, J. Domingo-Ferrer and A. Martnez-Ballest, “Rule protection for indirect discrimination prevention in data mining”, Modeling Decisions for Artificial Intelligence-MDAI 2011, Lecture Notes in Computer Science 6820, pp. 211-222, 2011.


[9] T. Calders and S. Verwer, “Three naive Bayes approaches for discrimination-free classification”, Data Mining and Knowledge Discovery, 21(2):277-292. 2010


[10] F. Kamiran, T. Calders and M. Pechenizkiy, “Discrimination aware decision tree learning”, Proc. of the IEEE Intl. Conf. on Data Mining (ICDM 2010), pp. 869-874. ICDM, 2010.


[11] V. Verykios and A. Gkoulalas-Divanis, “A survey of association rule hiding methods for privacy”, in Privacy- Preserving Data Mining: Models and Algorithms (eds. C. C. Aggarwal and P. S. Yu). Springer, 2008.

 

 

Do you want to comment the article? Sign in to Modap Social Network!
Current  |  Issues  |  Sections  |  Authors  |  News  |  Resources