Is Hair Color a Continuous Variable
Count Predicted Variable
John K. Kruschke , in Doing Bayesian Data Analysis (Second Edition), 2015
24.1.1 Data structure
Table 24.1 shows counts of hair color and eye color from a classroom poll of students at the University of Delaware ( Snee, 1974). These are the same data as in Table 4.1 (p. 90) but with the original counts instead of proportions, and with levels re-ordered alphabetically. Respondents self-reported their hair color and eye color, using the labels shown in Table 24.1. The cells of the table indicate the frequency with which each combination occurred in the sample. Each respondent fell in one and only one cell of the table. The data to be predicted are the cell counts. The predictors are the nominal variables. This structure is analogous to two-way analysis of variance (ANOVA), which also had two nominal predictors, but had several metric values in each cell instead of a single count.
Table 24.1. Counts of combinations of hair color and eye color
| Hair color | |||||
|---|---|---|---|---|---|
| Eye color | Black | Blond | Brown/Brunette | Red | Marginal (eye color) |
| Blue | 20 | 94 | 84 | 17 | 215 |
| Brown | 68 | 7 | 119 | 26 | 220 |
| Green | 5 | 16 | 29 | 14 | 64 |
| Hazel | 15 | 10 | 54 | 14 | 93 |
| Marginal (hair color) | 108 | 127 | 286 | 71 | 592 |
Data adapted from Snee (1974).
For data like these, we can ask a number of questions. We could wonder about one predictor at a time, and ask questions such as, "How much more frequent are brown-eyed people than hazel-eyed people?" or "How much more frequent are black-haired people than blond-haired people?" Those questions are analogous to main effects in ANOVA. But usually we display the joint counts specifically because we're interested in the relationship between the variables. We would like to know if the distribution of counts across one predictor is contingent upon the level of the other predictor. For example, does the distribution of hair colors depend on eye color, and, specifically, is the proportion of blond-haired people the same for brown-eyed people as for blue-eyed people? These questions are analogous to interaction contrasts in ANOVA.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124058880000246
The R Programming Language
John K. Kruschke , in Doing Bayesian Data Analysis (Second Edition), 2015
3.5.1 The read.csv and read.table functions
Consider a small sample of people, from whom we record their gender, hair color, and ask a random number between 1 and 10. We also record each person's first name, and which of two groups they will be assigned to. It is typical to save this sort of data as a computer file in a format called comma separated values, or CSV format. In CSV format, there is a column for each type of measurement, and a row for each person (or item) measured. The first row specifies the column names, and the information in each column is separated by commas. Here is an example:
Hair,Gender,Number,Name,Group
black,M,2,Alex,1
brown,F,4,Betty,1
blond,F,3,Carla,1
black,F,7,Diane,2
black,M,1,Edward,2
red,M,7,Frank,2
brown,F,10,Gabrielle,2
CSV files are especially useful because they are generic text files that virtually any computer system can read. A disadvantage of CSV files is that they can take up a lot of memory space, but these days computer memory is cheap and plentiful.
CSV files are easily loaded into R's memory using the read.csv function. Suppose that the data above are saved in a file called HGN.csv. Then the data can be loaded into a variable I've named HGNdf as follows:
> HGNdf = read.csv( "HGN.csv" )
The resulting variable, HGNdf, is a data frame in R. Thus, the columns of HGNdf are vectors or factors, named according to the words in the first row of the CSV file, and all of length equal to the number of data rows in the CSV file.
It is important to note that columns with any character (non-numeric) entries are turned into factors (recall that Section 3.4.2 described factors). For example, the Hair column is a factor:
> HGNdf$Hair
[1] black brown blond black black red brown
Levels: black blond brown red
> as.numeric(HGNdf$Hair)
[1] 1 3 2 1 1 4 3
The levels of the factor are alphabetical by default. We might want to reorder the levels to be more meaningful. For example, in this case we might want to reorder the hair colors from lightest to darkest. We can do that after loading the data, like this:
> HGNdf$Hair = factor( HGNdf$Hair , levels=c("red","blond","brown", "black"))
>
> HGNdf$Hair
[1] black brown blond black black red brown
Levels: red blond brown black
>
> as.numeric(HGNdf$Hair)
[1] 4 3 2 4 4 1 3
There might be times when we do not want a column with character entries to be treated as a factor. For example, the Name column is treated by read.csv as a factor because it has character entries:
> HGNdf$Name
[1] Alex Betty Carla Diane Edward Frank Gabrielle
Levels: Alex Betty Carla Diane Edward Frank Gabrielle
Because the names are never repeated, there are as many factor levels as entries in the column, and there might be little use in structuring it as a factor. To convert a factor to an ordinary vector, use the function as.vector, like this:
> HGNdf$Name = as.vector( HGNdf$Name )
> HGNdf$Name
[1] "Alex" "Betty" "Carla" "Diane" "Edward" "Frank" "Gabrielle"
There might be times when a column of integers is read by read.csv as a numeric vector, when you intended it to be treated as indexical levels for grouping the data. In other words, you would like the column to be treated as a factor, not as a numeric vector. This happened in the present example with the Group column:
> HGNdf$Group
[1] 1 1 1 2 2 2 2
Notice above that no levels are associated with the Group column. It is easy to convert the column to a factor:
> HGNdf$Group = factor( HGNdf$Group )
> HGNdf$Group
[1] 1 1 1 2 2 2 2
Levels: 1 2
The read.csv function is a special case of the more general read.table function. You can learn more about it by typing ?"read.table" at R's command prompt. For example, you can turn off the default action of making character columns into factors.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124058880000039
What is This Stuff Called Probability?
John K. Kruschke , in Doing Bayesian Data Analysis (Second Edition), 2015
4.4.1 Conditional probability
We often want to know the probability of one outcome, given that we know another outcome is true. For example, suppose I sample a person at random from the population referred to in Table 4.1 . Suppose I tell you that this person has blue eyes. Conditional on that information, what is the probability that the person has blond hair (or any other particular hair color)? It is intuitively clear how to compute the answer: We see from the blue-eye row of Table 4.1 that the total (i.e., marginal) amount of blue-eyed people is 0.36, and that 0.16 of the population has blue eyes and blond hair. Therefore, of the 0.36 with blue eyes, the fraction 0.16/0.36 has blond hair. In other words, of the blue-eyed people, 45% have blond hair. We also note that of the blue-eyed people, 0.03/0.36 = 8% have black hair. Table 4.2 shows this calculation for each of the hair colors.
Table 4.2. Example of conditional probability
| Hair color | |||||
|---|---|---|---|---|---|
| Eye color | Black | Brunette | Red | Blond | Marginal (eye color) |
| Blue | 0.03/0.36 | 0.14/0.36 | 0.03/0.36 | 0.16/0.36 | 0.36/0.36 = 1.0 |
| = 0.08 | = 0.39 | = 0.08 | = 0.45 | ||
Of the blue-eyed people in Table 4.1, what proportion have hair color h? Each cell shows p(h |blue) = p(blue, h)/p(blue) rounded to two decimal points.
The probabilities of the hair colors represent the credibilities of each possible hair color. For this group of people, the general probability of having blond hair is 0.21, as can be seen from the marginal distribution of Table 4.1. But when we learn that a person from this group has blue eyes, then the credibility of that person having blond hair increases to 0.45, as can be seen from Table 4.2. This reallocation of credibility across the possible hair colors is Bayesian inference! But we are getting ahead of ourselves; the next chapter will explain the basic mathematics of Bayesian inference in detail.
The intuitive computations for conditional probability can be denoted by simple formal expressions. We denote the conditional probability of hair color given eye color as p(h|e), which is spoken "the probability of h given e." The intuitive calculations above are then written p(h|e) = p(e, h)/p(e). This equation is taken as the definition of conditional probability. Recall that the marginal probability is merely the sum of the cell probabilities, and therefore the definition can be written p(h|e) = p(e, h)/p(e) = p(e, h)/Σ h p(e, h). That equation can be confusing because the h in the numerator is a specific value of hair color, but the h in the denominator is a variable that takes on all possible values of hair color. To disambiguate the two meanings of h, the equation can be written p(h|e) = p(e, h)/p(e) = p(e, h)/Σ h* p(e, h*), where h* indicates possible values of hair color.
The definition of conditional probability can be written using more general variable names, with r referring to an arbitrary row attribute and c referring to an arbitrary column attribute. Then, for attributes with discrete values, conditional probability is defined as
(4.9)
When the column attribute is continuous, the sum becomes an integral:
(4.10)
Of course, we can conditionalize on the other variable, instead. That is, we can consider p(r|c) instead of p(c|r). It is important to recognize that, in general, p(r|c) is not equal to p(c|r). For example, the probability that the ground is wet, given that it's raining, is different than the probability that it's raining, given that the ground is wet. The next chapter provides an extended discussion of the relationship between p(r|c) and p(c|r).
It is also important to recognize that there is no temporal order in conditional probabilities. When we say "the probability of x given y" we do not mean that y has already happened and x has yet to happen. All we mean is that we are restricting our calculations of probability to a particular subset of possible outcomes. A better gloss of p(x|y) is, "among all joint outcomes with value y, this proportion of them also has value x." So, for example, we can talk about the conditional probability that it rained the previous night given that there are clouds the next morning. This is simply referring to the proportion of all cloudy mornings that had rain the night before.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124058880000040
Bayes' Rule
John K. Kruschke , in Doing Bayesian Data Analysis (Second Edition), 2015
5.1.2 Bayes' rule intuited from a two-way discrete table
Consider Table 5.1, which shows the joint probabilities of a row attribute and a column attribute, along with their marginal probabilities. In each cell, the joint probability p(r, c) is re-expressed by the equivalent form p(r | c) p(c) from the definition of conditional probability in Equation 5.3. The marginal probability p(r) is re-expressed by the equivalent form Σ c* p(r | c*) p(c*), as was done in Equations 4.9 and 5.6. Notice that the numerator of Bayes' rule is the joint probability, p(r, c), and the denominator of Bayes' rule is the marginal probability, p(r). Looking at Table 5.1, you can see that Bayes' rule gets us from the lower marginal distribution, p(c), to the conditional distribution p(c | r) when focusing on row value r. In summary, the key idea is that conditionalizing on a known row value is like restricting attention to only the row for which that known value is true, and then normalizing the probabilities in that row by dividing by the row's total probability. This act of spatial attention, when expressed in algebra, yields Bayes' rule.
Table 5.1. A table for making Bayes' rule not merely special but spatial
| Column | ||||
|---|---|---|---|---|
| Row | … | c | … | Marginal |
| âµ— | âµ— | |||
| r | … | p(r, c) = p(r | c) p(c) | … | p(r) = Σ c* p(r|c*)p(c*) |
| âµ— | âµ— | |||
| Marginal | p(c) | |||
When conditionalizing on row value r, the conditional probability p(c | r) is simply the cell probability, p(r, c), divided by the marginal probability, p(r). When algebraically re-expressed as shown in the table, this is Bayes' rule. Spatially, Bayes' rule gets us from the lower marginal distribution, p(c), to the conditional distribution p(c | r) when focusing on row value r.
A concrete example of going from marginal to conditional probabilities was provided in the previous chapter, regarding eye color and hair color. Let's revisit it now. Table 5.2 shows the joint and marginal probabilities of various combinations of eye color and hair color. Without knowing anything about a person's eye color, all we believe about hair colors is expressed by the marginal probabilities of the hair colors, at the bottom of Table 5.2. However, if we are told that a randomly selected person's eyes are blue, then we know that this person comes from the "blue" row of the table, and we can focus our attention on that row. We compute the conditional probabilities of the hair colors, given the eye color, as shown in Table 5.3. Notice that we have gone from the "prior" (marginal) beliefs about hair color before knowing eye color, to the "posterior" (conditional) beliefs about hair color given the observed eye color. For example, without knowing the eye color, the probability of blond hair in this population is 0.21. But with knowing that the eyes are blue, the probability of blond hair is 0.45.
Table 5.2. Proportions of combinations of hair color and eye color
| Hair color | |||||
|---|---|---|---|---|---|
| Eye color | Black | Brunette | Red | Blond | Marginal (Eye color) |
| Brown | 0.11 | 0.20 | 0.04 | 0.01 | 0.37 |
| Blue | 0.03 | 0.14 | 0.03 | 0.16 | 0.36 |
| Hazel | 0.03 | 0.09 | 0.02 | 0.02 | 0.16 |
| Green | 0.01 | 0.05 | 0.02 | 0.03 | 0.11 |
| Marginal (hair color) | 0.18 | 0.48 | 0.12 | 0.21 | 1.0 |
Some rows or columns may not sum exactly to their displayed marginals because of rounding error from the original data. Data adapted from Snee (1974). This is a Table 4.1 duplicated here for convenience.
Table 5.3. Example of conditional probability
| Hair color | |||||
|---|---|---|---|---|---|
| Eye color | Black | Brunette | Red | Blond | Marginal (Eye color) |
| Blue | 0.03/0.36 | 0.14/0.36 | 0.03/0.36 | 0.16/0.36 | 0.36/0.36 = 1.0 |
| = 0.08 | = 0.39 | = 0.08 | = 0.45 | ||
Of the blue-eyed people in Table 5.2, what proportion has hair color h? Each cell shows p(h|blue) = p(blue, h)/p(blue) rounded to two decimal points. This is a Table 4.2 duplicated here for convenience.
The example involving eye color and hair color illustrates conditional reallocation of credibility across column values (hair colors) when given information about a row value (eye color). But the example uses joint probabilities p(r, c) that are directly provided as numerical values, whereas Bayes' rule instead involves joint probabilities expressed as p(r|c) p(c), as shown in Equation 5.6 and Table 5.1. The next example provides a concrete situation in which it is natural to express joint probabilities as p(r|c) p(c).
Consider trying to diagnose a rare disease. Suppose that in the general population, the probability of having the disease is only one in a thousand. We denote the true presence or absence of the disease as the value of a parameter, θ, that can have the value θ =
Suppose that there is a test for the disease that has a 99% hit rate, which means that if a person has the disease, then the test result is positive 99% of the time. We denote a positive test result as T = +, and a negative test result as T =−. The observed test result is the datum that we will use to modify our belief about the value of the underlying disease parameter. The hit rate is expressed formally as p(T = + | θ =
Suppose we sample a person at random from the population, administer the test, and it comes up positive. What is the posterior probability that the person has the disease? Mathematically expressed, we are asking, what is p(θ =
Table 5.4 shows how to conceptualize disease diagnosis as a case of Bayes' rule. The base rate of the disease is shown in the lower marginal of the table. Because the background probability of having the disease is p(θ =
, it is the case that the probability of not having the disease is the complement, p(θ =
) = 1 - 0.001 = 0.999. Without any information about test results, this lower marginal probability is our prior belief about a person having the disease.
Table 5.4. Joint and marginal probabilities of test results and disease states
For this example, the base rate of the disease is 0.001, as shown in the lower marginal. The test has a hit rate of 0.99 and a false alarm rate of 0.05, as shown in the row for T = +. For an actual test result, we restrict attention to the corresponding row of the table and compute the conditional probabilities of the disease states via Bayes' rule.
Table 5.4 shows the joint probabilities of test results and disease states in terms of the hit rate, false alarm rate, and base rate. For example, the joint probability of the test being positive and the disease being present is shown in the upper-left cell as p(T = +, θ =
Suppose we select a person at random and administer the diagnostic test, and the test result is positive. To determine the probability of having the disease, we should restrict attention to the row marked T = + and compute the conditional probabilities of p(θ|T = +) via Bayes' rule. In particular, we find that
Yes, that's correct: Even with a positive test result for a test with a 99% hit rate, the posterior probability of having the disease is only 1.9%. This low probability is a consequence of the low-prior probability of the disease and the non-negligible false alarm rate of the test. A caveat regarding interpretation of the results: Remember that here we have assumed that the person was selected at random from the population, and there were no other symptoms that motivated getting the test. If there were other symptoms that indicated the disease, those data would also have to be taken into account.
To summarize the example of disease diagnosis: We started with the prior credibility of the two disease states (present or absent) as indicated by the lower marginal of Table 5.4. We used a diagnostic test that has a known hit rate and false alarm rate, which are the conditional probabilities of a positive test result for each disease state. When an observed test result occurred, we restricted attention to the corresponding row of Table 5.4 and computed the conditional probabilities of the disease states in that row via Bayes' rule. These conditional probabilities are the posterior probabilities of the disease states. The conditional probabilities are the re-allocated credibilities of the disease states, given the data.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124058880000052
Handbook of Statistics
D.A. Reid , ... A. Ross , in Handbook of Statistics, 2013
1 Introduction
Biometrics provides an automated method to identify people based on their physical or behavioral characteristics. Classical examples of biometric traits include fingerprints, irises, and faces which have been evaluated and demonstrated to be useful in several different applications ranging from laptop access to border control systems. Although these traits have been successfully incorporated in operational systems, there are several challenges that are yet to be addressed. For example, the utility of these traits significantly decreases when the input data is degraded or when the distance between the sensor and the subject increases. Thus, the use of alternate human attributes may be necessary to establish an individual's identity.
Soft biometric traits are physical or behavioral features which can be described by humans. Height, weight, hair color, and ethnicity are common examples of soft traits: they are not unique to the individual but can be aggregated to provide discriminative biometric signatures. Although these types of biometric traits have only been recently considered in biometrics, they have tremendous potential for human identification by enhancing the recognition performance of primary biometric traits.
Identification from a distance has become important due to the ever-increasing surveillance infrastructure that is being deployed in society. Primary biometric traits capable of identifying humans from a distance, viz., face and gait, are negatively impacted by the limited frame rate and low image resolution of most CCTV cameras. Figure 1 shows an example of a typical CCTV video frame. This frame shows a suspect in the murder of a Hamas commander in Dubai in 2010. The murder involved an 11-strong hit squad using fake European passports to enter the country. All the members of the hit squad used disguises including wigs and fake beards during the operation. From Fig. 1 it can be observed that although the image is at low resolution and the subjects' face and ocular features are occluded, a number of soft biometric features such as hair color, skin color, and body geometry can be deduced. Soft biometric traits can be extracted from very low quality data such as those generated by surveillance cameras. They also require limited cooperation from the subject and can be non-intrusively obtained, making them ideal in surveillance applications. One of the main advantages of soft biometric traits is their relationship with conventional human descriptions (Samangooei et al., 2008); humans naturally use soft biometric traits to identify and describe each other. On the other hand, translating conventional primary biometric features into human descriptive forms may not always be possible. This is the semantic gap that exists between how machines and people recognize humans. Soft biometrics bridge this gap, allowing conversion between human descriptions and biometrics. Very often, in eyewitness reports, a physical description of a suspect may be available (e.g., "The perpetrator was a short male with brown hair"). An appropriate automation scheme can convert this description into a soft biometric feature set. Thus, by using soft biometrics, surveillance footage archives can be automatically searched based on a human description.
Fig. 1. Surveillance frame displaying a few challenges in establishing identity of individuals in surveillance videos. 1
Biometric traits should exhibit limited variations across multiple observations of a subject and large variations across multiple subjects. The extent of these variations defines the discriminative ability of the trait and, hence, its identification potential. Soft biometrics, by definition, exhibit low variance across subjects and as such rely on statistical analysis to identify suitable combinations of traits and their application potential. This chapter examines the current state of the art in the emerging field of soft biometrics. Section 2 introduces the performance metrics used in biometrics. Section 3 discusses how soft traits can be used to improve the performance of classical biometric systems based on primary biometric traits. Using soft biometric traits to identify humans is reviewed in Section 4. Identifying gender from facial images is explored in Section 5. Finally, Section 6 explores some of the possible applications of soft biometrics.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780444538598000138
Analysis of Variance
Sheldon M. Ross , in Introductory Statistics (Fourth Edition), 2017
Review Problems
- 1.
-
A corporation has three apparently identical manufacturing plants. Wanting to see if these plants are equally effective, management randomly chose 30 days. On 10 of these days it determined the daily output at plant 1. On another 10 days, it determined the daily output at plant 2, and on the final 10 days management determined the daily output at plant 3. The following summary data give the sample means and sample variances of the daily numbers of items produced at the three plants over those days.
Plant i i = 1 325 450 i = 2 413 520 i = 3 366 444 Test the hypothesis that the mean number of items produced daily is the same for all three plants. Use the 5 percent level of significance.
- 2.
-
Sixty nonreading preschool students were randomly divided into four groups of 15 each. Each group was given a different type of course in learning how to read. Afterward, the students were tested with the following results.
Group 1 65 224 2 62 241 3 68 233 4 61 245 Test the null hypothesis that the reading courses are equally effective. Use the 5 percent level of significance.
- 3.
-
Preliminary studies indicate a possible connection between one's natural hair color and threshold for pain. A sample of 12 women were classified as to having light, medium, or dark hair. Each was then given a pain sensitivity test, with the following scores resulting.
Light Medium Dark 63 60 45 72 48 33 52 44 57 60 53 40 Are the given data sufficient to establish that hair color affects the results of a pain sensitivity test? Use the 5 percent level of significance.
- 4.
-
Three different washing machines were employed to test four different detergents. The following data give a coded score of the effectiveness of each washing.
Machine Detergent 1 2 3 1 53 50 59 2 54 54 60 3 56 58 62 4 50 45 57 - (a)
-
Estimate the improvement in mean value with detergent 1 over detergent (i) 2, (ii) 3, and (iii) 4.
- (b)
-
Estimate the improvement in mean value when machine 3 is used as opposed to machine (i) 1 and (ii) 2.
- (c)
-
Test the hypothesis that the detergent used does not affect the score.
- (d)
-
Test the hypothesis that the machine used does not affect the score.
In both (c) and (d), use the 5 percent level of significance.
- 5.
-
Suppose in Prob. 4 that the 12 applications of the detergents were all on different randomly chosen machines. Test the hypothesis, at the 5 percent significance level, that the detergents are equally effective.
- 6.
-
In Example 11.3 test the hypothesis that the mean test score depends only on the test taken and not on which student is taking the test.
- 7.
-
A manufacturer of women's beauty products is considering four new variations of a hair dye. An important consideration in a hair dye is its lasting power, defined as the number of days until treated hair becomes indistinguishable from untreated hair. To learn about the lasting power of its new variations, the company hired three long-haired women. Each woman's hair was divided into four sections, and each section was treated by one of the dyes. The following data concerning the lasting power resulted.
Dye Woman 1 2 3 4 1 15 20 27 21 2 30 33 25 27 3 37 44 41 46 - (a)
-
Test, at the 5 percent level of significance, the hypothesis that the four variations have the same mean lasting power.
- (b)
-
Estimate the mean lasting power obtained when woman 2 uses dye 2.
- (c)
-
Test, at the 5 percent level of significance, the hypothesis that the mean lasting power does not depend on which woman is being tested.
- 8.
-
Use the following data to test the hypotheses of
- (a)
-
No row effect
- (b)
-
No column effect
17 23 35 39 5 42 28 19 40 14 36 23 31 44 13 27 40 25 50 17 - 9.
-
Problem 9 of Sec. 11.2 implicitly assumes that the number of deaths is not affected by the year under consideration. However, consider a two-factor ANOVA model for this problem.
- (a)
-
Test the hypothesis that there is no effect due to the year.
- (b)
-
Test the hypothesis that there is no seasonal effect.
- 10.
-
The following data relate to the ages at death of a certain species of rats that were fed one of three types of diet. The rats chosen were of a type having a short life span, and they were randomly divided into three groups. The data are the sample means and sample variances of the ages of death (measured in months) of the three groups. Each group is of size 8.
Very low-calorie Moderate-calorie High-calorie Sample mean 22.4 16.8 13.7 Sample variance 24.0 23.2 17.1 Test the hypothesis, at the 5 percent level of significance, that the mean lifetime of a rat is not affected by its diet. What about at the 1 percent level?
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128043172000114
Discovery of Abstract Knowledge from Non-Atomic Attribute Values in Fuzzy Relational Databases
Rafal A. Angryk , Frederick E. Petry , in Modern Information Processing, 2006
3.2 Character of imprecision reflected in fuzzy records
Before introducing our approach to AOI from imprecise data, let us analyze briefly the nature of the uncertainty representation allowed in the fuzzy database model. There are two actual representations of imprecision in the fuzzy database schema. First, as already mentioned, is the occurrence of multiple attribute values. Obviously, the more descriptors we use to characterize a particular record in the database, the more imprecise is its depiction. Uncertainty about the description is also implicitly reflected in the similarity of values characterizing a particular entity, e.g. when we describe someone's hair as {black, dark brown, red, auburn} we have more doubt about the person's hair colour than in the case when we characterize it as {blond, dark blond, light brown, brown}, since this description would be rather immediately interpreted as "blondish". There are the same number of attribute values in each case, however the higher similarity of values utilized in the second set results in the higher informativeness carried by the second example.
The imprecision of the original information is actually reflected both in the number of inserted descriptors for a particular attribute and in the similarity of these values. In Table 4 we summarize observations concerning their relationship. The domain called Quantity of attribute values is a discrete set of integer numbers (> 0, since the fuzzy model does not allow empty attributes); the Similarity of attribute values is characterized in fuzzy databases with a continuous set of real numbers in a range [0,1] – the values of α.
Table 4. Character of information stored in the Fuzzy Databases.
| Quantity of attr. values/Similarity of attr. values | LOW | HIGH |
|---|---|---|
| SMALL | Imprecise | Precise |
| LARGE | Imprecise(Error suspected) | Precise(Confirmed) |
The simplified characterization of data imprecision presented in Table 4 can be enhanced with a brief analysis of the boundary values. The measure of imprecision can be thought of ranging between 0 (i.e. the lack of uncertainty about results) and infinity (maximum imprecision). The common opinion that even flawed information is better than lack of the information, leads us to say that imprecision reaches its maximum limits when there is no data inserted at all. Since the fuzzy database model does not allow empty attributes we will not consider this further. The minimum imprecision (0-level) is achieved by a single attribute value. If there are no other descriptors or auxiliary information, we must assume the inserted value is a perfect characterization of the particular entity's feature. The same minimum can be also accomplished with multiple values if they all have identical meaning (synonyms). Despite the fact that multiple, identical descriptors additionally confirm an initially inserted value, they cannot lead to further reduction of imprecision, since it already has the minimal value. Therefore the descriptors, which are so similar that they are considered to be identical, can be reduced to a single descriptor. Obviously, some attribute values, initially considered as different, may be treated as identical at a higher abstraction level. Therefore we can conclude that the practically achievable minimum of imprecision depends on the abstraction level of employed descriptors, and can reach its original 0-level only at the lowest level of abstraction (for α = 1.0 in our fuzzy database model).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780444520753500157
Statistical analysis of multivariate data
Milan Meloun , Jiřà Militký , in Statistical Data Analysis, 2011
4.1.1 Description of Data
Multivariate data matrix – The data with which we are primarily concerned consist of a series of variables (sometimes called characteristics or properties) ξ 1,..ξm in m columns (i.e. measurements or observations, power of engine, weight of car, petrol consumption, length of car, etc.) made on a number of objects or cases in n rows (patients, individuals, molecules or other entities of interest). A typical non-structured multivariate data matrix, X, has the form
where the typical element, xij is the value of the jth variable ξj for the ith object. If there are several distinct groups of individuals, one of the columns (usually the first) might be a categorical variable with values of 1, 2, etc. to distinguish these groups. The number of objects under investigation is n and the number of variables taken on each of these n objects is m. In the case of structured data is the matix X divided into submatrix Y (n × p) containing dependent variables (goal variables) and submatrix Z (n × m-p) of explanatory variables. From a statistical point of view are ξ 1,…,ξm random variables [28]
Classification of variables – Broadly speaking, there are three main measurement scales of data encountered in practice: nominal, ordinal and cardinal. The most common way of distinguishing between these types of variables is the following [25]:
- (a)
-
Nominal scale variables are unordered categorical variables. Each observation belongs to one of set of mutually exclusive and collectively exhaustive categories. These categories have no natural or necessary order relative to each other. For these variables, the operators of equality ( = ) and non equality ( ≠ ) are defined. Nominal variables are often binary (1 – presence, 0 – absence). Examples include treatment allocation, the sex of the respondent, hair colour, presence or absence of depression, and so on. In computerized data analysis, numbers are often used as symbols.
- (b)
-
Ordinal scale variables are ranked data where there is an ordering of categories. For these variables the operators less than ( < ) or higher than ( > ) are defined as well. Each observation belongs to one of set of mutually exclusive and collectively exhaustive categories but these categories are naturally ordered (usually the higher category is better, the first category being the worse and last category the best). Categories are ordered, but the differences between them are no quantified. Categories are characterized by symbols with natural ordering as letters (e.g. A, B, C..) or numbers. Examples include social class and self- perception of health (each coded from 1 to 5, say), hand feeling (not acceptable = 1, bad = 2, medium = 3, good = 4, fully acceptable = 5) and educational level (no schooling = 1, primary = 2, secondary = 3 or tertiary = 4 education).
- (c)
-
Cardinal scale variables are data containing quantitative information where the distance (or norm) is defined. The basic mathematical operations such as summation, substraction, multiplication and division can be used. From a statistical point of view, the cardinal scale data are discrete (categorized) or continuous. In fact all measured data are discrete and the number of categories is dependent on the rounding. The cardinal scale is divided into two subcategories
Interval scale variables – where there are equal differences between successive points on the scale, but the position of zero is arbitrary. The classic example is the measurement of temperature using the Celsius or Fahrenheit scales. In some cases a variable such as a measure of depression, anxiety or intelligence, for example, might be treated as if it were interval-scaled when this, in fact, might be difficult to justify.
Ratio scale variables are interval variables with a natural point representing the origin of measurement, i.e. a natural zero point. It also represents the highest level of measurement, where one can investigate the relative magnitude of scores as well as the differences between them. The position of zero is fixed. The classic example is the absolute measure of temperature (in kelvin, for example) but other common ones include age (or any other time from a fixed event), weight and length.
Higher scale type includes naturally the properties of lower scales and can be transformed (with loss of information) into lower scale.
Equality, difference and similarity – concepts of equality, difference and similarity seem trivial; but each data set requires careful consideration about the meanings of these three words, because the result of data analysis greatly depends on these [26]. The concept of similarity depends on the aim of data analysis. MDA searches for relationships among objects, among variables, and among objects and variables. Objects can be equal, similar, dissimilar and proportional mixtures. Variables can be equal, similar, dissimilar, proportional and linear combinations. This is clustering analysis, the search for clusters of similar objects (clustering of objects) or of similar variables (clustering of variables). When from cluster analysis we know that some clusters (categories) of objects are present, we use classification analysis to assign objects to one of these categories.
Dissimilarity between entities forms the starting point of various techniques, so it is worth gathering together the basic ideas in one section. Since dissimilarity is so closely linked to the idea of distance, one natural way of measuring it is by the use of a familiar metric such as Euclidean distance. If s is the similarity between two entities (usually in the range 0 ≤ s ≤ 1), then the dissimilarity d is the direct opposite of s and hence may be obtained by using any monotonically decreasing transformation of s. The most common such transformation is d = 1 − s. Secondly, all dissimilarity measure formulae naturally assume that a set of data is available on which to apply them.
Dissimilarity dkl between objects k and l: The numerical value xkj is observed for the jth variable on the kth objects in the sample.
- (a)
-
Euclidean distance (metric): .
- (b)
-
Hamming or manhattan distance (metric): .
- (c)
-
General Minkowski distance (metric): . where for p = 1 it is Hamming distance and for p = 2 it is Euclidean metric. The consequence of increasing p is increasingly to exaggerate the more dissimilar units relative to the similar ones.
- (d)
-
Mahalanobis distance (metric) can be thought of as an appropriate statistical distance for use in sample space similarly as the Euclidean distance but where there exist different variances and covariance between variables expressed by covariance matrix C, .
Measures of similarity: xij can take just two values, which can be arbitrarily coded 0 and 1, say. To compute a similarity between two entities, therefore, the relevant data can be reduced to the 2 × 2 table:
This table shows that, out of all the possible pairwise comparisons between the two entities, a show 0-0 in both positions, d show 1 − 1 in both positions, b show the disagreement 1 − 0 , while c show the reverse disagreement 0 − 1.
Similarities between two objects: In this case we compare the p variable values for the two objects, so a + b + c + d = p. The four most common measures of similarity are as follows:
- (a)
-
Sokal-Michener coeficient of association: ,
- (b)
-
Russel-Rao coeficient of association: ,
- (c)
-
Human coeficient of association: ,
- (d)
-
Correlation coeficient: .
More details about practical applications of these measures are given in the chap.00204.9 and 4.10.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780857091093500042
Functional data analysis in evolutionary biology
Nancy E. Heckman , in Recent Advances and Trends in Nonparametric Statistics, 2003
1 INTRODUCTION
Animal breeders and evolutionary biologists study how physical traits change from generation to generation. Which dairy cattle should we breed to maximize milk production in the next generation? Caterpillars grow at a rate that is temperature dependent. Does the pattern of this dependence have a genetic component? Is the pattern related to survival?
A physical trait is called a phenotype. It can be qualitative such as hair colour, quantitative such as height, a vector such as (height,weight), or a function such as growth as a function of temperature. Understanding how phenotypes evolve is a basic goal of quantitative genetics.
Two components of evolution of phenotypes are a selection mechanism and genetic variability. If there is no selection, the distribution of the phenotype remains the same from generation to generation. If there is selection, the selection mechanism can be chosen, as in animal breeding, or it can be unknown, to be estimated, as in evolutionary biology. However, the presence of selection doesn't guarantee evolution of the phenotype. If there is no genetic variability, the phenotype can not evolve save by mutation. Thus, knowledge about the genetic variability of the phenotype is important in determining how phenotypes evolve. Here, evolution refers to adaptive evolution, that is, to evolution governed by selection and genetic variability as opposed to, e.g., mutation.
Figure 1 shows two sets of curves. Each curve shows the average distance (kilometers per week) run on a wheel as a function of age. The average is over mice in a particular breeding line. The curves are constructed from data first discussed in [1] and analyzed in [2]. The top four curves come from four lines of mice bred for fast wheel-running at age eight weeks. The bottom four curves come from a control group of four lines of mice. The researchers are interested in studying the over-all effect of the selective breeding. From the plots, it seems that the biggest effect is an over-all increase in wheel-running at all ages. Therefore, any sensible statistical analysis will probably conclude that wheel-running activity at age eight weeks is genetically correlated with wheel-running activity at all other ages. Can we make further statements about the distribution of the genetic component of wheel-running?
Figure 1. Mean wheel-running (km/week) as a function of age for four selected lines (solid lines) and four control lines (dashed lines) of mice.
Figure 2 shows the short term growth rates of two types of caterpillars as a function of temperature. These curves are called thermal performance curves. They were calculated from data collected by Joel Kingsolver [3] under laboratory conditions. Caterpillars were then placed in a natural setting, with temperature varying, and their survival noted. The researchers would like to relate an individual's chance of survival in the natural setting to the individual's thermal performance curve. In addition, the researchers would like to predict the thermal performance curves of future generations under current naturally occuring temperature patterns and under other patterns, such as those caused by global warming.
Figure 2. Mean(± 1 se) short-term relative growty rates as a function of temperature for Manduca sexta (diamonds) and Pieris rapae (squares) caterpillars.
Figures 1 and 2 are reproduced with kind permission of Kluwer Academic publishers, from Variation, selection and evolution of function-valued traits, by Kingsolver, Gomulkiewicz and Carter, Genetica 112-113 (2001) 87-104.Methodology for the analysis of the evolution of qualitative, real-valued and vectorvalued traits is fairly well-established. Their analysis is discussed in many quantitative genetics book (see, for instance, [4]). In a sequence of papers ([5],[6]), Lande and Arnold showed the importance of considering vector-valued traits: due to genetic correlation, how one trait evolves can effect how another trait evolves. Therefore much information is lost by studying traits marginally rather than jointly. The study of function-valued traits takes this logic one-step further. The continuity of the trait as a function of, e.g., age, provides valuable information that shouldn't be ignored. Methodological research for function-valued traits is still in its infancy. The purpose of the current paper is to make accessible to statisticians the relevant parts of quantitative genetics for vector-valued traits, and to point a path for work to be done on function-valued traits.
Section 2 contains notation, standard terminology and some important results in quantitative genetics, namely the Breeder's Equation, the Robertson-Price Identity, an equation of Lande's ([5] and [7]) relating fitness to the selection gradient and the calculation of a relationship coefficient in a simple setting. Estimation procedures for vector-valued traits are discussed in section 3. Section 4 contains a summary of some of the existing work on function-valued traits.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780444513786500041
Organizational Behavior
Dail Fields , Mihai C. Bocarnea , in Encyclopedia of Social Measurement, 2005
Introduction
Organizational behavior research draws on multiple disciplines, including psychology, sociology, and anthropology, but predominantly examines workers within actual organizational settings, rather than in experimental or quasi-experimental settings. Organizational behavior researchers are primarily concerned with measuring the presence of employee motivation, job alienation, organizational commitment, or similar work-related variables in order to understand how these attributes explain employee work behaviors and how they are affected by other variables, such as working conditions, company policies, human resource programs, or pay plans.
Individuals regularly behave in various ways at work, and these behaviors are described in various ways: someone putting forth a great deal of effort might be described as "motivated"; a person who approaches his or her job in a resigned fashion, putting forth only the minimum effort required, might be described as "alienated"; an employee who stays late to help a customer might be described as "committed." In each case, there are alternative possible explanations for the observed behaviors. Plausible alternatives could be that the employee putting forth lots of effort has knowledge about a possible layoff, that the person putting forth the minimum required may be ill or preoccupied with family problems, and that the person staying late may be hoping to secure a job with the customer. Many of the variables of interest in organizational behavior reflect employee perceptions. In fact, many researchers consider the perceptions of organizational members to be the reality within work settings. Perceptions and other similar phenomena on which people may vary are not directly observable as attributes (such as hair color, height, and size), but are nonobservable aspects, or latent variables. Nonobservable variables are latent in that they are assumed to be underlying causes for observable behaviors and actions. Examples of individual latent variables include intelligence, creativity, job satisfaction, and, from the examples previously provided, motivation, job alienation, and organizational commitment. Not all latent variables studied in organizational behavior are individual in nature. Some latent variables apply to groups and organizations. These include variables reflecting aggregate characteristics of the group, such as intragroup communication, group cohesion, and group goal orientation. Furthermore, organizations may be globally described as centralized, flexible, or actively learning, also representing latent variables.
Although some theories in organizational behavior (OB) include manifest variables that are directly observable, such as age, gender, and race of workers or the age and location of a work facility, many OB theories are concerned with relationships among latent variables. For example, job satisfaction is a variable that has been studied extensively over the history of organizational behavior. In particular, researchers (and practitioners) are frequently concerned with which variables are related to higher levels of job satisfaction. Generally, an employee's perceptions of the nature his or her work (interesting, boring, repetitive, difficult, demanding, skill intensive, autonomous) greatly influence job satisfaction. These perceptions are all latent variables. Some researchers have measured these aspects of a worker's job using estimations by trained observers. The agreement between the observer estimations and the worker self-assessments of perceptions is often low, and the observations have a weaker relationship with job satisfaction. Thus, measurement in organizational behavior often deals with how to obtain the specific beliefs, attitudes, and behaviors of individuals that appropriately represent, or operationalize, a latent variable. An example of one way to measure a latent variable, intelligence, is to sum the scores of an individual on tests of verbal, numerical, and spatial abilities. The assumption is that the latent variable, intelligence, underlies a person's ability in these three areas. Thus, if a person has a high score across these three areas, that person is inferred to be intelligent.
A central issue in measurement in organizational behavior is which specific perceptions should be assembled to form adequate measures of latent variables such as job satisfaction and organizational commitment. Because satisfied and committed employees are valued, managers are very interested in what policies, jobs, or pay plans will help promote such states in workers. Researchers, in turn, want to be sure that the things they use to represent satisfaction and commitment are in fact good indicators of the unseen variables. Because latent variables such as satisfaction may be based on different aspects of a job or an organization, it is necessary for an indicator of job satisfaction to include all these aspects—as many as is necessary. The result is that indicators of job satisfaction and other latent variables important in organizational behavior are based on multiple items representing statements or questions addressing measurable aspects of the concept being measured.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B0123693985005284
Source: https://www.sciencedirect.com/topics/mathematics/hair-color
0 Response to "Is Hair Color a Continuous Variable"
Post a Comment