Here is how you add a column that contains the ranking, within some category, of the observation. For example, if we have observations which contain a country of birth variable, we want to be able to say that a given observation belongs to the place with the highest proportion, the place with the second highest proportion, etc.
In summary: First we tabulate the counts by the variable with xtabs(), then we assign each row in the tabulation a rank with rank(), and then we distribute these ranks into the original data.frame using merge(). This should be fairly obvious now in the code below.
x = as.data.frame(factor(c('a','a','b','b','c','a','c','d'))) colnames(x) = 'x' x$weight = c(1,1,2,1,1,3,1,1) counts= as.data.frame(xtabs(weight~x, x)) counts$rank = rank(counts$Freq, ties.method='random') x = merge (x,counts, by.x='x', by.y='x')
So now you can select only those observations that have the most populous by picking the highest number assigned to “rank”. I don’t know why R doesn’t assign rank 1 to the most populous, rather than the least, however.
Note — Jim pointed out some extraneous weirdness in my script which I have since edited. The above *now* creates a nice x with rankings… (Wondering what other code mistakes I have left in my nascent blog…. Perhaps now is a good time to start practicing aggressive agile testing approaches….)
The result of the script above is:
x weight rank
1 a 1 4
2 a 1 4
3 b 2 4
4 b 1 3
5 c 1 3
6 a 3 2
7 c 1 2
8 d 1 1
I would have thought that ‘a’ would have a rank of 4 for all three of its occurances. If you just take the result of the merge, you get:
> merge (x,counts, by.x=’x’, by.y=’x’)
x weight Freq rank
1 a 1 5 4
2 a 1 5 4
3 a 3 5 4
4 b 2 3 3
5 b 1 3 3
6 c 1 2 2
7 c 1 2 2
8 d 1 1 1
This seems to have the correct ranks assigned to the rows.