|


Probability of Duplicate Pairs
Date: 05/13/2003 at 07:23:54
From: Neil
Subject: Probability of generating duplicate addresses
I am trying to find the probability of getting at least 20 duplicate
addresses when I draw a sample of 30,000 at random from UK households
(estimate 21,000,000) where there is replacement every time a
selection is made.
I have managed to work out the probablity of getting at least one
duplicate by subtracting the probability of getting no duplicates from
1. This is extremely high - coming in very close to 1. However, I
cannot see how to calculate 2+, 3+ duplicates etc.
I have used the basic formula of 1-{[(N-n)/N]*[(N-n-1)/(N-1)]..etc}
where N is the number in the universe, n is the number in the sample,
and have repeated this n times to reflect the number of selections.
Date: 05/13/2003 at 11:07:21
From: Doctor Mitteldorf
Subject: Re: Probability of generating duplicate addresses
Neil -
A quick and practical answer to your question is this: There are
roughly (1/2) 30,000^2 pairs in your sample. If the pairs were
independent, the probability for each one to be a match would be
1/21,000,000. So the mean number of expected matches would be
(1/2) 30,000^2 / 21,000,000 = 21
The standard error in this number is roughly its square root, so you'd
expect 21 +/- 5 duplicates in your 30,000 draws. Because the numbers
are so high, this is quite an accurate estimate.
You might verify the estimate by a numerical experiment: write a
program to draw 30,000 random numbers between 1 and 21,000,000, then
sort the numbers and count the duplicates. Repeat 100 times, and
you'll have a good idea whether the above estimate is valid.
Here are some thoughts on the formal (exact) solution to your problem,
which I formulate thus: Suppose you draw n samples from a universe of
N objects with replacement. What is the probability of exactly r
duplicate pairs?
It is easier to approach this problem by counting permutations
separately, even though in the final analysis order doesn't matter.
The number of possible samples of n is then just N^n. The subset in
which these are all different numbers N!/(N-n)!, so the probability
for r=0 duplicates is
N!
----------
(N-n)! N^n
Suppose there are r duplicate pairs. Then the remaining (n-2r) samples
can be selected (and ordered) from the remaining (N-r) universe
objects in (N-r)!/((N-r)-(n-2r))! = (N-r)!/(N-n+r)! different ways.
Where are the duplicate pairs located in the sample of n? This is the
number of ways of choosing (2r) from a list of n: C(n,2r).
There are C(N,r) ways of choosing which r objects will be the
duplicates.
Finally, how many different ways may the r duplicates be arranged to
make a list of length 2r? The answer is (2r)!/2^r. This is because
there would be (2r)! permutations of the list if all the members were
distinct, but each swapping of a pair means we've overcounted by a
factor of 2.
Putting all this together, the probability of exactly n distinct pairs
(with no triples or quadruples) is
(N-r)! n! N! (2r)!
---------------------------------------- =
(N-n+r)! (n-2r)! (2r)! (N-r)! r! 2^r N^n
N! n!
------------------------------
(N-n+r)! (n-2r)! r! 2^r N^n
Practicing what I preach, I've tested this formula in 1,000,000
numerical trials for N=30 and n=15, r=0,1,2,3, and 4, and find that it
correctly predicts the number of pairs.
- Doctor Mitteldorf, The Math Forum
http://mathforum.org/dr.math/
Date: 05/16/2003 at 08:25:02 From: Neil Subject: Thank you (Probability of generating duplicate addresses) In the end I was able to work this out through trial and error - but it is lovely to have the formula to hand now and the explanation behind it. Given that I had found 20 duplications in the sample that I had drawn I wanted to check whether this was in line with what might be expected. I now have a full answer. - Thank you again. Neil |
Search the Dr. Math Library: |
[Privacy Policy] [Terms of Use]


Ask Dr. MathTM
© 1994-2011 The Math Forum
http://mathforum.org/dr.math/