Sampling with replacement.
Suppose we have a population D of m objects and wish to sample a population D* size n from D. Is there any mathematical justification for expecting 35% of the sample D* to be repeats of units already sampled?
Suppose we have a population `D` consisting of `m` units.
If we want to sample (with replacement) a new population ` ``D`* from `D` that consists of `n`units, any of the `m`units in `D`will be sampled with equal probability so that we would expect `n/m` copies of any particular unit in `D`* . But, because the units are replaced once sampled (ie can be sampled again and again), it is possible for there to be repeats. Even so, each unit is still equally likely to be repeated so that we would still expect the same quantity `n/m` of each unit in the sampled population `D`*.
The distribution of the number ` `of a particular unit from`D` found in`D`* can be seen to be Binomial: the probability of there being `x`of the units in `D`* is given by
`P(X=x) = (n!)/(x!(n-x)!)p^x(1-p)^(n-x)`
where `p = n/m`. This distribution is the same for each and every unit in `D`.
There is a method in field biology called 'capture re-capture' where sampled birds or turtles for example are marked/tagged when captured. If the first sample was 35% the size of the total population (rather large in this context!) and the animals were put back, then on the next capture we would expect 35% of the animals to be already marked (a repeat from the first sample). If the percentage of marked animals is higher then this indicates that those particular individuals return to that place/type of place more than others of their species. Perhaps this was what the author was referring to.
We expect there to be the same proportion of each unit in D to be in D*, namely n/m of each unit, because they are sampled with equal probability. That we are sampling with replacement does allow for repeats, but again this is equal over all units in D so we would still expect there to be n/m of each unit. The number of each unit appearing in D* follows a binomial distribution which is the same for all of the m units in D.
The population D* will not then (in general) contain 35% repeated units and there should be no pattern like this unless a sample is marked before being replaced (see comment about 'capture re-capture' methods above).
thank you :0