Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. Join them; it only takes a minute:
_
Here's how it works:
Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top
Sign up
What is the difference between a population and a sample?
What is the difference between a population and a sample? What common variables and statistics are used for each one, and how do those relate to each other? standard-deviation variance sample population
edited Aug 7 '10 at 17:55
mbq 19.6k
10
59
113
asked Jul 20 '10 at 11:07
Baltimark 924
1
12
19
Obligatory reading: Krieger, N. (2012). Who and what is a “population”? Historical debates, current controversies, and implications for understanding “population health” and rectifying health inequities. The Milbank Quarterly, 90(4):634–681. – Alexis Mar 14 at 0:01
5 Answers
The population is the set of entities under study. For example, the mean height of men. This is a hypothetical population because it includes all men that have lived, are alive and will live in the future. I like this example because it drives home the point that we, as analysts, choose the population that we wish to study. Typically it is impossible to survey/measure the entire population because not all members are observable (e.g. men who will exist in the future). If it is possible to enumerate the entire population it is often costly to do so and would take a great deal of time. In the example above we have a population "men" and a parameter of interest, their height. Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions. Thus we could measure the mean height of men in a sample of the population which we call a statistic and use this to draw inferences about the parameter of interest in the population. It is an inference because there will be some uncertainty and inaccuracy involved in drawing conclusions about the population based upon a sample. This should be obvious - we have fewer members in our sample than our population therefore we have lost some information. There are many ways to select a sample and the study of this is called sampling theory. A commonly used method is called Simple Random Sampling (SRS). In SRS each member of the population has an equal probability of being included in the sample, hence the term "random". There are many other sampling methods e.g. stratified sampling, cluster sampling, etc which all have their advantages and disadvantages. It is important to remember that the sample we draw from the population is only one from a large number of potential samples. If ten researchers were all studying the same population, drawing their own samples then they may obtain different answers. Returning to our earlier example, each of the ten researchers may come up with a different mean height of men i.e. the statistic in question (mean height) varies of sample to sample -- it has a distribution called a sampling distribution. We can use this distribution to understand the uncertainty in our estimate of the population parameter. The sampling distribution of the sample mean is known to be a normal distribution with a standard deviation equal to the sample standard deviation divided by the sample size. Because this could easily be confused with the standard deviation of the sample it more common to call the standard deviation of the sampling distribution the standard error. answered Jul 21 '10 at 14:00
Graham Cookson 4,962
4
30
32
6 Isn't it a little pointless use "all men ever" as a population? I mean, there's not even a consensus as to how old homo sapiens is, or whether homo neanderthalensis were a separate species, let alone whether males of the stone-tool using homo habilis count as "men". Presumably the same problems will face us in the future, too. – naught101 Nov 29 '12 at 7:43
In the last paragraph, I think there is a minor slight of hand, and it should read... "equal to the sample standard deviation divided by the [square root] of the sample size" in reference to the standard error. – Antoni Parellada Jun 16 '17 at 13:27
The population is the whole set of values, or individuals, you are interested in. The sample is a subset of the population, and is the set of values you actually use in your estimation. So, for example, if you want to know the average height of the residents of China, that is your population, ie, the population of China. The thing is, this is quite large a number, and you wouldn't be able to get data for everyone there. So you draw a sample, that is, you get some observations, or the height of some of the people in China (a subset of the population, the sample) and do your inference based on that. answered Jul 20 '10 at 11:21
Vivi 666
1
10
18
Good answer. I think you should go further into what you mean by "do your inference based on that". That's kind of the second part of my question. – Baltimark Jul 20 '10 at 11:36
mmm... I didn't really understand what you meant by what common variables and statistics... Oh, do you mean like you use z distribution if you have the population variance and the t-distribution if you only have the sample variance and the sample size is small? Something along those lines? – Vivi Jul 20 '10 at 12:03
What I was getting at was that mean and standard deviation are parameters associated with the population, but they're estimated by the sample mean ((1/N)*\sum(x_i)) and the sample standard deviation ((1/(N-1))*\sum(x_i x^bar)^2). – Baltimark Jul 20 '10 at 13:42
The population is everything in the group of study. For example, if you are studying the price of Apple's shares, it is the historical, current, and even all future stock prices. Or, if you run an egg factory, it is all the eggs made by the factory. You don't always have to sample, and do statistical tests. If your population is your immediate living family, you don't need to sample, as the population is small. Sampling is popular for a variety of reasons: it is cheaper than a census (sampling the whole population) you don't have access to future data, so must sample the past you have to destroy some items by testing them, and don't want to destroy them all (say, eggs) answered Jul 20 '10 at 17:41
Neil McGuigan 4,966
8
44
57
When we think of the term “population,” we usually think of people in our town, region, state or country and their respective characteristics such as gender, age, marital status, ethnic membership, religion and so forth. In statistics the term “population” takes on a slightly different meaning. The “population” in statistics includes all members of a defined group that we are studying or collecting information on for data driven decisions. A part of the population is called a sample. It is a proportion of the population, a slice of it, a part of it and all its characteristics. A sample is a scientifically drawn group that actually possesses the same characteristics as the population – if it is drawn randomly.(This may be hard for you to believe, but it is true!) Randomly drawn samples must have two characteristics: *Every person has an equal opportunity to be selected for your sample; and, *Selection of one person is independent of the selection of another person. What is great about random samples is that you can generalize to the population that you are interested in. So if you sample 500 households in your community, you can generalize to the 50,000 households that live there. If you match some of the demographic characteristics of the 500 with the 50,000, you will see that they are surprisingly similar. answered Jan 16 '14 at 8:25
roseleneramas 21
1
1 This is basically correct, if properly interpreted. I worry that some readers might be misled into thinking that simple random samples with replacement (which is the kind of random sample you describe; there are other kinds) correctly reproduce all characteristics of the population. In fact, they rarely do. The point of random sampling is that the (inevitable) differences between the sample's characteristics and the population's characteristics can be attributed to the random selection process. – whuber © Jan 16 '14 at 17:40
A population includes all of the elements from a set of data. A sample consists of one or more observations from the population. BOA, A.(2012, 17) answered Oct 8 '15 at 0:56
user91513 1
1 When all the elements of a "set of data" are considered a population, that dataset is called a census of the population. Extremely few datasets are censuses. – whuber © Oct 8 '15 at 4:46
protected by Glen_b © Jun 16 '17 at 6:54 Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count ). Would you like to answer one of these unanswered questions instead?