Galaxy Hunter
Teacher Page: Science Background

Index:

Math Background

1. What is population size?
The population size includes all the individuals in the identified group to be studied. This may be the number of people in a city, or the number of people who buy new cars. Often you may not know the exact population size, which is not a problem. The mathematics of probability proves that the size of the population is irrelevant, unless the size of the sample exceeds a few percent of the total population you are examining. This means that a sample of 500 people is equally useful in examining the opinions of a state of 15,000,000 as it would a city of 100,000. For this reason, The Survey System ignores the population size when it is "large" or unknown. A large population is referred to as infinite, while a small population is considered finite. Population size is only likely to be a factor when you work with a relatively small, known, finite group of people (e.g., the members of an association).

2. What is sample size?
The sample size is the number of individuals included in a study and represents only a subset of the population. This subset is selected in a way that gives every member of the population an equal chance of being chosen. The larger your sample, the more sure you can be that it truly reflects the population.

3. Why is sample size important?
It is essential to use the correct sample size to accurately represent the population. Choosing a sample size that is too small may not give an accurate representation of the population distribution. Too large a sample size is wasteful and sometimes impossible to complete. For example, you want to change something in a school with a population of 500 students, and decide to survey the school but ask only ten people. Is this truly representative of the school community? No! Ten people are not enough to accurately represent the school. Suppose you tried to ask every person in the school. Sometimes this is not easily accomplished and can be unnecessary. In this case, a sample of 23 should be enough to represent the population. Reasonable sample size is dependent on population size and how much sampling error is tolerated.

4. What is simple random sampling?
Simple random sampling is the basic sampling technique in which a group of subjects, i.e., a sample, is selected for study from a larger group, i.e., a population. Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection; i.e. each member of the population is equally likely to be chosen at any stage in the sampling process.

5. How do you organize a simple random sampling?
A simple random sample is formed by assigning each member of the population a number and then indiscriminately selecting from these numbers. One way to make the selection random is to use a random number table or let a computer generate a series of random numbers. Each member of the population is assigned a unique number, or perhaps a number is already assigned to each member, such as a social security number or telephone number. The members of the population chosen for the sample will be those whose numbers are identical to the ones extracted from the random number table (or computer), in succession, until the desired sample size is reached. For example, suppose a committee is to be formed whose members are randomly selected from a group of 25 people. To obtain a simple random sample, each person is assigned a number, the numbers are placed in a hat, mixed, and then blindly drawn to form the committee.

6. What are the strengths and weaknesses of using simple random sampling?
The simple random sample requires less knowledge about the population than other techniques, but it does have two major drawbacks. One is the fact that, if the population is large, a great deal of time must be spent listing and numbering the members. The other is the fact that a simple random sample will not adequately represent many population attributes (characteristics) unless the sample is relatively large. That is, if you are interested in choosing a sample to be representative of a population on the basis of the distribution in the population of gender, age, and economic status, a simple random sample will need to be very large to ensure all these distributions are equivalent to (or representative of) the population.

7. What are some other types of sampling?

Systematic sampling - Similar to simple random sampling, but instead of selecting random numbers from tables, you move through a list (sample frame) picking every nth name. For example, pick every 10th name from an alphabetical list of students enrolled in a school.

Random Route Sampling - Used in market research surveys, mainly for sampling households, shops, garages and other premises in urban areas. A starting address is randomly selected and, taking alternate left- and right-hand turns at road junctions, every nth address is selected.

Stratified Sampling - All people in sampling frame are divided into "strata" (groups or categories). Within each stratum, a simple random sample or systematic sample is selected. For example, a politician wishes to poll his/her constituents regarding taxation. The constituents are broken into income brackets and then each bracket is polled.

Cluster or Area Random Sampling - In cluster sampling, the population is divided into clusters (usually along geographic boundaries), the clusters are randomly sampled and all units within the sampled cluster are measured. For example: a survey of town governments that will require going to the towns personally could be done by using county boundaries as the clusters and randomly selecting five counties. All the town governments in these selected counties would then be measured.

Multi-stage cluster sampling - As the name implies, this involves drawing several different samples. The first stage would be a cluster sample as described above but then another sample is taken from these samples. For example: a face-to-face survey of the residence of a state could be done by first selecting a sample of counties and then doing another sample, such as systemic sampling, of the residence of those selected counties. Thus the cost of interviewing is minimized.

There are many other methods of sampling that are more advanced. Check the references listed below, or visit the sampling method websites listed in the Grab Bag.

8. What is sampling error?
Every survey contains some form of error. Even a complete census of all known members of a population is subject to random error or potential measurement error. There are two major forms of sampling error that might be encountered in a survey: random error and systematic error.

9. What is random sampling error?
Random error occurs when a particular sample is not representative of the population of interest due to random variation. It can be expressed as the difference between the sample results and the true results. Even if all aspects of the sample are executed properly, the results are still subject to a certain amount of error because of random, chance variation.

10. What is a systematic error?
A systematic error occurs when something is wrong with the technique being used or when an instrument is not calibrated correctly. This results in an error throughout the sample.

11. How does a statistic differ from a parameter?
A statistic is a generalization concerning an entire sample, such as the mean, mode or median. A parameter is a generalization for an entire population, such as the mean, mode or median. In order to get a parameter, the entire population is involved whereas a statistic is derived from a sample of that population.

12. So how do we get from our sample statistic to an estimate of the population parameter?
There are an infinite number of samples that can be taken from a large population. One sample from a population might yield a slightly different statistic than another sample taken from the same population but the statistics should be similar to each other. If more and more samples of the same size were taken from the population, the sampling distribution of the statistic would resemble a bell curve or normal distribution. The average of the sampling distribution is essentially equivalent to the parameter. The standard deviation of the sampling distribution, called sampling error, tells us something about how different samples would be distributed which, in turn, tells how far the statistic is from the parameter. A low sampling error means that we have relatively less variability or range in the sampling distribution and are therefore closer to the parameter.

13. How is the sampling error or standard error determined?
Calculation of sampling error (also called standard error) is based on the standard deviation of the sample: the greater the sample standard deviation, the greater the sampling error. The sampling error is also related to the sample size. The greater your sample size, the smaller the sample error. This error cannot be avoided, only reduced by increasing the sample size. It is possible to estimate the range of random error at a particular level of confidence. Suppose we surveyed 500 people and found that 65% of them said that vanilla is their favorite ice cream. For a sample of 500, sampling error is 4 percent. This means that we can expect our sample results to be within 4 percentage points of the actual figure for the population -- in other words, as high as 69% or as low as 61%. As sample size increases, sampling error decreases. Sampling error is 10% for a sample of 100 and 3% for a sample of 1000.

14. What is a normal distribution and how does standard error relate to this distribution?
A normal distribution is a bell curve that extends to infinity in both directions. The high point represents the mean. If the area under the curve is defined to be 1 and you multiply that by 100 then there is a 100 % chance that any value you name will be somewhere in the distribution. Because half the area of the curve is below the mean and half is above the mean, there is a 50% chance that a randomly chosen value will be above the mean and the same chance that it will be below it. The area under the normal curve is equivalent to the probability of randomly drawing a value in that range. The area is greatest in the middle where the "hump" is and thins out toward the tails. When the area of the standard normal curve is divided into sections by standard error above and below the mean, the area in each section is a known quantity. The areas above and below the mean can be added together to get the probability of obtaining a value within (plus or minus) a given number of standard errors. There is a 65% chance of a value falling within one standard error of the mean, a 95% chance within two standard errors and a 99% chance that it will be within three. Suppose a normal distribution has a mean of 3.75 (highest point on graph below) and a standard deviation of .25 then 65% of the values will fall between 3.5 and 4.0 as shown below. (taken from http://trochim.human.cornell.edu/kb/sampstat.htm)

15. What is meant by a level of confidence or confidence level?
Confidence levels are used when two sets of data are being compared. Confidence level, also called significance level, is the likelihood of obtaining a particular result by chance rather than due to a truly significant difference in the two sets of data. The smaller the significance level, the more stringent the test, and the greater the likelihood the conclusion is correct. Common confidence levels are 0.05 (1 in 20 chance), 0.01 (1 chance in 100) and 0.001 (1 in 1000 chance)

16. What is bias?
Bias is a systematic error in sample statistics that can occur from the use of poor sampling methods. Sample design results may be biased for a number of reasons: frame error, population specification error, or selection error. The sampling frame is the list of population elements or members from which the sample is selected. Frame error results when the sampling frame does not represent a true cross-section of the target population. For example, suppose you survey your neighborhood and only talk to the people on the street. Any data collected in this manner is heavily biased because not everybody in the neighborhood had a chance to respond - what about the people who were inside at the time of the survey? Any conclusions drawn about your neighborhood using this method of sampling will not be representative of the population, the entire neighborhood. Selection error involves a systematic bias in the manner in which respondents are selected for participation in the survey. Even if the sampling frame is defined properly to include the appropriate population members, selection error can still occur. Incomplete or improper procedures for selecting participants will lead to selection error. If a sample list was sorted by zip code and interviewers selected survey participants by contacting names in order from the beginning of the list, selection error would occur because those members of the population appearing at the end of the list (larger zip codes) would never be contacted.

17. How can bias effect the accuracy of a sample?
When bias occurs, the results are skewed from the normal distribution. A negatively skewed curve has a thicker tail on the side below the mean while a positively skewed distribution has a larger tail on the side above the mean. In either case, the accuracy of the results will be compromised. Note: a skewed sample does not necessarily mean it is biased.

18. Is a computer always unbiased? or Do computers always produce random samples?
The answer is no. A computer's random number generator could be programmed in such a manner as to yield a biased sample. However, for the purposes of this lesson, computers are considered unbiased.

19. What is a Box and Whisker Plot (or Boxplot) graph?
A box and whisker plot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values at the ends of the line), the lower and upper quartiles (edges of the box), and the median (line through the box). (NOTE: The lines extending from the box may be adjusted to represent a certain fraction of the data: they could be set at 5% and 95% or they could represent the minimum and maximum values.) A box plot, as it is often called, is especially helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set. Box and whisker plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.

 25% Lower Quartile 50% Median 75% Upper Quartile

20. What is a frequency table?
A frequency table is a way of summarizing a set of data. It is a record of how often each value (or set of values) of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category. A frequency table is used to summarize categorical, nominal, and ordinal data. It may also be used to summarize continuous data once the data set has been divided up into sensible groups.

Example: Suppose that in thirty shots at a target, a marksman makes the following scores:

5 2 2 3 4
4 3 2 0 3
0 3 2 1 5
1 3 1 5 5
2 4 0 0 4
5 4 4 5 5

The frequencies of the different scores can be summarized as:

Score Frequency Frequency (%)
0 4 13%
1 3 10%
2 5 17%
3 5 17%
4 6 20%
5 7 23%

21. Why do astronomers use statistics?
Astronomers use statistics because they can't manipulate the universe in a laboratory the way a chemist can manipulate a compound or biologist can manipulate a specimen. Since it is impossible to perturb some part of the population in order to see its effect, astronomers rely on standard sampling design and estimation methods in order to make conclusions regarding the universe. Also, processes in the universe take place over a very large time scale so noticeable changes are rare and tend to be studied in detail. As an example, consider stellar evolution. No one has ever observed a star go through its life cycle since the shortest cycles are about 10 million years long, but astronomers can observe many stars at different stages in their life cycles and make predictions.

22. How do astronomers use sampling statistic techniques in their research?
Astronomers use two different sampling designs depending on the population being studied. If the population is finite in size, such as a cluster of stars or the HDFs, simple random sampling is chosen. If the population is very large and considered infinite, then more complex designs are used, depending on the characteristics of the population and the property being studied. Active galactic nuclei and halo stars are two populations that are considered infinite.

Science Background:

1. What is a galaxy?
A galaxy is an enormous collection of a few million to trillions of stars, gas, and dust held together by gravity. They can be several thousand to hundreds of thousands of light-years across. Galaxies can be placed into three main classes:

· Elliptical Galaxy - A galaxy having an oval or nearly spherical shape. Some are more elongated than others. Resembling a bulge and halo, it is composed mostly of old stars and contains very little gas and dust. The smallest elliptical galaxies (called "dwarf ellipticals") are probably the most common type of galaxy in the nearby universe.

· Spiral Galaxy - A galaxy made up of a disk with spiral (pinwheel-shaped) arms, a bulge near its center, and a halo. The sizes of the disk and bulge vary. The galaxy is composed of a mixture of old and young stars as well as gas and dust. The spiral arms are sites of active star formation. The majority of large galaxies in the nearby universe are spirals.

· Irregular Galaxy - A galaxy whose shape is neither elliptical nor spiral. It contains both young and old stars and is often rich in gas and dust. These galaxies often have active regions of star formation. Sometimes the irregular shape of these galaxies results from interactions or collisions between galaxies. Observations such as the Hubble Deep Fields show that irregular galaxies were more common in the distant (early) universe.

2. Why do we study galaxies?
By studying other galaxies, astronomers learn more about the Milky Way Galaxy, the galaxy that contains our solar system. Answers to such questions as: "Do all galaxies have the same shape?," "Are all galaxies the same size?," "Do they all have the same number of stars?," and "How and when did galaxies form?" help astronomers learn about the history of the universe. Galaxies are visible to vast distances, and trace the structure of the visible universe with their collections of billions of stars, gas, and dust.

3. Why do we study distant galaxies, if they are faint and hard to observe?
When we study astronomical objects, we are actually looking back in time. Light from the Sun takes 8 minutes to reach Earth. The light we see today from the next nearest star was emitted about four years ago. Light from the nearest galaxy that is like our own, Andromeda, takes over 2 million years to reach us. That is, we see Andromeda as it appeared more than 2 million years ago! Observations of distant galaxies show us what the universe looked like at an earlier time in the history of the universe. By studying the properties of galaxies at different epochs, we can map the evolution of the universe!

4. When scientists study these distant galaxies what do they look at?
They observe many properties of each galaxy including size, shape, brightness, color, amount of star formation and distance from us. This information helps astronomers to determine how these structures may have formed and evolved.

5. What is a "deep" field?
In astronomical terms, a deep field is a long exposure observation taken to view very faint objects. Light from these objects is collected over a large period of time, so the detectors have a chance to gather as much light as possible. Objects can be very far away and appear faint to us due to the vast distances over which the light must travel. However, objects can lie close to us and be faint because they don't give off much light. So "deep" doesn't necessarily mean far. However, in the case of the Hubble Deep Fields, deep does mean far away since the images were taken in locally empty areas.

6. What are the Hubble Deep Fields?
The Hubble Deep Field project was inspired by some of the first deep images to return from the telescope after the 1993 Hubble Space Telescope servicing mission. These images showed that the early universe contained galaxies in a bewildering variety of shapes and sizes. Some had the familiar elliptical and spiral shapes seen among normal galaxies, but there were many peculiar shapes as well. Such images of the early universe are likely to be one of the enduring legacies of the Hubble Space Telescope. Few astronomers had expected to see this activity presented in such amazing detail.

Impressed by the results of earlier observations such as the Hubble Medium Deep Survey, a special advisory committee convened by Robert Williams, then Director of the Space Telescope Science Institute (STScI), recommended that he use a significant fraction of his annual director's discretionary time to take the deepest optical picture of the universe, by aiming Hubble for 150 consecutive orbits on a single piece of sky. The research was done by pointing the telescope at one spot in the northern sky for 10 days in December of 1995 as a service to the entire astronomical community. Images from the Hubble Deep Field project were made available to the astronomers around the world shortly after completion of the observation.

Few thousand never before seen galaxies are visible in this "deepest-ever" view of the universe, called the "Hubble Deep Field" (later named the HDF-North). Besides the classical, the variety of other galaxy shapes and colors are important clues to understanding the evolution of the universe. Some of the galaxies may have formed less than one billion years after the Big Bang.

Hubble took a second deep look in the southern hemisphere in October of 1998, the HDF-South, to see if a similar result would be obtained. Each of the Hubble Deep Fields shows hundreds of galaxies in an area of the sky that is as small as the size of President Roosevelt's eye on a dime held at arm's length.

7. How were the two Hubble Deep Field sites chosen?
Each of the Hubble Deep Fields represents a "carefully selected random spot on the sky." To allow the Hubble Space Telescope to peer deeply into the sky, astronomers selected a special region of Hubble's orbit where Hubble can view the sky without being blocked by the Earth or experiencing interference from the Sun or Moon. The field also had to be far away from the plane of our own galaxy, to avoid being cluttered with objects in our galaxy. Finally, the field needed to have nearby guide stars, used to keep Hubble pointed at the field. These criteria led to the selection of a spot of sky near the handle of the Big Dipper in the northern hemisphere and a spot of the sky in the constellation Tucana in the southern hemisphere.

8. If there are thousands of galaxies visible in the Hubble Deep Fields, why does the lesson use just over 1000 as the populations of each HDF?
Because of the way astronomers' instruments work, they can be reasonably sure that they have detected all galaxies with a certain range of brightnesses in the Hubble Deep Fields. Astronomers may be able to identify fainter objects, but they cannot be sure that they have detected all of the fainter objects that exist. When studying populations of objects, astronomers need to make sure that the sample they choose is representative. The very faintest objects do not form a representative sample, since astronomers do not know if they have detected all of the faintest objects. Therefore, astronomers limit their sample to objects in a certain brightness range. The sample is then said to be "statistically complete" to that brightness level. For the HDF-N, the statistically complete sample consists of 1067 galaxies.

9. What is the importance of the HDF?
The HDF contains the faintest galaxies we've ever been able to see over a large range of distances. Since seemingly "empty" spots were chosen, most of the galaxies in the Deep Fields lie billions of light-years away. The images show that the early universe contained galaxies in a bewildering variety of shapes and sizes. Some had the familiar elliptical and spiral shapes seen among galaxies today, but there were many peculiar shapes as well. Few astronomers had expected to see this activity presented in such amazing detail. Besides the classical elliptical and spiral galaxies, the variety of other galaxy shapes and colors are important clues to understanding the evolution of the universe. Some of the galaxies may have formed less than one billion years after the Big Bang. The HDFs are important because they can help answer such questions as: How many galaxies are there in the universe? How did large-scale structure evolve in the universe? How were galaxies assembled? Is the universe open or closed? What is the age of the Universe? The HDF has been compared to a geological core sample of the earth, done on the sky. For more information, see key findings of the Hubble Deep Field: (http://oposite.stsci.edu/pubinfo/pr/97/hdf-key-findings.html)

· How many galaxies are there in the universe?
The Hubble Deep Field will be used to count galaxies ten times as faint as the deepest existing ground-based optical observations and nearly twice as faint as the deepest existing Hubble images.

· How did large-scale structure evolve in the universe?
The Hubble Deep Field will be used to perform a statistical study of the distribution of galaxies on the sky. This is an essential test of models for the structure of the universe and galaxy formation theories. The Hubble Deep Field will push such studies to fainter limits.

· How were galaxies assembled?
Detailed studies of the ages and chemical compositions of stars in our own galaxy suggest that it has led a relatively quiet existence, forming stars at a rate of a few suns a year for the last 10 billion years. Other spiral galaxies seem to have similar histories. If this is typical evolution for spiral galaxies, then predictions can be made for what they should have looked like at half their present age -- their size, color and abundance. This information, combined with actual distances derived from ground-based spectroscopic observations, will provide a new test for theories of spiral galaxies.

The other major class of galaxies seen in the nearby universe is the elliptical, football-shaped aggregates of stars that appear to be very old and stopped forming stars long ago. There is currently debate about when such galaxies formed and whether they formed through collisions of other types of galaxies or through collapse of a pristine cloud of primordial gas in the very early universe. The Hubble Deep Field, along with other deep Hubble images, provides a snapshot through time, which can be used to search for distant elliptical galaxies, or primeval galaxies that might later evolve into elliptical galaxies.

· Is the universe open or closed?
The distribution of galaxies in the Hubble Deep Field images may yield clues to the curvature of space. The Hubble Deep Field results will be compared to models that predict how the universe should look if it is open or closed.

If space is negatively curved, as first described by Einstein in his Law of General Relativity, then the universe would be described as open. In an open universe, the universe would continue to expand forever because it lacked sufficient mass to establish the gravitational pull necessary to collapse back on itself. On the other hand, if space is described as positively curved then the universe folds back on itself. This is space described as unbounded but finite. Such a situation is called a closed universe. In this scenario the universe eventually stops expanding and then ultimately contracts back to a point.

10. Why was a second Deep Field taken?
The HDF-N covers a very small fraction of the sky. It would take 27 million fields and well over 500,000 years to use Hubble to survey the entire sky to the depths of the HDF. So, astronomers must rely on a thin "looking-through-soda straw" view across the cosmos to infer the history of star and galaxy formation. Taking a second Deep Field helps astronomers to confirm that the HDF-N is representative, and that it is not unusual in some way. The two HDFs are, in fact, consistent with the common assumption of astronomers that the universe should look largely the same in any direction we look.

11. It seems hard to see the shape of some of the galaxies in the HDFs. How do astronomers classify them?
They use the colors of the galaxies. Different types of galaxies tend to be different colors. For example, elliptical galaxies have reddish colors because they are mostly composed of old red stars. Astronomers study the colors of nearby elliptical, spiral, and irregular galaxies and compare these colors to those of the galaxies in the Hubble Deep Fields. Comparing the colors allows them to classify the galaxies.

12. What is the most common type of galaxy in the nearby universe?
When one counts both large and small galaxies, dwarf ellipticals (small ellipticals) are probably the most common type of galaxy in the nearby universe. Since these galaxies are small and faint, the exact number of these galaxies is not well known. The majority of large, bright galaxies in the nearby universe are spirals. Large bright elliptical galaxies are relatively rare.

Words from the Scientist:

Our Milky Way galaxy is just one of the many billions of galaxies that inhabit the universe. These galaxies come in different shapes, sizes, and colors. You will notice this variety in morphology as you capture galaxies in "Galaxy Hunter."

Studying the appearance of galaxies in the Hubble Deep Fields can help us understand how galaxies formed and how their properties have changed over time. For example, the presence of many irregularly shaped, blue objects in the Hubble Deep Fields may indicate that collisions between galaxies and episodes of rapid star formation, or "starbursts", were more common in the past. This possibility is very exciting to me, since I study starbursts in nearby galaxies.

Understanding results from the Hubble Deep Fields ultimately rests in mathematics, however. Our studies are based upon samples of galaxies drawn from the Hubble Deep Fields, and thus rely heavily on our understanding of sampling techniques and statistics. Come join us in "Galaxy Hunter" to see how important mathematics can be!

-Denise Smith

References:

See the Grab Bag page for a complete list of Web sites, books, and other related materials that can be used as references about statistics and the Hubble Deep Fields.