Get Real – Variation in data
Students use the PPDAC cycle to undertake statistical and probability investigations. This unit of work explicitly looks at making valid and reliable measurements and considers the different sources of variation that are present in data.
About this resource
The purpose of this unit is to support students to develop knowledge of:
- sources of variation in statistical investigations
- designing probability experiments that use real data to create probability distributions for numerical variables.
Get Real – Variation in data, is it real or induced?
Purpose
Students use their knowledge of the statistical enquiry cycle (PPDAC) when they pose a question for investigation comparing their class data to data for New Zealand students.
Students learn that theoretical model probabilities and experimental estimates of probabilities are approximations of the true probabilities as they make sense of outcomes or conclusions in light of a given situation and context.
As they plan an investigation pathway and follow it, students respect the fact that people have rights and obligations in relation to their own data and that of others.
Progression
Students are building on the following understandings, knowledge and practices that they have learned previously in phase 3. Knowing how to:
- determine the variables needed to answer investigative questions, and plan how to collect data for each variable
- collect data from a group (when all of the group can be surveyed), or source and use data collected by others
- analyse data and communicate findings in context
- pose investigative questions for chance-based situations, including those with not equally likely outcomes
- plan, conduct, and systematically record data from probability experiments
Learning in this unit contributes to students’ understanding of the following connections with the Mathematics and Statistics Learning Matrix for Level 6:
- fluently and flexibly investigate situations
- communicate mathematical and statistical ideas and insights using appropriate mathematical and statistical language, symbols, and representations
- explore situations involving variation using a variety of visualisations and measures
Required Resources
- Census At School download and upload into CODAP
- Session 1 dataset if needed
- Arnold, P. (2022). Statistical investigations | Te Tūhuratanga Tauanga, NZCER Press. (all secondary schools have been given a copy)
- Wild, C. & Pfannkuch, M. (1999). Statistical Thinking in Empirical Enquiry. International Statistical Review, 67(3), 223–265. doi: 10.1111/j.1751-5823.1999.tb00442.x (3 Variation, Randomness and Statistical Models, pp. 235-237)
See Materials that come with this resource to download
- Variation in data 1 (pptx.)
- Variation in data 2 (pptx.)
- Copymaster 1: 3-2-1 Grid
- Copymaster 2: Making measurements activity
- Copymaster 3: Sets of axes for box plots
- Copymaster 4: Interrogating data and posing investigative questions
Sessions
In this session, students explore real variation and apply the statistical enquiry cycle (PPDAC) to a summary situation using measurement data, sourcing data from the latest CensusAtSchool database, and collect some data from themselves.
Introduction
Over the next three sessions we are going to explore different sources of variation in data. Show the diagram below (slide 3 of PowerPoint 1) if required.
“Variation is an observable reality. It is present everywhere and in everything. Variability affects all aspects of life and everything we observe. No two manufactured items are identical, no two organisms are identical or react in identical ways. In fact, individual organisms are actually systems in constant flux. The aforementioned refers only to real variation inherent in the system. Fig. 6 [above] depicts how, when we collect data from a system, this real variation is supplemented by variation added in various ways by the data collection process” (Wild & Pfannkuch, 1999, p. 235).
Variation in data is made up of real variation and induced variation. Real variation is the characteristic of a system, and induced variation is added by data collection processes. The direction of the arrows is important; measurers and devices are elements of variation due to measurement, which is an element of induced variation; collection and processing processes are elements of variation due to accident, which is also an element of induced variation. We will cover these two types of induced variation in session two. A third element of induced variation is sampling, this is the focus of session three.
1. Explain to students that they will look at different sources of variation that occur in statistics over the next sessions.
2. Check in with students about what the term variation means to them.
3. Today’s focus is real variation.
- When the same variable is measured for different individuals, there will be differences in the measurements, simply due to the fact that individuals are different. This can be thought of as individual-to-individual variation and is often described as natural or real variation.
- Repeated measures on the same individual may vary because of changes in the variable being measured. For example, an individual’s blood pressure is not exactly the same throughout the day. This can be thought of as occasion-to-occasion variation and is also an example of real variation.
4. Explore these two ideas (3i) using measurement data from the latest CensusAtSchool database (at the time of writing the latest database was 2023. Some of the variables suggested may change in future CensusAtSchool surveys.), and (3ii) by collecting data from students during the session. Students will undertake the following activities:
- To demonstrate (3ii) occasion-to-occasion variation, ask students to record their heart rate three times during the session. This activity could be situated in the context of fitness training programmes.
- Students should count the number of beats for 10 seconds and then multiply the result by six to get their heart rate in beats per minute. This can be done collectively as a class. Collect at the start (now), during the session and about five minutes from the end. Students record their own heart rates.
- To explore (3i) individual-to-individual variation data, the latest CensusAtSchool database will be used. In 2023 the section on measurements (Q10-13) and the section on games (Q17-20) both provide measurement data that suit this purpose.
Data/Plan – sourcing and interrogating the data
5. Start with the questionnaire, direct students to look at the survey questions 10-13 and 17-20.
6. Ask the students to describe what is measured in the two different sections (measurements and games). Notice the following:
- Do students identify that, in the measurements section (Q10-13), each question measures some part of a person’s body, e.g., height, and right foot length?
- Do students identify that, in the games section (Q17-20) each question measures what a person does, e.g., reaction time, standing jump?
7. Collectively as a class identify all of the variables that are available across these eight survey questions. (What is a variable?)
8. Discuss, as a class, reasons why this type of information might be collected.
Notice the following:
- Do students identify that body measurements might be of interest to clothing manufacturers? For example, a footwear manufacturer could use the distribution of foot sizes to decide how many of each size shoe to produce.
- Do students identify that the types of information in the games section might be of interest to sports coaches to help them track athletes training?
9. Discuss with students the importance of data sovereignty. Emphasise the importance of whakapapa (genealogy) and how personal information is tied to individual and collective identity. Discuss the idea of being kaitiaki (guardians) of personal information and the responsibility that comes with it.
10. In pairs, students select two variables they are interested in exploring (one from measurements and one from games) and interrogate them. Each pair of students uses a set of three slides from PowerPoint 1. The slides within the slide deck are coloured in groups of three. Alternatively, use Copymaster 4.
Problem – developing investigative questions
11. Explore the two variables further. Using the second slide in their set of three, students (1) capture ideas they have about one of the variables, (2) decide what population to explore e.g., year 9-11, (3) pose a summary investigative question about their selected variable for the population they have selected, and (4) describe what they think the data will show, e.g., the shape of the distribution, possible high value, low value, middle value.
Possible summary investigative questions
- What are the right foot lengths of year 10 students in New Zealand?
- What are the reaction times for New Zealand year 12 students?
12. Using the third slide in their set of three, repeat for the second variable.
13. Take the second heart rate reading.
Data/Analysis: get, display & describe
14. GET: Students download a sample of 1000 from CensusAtSchool for the population group selected. They include in the dataset the two variables they have chosen. The sample is imported into statistical software, e.g., CODAP. Click to view a video on importing data from CensusAtSchool into CODAP
15. DISPLAY: Students make a data visualisation for each of the variables that shows the individual values (e.g., a dot plot).
16. DESCRIBE: Students describe what the data shows. This would include the shape of the distribution, high and low values, and the middle of the distribution. Take time to notice the individual-to-individual variation. While some of the 1000 individuals selected are the same measurement (e.g., height), there is variation in the measurements (e.g., heights). Features that describe variability include the range, interquartile range, intervals (e.g., describing the tail for a skewed distribution), clustering, and position of the majority. Features that describe signal and noise, e.g., the centre, include the median, mean, middle 50%, peak(s) and modal groups. Check students' descriptions include the context (variable, values, units and the population).
17. Students share their CODAP (or other statistical software) document with the teacher.
18. Take the third heart rate reading. Take time to notice the occasion-to-occasion variation of the repeated measure of heart rate from an individual.
Reflection
19. Each student completes a 3-2-1 Grid to reflect on the activity plus define individual-to-individual variation and occasion-to-occasion variation. These are handed in to the teacher.
Notes for teachers – session one
Concepts that are being developed or used in this session include:
sources of variation - individual-to-individual variation and occasion-to-occasion variation
sourcing secondary data (Arnold, 2022, pp. 79, 158-162)
interrogating the data to find out about what, why, when, where, “who”, how - the original investigator’s plan
defining variables and populations
using survey questions and the variable list to define the variables of interest
deciding which population groups to explore, understanding that they are New Zealand year 4-13 students
posing summary investigative questions
use the criteria for what makes a good investigative question to interrogate (Arnold, 2022, pp. 36-37)
thinking about what the data might show before sourcing and analysing
summary situations are chosen to allow the teacher to check out prior knowledge around:
creating data visualisations using statistical software e.g., see CODAP and iNZight lite
describing data visualisations (Arnold, 2022, pp. 298-317)
including the context in descriptions (Arnold, 2022, p. 299, contextual knowledge & p. 384)
using a large sample size (1000) gives many different measurements (individual-to-individual variation) and distributional shape of the variable is obvious
downloading data and importing into statistical software
Provided Dataset if needed
Please see this dataset in CODAP to use if not wanting to get students to source and upload into CODAP themselves. Note the dataset is only New Zealand Year 9 & 10 students, all the variables for Q10-13, 17-20 (2023 questionnaire) are included. The CODAP document is set up with two graphs and two text boxes ready to use. When students click on the link they open their own copy to use.
Here is a CODAP document variables graphed for teacher reference.
Introduction
1. In this session we are going to look at induced variation due to measurement and accident (slide 2 of PowerPoint 2).
- Measurement variation - devices (slide 3) refers to differences or errors introduced by measurement instruments or tools used in data collection. It can arise from inaccuracies, calibration issues, or limitations of the devices. Regular calibration and validation of devices, as well as ensuring consistency in measurement procedures, can help reduce measurement variation.
- Measurement variation - measurers (slide 4) refers to differences or errors introduced by the people doing the measuring. It can arise from inconsistency in measurement procedures by different measurers, and can arise from non-sampling errors such as participants providing incorrect information, and participants giving socially desirable answers rather than what they really think.
- Accident variation - collection (slide 5) includes human errors such as transcriptional errors and estimation errors, and individuals providing incorrect information.
- Accidental variation - processing (slide 6) includes errors to do with processing the data, including skipping or duplicating entries and placing data in the wrong column in the spreadsheet.
Making Measurements
2. In the making measurements activity (Copymaster 2) students are working in groups of seven. Each member of the group is given a different instruction card. Card A is the person being measured, call for a volunteer within the group for this card. The rest of the cards are handed out face-down. Let students know that they are not to show or share their instructions with anyone else until everyone has made a measurement. In this activity, students are making measurements of the same variable on the same individual.
3. Students make the wrist circumference measures as per the card they are given and tell the person who was measured to record it for them. Ideally, they should not be able to observe how students before them made the measurements.
4. Once everyone has made their measurements and they have been recorded, students can share their instructions. Identify and write down which instructions are similar, and which instructions are different.
- Some students used string and a ruler, some used a tape measure.
- Some measured to the nearest millimetre, some to the nearest centimetre.
- Some were told which wrist to measure, some were not.
5. Card A student shares the information they gathered about which wrist was measured and where on the wrist they measured.
6. Discuss as a group which type of induced variation each of the ideas from step 4 is. Give ideas about how they can improve the instructions and measuring tools to reduce variation in what is measured. Once they have created their own set of instructions, get everyone to measure according to the instructions. They keep their measurement secret until everyone has measured.
7. Compare their measurements now, are they the same? If not, how much variation is there? Is it less or more than before?
8. Look at the instructions for measuring the left wrist from the 2023 CensusAtSchool survey. See the measurement station cards page 36 of the Secondary Teachers Guide. Compare this with the instructions they created as a group.
9. Why do the “bumpy” bones matter?
Notice the following: Can they describe why it is important to be consistent about where the wrist circumference is measured from measurer to measurer?
10. Would it matter if we measured the right wrist? and why?
Notice the following: Do they recognise that the right wrist may be a different size?
11. Why do we measure in centimetres to one decimal place?
Notice the following: Can they explain that since wrist circumferences are mostly around 14-18cm, measuring to the nearest cm wouldn’t show the real variation of wrist circumferences? A look at the data on CensusAtSchool (18 June 2023, 2023 database) shows that there appears to be quite a bit of rounding to the nearest cm regardless.
Validity, Reliability & Bias
12. Introduce Jessica Utts’ definitions (slide 7 of PowerPoint 2) of validity, reliability, and bias. Discuss these definitions in the context of measuring the left wrist circumference (slides 8-17). How do validity and reliability help with reducing induced variation? Notice the following:
- Do students understand that validity depends on measurers measuring the same thing, e.g., left wrist?
- Do students understand that reliability requires accuracy in the measurements that are taken, i.e. that they are taken consistently from one measurer to the other?
- Do students understand that, while both focus on reducing the variation due to measurers, reliability also supports reducing variation due to devices.
13. Explore the instructions for standing on your left leg. This survey question has been in since 2013, and from 2019 instructions have been given about how to do this. Students read the (2019 instructions (page 9)), the (2021 instructions (page 35)), and the (2023 instructions (page 39)), and describe the changes that have been made and why they think the changes have been made (in terms of reducing induced variation
Reflection
14. Exit card: On an A6 size piece of paper students note one thing that surprised them about induced variation and hand it to the teacher.
Notes for teachers – session two
Concepts that are being developed or used in this session include:
- sources of variation - induced variation due to measurement and accident
- repeated measurements of the same person and the same variable, by different people
For more on measurement errors (Arnold, 2022, pp. 119-120)
For more on validity, reliability and bias (Arnold, 2022, pp. 117-118)
In this session, students explore induced variation from sampling, developing the concept of sampling variability using samples of size 30 (ANALYSIS).
Introduction
Sampling variation occurs when different samples from the same population produce different results. This variation arises because it is usually impractical or impossible to collect data from the entire population of interest. Random sampling techniques and increasing sample sizes can help reduce sampling variation.
1. In this session we are going to look at how samples from the same population are similar and how they are different. Technology and existing data in the CensusAtSchool database allow us to take multiple samples very quickly from the population to see what happens. This database has been created for the purpose of supporting teaching and learning in many ways, including taking multiple samples. In reality though when we survey a sample from a population, we only have one sample.
Problem – posing an investigative question to explore
2. For the purpose of this activity, we are going to explore one of the measurement variables we have looked at previously. As a class, decide which variable to look at by choosing from height, left foot length, right foot length, and left wrist circumference or left thumb circumference. Decide also the population of interest e.g., year 10 New Zealand students. Pose a summary investigative question.
- What are the right foot lengths of year 10 students in New Zealand?
Plan – What to do
3. Working in pairs students take five different samples from the latest CensusAtSchool database (at the time of writing the latest database was 2023. Some of the variables suggested may change in future CensusAtSchool surveys.) to answer the investigative question. As a class decide on the sample size, the recommendation is to choose around 30-50.
Data – Generate samples
4.Select the database, select subpopulation (year groups), and select the variable(s) - choose only the one required for the investigative question, sample size e.g., 30.
Analyse – get information to draw a box plot
5. Click generate sample, and then analyse sample.
6. This creates a dot and box plot in iNZight lite. Click on the summary tab to see the five-number summary (Min, LQ (25%), Median, UQ (75%), Max).
7. Students plot the information from copymaster 3 onto a separate sheet of paper. Mark the median in blue, draw the middle 50% (the box) in red, and draw the whiskers in black.
8. To get a new sample, click the back screen arrow; reselect the I agree statement, scroll to the bottom of the page, enter the same sample size used previously, click generate sample, then analyse sample.
9. Record the information on the sheet on a new graph (the sheet contains five axes). Continue to generate five box plots.
Analyse – what do you notice?
10. Students look at their five box plots and write “I notice” statements about what they notice that is similar and what they notice that is different between the samples of the same size from the same population.
Noticings could include: that the medians are all different, but within a small range; the middle 50% varies in size and position, but they are all within the range (given)
11. Students combine their five box plots with another pair and have a look across the ten box plots, what do they notice?
12. Encourage students to annotate their box plots to get an interval for the median and for the middle 50%.
13. Provide the population values (if possible, as it depends on the variable and population combination). For example, to find the population values for right foot length and year 10, use the whole database and select year categorical and right foot length. Click on analyse, change the first variable to right foot length and then select year categorical for the second variable. The resultant graph is too small to read off the desired values.
This information is needed in the summary tab:
The relevant summary statistics for the population can be read in the table. In this instance, the median for year 10 is 25 cm and the middle 50% is from 23 cm to 26 cm.
This information was downloaded 20 June 2023 and may change over time as more data is added to the database. What you see might not be the same as the summary statistics presented here.
Reflection – about sampling variability
14. Students, in pairs, write a summary statement on an exit card about what sampling variability is. This is handed into the teacher at the end of the session.
Notes for teachers – session three
Concepts that are being developed or used in this session include:
sources of variation - induced variation due to sampling
For more on sampling variability (Arnold, 2022, pp. 349-352).
For animations to support sampling variability see Animations of Sampling Variation.
In this session, students design and explore probability distributions for real data about themselves. In this session they pose a chance-based investigative question (PROBLEM), PLAN to collect data (experimental estimates of probabilities) and then collect and record the DATA by undertaking the probability experiment.
Introduction
1. We will undertake a probability investigation to create a probability distribution for the number of throws to successfully get a ball in a bin.
2. In groups, students discuss how many throws it might take to get a ball in the bin. What might the distribution of the number of throws look like (for them the individual, say if they did this 20 times)?
Show students the potential set-up for the experiment, a line on the floor and the bin placed a distance away.
Ask questions such as what do you think the largest number of throws would be, the smallest number of throws? How many throws would be typical?
What shape do you think the distribution would be (linking to statistics distributional shape e.g., symmetrical and unimodal, uniform, left or right-skewed)? How might it be a chance-based situation? (For it to be chance-based, assume all participants are novices at throwing a ball in a bin, and that they will not improve markedly over say 30 throws. Alternatively, you could assume that participants are sufficiently expert and will not improve. The chance of success is not certain (probability of 1) or impossible (0). That is, sometimes you succeed and sometimes you fail.)
3. Ask what could be changed so that they take more throws, or fewer throws to get a ball in the bin?
Ideas could include moving the bin further away, or closer; having a smaller bin, or larger bin; non-preferred throwing hand, preferred throwing hand; standing up, sitting down; underarm, overarm
Problem – Pose an investigative question for a chance-based situation.
4. Get students to discuss in their groups ideas about an investigative question to explore this chance-based situation.
The investigative question will have something to do with probability. What probabilities could students explore?
5. Collate ideas from the class about what they might be exploring and use these to confirm the investigative question:
What is the probability that I get a ball in the bin on my first throw?
Note to teacher
In other words, we are trying to find the probability of success for throwing a ball in the bin.
Ask students what they think this probability might be, half, ¾, ¼ ? Do we know this already, we know the probability of rolling a six on a die is ⅙, but do we know the probability for getting a ball in the bin? No, so we need to undertake a probability experiment to find this out.
To answer this probability investigative question we first need to answer this investigative question: How many throws does it take to get a ball into the bin?
A probability experiment to answer the investigative question.
6. Ask students to discuss in their group what they need to do to collect data to answer the investigative question - capture ideas from the groups and use these and strategic questioning to develop a plan to undertake a probability experiment.
What are the possible outcomes of this probability experiment (the sample space)?
The variable, x, the number of throws needed, including the successful one, to get a tennis ball in the bin has the sample space {1, 2, 3, 4,...}.
What assumptions are we making?
Assumptions: In this experiment, we assume the probability of success for every throw is the same. This implies that any throw is independent of previous throws.
What would a trial consist of?
A trial consists of the thrower throwing (tennis) balls at the bin, until they get one in.
Defining the constraints for the trial
Each thrower, for example,
- is throwing from the same fixed length (to be agreed by the class) away from the bin
- will use the same types of ball (e.g., tennis ball, ping pong ball)
- has the same type of bin (e.g., box, yoghurt container)
- is sitting down on a classroom chair
- throws the ball using an overarm throw from their preferred hand
- has no practice.
How many trials per person?
Students can work together in a group of three to collect their data. Each student has turns at being the thrower, recorder and cleaner-upper. They should do 10 trials each and then if time do another 10 trials each (total of 20), and if super efficient they could do another 10 each (total 30). Allow the rest of the session to collect the data.
How will the results be recorded?
The actual outcome of each trial as a value of the variable, x, is recorded. Record the trial number and the outcome. A group could use a spreadsheet or similar, and also record the throwers name or initials e.g.,
Name |
Trial Number |
Number of throws |
Freya |
1 |
1 |
Freya |
2 |
2 |
Freya |
3 |
7 |
|
|
|
Data – Undertake the probability experiment and record the data
7. Students work in groups of three to undertake the probability experiment. Each member of the group takes turns to be the thrower, recorder and cleaner-upper. Each group needs about 5-10 balls, a bin, a line marked at an agreed fixed length from the bin, a chair and recording materials e.g., a spreadsheet.
Notes for teachers – session 4
Concepts that are being developed or used in this session include:
- setting up a probability experiment using real data requires the conditions to be the same for each trial
- making assumptions underpinning the experiment very clear
- creating experimental probability distributions
- finding the experimental probability of success
In this session, students design and explore probability distributions for real data about themselves. In this session they ANALYSE the data, answer the chance-based investigative question and communicate their findings (CONCLUSION). This session continues work carried out in session four.
Analysis – moving towards a probability distribution
Frequency distribution for the number of throws
1. Students initially make two data visualisations of their own data.
- Make a dot plot to show the distribution of the number of throws to get a ball in the bin.
- Make a time-series graph. Put the trial number on the horizontal axis and the number of throws to get a ball in the bin on the vertical axis.
2. Describe the distribution for the number of throws to get a ball in the bin (dot plot).
3. Find the mean and median number of throws.
4. Look at the time-series graph, does it show improvement over time? or is it stationary, with no improvement?
This is checking our assumption that the probability of success for the throws remains constant and does not change over time, i.e. there is no practice effect.
Experimental probability distribution for the number of throws
5. Show students how to create a probability distribution of their probability experiment data (a probability distribution has relative frequencies expressed as percentages or decimals) e.g., in CODAP - “treat the variable as categorical” and in the configuration menu select fuse dots into bars, and then select percent for the scale. The actual percentages can be added from the measures menu as shown in this data visualisation.
6. Describe the experimental probability distribution e.g.,
The experimental probability distribution for the number of throws I took to get a ball in the bin is right-skewed. The relative frequency is high for one throw and drops down reasonably consistently until the relative frequency is zero after seven throws. The probability of getting a ball in the bin in one throw is 40%, and 73% of the time it took three or fewer throws.
7. Find the experimental estimate of the probability of success in a single throw.
Method 1. To do this we divide the number of successes by the total number of throws for those trials.
Method 2. Use the probability for x = 1
P(Success) = P(x =1)
8.Compare the probability of success, p, obtained by method 1 and method 2. What do you notice?
9.Why would method 1 be better than method 2?
Notice the following: Can students explain that method 1 considers all of the data?
Conclusion
10. Compare individual findings within their group, what are the similarities and differences in their findings?
11. Compare probability distributions e.g., shape, and probability for success (methods 1 and 2).
12. Answer the investigative question using evidence from your findings.
13. How does what you found compare with what you anticipated at the beginning of the previous session?
Throughout the unit, use the ‘notice the following’ prompts as a means for formative assessment. Observe the mathematical ideas and practices demonstrated by students, and respond to these using questioning, prompts, examples, modelling, and further, more targeted teaching.