Homework assignment #3 - Introduction to Python - 2
Welcome to homework assignment #3 of the “Python for Psychologists” course, winter term 2024. We’ll continue working on python in this jupyter notebook. Again, if you encounter any problems along the way, please don’t hesitate to get in touch (via e-mail or the discord server) or check back in the Intro to Python I or the Intro to Python II chapters on the course website.
Some general points:
Throughout the assignment you will be asked to either provide some answers to questions in written form or directly code. Regarding the first, you can just write your answer in the empty markdown cells below the questions. Regarding the second, you can write your code in the empty code cells below the questions and actually run it so that the output (if there’s one) will be included in the notebook. The number and type of cell you need to answer a given questions will be indicated but if you want to use more cells please don’t hesitate to add a few. If you see something like “Type Markdown … or Please provide your solution here
“, you can just click (or double-click) on it to make it a markdown cell. If any of this sounds confusing or unclear you can again consult the Introduction to Jupyter page.
Please don’t forget to save your notebook regularly (maybe every few minutes) either via strg + s on windows and cmd + s on macOS. Once you finished the assignment, pleas save the entire notebook via File -> Save as... and provide the following name FirstNameSurname_pfp_2022_homework_assignment_3 where FirstName should be changed to your first name and Surname to your surname.
The aim of this homework assignment is to go through the things we talked about in the session again and deepen your understanding of them. This will entail the following parts:
1. Operators & Comparisons
2. Strings
3. Lists
4. Dictionaries
You should be able to answer all of the following questions given the content included in the respective sections on the website, as well as the things covered in during the session. If you feel like some of the things asked here were actually not (sufficiently) covered during the session, please don’t hesitate to let me know ASAP and we’ll figure that out. Otherwise you of course are welcome to train your “google muscles”, just keep in mind that code directly copied from the internet should include sufficient credits (i.e. include links to the respective website as a comment in code/markdown cells. And remember that there is always more than one correct answer in programming.
However, after this way too long intro, I wish you a great time and hope this will feel less like an assignment and more like a fun add on to the session. Again: learning new things takes time and programming is by no means an exception to that. Thus don’t worry if you have to look back and fourth between the assignment and the materials: we all start somewhere and you putting in the effort is already fantastic. You got this!
Homework assignment #3#
1. Operators & comparisons#
During the session we talked about arithmetic operations and comparison being on of the most basic utilization of Python. Explain what they refer to and what types of them we have available in python.
Please provide your solution here
arithmetic operations:
Please provide your solution here
comparisons:
Lets work with some “real” data: group_control and group_target variables that contain lists of numbers, lets imagine that it was an experiment with two groups and percent correct answers were measured:
the
valuesofgroup_control:[0.56377612, 0.88131505, 0.11624772, 0.07403529, 0.75973353, 0.00409758, 0.93428345, 0.19309668, 0.3293604 , 0.67125433, 0.29852433, 0.84537071, 0.61858061, 0.69103466, 0.922849 , 0.4790626 , 0.84704065, 0.0797407 , 0.9544613 , 0.90069604]
the
valuesofgroup_target:[0.7486552 , 0.51256649, 0.35233699, 0.45989524, 0.13884197, 0.44548063, 0.49262154, 0.6530387 , 0.1199654 , 0.64623522, 0.37008847, 0.35532377, 0.96873621, 0.90653078, 0.6161739 , 0.26623139, 0.28488811, 0.30282947, 0.12302529, 0.25670394]
Could you please "define" these variables - group_control and group_target with the respective values?
Hint: you can copy them
# Please write your solution here
# Please write your solution here
Lets compute some stats on these variables. For example, minimum, maximum and mean values. We will need to remember how to import a new library for it, because we will not do it manually. You will need to use library numpy, and find relevant functions in this library. Can you compute minimum, maximum and mean values for each group and assign them to the variables group_control_min, group_control_max, etc.?
Hint: if you imported a library sucessfully you can write its name plus dot and press TAB - you will see the list of all available functions (they are highlighted with blue and letter f)
Hint: functions in numpy and other libraries could be used like numpy.function_name(argument_name)
Hint: functions names are usually have similar name to the action they do try to start typing them and after press TAB
Hint: if you lost here please write in discord group
# Please write your solution here
and now compare them? More precisely, could you do the following:
evaluate for each
valueif they areequalevaluate if the
minimum valueofgroup_controlisgreater thanthan that ofgroup_controlevaluate if the
maximum valueofgroup_controlisless or equalthan that ofgroup_controlevaluate if the
mean valuesarenot equal
# Please write your solution here
# Please write your solution here
# Please write your solution here
# Please write your solution here
Awesome, thanks so much! That result regarding the mean values is interesting.
Maybe we should run a test to evaluate if they’re actually different in the statistical sense. Thus, lets compute an independent t-test between the values of our variables, using scipy's stats module, specifically, the ttest_ind function? Lets assign the outcomes to the variables t_stat_group_difference and p_value_group_difference?
Hint: you will need to import ttest_ind from scipy.stats first
Hint: if the output of a function contains multiple values you can directly access and assign them to variables. For example, the scipy.stats.ttest_ind returns a list within which the first index is the t statistic and the second index the p value. In contrast, you could also directly assign the outcomes of the function to two variables. For example: t_stat, p = ttest_ind(data1, data2).
# Please write your solution here
Before we dive into other things, a quick important question: does python actually follow arithmetic rules that would guide how we have to define equations, e.g. the succession of +/- and *//?
Please write your solution here
Optional Bonus#
This is optional, but what is below it (Number 2, 3, 4) - is not
(The remaining tasks within this first part are optional and thus don’t need to be completed if you don’t want to, this won’t affect the final assessment of your homework assignment. However, we obviously strongly suggest that you give it a try as it will deepen your understanding of operators and comparisons as well as their important role regarding complex analyses. Here, this will entail the “manual” computation/derivation of the t-test you computed above via scipy.)
Now that we’ve assigne our test statistic and p-value to a separate variable each we could move on with our lives, but lets compute the independent t-test again, this time “manually”.
Thinking back to the intro to stats classes, we know that we need a few values to do that:
means,standard deviation,standard errors,critical value,
etc.
Thus, we’re going to compute them and run the test step-by-step. We already have the means, so we can go directly to the standard deviations. We can compute them using the std function from numpy. All we need to do is provide our variables and the degrees of freedom (here 1). We are going to store the outputs in the variables group_control_std and group_target_std.
(Hint: you can adapt the following syntax std1, std2 = np.std(data1, ddof=1), np.std(data2, ddof=1).
# Please write your solution here
Next up, standard errors. We can compute them via dividing the standard deviation by the square root of the number of observations, i.e. the values in our variables.
As we already have the standard deviation, we’re only missing our number of observations.
To get these we’ll use the in-built function len(). We’re going to save the output to two new variables group_control_n_obs and group_target_n_obs.
(Hint: you can adapt the following syntax len1, len2 = len(data1), len(data2).)
# Please write your solution here
Now we can use numpy’s sqrt function to compute the square root of our number of observations. As usual, we’re going to create two new respective variables, group_control_sqrt and group_target_sqrt.
(Hint: you can adapt the following syntax sqrt1, sqrt2 = sqrt(len1), sqrt(len2).)
# Please write your solution here
Nice, but we’re still missing our standard errors.
Let’s compute the standard errors and assign them to two variables called group_control_se and group_target_se by dividing the standard deviation by the square root of the number of observations
(Hint: adapt the syntax for both your variables standard_error1 = std1/sqrt1.)
# Please write your solution here
The last thing we need to calculate for our t statistic is the standard error of the difference between samples (are the memories of the stats intro classes coming back?).
The resulting value should be stored in a variable called group_difference_sed and can be obtained via adapting the following equation sed = sqrt(standard_error1**2.0 + standard_error2**2.0).
# Please write your solution here
Bonus Bonus questions: what’s ** again?
Please provide your solution here
It’s t statistic time.
We get this value by dividing the difference between the means by the standard error of the difference between samples, i.e. t_stat = mean1 - mean2 / sed. Assign the value to a variable called t_stat
(Hint: check the proposed equation precisely. Remember the order of mathematical operations)
# Please write your solution here
Fantastic, Amzazing!
However, being the “thorough” “researchers” we were “trained” to be (ignoring the assumptions needed for a parametric test…) we know that the t statistic is only one half of the story, we also need a p value.
Adding this rather straightforward, we need the
degrees of freedom
and the
cumulative distribution function
of the t-distribution.
Lets start with the first one. This value is obtained via computing the sum of the number of observations (len) in both samples minus two. We’ll assign it to the df variable.
(Hint: you can adapt the following syntax df = len(data1) + len(data2) - 2.)
# Please write your solution here
We can now integrate that into a more complex function within which we calculate the p value of the test.
For this we need to get the
cumulative distribution functionon thet-distribution.
We can obtain this via the cdf function of the t module available in scipy (stats.t.cdf).
Now we can finally adapt the following equation to get the p value: p = (1 - t.cdf(abs(t_stat), df)) * 2. Could you please do that in the cell below?
# Please write your solution here
Having computed the t_stat and p value, we need to situate them with respect to a given critical value and significance level.
Starting with the critical value, we can compute that via applying the percent point function (which is part of scipy’s t module: stats.t.ppf) to our degrees of freedom and the assumed alpha.
Regarding the latter, we’ll go “classic” and assume 0.05 and therefore 95% confidence(I know: shocker…). Thus, we can adapt the following function: critical_value = stats.t.ppf(1.0-0.05, degrees_of_freedom). Could you please do that in the cell below?
# Please write your solution here
Oh my, all these equations…but that should be all.
Now, we only need to compare our t statistic and p value to the critical value and alpha respectively to make some statements about the outcomes of our test.
Could you evaluate if the t statistic is greater or equal than the critical value and if the p value is less than the alpha?
# Please write your solution here
# Please write your solution here
Whoa, we just computed a t statistic step-by-step in python. You might think: “WTF? I came here to learn python and not redo stats…” and you’re definitely right!
However, this example and the respective tasks aimed to provide you with more insights on what can be achieved via basic operations and comparison in python.
Further we outline again that these basic operations are the underlying workhorses of basically any more complex function.
Additionally, the steps should showcase some of the different functions supported by the numpy and scipy modules as a proxy for all the things that are possible (remember: these are just two of several hundred or thousand modules out there).
Lastly, this was also meant to be a practice regarding adapting code snippets you find in the documentation of the modules or the internet to your use case and data at hand. The good news is that this was the hardest part of this homework assignment, the rest should be way easier to follow/do.
So you’ve learned a whole lot, just by following along!
We know all of these things are a lot and for most of you completely new. However, you’ll definitely learn them by just keeping at it. You’re putting in a real effort, thanks for that!
2. Strings#
Welcome back, don’t forget the rest of the assigment is mandatory!
After talking about different operations and comparison we talked about different data types and structures, starting with strings.
However, didn’t we talk about strings before as one of the fundamental data types? Is there something more going on with them? Could you please provide some pointers on why strings are somewhat different than say integersor floats in the cell below?
Please write your solution here
Interesting…how about we investigate this further by defining a new variable called stats_statement, that contains the following string as value: "The difference between the control and target group was not significant."?
(Hint: you can just copy-paste it.)
# Please write your solution here
With that in place: is there a way to get some basic information, like the type and length of the string? Name them in the cell below.
# Please write your solution here
# Please write your solution here
So far so good.
However, we know that non-significant results don’t sell well out there and that the industry demands us to only include significant results. Is there any way we could replace the not significant portion of our string with significant?
(Hint: Remember that python tries to stay as close to the natural use of language)
# Please write your solution here
That brings up an important point: are strings actually replaced "in-place" or do they require a new variable assignment? In case of the latter: how would you do that?
Please write your solution here
# Please write your solution here
Can we actually make sure that the string contains the sub-string (or word/letter combination) significant, i.e. is there a way to compare two strings? Or maybe to find out if one string is contained in another?
# Please write your solution here
Great! We now want to make sure we put the right group comparison in our string. Is there a way to “extract” certain characters/sub-strings based on their index? If so, what would be the name of this technique and is there anything we should watch out for?
Hint: remember about how index works in python
Please write your solution here
Thanks for the reminder. Could you maybe get the sub-string “control and target” from the string?
(Remember Start, Stop, Step?)
# Please write your solution here
I’ve been informed that people who want to be taken somewhat serious about their work usually report more info than just a p-value. Crazy, right?
So, to seem somewhat capable let’s add some more information to our statement.
Please define a new variable called t_stat_statement with the value: The t statistic was: [insert t statistic here].", replacing [insert t statistic here] with the t statistic you computed above (in the very first task)?
Hint: here you just need to manually insert a computed value from the first task, no formating!
# Please write your solution here
Now concatenate the two strings within one print function in such a way that they will be separated by a space
Hint: you can do it multiple ways withtin the print function or before calling print function
# Please write your solution here
In python we usually like to automate as much as possible to remove human sources of error in our analyzes and be as reproducible as possible.
Usually the more we write/put in manually the higher the chance of errors. This holds, for example, true for including computed outputs in print functions like we did above. Do you know a way to format the t_statistic_statement string so that we can “automatically” insert the computed t statistic and don’t have to specifically write it?
(Hint: There are more ways than one to actually insert additional information into a string)
# Please write your solution here
3. Lists#
The next data type on our list are lists. Please explain what lists are and what makes them somewhat special (e.g. compared to othe data types or structures such as strings or tuples)?
Please write your solution here
As you might remember, it was said that the values of our group_control and group_target variables were provided as lists. Please briefly verify that they really have this type!
group_control= [0.56377612, 0.88131505, 0.11624772, 0.07403529, 0.75973353,
0.00409758, 0.93428345, 0.19309668, 0.3293604 , 0.67125433,
0.29852433, 0.84537071, 0.61858061, 0.69103466, 0.922849 ,
0.4790626 , 0.84704065, 0.0797407 , 0.9544613 , 0.90069604]
group_target= [0.7486552 , 0.51256649, 0.35233699, 0.45989524, 0.13884197,
0.44548063, 0.49262154, 0.6530387 , 0.1199654 , 0.64623522,
0.37008847, 0.35532377, 0.96873621, 0.90653078, 0.6161739 ,
0.26623139, 0.28488811, 0.30282947, 0.12302529, 0.25670394]
# Please write your solution here
One of the first things we usually want to know about lists is their length.
Could you briefly get the length of each list and indicate them via print functions like the following: "The control group had [insert number of responses here] responses.".
(Hint: try to use string formatting to insert the needed value)
# Please write your solution here
Going back to our lists and having a closer look at them, we recognize that there are actually some errors we need to fix.
You’ve learned that the data type list comes with some in-built functions that could ease up certain operations. Briefly name at least two and describe what they’re doing?
Please write your solution here
Say initially, we need to change the order of elements in our lists.
Specifically, our groups should be sorted in an ascending order. How would you do that? Add the new list to a variable called group_control_asc and group_target_asc.
Attention: If we’d use the list.sort() function we’d change our list in-place, to avoid that we use the python internal function sorted(). So the syntax would be sorted_group = sorted(group)
(Hint: To get some input check out the Python “Howtos” in the official docs)
# Please write your solution here
Now could you quickly do the same, but sort our group values in a descending order?
# Please write your solution here
If you’d want to add a value that we initially forgot at a specific index to our new lists, how would you do that?
For example, you want to add the values 0.782374610 and 0.12039482 at index 3 and 8 in respectivelygroup_control and group_target, as well as in respetively our newly sorted lists group_control_asc and group_target_asc. Provide the answer below.
# Please write your solution here
This brings up two important questions. First, how do lists work regarding indices, comparably to strings or not?
Please write your solution here
Second, what happens to the indices of the other elements when we insert a new value?
Please write your solution here
Oh my, one of the “classics” happened: someone forgot to add the responses of the last two participants to the lists. Could you please append the values 0.27384265, 0.56938264 and 0.28385745, 0.91282374 to respectively group_control_asc and group_target_asc, as well as respectively the group_control and group_target variables?
# Please write your solution here
The errors won’t stop…we just noticed that participant 10 of the control group and 5 of the target group didn’t complete the experiment. Please remove their responses from the respective lists.
Hint: here it means participant number 10th and number 5th, not the index
# Please write your solution here
Wait, what variables you used in the last task? group_xxx_asc or group_xxx? Which one will be correct to use in this logic?
Please write your solution here
One last thing: Say we want to make it impossible that our lists could be tampered with, i.e. that elements in our data structure could be changed, what data type you’ve met could you use? And how is that property of this data type called again?
Please write your solution here
4. Dictionaries#
The last data type we had a look at where dictionaries. Could you name some of the prominent differences to strings and lists?
Please provide your solution here
Damn, dictionaries seem pretty handy and you’ve might have guessed already above, that these will come in useful for our future analyzes, if we’d need something we can store and access all of the previously obtained information, i.e. our t test results, in a explicit manner.
However, given we changed our underlying lists quite a bit, we need to re-compute the t statistic and p value again. Don’t worry, you can definitely copy-paste the previously used scipy.stats’ ttest_ind function for that and not our lengthy manual implementation. Please do that for both our sorted, as well as our original lists.
Remember, if you want to use the ttest_ind function again, you first have to import it from scipy.stats
group_control= [0.56377612, 0.88131505, 0.11624772, 0.07403529, 0.75973353,
0.00409758, 0.93428345, 0.19309668, 0.3293604 , 0.67125433,
0.29852433, 0.84537071, 0.61858061, 0.69103466, 0.922849 ,
0.4790626 , 0.84704065, 0.0797407 , 0.9544613 , 0.90069604]
group_target= [0.7486552 , 0.51256649, 0.35233699, 0.45989524, 0.13884197,
0.44548063, 0.49262154, 0.6530387 , 0.1199654 , 0.64623522,
0.37008847, 0.35532377, 0.96873621, 0.90653078, 0.6161739 ,
0.26623139, 0.28488811, 0.30282947, 0.12302529, 0.25670394]
# Please provide your solution here
Awesome, thanks! Could you now please create a dictionary called t-test_info with the following key-value pairs:
t_statistic : the computed t statistic,
p_value : the computed p value,
n_responses_control : length of the list group_control,
n_responses_target : length of the list group_target,
responses_control : the list group_control,
responses_target : the list group_target,
(Hint: the value descriptions indicate the values you should put at the respective key.)
# Please provide your solution here
The more complex your data types get the more you should do quality control. Is there an easy way to get all the keys and values of your newly created dictionary?
# Please provide your solution here
Assuming you wanted to access the computed p value, could you use the same kind of indexing as before for that? If not how would you get it?
# Please provide your solution here
Great! We decided to add new key in our dictionary and add mean values of control and target lists. Lets call it mean_values and store a list of means of control and target lists.
Hint: you need to add list of mean values for both groups
Hint: you need to use numpy.mean to compute mean
# Please provide your solution here
Thats nice addition to our dictionary. The last task, lets access mean values from our dictionary called t-test_info. Can you compare whether they equal each other and store this information in new key mean_comparison.
# Please provide your solution here
5. Outro/Q&A#
Awesome, you did it, great job! Take a few seconds to appreaciate your effort and have a little party!
Please remember to save the notebook as described in the beginning and send it to me via e-mail. If you have questions, encountered problems and/or errors, please use the cell to outline things further. The same holds true for any type of feedback!
Thanks for going through this assignment, we hope you found it useful and informative. Feedback on your work can also by provided and is just one e-mail or discord message away!
You can write your feedback here as well