Data analyzes II - data visualization and analyses#

Edited by Yury Markov (he/him) Peer Herholz (he/him)
Habilitation candidate - Fiebach Lab, Neurocognitive Psychology at Goethe-University Frankfurt
Research affiliate - NeuroDataScience lab at MNI/McGill
Member - BIDS, ReproNim, Brainhack, Neuromod, OHBM SEA-SIG, UNIQUE

logo logo   @peerherholz

Before we get started …#


Objectives 📍#

  • learn basic and efficient usage of python for data analyzes & visualization

    • working with data:

      • reading, working, writing

      • preprocessing, filtering, wrangling

    • visualizing data:

      • basic plots

      • advanced & fancy stuff

    • analyzing data:

      • descriptive stats

      • inferential stats

Recap - Why do data science in Python?#

  • all the general benefits of the Python language (open source, fast, etc.)

    • Specifically, it’s a widely used/very flexible, high-level, general-purpose, dynamic programming language

  • the Python ecosystem contains tens of thousands of packages, several are very widely used in data science applications:

    • Jupyter: interactive notebooks

    • Numpy: numerical computing in Python

    • pandas: data structures for Python

    • pingouin: statistics in Python

    • statsmodels: statistics in Python

    • seaborn: data visualization in Python

    • plotly: interactive data visualization in Python

    • Scipy: scientific Python tools

    • Matplotlib: plotting in Python

    • scikit-learn: machine learning in Python

  • even more: Python has very good (often best-in-class) external packages for almost everything

  • Particularly important for data science, which draws on a very broad toolkit

  • Package management is easy (conda, pip)

  • Examples for further important/central python data science packages :

    • Web development: flask, Django

    • Database ORMs: SQLAlchemy, Django ORM (w/ adapters for all major DBs)

    • Scraping/parsing text/markup: beautifulsoup, scrapy

    • Natural language processing (NLP): nltk, gensim, textblob

    • Numerical computation and data analysis: numpy, scipy, pandas, xarray

    • Machine learning: scikit-learn, Tensorflow, keras

    • Image processing: pillow, scikit-image, OpenCV

    • Plotting: matplotlib, seaborn, altair, ggplot, Bokeh

    • GUI development: pyQT, wxPython

    • Testing: py.test

Widely-used#

  • Python is the fastest-growing major programming language

  • Top 3 overall (with JavaScript, Java)

What we will do in this section of the course is a short introduction to Python for data analyses including basic data operations like file reading and wrangling, as well as statistics and data visualization. The goal is to showcase crucial tools/resources and their underlying working principles to allow further more in-depth exploration and direct application.

It is divided into the following chapters:

Here’s what we will focus on in the second block:

Recap - What kind of data do I have and where is it#

The first crucial step is to get a brief idea of the kind of data we have, where it is, etc. to outline the subsequent parts of the workflow (python modules to use, analyses to conduct, etc.). At this point it’s important to note that Python and its modules work tremendously well for basically all kinds of data out there, no matter if behavior, neuroimaging, etc. . To keep things rather simple, we will keep using the behavioral dataset from last session that contains ratings and demographic information from a group of university students (ah, the classics…).

We already accomplished and worked with the dataset quite a bit during the last session, including:

  • reading data

  • extract data of interest

  • convert to different more intelligible structures and forms

At the end, we had two dataframes, both containing the data of all participants but in different formats. Does anyone remember what formats datasets commonly have and how they differ?

True that, we initially had a dataframe in wide-format but then created a reshaped version in long-format. For this session, we will continue to explore aspects of data visualization and analyzes via this dataframe. Thus, let’s move to our data storage and analyses directory and load it accordingly using pandas!

from os import chdir
chdir('C:/Users/ika_m/OneDrive - Johann Wolfgang Goethe Universität/Documents/test_data')
from os import listdir
listdir('.')
['AG_01_02_stroop_2025-01-23_13h12.02.354.csv',
 'BF_31_01_stroop_2025-01-23_12h25.25.690.csv',
 'BF_31_02_stroop_2025-01-23_13h12.43.710.csv',
 'data_experiment',
 'FB_09_02_stroop_2025-01-23_13h12.14.223.csv',
 'FB_09_03_stroop_2025-01-23_13h30.49.889.csv',
 'FK_24_01_stroop_2025-01-23_12h36.23.128.csv',
 'JB_22_01_stroop_2025-01-23_12h29.58.313.csv',
 'JB_22_02_stroop_2025-01-23_13h13.20.521.csv',
 'JB_22_03_stroop_2025-01-23_13h32.52.317.csv',
 'XY_16_03_stroop_2025-01-23_13h31.18.706.csv']
listdir('data_experiment/preprocessing')
['AG_01_02clean.csv', 'cleaned_df_group.csv']
import pandas as pd
df_long = pd.read_csv('data_experiment/preprocessing/cleaned_df_group.csv')

df_long.head()
participant session date expName psychopyVersion OS frameRate resp.keys resp.corr resp.rt ... trials.thisRepN trials.thisTrialN trials.thisN trials.thisIndex trials.ran text letterColor corrAns congruent trial_number
0 AG_01_02 1 2025-01-23_13h12.02.354 stroop 2023.1.3 MacIntel 20.408163 left 1.0 0.635 ... 0.0 0.0 0.0 5.0 1.0 blue red left 0.0 1
1 AG_01_02 1 2025-01-23_13h12.02.354 stroop 2023.1.3 MacIntel 20.408163 down 1.0 0.342 ... 0.0 1.0 1.0 2.0 1.0 green green down 1.0 2
2 AG_01_02 1 2025-01-23_13h12.02.354 stroop 2023.1.3 MacIntel 20.408163 left 0.0 0.641 ... 0.0 2.0 2.0 1.0 1.0 red green down 0.0 3
3 AG_01_02 1 2025-01-23_13h12.02.354 stroop 2023.1.3 MacIntel 20.408163 right 1.0 1.076 ... 0.0 3.0 3.0 3.0 1.0 green blue right 0.0 4
4 AG_01_02 1 2025-01-23_13h12.02.354 stroop 2023.1.3 MacIntel 20.408163 right 1.0 0.680 ... 0.0 4.0 4.0 4.0 1.0 blue blue right 1.0 5

5 rows × 21 columns

Recap - What is the goal of the data analyzes#

There obviously many different routes we could pursue when it comes to analyzing data. Ideally, we would know that before starting (pre-registration much?) but we all know how these things go… For the dataset we aimed at the following, with steps in () indicating operations we already conducted:

  • (read in single participant data)

  • (explore single participant data)

  • (extract needed data from single participant data)

  • (convert extracted data to more intelligible form)

    • (repeat for all participant data)

    • (combine all participant data in one file)

  • (explore data from all participants)

    • (general overview)

    • basic plots

  • analyze data from all participant

    • descriptive stats

    • inferential stats

Nice, that’s a lot. The next step on our list would be data explorations by means of data visualization which will also lead to data analyzes.

Recap - how will the respective steps be implemented#

After creating some sort of outline/workflow, we though about the respective steps in more detail and set overarching principles. Regarding the former, we also gathered a list of potentially useful python modules to use. Given the pointers above, this entailed the following:

Regarding the second, we went back to standards and principles concerning computational work:

  • use a dedicated computing environment

  • provide all steps and analyzes in a reproducible form

  • nothing will be done manually, everything will be coded

  • provide as much documentation as possible

Important: these aspects should be followed no matter what you’re working on!

So, after “getting ready” and conducted the first set of processing steps, it’s time to continue via basic data visualization.

Basic data visualization#

Given that we already explored our data a bit more, including the basic descriptive statistics and data types, we will go one step further and continue this process via basic data visualization to get a different kind of overview that can potentially indicate important aspects concerning data analyses. As mentioned above, we will do so via the following steps, addressing different aspects of data visualization. Throughout each, we will get to know respective python modules and functions.

Underlying principles#

When talking about visualization one might want to differentiate data exploration and analyses but one can actually drastically influence the other. Here, we are going to check both, that is facilitating data understanding in many ways and creating high quality results figures.

Unsurprisingly, python is nothing but fantastic when it comes to data visualization:

  • python provides a wide array of options

  • Low-level and high-level plotting APIs

  • static images vs. HTML output vs. interactive plots

  • domain-general and domain-specific packages

  • optimal visualization environment as it’s both efficient and flexible

    • produce off-the-shelf high-quality plots very quickly

    • with more effort, gives you full control over the plot

While python has a large amount of amazing modules targetting data visualization, we are going to utilize the three most common and general ones, as they provide the basis for everything else going further:

The first two produce static images and the last one HTML outputs and allow much more interactive plots. We will talk about each one as we go along.

matplotlib#

  • the most widely-used python plotting library

  • initially modeled on MATLAB’s plotting system

  • designed to provide complete control over a plot

matplotlib and all other high-level APIs that build upon it operate on underlying principles and respective parts:

In the most basic sense matplotlib graphs your data on Figures (e.g., windows, Jupyter widgets, etc.), each of which can contain one or more Axes, an area where points can be specified in terms of x-y coordinates (or theta-r in a polar plot, x-y-z in a 3D plot, etc.).

  • figures

    • the entire graphic

    • keep track of everything therein (axes, titles, legends, etc.)

  • axes

    • usually contains two or three axis objects

    • includes title, x-label, y-label

  • axis

    • ticks and tick labels to provide scale for data

  • artist

    • everything visible on the figure: text, lines, patches, etc.

    • drawn to the canvas

A bit too “theoretical”, eh? Let’s dive in and create some plots!

But before we start, two important points to remember: when plotting in jupyter notebooks, make sure to run the %matplotlib inline magic before your first graphic which results in the graphics being embedded in the jupyter notebook and not in the digital void. (NB: this is true for most but not all plotting modules/functions.)

%matplotlib inline

When using matplotlib you can choose between explicitly creating Figures and axes or use the plt interface to automatically create and manage them, as well as adding graphics. Quite often you might want to use the latter.

import matplotlib.pyplot as plt

standard plots#

Obviously, matplotlib comes with support for all the “standard plots” out there: barplots, scatterplots, histograms, boxplots, errorbars, etc. . For a great overview on what’s possible, make sure to check the gallery of the matplotlib documentation. For now, we are going to start simply…how about some univariate data visualization, e.g. a scatterplot?

import numpy as np
# We don't have age in original dataset, so we will generate a random age for each unique participant
np.random.seed(42)  # For reproducibility
participant_ages = (
    df_long[['participant']]
    .drop_duplicates()
    .assign(age=np.random.randint(18, 66, size=df_long['participant'].nunique()))
)

# Map ages back to df_long
df_long = df_long.merge(participant_ages, on='participant', how='left')

For example, we are interested in the distribution of age in our dataset. Using matplotlib, we need to create a figure and draw something inside. As our data is in long-format we have to initially extract a list containing the age of each participant only once, using drop_duplicates method

plt.figure(figsize=(10, 5))
plt.hist(df_long.drop_duplicates(subset='participant')['age'])
plt.show()
../_images/9ee05e537927cf24c1a2d28b125df8f659a5c0d19f8d5fb1ce7528eeb8cad276.png

While the information we wanted is there, the plot itself looks kinda cold and misses a few pieces to make it intelligible, e.g. axes labels and a title. This can easily be added via matplotlib’s plt interface.

plt.figure(figsize=(10, 5))
plt.hist(df_long.drop_duplicates(subset='participant')['age'])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);
plt.show()
../_images/583ca9fe16b9940862d1dbf7fd0b3bb803d9cca45f7cd7ea1a90c49900ba1d48.png

We could also add a grid to make it easier to situate the given values:

plt.figure(figsize=(10, 5))
plt.hist(df_long.drop_duplicates(subset='participant')['age'])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15)
plt.grid(True)
plt.show()
../_images/f5db36dc43a9bb80dc10f33dd57116a7f10d45d48869abf6c4a4e5e2ebb6d7a0.png

Nice, now we could also have a look at how meaningful values could interact using scatter plot. We want to check how the frame rate of the participant could affect the mean value of the reaction time of each participant. But for that, we need to learn groupby

📊 Introduction to groupby() in Pandas#

The .groupby() method in Pandas is used for grouping data based on a column or multiple columns and then applying aggregations (e.g., mean, sum, count).

Why Use groupby()?#

  • To summarize data by categories (e.g., average reaction time per participant).

  • To explore patterns and trends in grouped data.

  • To perform multi-level aggregations efficiently.

Common Aggregations with groupby():#

  • .mean() – Calculate the average value

  • .sum() – Get the total value

  • .count() – Count the number of rows

  • .median() – Find the median value

  • .std() – Calculate the standard deviation

GroupBy Workflow:#

  1. Split: Group the data by unique values in a column.

  2. Apply: Perform an aggregation or transformation on each group.

  3. Combine: Return the results in a single DataFrame or Series.


🐼 Examples of groupby() in Pandas:#

# 1️⃣ Mean reaction time by participant:
mean_rt = df_long.groupby('participant')['resp.rt'].mean()

# 2️⃣ Multiple aggregations for reaction time:
mean_std_rt = df_long.groupby('participant').agg({'resp.rt': ['mean', 'std']})

print(mean_rt)
print(mean_std_rt)
participant
AG_01_02    0.814500
BF_31_01    0.691667
BF_31_02    0.714167
FB_09_02    0.584733
FB_09_03    0.594533
FK_24_01    1.603800
JB_22_01    0.541033
JB_22_02    0.569667
JB_22_03    0.545400
XY_16_03    0.671200
Name: resp.rt, dtype: float64
              resp.rt          
                 mean       std
participant                    
AG_01_02     0.814500  0.227043
BF_31_01     0.691667  0.231659
BF_31_02     0.714167  0.202041
FB_09_02     0.584733  0.212847
FB_09_03     0.594533  0.198665
FK_24_01     1.603800  1.505703
JB_22_01     0.541033  0.140593
JB_22_02     0.569667  0.166234
JB_22_03     0.545400  0.168221
XY_16_03     0.671200  0.166866
# 3️⃣ Try to group by multiple columns (participant and congruent):
frameRate_list = df_long.drop_duplicates(subset='participant')['frameRate']
rt_list = df_long.groupby('participant')['resp.rt'].mean()

plt.figure(figsize=(10, 5))
plt.scatter(frameRate_list, rt_list)
plt.xlabel('Frame Rate', fontsize=12)
plt.ylabel('Reaction time', fontsize=12)
plt.title('Comparing frame rate and reaction time', fontsize=15)
plt.show()
../_images/a004b107c9d1b61a3b92c93db8ad72803f710f10a62288115781bc5959b20a1e.png

Sometimes, we might want to have different subplots within one main plot. Using matplotlib’s subplots function makes this straightforward via two options: creating a subplot and adding the respective graphics or creating multiple subplots and adding the respective graphics via the axes. Let’s check the first option:

plt.subplot(1, 2, 1)
plt.hist(frameRate_list)
plt.xlabel('Frame Rate', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of frame rate', fontsize=15)
plt.grid(True)

plt.subplots_adjust(right=4.85)

plt.subplot(1, 2, 2)
plt.scatter(frameRate_list, rt_list)
plt.xlabel('Frame Rate', fontsize=12)
plt.ylabel('Reaction time', fontsize=12)
plt.title('Comparing frame rate and reaction time', fontsize=15)
plt.show()
../_images/8e5f439b9316a9535695da380008416e163075aa34b1e8f97b7136955cf5dd49.png

Hm, kinda ok but we would need to adapt the size and spacing. This is actually easier using the second option, subplots(), which is also recommended by the matplotlib community:

fig, axs = plt.subplots(1, 2, figsize=(20, 5))

axs[0].hist(frameRate_list)
axs[0].set_xlabel('Frame Rate', fontsize=12)
axs[0].set_ylabel('Count', fontsize=12)
axs[0].set_title('Distribution of Frame Rate', fontsize=15)
axs[0].grid(True)

axs[1].scatter(frameRate_list, rt_list)
axs[1].set_xlabel('Frame Rate', fontsize=12)
axs[1].set_ylabel('Reaction time', fontsize=12)
axs[1].set_title('Comparing Frame Rate and Reaction time', fontsize=15)
plt.show()
../_images/dc2fa0048d771fcfa05d8dca6283317022bd60127a9b873234e18f5ae25c1085.png

As matplotlib provides access to all parts of a figure, we could furthermore adapt various aspects, e.g. the color and size of the drawn markers.

fig, axs = plt.subplots(1, 2, figsize=(20, 5))

axs[0].hist(frameRate_list)
axs[0].set_xlabel('Frame Rate', fontsize=12)
axs[0].set_ylabel('Count', fontsize=12)
axs[0].set_title('Distribution of Frame Rate', fontsize=15)
axs[0].grid(True)

axs[1].scatter(frameRate_list, rt_list, c='black', s=80)
axs[1].set_xlabel('Frame Rate', fontsize=12)
axs[1].set_ylabel('Reaction time', fontsize=12)
axs[1].set_title('Comparing Frame Rate and Reaction time', fontsize=15)
axs[1].grid(True)
plt.show()
../_images/d82fb1bbd5aaaeabde0089a8b9cf1a61f397a795aeb048c99dcf5002bdd9df33.png
#Now its your turn! Try to plot hist plot for reaction time
#Now its your turn! Try to plot scatter plot for reaction time vs congruency
#Now its your turn! Try to boxplot reaction time on y-axis and condition on x-axis
# Group data by 'congruent' and create a list of 'resp.rt' values
# Collect reaction times based on congruency
congruent_0 = df_long[df_long['congruent'] == 0]['resp.rt'].values
congruent_1 = df_long[df_long['congruent'] == 1]['resp.rt'].values

# Combine into a list for boxplot
congruent_lists = [congruent_0, congruent_1]
# Create the boxplot
plt.figure(figsize=(8, 5))

plt.show()

This provides just a glimpse but matplotlib is infinitely customizable, thus as in most modern plotting environments, you can do virtually anything. The problem is: you just have to be willing to spend enough time on it. Lucky for us and everyone else there are many modules/libraries that provide a high-level interface to matplotlib. However, before we check one of them out we should quickly summarize pros and cons of matplotlib.

Pros#

  • provides low-level control over virtually every element of a plot

  • completely object-oriented API; plot components can be easily modified

  • close integration with numpy

  • extremely active community

  • tons of functionality (figure compositing, layering, annotation, coordinate transformations, color mapping, etc.)

Cons#

  • steep learning curve

  • API is extremely unpredictable–redundancy and inconsistency are common

  • some simple things are hard; some complex things are easy

  • lacks systematicity/organizing syntax–every plot is its own little world

  • simple plots often require a lot of code

  • default styles are not optimal

High-level interfaces to matplotlib#

  • matplotlib is very powerful and very robust, but the API is hit-and-miss

  • many high-level interfaces to matplotlib have been written

  • abstract away many of the annoying details

  • best of both worlds: easy generation of plots, but retain matplotlib’s power

  • many domain-specific visualization tools are built on matplotlib (e.g., nilearn and mne in neuroimaging)

Going further with advanced plots#

This also marks the transition to more “advanced plots” as the respective libraries allow you to create fantastic and complex plots with ease!

Seaborn#

Seaborn abstracts away many of the complexities to deal with such minutiae and provides a high-level API for creating aesthetic plots.

  • arguably the premier matplotlib interface for high-level plots

  • generates beautiful plots in very little code

  • beautiful styles and color palettes

  • wide range of supported plots

  • modest support for structured plotting (via grids)

  • exceptional documentation

  • generally, the best place to start when exploring and visualizing data

  • (can be quite slow (e.g., with permutation))

For example, recreating the plots from above is as easy as:

import seaborn as sns

sns.histplot(df_long.drop_duplicates(subset='participant')['age'])
plt.xlabel('Age')
plt.title('Distribution of age')
plt.show()
../_images/89119c80b25da9796dcae613ad8444420b6009e29b37bf0a09ecf99fa3845004.png
sns.scatterplot(x=frameRate_list.to_numpy(), y=rt_list.to_numpy())
plt.xlabel('Frame rate')
plt.title('Comparing frame rate and reaction time')
plt.show()
../_images/b52ad8b8b2ac452f26b204ac82819a8e27cafd5d0a2058d70ae1cf0e1748fb7f.png

You might wonder: “well, that doesn’t look so different from the plots we created before and it’s also not way faster/easier”.

True that, but so far this based on our data and the things we wanted to plot. Seaborn actually integrates fantastically with pandas dataframes and allows to achieve amazing things rather easily.

Let’s go through some examples!

sns.pairplot(df_long[['resp.rt', 'resp.corr','congruent']], hue='congruent')
plt.show()
../_images/6f86cb3b5ef03430c20b18f2a8a4d0feb463d1d25faf0339cac5ad6103b36776.png
sns.pairplot(df_long[['resp.rt', 'resp.corr','text']], hue='text')
plt.show()
../_images/5a625a2d149cb1d07ecd1575b62048683dd899be8606bd22273a09fe1ecace53.png

And can now make use of the fantastic pandas - seaborn friendship. For example, let’s go back to the scatterplot of frameRate and resp.rt we did before. How could we improve this plot? Maybe adding the distribution of each variable to it? That’s easily done via jointplot():

sns.jointplot(x='frameRate', y='resp.rt', data=df_long)
plt.show()
../_images/423f81643cef0b6c42b8b48a91645437329edb4afbd062be8b2d82bd3e379ee9.png

Wouldn’t it be cool if we could also briefly explore if there might be some statistically relevant effects going on here? Say no more, as we can add a regression to the plot via setting the kind argument to reg:

sns.jointplot(x='frameRate', y='resp.rt', data=df_long, kind='reg')
plt.show()
../_images/130609fae45561524d2bf37dbb4f39f9bf941bff9aefa7cf8f50759b6d9455e6.png

Chances are we see another manifestation of early onset grumpiness here and thus might want to spend a closer look on the ratings for each movie. One possibility to do so, could be a boxplot():

plt.figure(figsize=(3,4))
sns.boxplot(x='congruent', y='resp.rt', data=df_long, hue='congruent', palette="vlag")
plt.show()
../_images/ec644fc47dc25317024200bd47800339d79b1122a6756c83e2e96fb71295ebb5.png
#now its your turn! make boxplot for each text vs resp.rt

However, we know that boxplots have their fair share of problems…given that they show summary statistics, clusters and multimodalities are hidden..

That’s actually one important aspect everyone should remember concerning data visualization, no matter if for exploration or analyses: show as much data and information as possible! With seaborn we can easily address this via adding individual data points to our plot via stripplot():

plt.figure(figsize=(6,8))
sns.boxplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.stripplot(x='text', y='resp.rt', data=df_long, color='black')
plt.show()
../_images/72e0e85a1e1450bf672495af1d808225450e63ae93ef2f3fd616d63c1cf463f0.png

Ah yes, that’s better! Seeing the individual data points, we might want to check the respective distributions. Using seaborn’s violinplot(), this is done in no time.

plt.figure(figsize=(5,10))
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.despine()
plt.show()
../_images/d70ec11f98538c80a1f0e18bf73ea4d8493016b94997d2e0d496185a7f9c75da.png

As you might have seen, we also adapted the style of our plot a bit via sns.despine() which removed the y axis spine. This actually outlines another important point: seaborn is fantastic when it comes to customizing plots with little effort (i.e. getting rid of many lines of matplotlib code). This includes “themes, contexts, colormaps among other things.

Starting with themes, which set a variety of aesthetic related factors, including background color, grids, etc., here are some very different examples showcasing the whitegrid style

sns.set_style("whitegrid")
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.despine(left=True)
plt.show()
../_images/a302348499ef0f3aa236d1ecdd476fca25b9e0d005fa1f673f27ca3e5813d508.png

and the dark style:

sns.set_style("dark")
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.despine(left=True)
plt.show()
../_images/3bf28c2322a76129bb822c5f6a811627de03ecfc5f09f05fe23b8e8772b12f16.png

While this is already super cool, seaborn goes one step further and even let’s you define the context for which your figure is intended and adapts it accordingly. For example, there’s a big difference if you want to include your figure in a poster:

sns.set_style('whitegrid')
sns.set_context("poster")
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.despine(left=True)
plt.show()
../_images/fa6df5cc525bcf788f4b6d77517ebf4ba1bbc073f914d7df6ec4471899e5da6a.png

or a talk:

sns.set_style("whitegrid")
sns.set_context("talk")
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.despine(left=True)
plt.show()
../_images/cb57d9526816972ad77e2a180c96687d03c72589fccd5420da56162c348d4282.png

or a paper:

sns.set_context("paper")
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text', palette="vlag")
sns.despine(left=True)
plt.show()
../_images/5d501a6f390011dd8dc6c61069c16fef4bc8077a644ae741a0b92f8dc2f88674.png

No matter the figure and context, another very crucial aspect everyone should always look out for is the colormap or colorpalette! Some of the most common ones are actually suboptimal in multiple regards. This entails a misrepresentation of data:

It gets even worse: they don’t work for people with color vision deficiencies!

That’s obviously not ok and we/you need to address this! With seaborn, some of these important aspects are easily addressed. For example, via setting the colorpalette to colorblind:

sns.set_context("notebook")
sns.set_palette('colorblind')
sns.violinplot(x='text', y='resp.rt', data=df_long, hue='text')
sns.despine(left=True)
plt.show()
../_images/bd1738af6ad2f86c05dde6ef08f078710a5c87f106f6a48dd32a5167006f525f.png

or using one of the suitable color palettes that also address the data representation problem, i.e. perceptually uniform color palettes:

sns.color_palette("flare", as_cmap=True)
flare
flare colormap
under
bad
over
sns.color_palette("crest", as_cmap=True)
crest
crest colormap
under
bad
over
sns.cubehelix_palette(as_cmap=True)
seaborn_cubehelix
seaborn_cubehelix colormap
under
bad
over
sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True)
seaborn_cubehelix
seaborn_cubehelix colormap
under
bad
over

Let’s see a few of those in action, for example within a heatmap that displays the correlation between ratings of movies. For this, we need to reshape our data back to wide-format which is straightforward using pandaspivot function.

# Pivot df_long to wide format
df_wide = df_long.pivot_table(
    index='participant', 
    columns=['congruent', 'text', 'letterColor'], 
    values='resp.rt',
    aggfunc='mean'  # Use mean if multiple values exist per condition
)
df_wide
congruent 0.0 1.0
text blue green red blue green red
letterColor red blue green blue green red
participant
AG_01_02 1.0116 1.0154 0.8174 0.7612 0.5682 0.7132
BF_31_01 0.7160 0.7048 0.6294 0.7492 0.6210 0.7296
BF_31_02 0.7150 0.8726 0.7404 0.6372 0.6338 0.6860
FB_09_02 0.5958 0.4744 0.5816 0.6554 0.5906 0.6106
FB_09_03 0.5774 0.6770 0.5604 0.6946 0.4944 0.5634
FK_24_01 1.7914 1.0746 2.6864 1.3936 1.3048 1.3720
JB_22_01 0.4814 0.6286 0.6164 0.4746 0.5226 0.5226
JB_22_02 0.5252 0.6798 0.5054 0.5222 0.5902 0.5952
JB_22_03 0.5910 0.6402 0.5164 0.5020 0.4728 0.5500
XY_16_03 0.6842 0.7708 0.5328 0.6834 0.8144 0.5416
# Reset column names for clarity
df_wide.columns = [f'rt_congruent{c}_text{t}_color{col}' for c, t, col in df_wide.columns]
df_wide
rt_congruent0.0_textblue_colorred rt_congruent0.0_textgreen_colorblue rt_congruent0.0_textred_colorgreen rt_congruent1.0_textblue_colorblue rt_congruent1.0_textgreen_colorgreen rt_congruent1.0_textred_colorred
participant
AG_01_02 1.0116 1.0154 0.8174 0.7612 0.5682 0.7132
BF_31_01 0.7160 0.7048 0.6294 0.7492 0.6210 0.7296
BF_31_02 0.7150 0.8726 0.7404 0.6372 0.6338 0.6860
FB_09_02 0.5958 0.4744 0.5816 0.6554 0.5906 0.6106
FB_09_03 0.5774 0.6770 0.5604 0.6946 0.4944 0.5634
FK_24_01 1.7914 1.0746 2.6864 1.3936 1.3048 1.3720
JB_22_01 0.4814 0.6286 0.6164 0.4746 0.5226 0.5226
JB_22_02 0.5252 0.6798 0.5054 0.5222 0.5902 0.5952
JB_22_03 0.5910 0.6402 0.5164 0.5020 0.4728 0.5500
XY_16_03 0.6842 0.7708 0.5328 0.6834 0.8144 0.5416
#wow thats cool! can you try to do similar to resp.corr? but use date as index, and aggregate by congruent and corrAns  

Then we can use another built-in function of pandas dataframes: .corr(), which computes a correlation between all columns:

plt.figure(figsize=(10,7))
sns.heatmap(df_wide.corr(), xticklabels=True, cmap='rocket')
plt.show()
<Figure size 1000x700 with 0 Axes>
../_images/71afe05dd62a8a69d9034bb707fe9f78151c6277eb758ee7d00324dcfaabb9ff.png
df_wide.corr()
rt_congruent0.0_textblue_colorred rt_congruent0.0_textgreen_colorblue rt_congruent0.0_textred_colorgreen rt_congruent1.0_textblue_colorblue rt_congruent1.0_textgreen_colorgreen rt_congruent1.0_textred_colorred
rt_congruent0.0_textblue_colorred 1.000000 0.812087 0.959666 0.961608 0.887717 0.965283
rt_congruent0.0_textgreen_colorblue 0.812087 1.000000 0.692348 0.703376 0.638695 0.708970
rt_congruent0.0_textred_colorgreen 0.959666 0.692348 1.000000 0.940903 0.911372 0.976929
rt_congruent1.0_textblue_colorblue 0.961608 0.703376 0.940903 1.000000 0.907014 0.957430
rt_congruent1.0_textgreen_colorgreen 0.887717 0.638695 0.911372 0.907014 1.000000 0.893670
rt_congruent1.0_textred_colorred 0.965283 0.708970 0.976929 0.957430 0.893670 1.000000

Nice, how does the crest palette look?

plt.figure(figsize=(10,7))
sns.heatmap(df_wide.corr(), xticklabels=False, cmap='crest')
plt.show()
../_images/ef139834d2d868d1f7f62b98737ca2c018923b8923efe9ea448f2fbae44e2740.png

While we’re at it, we also change heatmap to clustermap!

plt.figure(figsize=(10,7))
sns.clustermap(df_wide.corr(), xticklabels=True, cmap='crest', center=0)
plt.show()
<Figure size 1000x700 with 0 Axes>
../_images/4400eff230fd603a12d0992707684ad4a6d1fbe6d6fa2058168b9e4bf71289a1.png

However, to be on the safe side, please also check your graphics for the mentioned points, e.g. via tools like Color Oracle that let you simulate color vision deficiencies!

Make use of amazing resources like the python graph gallery, the data to viz project and the colormap decision tree in Crameri et al. 2020, that in combination allow you to find and use the best graphic and colormap for your data!

And NEVER USE JET!

There appear to be some interesting effects we should investigate further which leads us to the grand finale: statistical analyses using python!

But before it one final test. We need to finally make meaningful plots!

  • First, create a new column which will have the words ‘congruent’ and ‘incongruent’ in corresponding rows

  • Use the plot of your choice to find whether there are any differences in RT and correct answers between congruent and incongruent trials

#work here

Now try to find if there is any relation between RT and trial number! Find a suitable plot for it

#work here

Statistics in python#

We’ve reached the final step of our analyses workflow: statistics, i.e. putting things to test via evaluating hypotheses and/or assumptions! As with the previous aspects, python has a lot of libraries and modules for this purpose, some general-domain and some rather tailored to specific data and analyzes (e.g. fMRI, EEG, drift diffusion models, etc.).

Here, we will stick with the general-domain ones as they are already support a huge amount of approaches and are commonly used by the community: pingouin and statsmodels. As before, we will go through several steps and explore respective python modules and functions.

Descriptive analyses#

Regarding the first aspect, descriptive analyses, we can safely state that we already know how to do this. For example using pandas or numpy and their respective functions:

print(df_long['resp.rt'].mean(), df_long['resp.rt'].std()) 
0.73307 0.5862532456143087
import numpy as np
print(np.mean(df_long['resp.rt']), np.std(df_long['resp.rt']))
0.73307 0.5852753412710977
for idx, df in df_long[['resp.rt', 'resp.corr','congruent']].groupby('congruent'):
    print('Summary statistics for: %s' %idx)
    print(df.describe())
Summary statistics for: 0.0
          resp.rt   resp.corr  congruent
count  150.000000  150.000000      150.0
mean     0.780460    0.893333        0.0
std      0.735391    0.309723        0.0
min      0.320000    0.000000        0.0
25%      0.514250    1.000000        0.0
50%      0.629000    1.000000        0.0
75%      0.838250    1.000000        0.0
max      8.553000    1.000000        0.0
Summary statistics for: 1.0
          resp.rt   resp.corr  congruent
count  150.000000  150.000000      150.0
mean     0.685680    0.920000        1.0
std      0.379962    0.272202        0.0
min      0.299000    0.000000        1.0
25%      0.509000    1.000000        1.0
50%      0.598500    1.000000        1.0
75%      0.750000    1.000000        1.0
max      3.564000    1.000000        1.0

Thus, we can skip these aspects here and directly move on! Regarding inferential analyses, we will start with and focus on pingouin!

Unfortunately not those…

but make sure to check https://www.penguinsinternational.org/ to see what you can do to help our amazing friends: the Penguins!

Inferential analyses#

Of course, exploring and visualizing data is informative and can hint at certain effects but we nevertheless have to run tests and obtain respective statistics to evaluate if there’s “something meaningful”. This is where inferential analyses come into play!

Pingouin#

Even though comparably new, Pingouin quickly became a staple regarding statistical analyses as it integrates amazingly well with the existing python data science stack, e.g. pandas and numpy. As mentioned in the docs: “Pingouin is designed for users who want simple yet exhaustive stats functions.”, which summarizes it perfectly as it supports the majority of commonly used statistical functions and additionally provides adjacent functionality, e.g. related to test diagnostics and evaluation:

  • ANOVAs: N-ways, repeated measures, mixed, ancova

  • pairwise post-hocs tests (parametric and non-parametric) and pairwise correlations

  • robust, partial, distance and repeated measures correlations

  • linear/logistic regression and mediation analysis

  • Bayes Factors

  • multivariate tests

  • reliability and consistency

  • effect sizes and power analysis

  • parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient

  • circular statistics

  • chi-squared tests

  • plotting: Bland-Altman plot, Q-Q plot, paired plot, robust correlation

When in doubt, make sure to check if pingouin has your back!

One of the first things we explored via the previous analyses steps was if the ratings vary as a function of age. While the plots suggested at some potential effects, we can simply compute their correlation using pingouin’s corr function:

import pingouin as pg
pg.corr(frameRate_list, rt_list)
n r CI95% p-val BF10 power
pearson 10 -0.41001 [-0.83, 0.3] 0.239279 0.717 0.225217

Not only do we get the correlation value and p value but also the number of observations, the 95% CI, Bayes Factor and the power. Pretty convenient, eh?

Using heatmap and clustermaps we also evaluate the correlation pattern of different conditions. In order to put some more numbers on that, we can make use of the mentioned pingouin - pandas friendship and compute multiple pairwise correlations based on the columns of a dataframe:

df_cor = pg.pairwise_corr(df_wide, method='pearson')
df_cor.head()
X Y method alternative n r CI95% p-unc BF10 power
0 rt_congruent0.0_textblue_colorred rt_congruent0.0_textgreen_colorblue pearson two-sided 10 0.812087 [0.37, 0.95] 0.004319 13.317 0.874392
1 rt_congruent0.0_textblue_colorred rt_congruent0.0_textred_colorgreen pearson two-sided 10 0.959666 [0.83, 0.99] 0.000011 1284.587 0.999531
2 rt_congruent0.0_textblue_colorred rt_congruent1.0_textblue_colorblue pearson two-sided 10 0.961608 [0.84, 0.99] 0.000009 1490.035 0.999631
3 rt_congruent0.0_textblue_colorred rt_congruent1.0_textgreen_colorgreen pearson two-sided 10 0.887717 [0.59, 0.97] 0.000606 60.022 0.970876
4 rt_congruent0.0_textblue_colorred rt_congruent1.0_textred_colorred pearson two-sided 10 0.965283 [0.86, 0.99] 0.000006 2016.543 0.999777

We get a dataframe in return, which we can then use for further analyses. For example, as we did run quite a few tests, we might want to correct our p values for multiple comparisons. Obviously, pingouin also makes this super easy via it’s multicomp function which returns an list with True/False booleans and corrected p values:

df_cor['p_corr'] = pg.multicomp(df_cor['p-unc'].to_numpy(), method='bonf')[1]
df_cor.head()
X Y method alternative n r CI95% p-unc BF10 power p_corr
0 rt_congruent0.0_textblue_colorred rt_congruent0.0_textgreen_colorblue pearson two-sided 10 0.812087 [0.37, 0.95] 0.004319 13.317 0.874392 0.064781
1 rt_congruent0.0_textblue_colorred rt_congruent0.0_textred_colorgreen pearson two-sided 10 0.959666 [0.83, 0.99] 0.000011 1284.587 0.999531 0.000165
2 rt_congruent0.0_textblue_colorred rt_congruent1.0_textblue_colorblue pearson two-sided 10 0.961608 [0.84, 0.99] 0.000009 1490.035 0.999631 0.000136
3 rt_congruent0.0_textblue_colorred rt_congruent1.0_textgreen_colorgreen pearson two-sided 10 0.887717 [0.59, 0.97] 0.000606 60.022 0.970876 0.009090
4 rt_congruent0.0_textblue_colorred rt_congruent1.0_textred_colorred pearson two-sided 10 0.965283 [0.86, 0.99] 0.000006 2016.543 0.999777 0.000091

Now please compute something more “meaningful” like correlation between frame rate, reaction time and trial number

  • adjust p value using holm method

  • add column which shows True if p value is significant and False if not

# work here

We can reformulate our analyses to an regression problem and thus run a multiple regression via pingouin’s linear_regression function:

movie_reg = pg.linear_regression(df_long['resp.rt'], df_long['trial_number'])
movie_reg
names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%]
0 Intercept 16.499060 0.800194 20.618833 2.589562e-59 0.008492 0.005165 14.924313 18.073806
1 resp.rt -1.362844 0.853039 -1.597634 1.111843e-01 0.008492 0.005165 -3.041587 0.315900

Which would again provide a very exhaustive and intelligible output that can easily be utilized further.

One of the outputs of the regression reminds us of “a very low hanging fruit” we haven’t checked yet: T-Tests! Lets finally test the usual Stroop hypothesis!

That’s right: testing statistical premises/test assumptions! In other words: we need to evaluate if the assumptions/requirements for a parametric test, like the T-Test are fulfilled or if we need to apply a non-parametric test.

At first, we are going to test the distribution of our data. Often folks assume that their data follows a gaussian distribution, which allows for parametric tests to be run. Nevertheless it is essential to first test the distribution of your data to decide if the assumption of normally distributed data holds, if this is not the case we would have to switch to non-parametric tests.

With pingouin this is again only one brief function away. Here specifically, the normality() function that implements the Shapiro Wilk normality test:

pg.normality(df_long['resp.rt'])
W pval normal
resp.rt 0.418217 5.602158e-30 False
#now check the same for correct reports!

Not, so good, but for learning purposes lets continue with T-test.

By now you might know that pingouin obviously has a dedicated function for this and of course you’re right: ttest:

pg.ttest(df_long[df_long['congruent']==1]['resp.rt'], df_long[df_long['congruent']==0]['resp.rt'], paired=True)
T dof alternative p-val CI95% cohen-d BF10 power
T-test -1.629319 149 two-sided 0.105358 [-0.21, 0.02] 0.161932 0.331 0.50423

Since we are working with paired t-test, there is a dedicated function which makes everything easier!

pg.pairwise_tests(
    data=df_long,
    dv='resp.rt',
    between=None,
    within='congruent',
    subject='participant'
)
Contrast A B Paired Parametric T dof alternative p-unc BF10 hedges
0 congruent 0.0 1.0 True True 1.774183 9.0 two-sided 0.10978 0.987 0.275808

But we have different results? Why? Which one is correct?

We can actually make another version correct as well

# Align data by participant (only participants with both conditions)
df_paired = df_long.pivot_table(
    index='participant',
    columns='congruent',
    values='resp.rt'
).dropna()

# Perform the paired t-test
pg.ttest(
    x=df_paired[1], 
    y=df_paired[0], 
    paired=True
)
T dof alternative p-val CI95% cohen-d BF10 power
T-test -1.774183 9 two-sided 0.10978 [-0.22, 0.03] 0.287976 0.987 0.129216

If we however wanted to run a non-parametric test, here a wilcoxon test, we can do so without problems:

pg.wilcoxon(x=df_paired[1], y=df_paired[0])
W-val alternative p-val RBC CLES
Wilcoxon 10.0 two-sided 0.083984 -0.636364 0.43
# now check whats going on with correct reports!

While we could now go through each comparison of interest, we might want to approach the analyses from a different angle. For example, starting with an anova comparing ratings between text:

pg.rm_anova(data=df_long, dv='resp.rt', within='text', subject='participant', detailed=True)
Source SS DF MS F p-unc p-GG-corr ng2 eps sphericity W-spher p-spher
0 text 0.010958 2 0.005479 0.241367 0.788055 0.664249 0.003477 0.569127 False 0.242924 0.003482
1 Error 0.408595 18 0.022700 NaN NaN NaN NaN NaN NaN NaN NaN

Usually the next step would be to run post-hoc tests contrasting each pair of categories. Once more pingouin comes to the rescue and implements this in just one function within which we can additionally define if we want to run a parametric test, if p values should be adjusted and what kind of effect size we would like to have:

pg.pairwise_tests(data=df_long, dv='resp.rt', within='text', subject='participant',
                             parametric=True, padjust='bonf', effsize='cohen')
Contrast A B Paired Parametric T dof alternative p-unc p-corr p-adjust BF10 cohen
0 text blue green True True 0.621562 9.0 two-sided 0.549648 1.0 bonf 0.364 0.114903
1 text blue red True True -0.295968 9.0 two-sided 0.773970 1.0 bonf 0.321 -0.039105
2 text green red True True -0.499229 9.0 two-sided 0.629593 1.0 bonf 0.343 -0.131297

Isn’t pingouin just fantastic? Especially considering that the many things we explored only cover a small percentage of available functionality as we didn’t even look at e.g. mediation analyses, contingency analyses, multivariate tests, reliability, etc. . However, you should have got an idea how powerful and versatile pingouin is for the majority of analyses you commonly run.

NB 1: A small note before we continue: a lot of these “standard analyses” and more are also implemented in the scipy.stats module which is might be interesting if there’s something you want to run that is implemented pingouin.

NB 2: we focused on pingouin because of it’s large amount of functions, great integration with pandas and high-level API.

Even though it’s a virtual class I can literally hear folks yell: “What about formulas? I need to define formulas!”.

Don’t you worry a thing: while this is (yet) not possible in pingouin, there’s of course a different python module that can do these things. Open your hearts for statsmodels!

Statsmodels#

statsmodels (sorry no animal pun, don’t know why) is yet another great python module for statistical analyses. In contrast to pingouin it focuses on rather complex and more specialized analyses:

  • regression

  • discrete choice models

  • Generalized Estimating Equations

  • meta-analyses

  • time-series, forecasting

  • state space models

  • etc.

In order to provide a brief glimpse of how statsmodels can be used to define and apply statistical models via formulas, we will check two examples and while doing so ignoring the reasonableness. At first we will import the formula.api:

import statsmodels.formula.api as smf

Which can then be used to define and run statistical models. Independent of the specific model the outline is roughly identical: a model is defined via a string e.g. referencing the columns/variables of a dataframe and then fitted. Subsequently, all aspects of the model can be accessed. For example, a simple linear regression via an ordinary least squares would look like the following:

df_long['rt']=df_long['resp.rt']
md = smf.ols("rt ~ trial_number", df_long).fit()

Using the .summary() function we can then get a holistic overview of the model and the outcomes:

print(md.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     rt   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     2.552
Date:                Thu, 13 Feb 2025   Prob (F-statistic):              0.111
Time:                        14:57:12   Log-Likelihood:                -263.70
No. Observations:                 300   AIC:                             531.4
Df Residuals:                     298   BIC:                             538.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.8297      0.069     11.982      0.000       0.693       0.966
trial_number    -0.0062      0.004     -1.598      0.111      -0.014       0.001
==============================================================================
Omnibus:                      488.908   Durbin-Watson:                   1.486
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           136970.679
Skew:                           8.591   Prob(JB):                         0.00
Kurtosis:                     106.259   Cond. No.                         36.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

All displayed information is furthermore directly accessible:

md.bic
538.8083391527322
md.fvalue
2.552433103431772
md.resid
0     -0.188426
1     -0.475194
2     -0.169963
3      0.271268
4     -0.118500
         ...   
295    0.047360
296   -0.173408
297   -0.141177
298   -0.100946
299    0.057286
Length: 300, dtype: float64
#try to extract p value!

If we also want to define and run a linear mixed effects model, we only need to change the model type to mixedlm and define our model accordingly:

md = smf.mixedlm("rt ~ congruent", df_long, groups=df_long["participant"])
mdf = md.fit()
print(mdf.summary())
         Mixed Linear Model Regression Results
=======================================================
Model:            MixedLM Dependent Variable: rt       
No. Observations: 300     Method:             REML     
No. Groups:       10      Scale:              0.2587   
Min. group size:  30      Log-Likelihood:     -237.4849
Max. group size:  30      Converged:          Yes      
Mean group size:  30.0                                 
-------------------------------------------------------
             Coef.  Std.Err.   z    P>|z| [0.025 0.975]
-------------------------------------------------------
Intercept     0.780    0.105  7.448 0.000  0.575  0.986
congruent    -0.095    0.059 -1.614 0.107 -0.210  0.020
Group Var     0.093    0.095                           
=======================================================

Overall, there’s only one thing left to say regarding statistical analyses in python:

Outro/Q&A - visualization & statistics#

What we went through in this session was intended as a super small showcase of visualizing and analyzing data via python, specifically using a small subset of available modules which we only just started to explore and have way more functionality.

Sure: this was a very specific use case and data but the steps and underlying principles are transferable to the majority of comparable problems/tasks you might encounter. Especially because the modules we checked are commonly used and are usually the “go-to” option, at least as a first step. As always: make sure to check the fantastic docs of the python modules you’re using, as well as all the fantastic tutorials out there.

Outro/Q&A - working with data in python#

As mentioned in the beginning, there are so many options of working with data and setting up correspondingly workflows that vary as a function of data at hand and planned analyses. Finding the “only right option” might also not be possible in some cases.

The idea of this section of the course was to outline an example workflow based on some “real-world data” to showcase you some of the capabilities and things you can do using python. Specifically, once more making the case that python has module/library for basically everything you want to do, which are additionally rather straightforward and easy to use and come with comprehensive documentation, as well as many tutorials. This holds true for both domain-general and highly specialized use cases. From data exploration over visualization to analyses you will find a great set of resources that integrates amazingly well with one another and allow you to achieve awesome things in an intelligible and reproducible manner!

The core Python “data science” stack#

  • The Python ecosystem contains tens of thousands of packages

  • Several are very widely used in data science applications:

  • We’ll cover the first three very briefly here

    • Other tutorials will go into greater detail on most of the others

The core “Python for psychology” stack#

  • The Python ecosystem contains tens of thousands of packages

  • Several are very widely used in psychology research:

  • Execept scikit-learn, nilearn and mne, we’ll cover all very briefly in this course

    • there are many free tutorials online that will go into greater detail and also cover the other packages