Example usage of the cardsort package

To use cardsort in a project:

import cardsort
import logging
import pandas as pd
import numpy as np

print(cardsort.__version__)
logging.basicConfig(level=logging.INFO) # activate display of logging texts (optional)
0.2.36

Load data

  • Input data: csv file (columns: card_id, card_label, category_id, category_label, user_id)

  • As created by kardsort.com, “Casolysis Data (.csv) - Recommended” export

  • The data used in this example can be accessed via the docs folder on GitHub

path = "example-data.csv"
df = pd.read_csv(path) # a set of 10 cards that has been categorized by 5 users
print(df)
    card_id card_label  category_id     category_label  user_id
0         1        Dog            1               pets        1
1         2      Tiger            1               pets        1
2         3        Cat            1               pets        1
3         4      Apple            2              lunch        1
4         5   Sandwich            2              lunch        1
5         6     Banana            3          long food        1
6         7    Hot Dog            3          long food        1
7         8  Croissant            4        Moon-shaped        1
8         9   Mooncake            4        Moon-shaped        1
9        10       Moon            4        Moon-shaped        1
10       10       Moon            5   Celestial bodies        2
11        9   Mooncake            6          Junk food        2
12        8  Croissant            6          Junk food        2
13        5   Sandwich            6          Junk food        2
14        7    Hot Dog            6          Junk food        2
15        6     Banana            7     Healthy snacks        2
16        4      Apple            7     Healthy snacks        2
17        3        Cat            8            animals        2
18        2      Tiger            8            animals        2
19        1        Dog            8            animals        2
20       10       Moon            9         Satellites        3
21        5   Sandwich           10             Snacks        3
22        8  Croissant           10             Snacks        3
23        6     Banana           10             Snacks        3
24        9   Mooncake           10             Snacks        3
25        4      Apple           10             Snacks        3
26        1        Dog           11               Dogs        3
27        7    Hot Dog           11               Dogs        3
28        2      Tiger           12            Felines        3
29        3        Cat           12            Felines        3
30        9   Mooncake           13             Snacks        4
31        5   Sandwich           13             Snacks        4
32        7    Hot Dog           13             Snacks        4
33        8  Croissant           13             Snacks        4
34        4      Apple           14             Fruits        4
35        6     Banana           14             Fruits        4
36       10       Moon           15             Nature        4
37        2      Tiger           15             Nature        4
38        1        Dog           16               Pets        4
39        3        Cat           16               Pets        4
40       10       Moon           17  Astronomical Body        5
41        6     Banana           18               Food        5
42        8  Croissant           18               Food        5
43        4      Apple           18               Food        5
44        5   Sandwich           18               Food        5
45        7    Hot Dog           18               Food        5
46        9   Mooncake           18               Food        5
47        1        Dog           19            Animals        5
48        2      Tiger           19            Animals        5
49        3        Cat           19            Animals        5

Create dendrogram

A quick and easy way to get an overview of your cardsorting results.

cardsort.create_dendrogram(df)
INFO:cardsort.analysis:Computing distance matrix for user 1
INFO:cardsort.analysis:Computing distance matrix for user 2
INFO:cardsort.analysis:Computing distance matrix for user 3
INFO:cardsort.analysis:Computing distance matrix for user 4
INFO:cardsort.analysis:Computing distance matrix for user 5
_images/340a1f222b8a04e3011a0db17aaa77da767c99e3a4c23426b7fe64b9ca7e01a2.png

Get cluster labels

Find out which category labels users gave to different clusters.

cards = ['Banana', 'Apple'] # the cards in the cluster of interest

# Default: Returns a DataFrame with the user_id of every user who clustered the cards in 'cards' together, 
# including the user-generated cluster_label and a list of all cards in the respective cluster.
cardsort.get_cluster_labels(df, cards)
INFO:cardsort.analysis:User 1 did not cluster cards together.
INFO:cardsort.analysis:User 2 labeled card(s): Healthy snacks
INFO:cardsort.analysis:User 3 labeled card(s): Snacks
INFO:cardsort.analysis:User 4 labeled card(s): Fruits
INFO:cardsort.analysis:User 5 labeled card(s): Food
user_id cluster_label cards
0 2 Healthy snacks [Banana, Apple]
1 3 Snacks [Sandwich, Croissant, Banana, Mooncake, Apple]
2 4 Fruits [Apple, Banana]
3 5 Food [Banana, Croissant, Apple, Sandwich, Hot Dog, ...

Interpretation: In this case, the users with IDs 2 and 4 made clusters containing exactly the two cards of interest (‘Banana’ and ‘Apple’, as specified in the input variable ‘cards’). User 2 labelled this cluster ‘Healthy snacks’, and user 4 ‘Fruits’. Users 3 and 5 also clustered these cards together, but they included additional other cards in the same cluster, and labelled the cluster ‘Snacks’ or ‘Food’. User 1 does not appear in the output, because they did not cluster the cards together.

Adapting the dendrogram

You can easily adapt the dendrogram to your needs by specifying parameters.

The function create_dendrogram accepts the following parameters:

  • df : A DataFrame with your data

  • distance_matrix : A pre-calculated condensed distance matrix (see “Advanced usage”)

  • count : The scale in which you want to present the results in (‘absolute’: absolute count of users, ‘fraction’: fractions of users)

  • linkage : Linkage method to use when computing the distance between two clusters. Check the scipy.cluster.hierarchy documentation for more information (‘average’,’complete’, or ‘single’)

  • color_threshold : Threshold over which to end the coloring of clusters (can be an absolute value, i.e. numbers of users, or a fraction from 0 - 1)

# adaption of the default dendrogram with parameters
cardsort.create_dendrogram(df, count='absolute', linkage='complete', color_threshold=2)
Computing distance matrix for user 1
Computing distance matrix for user 2
Computing distance matrix for user 3
Computing distance matrix for user 4
Computing distance matrix for user 5
_images/ee05c49d269aab5f1c5fc113b65347e57a6d90a7fce8e970947b410b9fcf8cba.png

Advanced usage

Precalculating a condensed distance matrix

The create_dendrogram function automatically calculates a condensed distance matrix based on the pairwise similarity of all cards (this serves as the input of the hierarchical cluster analysis function used in the create_dendrogram function).

However, there might be cases in which you want to use a separately created condensed distance matrix.

dist = cardsort.get_distance_matrix(df) # this function returns a `condensed` distance matrix
Computing distance matrix for user 1
Computing distance matrix for user 2
Computing distance matrix for user 3
Computing distance matrix for user 4
Computing distance matrix for user 5

Use cases of a separately created condensed distance matrix

Saving time

If you have a large dataset, you might want to use a pre-calculated distance matrix. This prevents the create_dendrogram function to recalculate the distance matrix anytime you run the function.

# dist.dump("distance-matrix.dat") # save your distance matrix for later reuse
dist = np.load("distance-matrix.dat", allow_pickle=True) # load pre-calculated distance matrix
cardsort.create_dendrogram(df, distance_matrix=dist)
_images/340a1f222b8a04e3011a0db17aaa77da767c99e3a4c23426b7fe64b9ca7e01a2.png

Creating a custom dendrogram

You can use the pre-calculated distance matrix as input for scipy’s hierarchy.dendrogram function to create a fully customized dendrogram, like in the example below.

from scipy.cluster import hierarchy
import matplotlib.pyplot as plt

Z = hierarchy.linkage(y=dist, method='average') # this method accepts the output of get_distance_matrix as 'y' parameter
plt.figure(layout="constrained")
labels = df.loc[df['user_id'] == 1].sort_values('card_id')['card_label'].squeeze().to_list()
dn = hierarchy.dendrogram(Z, labels=labels, orientation='left', color_threshold=3)
plt.show()
_images/acc1927f14e224ddbd74bdbd7779385df9a621d449b87a7c23017c9f2541beaa.png