Example usage of the cardsort package
To use cardsort in a project:
import cardsort
import logging
import pandas as pd
import numpy as np
print(cardsort.__version__)
logging.basicConfig(level=logging.INFO) # activate display of logging texts (optional)
0.2.36
Load data
Input data: csv file (columns: card_id, card_label, category_id, category_label, user_id)
As created by kardsort.com, “Casolysis Data (.csv) - Recommended” export
The data used in this example can be accessed via the docs folder on GitHub
path = "example-data.csv"
df = pd.read_csv(path) # a set of 10 cards that has been categorized by 5 users
print(df)
card_id card_label category_id category_label user_id
0 1 Dog 1 pets 1
1 2 Tiger 1 pets 1
2 3 Cat 1 pets 1
3 4 Apple 2 lunch 1
4 5 Sandwich 2 lunch 1
5 6 Banana 3 long food 1
6 7 Hot Dog 3 long food 1
7 8 Croissant 4 Moon-shaped 1
8 9 Mooncake 4 Moon-shaped 1
9 10 Moon 4 Moon-shaped 1
10 10 Moon 5 Celestial bodies 2
11 9 Mooncake 6 Junk food 2
12 8 Croissant 6 Junk food 2
13 5 Sandwich 6 Junk food 2
14 7 Hot Dog 6 Junk food 2
15 6 Banana 7 Healthy snacks 2
16 4 Apple 7 Healthy snacks 2
17 3 Cat 8 animals 2
18 2 Tiger 8 animals 2
19 1 Dog 8 animals 2
20 10 Moon 9 Satellites 3
21 5 Sandwich 10 Snacks 3
22 8 Croissant 10 Snacks 3
23 6 Banana 10 Snacks 3
24 9 Mooncake 10 Snacks 3
25 4 Apple 10 Snacks 3
26 1 Dog 11 Dogs 3
27 7 Hot Dog 11 Dogs 3
28 2 Tiger 12 Felines 3
29 3 Cat 12 Felines 3
30 9 Mooncake 13 Snacks 4
31 5 Sandwich 13 Snacks 4
32 7 Hot Dog 13 Snacks 4
33 8 Croissant 13 Snacks 4
34 4 Apple 14 Fruits 4
35 6 Banana 14 Fruits 4
36 10 Moon 15 Nature 4
37 2 Tiger 15 Nature 4
38 1 Dog 16 Pets 4
39 3 Cat 16 Pets 4
40 10 Moon 17 Astronomical Body 5
41 6 Banana 18 Food 5
42 8 Croissant 18 Food 5
43 4 Apple 18 Food 5
44 5 Sandwich 18 Food 5
45 7 Hot Dog 18 Food 5
46 9 Mooncake 18 Food 5
47 1 Dog 19 Animals 5
48 2 Tiger 19 Animals 5
49 3 Cat 19 Animals 5
Create dendrogram
A quick and easy way to get an overview of your cardsorting results.
cardsort.create_dendrogram(df)
INFO:cardsort.analysis:Computing distance matrix for user 1
INFO:cardsort.analysis:Computing distance matrix for user 2
INFO:cardsort.analysis:Computing distance matrix for user 3
INFO:cardsort.analysis:Computing distance matrix for user 4
INFO:cardsort.analysis:Computing distance matrix for user 5
Get cluster labels
Find out which category labels users gave to different clusters.
cards = ['Banana', 'Apple'] # the cards in the cluster of interest
# Default: Returns a DataFrame with the user_id of every user who clustered the cards in 'cards' together,
# including the user-generated cluster_label and a list of all cards in the respective cluster.
cardsort.get_cluster_labels(df, cards)
INFO:cardsort.analysis:User 1 did not cluster cards together.
INFO:cardsort.analysis:User 2 labeled card(s): Healthy snacks
INFO:cardsort.analysis:User 3 labeled card(s): Snacks
INFO:cardsort.analysis:User 4 labeled card(s): Fruits
INFO:cardsort.analysis:User 5 labeled card(s): Food
| user_id | cluster_label | cards | |
|---|---|---|---|
| 0 | 2 | Healthy snacks | [Banana, Apple] |
| 1 | 3 | Snacks | [Sandwich, Croissant, Banana, Mooncake, Apple] |
| 2 | 4 | Fruits | [Apple, Banana] |
| 3 | 5 | Food | [Banana, Croissant, Apple, Sandwich, Hot Dog, ... |
Interpretation: In this case, the users with IDs 2 and 4 made clusters containing exactly the two cards of interest (‘Banana’ and ‘Apple’, as specified in the input variable ‘cards’). User 2 labelled this cluster ‘Healthy snacks’, and user 4 ‘Fruits’. Users 3 and 5 also clustered these cards together, but they included additional other cards in the same cluster, and labelled the cluster ‘Snacks’ or ‘Food’. User 1 does not appear in the output, because they did not cluster the cards together.
Adapting the dendrogram
You can easily adapt the dendrogram to your needs by specifying parameters.
The function create_dendrogram accepts the following parameters:
df : A DataFrame with your data
distance_matrix : A pre-calculated condensed distance matrix (see “Advanced usage”)
count : The scale in which you want to present the results in (‘absolute’: absolute count of users, ‘fraction’: fractions of users)
linkage : Linkage method to use when computing the distance between two clusters. Check the scipy.cluster.hierarchy documentation for more information (‘average’,’complete’, or ‘single’)
color_threshold : Threshold over which to end the coloring of clusters (can be an absolute value, i.e. numbers of users, or a fraction from 0 - 1)
# adaption of the default dendrogram with parameters
cardsort.create_dendrogram(df, count='absolute', linkage='complete', color_threshold=2)
Computing distance matrix for user 1
Computing distance matrix for user 2
Computing distance matrix for user 3
Computing distance matrix for user 4
Computing distance matrix for user 5
Advanced usage
Precalculating a condensed distance matrix
The create_dendrogram function automatically calculates a condensed distance matrix based on the pairwise similarity of all cards (this serves as the input of the hierarchical cluster analysis function used in the create_dendrogram function).
However, there might be cases in which you want to use a separately created condensed distance matrix.
dist = cardsort.get_distance_matrix(df) # this function returns a `condensed` distance matrix
Computing distance matrix for user 1
Computing distance matrix for user 2
Computing distance matrix for user 3
Computing distance matrix for user 4
Computing distance matrix for user 5
Use cases of a separately created condensed distance matrix
Saving time
If you have a large dataset, you might want to use a pre-calculated distance matrix. This prevents the create_dendrogram function to recalculate the distance matrix anytime you run the function.
# dist.dump("distance-matrix.dat") # save your distance matrix for later reuse
dist = np.load("distance-matrix.dat", allow_pickle=True) # load pre-calculated distance matrix
cardsort.create_dendrogram(df, distance_matrix=dist)
Creating a custom dendrogram
You can use the pre-calculated distance matrix as input for scipy’s hierarchy.dendrogram function to create a fully customized dendrogram, like in the example below.
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
Z = hierarchy.linkage(y=dist, method='average') # this method accepts the output of get_distance_matrix as 'y' parameter
plt.figure(layout="constrained")
labels = df.loc[df['user_id'] == 1].sort_values('card_id')['card_label'].squeeze().to_list()
dn = hierarchy.dendrogram(Z, labels=labels, orientation='left', color_threshold=3)
plt.show()