Example usage of the cardsort package

To use cardsort in a project:

import cardsort
import logging
import pandas as pd
import numpy as np

print(cardsort.__version__)
logging.basicConfig(level=logging.INFO) # activate display of logging texts (optional)

0.2.36

Load data

Input data: csv file (columns: card_id, card_label, category_id, category_label, user_id)
As created by kardsort.com, “Casolysis Data (.csv) - Recommended” export
The data used in this example can be accessed via the docs folder on GitHub

path = "example-data.csv"
df = pd.read_csv(path) # a set of 10 cards that has been categorized by 5 users
print(df)

    card_id card_label  category_id     category_label  user_id
       1        Dog            1               pets        1
       2      Tiger            1               pets        1
       3        Cat            1               pets        1
       4      Apple            2              lunch        1
       5   Sandwich            2              lunch        1
       6     Banana            3          long food        1
       7    Hot Dog            3          long food        1
       8  Croissant            4        Moon-shaped        1
       9   Mooncake            4        Moon-shaped        1
      10       Moon            4        Moon-shaped        1
     10       Moon            5   Celestial bodies        2
      9   Mooncake            6          Junk food        2
      8  Croissant            6          Junk food        2
      5   Sandwich            6          Junk food        2
      7    Hot Dog            6          Junk food        2
      6     Banana            7     Healthy snacks        2
      4      Apple            7     Healthy snacks        2
      3        Cat            8            animals        2
      2      Tiger            8            animals        2
      1        Dog            8            animals        2
     10       Moon            9         Satellites        3
      5   Sandwich           10             Snacks        3
      8  Croissant           10             Snacks        3
      6     Banana           10             Snacks        3
      9   Mooncake           10             Snacks        3
      4      Apple           10             Snacks        3
      1        Dog           11               Dogs        3
      7    Hot Dog           11               Dogs        3
      2      Tiger           12            Felines        3
      3        Cat           12            Felines        3
      9   Mooncake           13             Snacks        4
      5   Sandwich           13             Snacks        4
      7    Hot Dog           13             Snacks        4
      8  Croissant           13             Snacks        4
      4      Apple           14             Fruits        4
      6     Banana           14             Fruits        4
     10       Moon           15             Nature        4
      2      Tiger           15             Nature        4
      1        Dog           16               Pets        4
      3        Cat           16               Pets        4
     10       Moon           17  Astronomical Body        5
      6     Banana           18               Food        5
      8  Croissant           18               Food        5
      4      Apple           18               Food        5
      5   Sandwich           18               Food        5
      7    Hot Dog           18               Food        5
      9   Mooncake           18               Food        5
      1        Dog           19            Animals        5
      2      Tiger           19            Animals        5
      3        Cat           19            Animals        5

Create dendrogram

A quick and easy way to get an overview of your cardsorting results.

cardsort.create_dendrogram(df)

INFO:cardsort.analysis:Computing distance matrix for user 1
INFO:cardsort.analysis:Computing distance matrix for user 2
INFO:cardsort.analysis:Computing distance matrix for user 3
INFO:cardsort.analysis:Computing distance matrix for user 4
INFO:cardsort.analysis:Computing distance matrix for user 5

_images/340a1f222b8a04e3011a0db17aaa77da767c99e3a4c23426b7fe64b9ca7e01a2.png

Get cluster labels

Find out which category labels users gave to different clusters.

cards = ['Banana', 'Apple'] # the cards in the cluster of interest

# Default: Returns a DataFrame with the user_id of every user who clustered the cards in 'cards' together, 
# including the user-generated cluster_label and a list of all cards in the respective cluster.
cardsort.get_cluster_labels(df, cards)

INFO:cardsort.analysis:User 1 did not cluster cards together.
INFO:cardsort.analysis:User 2 labeled card(s): Healthy snacks
INFO:cardsort.analysis:User 3 labeled card(s): Snacks
INFO:cardsort.analysis:User 4 labeled card(s): Fruits
INFO:cardsort.analysis:User 5 labeled card(s): Food

	user_id	cluster_label	cards
0	2	Healthy snacks	[Banana, Apple]
1	3	Snacks	[Sandwich, Croissant, Banana, Mooncake, Apple]
2	4	Fruits	[Apple, Banana]
3	5	Food	[Banana, Croissant, Apple, Sandwich, Hot Dog, ...

Interpretation: In this case, the users with IDs 2 and 4 made clusters containing exactly the two cards of interest (‘Banana’ and ‘Apple’, as specified in the input variable ‘cards’). User 2 labelled this cluster ‘Healthy snacks’, and user 4 ‘Fruits’. Users 3 and 5 also clustered these cards together, but they included additional other cards in the same cluster, and labelled the cluster ‘Snacks’ or ‘Food’. User 1 does not appear in the output, because they did not cluster the cards together.

Adapting the dendrogram

You can easily adapt the dendrogram to your needs by specifying parameters.

The function create_dendrogram accepts the following parameters:

df : A DataFrame with your data
distance_matrix : A pre-calculated condensed distance matrix (see “Advanced usage”)
count : The scale in which you want to present the results in (‘absolute’: absolute count of users, ‘fraction’: fractions of users)
linkage : Linkage method to use when computing the distance between two clusters. Check the scipy.cluster.hierarchy documentation for more information (‘average’,’complete’, or ‘single’)
color_threshold : Threshold over which to end the coloring of clusters (can be an absolute value, i.e. numbers of users, or a fraction from 0 - 1)

# adaption of the default dendrogram with parameters
cardsort.create_dendrogram(df, count='absolute', linkage='complete', color_threshold=2)

Computing distance matrix for user 1
Computing distance matrix for user 2
Computing distance matrix for user 3
Computing distance matrix for user 4
Computing distance matrix for user 5

_images/ee05c49d269aab5f1c5fc113b65347e57a6d90a7fce8e970947b410b9fcf8cba.png

Advanced usage

Precalculating a condensed distance matrix

The create_dendrogram function automatically calculates a condensed distance matrix based on the pairwise similarity of all cards (this serves as the input of the hierarchical cluster analysis function used in the create_dendrogram function).

However, there might be cases in which you want to use a separately created condensed distance matrix.

dist = cardsort.get_distance_matrix(df) # this function returns a `condensed` distance matrix

Computing distance matrix for user 1
Computing distance matrix for user 2
Computing distance matrix for user 3
Computing distance matrix for user 4
Computing distance matrix for user 5

Use cases of a separately created condensed distance matrix

Saving time

If you have a large dataset, you might want to use a pre-calculated distance matrix. This prevents the create_dendrogram function to recalculate the distance matrix anytime you run the function.

# dist.dump("distance-matrix.dat") # save your distance matrix for later reuse
dist = np.load("distance-matrix.dat", allow_pickle=True) # load pre-calculated distance matrix
cardsort.create_dendrogram(df, distance_matrix=dist)

Creating a custom dendrogram

You can use the pre-calculated distance matrix as input for scipy’s hierarchy.dendrogram function to create a fully customized dendrogram, like in the example below.

from scipy.cluster import hierarchy
import matplotlib.pyplot as plt

Z = hierarchy.linkage(y=dist, method='average') # this method accepts the output of get_distance_matrix as 'y' parameter
plt.figure(layout="constrained")
labels = df.loc[df['user_id'] == 1].sort_values('card_id')['card_label'].squeeze().to_list()
dn = hierarchy.dendrogram(Z, labels=labels, orientation='left', color_threshold=3)
plt.show()

_images/acc1927f14e224ddbd74bdbd7779385df9a621d449b87a7c23017c9f2541beaa.png