Tips To Improve Cluster Accuracy

Choose the optimal number of clusters

There are many factors involved in identifying the optimal number of clusters. Data scientists often employ techniques such as the Elbow method or metrics such as the Silhouette Coefficient to decide on the best number of clusters, however, even under those methods, it is very difficult to identify the exact number of categories. This is for algorithms such as KMeans which require the user to input the number of clusters in advance.

On the other hand, some clustering algorithms claim that they automatically find the optimum number, but this is heavily dependent on the data.

In practice, what we recommend and have found useful is to

  1. understand the data as much as possible to get an idea of the topics
  2. try different numbers of clusters and quickly check the results under the Explorer dashboard
  3. increase or decrease the number of clusters based on step 2
  4. employ the merge functionality of the Explorer dashboard to combine clusters that are conceptually close to one another

Note: Step 2 can include different clustering algorithms (e.g Auto-cluster and KMeans) as well as different numbers of clusters in one algorithm (e.g. KMeans)

Breaking long pieces of text

  • When designing surveys, it is recommended to make questions as specific as possible so that people do not include several topics in their responses
  • When dealing with long pieces of text, breaking it into its composing sentencesis useful to better identify the included topics and themes.