Required Knowledge: Vectors, Encoding, Classification problems
Audience: Data scientists, Vector enthusiasts, Statisticians, Machine learning engineers
Reading time: 5 minutes
Classification is the task of assigning a label to an input. Machine Learning classification requires a dataset full of labeled data. A Machine Learning model is first trained (i.e. the model observes and analyses the data), and then it will be able to inference the class of an unlabeled item (i.e. classify them into one of the seen categories/classes).
Binary classification models are simple and only know two classes (e.g. positive/negative, cat/dog, True/False). For more complex problems, we need stronger models that are able to inference multiple data classes (e.g. cat/dog/rabbit/horse, agree/disagree/neutral). Multilabel classification models are to use in such scenarios.
Typically, we may want to use a neural network to solve these kinds of problems. However, embeddings offer an alternative, less computationally intensive, solution.
Vectors reframe traditional classification into a vector search problem
One solution to substitute neural networks in classification tasks is using a technique called k-nearest neighbour (kNN) (i.e. finding the K nearest neighbouring data points). This approach requires data to be encoded (i.e vectorized). All data points are encoded/vectorized and placed along with each other to form the embedding space. This also applies to new query/search items.
To perform classification under the KNN approach, we look at the closest k vectors to our new/query item (k is decided based on a set of pilot experiments); distance is calculated in the vector space, following vector algebra. Classification is done according to the classes of these k nearest neighbours.
Here are a few key advantages and possible disadvantages of using vectors over neural networks for classifications.
Advantages and Disadvantages of Vector Similarity Approach
|Resolves the cold-start issue|
In traditional approaches, classification neural networks would have to be re-trained in order to adapt to new categories. Adding new classes to existing models would often require re-training costs, compute costs and significant time into ensuring a new model has been tested to work .
|Fine-tuning might be necessary for specific data|
Under specific data, the encoders might need to be finetuned which requires labeled data.
|Reduced computational costs|
Using excellent out-of-the-box vectors/similarity search that resolves this issue means you can then reduce the cost of initial data science experiments and ensure data to value quickly
Updated 10 months ago