How to prepare data for Relevance AI

Before uploading a dataset, run through the checklist to make sure your data meets our recommendations and requirements.

  • The general format for uploading data to Relevance AI is CSV.
  • 300000 rows in the maximum number of rows in your dataset. Please contact us if your dataset is larger.
  • Make sure to read further down on this page if you are processing images or audio (i.e. media data).

File format: CSV

Your dataset should be saved in valid CSV format before being uploaded to Relevance AI's platform.

CSV files are table-like data formats, similar to what is seen on an Excel sheet. Make sure, all columns have a unique name and follow the same data type and format for values in each column (see the "Field values" section below).

No|Name|Company|Age
--|----|-------|---
1 |Jim |  ABC  |32
--|----|-------|---
2 |Jack|  XYZ  |24
--|----|-------|---
3 |Dave|  LMN  |39

Headers: Names of fields/columns

  • Column names/headers are included as the first row of the file
  • Column names/headers should be in one line (i.e multiple-line headers are not accepted)
  • Should be short but descriptive names (recommended)
  • No duplicate column names (i.e. unique column name, otherwise we automatically add numbers to headers)
  • Names can only contain letters, numbers, dashes or underscores​ (any other character will be replaced by our upload engine)
    Note that
    • white spaces will be replaced with -by our upload engine
    • . will be replaced with -by our upload engine
    • if your dataset includes vectors, make sure the vector field name ends in _vector_
      Vector fields:
      Vector fields are representations of data in another format (i.e. a list of numbers or vectors to be precise). For instance, if your dataset includes a field/column named "description" which shows the description of items in the dataset in text format, after vectorizing each description value, you will have access to the corresponding vectors. These vectors can be saved in the original dataset under a vector field (an example is provided under Data FAQs).

Values: Values under headers

  • Include only one data type and format in each column. For instance:
    • All digit values - Values under fields such as age, phone number or scores, that are composed of only digit, must be all digits in all cells. Meaning entries like None, NA, and white spaces will break the upload.
    • DATES - All date fields formatted in "YYYY-MM-DD" format (recommended)
    • None / No values - When there is nothing as a value, simply leave it as an empty cell in your CSV file. An empty cell is a cell with nothing typed in it; not a white space, not 0, not N/A, not None, not null, literally no nothing. A common sample case is when people do not respond to a question.
    • Note 1: keep the format the same
      Example: CURRENCY - Values in digits only, without the currency sign (e.g. 119.50), as opposed to: price = [$119.50, 200 dollars]
    • Note 2: when values in a column are both all digits and digit and characters
      Example: POSTCODES - If your data contains a postcode field that contains both numeric (e.g. 90210) and string format (e.g. SW1A 1AA) values in the one field (e.g postcodes across countries), ensure that the first postcode value (under the column header 'postcode') is a string format postcode (e.g. SW1A 1AA) and not a full digit one.

Categorical measures

  • If your data has coded values (i.e. Is Member = 1/0), we recommend changing the data to natural language for businesses to understand i.e. Is Member / Is Not Member, or Yes / No or True / False.

Numeric measures

  • For numeric scores like NPS, we recommend including both columns: numeric scores (e.g. 0-10 scores) and the coded value/label as an additional field (e.g. detractor, passive, promoter).

No values

When there is nothing as a value, simply leave it as an empty cell in your CSV file. An empty cell is a cell with nothing typed in it; no white space, Not 0, not N/A, not None, not null, literally no nothing.. A common sample case is when people do not respond to a question.

A unique identifier field (Highly recommended)

We recommend including an id field in your CSV. The name of the field can be any string value (e.g. id, original order, customer id, number, etc.) and values should be unique per row (e.g. sequential numbers starting from 1).

Note 1: This unique identifier is very useful when you Export the analysis results. And is your way of mapping the export data to your original CSV.

Note 2: There is a unique identifier per entry (_id) in datasets sitting on the Relevance AI's platform. The _id field can preexist in a CSV (i.e. included in the upload CSV by the dataset owner). Otherwise, the platform automatically adds the field with unique values.

Cleaning text data (optional)

When working with text data it is recommended (i.e. not required) to apply certain preprocessing steps which can potentially improve the analysis results. Common text pre-processing are:

  • Stop words removal: to remove frequent but not important words used in our language (e.g. the, there).
  • Lemmatization: replacing words with their common root (e.g. changes or changing become change)
  • Lowercasing: converting all characters to their lowercase form
  • Breaking into shorter pieces of text: when automatically analyzing text, processing smaller pieces of text (e.g. a sentence vs paragraph) often produces more precise results.
  • Noise removal: this step is completely data specific. Popular cleaning methods are html, URL or hashtag removal.

File formats for media data (images, audio)

Media data such as images or audio files must be accessible via a URL in order to work with them on Relevance AI's platform. If your media files are available online, simply include their corresponding URLs in your dataset.

Note 1: you can include other fields in your CSV file as shown in the second example below.

Note 2: try opening your URLs in an incognito tab to make sure they are available to our platform.

How To Get Started: Audio Use Case

  • Save your audio file(s) in one of the common audio formats - mp3 is recommended
  • Your audio file must be accessible via a http... link. Use your preferred hosting method, include the URL(s) in your CSV and upload your csv file to Relevance AI or simply run [Connect Media] which takes care of this step.
 _id              Audio-URL
-----|--------------------------------------
  1  |  https://my-repo/my-audio-file1.mp3
  2  |  https://my-repo/my-audio-file2.mp3
  3  |  https://my-repo/my-audio-file3.mp3
  
  
  
 _id                 URL               project
-----|-------------------------------|---------
  1  |  https://my-repo/my-file1.mp3 |   X1
  2  |  https://my-repo/my-file2.mpe |   X2
  3  |  https://my-repo/my-file3.mp3 |   X1
  4  |  https://my-repo/my-file4.mp3 |   X3
  5  |  https://my-repo/my-file5.mp3 |   X2

Otherwise, you need to first host them on the internet. This is possible through Upload your media files workflow on Relevance AI.

Audio files

  • Save your audio file under common formats such as mp3
  • Make sure the moderator is the first person heard in the audio, so speaker A is always the moderator/interviewer. This helps filtering data better
  • Make sure people do not speak over each other when recording the audio

Image files

  • Save your image file under common formats such as jpg
     _id              Image-URL             project
    -----|--------------------------------|---------
      1  |  https://my-repo/my-image1.jpg |   X1
      2  |  https://my-repo/my-image2.jpg |   X2
      3  |  https://my-repo/my-image3.jpg |   X1
      4  |  https://my-repo/my-image4.jpg |   X3
      5  |  https://my-repo/my-image5.jpg |   X2
    

Useful links:

See the guide on How to update an existing dataset which covers all the below items

  • adding new items (i.e. rows) to an existing dataset
  • adding new fields (columns) to an existing dataset
  • modifying existing values in a dataset