Data Formats

Prepare your input data

Upload one or more files containing documents to be annotated in the data folder.

We support multiple formats of raw data files, including: csv, tsv, json, or jsonl.

Each document needs, at minimum, a unique identifier and the body of the document.

You can find example data files here. We currently support four different document formats:

  • Text: body is the document plaintext (example)

  • Image, Video, or GIF: body is the filepath (example)

  • Dialogue or a list of text: body is a list of comma-seperated documents and potato will automatically display the list of text horizontally. (example)

  • Pairs of text displayed in separate boxes: body is a dictionary of documents (example)

  • Best-Worst Scaling: body is a comma-separated list of documents to order (example)

  • Custom Arguments: body is one of the above + extra fields for whatever custom arguments you want to enter (example – in this kwargs and other_kwargs are the custom endpoints for a Likert scale)

  • Annotating Document A in context of Document B: body is document A + extra context field with the body of document B (example)

You can also use html tags to design the way your text to be displayed. In the match finding example project, html tags are used to create two seperate boxes for the finding pairs.

Update input data formats on the YAML config file

You would pass the input data paths and field names into the YAML config file as follows:

# Pass in a comma-separated list of data files containing documents to be annotated in this task
"data_files": [
   "data/toy-example1.json",
   "data/toy-example2.json"
],

# Specify the field names containing the document unique identifier (id) and document body (text)
"item_properties": {
    "id_key": "id",
    "text_key": "text"
},

Update output data preferences on the YAML config file

The output file will include each labeled document’s id and annotations; the header will consist of the question and answer labels specified in the schema. You need to specify a subdirectory of the annotation_output directory where files for each annotator should be placed. We support multiple output formats, including: csv, tsv, json, or jsonl.

# Potato will write the annotation file for all annotations to this
# directory, as well as per-annotator output files and state information
# necessary to restart annotation.
"output_annotation_dir": "annotation_output/folder_name/",

# The output format for the all-annotator data. Allowed formats are:
# * jsonl
# * json (same output as jsonl)
# * csv
# * tsv
#
"output_annotation_format": "json",