medacy.data.dataset module

A medaCy Dataset facilities the management of data for both model training and model prediction.

A Dataset object provides a wrapper for a unix file directory containing training/prediction data. If a Dataset, at training time, is fed into a pipeline requiring auxilary files (Metamap for instance) the Dataset will automatically create those files in the most efficient way possible.

Training

When a directory contains both raw text files alongside annotation files, an instantiated Dataset detects and facilitates access to those files.

Assuming your directory looks like this (where .ann files are in BRAT format):

home/medacy/data
├── file_one.ann
├── file_one.txt
├── file_two.ann
└── file_two.txt

A common data work flow might look as follows.

Running:

>>> from medacy.data import Dataset
>>> from medacy.pipeline_components.feature_overlayers.metamap.metamap import MetaMap

>>> dataset = Dataset('/home/datasets/some_dataset')
>>> for data_file in dataset:
...    (data_file.file_name, data_file.raw_path, dataset.ann_path)
(file_one, file_one.txt, file_one.ann)
(file_two, file_two.txt, file_two.ann)
>>> dataset
['file_one', 'file_two']
>>>> dataset.is_metamapped()
False
>>> metamap = MetaMap('/home/path/to/metamap/binary')
>>> with metamap:
...     metamap.metamap_dataset(dataset)
>>> dataset.is_metamapped()
True

MedaCy does not alter the data you load in any way - it only reads from it.

Prediction

When a directory contains only raw text files, an instantiated Dataset object interprets this as a directory of files that need to be predicted. This means that the internal Datafile that aggregates meta-data for a given prediction file does not have fields for annotation_file_path set.

When a directory contains only ann files, an instantiated Dataset object interprets this as a directory of files that are predictions. Useful methods for analysis include medacy.data.dataset.Dataset.compute_confusion_matrix(), medacy.data.dataset.Dataset.compute_ambiguity() and medacy.data.dataset.Dataset.compute_counts().

External Datasets

In the real world, datasets (regardless of domain) are evolving entities. Hence, it is essential to version them. A medaCy compatible dataset can be created to facilitate this versioning. A medaCy compatible dataset lives a python packages that can be hooked into medaCy or used for any other purpose - it is simply a loose wrapper for this Dataset object. Instructions for creating such a dataset can be found here. wrap them.

class medacy.data.dataset.Dataset(data_directory, data_limit=None)[source]

Bases: object

A facilitation class for data management.

_create_data_files()[source]
compute_ambiguity(dataset)[source]

Finds occurrences of spans from ‘dataset’ that intersect with a span from this annotation but do not have this spans label. label. If ‘dataset’ comprises a models predictions, this method provides a strong indicators of a model’s in-ability to dis-ambiguate between entities. For a full analysis, compute a confusion matrix.

Parameters:dataset – a Dataset object containing a predicted version of this dataset.
Returns:a dictionary containing the ambiguity computations on each gold, predicted file pair
compute_confusion_matrix(other, leniency=0)[source]

Generates a confusion matrix where this Dataset serves as the gold standard annotations and dataset serves as the predicted annotations. A typical workflow would involve creating a Dataset object with the prediction directory outputted by a model and then passing it into this method.

Parameters:
  • other – a Dataset object containing a predicted version of this dataset.
  • leniency – a floating point value between [0,1] defining the leniency of the character spans to count as different. A value of zero considers only exact character matches while a positive value considers entities that differ by up to ceil(leniency * len(span)/2) on either side.
Returns:

two element tuple containing a label array (of entity names) and a matrix where rows are gold labels and columns are predicted labels. matrix[i][j] indicates that entities[i] in this dataset was predicted as entities[j] in ‘annotation’ matrix[i][j] times

compute_counts()[source]

Computes entity counts over all documents in this dataset.

Returns:a Counter of entity counts
generate_annotations()[source]

Generates Annotation objects for all the files in this Dataset

get_labels(as_list=False)[source]

Get all of the entities/labels used in the dataset. :param as_list: bool for if to return the results as a list; defaults to False :return: A set of strings. Each string is a label used.

is_metamapped()[source]

Verifies if all fil es in the Dataset are metamapped.

Returns:True if all data files are metamapped, False otherwise.
medacy.data.dataset.main()[source]

CLI for retrieving dataset information