medacy.data.dataset module¶
A medaCy Dataset facilities the management of data for both model training and model prediction.
A Dataset object provides a wrapper for a unix file directory containing training/prediction data. If a Dataset, at training time, is fed into a pipeline requiring auxilary files (Metamap for instance) the Dataset will automatically create those files in the most efficient way possible.
Training¶
When a directory contains both raw text files alongside annotation files, an instantiated Dataset detects and facilitates access to those files.
Assuming your directory looks like this (where .ann files are in BRAT format):
home/medacy/data
├── file_one.ann
├── file_one.txt
├── file_two.ann
└── file_two.txt
A common data work flow might look as follows.
Running:
>>> from medacy.data import Dataset
>>> from medacy.pipeline_components.feature_overlayers.metamap.metamap import MetaMap
>>> dataset = Dataset('/home/datasets/some_dataset')
>>> for data_file in dataset:
... (data_file.file_name, data_file.raw_path, dataset.ann_path)
(file_one, file_one.txt, file_one.ann)
(file_two, file_two.txt, file_two.ann)
>>> dataset
['file_one', 'file_two']
>>>> dataset.is_metamapped()
False
>>> metamap = MetaMap('/home/path/to/metamap/binary')
>>> with metamap:
... metamap.metamap_dataset(dataset)
>>> dataset.is_metamapped()
True
MedaCy does not alter the data you load in any way - it only reads from it.
Prediction¶
When a directory contains only raw text files, an instantiated Dataset object interprets this as a directory of files that need to be predicted. This means that the internal Datafile that aggregates meta-data for a given prediction file does not have fields for annotation_file_path set.
When a directory contains only ann files, an instantiated Dataset object interprets this as
a directory of files that are predictions. Useful methods for analysis include medacy.data.dataset.Dataset.compute_confusion_matrix()
,
medacy.data.dataset.Dataset.compute_ambiguity()
and medacy.data.dataset.Dataset.compute_counts()
.
External Datasets¶
In the real world, datasets (regardless of domain) are evolving entities. Hence, it is essential to version them. A medaCy compatible dataset can be created to facilitate this versioning. A medaCy compatible dataset lives a python packages that can be hooked into medaCy or used for any other purpose - it is simply a loose wrapper for this Dataset object. Instructions for creating such a dataset can be found here. wrap them.
-
class
medacy.data.dataset.
Dataset
(data_directory, data_limit=None)[source]¶ Bases:
object
A facilitation class for data management.
-
compute_ambiguity
(dataset)[source]¶ Finds occurrences of spans from ‘dataset’ that intersect with a span from this annotation but do not have this spans label. label. If ‘dataset’ comprises a models predictions, this method provides a strong indicators of a model’s in-ability to dis-ambiguate between entities. For a full analysis, compute a confusion matrix.
Parameters: dataset – a Dataset object containing a predicted version of this dataset. Returns: a dictionary containing the ambiguity computations on each gold, predicted file pair
-
compute_confusion_matrix
(other, leniency=0)[source]¶ Generates a confusion matrix where this Dataset serves as the gold standard annotations and dataset serves as the predicted annotations. A typical workflow would involve creating a Dataset object with the prediction directory outputted by a model and then passing it into this method.
Parameters: - other – a Dataset object containing a predicted version of this dataset.
- leniency – a floating point value between [0,1] defining the leniency of the character spans to count as different. A value of zero considers only exact character matches while a positive value considers entities that differ by up to
ceil(leniency * len(span)/2)
on either side.
Returns: two element tuple containing a label array (of entity names) and a matrix where rows are gold labels and columns are predicted labels. matrix[i][j] indicates that entities[i] in this dataset was predicted as entities[j] in ‘annotation’ matrix[i][j] times
-
compute_counts
()[source]¶ Computes entity counts over all documents in this dataset.
Returns: a Counter of entity counts
-