Converting Aperio Imagescope XML Annotations to Slideflow's CSV Format
Working with digital pathology often involves handling annotations from various software platforms. Aperio ImageScope is a popular tool for viewing and annotating digital pathology images and Slideflow is a Python framework used for deep learning tasks in digital pathology. While ImageScope saves the annotations in XML format, Slideflow saves it in a CSV format.
In this post I will share a python function that allows you to convert Aperio Imagescope XML annotations into a format that Slideflow can easily work with.
Here is an example of a typical ImageScope XML file structure that contains annotations for regions of interest (ROIs) on a digital slide.
<Annotations> <Annotation Id="0" Name="Annotation 1"> <Attributes> <Attribute Name="RegionType" Value="Tissue"/> </Attributes> <Regions> <Region Id="0" Type="1" NegativeROA="0"> <Vertices> <Vertex X="100.0" Y="150.0"/> <Vertex X="200.0" Y="250.0"/> <Vertex X="300.0" Y="350.0"/> <Vertex X="100.0" Y="150.0"/> </Vertices> </Region> <Region Id="1" Type="1" NegativeROA="0"> <Vertices> <Vertex X="400.0" Y="450.0"/> <Vertex X="500.0" Y="550.0"/> <Vertex X="600.0" Y="650.0"/> </Vertices> </Region> </Regions> </Annotation> <Annotation Id="1" Name="Annotation 2"> ... </Annotation> </Annotations>
This is how the same data would be structured in a SlideFlow-compatible CSV format
roi_name,x_base,y_base,label Annotation_1_0,100.0,150.0,Tissue Annotation_1_0,200.0,250.0,Tissue Annotation_1_0,300.0,350.0,Tissue Annotation_1_0,100.0,150.0,Tissue Annotation_1_1,400.0,450.0,Tissue Annotation_1_1,500.0,550.0,Tissue Annotation_1_1,600.0,650.0,Tissue Annotation_2_0,700.0,750.0,Tumor Annotation_2_0,800.0,850.0,Tumor Annotation_2_0,900.0,950.0,Tumor
Following Python function read the XML file given it's file path and returns a pandas dataframe. You can save it using pd.to_csv(). The aperio2sf function parses the XML file and extracts each region's X and Y coordinates, which are stored in the x_base and y_base columns. The label column corresponds to the region name, and roi_name stores the name of each region, suffixed with a unique index.
import os import shutil import pandas as pd import xml.dom.minidom as minidom def aperio2sf(annt_path): # read the aperio xml file doc = minidom.parse(annt_path) annotations = doc.getElementsByTagName("Annotation") # initialize relevant columns for slideflow format data = { 'roi_name': [], 'label':[], 'x_base': [], 'y_base': [] } # extract the corresponding information parsing the xml for annotation in annotations: name = annotation.getElementsByTagName("Attribute")[0].getAttribute("Name") regions = annotation.getElementsByTagName("Region") for i,region in enumerate(regions): vertices = region.getElementsByTagName("Vertex") for j,vertex in enumerate(vertices): data['x_base'].append(vertex.getAttribute("X")) data['y_base'].append(vertex.getAttribute("Y")) data['label'].append(name) data['roi_name'].append(name+"_"+str(i)) df = pd.DataFrame(data) return df
The slideflow project's default config expects the annotations to be at slides/rois location. This function converts all Aperio XML files in the annotations folder to CSV format and saves them in the slides/rois folder, ready for Slideflow to process.
def convertall(data_dir, sf_dest="rois"): ''' convert all aperio xmls into slideflow formats and save into slides/rois folder ''' if os.path.exists(os.path.join(data_dir,"slides", sf_dest)): shutil.rmtree(os.path.join(data_dir,"slides", sf_dest)) os.makedirs(os.path.join(data_dir,"slides", sf_dest), exist_ok=True) for annt_path in os.listdir(os.path.join(data_dir, "annotations")): print(annt_path) annt_path = os.path.join(data_dir,"annotations", annt_path) df = aperio2sf(annt_path) out_path = annt_path.replace("annotations", "slides/"+sf_dest) out_path = out_path.replace("xml", "csv") df.to_csv(out_path, sep=',', index=False)
Since the CSV content follows a generic tabular format it can also be used to generate annotation files for other software platforms, such as QuPath.