Results of a Novel and Comprehensive Training Program for Standardization of 2 -Dimensional  ltrasound Interpretation as a Precursor to a Multicenter Trial

Research Article

Results of a Novel and Comprehensive Training Program for Standardization of 2 -Dimensional  ltrasound Interpretation as a Precursor to a Multicenter Trial

Corresponding author:  Dr. Elizabeth R. Mueller, Departments of Urology and Obstetrics and Gynecology, Loyola University Chicago Stritch School of Medicine, 2160 S. First Avenue, Building 103, Room 1004, Maywood, Il60153, Tel: 708 216-2170; Fax: 708 216-2171; Email:

 We report the process employed to set minimum standards for experts from 8 clinical sites in obtaining and interpreting standardized endoanal ultrasound (EUS) images for the “Behavioral Therapy versus Usual Care in Primiparous Women with Anal Sphincter Tears and Fecal Incontinence” (BOOST) Trial.Methods: The BOOST EUS protocol developers and a consulting expert radiologist conducted a one-day centralized training event. Didactic sessions included a discussion of normal anal canal anatomy, common EUS variations in women, levels of the anal canal that needed to be assessed, number of images to acquire, and labeling of the images. Representative images from the midanal canal (MAC) were obtained from each site prior to the training event and were used to provide critique during the training session and determine outliers. A post-training assessment of image interpretations, by participating investigators, was compared to ‘standard’ interpretations by the expert radiologist. A reader was considered “qualified” if s/he was at least 75% concordant with the expert radiologist’s assessment for categorizing a defect in the internal anal sphincter (IAS) and at least 50% concordant with the expert for interpreting defects in the external anal sphincter (EAS).


TEight study centers and 15 investigators participated in the training event. The investigators were from the following specialties: gynecology (86.6%), gastroenterology (6.7%) and urology (6.7%). Based on the post-training assessment, agreement among the readers (not the expert) for presence of sphincter defect was ICC = 0.54 (95% CI: 0.41, 0.69) for IAS and 0.50 (95% CI: 0.37, 0.66) for EAS. Moderate agreement was obtained with kappa 0.58 (95% CI: 0.55, 0.61) for IAS and 0.55 (95% CI: 0.52, 0.58) for EAS.Conclusion

This study demonstrates that a single day, centralized training course focused on interpreting endoanal ultrasound images with an expert radiologist can result in moderate agreement among pelvic floor physicians and the expert radiologist.

Keywords: 2-D endoanal ultrasound; anal sphincter tear; anal sphincter defect; fecal incontinence


Assuring the accuracy of outcome data from diagnostic tests is a universal challenge in multicenter clinical trials. This can increase in complexity when diagnostic tests are being performed by different clinical specialties. That said, participation of qualified multi-disciplinary investigators might improve the validity and generalizability of the study conclusions. The design of a process to standardize test performance and accuracy of interpretation is just as crucial to the success of a research study as is recruitment of subjects, since the validity of the test findings depend upon training investigators to consistently perform and interpret the test in a standardized manner.Two-dimensional endoanal ultrasound (EUS) was introduced in the 1990’s as an adjunct to manometry to assess the structure of the anal sphincter. Study interpretation can be challenging, especially in women due to the shorter anterior part of the EAS compared to males and due to the longitudinal muscle being sonographically indistinguishable from the EAS in 60% of women1,2. Previous studies describing the inter- and intra-observer reliability in the interpretation of pelvic floor images obtained by ultrasound and magnetic resonance imaging (MRI) techniques have required that the “readers” undergo “training sessions” interpreting images on non-study subjects. These studies have demonstrated that discrepancies in interpretation improve with training and experience. However, the number of readers in each of the studies was small, with 2 to 6 readers.

We report the process employed to design a curriculum and to set minimum standards for experts from 3 disciplines at 8 clinical sites in obtaining and interpreting standardized EUS for the NIH-sponsored “Behavioral Therapy versus Usual Care

in Primiparous Women with Anal Sphincter Tears and Fecal Incontinence” (BOOST) Trial. The BOOST Trial was originally designed as a multicenter, randomized trial of behavioral therapy for fecal incontinence (FI) in primiparous women sustaining an Obstetrical Anal Sphincter Injury. At 6 weeks postpartum, participants who had FI were randomized to behavior therapy or usual care. After initiation of the trial, the rates of FI were lower than predicted and it was not feasible to complete the trial in a reasonable timeframe3.


Centralized TrainingPrior to initiation of the study, a one-day centralized training of all examiners took place at the University of North Carolina at Chapel Hill (UNC), April 4, 2010. The one-day session was conducted by the investigators who developed the ultrasound protocol (EM, MC) and by the consulting radiologist (JF).

Endoanal Ultrasound Equipment and Settings

Ultrasound scanners (BK Medical, Denmark) were used for image acquisition. The 2-D probes used included the 1850 360° axial endoscopic probe or the 2050 3-D probe used in 2-D mode. By consensus, the frequency of the transducer was set at 10 MHz and the focal range of the transducer (depth) was set at 2.8 cm. Individual examiners were expected to adjust image brightness and contrast during each examination to optimize image quality.Preparation Prior to Training

A representative image from the midanal canal (MAC) was obtained from each site as a Joint Photographic Experts Group (JPEG) file that identified: 1) site number, 2) megahertz (MHz) settings, 3) age and parity status of the subject, if known, and d) symptomatology. All images were stripped of other identifying data, including site name, subject name and medical record number. These images were used to compare the existing best practice images from each site and to determine outliers. The protocol committee reviewed each of the images in detail. Following review of the MAC images, other images from each site were requested for inclusion in testing/training program at UNC. Images requested included: 1) one normal examination at high, mid, and low anal canal levels; 2) at least 1 and up to 4 abnormal examinations that included at least 1 abnormal finding at the MAC level; and 3) one examination thought to be un-interpretable (over or underexposed) or sphincter defect undetermined (scar, fragmentation).

Training Curriculum
ParticipantsInvestigators, who attended the training course, consisted of urogynecologists, an urologist and a gastroenterologist. All in vestigators had experience in performing endoanal ultrasound evaluations and interpreting sonographic findings of the anal sphincters. For the study protocol, it was expected that examiners would be masked to the exact extent of anal sphincter disruption recorded at the time of delivery and to the current fecal continence status of the subjects. The images that were used for the training event were from the investigator’s recent images with non-study patients who were female and required a EUS for clinical care.


The training session included a PowerPoint presentation that reviewed the major aims of the BOOST study. In addition, investigators were taught the specific physical examination assessment required prior to the endoanal ultrasound exam. The sonography data collection form was also reviewed. Definitions and examples of “normal” sonographic anal sphincter findings at each anal canal level were reviewed along with internal and external anal sphincter defects (by anal canal level and by clock-face position). Criteria that made a study “un-interpretable” and image characteristic associated with an “undetermined” sphincter defects were defined. Didactic sessions also included a discussion of normal anal canal anatomy, common variations in women, levels of the anal canal that needed to be assessed, number of images to acquire, and labeling of the images. Settings required for acquisition of images (10 MHz transducer, 2.8 cm focal range) and method of image storage and delivery to the Data Coordinating Center were reviewed. A discussion regarding the information presented completed the didactic portion of the training. A copy of the training agenda is included in the appendix.Testing

A written test was administered twice after the didactic session: the first test was a “practice set #1” to test comprehension. The expert radiologist provided feedback and the test was considered to be part of the training. The second test, “Test Set #1” was for qualification. Raters were asked to identify six features for each of 21 different images: image exposure (normal; overexposed; underexposed), study interpretable (yes; no), EAS defect (yes; no), IAS defect (yes; no), defect location by anal canal level (HAC; MAC; LAC) and defect location by clock face position. The 21 images were selected by the consultant/ participating radiologist (JF) and were used in order to assess the individual rater’s qualifications, as well as to certify the training session (see statistical considerations below). The “standard” interpretations were made by the radiologist and used as the correct answers in grading results from the participants.The same images were used for both the practice/learning test (practice set #1) and the qualification test (test set #1), but were presented in different order. Practice set #1 was administered after content was presented and used as a practice self-assessment test and the radiologist (JF) reviewed the correct answers in a group setting. This set was not collected or analyzed. Test set #1 was not used for learning but was used to assess and qualify the participants. For both the practice and test sets, each question and image was displayed on the screen using PowerPoint. The images were presented in random order with respect to normal and abnormal examples, as well as anal canal location. Each participant entered the results in a study assessment form. Test set #1 study forms were collected by the study statistician (CS) who analyzed each participant’s results relative to the “gold standard” provided by the radiologist (JF). In order to ensure uniformity of the endoanal examination reporting, each physician participating in this portion of the study agreed to report static image findings in a standardized way on the study assessment forms. A second test set (test set #2) was created by the radiologist (JF) based on a different set of images; this test set was used for participants who did not pass the first test or who were unable to attend the in-person training session.

Sonographic Definitions of Anal Canal Levels, Anal Sphincters, and Anal Sphincter Defects

The sonographic anal canal, anal sphincters, and sphincter defects were defined as previously described by Corton 4 and summarized below.

Anal Canal Levels

The high anal canal (HAC) was defined as the region from the lowest level of the puborectalis muscle “slings” to the level where the external anal sphincter (EAS) formed a complete ring anteriorly (See Figure 1).

The puborectalis slings were defined as the right and left portion of the puborectalis muscle that extended anteriorly toward the inner surface of pubic bones. This level was called HAC 1 (Figure 1). Still in the HAC, but 1-5 mm distal to the lowest level of the puborectalis, the hyperechoic external anal sphincter may or may not have formed a complete ring anteriorly around the anal canal. Thus, absence of continuity of the EAS anteriorly at this level was not considered a defect. This level was called HAC2 (Figure 1). The mid anal canal (MAC) was defined as the region where the EAS muscle formed a complete ring anteriorly around hypoechoic the internal anal sphincter (IAS) (Figure 2). It extended inferiorly to the most distal end of the IAS. The low anal canal (LAC) was defined as the region below the end of the IAS muscle where only the EAS was identified (Figure 3).

Figure 1. Cross-sectional images at the high anal canal (HAC) level. Image on the left (A) illustrates HAC-1, the lowest level at which the puborectalis muscle sling (asterisks) were identified. Image on the right (B) is representative of HAC-2 level, the region just below the puborectalis muscle sling. Note that the external anal sphincter (c) does not form a complete ring around the anal canal in this patient. Arrowheads = anterior ends of EAS; a = anal submucosa. In the HAC, the internal anal sphincter (b) was evaluated at the HAC-2 level.

Figure 2. Cross-sectional image at the mid anal canal level in a patient with normal sphincters.a = anal submucosa; b = internal anal sphincter; c = external anal sphincter.

Anal SphincterThe IAS was defined as the concentric hypoechoic band surrounding the anal submucosa (Figures 1 and 2). The EAS was defined as the concentric band of mixed or hyperechogenicity lateral to the IAS (Figures 1, 2 and 3). This band incorporated the longitudinal muscle, which is often difficult to distinguish from the surrounding EAS, as both muscles exhibit similar echogenicity.

Figure 3. Cross-sectional image at the low anal canal level illustrating complete continuity of the external anal sphincter (c). Note that the internal anal sphincter is no longer visualized at this level.

Anal Sphincter Defects

An IAS defect was defined as a complete loss of continuity (at least 5 degrees) of the concentric hypoechoic band that represents  the internal anal sphincter muscle (Figure 4). An EAS defect was defined as a complete loss of continuity (at least 5 degrees) of the concentric mix echogenic or hyperechogenic band that represents the external anal sphincter (Figure 4).

Figure 4. Representative image at the mid anal canal level in a patient with internal (b) and external anal sphincter (c) defects. Arrows indicate lateral borders of internal anal sphincter defect; arrowheads indicate lateral borders of external anal sphincter defect.


The number of ultrasound images to be reviewed was set to assess individual rater’s qualifications, as well as to certify the participant during the training session. With 21 images and15 readers per ultrasound image, there was at least 80% power to detect that an intraclass correlation (ICC) of 0.70 reflecting the alternative hypothesis was significantly different from 0.50, the ICC of the null hypothesis, with a one-sided Type I error of 0.05 5. The BOOST Ultrasound Committee defined the qualification criteria as part of the training curriculum; a reader was considered qualified if s/he was at least 75% concordant with the expert radiologist’s assessment (the gold standard) for categorizing a defect in the internal anal sphincter (IAS) and at
least 50% concordant with the expert for interpreting defects in the external anal sphincter (EAS).

Data on study interpretability (yes/no), anal canal level (distal, mid, proximal), and presence or absence of an EAS and/or IAS defect were summarized. ICCs among all readers excluding the expert radiologist and corresponding 95% confidence intervals (95% CI) were calculated separately to assess inter-reader concordance for IAS and EAS6. However, the value of ICC is dependent on the coding of the responses (calculations used 1=no, 2=yes and 0=not applicable). Given that there is no natural ordering for yes, no and not applicable, kappa estimates of agreement among multiple rates for nominal data [x] were also calculated as confirmatory measures because they treat the EUS responses as discrete values (presence or absence of sphincter defect, not applicable if poor image quality). We used the method described by Fleiss (1976) to calculate kappa estimates of agreement among multiple raters for nominal or ordinal scale data7. This method calculates the correct stan-dard error and addresses the problem noted by the reviewer of having more than two categories to rate. The overall kappa is a weighted average of the category-specific responses. Values of <0.40 were considered poor to slight agreement, 0.41 – 0.60 fair to moderate, 0.61 – 0.80 good, and 0.81 – 1.00 very good agreement8.


Fourteen of the identified 15 readers attended the 1-daylong training session. Site experts represented the following specialties: gynecology (87%), urology (7%) and gastroenterology (7%). Calculation of agreement or concordance among the readers was based on all 21 images. Thirteen of 14 readers met the pre-defined passing level for IAS and EAS at the training session (test set #1) while one reader passed by completing a second test (test set #2). Protocol leaders using test set #1 trained the one reader not attending the training session who then successfully completed only test set #2 for certification. Given that this reader did not complete test set #1 for qualification, this participant’s results are not incorporated in the analysis on concordance.

Concordance among the readers (for presence of sphincter defect) was ICC = 0.54 (95% CI: 0.41, 0.69) for IAS and 0.50 (95% CI: 0.37, 0.66) for EAS. Agreement using kappa was 0.58 (95% CI: 0.55, 0.61) for IAS and 0.55 (95% CI: 0.52, 0.58) for EAS.

Table 1 summarizes the degree of concordance for IAS, EAS, interpretability and anal canal level for all readers with that of the expert radiologist (JF) for their original qualification test. The biostatistician also reviewed reader responses (in a blinded fashion) relative to the expert’s for discrepant IAS, EAS, Interpretability and Anal Canal Level to determine if there were specific images that were most problematic for readers. Most participants had difficulty with interpreting defects in the high anal canal and low anal canal along with understanding when an image was not interpretable. This most often occurred when interpreting if the image was in the correct location to interpret the high anal canal.


This study demonstrated that a single day, centralized training course with investigators experienced in performing endoanal ultrasound, focusing on interpreting endoanal ultrasound images can result in moderate agreement between pelvic floor physicians and an expert radiologist. It should be no surprise that raters trained to interpret images using the same standards as an ‘expert’ will produce raters more like the ‘expert’. However, our study suggests that in the setting of preparing for a research trial, if study planners incorporate important contextual elements and outcomes related to the clinical trial as the frame of reference in the training program, one can also achieve moderate agreement among study investigators9. The use of the classification system of sphincteric injury allowed for a systematic interpretation pattern of ultrasound images and likely resulted in better agreement among readers. Another important finding is that this level of agreement can be achieved with a sizable and diverse multi-specialty group of practitioners. This is an important characteristic for ensuring that the findings of studies that rely on interpretability of imaging or standardization of procedures are generalizable to practitioners at trial completion.

Prior studies have investigated the intraobserver and interobserver agreement in endoanal sonography. Gold et al. studied 51 consecutive patients including 43 women who were referred for possible sphincter abnormalities10. Images were reviewed by two experienced sonographers with each unaware of the other’s findings. Although both observers agreed in 27 patients with intact sphincters, this study was limited in the use of only two sonographers and the majority of images were in patients with normal or intact sphincters (35 out of 51). Fowler et al. developed and validated the pictorial chart to document defects from endoanal ultrasound examination by having two independent assessors review 296 endoanal ultrasound scans in patients recruited for a longitudinal cohort assessing occult anal sphincter injury after vaginal delivery11. There was strong agreement between reviewers (kappa 0.99) in categorizing normal vs. abnormal in 60 out of 296 scans but when these images were compared to an “expert” reader the agreement was highly variable among different levels of experience in endoanal ultrasonography.

The strengths of this study are the use of multiple readers from a clinically diverse background. Most studies investigating inter- observer agreement of interpreting endoanal ultrasound have limited number of readers. In previous studies that have investigated multiple readers, the experience with endoanal ultrasound is highly variable. This study used 15 readers who specialize in the evaluation of pelvic floor disorders. The study had several limitations. The logistics of the course required that content was taught in a group; all readers were assessed immediately following training at the end of the course when recall was likely best. We have no knowledge on whether readers would retain this level of agreement if the assessment were performed months or even years from the training session. In the setting of a large research trial it would be important to reassess the individuals performing reading of images at intervals during the study to determine whether a refresher course is needed to maintain agreement during the entire duration of a study. Additionally, readers did not perform the endoanal ultrasound exams as they only read the images as part of the training. Therefore, this study does not provide insight into the actual psychomotor performance of the endoanal ultrasound technique on patients and can only comment on agreement in image interpretability. Finally, the study did not assess intraobserver agreement of the images. Future studies should investigate reliability of the performance of endoanal ultrasound in combination with image interpretability as this is what is commonly done in practice. Reliability may improve if the person interpreting the images actually performed the procedure.  Unfortunately, the BOOST trial was discontinued due to issues with study enrollment and the ability to assess the longer-term impacts of the training session could not be performed. Additionally, studies should determine whether these findings correlate with the clinically important outcome such as fecal incontinence.

In conclusion, multi-center certification of competence of 15 experienced individuals on interpreting EUS images was performed in a carefully planned single-day event. Agreement for diagnosis of sphincter defects using endoanal ultrasound images is moderate. This strategy allowed the achievement of an acceptable level of EUS image characterization compared to a gold standard radiology expert.

Supported by grants from The Eunice Kennedy Shriver National Institute of Child Health and Human Development, (Duke: 2-U10-HD04267-12, Loyola: U10-HD054136, UAB: 2-U10- HD041261-11, Utah: U10-HD041250, Cleveland Clinic: 2-U10- HD054215-06, UCSD: 2-U10-HD054214-06, Pittsburgh: 1-U10-HD069006-01, UTSW: 2-U10-HD054241-06, Univ. of Michigan: U01-HD41249, RTI: 1-U01-HD069010-01) and the NIH Office of Research on Women’s Health

Clinical Trials Registry: NCT01166399

Prècis: Centralized training for 2-dimensional endoanal ultrasound with experienced investigators resulted in moderate agreement of anal sphincter diagnoses with an expert radiologist.


1.Sultan AH, Kamm MA, Hudson CN, Nicholls JR, Bartram CL. Endosonography of the anal sphincters: normal anatomy and comparison with manometry. Clin Radiol. 1994, 49(6): 368- 374.

2.Abdool Z, Sultan AH, Thakar R. Ultrasound imaging of the anal sphincter complex: a review. Br J Radiol. 2012, 85: 865- 875.

3.Richter HE, Nager CW, Burgio KL. Incidence and Predictors of Anal Incontinence After Obstetric Anal Sphincter Injury in Primiparous Women. Female Pelvic Med Reconstr Surg. 2015, 21(4): 182-189.

4.Corton MM, McIntire DD, Twickler DM, Atnip S, Schaffer JI et al. Endoanal ultrasound for detection of sphincter defects following childbirth. Int Urogynecol J. 2013, 24(4): 627-635.

5.Walter SD, Eliasziw M, Donner A. Sample size and optimal designs for reliability studies. Stat Med. 1998, 17(1):101-110.

6.Cappelleri JC, Ting N. A modified large-sample approach to approximate interval estimation for a particular intraclass correlation coefficient. Stat Med. 2003, 22(11):1861-1877.

7.Fleiss JL, Nee JC, Landis JR. Large sample variance of kappa in the case of different sets of raters. Psychol Bull. 1979, 86(5):974-977.

8.Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977, 33(1):159-174.

9.Woehr DJ, Huffcutt AI. Rater training for performance appraisal: A quantitative review. J Occup Organ Psychol. 1994, 67(3): 189-205.

10.Gold DM, Halligan S, Kmiot WA, Bartram CI. Intraobserver and interobserver agreement in anal endosonography. Br J Surg 1999, 86(3): 371-375.

11.Fowler GE, Adams EJ, Bolderson J, Hosker G, Lowe D et al. Liverpool Ultrasound Pictorial Chart: the development of a new method of documenting anal sphincter injury diagnosed by endoanal ultrasound. BJOG. 2008, 115(6): 767-772.

Be the first to comment on "Results of a Novel and Comprehensive Training Program for Standardization of 2 -Dimensional  ltrasound Interpretation as a Precursor to a Multicenter Trial"

Leave a comment

Your email address will not be published.