The Internal Revenue Service Could Improve Its Process to
More Reliably Measure the Accuracy of Its Toll-Free Tax Law Assistance
February 2002
Reference Number:
2002-40-051
This report has cleared the Treasury Inspector General for Tax Administration disclosure review process and information determined to be restricted from public release has been redacted from this document.
February 15, 2002
MEMORANDUM FOR
COMMISSIONER, WAGE AND INVESTMENT DIVISION
FROM: Pamela J. Gardiner /s/ Pamela J. Gardiner
Deputy Inspector General for
Audit
SUBJECT: Final Audit Report – The Internal Revenue
Service Could Improve Its Process to More Reliably Measure the Accuracy of Its
Toll-Free Tax Law Assistance (Audit # 200140039)
This
report presents the results of our review of the Internal Revenue Service’s
(IRS) efforts to measure the accuracy of its toll-free tax law assistance. The overall objective of this review was to
determine if the IRS reliably measured the accuracy of responses that millions
of taxpayers experienced when they called the IRS for tax law assistance.
In summary, we found that the IRS made a good attempt to obtain a representative measure of its tax law assistance within the limitations of its current sampling strategy. The IRS implemented a sampling strategy that required quality reviewers to select, monitor, and assess the quality of responses to calls in “real” time during live calls. During the period that it monitored the calls, the results provided useful indicators of the quality of the responses it provided. However, weaknesses in the current sampling strategy had the potential to bias the accuracy of the results. For example:
·
The IRS’ sampling strategy used a cluster sampling design,
an acceptable statistically valid random sampling method. However, the sampling plan developed from
the design may not have produced a valid random sample. The IRS’ sampling plan attempted to ensure a
certain number of calls were reviewed from all call sites at various times of
the day; however, the plan did not include all hours and days of operation,
used a sampling formula that likely resulted in a wider precision than
reported, was designed in a mutually exclusive manner (i.e., if one event occurs, the other(s) cannot), and did not
include r-mail calls.
·
The sampling plan created by the IRS was not implemented as
designed. To ensure the reliability of
the results, once a random sampling plan has been designed, it is critical that
it be followed to achieve the expected outcome. Our review found several problems in this area, including
latitude on the part of the reviewers to select calls to monitor. By deviating from the sampling plan and
disconnecting from calls, reviewers had the ability to influence which calls
were included in the sample, which may distort the true randomness of the
sample. In addition, the required
number of calls was not reviewed at critical times during the 2001 Filing
Season.
We were unable to assess the
impact of these weaknesses on the 73.78
percent accuracy rate the IRS reported for the 2001 Filing Season. (The IRS’ true accuracy for toll-free tax
law assistance could be higher or lower than estimated or at 73.78 percent as
reported.) To do this would require the
re-creation of the sample, which is not possible.
Also, managerial reviews were not
conducted as required at critical periods during the 2001 Filing Season. In addition, they were not conducted in an
independent setting. Improving this
process could reduce some of the weaknesses identified
in the current sampling strategy, such as deviating from the sampling plan and
disconnecting from calls.
Management’s Response: IRS
management agreed with the findings we presented in our report and implemented
corrective actions where possible. They
have already implemented corrective actions for recommendations 1 and 2 of our
report. However, they were not able to
devise corrective actions for recommendations 3 and 4.
Specifically, IRS management
developed a sampling plan for the 2002 Filing Season that covers all of the
days and hours Customer Service Representatives will provide tax law
service. They also hired additional
Centralized Quality Review Site (CQRS) staff and authorized overtime to assure
the IRS executes the sample plan as designed.
The new sampling plan offers the highest level of coverage by using the
most random sampling its telephone system allows.
In addition, the 2002
version of the plan does not allow reviewers to select call sites and
applications. Instead, reviewers are
assigned to specific sites and applications.
IRS management plans to monitor the samples to ensure reviewers are not
deviating from their hourly assignments and that projected sample sizes are
attained. The IRS has scheduled staff
resources, including overtime, to ensure it consistently meets sample sizes.
However, the IRS was not
able to develop corrective actions for recommendations 3 or 4 and one aspect of
recommendation 2. Specifically, the IRS
could not implement recommendation number 3 due to telephone system limitations. The telephone system used by CQRS does not
allow simultaneous remote call monitoring.
The IRS did strengthen controls to ensure that each CQRS manager meets
minimum monitoring standards.
Management’s
complete response to the draft report is included as Appendix VI.
Copies of this
report are also being sent to the IRS managers who are affected by the report
recommendations. Please contact me at
(202) 622-6510 if you have questions or Michael R. Phillips, Acting Assistant
Inspector General for Audit (Wage and Investment Income Programs), at (202)
927-7085.
Weaknesses in the Sampling Strategy May Bias the Accuracy of the Results
Appendix I – Detailed Objective, Scope, and Methodology
Appendix II – Major Contributors to This Report
Appendix III – Report Distribution List
Appendix IV – Tables and Charts
Appendix V – Common Sampling Techniques
Appendix VI – Management’s Response to the Draft Report
Millions of taxpayers rely on the Internal Revenue Service (IRS) to provide accurate information when they call with questions about the tax law. Incorrect information, if provided by the IRS, could result in filing errors and increased taxpayer burden. During the 2001 Filing Season, the IRS received approximately 4.7 million toll-free telephone calls from taxpayers seeking tax law assistance. The assistance was provided by Customer Service Representatives (CSR) located in IRS call sites across the country. To assess the quality of this assistance, the IRS reviewed a sample of live taxpayer calls. For the 2001 Filing Season, it reported a national accuracy rate of 73.78 percent for its toll-free tax law assistance that covered the 4.7 million calls.
Taxpayers calling the
IRS for tax law assistance initially reached an automated, menu-driven
telephone system and chose to either listen to recorded information or speak to
a CSR. When taxpayers chose to speak to
a CSR, they navigated through the system and selected a particular tax
topic (application) that related to his or her question. For example,
one application covered questions on filing status and dependents and another
covered questions on pensions and social security benefits. The telephone system was designed to then
route the call to the next available CSR trained in the application
selected.
Incoming calls were routed to a CSR in any 1 of 15 call sites nationwide that handled tax law calls. Appendix IV (Figures 1 and 2) provides the locations of the tax law call sites and their hours of operation during the 2001 Filing Season. The volume of calls handled by the IRS varied by application and call site. Appendix IV (Figures 3 and 4) provides the volumes of calls handled by application and call site during the 2001 Filing Season.
An
overview of the IRS’ centralized quality review process for tax law assistance
The IRS established the Centralized Quality Review Site (CQRS) to determine the accuracy of responses provided to taxpayers that call the IRS’ toll-free customer service telephone system with questions about the tax law. It would be impossible for the IRS to review every call for accuracy; therefore, the CQRS reviewers monitored a sample of taxpayer calls and determined the accuracy of the responses.
A sampling plan was developed by the IRS to select actual taxpayer calls to monitor. Using this sampling plan, reviewers selected and monitored the calls to determine whether the CSRs followed all procedures and provided the taxpayer with an accurate response to his or her question. The reviewers then annotated the results of their monitoring on a case review form. Information on the case review forms was entered into a computerized database from which reports were generated providing accuracy rates and other quality-related information.
To ensure that reviewers possessed the required skills to adequately monitor and assess the accuracy of the call, the IRS selected individuals that already had years of tax law experience. Reviewers were not expected to be experts on all tax law topics. They were assigned to monitor only two to four tax law applications. The IRS ensured the reviewers maintained the necessary skills by providing refresher training.
Sampling techniques
As stated earlier, it was not feasible for the IRS to monitor all calls received to gauge the accuracy of responses. However, to analyze the quality of the responses, the IRS needed to collect data that were representative of all tax law calls received. Sampling offered the practical solution.
Sampling is the selection and study of a part of a whole (the population) for the purpose of drawing conclusions about the whole. Sampling may be likened to taste testing, where the tester tastes a small part of the item and thereby draws conclusions about the whole regarding its quality or other characteristics.
In designing a sampling plan, there are a variety of possible sample selection techniques that can be used, each having various levels of reliability. Some commonly used techniques are random, judgmental, and convenience sampling. Appendix V provides descriptions of these sampling techniques.
The IRS chose the statistical method of random selection using cluster sampling. Cluster sampling randomly selects groups or clusters of units to sample from. One benefit of cluster sampling is the reduced cost of the sample selection and data collection.
One major benefit of statistical random sampling in general is that it permits objective measurement of the reliability of the results. Established mathematical formulas are used to determine the proper sample size necessary to measure the reliability of the results. This sampling method is used when conclusions about the entire population are to be presented.
With all statistical random sampling methods, every item in the population is given an equal chance of being included in the sample. For example, each call answered by the IRS within the cluster selected must have an equal chance of being selected for the sample, or the method is not truly random. Unless each item is selected by a purely random technique, there is no way of measuring later how accurately the sample reflects the characteristics of the population from which it was selected.
Sampling reliability
Since a sample only contains partial data, the results have some limitations because one can never be certain the entire population is represented. The limitations are identified through a process called “measurement of reliability.” When designing a sampling plan, management must decide the desired reliability of the results and degree of confidence in using those results.
Assessing the reliability of the results involves determining how closely the sample represents the population (precision margin) and the confidence with which the results can be used to assess the population (confidence level). Precision is the amount of error that management will tolerate due to sampling and is expressed as a plus or minus figure (such as +/-5 percent). The smaller the precision margin, the less error due to sampling exists.
The confidence level refers to the degree of assurance that the results obtained from the sample, after applying the precision margin, represent the true population. For illustration, the IRS designed its sample with a confidence level of 90 percent, a precision margin of +/-5 percent, and a historical accuracy rate of 71 percent. That means that the IRS would expect that the average accuracy rate for the calls sampled would fall within a range from 66 percent to 76 percent, 90 percent of the time.
How the IRS used the results
from its sample
The IRS designed the sample to provide an estimate of the
accuracy of its responses to tax law questions on both a nationwide and a call
site level. In addition, the IRS used
the results of the sample to identify needed improvements in training and
research materials for CSRs and to propose changes to tax forms, instructions,
and publications. The results of the
sample could also enable managers to better
operate their call sites and set performance goals to improve its telephone
customer service program overall.
We conducted this audit in the Wage and Investment Division Headquarters Office in Atlanta, Georgia, and Wage and Investment Division offices in Philadelphia, Pennsylvania; New Carrollton, Maryland; and New York, New York. Audit work was also conducted in the Office of the Director, Research, Analysis and Statistics of Income in Washington, D.C.
The audit was conducted between April and August 2001 and in accordance with Government Auditing Standards. Detailed information on our audit objective, scope, and methodology is presented in Appendix I. Major contributors to the report are listed in Appendix II.
The IRS made a good attempt to obtain a representative measure of its tax law assistance within the limitations of its current sampling strategy. The IRS implemented a sampling strategy that required quality reviewers to select, monitor, and assess the quality of responses to calls in “real” time during live calls. During the 2001 Filing Season, it reviewed 11,481 calls. The results provided the IRS with useful indicators of the quality of the responses it provided. However, plan design and implementation weaknesses in the current sampling strategy had the potential to bias the accuracy of the results. Also, managerial reviews were not conducted as required at critical periods during the 2001 Filing Season and were not performed in an independent setting.
According to a Professor of Decision Sciences that reviewed the sampling plan for the 2001 Filing Season, “At best, the sampling plan attempts to obtain ‘representative coverage’ of the quality of responses to tax law calls across the 15 tax sites every 2 weeks.” The sampling plan design attempted to ensure a certain number of calls were reviewed from all call sites at various times of the day. However, the plan did not include all hours or days of operation, used a sampling formula that likely resulted in a wider precision than reported, was designed in a mutually exclusive manner (i.e., if one event occurs, the other(s) cannot), and did not include r-mail calls. These weaknesses could bias the overall accuracy rate and precision for the sample, affecting the sample’s reliability. We were unable to estimate the impact of the weaknesses. (The IRS’ true accuracy for toll-free tax law assistance could be higher or lower than estimated or at 73.78 percent as reported.) To do this would require the re-creation of the sample, which is not possible.
Weaknesses existed in the design of the sampling plan
Our
review identified several weaknesses in the
design of the sampling plan.
Specifically, the plan did not cover all hours and days of operation,
used the simple random sampling formula for cluster sampling, and was designed
in a “mutually exclusive” manner. Also,
calls from taxpayers seeking tax law assistance that could not be answered
during that initial call (r-mail) were not included. These design elements kept the IRS’ sample from being truly
random.
The plan did not cover all hours and days of operation. The IRS’ toll-free tax law service was available 24 hours a day, 7 days a week during the 2001 Filing Season. However, the IRS selected calls for review only between the hours of 7:00 a.m. and 11:00 p.m. Eastern Standard Time, Monday through Saturday. Therefore, no calls received after 10:00 p.m. Central Standard Time, 9:00 p.m. Mountain Standard Time, or 8:00 p.m. Pacific Standard Time were monitored. Also, calls received on Sundays were not monitored.
Even during the 7:00 a.m. to 11:00 p.m. monitored time frame, the sampling plan did not cover all hours of operations for each call site. For example, one site that was open for 13 hours during the 7:00 a.m. to 11:00 p.m. time frame was only scheduled to be monitored up to 8 of the 13 hours that it was open.
The IRS estimated that 93 percent of the tax law calls came in during the hours reviewers were monitoring calls. Although the volume of calls outside of the time periods monitored was low in relation to the total received, the ultimate effect on the sampling plan was that these calls did not have a chance of being selected.
The sampling plan did not include call site assignments for Saturdays. The sampling plan assigned reviewers to specific call sites. Assignments were made at varying times of the day on Monday through Friday, according to hours of operation of the calls sites and tours of duty of the reviewers. The sampling plan did not include Saturdays because sites that were available to answer tax law calls on Saturdays varied from week to week. During the week, reviewers were notified of the call sites that were operational each Saturday, but there was no systematic schedule for monitoring these call sites on Saturdays.
Using the simple random sampling formula for cluster sampling likely resulted in a wider precision. Under the cluster sampling method, the IRS grouped tax law calls within each call site into 1-hour sampling units or clusters and then selected calls within the hour to monitor for quality. The IRS used cluster sampling to select the calls to monitor; however, it used simple random sampling to determine the sample size (ideal number of calls required to be monitored) and to estimate the results. The sample size was determined based on the statistical simple random sampling formula. The effect of applying the simple random sample formula to cluster sampling likely resulted in a wider precision.
The plan was designed in a mutually exclusive manner. Only one reviewer was scheduled to monitor a particular call site at a specific time. If the sampling plan was followed, other incoming calls that occurred at that call site during the same time period a reviewer was already monitoring a call could not be selected for review. As a result, all calls coming into that call site did not have an equal chance of selection for the sample.
Calls from
taxpayers that were forwarded to the r-mail system were not included in the
sample. There were occasions
when a call received by a CSR was not answered initially (for example, for
complex tax law subjects such as Individual Retirement Accounts, Capital Gains,
and Sale of Residence). For these calls
(r-mail), the CSR was to ask the taxpayer to leave a name and telephone number
or electronic mail address so that the taxpayer could be provided a response
within a few days. During the 2001
Filing Season, there were approximately 1 million taxpayer calls forwarded to
r-mail.
As noted in a recent Treasury Inspector General for Tax
Administration report, even though these taxpayers sought assistance through
the IRS’ toll-free tax law telephone system, the quality assessment when the
IRS later responded to these taxpayers was separately computed and did not
factor into the overall estimate of accuracy for toll-free tax law telephone
quality. Because the response is not
immediate, the IRS does not believe this should be part of the toll-free tax
law assistance measure for ‘live’ calls.
After discussions with IRS management during this review, they changed
the definition of the measure to include only ‘live’ toll-free tax law
assistance.
Weaknesses existed in the implementation of the sampling plan
Once a sampling plan has been designed, it is critical that it be followed to ensure that it achieves the expected outcome. However, reviewers had latitude in selecting call sites and applications to monitor, had the ability to disconnect from a call at any time during monitoring with no systematic record that the call was selected, and did not monitor the required numbers of calls at critical times during the 2001 Filing Season.
Reviewers had latitude in selecting
the site and application to monitor. Although the plan directed reviewers to a
specific call site at a specific time, they had latitude in selecting specific
calls to be monitored at the assigned site.
For example, even though a reviewer was assigned two to four
applications to monitor, the reviewer could impose his or her own preference to
review one particular application over another. Also, if a call was not available to monitor at a site for his or
her assigned applications, he or she was allowed to deviate to another call
site.
By deviating from the sampling plan, reviewers could influence which calls were included in the sample and distort the true randomness of the sample. This conclusion is shared by the Professor of Decision Sciences who wrote, “The ‘experimental design’ or ‘controlled experiment’ approach adopted by the IRS attempts to limit the discretion of the reviewers, who are charged with the decision regarding which incoming call to monitor. Given a choice of calls to monitor, each reviewer can infuse his/her own preference bias into which call to select. This discretion can bias the overall quality estimates and precision estimates for the sample.”
The sampling plan was designed for reviewers to monitor calls at specific call sites at specific hours of the day. According to CQRS management, deviations from the assigned applications were allowed without prior managerial approval. However, deviations from the assigned call site were only to be to approved sites and then documented by the reviewers. Documentation requirements included annotating the time, the call site unable to be monitored, the substituted call site, and any applicable remarks.
We selected a statistically valid sample of 75 tax law calls monitored by reviewers during the 2001 Filing Season and reviewed the IRS’ documentation pertaining to the call. Our sample showed that reviewers deviated from the assigned site 12 percent (9 times) of the time. When we apply this 12 percent to the total calls monitored by the CQRS, we estimate that 1,377 call site deviations may have occurred during the 2001 Filing Season. Reviewers did not document the reason for the deviations in six of the instances in our sample. The degree of latitude, even to approved sites, could affect the randomness of the sample.
Reviewers had the ability to
disconnect from a call at anytime. Reviewers could disconnect from calls at any
time for various reasons. For example,
the reviewers were instructed to disconnect from a call if it was transferred
outside the tax law area. However, no
controls were in place to prevent reviewers from disconnecting from a call for
reasons such as complexity of the issue, topic preference, or length of the
call.
Each day, reviewers were to manually document the number of disconnected calls on information sheets maintained with their case review forms. However, CQRS management did not monitor the number of disconnected calls. We could not assess the frequency with which reviewers disconnected from calls because there was no way to systemically track the reviewer activity within the toll-free telephone system. This degree of latitude could also affect the randomness of the sample.
Reviewers did not
monitor the required number of calls at critical times during the 2001 Filing
Season. Once the ideal sample size was determined, the IRS evaluated its ability
to meet the sampling plan based on resource assumptions. For example, one assumption was that the
reviewers would be available an average of only 6.5 hours out of an 8.5 hour
work day to monitor calls.
For the 2001 Filing Season, the IRS calculated the
ideal sample size at 223 calls per site and 3,345 calls nationwide (223 calls
per site times 15 sites). The IRS
resource assumptions for the 2001 Filing Season indicated that sufficient
staffing was available to meet the sample size of 223 calls per call site per
month for all 15 call sites (i.e., 3,345 calls nationwide per month).
Although the design of the sampling plan necessitated that at least 223 calls per month be monitored at each call site, the IRS did not meet the sampling plan at the call site level in three of four months of the 2001 Filing Season. Appendix IV (Figure 5) provides information for the planned and actual number of monitored calls per call site per month during the 2001 Filing Season.
While we could not substantiate the primary cause for not meeting the sampling plan, the IRS cited several reasons, such as low call volumes at certain times, problems with the software used to locate call traffic, and unavailability of reviewers during parts of the 2001 Filing Season.
The results by call site and on a nationwide basis
provided the IRS with the opportunity to gauge the overall performance of each
call site and implement changes targeted at improving performance, such as
identifying training needs in specific areas.
The assessment at the call site level was particularly important because
not all call sites performed at the same level during the 2001 Filing
Season. For example, during the month
when the most calls were handled, one call site had a tax law accuracy rate of 80.28
percent (+/-4.48 percent), while another call site had a 60.27 percent tax law
accuracy rate (+/-5.38 percent).
Although there was not a material effect on the nationwide accuracy rate when the monthly sample size was not met, meeting the sample size at the call site level was important because of the degree of risk associated with only sampling parts of the entire population. When sample sizes fall below the required number, the degree of sampling error (precision) of the results will be wider or broader.
As stated in the background section of this report, IRS management must decide the maximum amount of error due to sampling that they will tolerate. IRS management selected a 5 percent sampling error. A wide precision in individual call site results may not provide meaningful information for them to make sound decisions about call site improvements. For example, during January 2001, one site had an accuracy rate of approximately 65.22 percent, +/-7.31 percent. This means that the true accuracy rate would lie somewhere between 57.91 percent and 72.53 percent – information that may not be meaningful to management. Appendix IV (Figure 6) provides call site accuracy and precision rates for January 2001.
Managerial reviews were not conducted at critical periods during the 2001 Filing Season
Managers
in the CQRS had minimum monthly review requirements. We determined that these reviews of the work performed by the
quality reviewers were not conducted as required in January, February, and March
2001. The manager of the quality
reviewers was temporarily reassigned and a replacement was not designated.
Even though managerial reviews were not conducted as required, our analysis of IRS documentation of calls did not identify any material problems.
· In our review of the IRS’ documentation for the 75 monitored calls previously mentioned, we determined that the reviewers’ assessments of the accuracy of the responses to taxpayers were correct for 69 of 71 calls. For the remaining four calls, we could not determine the accuracy based on the limited documentation available on the call. However, it should be noted that our analysis of the 75 calls was dependent on the transcribed notes of the reviewers, as we could not re-create the actual taxpayer call.
· In our review of a statistically valid sample of 68 of 1,020 edited case review forms on the tax law calls monitored, we concluded that the reasons were appropriate for the changes.
One
aspect of the managerial reviews consisted of the joint monitoring of live
calls conducted side-by-side with the quality reviewers. This setting, not
being independent, may have minimized the effectiveness of these managerial
reviews in reducing some of the weaknesses
identified in the current sampling strategy, such as deviating from the
sampling plan and disconnecting from calls.
Improving the managerial review process to conduct all required reviews
in an independent setting could help to ensure the sampling plan is implemented
as designed.
Errors in data on call volumes did not have a material impact
To ensure each call site was given the proper amount of
weight in the overall results, the respective call volumes were used to “level
the playing field.” These call volumes
originated from an automated telephone routing system, and then were manually calculated
and transcribed into the quality review database. Our review of the manual process that summarizes tax law call
volumes at the call site level identified errors in the compilation of the
data, but the errors were not material enough to affect the overall accuracy
rate. The errors ranged from minimal in
some call sites to overstating more than 25,000 calls in 2 call sites and
understating more than 47,000 calls in one other. Although these errors did not change the overall accuracy rate
when taken into account, there is the risk that errors, depending on their
size, could have an effect. Since our
discussion of these errors with IRS management, they have automated this
process. We did not test the automated
process to determine its reliability.
In conclusion, statistical sampling provides a means to make conclusions about a population when only a sample of that population is reviewed. The IRS made a good attempt within the limitations of its current sampling strategy to obtain a representative measure of the quality of toll-free tax law assistance. Weaknesses resulting from the current sampling strategy, in terms of design and implementation, may bias the sample results. However, the exact impact cannot be quantified without re-creating the sample.
To improve the IRS’ measure of accuracy under the current sampling strategy, the Commissioner, Wage and Investment Division, should:
1. Design the sampling plan to include all tax law calls in the population and randomly select from all hours of all call site operations.
Management’s Response: IRS management stated, “The sampling plan we developed for the 2002 Filing Season covers all of the days and hours CSRs will provide tax law service. We have hired additional CQRS staff and authorized overtime to assure we execute the sample plan as designed. The sampling plan we developed offers the highest level of coverage by using the most random sampling our telephone system allows.”
2. Ensure the sampling plan is implemented as designed, the latitude of reviewers to select and disconnect calls (in terms of both application and call site) is limited, and the required number of calls is reviewed at critical times during the filing season.
Management’s Response: IRS management stated, “The 2002 version of the plan does not allow reviewers to select call sites and applications. Instead, reviewers are assigned to specific sites and applications. Now we check daily information sheets prepared by the reviewers to ensure they follow the sample plan. If we must deviate from the plan because of site operating conditions or closures, we follow a predetermined process to maintain random selection. In addition, we will select a separate sample of calls and compare them to the sample plan to ensure reviewers are not deviating from their hourly assignments. No telephone systems exist that can prevent the user from disconnecting from a call. The telephone equipment the CQRS staff uses cannot track disconnected calls.
We are using a status report to compare the
projected sample per site and the actual samples taken. If the projection is not met, we will
determine the reason(s) and take corrective action. We have scheduled staff resources, including overtime, to ensure
we consistently meet sample sizes.”
3. Conduct the required managerial reviews of the work performed by quality reviewers and perform the on-line, joint-monitoring aspect of these reviews remotely so that it can be done in an independent setting. These reviews should help ensure that the sampling plan is executed as designed.
Management’s Response: IRS management stated, “We cannot implement this recommendation due to telephone system limitations. The telephone system used by CQRS does not allow simultaneous remote call monitoring. CQRS managers conduct post reviews of their reviewers’ work by analyzing their data collection instruments and call notes. This review supplements the side-by-side joint monitoring that is conducted by each manager. We have strengthened controls to ensure that each CQRS manager meets minimum monitoring standards.”
To ultimately address the design limitations of the current sampling strategy the Commissioner, Wage and Investment Division, should:
4. Develop a system to measure the accuracy of tax law assistance that ensures that all tax law calls are included in the population, the selection of calls is truly random, and the sampling plan is implemented as designed. One way to accomplish this would be to institute an automated call recording system that would also provide a true random method of selecting calls from the entire population of calls to the IRS’ toll-free tax law assistance telephone system.
Appendix I
Detailed Objective, Scope, and
Methodology
The overall objective of this
review was to determine if the Internal Revenue Service (IRS) reliably measured
the accuracy of responses that millions of taxpayers experienced when they called
the IRS for tax law assistance.
We reviewed the process the IRS
uses to measure the accuracy of its tax law assistance and determined the
statistical reliability of each component of the process. This included reviewing actual case files
from quality reviews at the Centralized Quality Review Site (CQRS), reviewing
the reliability of the sampling methodology used to measure the accuracy, and
reviewing the reliability of the individual components that comprised the
measure of accuracy.
To accomplish our objective, we:
I. Reviewed the current sampling plan to determine, if executed
properly, whether it would result in a statistically valid estimate. (Design of the Sample)
A. Interviewed CQRS and Statistics of Income (SOI)
staff to determine the purpose of the sample, how it was developed, and the
attributes of the population used.
B. Interviewed SOI staff and reviewed industry
practices to determine the basis for the confidence and precision levels set by
the IRS.
C. Researched industry practices and consulted with
contracted statistician to determine whether the appropriate and necessary
elements or attributes of the population were addressed in the sampling plan.
D. Analyzed quality review database information for
the 2000 and 2001 Filing Seasons and compared the results to the 2001 Filing
Season sampling plan for indications of bias.
II.
Evaluated whether the sampling plan was executed by the
CQRS as designed by the SOI.
A. Reviewed quality review database and call volume
data to determine if the CQRS met the sampling plan.
B. Interviewed
CQRS staff to determine reasons why the sampling plan was not met.
C. Interviewed CQRS staff to determine reasons for deviations from the sampling plan. Also, reviewed a statistically valid random sample of review documentation for the 75 calls from II.G. to determine if the reviewers had deviated from the sampling plan. The sample was selected from a universe of 11,459 calls monitored during the 2001 Filing Season, at a confidence level of 95 percent, an error rate of 5 percent, and a precision margin of 5 percent.
D. Analyzed quality review database information to determine if reviewers were monitoring sites and applications outside of the sampling plan.
E. Interviewed CQRS staff and reviewed training material to determine if reviewers had required skill sets.
F. Interviewed CQRS management and reviewed performance review documentation to determine if there was proper oversight.
G. Reviewed a statistically valid random sample of review documentation for 75 calls to determine the overall accuracy of the determination, if calls could be reconstructed, and if calls were accurately transcribed. The sample was selected from a universe of 11,459 calls monitored during the 2001 Filing Season, at a confidence level of 95 percent, an error rate of 5 percent, and a precision margin of 5 percent.
H. Reviewed a statistically valid sample of 68 edited cases for review to determine reasons for the edits. The sample was selected from a universe of 1,020 edited records during the 2001 Filing Season at a confidence level of 95 percent, an error rate of 5 percent, and a precision margin of 5 percent.
III. Evaluated whether results of the sampling plan were compiled correctly.
A. Interviewed operations program analysts at the IRS National Headquarters and SOI management staff to determine how the sampling plan results were compiled and how the mathematical statistical formulas were used in the compilation.
B. Determined if the information used in the calculation of the quality rates and precision margins by the quality database was accurate by using original source data to test the individual components of the quality calculations.
C. Verified that the calculations used by the quality database to report the individual call site and national rollup quality rates and precision margins were accurate, based on actual data previously entered into the database, by recalculating the individual and national results.
D. Determined by discussion with SOI management staff, IRS operations program analysts, and CQRS management staff what validation the IRS performed on the quality rate reporting methodology in order to ensure the accuracy of the data and the statistical validity of the results calculated by the quality database.
E.
Determined
how certain attributes of the toll-free tax law call population, such as calls
received/monitored by call site and calls received/monitored by application,
were represented in the overall final sample selected for purposes of reporting
the IRS national toll-free tax law quality rate
during the 2001 Filing Season.
Appendix II
Major Contributors to This Report
Michael R.
Phillips, Acting Assistant Inspector General for Audit (Wage and
Investment Income Programs)
Susan
Boehmer, Director
Stan Rinehart, Director
Patricia Lee, Audit Manager
Anthony Anneski, Senior Auditor
Deborah Carter, Senior Auditor
Gregory Dix, Senior Auditor
Kathleen Hughes, Senior Auditor
Doris Hynes, Senior Auditor
Sharla Robinson, Senior Auditor
Jerry Douglas, Auditor
Andrea McDuffie, Auditor
Geraldine Vaughn, Auditor
Appendix III
Commissioner N:C
Commissioner, Small Business/Self-Employed Division S
Director, Customer Account Services W:CAS
Director, Strategy and Finance W:S
Director, Research, Analysis, and Statistics of Income N:ADC:R
Chief, Customer Liaision
S:COM
Chief Counsel CC
National Taxpayer Advocate
TA
Director, Legislative Affairs CL:LA
Director, Office of
Program Evaluation and Risk Analysis
N:ADC:R:O
Office of Management Controls N:CFO:F:M
Audit Liaisons:
Commissioner,
Wage and Investment Division W
Director,
Customer Account Services W:CAS
Director,
Research, Analysis, and Statistics of Income
N:ADC:R
Appendix IV
|
Sunday – Saturday |
Atlanta |
24 Hours |
|
|---|---|---|---|
|
|
Denver |
9:00A – 7:00P |
|
|
|
Nashville |
7:00A – 8:00P |
Mon – Fri |
|
|
|
7:00A – 3:00P |
Sat, Sun |
|
Monday – Saturday |
Jacksonville |
6:30A – 8:30P |
Mon – Fri |
|
|
|
10:30A – 8:30P |
Sat |
|
|
Pittsburgh |
6:00A – 4:30P |
Mon – Fri |
|
|
|
8:00A – 4:30P |
Sat |
|
|
St. Louis |
7:30A – 5:30P |
Mon - Fri |
|
|
|
6:00A – 3:00P |
Sat |
|
Monday - Friday |
Baltimore |
6:30A – 4:30P |
|
|
|
Buffalo |
6:30A – 10:00P |
|
|
|
Cleveland |
7:00A – 4:30P |
|
|
|
Dallas |
7:00A – 11:30P |
Sun
11:30A-11:30P |
|
|
Indianapolis |
8:00A – 8:00P |
|
|
|
Oakland |
10:00A – 10:00P |
|
|
|
Portland |
9:30A – 10:00P |
|
|
|
Richmond |
6:30A – 7:00P |
|
|
|
Seattle |
9:30A – 10:00P |
|
Source: Internal Revenue Service Call Sites
Figure 3 was removed due to its size. To see the figure, please go to the Adobe
PDF version of the report on the TIGTA Public Web Page.
The volume of toll-free tax law calls
handled by tax law application by the IRS varied. The volume of calls handled for the Filing and Dependent tax law
application was the highest for each month from January through April.
Figure 4
Figure 4 was removed due to its size. To see the figure, please go to the Adobe
PDF version of the report on the TIGTA Public Web Page.
The volume of toll-free tax law calls
handled by each IRS call site varied.
Figure 5
Internal Revenue Service Tax Law Calls Monitored Compared to Planned for Call Sites (2001 Filing Season)
|
Call Sites |
|
|
|
|
||||
|---|---|---|---|---|---|---|---|---|
|
|
|
Over (Under) Planned |
|
Over (Under) Planned |
|
Over (Under)
Planned |
|
Over (Under)
Planned |
|
1 |
165 |
(58) |
224 |
1 |
313 |
90 |
162 |
50 |
|
2 |
162 |
(61) |
211 |
(12) |
268 |
45 |
109 |
(3) |
|
3 |
150 |
(73) |
199 |
(24) |
280 |
57 |
137 |
25 |
|
4 |
154 |
(69) |
200 |
(23) |
273 |
50 |
106 |
(6) |
|
5 |
140 |
(83) |
169 |
(54) |
263 |
40 |
151 |
39 |
|
6 |
154 |
(69) |
166 |
(57) |
242 |
19 |
127 |
15 |
|
7 |
151 |
(72) |
231 |
8 |
274 |
51 |
134 |
22 |
|
8 |
179 |
(44) |
213 |
(10) |
305 |
82 |
140 |
28 |
|
9 |
151 |
(72) |
214 |
(9) |
255 |
32 |
131 |
19 |
|
10 |
115 |
(108) |
185 |
(38) |
265 |
42 |
114 |
2 |
|
11 |
199 |
(24) |
232 |
9 |
332 |
109 |
156 |
44 |
|
12 |
136 |
(87) |
147 |
(76) |
235 |
12 |
110 |
(2) |
|
13 |
145 |
(78) |
193 |
(30) |
277 |
54 |
153 |
41 |
|
14 |
103 |
(120) |
185 |
(38) |
262 |
39 |
132 |
20 |
|
15 |
182 |
(41) |
262 |
39 |
339 |
116 |
119 |
7 |
|
Totals |
2286 |
(1059) |
3031 |
(314) |
4183 |
838 |
1981 |
301 |
Source: Internal Revenue Service (IRS) Quality
Review Database
In January,
February, and April, the IRS did not meet the sampling plan for all of its call
sites.
Figure 6
Accuracy Rates and
Precision Margins for Internal Revenue Service Call Sites
(January 2001)
|
Call
Site |
Accuracy
Rate |
Precision
Margin |
|---|---|---|
|
1 |
62.42% |
+/- 6.20% |
|
2 |
72.22% |
+/- 5.79% |
|
3 |
64.00% |
+/- 6.45% |
|
4 |
68.83% |
+/- 6.14% |
|
5 |
62.14% |
+/- 6.74% |
|
6 |
71.43% |
+/- 5.99% |
|
7 |
75.50% |
+/- 5.76% |
|
8 |
81.01% |
+/- 4.82% |
|
9 |
72.19% |
+/- 6.00% |
|
10 |
65.22% |
+/- 7.31% |
|
11 |
73.87% |
+/- 5.12% |
|
12 |
74.26% |
+/- 6.17% |
|
13 |
72.41% |
+/- 6.11% |
|
14 |
71.84% |
+/- 7.29% |
|
15 |
64.29% |
+/- 5.84% |
|
|
|
|
Source: Internal Revenue Service (IRS) Quality
Review Database
During January 2001 when the sampling plan was not met for the IRS call sites (refer to Figure 5), the precision margin (sampling error) for many of the call sites exceeded the 5 percent sampling error desired by the IRS (based on the 5 percent sampling error the IRS used in its simple random sample formula to determine sample size).
Appendix V
· Random sampling relies entirely on chance. By this method, every item in the population is given an equal chance of being included in the sample. For example, each call answered by the Internal Revenue Service (IRS) must have an equal chance of selection into its sample or the method is not truly random. Unless each item is selected by a purely random technique, there is no way of measuring later how accurately the sample reflects the characteristics of the population from which it was selected. One major benefit of statistical random sampling is that it permits objective measurement of the reliability of the results. Established mathematical formulas are used to determine the proper sample size necessary to measure the reliability of the results. This sampling method is used when conclusions about the entire population are to be presented. Random sampling methods include simple random, stratified and cluster sampling.
·
Judgmental
sampling (also known as non-random) does
not rely on the principal of chance to select the sample. In this type of sample, the sampler’s best
judgment (possibly based on past experience) is used in selecting those items
for the sample that are believed to give a representative picture of the
universe. Although a judgmental sample
may give a good indication of the population, this type of sample does not lend
itself to analysis by standard statistical methods such as assessing the
reliability of the results. Judgmental
samples are also difficult to defend against challenges regarding their
validity and reliability. Therefore,
the results cannot be used to present conclusions about the whole.
·
Convenience
sampling (also known as spot-check sampling) is neither a judgmental nor a statistical random (probability)
sample. It differs from statistical
random sampling in that the items usually included in the sample are “grab”
items. This type of sample rests on the
illusion that no rule is the best rule for obtaining a representative
sample. There is neither a control to
assure a known chance of selection nor a system of considered judgment.
Appendix VI
Management’s Response to the Draft
Report
The
response was removed due to its size.
To see the complete response, please go to the Adobe PDF version of the
report on the TIGTA Public Web Page.