3.1. What Is Transaction Log Analysis?
Transaction log analysis (TLA) was developed about 25 years ago to evaluate system performance. Over the course of a decade, it evolved as a method to study unobtrusively interactions between online information systems and the people who use them. Today, it is also used to study use of Web sites. Researchers who conduct TLA rely on transaction monitoring software, whereby the system or Web server automatically records designated interactions for later analysis. Transaction monitoring records the type, if not the content, of selected user actions and system responses. For example, a user submits a query in the OPAC. Both the fact that a query was submitted and the content of that query could be recorded. In response, the system conducts a search and returns a list of results. Both the fact that results were returned and the number of results could be recorded. Transaction monitoring software often captures the date and time of these transactions, and the Internet Protocol (IP) address of the user. The information recorded is stored in an electronic file called a “transaction log.” The contents of transaction logs are usually formatted in fields to facilitate quantitative analysis. Researchers analyze transaction logs to understand how people use online information systems or Web sites with the intention of improving their design and functionality to meet user needs and expectations. The analysis can be conducted manually or automatically, using software or a script to mine data in the logs and generate a report.
TLA is an effective method to study such activities as the frequency and sequence of feature use; system response times; hit rates; error rates; user actions to recover from errors; the number of simultaneous users; and session lengths. In the library world, if queries are logged, it can reveal why searches fail to retrieve results and suggest areas for collection development. If the IP addresses of users are logged, it can reveal whether the user is inside or outside of the library. The information extracted from transaction logs can be used to assess patterns of use and trends over time, predict and prepare for times of peak demand, project future system needs and capacities, and develop services or interfaces that support user actions. For TLA to be effective, transaction monitoring software must record meaningful transactions, and data mining must be driven by carefully articulated definitions and purposes.
TLA is an unobtrusive way to study user behavior, an efficient way to gather longitudinal usage data, and an effective way to detect discrepancies between what users say they do (for example in a focus group study) and what they actually do when they use an online system or Web site. Transaction log analysis is also a good way to test hypotheses; for example, to determine whether the placement or configuration of public computers (for example, at stand-up or sit-down stations) in the library affects user behavior.
The primary disadvantages of TLA are that extracting data can be time-consuming and the data can be difficult to interpret. Though systems and servers have been logging transactions for decades, they still do not incorporate software to analyze the logs. If analysis is to be conducted routinely over time, programmers must develop software or scripts to mine the data in transaction logs. If additional information is to be mined, someone must do it manually or the programmer must add this capability to the routine. Often, extracting the data requires discussion and definitions. For example, in stateless, unauthenticated systems such as the Web environment, what constitutes a user session with a Web-based collection or a virtual visit to the library Web site?
Even after the data have been mined, interpreting the patterns or trends discovered in the logs can be problematic. For example, are a large number of queries necessarily better than a small number of queries? What if users are getting better at searching and able to retrieve in a single query what it might have taken them several queries to find a few years ago? Are all searches that retrieve zero results failed searches? What if it was a known-item search and the user just wanted to know whether the library has the book? What constitutes a failed search? Zero results? Too many results? How many is too many? Meaning is contextual, but with TLA, there is no way to connect data in transaction logs with the users’ needs, thoughts, goals, or emotions at the time of the transaction. Interpreting the data requires not only careful definitions of what is being measured but additional research to provide contextual information about the users.
A further disadvantage is that transaction logs can quickly grow to an enormous size. The data must be routinely moved from the server where they are captured to the server where they are analyzed. Keeping log files over time, in case a decision is made to mine additional data from the files, results in massive storage requirements or offline storage that can impede data mining.
3.2. Why Do Libraries Conduct Transaction Log Analysis?
Most of the DLF respondents reported conducting TLA or using TLA data provided by vendors to study use of the library Web site, the OPAC and integrated library system (ILS), licensed electronic resources, and, in some cases, local digital collections and the proxy server. They have used TLA data from local servers to
- Identify user communities
- Identify patterns of use
- Project future needs for services and collections
- Assess user satisfaction
- Inform digital collection development decisions
- Inform the redesign and development of the library Web site
- Assess whether redesign of the library Web site or digital collection has had any impact on use
- Assess whether providing additional content on the library Web site or digital collection has any impact on use
- Target marketing or instruction efforts
- Assess whether marketing or instruction has any impact on use
- Drive examinations of Web page maintenance requirements
- Inform capacity planning and decisions about platform
- Plan system maintenance
- Allocate human and financial resources
Vendor-supplied TLA data from licensed electronic resources have been used to
- Help secure funding for additional e-resources from university administrators
- Inform decisions about what subscriptions or licenses to renew or cancel
- Inform decisions about which interface(s) to keep
- Determine how many ports or simultaneous users to license
- Assess whether instruction has any impact on use of an e-resource
- Determine cost per-use of licensed e-resources
3.3. How Do Libraries Conduct Transaction Log Analysis?
3.3.1. Web Sites and Local Digital Collections
Practices vary significantly across institutionsfrom no analysis to extensive analysis. DLF libraries track use of their Web sites and Web-accessible digital collections using a variety of homegrown, shareware, or commercial software. The server software determines what information is logged and therefore what data are available for mining. The logging occurs automatically, but decisions concerning what data are extracted appear to be guided by library managers, administrators, or committees. As different questions are asked, different data are extracted to answer them. For example, as libraries adopt new measures for digital library use, Web server logs are being mined for data on virtual visits to the library. In some libraries a great deal of discussion is involved in defining such things as a “virtual visit.” In other libraries, programmers are instructed to make their best guesstimate, explain what it is and why they chose it, and use it consistently in mining the logs. As with user studies, the more people involved in making these decisions, the longer it can take. The longer it takes, the longer the library operates without answers to its questions.
Many libraries do not use Web usage data because they do not know how to apply them or do not have the resources to apply them. Some libraries, however, are making creative use of transaction logs from the library Web site and local digital collections to identify user communities, determine patterns of use, inform decisions, assess user satisfaction, and measure the impact of marketing, instruction, interface redesign, and collection development. They do this by mining, interpreting, and applying the following data over time:
- Number of page hits
- Number and type of files downloaded
- Referral URLs (that is, how users get to a Web page)
- Web browser used
- Query logs from “Search this site” features
- Query logs from digital collection (image) databases
- Date and time of the transactions
- IP address or Internet domain of the user
- User IDs (in cases where authentication is required)
In addition, several libraries have begun to count “click throughs” from the library Web site to remote e-resources using a “count use mechanism.” This mechanism captures and records user clicks on links to remote online resources by retrieving and logging retrieval of an intermediate Web page. The intermediate page is retrieved and replaced with the remote resource page so quickly that users do not notice the intermediate page. Writing a script to capture click throughs from the library Web site to remote resources is apparently simple, but the mechanism requires that the links (URLs) to all remote resources on the library Web site be changed to the URL of the intermediate page, which contains the actual URL of the remote resource. Libraries considering implementing a count use mechanism must weigh the cost of these massive revisions against the benefits.
The count use mechanism provides a consistent, comparable count of access to remote e-resources from the library Web site, and it is the only way to track use of licensed resources for which the vendor provides no usage statistics. The data, however, provide an incomplete and inaccurate picture of use of remote resources because users can bookmark resources rather than click through the library Web site to get to them, and because the mechanism counts all attempts to get to remote resources, some of which fail because the server is down or the user does not have access privileges to the resource.
3.3.2. OPAC and Integrated Library Systems
Both OPAC and ILS log transactions, but different systems log different information; therefore, each enables analysis of different user activities. For example, some systems simply count different types of transactions. Others log additional information, such as the text of queries, the date and time of the transaction, the IP address and interface of the client machine, and a session ID, which can be used to reconstruct entire user sessions. Systems can provide an on-off feature to allow periodic monitoring and reduce the size of log files, which can grow at a staggering rate if many transactions and details are captured.
Integrated library systems provide a straightforward way for libraries to generate summary reports of such things as the number of catalog searches, the number of items circulated, the number of items used in-house, and the number of new catalog records added within a given period. Use of different interfaces, request features (for example, renewals, holds, recalls, or requests for purchases) and the ability to view borrowing records might also be tracked. This information is extracted from system transaction logs using routine reporting mechanisms provided by the vendor, or special custom report scripts developed either in-house or prepared as work for hire by the vendor for a fee. Customized reports are produced for funding agencies or in response to requests for data relevant to specific problems or pages (for example, subject pages or pathfinders). Often Web and ILS usage data are exported to other tools for further analysis, manipulation, or use; for example, circulation data and the number of queries are exported to spreadsheet software to generate trend lines. In rare cases, Web forms and functionality are provided for staff to generate ad hoc reports.
3.4. Who Uses the Results of Transaction Log Analysis? How Are They Used?
3.4.1. Web Sites and Local Digital Collections
Staff members generate monthly usage reports and distribute or make them available to all staff or to the custodians of the Web pages or digital collection. Overall Web site usage (page hits) or the 10 most heavily used pages might be included in a library’s annual report. However, though usage reports are routinely generated, often the data languish without being used.
At institutions where the data are used, many different people use the data for many different purposes. Interface designers, system managers, collection developers, subject specialists, library administrators, and department heads all reap meaning and devise next steps from examining and interpreting the data. Page hits and refer ral URLs are used to construct usage patterns over time, understand user needs, and inform interface redesign. For example, frequently used Web pages are placed one to two clicks from the home page; infrequently used links on the home page are moved one to two clicks down in the Web site. Data on heavily used Web pages prompt consideration of whether to expand the information on these pages. Similarly, data on heavily used digital collections prompt consideration of expanding the collection. Subject specialists use the data to understand how people use their subject pages and pathfinders and revise their pages based on this understanding. Page hit counts also drive examination of page maintenance requirements with the understanding that low-use pages and collections should be low maintenance; high-use pages should be well maintained, complete, and up to date. Such assessments facilitate appropriate allocation of resources. Data on low-use or no-use pages can be used to target publicity campaigns. Cross-correlations of marketing efforts and usage statistics are performed to determine whether marketing had any measurable effects on use. Similarly, correlating interface redesign or expansion of content with usage statistics can determine whether redesign or additional content had any effect on use. Data on use of “new” items on the Web site are used to determine whether designating a resource as “new” had any measurable effects on use. Tracking usage patterns over time enables high-level assessments of user satisfaction. For example, are targeted user communities increasingly using the library Web site or digital collection? Do referral URLs indicate that more Web sites are linking to the library Web site or collection?
Query logs are also mined, interpreted, and applied. Frequent queries in “Search this site” logs identify resources to be moved higher in the Web site. Unsuccessful queries target needed changes in Web site vocabulary or content. Query logs from image databases are used to adjust the metadata and vocabulary of digital collections to match the vocabulary and level of specificity of users and to help decide whether the content and organization of digital collections are appropriate to user needs.
TLA also informs system maintenance and strategic planning. Time and date stamps enable the monitoring of usage patterns in the context of the academic year. Libraries have analyzed low-use times of day and day of week to determine good times to take Web servers down for maintenance. Page hits and data on the number and type of files downloaded month-to-month are used to plan load and capacity, to characterize consumption of system resources, to prepare for peak periods of demand, and to make decisions about platform and the appropriate allocation of resources.
Although the use of dynamic IP addresses makes identification of user communities impossible, libraries use static IP addresses and Internet domain information (for example, .edu, .com, .org, .net) in transaction logs to identify broad user communities. Libraries are defining and observing the behavior of different communities. Some libraries track communities of users inside or outside the library. Some track on-campus, off-campus, or international user communities; others track communities in campus dormitories, libraries, offices, computer clusters, or outside the university. In rare cases, static IP addresses and locations are used to affiliate users with a particular school, department, or research center-recognizing that certain IP address locations, such as libraries, dormitories, and public computing clusters, reveal no academic affiliation of the users. Where users are required to authenticate (for example, at the proxy server), the authentication data are mapped to the library patron database to identify communities by school and user status (such as humanities undergraduate). If school and user status are known, some libraries conduct factor analysis to identify clusters of use by user communities.
Having identified user communities in the transaction logs, libraries then track patterns of use by different communities and the distribution of use across communities. For example, IP addresses and time and date stamps of click-through transactions are used to identify user communities and their patterns of using the library Web site to access remote e-resources. IP addresses and time and date stamps of Web site usage are used to track patterns of use inside and outside the libraries. The patterns are then used to project future needs for services and collections. For example, what percentage of use is outside the library? Is remote use increasing over time or across user groups? What percentage of remote use occurs in dormitories (undergraduate students)? What services and collections are necessary to meet the needs of remote users? Patterns of use per user community and resource are used to target publicity about digital collections or Web pages.
3.4.2. OPAC and Integrated Library Systems
OPAC and ILS usage data are used primarily to track trends and provide data for national surveys, for example, circulation per year or items cataloged per year. At some institutions, these data are used to inform decisions. OPAC usage statistics are used to determine usage patterns, customize the OPAC interface, and allocate resources. Seldom-used indexes are removed from the simple search screen and buried lower in the OPAC interface hierarchy. More resources are put into developing the Web interface than the character-based (telnet) interface because usage data show that the former is more heavily used. Libraries shopping for a new ILS frequently use the data to determine the relative importance of different features and required functionality for the new system.
In addition to mining data in transaction logs, some libraries extract other information from the ILS and export it to other tools. For example, e-journal data are exported from the ILS to a Digital Asset Management System (DAMS) to generate Web page listings of e-journals. The journal call numbers are used to map the e-journals to subject areas, and the Web pages are generated using Perl scripts and persistent URLs that resolve to the URLs of the remote e-journal sites. One site participating in the DLF survey routinely exports in formation from the ILS to a homegrown desktop reporting tool that enables staff to generate ad hoc reports.
3.4.3. Remote Electronic Resources
Library administrators use vendor-provided data on searches, sessions, or full-text use of remote e-resources to lobby for additional funding from university administrators. Data on selected, high-use e-resources might be included in annual reports. Collection developers use the data to determine cost per use of various products and to inform decisions about what subscriptions, licenses, or interfaces to keep or drop. Turn-away data are used to determine how many ports or simultaneous users to license, which could account for why so few vendors provide this information. Reference librarians use the data to determine whether product instruction has any impact on product use. Plans to promote particular products or to conduct research are developed on the basis of data identifying low-use products. Usage data indicate whether promoting a product has any impact on product use. Libraries that require authentication to use licensed resources, capture the authentication data, and map it to the patron database have conducted factor analysis to cluster the use of different products by different user communities. Libraries that compile all of their e-resource usage statistics have correlated digital input and output data to determine, for example, that 22 percent of the total number of licensed e-resources accounts for 70 percent of the total e-resource use.
3.5. What Are the Issues, Problems, and Challenges with Transaction Log Analysis?
3.5.1. Getting the Right (Comparable) Data and Definitions
126.96.36.199. Web Sites and Local Digital Collections
DLF respondents expressed concern that the most readily available usage statistics might not be the most valuable ones. Page hit rates, for example, might be relevant on the open Web, where sites want to document traffic for their advertisers, but on the library Web site, what do high or low hit rates really mean? Because Web site usage changes so much over time, comparing current and past usage statistics presents another challenge.
Despite the level of creative analysis and application of Web usage data at some institutions, even these libraries are not happy with the software they use to analyze Web logs. The logs are and analysis is cumbersome, sometimes exceeding the capacity of the software. Libraries are simultaneously looking for alternative software and trying to figure out what data are useful to track, how to gather and analyze the data efficiently, and how to present the data appropriately to inform decisions. Ideally, to facilitate comparisons, libraries want the same data on Web page use, the use of local databases or digital collections, and the use of commercially licensed databases and collections.
Libraries also want digital library usage statistics to be comparable with traditional usage statistics. For example, they want to count virtual visits to the library and combine this information with gate counts to get a complete picture of library use. Tracking virtual visits is difficult because in most cases, library Web site and local digital collection use are not authenticated. Authentication automatically associates transactions with a user session, clearly defining a “visit.” In an unauthenticated environment where transactions are associated with IP addresses and public computers are used by many different people, perhaps in rapid succession, defining a visit is not easy.
While the bulk of the discussion centers on what constitutes a visit and how to count the number of visits, one library participating in the DLF survey wants to gather the following data, though it is unclear why this level of specificity was desirable or how the data would be used:
- Number and percentage of Web site visits at time of day and day of week
- Number and percentage of visits that look at one Web page, 2-4 Web pages, 5-10 Web pages, or more than 10 pages
- Number and percentage of visits that last less than 1 minute, 2-4 minutes, 5-10 minutes, or more than 10 minutes per page, service, or collection
However a visit is defined, in an unauthenticated environment the data will be dirty. Libraries are probably prepared to settle for “good-enough” data, but a standard definition would facilitate comparisons across institutions.
Similarly, libraries would like to be able to count e-reserves, e-book, and e-journal use and combine this information with traditional reserves, book, and journal usage statistics to get a complete picture of library use. Again, tracking use of e-resources in a way that is comparable to traditional measures is problematic. Even when e-resources are managed locally, the counts are not comparable, because page hits, not title hits, are logged. Additional work is required to generate hits by title.
In the absence of standards or guidelines, libraries are charting their own course. For example, one site participating in the DLF survey is devising statistics to track use of Web-accessible, low-resolution images, and requests for high-resolution images that are not available on the Web. They are grappling with how to incorporate into their purview metadata from other digital collections available on campus so that they can quantify use of their own content and other campus content. No explanation was offered for how these data would be used.
188.8.131.52. OPAC and Integrated Library Systems
ILS vendors often provide minimal transaction logging because of the high use of the system by staff and end users and the rapid rate with which log files grow to enormous size. When the server is filled with log files, the system ceases to function properly. Many libraries are not satisfied with the data available for mining in their ILS or the routine reporting mechanisms provided by the vendor. Some libraries have developed custom reports in response to requests from library administrators or department heads. These reports are difficult to produce, often requiring expensive Application Program Interface (API) training from the vendor. Many sites want reports that they cannot produce because they do not have the resources or because the system does not log the information they need. For example, if a library wants to assess market penetration of library books, its ILS might not be able to generate a report of the number of unique users who have checked out books within a specified period of time. If administrators want to determine which books to move to off-site storage, their ILS might not be able to generate a report of which books circulated fewer than five times within a specified period of time.
184.108.40.206. Remote Electronic Resources
Getting the right data from commercial vendors is a well-known problem. Data about use of commercial resources are important to libraries, because use is a measure of service provided and because the high cost of e-resources warrants scrutiny. The data might also be needed to justify subscription expenditures to university administrators. DLF respondents had the usual complaints about vendor-supplied usage statistics:
- The incomparability of the data
- The multiple formats, delivery methods, and schedules for providing the data (for example, e-mail; paper; remote access at the vendor’s Web site; monthly, quarterly, annual, or irregular reporting)
- The lack of useful data (for example, no data on use of specific e-resource titles)
- The lack of intelligible or comprehensible data
- The level of specificity of usage data by IP address
- The failure of some vendors to provide usage data at all
While acknowledging that some vendors are collaborating with libraries and making progress in providing useful statistics, libraries continue to struggle to understand what vendors are actually counting and the time periods covered in their reports. Many libraries distrust vendor-supplied data and rue the inability to corroborate these data. One DLF respondent told a story of a vendor calling to report a large number of turn-aways. The vendor encouraged the library to increase the number of licensed simultaneous users. Instead, the library examined the data, noticed the small number of sessions during that two-day period, concluded that the problem was technical, and did not change its license-which was the right course of action. The number of turn-aways was insignificant thereafter. Another story concerned vendor-supplied data about average session lengths. The vendor reported average session lengths of 25 to 26 minutes, but the vendor does not distinguish time-outs from log-outs. Libraries know that many users neglect to log out and that session length is skewed by users who walk away and the system times out minutes later.
In the absence of standard definitions and standardized procedures for capturing data about human-computer interactions, libraries cannot compare the results of transaction log analyses across institutions or even across databases and collections within their institutions. Efforts continue to persuade vendors to log standard transactions, extract the data using standard definitions, and provide that information to libraries in standard formats. Meanwhile, libraries remain at the mercy of vendors. Getting meaningful, manageable vendor statistics remains a high priority. Many librarians responsible for licensing e-resources are instructed to discuss usage statistics6 with vendors before licensing their products. Some librarians are lobbying not to sign contracts if the vendor does not provide good statistics. Nevertheless, vendors know that useful statistics are not yet required to make the sale.
3.5.2. Analyzing and Interpreting the Data
DLF respondents understand that usage statistics are an important measure of library service and, to some degree, an indication of user satisfaction. Usage data must be interpreted cautiously, however, for two reasons. First, usability and user awareness affect the use of library collections and services. Low use can occur because the product’s user interface is difficult to use, because users are unaware that the product is available, or because the product does not meet the users’ information needs. Second, usage statistics do not reveal the users’ experience or perception of the utility or value of a collection or service. For example, though a database or Web page is seldom used, it could be very valuable to those who use it. The bottom line is that usage statistics provide necessary but insufficient data to make strategic decisions. Additional information, gathered from user studies, is required to provide a context in which to interpret usage data.
Many DLF respondents observed that reports generated by TLA are not analyzed and applied. Perhaps this is because the library lacks the resources or skills to do the work. It may also be because the data lack context and interpretation is difficult. Several respondents requested guidance in how to analyze and interpret usage data and diagnose problems, particularly with use of the library Web site.
3.5.3. Managing, Presenting, and Using the Data
DLF libraries reported needing assistance with how to train their staff to use the results of the data analysis. The problem appears to be exacerbated in decentralized library systems and related to the difficulty of compiling and manipulating the sheer bulk of data generated by TLA. Monthly reports of Web site use, digital collection use, and remote e-resource use provide an overwhelming volume of information. Libraries expressed concern that they were not taking full advantage of the information they collect because they do not have the resources to compile it. Vendor statistics are a well-known case in point.
Because of the problems with vendor statistics, management and analysis of the data are cumbersome, tedious, and time-consuming. If the data are compiled in any way, typically only searches, sessions, and full-text use are included for analysis. Some DLF libraries gather and compile statistics from all vendors. Some compile usage statistics only on full-text journals and selected large databases. Some compare data only within products provided by a single vendor, not across products provided by different vendors. Others use data from different vendors to make comparisons that they know are less than perfect, or they try to normalize the data from different vendors to enable cross-product comparisons. For example, one site uses the number of sessions reported by a vendor to predict the number of searches of that vendor’s product based on the ratio of searches to sessions from comparable e-resources. Libraries that compile vendor statistics for staff or consortium perusal provide access to the data using either a spreadsheet or an IP-address-restricted Web page. One site described the painstaking process of producing this Web page: entering data from different vendor reports-from e-mail messages, printed reports, downloaded statistics-into a spreadsheet, then using the spreadsheet to generate graphs and an HTML table for the Web. The time and cost of this activity must be weighed against the benefits of such compilations.
Even if e-resource usage data are compiled, libraries struggle with how to organize and present the information to an audience for consideration in decision making and strategic planning. For example, how should monthly usage reports of 800 e-journals be organized? The quality of the presentation can affect the decisions made based on the data. Training is required to make meaningful, persuasive graphical presentations. Libraries need guidance in how to manage, present, and apply usage data effectively.
6 To guide these discussions, libraries are using the International Coalition of Library Consortia (ICOLC) Guidelines for Statistical Measures of Usage of Web-Based Indexed, Abstracted, and Full-Text Resources. Available at: http://www.library.yale.edu/consortia/Webstats.html.