xlrd - Documentation

What is xlrd?

xlrd is a pure Python library for reading data from Microsoft Excel (®) spreadsheet files (.xls). It supports both older .xls files (using the BIFF 8-12 formats) and does not support newer .xlsx, .xlsm, or .xltx files (which use the OOXML format). xlrd is designed to be reliable and efficient, providing a straightforward interface for accessing spreadsheet data. It’s particularly useful for tasks such as data extraction, analysis, and migration from older Excel files. It focuses solely on reading data; it does not offer any capabilities for writing or modifying Excel files.

Installation and Setup

xlrd is readily available through the Python Package Index (PyPI). The easiest way to install it is using pip:

pip install xlrd

This command will download and install xlrd and its dependencies. No additional configuration is typically required. If you encounter issues, ensure that you have a working Python installation and that pip is correctly configured. You may need administrator or root privileges for the installation to succeed.

Basic Usage Example

This example demonstrates how to open a workbook, access a specific sheet, and read cell values:

import xlrd

# Open the workbook
workbook = xlrd.open_workbook('example.xls')

# Get the first sheet (index 0)
sheet = workbook.sheet_by_index(0)  # or sheet = workbook.sheet_by_name('Sheet1')

# Get the number of rows and columns
nrows = sheet.nrows
ncols = sheet.ncols

# Print the values of the first few cells
for row_idx in range(min(nrows, 5)):  # Limit to 5 rows for brevity
    for col_idx in range(min(ncols, 5)):  # Limit to 5 columns for brevity
        cell_value = sheet.cell_value(row_idx, col_idx)
        print(f"Cell ({row_idx}, {col_idx}): {cell_value}")

workbook.release_resources() #Good practice to release resources.
del workbook # Good practice to release resources.

Remember to replace 'example.xls' with the actual path to your Excel file. This code opens the workbook, selects the first sheet, and iterates through a small portion of the data to print the cell values. More sophisticated data access methods are available, as described in the full xlrd documentation.

Supported Excel Versions

xlrd supports Microsoft Excel versions that use the Binary File Format (BIFF) versions 8, 7, 5, 4, 3, 2, and 1. It does not support the newer .xlsx, .xlsm, or .xltx file formats based on the Open XML Spreadsheet format. If you need to work with these newer file types, consider using libraries like openpyxl.

Opening and Reading Workbooks

Opening an Excel File

The primary function for interacting with xlrd is xlrd.open_workbook(). This function takes the file path to your Excel file as its main argument. Several optional arguments allow for customization:

import xlrd

# Basic usage:
workbook = xlrd.open_workbook('my_excel_file.xls')

# Specifying encoding (if necessary):
workbook = xlrd.open_workbook('my_excel_file.xls', encoding_override="latin-1")

# Opening a file-like object:
with open('my_excel_file.xls', 'rb') as f:
    workbook = xlrd.open_workbook(file_contents=f.read())

#  Using on_demand to load sheets lazily:
workbook = xlrd.open_workbook('my_excel_file.xls', on_demand=True)

The encoding_override argument is crucial if your Excel file uses a character encoding other than the default (usually UTF-8). The file_contents argument lets you read from an in-memory buffer instead of a file path. on_demand=True is useful for very large files where you might not need to load all sheets immediately; sheets will be loaded only when accessed. Refer to the full xlrd documentation for a complete list of all possible arguments. Note that failure to specify the correct encoding may lead to garbled characters in your output.

Accessing Workbook Properties

Once a workbook is open, you can access various properties:

import xlrd

workbook = xlrd.open_workbook('my_excel_file.xls')

# Get the number of sheets:
num_sheets = workbook.nsheets

# Get a sheet by index (0-based):
sheet1 = workbook.sheet_by_index(0)

# Get a sheet by name:
sheet2 = workbook.sheet_by_name('Sheet2') #raises error if sheet is not found.

# Get a list of sheet names:
sheet_names = workbook.sheet_names()

# Get workbook information (e.g., author, creation date)

# This requires checking if they are available in your workbook version.
try:
    info = workbook.info()
    print(f"Author: {info.author}")
    print(f"Creation Date: {info.create_ascii}")

except AttributeError:
    print ("This workbook version does not contain author or create_ascii information.")

workbook.release_resources()
del workbook

Error handling is critical, as not all workbooks will contain all properties.

Handling Errors

Various errors can occur when working with xlrd. The most common include:

xlrd.XLRDError: A general error indicating a problem reading the Excel file.
xlrd.SheetNotFoundError: Raised when attempting to access a sheet that doesn’t exist.
xlrd.XLRDError: A general error encompassing various problems during file parsing. More specific error messages are included to help in debugging.

It’s essential to use try-except blocks to handle these exceptions gracefully:

import xlrd

try:
    workbook = xlrd.open_workbook('my_excel_file.xls')
    sheet = workbook.sheet_by_name('Sheet1')
    # ... process the sheet ...
except xlrd.XLRDError as e:
    print(f"Error reading Excel file: {e}")
except xlrd.SheetNotFoundError:
    print("Sheet 'Sheet1' not found in the workbook.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
finally:
    if 'workbook' in locals() and workbook:
        workbook.release_resources()
        del workbook

Always include proper error handling to prevent your program from crashing due to unexpected file formats or missing sheets.

Closing Workbooks

While Python’s garbage collection will eventually reclaim resources, it’s good practice to explicitly close workbooks when finished, especially when dealing with large files:

import xlrd

workbook = xlrd.open_workbook('my_excel_file.xls')
# ... process the workbook ...
workbook.release_resources()
del workbook

Calling workbook.release_resources() releases the file handles and other resources held by the workbook object, improving efficiency and reducing the risk of resource leaks. del workbook helps ensure the object is removed from memory. This is a key best practice.

Working with Worksheets

Accessing Worksheets by Name and Index

After opening a workbook using xlrd.open_workbook(), you access individual worksheets using either their index (starting from 0) or their name. Both methods are demonstrated below:

import xlrd

workbook = xlrd.open_workbook('my_excel_file.xls')

# Accessing by index:
sheet_by_index = workbook.sheet_by_index(0)  # Gets the first sheet

# Accessing by name:
try:
    sheet_by_name = workbook.sheet_by_name('Sheet1') # Gets the sheet named 'Sheet1'
except xlrd.XLRDError as e:
    print(f"Error: {e}") #Handles the case where the sheet does not exist.


#It is good practice to check if sheets exist before accessing.
sheet_names = workbook.sheet_names()
if "Sheet1" in sheet_names:
    sheet = workbook.sheet_by_name("Sheet1")
    #Do something with the sheet.
else:
    print("Sheet named 'Sheet1' not found.")


workbook.release_resources()
del workbook

Remember that sheet_by_name() will raise a xlrd.XLRDError if the specified sheet name doesn’t exist. Using sheet_names() to check for existence is a safer approach.

Retrieving Worksheet Dimensions

Once you have a worksheet object, you can determine its dimensions using the nrows and ncols attributes:

import xlrd

workbook = xlrd.open_workbook('my_excel_file.xls')
sheet = workbook.sheet_by_index(0)

num_rows = sheet.nrows
num_cols = sheet.ncols

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

workbook.release_resources()
del workbook

These attributes provide the total number of rows and columns in the worksheet, including any empty rows or columns at the end.

Iterating through Rows and Cells

The most common way to process data in a worksheet is to iterate through its rows and cells. Here’s how you can do it:

import xlrd

workbook = xlrd.open_workbook('my_excel_file.xls')
sheet = workbook.sheet_by_index(0)

for row_index in range(sheet.nrows):
    row = sheet.row(row_index) # Get a list of cell objects in current row.
    for cell_index, cell_obj in enumerate(row):
        cell_value = cell_obj.value
        cell_type = cell_obj.ctype #  Get cell type (xlrd.XL_CELL_NUMBER, xlrd.XL_CELL_TEXT, etc.)

        print(f"Row {row_index + 1}, Column {cell_index + 1}: Value = {cell_value}, Type = {cell_type}")

workbook.release_resources()
del workbook

This code iterates through each row and then through each cell within that row. cell_obj.value provides the cell’s value, and cell_obj.ctype provides its type (e.g., number, text, date, boolean, error, blank). Understanding cell types is crucial for handling data correctly, as you might need to perform type conversions for certain operations. Remember that row and column indices are zero-based. Alternative methods for accessing individual cells directly using sheet.cell_value(rowx, colx) are also available for more efficient random access.

Reading Cell Values

Accessing Cell Values

The core of using xlrd is retrieving the values from individual cells. There are several ways to achieve this:

Method 1: Using sheet.cell_value(rowx, colx): This is the most direct method. rowx and colx are the row and column indices (both zero-based).

import xlrd

workbook = xlrd.open_workbook('example.xls')
sheet = workbook.sheet_by_index(0)

cell_value = sheet.cell_value(0, 0)  # Value of cell A1
print(f"Cell A1: {cell_value}")

cell_value = sheet.cell_value(2, 1)  # Value of cell B3
print(f"Cell B3: {cell_value}")

workbook.release_resources()
del workbook

Method 2: Iterating through rows: This is useful for processing entire rows or sheets.

import xlrd

workbook = xlrd.open_workbook('example.xls')
sheet = workbook.sheet_by_index(0)

for row_index in range(sheet.nrows):
    for col_index in range(sheet.ncols):
        cell_value = sheet.cell_value(row_index, col_index)
        # ... process cell_value ...
workbook.release_resources()
del workbook

Method 3: Accessing cells via sheet.row(rowx): This returns a list of Cell objects for a given row.

import xlrd

workbook = xlrd.open_workbook('example.xls')
sheet = workbook.sheet_by_index(0)

row = sheet.row(0) #Get the first row.
for cell in row:
    print(cell.value)

workbook.release_resources()
del workbook

Data Types and Handling

xlrd represents cell values using several data types. The cell.ctype attribute indicates the cell type:

xlrd.XL_CELL_EMPTY: Empty cell. The value will be an empty string.
xlrd.XL_CELL_TEXT: Text string. The value will be a string.
xlrd.XL_CELL_NUMBER: Numeric value. The value will be a float.
xlrd.XL_CELL_DATE: Date value. The value will be a float representing the number of days since a base date. You’ll need to convert this to a Python datetime object (see “Working with Formulas and Dates”).
xlrd.XL_CELL_BOOLEAN: Boolean value. The value will be 1.0 for True and 0.0 for False.
xlrd.XL_CELL_ERROR: Error value. The value will be an integer representing the error code.
xlrd.XL_CELL_BLANK: Blank cell. The value will be an empty string.

You should always check cell.ctype before using the cell.value to avoid type errors.

Dealing with Different Cell Formats

Excel allows for various cell formats (number formats, date formats, etc.). xlrd provides access to these formats through the xf_index attribute of the Cell object and the workbook’s xf_list. However, cell.value gives you the value as Python understands it, so direct formatting is generally unnecessary unless you need to format the output string differently than Python’s default behavior. To get a cell’s formatting information, use sheet.cell_xf_index(rowx, colx) to get the index into workbook.xf_list, then examine the relevant XF object.

Handling Errors in Cell Access

Accessing cells outside the bounds of the sheet or attempting to access invalid cells will raise IndexError. Always check the sheet’s dimensions (sheet.nrows, sheet.ncols) before accessing cells to avoid these errors.

import xlrd

try:
    workbook = xlrd.open_workbook('example.xls')
    sheet = workbook.sheet_by_index(0)
    value = sheet.cell_value(sheet.nrows, 0) #Accessing a row beyond the limit.
except IndexError:
    print("IndexError: Tried to access a cell outside the sheet bounds.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
finally:
    if 'workbook' in locals() and workbook:
        workbook.release_resources()
        del workbook

Working with Formulas and Dates

xlrd does not evaluate formulas; it returns the calculated result of formulas as they appear in the Excel file, or an error indicator if there was an error in evaluating the formula in the source file. For date values (xlrd.XL_CELL_DATE), the cell.value is a floating-point number representing the number of days since the Excel epoch (often December 30, 1899, or January 1, 1900, depending on the Excel version and settings). You need to convert this using the xldate_as_tuple() function from xlrd:

import xlrd
import datetime

workbook = xlrd.open_workbook('example.xls')
sheet = workbook.sheet_by_index(0)
cell = sheet.cell(0, 0)

if cell.ctype == xlrd.XL_CELL_DATE:
    date_tuple = xlrd.xldate_as_tuple(cell.value, workbook.datemode)
    date_value = datetime.datetime(*date_tuple)
    print(f"Date: {date_value}")

workbook.release_resources()
del workbook

Remember to handle cases where the cell does not contain a date value to avoid errors. workbook.datemode indicates whether the workbook is using the 1900 or 1904 date system.

Advanced Usage

Working with Styles

xlrd provides access to cell styles, allowing you to retrieve formatting information beyond the basic cell type. Each cell has an xf_index attribute pointing to an entry in the workbook’s xf_list. This list contains XF objects, which represent extended formats.

import xlrd

workbook = xlrd.open_workbook('example.xls')
sheet = workbook.sheet_by_index(0)

cell = sheet.cell(0, 0)
xf_index = cell.xf_index
xf = workbook.xf_list[xf_index]

# Access style properties (these may vary depending on Excel version and style):
# font index, alignment, border, pattern, protection, etc.
print(f"Font index: {xf.font_index}")
# ...access other properties of xf...

workbook.release_resources()
del workbook

You will need to refer to the Excel file format specification to fully understand the meaning of each attribute within the XF object. The information available varies slightly depending on the Excel version used to create the spreadsheet.

Accessing Cell Formatting Information

While cell.value provides the cell’s content, understanding formatting is crucial for accurate data interpretation. You can access formatting details through the XF object (as shown in the previous section) and also by using the sheet.cell_xf_index(rowx, colx) method which gives you the index of the Extended Format record associated with the cell at that location. This index then lets you look up the formatting details in workbook.xf_list. The formatting information includes font properties, number formats, alignment, borders, and more. Decoding number formats usually requires additional external resources (not provided directly by xlrd) or careful study of how Excel stores formatting information.

Using Named Ranges

xlrd supports accessing named ranges defined within an Excel workbook. Use workbook.name_map which is a dictionary mapping name strings to a tuple of sheet index and cell range.

import xlrd

workbook = xlrd.open_workbook('example.xls')
named_ranges = workbook.name_map

for name, (sheet_index, (row_start, col_start, row_end, col_end)) in named_ranges.items():
    print(f"Named range: {name}, Sheet index: {sheet_index}, Range: ({row_start}, {col_start}, {row_end}, {col_end})")

sheet = workbook.sheet_by_index(sheet_index)
# Access cells within the named range using row_start, col_start, row_end, col_end


workbook.release_resources()
del workbook

Remember that named ranges are not always present, so you should handle the case where workbook.name_map might be empty.

Handling Hyperlinks

xlrd can extract hyperlink information from cells that contain them. Each cell object has a hyperlink attribute. If a cell contains a hyperlink, this attribute will be a tuple of (URL, label, target sheet, target cell). If there’s no hyperlink, this attribute will be None.

import xlrd

workbook = xlrd.open_workbook('example.xls')
sheet = workbook.sheet_by_index(0)

for row_index in range(sheet.nrows):
    for col_index in range(sheet.ncols):
        cell = sheet.cell(row_index, col_index)
        if cell.hyperlink:
            url, label, target_sheet, target_cell = cell.hyperlink
            print(f"Hyperlink found at ({row_index}, {col_index}): URL = {url}, Label = {label}")

workbook.release_resources()
del workbook

Working with Charts (if supported)

xlrd’s support for charts is extremely limited or nonexistent in many versions. Older versions might offer minimal access to chart information, such as the existence of a chart within a sheet, but the details are not typically well documented or consistently available across different spreadsheet file versions. For robust chart processing you would need a different library. Attempting to access chart data might not produce any meaningful results, and the specific attributes available and their reliability would need careful testing with different Excel files. Directly accessing and parsing the underlying BIFF structures may be required, but this is complex and not recommended unless you have an exceptional need and are deeply familiar with the BIFF format.

Best Practices and Performance

Efficiently Reading Large Files

When working with large Excel files, optimizing your code for performance is crucial. Here are some key strategies:

on_demand=True: When opening the workbook, use the on_demand=True argument in xlrd.open_workbook(). This prevents xlrd from loading all sheets into memory at once. Sheets are loaded only when accessed. This is particularly beneficial if you only need data from a subset of the sheets.
Iterate, Don’t Load Everything: Avoid loading the entire sheet into memory at once. Instead, iterate through rows using a for loop as shown in previous examples. This significantly reduces memory usage.
Process only necessary data: Only access and process the columns and rows you actually need. Avoid unnecessary iteration. Use sheet.cell_value(rowx, colx) to access specific cells efficiently instead of iterating through whole rows when possible.
Use Generators (for very large files): If you are dealing with exceptionally large files, consider using generators to process data in chunks. This will further reduce memory consumption. A generator can yield one row (or a small block of rows) at a time.

Memory Management

workbook.release_resources(): Always call workbook.release_resources() and del workbook when you’re finished with a workbook. This explicitly releases file handles and other resources, preventing potential memory leaks, especially in long-running applications or when processing many files.
Avoid unnecessary variable creation: Minimize the creation of unnecessary variables, particularly large ones. If a variable is no longer needed, explicitly delete it using del. Python’s garbage collector will eventually reclaim memory, but manual deletion can help improve performance.
Use with statement (for file-like objects): If you are reading from a file-like object, use a with statement to ensure proper resource closure even in the event of errors:

import xlrd

with open('my_excel_file.xls', 'rb') as f:
    workbook = xlrd.open_workbook(file_contents=f.read())
    # ... process the workbook ...

#The file will be automatically closed and resources released even if exceptions are raised.

Error Handling Strategies

Robust error handling is essential when working with xlrd, especially for production systems. Here’s how to enhance error handling:

Specific Exception Handling: Catch specific exceptions (xlrd.XLRDError, xlrd.SheetNotFoundError, IndexError, etc.) rather than using a general except Exception block. This allows you to handle different error conditions appropriately, providing more informative error messages and potentially recovering from certain errors.
Context Managers: Use try...except...finally blocks to ensure that resources (especially the workbook) are always released, even if errors occur. The finally block should contain cleanup code.
Logging: Log errors, warnings, and other relevant information to help track down problems in deployed applications. A logging system provides a centralized place to record and analyze issues.
Input Validation: Before processing, validate the input Excel file. Check if the file exists, can be opened, and has the correct format. Handle the cases where files are missing, corrupted, or of an unsupported type.

By following these best practices and error handling strategies, you can make your xlrd applications more robust, efficient, and reliable when dealing with a wide range of Excel files, including large or complex ones.

Troubleshooting and Common Issues

Error Messages and Solutions

xlrd provides informative error messages, but understanding them requires familiarity with the library and potential issues in Excel files. Here are some common error messages and solutions:

xlrd.XLRDError: Unsupported format, or corrupt file: This indicates a problem with the Excel file itself. The file might be corrupt, use an unsupported version of the BIFF format (xlrd does not support .xlsx files), or be malformed. Try opening the file with Excel to verify its integrity. Repairing a corrupt Excel file may be possible using Excel’s built-in repair tools.
xlrd.SheetNotFoundError: Sheet name not found: You’re attempting to access a sheet using its name via workbook.sheet_by_name(), but that sheet name does not exist in the workbook. Double check the sheet name for typos and ensure it exists using workbook.sheet_names().
IndexError: list index out of range: You’re attempting to access a cell using an index that is beyond the bounds of the sheet (sheet.nrows or sheet.ncols). Carefully review your row and column indices, ensuring they are within the valid range.
UnicodeDecodeError or other encoding errors: These indicate a problem with character encoding. The file may not be using UTF-8 encoding. Try specifying the correct encoding using the encoding_override argument in xlrd.open_workbook(). Experiment with different encodings (e.g., “latin-1”, “cp1252”) until you find one that works.
TypeError: This usually happens when you try to perform operations on cell values without checking their type (cell.ctype). Always check the type of a cell value before attempting operations on it (e.g., mathematical calculations, string manipulation). Ensure that you handle the different cell.ctype values appropriately.

Debugging Techniques

Print Statements: The simplest debugging technique is to use print() statements at strategic points in your code to check variable values and the flow of execution. Print the contents of relevant variables, including cell values and types, to track potential errors.
Logging: Integrate logging into your application to systematically record events, errors, and debugging information. Logging provides a more structured approach to debugging, especially for complex applications.
Python Debugger (pdb): Use the Python debugger (pdb) to step through your code line by line, inspect variables, and understand the execution flow.
IDE Debugger: If you are using an Integrated Development Environment (IDE) like PyCharm, VS Code, or Thonny, leverage their built-in debugging tools for more advanced debugging features, such as breakpoints, watch variables, and step-through execution.
Inspecting the Workbook Structure: For complex issues, you might need to examine the structure of the Excel file itself. Tools exist that allow examining the file’s contents at a low level (e.g., examining BIFF structures), but this is a more advanced technique.

Compatibility Issues

Excel Version: xlrd primarily supports older .xls files (BIFF 8-12). It does not support newer .xlsx files. If you need to handle .xlsx, use a different library like openpyxl.
Python Version: Ensure that your Python version is compatible with the version of xlrd you’ve installed. Check the xlrd documentation for supported Python versions.
Operating System: xlrd is generally platform-independent but may encounter minor issues with unusual file system configurations or extremely large files on certain operating systems.

File Corruption Handling

Error Detection: Use try...except blocks to catch xlrd.XLRDError exceptions that may indicate file corruption. Implement logging to record details about the corrupted file.
Data Validation: After reading data, perform validation checks to detect inconsistencies or corrupted data. For example, check for unexpected data types or values in specific cells.
Partial Reading: If a file is partially corrupted, try reading only the portion that is not damaged. You might be able to extract useful data even from a file with minor corruption.
File Repair: Before processing, consider using Excel’s built-in repair functionality to fix minor corruptions. However, if the corruption is significant, this might not be possible.
Fallback Mechanism: Have a fallback mechanism if a file is severely corrupt. This might involve skipping the file, using a default value, or notifying the user about the issue.

Remember to thoroughly test your code with various Excel files (including ones that might be slightly corrupted) to ensure robustness. Implementing these troubleshooting strategies will make your applications more robust and resilient to errors encountered while reading Excel files.

Appendix: xlrd API Reference

This section provides a concise overview of key classes and methods in the xlrd API. For complete and up-to-date information, refer to the official xlrd documentation.

`open_workbook()` method

The open_workbook() function is the entry point for working with xlrd. It opens an Excel (.xls) file and returns a Book object.

Signature:

xlrd.open_workbook(filename=None, logfile=None, verbosity=0, use_mmap=True, file_contents=None, encoding_override=None, formatting_info=False, ragged_rows=False, on_demand=False, ignore_workbook_corruption=False)

Parameters:

filename (str, optional): Path to the Excel file. Required unless file_contents is used.
logfile (file-like object, optional): File to which log messages will be written.
verbosity (int, optional): Controls the level of logging.
use_mmap (bool, optional): Use memory mapping (generally faster for large files).
file_contents (bytes-like object, optional): Allows opening from in-memory data instead of a file.
encoding_override (str, optional): Specifies character encoding if different from the file’s default.
formatting_info (bool, optional): Whether to load formatting information (can increase memory usage).
ragged_rows (bool, optional): Allow for rows with varying numbers of cells.
on_demand (bool, optional): Load sheets only when accessed (improves performance for large files).
ignore_workbook_corruption (bool, optional): Whether to attempt to continue if encountering minor file corruptions.

Returns:

A Book object representing the opened workbook.

Raises:

xlrd.XLRDError if the file cannot be opened or parsed. Other exceptions may be raised depending on the specific problem encountered.

`Book` Class

The Book class represents an Excel workbook. It provides methods to access sheets, workbook properties, and other information.

Key Attributes and Methods:

nsheets: The number of worksheets in the workbook.
sheet_by_index(sheetx): Returns a Sheet object for a given sheet index.
sheet_by_name(sheet_name): Returns a Sheet object for a given sheet name. Raises xlrd.XLRDError if not found.
sheet_names(): Returns a list of sheet names.
xf_list: A list of XF objects describing cell extended formats.
datemode: An integer indicating whether the workbook uses the 1900 or 1904 date system.
name_map: A dictionary mapping named range names to their cell ranges.
release_resources(): Releases resources held by the workbook.

`Sheet` Class

The Sheet class represents a worksheet within a workbook.