xlrd is a pure Python library for reading data from Microsoft Excel (®) spreadsheet files (.xls). It supports both older .xls files (using the BIFF 8-12 formats) and does not support newer .xlsx, .xlsm, or .xltx files (which use the OOXML format). xlrd is designed to be reliable and efficient, providing a straightforward interface for accessing spreadsheet data. It’s particularly useful for tasks such as data extraction, analysis, and migration from older Excel files. It focuses solely on reading data; it does not offer any capabilities for writing or modifying Excel files.
xlrd is readily available through the Python Package Index (PyPI). The easiest way to install it is using pip:
pip install xlrd
This command will download and install xlrd and its dependencies. No additional configuration is typically required. If you encounter issues, ensure that you have a working Python installation and that pip is correctly configured. You may need administrator or root privileges for the installation to succeed.
This example demonstrates how to open a workbook, access a specific sheet, and read cell values:
import xlrd
# Open the workbook
= xlrd.open_workbook('example.xls')
workbook
# Get the first sheet (index 0)
= workbook.sheet_by_index(0) # or sheet = workbook.sheet_by_name('Sheet1')
sheet
# Get the number of rows and columns
= sheet.nrows
nrows = sheet.ncols
ncols
# Print the values of the first few cells
for row_idx in range(min(nrows, 5)): # Limit to 5 rows for brevity
for col_idx in range(min(ncols, 5)): # Limit to 5 columns for brevity
= sheet.cell_value(row_idx, col_idx)
cell_value print(f"Cell ({row_idx}, {col_idx}): {cell_value}")
#Good practice to release resources.
workbook.release_resources() del workbook # Good practice to release resources.
Remember to replace 'example.xls'
with the actual path to your Excel file. This code opens the workbook, selects the first sheet, and iterates through a small portion of the data to print the cell values. More sophisticated data access methods are available, as described in the full xlrd documentation.
xlrd supports Microsoft Excel versions that use the Binary File Format (BIFF) versions 8, 7, 5, 4, 3, 2, and 1. It does not support the newer .xlsx, .xlsm, or .xltx file formats based on the Open XML Spreadsheet format. If you need to work with these newer file types, consider using libraries like openpyxl
.
The primary function for interacting with xlrd is xlrd.open_workbook()
. This function takes the file path to your Excel file as its main argument. Several optional arguments allow for customization:
import xlrd
# Basic usage:
= xlrd.open_workbook('my_excel_file.xls')
workbook
# Specifying encoding (if necessary):
= xlrd.open_workbook('my_excel_file.xls', encoding_override="latin-1")
workbook
# Opening a file-like object:
with open('my_excel_file.xls', 'rb') as f:
= xlrd.open_workbook(file_contents=f.read())
workbook
# Using on_demand to load sheets lazily:
= xlrd.open_workbook('my_excel_file.xls', on_demand=True) workbook
The encoding_override
argument is crucial if your Excel file uses a character encoding other than the default (usually UTF-8). The file_contents
argument lets you read from an in-memory buffer instead of a file path. on_demand=True
is useful for very large files where you might not need to load all sheets immediately; sheets will be loaded only when accessed. Refer to the full xlrd documentation for a complete list of all possible arguments. Note that failure to specify the correct encoding may lead to garbled characters in your output.
Once a workbook is open, you can access various properties:
import xlrd
= xlrd.open_workbook('my_excel_file.xls')
workbook
# Get the number of sheets:
= workbook.nsheets
num_sheets
# Get a sheet by index (0-based):
= workbook.sheet_by_index(0)
sheet1
# Get a sheet by name:
= workbook.sheet_by_name('Sheet2') #raises error if sheet is not found.
sheet2
# Get a list of sheet names:
= workbook.sheet_names()
sheet_names
# Get workbook information (e.g., author, creation date)
# This requires checking if they are available in your workbook version.
try:
= workbook.info()
info print(f"Author: {info.author}")
print(f"Creation Date: {info.create_ascii}")
except AttributeError:
print ("This workbook version does not contain author or create_ascii information.")
workbook.release_resources()del workbook
Error handling is critical, as not all workbooks will contain all properties.
Various errors can occur when working with xlrd. The most common include:
xlrd.XLRDError
: A general error indicating a problem reading the Excel file.xlrd.SheetNotFoundError
: Raised when attempting to access a sheet that doesn’t exist.xlrd.XLRDError
: A general error encompassing various problems during file parsing. More specific error messages are included to help in debugging.It’s essential to use try-except
blocks to handle these exceptions gracefully:
import xlrd
try:
= xlrd.open_workbook('my_excel_file.xls')
workbook = workbook.sheet_by_name('Sheet1')
sheet # ... process the sheet ...
except xlrd.XLRDError as e:
print(f"Error reading Excel file: {e}")
except xlrd.SheetNotFoundError:
print("Sheet 'Sheet1' not found in the workbook.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
finally:
if 'workbook' in locals() and workbook:
workbook.release_resources()del workbook
Always include proper error handling to prevent your program from crashing due to unexpected file formats or missing sheets.
While Python’s garbage collection will eventually reclaim resources, it’s good practice to explicitly close workbooks when finished, especially when dealing with large files:
import xlrd
= xlrd.open_workbook('my_excel_file.xls')
workbook # ... process the workbook ...
workbook.release_resources()del workbook
Calling workbook.release_resources()
releases the file handles and other resources held by the workbook object, improving efficiency and reducing the risk of resource leaks. del workbook
helps ensure the object is removed from memory. This is a key best practice.
After opening a workbook using xlrd.open_workbook()
, you access individual worksheets using either their index (starting from 0) or their name. Both methods are demonstrated below:
import xlrd
= xlrd.open_workbook('my_excel_file.xls')
workbook
# Accessing by index:
= workbook.sheet_by_index(0) # Gets the first sheet
sheet_by_index
# Accessing by name:
try:
= workbook.sheet_by_name('Sheet1') # Gets the sheet named 'Sheet1'
sheet_by_name except xlrd.XLRDError as e:
print(f"Error: {e}") #Handles the case where the sheet does not exist.
#It is good practice to check if sheets exist before accessing.
= workbook.sheet_names()
sheet_names if "Sheet1" in sheet_names:
= workbook.sheet_by_name("Sheet1")
sheet #Do something with the sheet.
else:
print("Sheet named 'Sheet1' not found.")
workbook.release_resources()del workbook
Remember that sheet_by_name()
will raise a xlrd.XLRDError
if the specified sheet name doesn’t exist. Using sheet_names()
to check for existence is a safer approach.
Once you have a worksheet object, you can determine its dimensions using the nrows
and ncols
attributes:
import xlrd
= xlrd.open_workbook('my_excel_file.xls')
workbook = workbook.sheet_by_index(0)
sheet
= sheet.nrows
num_rows = sheet.ncols
num_cols
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")
workbook.release_resources()del workbook
These attributes provide the total number of rows and columns in the worksheet, including any empty rows or columns at the end.
The most common way to process data in a worksheet is to iterate through its rows and cells. Here’s how you can do it:
import xlrd
= xlrd.open_workbook('my_excel_file.xls')
workbook = workbook.sheet_by_index(0)
sheet
for row_index in range(sheet.nrows):
= sheet.row(row_index) # Get a list of cell objects in current row.
row for cell_index, cell_obj in enumerate(row):
= cell_obj.value
cell_value = cell_obj.ctype # Get cell type (xlrd.XL_CELL_NUMBER, xlrd.XL_CELL_TEXT, etc.)
cell_type
print(f"Row {row_index + 1}, Column {cell_index + 1}: Value = {cell_value}, Type = {cell_type}")
workbook.release_resources()del workbook
This code iterates through each row and then through each cell within that row. cell_obj.value
provides the cell’s value, and cell_obj.ctype
provides its type (e.g., number, text, date, boolean, error, blank). Understanding cell types is crucial for handling data correctly, as you might need to perform type conversions for certain operations. Remember that row and column indices are zero-based. Alternative methods for accessing individual cells directly using sheet.cell_value(rowx, colx)
are also available for more efficient random access.
The core of using xlrd is retrieving the values from individual cells. There are several ways to achieve this:
Method 1: Using sheet.cell_value(rowx, colx)
: This is the most direct method. rowx
and colx
are the row and column indices (both zero-based).
import xlrd
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet
= sheet.cell_value(0, 0) # Value of cell A1
cell_value print(f"Cell A1: {cell_value}")
= sheet.cell_value(2, 1) # Value of cell B3
cell_value print(f"Cell B3: {cell_value}")
workbook.release_resources()del workbook
Method 2: Iterating through rows: This is useful for processing entire rows or sheets.
import xlrd
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet
for row_index in range(sheet.nrows):
for col_index in range(sheet.ncols):
= sheet.cell_value(row_index, col_index)
cell_value # ... process cell_value ...
workbook.release_resources()del workbook
Method 3: Accessing cells via sheet.row(rowx)
: This returns a list of Cell
objects for a given row.
import xlrd
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet
= sheet.row(0) #Get the first row.
row for cell in row:
print(cell.value)
workbook.release_resources()del workbook
xlrd represents cell values using several data types. The cell.ctype
attribute indicates the cell type:
xlrd.XL_CELL_EMPTY
: Empty cell. The value will be an empty string.xlrd.XL_CELL_TEXT
: Text string. The value will be a string.xlrd.XL_CELL_NUMBER
: Numeric value. The value will be a float.xlrd.XL_CELL_DATE
: Date value. The value will be a float representing the number of days since a base date. You’ll need to convert this to a Python datetime
object (see “Working with Formulas and Dates”).xlrd.XL_CELL_BOOLEAN
: Boolean value. The value will be 1.0 for True and 0.0 for False.xlrd.XL_CELL_ERROR
: Error value. The value will be an integer representing the error code.xlrd.XL_CELL_BLANK
: Blank cell. The value will be an empty string.You should always check cell.ctype
before using the cell.value
to avoid type errors.
Excel allows for various cell formats (number formats, date formats, etc.). xlrd provides access to these formats through the xf_index
attribute of the Cell
object and the workbook’s xf_list
. However, cell.value
gives you the value as Python understands it, so direct formatting is generally unnecessary unless you need to format the output string differently than Python’s default behavior. To get a cell’s formatting information, use sheet.cell_xf_index(rowx, colx)
to get the index into workbook.xf_list
, then examine the relevant XF
object.
Accessing cells outside the bounds of the sheet or attempting to access invalid cells will raise IndexError
. Always check the sheet’s dimensions (sheet.nrows
, sheet.ncols
) before accessing cells to avoid these errors.
import xlrd
try:
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet = sheet.cell_value(sheet.nrows, 0) #Accessing a row beyond the limit.
value except IndexError:
print("IndexError: Tried to access a cell outside the sheet bounds.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
finally:
if 'workbook' in locals() and workbook:
workbook.release_resources()del workbook
xlrd does not evaluate formulas; it returns the calculated result of formulas as they appear in the Excel file, or an error indicator if there was an error in evaluating the formula in the source file. For date values (xlrd.XL_CELL_DATE
), the cell.value
is a floating-point number representing the number of days since the Excel epoch (often December 30, 1899, or January 1, 1900, depending on the Excel version and settings). You need to convert this using the xldate_as_tuple()
function from xlrd
:
import xlrd
import datetime
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet = sheet.cell(0, 0)
cell
if cell.ctype == xlrd.XL_CELL_DATE:
= xlrd.xldate_as_tuple(cell.value, workbook.datemode)
date_tuple = datetime.datetime(*date_tuple)
date_value print(f"Date: {date_value}")
workbook.release_resources()del workbook
Remember to handle cases where the cell does not contain a date value to avoid errors. workbook.datemode
indicates whether the workbook is using the 1900 or 1904 date system.
xlrd provides access to cell styles, allowing you to retrieve formatting information beyond the basic cell type. Each cell has an xf_index
attribute pointing to an entry in the workbook’s xf_list
. This list contains XF
objects, which represent extended formats.
import xlrd
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet
= sheet.cell(0, 0)
cell = cell.xf_index
xf_index = workbook.xf_list[xf_index]
xf
# Access style properties (these may vary depending on Excel version and style):
# font index, alignment, border, pattern, protection, etc.
print(f"Font index: {xf.font_index}")
# ...access other properties of xf...
workbook.release_resources()del workbook
You will need to refer to the Excel file format specification to fully understand the meaning of each attribute within the XF object. The information available varies slightly depending on the Excel version used to create the spreadsheet.
While cell.value
provides the cell’s content, understanding formatting is crucial for accurate data interpretation. You can access formatting details through the XF
object (as shown in the previous section) and also by using the sheet.cell_xf_index(rowx, colx)
method which gives you the index of the Extended Format record associated with the cell at that location. This index then lets you look up the formatting details in workbook.xf_list
. The formatting information includes font properties, number formats, alignment, borders, and more. Decoding number formats usually requires additional external resources (not provided directly by xlrd) or careful study of how Excel stores formatting information.
xlrd supports accessing named ranges defined within an Excel workbook. Use workbook.name_map
which is a dictionary mapping name strings to a tuple of sheet index and cell range.
import xlrd
= xlrd.open_workbook('example.xls')
workbook = workbook.name_map
named_ranges
for name, (sheet_index, (row_start, col_start, row_end, col_end)) in named_ranges.items():
print(f"Named range: {name}, Sheet index: {sheet_index}, Range: ({row_start}, {col_start}, {row_end}, {col_end})")
= workbook.sheet_by_index(sheet_index)
sheet # Access cells within the named range using row_start, col_start, row_end, col_end
workbook.release_resources()del workbook
Remember that named ranges are not always present, so you should handle the case where workbook.name_map
might be empty.
xlrd can extract hyperlink information from cells that contain them. Each cell object has a hyperlink
attribute. If a cell contains a hyperlink, this attribute will be a tuple of (URL, label, target sheet, target cell). If there’s no hyperlink, this attribute will be None
.
import xlrd
= xlrd.open_workbook('example.xls')
workbook = workbook.sheet_by_index(0)
sheet
for row_index in range(sheet.nrows):
for col_index in range(sheet.ncols):
= sheet.cell(row_index, col_index)
cell if cell.hyperlink:
= cell.hyperlink
url, label, target_sheet, target_cell print(f"Hyperlink found at ({row_index}, {col_index}): URL = {url}, Label = {label}")
workbook.release_resources()del workbook
xlrd’s support for charts is extremely limited or nonexistent in many versions. Older versions might offer minimal access to chart information, such as the existence of a chart within a sheet, but the details are not typically well documented or consistently available across different spreadsheet file versions. For robust chart processing you would need a different library. Attempting to access chart data might not produce any meaningful results, and the specific attributes available and their reliability would need careful testing with different Excel files. Directly accessing and parsing the underlying BIFF structures may be required, but this is complex and not recommended unless you have an exceptional need and are deeply familiar with the BIFF format.
When working with large Excel files, optimizing your code for performance is crucial. Here are some key strategies:
on_demand=True
: When opening the workbook, use the on_demand=True
argument in xlrd.open_workbook()
. This prevents xlrd from loading all sheets into memory at once. Sheets are loaded only when accessed. This is particularly beneficial if you only need data from a subset of the sheets.
Iterate, Don’t Load Everything: Avoid loading the entire sheet into memory at once. Instead, iterate through rows using a for
loop as shown in previous examples. This significantly reduces memory usage.
Process only necessary data: Only access and process the columns and rows you actually need. Avoid unnecessary iteration. Use sheet.cell_value(rowx, colx)
to access specific cells efficiently instead of iterating through whole rows when possible.
Use Generators (for very large files): If you are dealing with exceptionally large files, consider using generators to process data in chunks. This will further reduce memory consumption. A generator can yield one row (or a small block of rows) at a time.
workbook.release_resources()
: Always call workbook.release_resources()
and del workbook
when you’re finished with a workbook. This explicitly releases file handles and other resources, preventing potential memory leaks, especially in long-running applications or when processing many files.
Avoid unnecessary variable creation: Minimize the creation of unnecessary variables, particularly large ones. If a variable is no longer needed, explicitly delete it using del
. Python’s garbage collector will eventually reclaim memory, but manual deletion can help improve performance.
Use with statement (for file-like objects): If you are reading from a file-like object, use a with
statement to ensure proper resource closure even in the event of errors:
import xlrd
with open('my_excel_file.xls', 'rb') as f:
= xlrd.open_workbook(file_contents=f.read())
workbook # ... process the workbook ...
#The file will be automatically closed and resources released even if exceptions are raised.
Robust error handling is essential when working with xlrd, especially for production systems. Here’s how to enhance error handling:
Specific Exception Handling: Catch specific exceptions (xlrd.XLRDError
, xlrd.SheetNotFoundError
, IndexError
, etc.) rather than using a general except Exception
block. This allows you to handle different error conditions appropriately, providing more informative error messages and potentially recovering from certain errors.
Context Managers: Use try...except...finally
blocks to ensure that resources (especially the workbook) are always released, even if errors occur. The finally
block should contain cleanup code.
Logging: Log errors, warnings, and other relevant information to help track down problems in deployed applications. A logging system provides a centralized place to record and analyze issues.
Input Validation: Before processing, validate the input Excel file. Check if the file exists, can be opened, and has the correct format. Handle the cases where files are missing, corrupted, or of an unsupported type.
By following these best practices and error handling strategies, you can make your xlrd applications more robust, efficient, and reliable when dealing with a wide range of Excel files, including large or complex ones.
xlrd provides informative error messages, but understanding them requires familiarity with the library and potential issues in Excel files. Here are some common error messages and solutions:
xlrd.XLRDError: Unsupported format, or corrupt file
: This indicates a problem with the Excel file itself. The file might be corrupt, use an unsupported version of the BIFF format (xlrd does not support .xlsx files), or be malformed. Try opening the file with Excel to verify its integrity. Repairing a corrupt Excel file may be possible using Excel’s built-in repair tools.
xlrd.SheetNotFoundError: Sheet name not found
: You’re attempting to access a sheet using its name via workbook.sheet_by_name()
, but that sheet name does not exist in the workbook. Double check the sheet name for typos and ensure it exists using workbook.sheet_names()
.
IndexError: list index out of range
: You’re attempting to access a cell using an index that is beyond the bounds of the sheet (sheet.nrows
or sheet.ncols
). Carefully review your row and column indices, ensuring they are within the valid range.
UnicodeDecodeError
or other encoding errors: These indicate a problem with character encoding. The file may not be using UTF-8 encoding. Try specifying the correct encoding using the encoding_override
argument in xlrd.open_workbook()
. Experiment with different encodings (e.g., “latin-1”, “cp1252”) until you find one that works.
TypeError
: This usually happens when you try to perform operations on cell values without checking their type (cell.ctype
). Always check the type of a cell value before attempting operations on it (e.g., mathematical calculations, string manipulation). Ensure that you handle the different cell.ctype
values appropriately.
Print Statements: The simplest debugging technique is to use print()
statements at strategic points in your code to check variable values and the flow of execution. Print the contents of relevant variables, including cell values and types, to track potential errors.
Logging: Integrate logging into your application to systematically record events, errors, and debugging information. Logging provides a more structured approach to debugging, especially for complex applications.
Python Debugger (pdb
): Use the Python debugger (pdb
) to step through your code line by line, inspect variables, and understand the execution flow.
IDE Debugger: If you are using an Integrated Development Environment (IDE) like PyCharm, VS Code, or Thonny, leverage their built-in debugging tools for more advanced debugging features, such as breakpoints, watch variables, and step-through execution.
Inspecting the Workbook Structure: For complex issues, you might need to examine the structure of the Excel file itself. Tools exist that allow examining the file’s contents at a low level (e.g., examining BIFF structures), but this is a more advanced technique.
Excel Version: xlrd primarily supports older .xls files (BIFF 8-12). It does not support newer .xlsx files. If you need to handle .xlsx, use a different library like openpyxl
.
Python Version: Ensure that your Python version is compatible with the version of xlrd you’ve installed. Check the xlrd documentation for supported Python versions.
Operating System: xlrd is generally platform-independent but may encounter minor issues with unusual file system configurations or extremely large files on certain operating systems.
Error Detection: Use try...except
blocks to catch xlrd.XLRDError
exceptions that may indicate file corruption. Implement logging to record details about the corrupted file.
Data Validation: After reading data, perform validation checks to detect inconsistencies or corrupted data. For example, check for unexpected data types or values in specific cells.
Partial Reading: If a file is partially corrupted, try reading only the portion that is not damaged. You might be able to extract useful data even from a file with minor corruption.
File Repair: Before processing, consider using Excel’s built-in repair functionality to fix minor corruptions. However, if the corruption is significant, this might not be possible.
Fallback Mechanism: Have a fallback mechanism if a file is severely corrupt. This might involve skipping the file, using a default value, or notifying the user about the issue.
Remember to thoroughly test your code with various Excel files (including ones that might be slightly corrupted) to ensure robustness. Implementing these troubleshooting strategies will make your applications more robust and resilient to errors encountered while reading Excel files.
This section provides a concise overview of key classes and methods in the xlrd API. For complete and up-to-date information, refer to the official xlrd documentation.
open_workbook()
methodThe open_workbook()
function is the entry point for working with xlrd. It opens an Excel (.xls) file and returns a Book
object.
Signature:
=None, logfile=None, verbosity=0, use_mmap=True, file_contents=None, encoding_override=None, formatting_info=False, ragged_rows=False, on_demand=False, ignore_workbook_corruption=False) xlrd.open_workbook(filename
Parameters:
filename
(str, optional): Path to the Excel file. Required unless file_contents
is used.logfile
(file-like object, optional): File to which log messages will be written.verbosity
(int, optional): Controls the level of logging.use_mmap
(bool, optional): Use memory mapping (generally faster for large files).file_contents
(bytes-like object, optional): Allows opening from in-memory data instead of a file.encoding_override
(str, optional): Specifies character encoding if different from the file’s default.formatting_info
(bool, optional): Whether to load formatting information (can increase memory usage).ragged_rows
(bool, optional): Allow for rows with varying numbers of cells.on_demand
(bool, optional): Load sheets only when accessed (improves performance for large files).ignore_workbook_corruption
(bool, optional): Whether to attempt to continue if encountering minor file corruptions.Returns:
A Book
object representing the opened workbook.
Raises:
xlrd.XLRDError
if the file cannot be opened or parsed. Other exceptions may be raised depending on the specific problem encountered.
Book
ClassThe Book
class represents an Excel workbook. It provides methods to access sheets, workbook properties, and other information.
Key Attributes and Methods:
nsheets
: The number of worksheets in the workbook.sheet_by_index(sheetx)
: Returns a Sheet
object for a given sheet index.sheet_by_name(sheet_name)
: Returns a Sheet
object for a given sheet name. Raises xlrd.XLRDError
if not found.sheet_names()
: Returns a list of sheet names.xf_list
: A list of XF
objects describing cell extended formats.datemode
: An integer indicating whether the workbook uses the 1900 or 1904 date system.name_map
: A dictionary mapping named range names to their cell ranges.release_resources()
: Releases resources held by the workbook.Sheet
ClassThe Sheet
class represents a worksheet within a workbook.
Key Attributes and Methods:
nrows
: The number of rows in the sheet.ncols
: The number of columns in the sheet.name
: The name of the sheet.row(rowx)
: Returns a list of Cell
objects for a given row.cell(rowx, colx)
: Returns a Cell
object for a given cell.cell_value(rowx, colx)
: Returns the value of a cell.cell_type(rowx, colx)
: Returns the type of a cell (xlrd.XL_CELL_*
).cell_xf_index(rowx, colx)
: Returns the index into workbook.xf_list
for the cell’s extended format.Cell
ClassThe Cell
class represents a single cell within a worksheet.
Key Attributes:
value
: The value of the cell (string, number, date, boolean, error, or empty).ctype
: The type of the cell (xlrd.XL_CELL_*
).xf_index
: The index into the workbook’s xf_list
for the cell’s extended format.hyperlink
: A tuple containing hyperlink information (URL, label, target sheet, target cell) or None if no hyperlink is present.This API reference provides a starting point. The full xlrd documentation should be consulted for a complete and detailed description of all classes, methods, and attributes. Remember that the specifics of certain attributes (like those within XF
objects) may depend on the version of Excel that created the file.