Skip to content Skip to sidebar Skip to footer

Read Data and Save in Dictionary in Python

How to excerpt specific portions of a text file using Python

Updated: 06/30/2020 by Computer Hope

Python programming language logo

Extracting text from a file is a common job in scripting and programming, and Python makes it easy. In this guide, we'll discuss some unproblematic ways to extract text from a file using the Python three programming linguistic communication.

Make sure you're using Python three

In this guide, we'll exist using Python version iii. Most systems come pre-installed with Python 2.7. While Python ii.vii is used in legacy code, Python iii is the nowadays and future of the Python language. Unless you have a specific reason to write or support Python two, we recommend working in Python 3.

For Microsoft Windows, Python iii can be downloaded from the Python official website. When installing, make sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, every bit shown in the epitome beneath.

Installing Python 3.7.2 for Windows.

On Linux, you can install Python 3 with your package manager. For instance, on Debian or Ubuntu, you tin install it with the following command:

sudo apt-get update && sudo apt-get install python3

For macOS, the Python iii installer can be downloaded from python.org, as linked above. If you are using the Homebrew package managing director, it can as well be installed by opening a terminal window (ApplicationsUtilities), and running this command:

brew install python3

Running Python

On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if y'all installed the launcher, the command is py. The commands on this page employ python3; if you're on Windows, substitute py for python3 in all commands.

Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If yous accidentally enter the interpreter, yous tin get out it using the command exit() or quit().

Running Python with a file proper noun will translate that python program. For instance:

python3 program.py

...runs the program contained in the file program.py.

Okay, how can nosotros utilize Python to extract text from a text file?

Reading data from a text file

First, permit'south read a text file. Let's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum example text.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Note

In all the examples that follow, we work with the four lines of text contained in this file. Re-create and paste the latin text above into a text file, and save information technology equally lorem.txt, so you tin can run the example code using this file as input.

A Python program can read a text file using the built-in open up() function. For example, the Python 3 program below opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and prints the information.

myfile = open("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read()         # read the entire file to string myfile.close()                   # close the file print(contents)                  # print string contents

Hither, myfile is the proper noun nosotros give to our file object.

The "rt" parameter in the open() role means "we're opening this file to read text data"

The hash mark ("#") means that everything on that line is a comment, and information technology'southward ignored past the Python interpreter.

If you save this program in a file called read.py, you tin run information technology with the post-obit control.

python3 read.py

The command above outputs the contents of lorem.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Using "with open"

It's important to close your open files as soon equally possible: open up the file, perform your operation, and close it. Don't leave information technology open for extended periods of time.

When you're working with files, it'southward good practice to use the with open...as compound statement. Information technology'southward the cleanest way to open a file, operate on information technology, and shut the file, all in one easy-to-read block of code. The file is automatically closed when the code cake completes.

Using with open...as, we tin rewrite our program to wait like this:

with open up ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading text     contents = myfile.read()              # Read the unabridged file to a string print(contents)                           # Impress the string

Note

Indentation is important in Python. Python programs apply white infinite at the beginning of a line to ascertain scope, such as a cake of code. We recommend you lot use four spaces per level of indentation, and that you use spaces rather than tabs. In the following examples, make sure your code is indented exactly as it's presented hither.

Example

Save the program as read.py and execute it:

python3 read.py

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.

Reading text files line-by-line

In the examples and so far, nosotros've been reading in the whole file at one time. Reading a full file is no large deal with small-scale files, but generally speaking, it's not a great idea. For one matter, if your file is bigger than the amount of bachelor memory, y'all'll encounter an error.

In almost every instance, it's a better idea to read a text file one line at a fourth dimension.

In Python, the file object is an iterator. An iterator is a blazon of Python object which behaves in certain ways when operated on repeatedly. For instance, y'all can employ a for loop to operate on a file object repeatedly, and each fourth dimension the aforementioned operation is performed, you'll receive a different, or "next," result.

Example

For text files, the file object iterates ane line of text at a time. It considers one line of text a "unit" of data, then we can use a for...in loop statement to iterate one line at a time:

with open up ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading     for myline in myfile:              # For each line, read to a string,         print(myline)                  # and print the string.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit down amet pretium tellus.  Quisque at dignissim lacus.

Observe that we're getting an actress line interruption ("newline") after every line. That's because 2 newlines are being printed. The first 1 is the newline at the cease of every line of our text file. The 2d newline happens because, by default, print() adds a linebreak of its ain at the end of whatever you've asked it to print.

Let's store our lines of text in a variable — specifically, a list variable — so we can look at information technology more closely.

Storing text data in a variable

In Python, lists are like to, but not the same equally, an array in C or Java. A Python listing contains indexed data, of varying lengths and types.

Case

mylines = []                             # Declare an empty list named mylines. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text data.     for myline in myfile:                # For each line, stored as myline,         mylines.append(myline)           # add its contents to mylines. print(mylines)                           # Print the listing.

The output of this program is a niggling different. Instead of press the contents of the list, this plan prints our listing object, which looks like this:

Output:

['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\due north', 'Quisque at dignissim lacus.\n']

Here, nosotros see the raw contents of the list. In its raw object form, a list is represented as a comma-delimited listing. Here, each chemical element is represented as a string, and each newline is represented equally its escape character sequence, \n.

Much like a C or Java array, the list elements are accessed past specifying an alphabetize number after the variable name, in brackets. Alphabetize numbers start at zip — other words, the nth chemical element of a listing has the numeric index n-ane.

Note

If y'all're wondering why the index numbers start at zero instead of one, you lot're not lone. Computer scientists have debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why zero-based numbering is the best way to index information in reckoner science. You can read the memo yourself — he makes a compelling argument.

Example

We tin impress the first element of lines by specifying alphabetize number 0, independent in brackets after the name of the list:

print(mylines[0])

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.

Instance

Or the third line, by specifying index number 2:

print(mylines[ii])

Output:

Quisque at dignissim lacus.

But if we try to admission an index for which there is no value, we get an error:

Example

print(mylines[3])

Output:

Traceback (most recent call last): File <filename>, line <linenum>, in <module> print(mylines[iii]) IndexError: list index out of range

Instance

A list object is an iterator, so to print every element of the list, we tin can iterate over it with for...in:

mylines = []                              # Declare an empty list with open ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading text.     for line in myfile:                   # For each line of text,         mylines.append(line)              # add together that line to the listing.     for element in mylines:               # For each element in the listing,         print(chemical element)                    # print it.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

But nosotros're still getting extra newlines. Each line of our text file ends in a newline character ('\northward'), which is being printed. Likewise, after printing each line, print() adds a newline of its ain, unless you tell it to do otherwise.

We can alter this default beliefs by specifying an end parameter in our impress() telephone call:

print(element, end='')

By setting end to an empty cord (2 unmarried quotes, with no space), we tell impress() to impress nothing at the stop of a line, instead of a newline graphic symbol.

Example

Our revised program looks like this:

mylines = []                              # Declare an empty listing with open ('lorem.txt', 'rt') as myfile:  # Open up file lorem.txt     for line in myfile:                   # For each line of text,         mylines.append(line)              # add that line to the list.     for element in mylines:               # For each chemical element in the list,         print(element, end='')            # print information technology without extra newlines.

Output:

Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

The newlines y'all see hither are actually in the file; they're a special character ('\n') at the cease of each line. Nosotros want to go rid of these, so nosotros don't have to worry well-nigh them while nosotros procedure the file.

How to strip newlines

To remove the newlines completely, nosotros can strip them. To strip a string is to remove one or more characters, usually whitespace, from either the kickoff or end of the string.

Tip

This procedure is sometimes also chosen "trimming."

Python iii string objects take a method chosen rstrip(), which strips characters from the right side of a cord. The English language language reads left-to-right, so stripping from the right side removes characters from the end.

If the variable is named mystring, we can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For example, "123abc".rstrip("bc") returns 123a.

Tip

When yous represent a cord in your programme with its literal contents, it's called a cord literal. In Python (as in nearly programming languages), cord literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you tin can use one or the other, every bit long as they match on both ends of the cord. It's traditional to represent a human-readable string (such every bit Hullo) in double-quotes ("Hello"). If you're representing a single character (such every bit b), or a unmarried special graphic symbol such as the newline character (\n), it's traditional to utilize single quotes ('b', '\due north'). For more information virtually how to use strings in Python, you can read the documentation of strings in Python.

The statement string.rstrip('\northward') will strip a newline character from the right side of cord. The following version of our program strips the newlines when each line is read from the text file:

mylines = []                                # Declare an empty listing. with open ('lorem.txt', 'rt') as myfile:    # Open lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.suspend(myline.rstrip('\northward')) # strip newline and add to list. for element in mylines:                     # For each chemical element in the list,     print(element)                          # print it.

The text is at present stored in a listing variable, so individual lines can be accessed past index number. Newlines were stripped, so we don't take to worry about them. We can always put them back subsequently if we reconstruct the file and write it to disk.

At present, let'south search the lines in the list for a specific substring.

Searching text for a substring

Let'southward say nosotros want to locate every occurrence of a certain phrase, or fifty-fifty a single letter. For instance, maybe we need to know where every "eastward" is. Nosotros can reach this using the cord'due south find() method.

The list stores each line of our text as a string object. All string objects take a method, discover(), which locates the first occurrence of a substrings in the string.

Let's use the find() method to search for the letter "due east" in the get-go line of our text file, which is stored in the list mylines. The get-go element of mylines is a cord object containing the first line of the text file. This string object has a observe() method.

In the parentheses of notice(), we specify parameters. The first and only required parameter is the cord to search for, "eastward". The statement mylines[0].find("e") tells the interpreter to search forward, starting at the beginning of the cord, one graphic symbol at a time, until it finds the letter "e." When information technology finds i, it stops searching, and returns the alphabetize number where that "due east" is located. If it reaches the end of the string, it returns -1 to indicate nothing was plant.

Instance

print(mylines[0].find("e"))

Output:

three

The render value "three" tells us that the letter of the alphabet "due east" is the fourth graphic symbol, the "e" in "Lorem". (Recall, the index is zippo-based: index 0 is the first character, 1 is the second, etc.)

The observe() method takes two optional, additional parameters: a start index and a cease index, indicating where in the string the search should begin and end. For instance, string.find("abc", 10, twenty) searches for the substring "abc", merely just from the 11th to the 21st character. If stop is not specified, find() starts at index start, and stops at the end of the cord.

Example

For instance, the following statement searchs for "east" in mylines[0], beginning at the 5th graphic symbol.

print(mylines[0].detect("e", 4))

Output:

24

In other words, starting at the 5th character in line[0], the first "due east" is located at index 24 (the "e" in "nec").

Example

To start searching at index 10, and stop at index 30:

print(mylines[one].find("e", 10, 30))

Output:

28

(The starting time "east" in "Maecenas").

If discover() doesn't locate the substring in the search range, information technology returns the number -i, indicating failure:

impress(mylines[0].find("e", 25, xxx))

Output:

-1

There were no "east" occurrences between indices 25 and 30.

Finding all occurrences of a substring

But what if we want to locate every occurrence of a substring, not just the first one we encounter? Nosotros can iterate over the string, starting from the index of the previous friction match.

In this case, we'll use a while loop to repeatedly notice the letter "eastward". When an occurrence is establish, we telephone call find over again, starting from a new location in the string. Specifically, the location of the terminal occurrence, plus the length of the string (then we tin can move frontwards past the last 1). When find returns -ane, or the start index exceeds the length of the string, we stop.

# Build array of lines from file, strip newlines  mylines = []                                # Declare an empty listing. with open ('lorem.txt', 'rt') every bit myfile:    # Open up lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.append(myline.rstrip('\due north')) # strip newline and add to list.  # Locate and print all occurences of alphabetic character "e"  substr = "eastward"                  # substring to search for. for line in mylines:          # cord to be searched   index = 0                   # current alphabetize: character being compared   prev = 0                    # previous index: last grapheme compared   while index < len(line):    # While index has not exceeded string length,     alphabetize = line.find(substr, index)  # gear up alphabetize to first occurrence of "east"     if alphabetize == -i:           # If nothing was constitute,       break                   # exit the while loop.     print(" " * (index - prev) + "east", end='')  # impress spaces from previous                                                # friction match, so the substring.     prev = index + len(substr)       # remember this position for next loop.     index += len(substr)      # increment the index by the length of substr.                               # (Echo until alphabetize > line length)   print('\northward' + line);         # Impress the original string under the e's        

Output:

          e                    east       eastward  e               due east Lorem ipsum dolor sit amet, consectetur adipiscing elit.                          e  e Nunc fringilla arcu congue metus aliquam mollis.         eastward                   eastward e          due east    due east      east Mauris nec maximus purus. Maecenas sit amet pretium tellus.       e Quisque at dignissim lacus.

Incorporating regular expressions

For circuitous searches, use regular expressions.

The Python regular expressions module is called re. To use information technology in your program, import the module before y'all use it:

import re

The re module implements regular expressions past compiling a search pattern into a design object. Methods of this object can then exist used to perform match operations.

For example, let'southward say you desire to search for any word in your document which starts with the letter d and ends in the letter r. We can accomplish this using the regular expression "\bd\w*r\b". What does this mean?

character sequence meaning
\b A word purlieus matches an empty cord (anything, including nothing at all), merely only if information technology appears before or after a not-discussion graphic symbol. "Word characters" are the digits 0 through 9, the lowercase and uppercase letters, or an underscore ("_").
d Lowercase letter d.
\west* \west represents whatsoever give-and-take graphic symbol, and * is a quantifier pregnant "zero or more of the previous grapheme." So \w* will match nada or more word characters.
r Lowercase letter r.
\b Word purlieus.

So this regular expression will match any cord that tin can be described every bit "a word boundary, then a lowercase 'd', then zilch or more word characters, and so a lowercase 'r', then a word boundary." Strings described this fashion include the words destroyer, dour, and medico, and the abbreviation dr.

To utilize this regular expression in Python search operations, we offset compile it into a pattern object. For instance, the following Python statement creates a pattern object named blueprint which we can use to perform searches using that regular expression.

design = re.compile(r"\bd\w*r\b")

Note

The letter r before our string in the argument above is important. It tells Python to interpret our string as a raw string, exactly as nosotros've typed information technology. If we didn't prefix the cord with an r, Python would interpret the escape sequences such as \b in other ways. Whenever you demand Python to interpret your strings literally, specify it as a raw string by prefixing information technology with r.

At present nosotros tin can utilize the pattern object's methods, such as search(), to search a cord for the compiled regular expression, looking for a lucifer. If it finds one, it returns a special result chosen a friction match object. Otherwise, it returns None, a built-in Python constant that is used like the boolean value "fake".

import re str = "Good morning time, doctor." pat = re.compile(r"\bd\due west*r\b")  # compile regex "\bd\w*r\b" to a design object if pat.search(str) != None:     # Search for the pattern. If establish,     print("Found it.")

Output:

Found information technology.

To perform a case-insensitive search, you can specify the special constant re.IGNORECASE in the compile step:

import re str = "Hello, Doctor." pat = re.compile(r"\bd\westward*r\b", re.IGNORECASE)  # upper and lowercase volition match if pat.search(str) != None:     print("Found information technology.")

Output:

Establish it.

Putting it all together

So at present we know how to open a file, read the lines into a list, and locate a substring in any given listing element. Let's use this noesis to build some example programs.

Impress all lines containing substring

The program below reads a log file line by line. If the line contains the word "error," information technology is added to a list chosen errors. If not, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search example-insensitive without altering the original strings.

Note that the find() method is chosen directly on the result of the lower() method; this is called method chaining. Also, note that in the print() statement, we construct an output string by joining several strings with the + operator.

errors = []                       # The listing where we will shop results. linenum = 0 substr = "error".lower()          # Substring to search for. with open up ('logfile.txt', 'rt') as myfile:     for line in myfile:         linenum += 1         if line.lower().find(substr) != -1:    # if case-insensitive match,             errors.append("Line " + str(linenum) + ": " + line.rstrip('\north')) for err in errors:     print(err)

Input (stored in logfile.txt):

This is line 1 This is line 2 Line 3 has an error! This is line 4 Line 5 also has an error!

Output:

Line 3: Line 3 has an mistake! Line 5: Line five likewise has an mistake!

Excerpt all lines containing substring, using regex

The program beneath is similar to the in a higher place program, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.g., (linenum, line). The tuple is created by the additional enclosing parentheses in the errors.append() statement. The elements of the tuple are referenced similar to a list, with a zero-based alphabetize in brackets. As constructed here, err[0] is a linenum and err[i] is the associated line containing an error.

import re errors = [] linenum = 0 blueprint = re.compile("error", re.IGNORECASE)  # Compile a case-insensitive regex with open ('logfile.txt', 'rt') as myfile:         for line in myfile:         linenum += one         if pattern.search(line) != None:      # If a match is found              errors.append((linenum, line.rstrip('\n'))) for err in errors:                            # Iterate over the listing of tuples     print("Line " + str(err[0]) + ": " + err[i])

Output:

Line half-dozen: Mar 28 09:10:37 Error: cannot contact server. Connection refused. Line 10: Mar 28 10:28:fifteen Kernel error: The specified location is not mounted. Line 14: Mar 28 11:06:30 Fault: usb ane-1: can't set config, exiting.

Extract all lines containing a telephone number

The program below prints whatever line of a text file, info.txt, which contains a US or international phone number. It accomplishes this with the regular expression "(\+\d{1,two})?[\s.-]?\d{3}[\due south.-]?\d{4}". This regex matches the following phone number notations:

  • 123-456-7890
  • (123) 456-7890
  • 123 456 7890
  • 123.456.7890
  • +91 (123) 456-7890
import re errors = [] linenum = 0 blueprint = re.compile(r"(\+\d{1,2})?[\due south.-]?\d{3}[\s.-]?\d{iv}") with open up ('info.txt', 'rt') as myfile:     for line in myfile:         linenum += 1         if design.search(line) != None:  # If pattern search finds a match,             errors.append((linenum, line.rstrip('\n'))) for err in errors:     impress("Line ", str(err[0]), ": " + err[i])

Output:

Line  three : My phone number is 731.215.8881. Line  7 : You can attain Mr. Walters at (212) 558-3131. Line  12 : His agent, Mrs. Kennedy, tin can be reached at +12 (123) 456-7890 Line  xiv : She can besides be contacted at (888) 312.8403, extension 12.

Search a dictionary for words

The program beneath searches the dictionary for whatsoever words that start with h and end in pe. For input, information technology uses a dictionary file included on many Unix systems, /usr/share/dict/words.

import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\w*pe$", re.IGNORECASE) with open(filename, "rt") as myfile:     for line in myfile:         if pattern.search(line) != None:             print(line, finish='')

Output:

Hope heliotrope hope hornpipe horoscope hype

moritzaptir1997.blogspot.com

Source: https://www.computerhope.com/issues/ch001721.htm

Postar um comentário for "Read Data and Save in Dictionary in Python"