|
Text files and Excel |
||
This document describes text files, and reading them into an Excel worksheet, and writing a worksheet to a text file.
Background
Text files are simply that. They contain only text characters, usually in tabular form (a table). If a text file contained only the data "hello there", the file would have 11 bytes of data, and maybe one or two end-of-file characters. That's it. The bytes of the file would be the values shown in the top row of the following:
| 72 | 101 | 108 | 108 | 111 | 32 | 116 | 104 | 101 | 114 | 101 |
| H | e | l | l | o | space | t | h | e | r | e |
Text files are often called ASCII files, because, at least in the case of the PC, ASCII is the code used to represent text characters. ASCII is simply a numeric representation for alphanumeric characters (the values above). It was defined by the government years ago, and was in common use long before the PC came about in 1981. Since computer bytes can contain only the values 0-255, we use numbers in that range to represent characters. (Actually, ASCII only defines characters for the first half of that range, 0-127). The character "A" is represented by the value 65 (01000001 in binary, or 101 in octal, or 65 in decimal, or 41 in hex, but all the same value). This is one of the possible 256 values that a byte can store. Note that if we wanted to represent the numerical value 65, and wanted to do so in text, we would need two bytes, character 6 (ASCII code 54) and character 5 (code 53), because text files use decimal characters to convey numerical values. So we would have the ASCII values for those two characters in two bytes, 54 and 53. For a list of the ASCII code values, see http://www.asciitable.com/.
There are an additional 128 characters (128-255) defined in what's called "Extended ASCII." These characters are, by and large, not on PC keyboards. Some are graphic-like characters used in the DOS days, when displays and printers were not graphics-capable, to make boxes around areas of the screen, and the like. The characters for extended ASCII codes in Microsoft Office programs are not all the same as the standard Extended ASCII codes as shown at www.asciitable.com/.
When you press the character A on a keyboard, it produces the ASCII code (65) for an A. Well, not directly, but eventually that's what is produced. For a character not on the keyboard, you can hold ALT while typing the code value for the character. You must use only the numeric keypad for this. Try ALT with 248. You'll get the degree symbol (°). You could use this method to get the character A ( ALT 6 5 ) but it'd be more trouble than it's worth, since there's already a key for that character.
There are other character codes in use today. IBM has long used EBCDIC (Extended Binary-Coded Decimal Interchange Code) instead of ASCII. The idea is the same, just different numerical values for characters. EBCDIC defines numerical codes for all 256 possible values of a byte, where ASCII defines only 128 (unless its Extended ASCII). For the EBCDIC codes, see http://www.legacyj.com/cobol/ebcdic.html.
Text files can be considered the "universal language of data," because they are only pure data, and contain no proprietary additional information. They contain no formatting information (font, color, margins, etc.), or layout information (margins, sections, etc.) as you might find in a word processing file. When you save a file in an application (like Word or Excel), you usually save it in its native format (.doc or .xls) which those respective programs understand. There is more than just the text of the document in such a file. Other applications may or may not be able to open such a file. But most applications can save a document as a text file, and can open a text file. Of course, you don't get formatting, layout, and other application-specific stuff. So text files can be read in (opened or imported) and read out (saved or exported) by many applications, giving them a common language to exchange text data.
Lines
Text files are usually divided into lines, as signified by an end-of-line sequence at the end of the line. This is usually a carriage-return character (13) followed by a line-feed character (10). This is often called CRLF. You don't see these characters, but they cause a display or printer to go to the beginning of the next physical line before continuing with the data. Consider the following two-line document:
Hello
there
Such a text file's numerical values in an ASCII file would look like this (top row)
| 72 | 101 | 108 | 108 | 111 | 13 | 10 | 116 | 104 | 101 | 114 | 101 | 13 | 10 |
| H | e | l | l | o | <CR> | <LF> | T | h | e | r | e | <CR> | <LF> |
Note that the display or printing of a text file may be broken into lines, even where there are no end-of-line sequences in the file. Most programs insert a "soft return" (carriage return) when they physically run out of space at the right side of the screen or page. This causes them to "wrap" to the next line on the screen or page. The Windows Notepad program has a word wrap option, which when turned off, causes lines in the file to continue to the right beyond the edge of the window. So you see the file the way it really is, in terms of lines. You can use the scroll bar to view these lines. This makes NotePad a useful tool for working with text files, though there are other text editors available with more features. Word processing programs, though they will readily open a text file, generally use word wrap at all times.
Tables
A table is information organized in rows and columns. Usually, each row is one record. If the table is an employee table, each record is about one employee. The columns divide the records into fields (likely First Name, Last Name, SSN, etc.) providing information about that employee. Records are added to the table and deleted from the table as needed. The record layout defines the fields (first might be FirstName, etc.), so the width of the table in fields remains constant, though the "height" changes are records are added or deleted. Tables are often used in spreadsheet programs and database programs to store data with such organization.
Delimited text file
A delimited text file is a table in text file form. The fields are separated by a field-delimiter character, usually a comma, but sometimes a tab character (code value 9). Each field can have pretty much any count of characters, as the program reading the file knows when a field has ended and the next field will start when it sees the field delimiter character. The end of each record (row) is signified by an end-of-line sequence, usually CR LF, and is called the Record Delimiter.. A delimited text file has one record per line.
So what happens when the data contains a field delimiter? If you had "Hello, there" in one field, you wouldn't want the program reading the file to interpret this as two fields because of the comma. When a field contains the field delimiter, it's wrapped in characters called Text Qualifiers (or bracketing characters), usually quotation marks. The program reading the file knows when it's inside Text Qualifiers, and considers a comma user text, not a field delimiter. Some programs cannot handle text qualifiers, and with such programs, no field delimiters should be present in the data.
Well, now what happens if the data happens to contain the Text Qualifier character? Seems bloody endless, doesn't it? When the data has a Text Qualifier character, the character is doubled up. Two consecutive quotation marks would appear. Also, the field is wrapped in Text Qualifier characters, whether or not there is a field delimiter in the data. So there will be text qualifiers where there's a field delimiter in the user data (comma) and/or where there's text qualifier characters in the user data. Consider the following data two fields
| Hello, "big boy" | 5 |
This would appear in a delimited text file as:
"Hello, ""big boy""",5
The program reading the text file knows to remove one of the quote marks, and consider the other as part of the data, not as a text qualifier.
This use of two consecutive quotations mark is one good reason not to use the term "double-quote" for a quotation mark, and "single-quote" for an apostrophe, as seem to be popular in computer circles, because it's ambiguous. Is it one or two of them? Which one? Also, it's pretty much only computer nerds that use such terms.
The field delimiter, usually a comma in English countries, may be a different character, such as the semicolon (;), used in many European countries. The Windows Regional settings language specification specifies the field-delimiter character. Go to the Windows Control Panel, start Regional Settings. There you can select the language, and with the Customize button (Windows XP), you'll see the resulting field-delimiter character, and can change it if you need to.
Optionally, a header row (the first line of the file) can contain the field names, separated by the field delimiter. The second line is the first record, etc. The program reading the file generally knows whether to expect a header. With a header, the fields need not be in a prescribed order, and the chance of errors associated with field order is eliminated.
Fixed-width text files
Fixed-field, or fixed-width text files have a prescribed count of characters for each field, and do not use field delimiters. No commas. It must be specified in advance how many characters comprise each field, and whether the field is alphabetic or numeric. This is called the "record layout.". The program reading the file counts characters, and determines where a field ends and another begins using that count. Each field must have the prescribed count of characters. If there are fewer data characters than allocated for a field, it's generally left-padded with zeroes if numeric, or right-padded with spaces if alphabetic. This preserves the character count for each field. Consider the following record layout:
| Field Name | LastName | Amount |
| Field size (character count) | 10 | 5 |
| Type | Alpha | Numeric |
The following record:
| Johnson | 5 |
would appear in a fixed-field file thus:
Johnson```00005
The ` characters here represent space characters.
Record types
A fixed file can have records of different record layouts, each a different Record Type. I suppose a delimited text file could also have record types, though I've not run across one. Records of the various types can be intermixed in the file. A field, often a single character, and often the first character of the record, is used to identify the record type. Here's an example of some possible record layouts.
Record type 1
| Field Name | RecordType | CustNum | CustName |
| Field size (character count) | 1 | 5 | 15 |
| Type | Numeric | Numeric | Alpha |
Record type 2
| Field Name | Record Type | PmtDate | CustNum | Amount |
| Field size (character count) | 1 | 6 | 5 | 10 |
| Type | Numeric | Alpha | Numeric | Numeric |
A fixed-field file with two record types might look like this:
100001Acme Sales````` ( Customer
00001, Acme Sales )
2052104000010000001000 ( a payment of $10 on 5/21/04 )
2062704000010000002000 ( a payment of $20 on 6/27/04 )
100002Bob's Market``` ( Customer 00002, Bob's Market )
2052204000020000001500 ( a payment of $15 on 5/22/04 )
2062804000020000002500 ( a payment of $25 on 6/28/04 )
In this typical file with mixed record types, record type one is a "header record" and is from a Customer Master file. The first byte is defined as the record type field, and contains a 1. Record type 2 is called a "Detail record." It's first field is defined as 2. The detail records for each master record follow its header record, and the CustNum field might be omitted. In database parlance, this might be the result of a query where a primary, or "one" table (record type 1) is joined in a query with a related, or "many," table (record type 2).
Data types and formatting
A delimited or fixed-width text file contains no information as to whether a field is to be intended as a number or as text data type (see "Data types and Excel" at this site). It contains only the characters, and that's all. When writing a sheet as a text file (below), characters are written exactly as seen on the sheet. No data type or formatting information is included. The program that will read the file will have to know how to interpret the data.
Similarly, when reading a text file into Excel, there is nothing to indicate how the data is to be interpreted. The program reading the file needs to know these things. When you open a text file (below), unless you use the Text Import Wizard, Excel guesses at types and formatting. Here are some examples of how data is interpreted when the Text Import Wizard is not used.
| Data in text file | Raw data put in cell | Formatted as | Displayed |
| 1/1 | 38353 (1) | Date d-mmm | 1-Jan (1) |
| 1/1/03 | 37622 | Date m-d-yyyy | 1/1/2003 |
| 1-1 | 38353 (1) | Date d-mmm | 1 Jan (1) |
| 1-1-03 | 37622 | Date m-d-yyyy | 1/1/2003 |
| 1/44 | 16072 (Jan 1, 1944) | Date mmm-yy | |
| 1/44/05 | 1/44/05 (text) | General | 1/44/05 (text) |
| 44/1 | 44/1 (text) | General | 44/1 (text) |
| 44/1/05 | 44/1/05 (text) | General | 44/1/05 (text) |
| 1.95 | 1.95 | General | 1.95 |
| 01.95 | 1.95 | General | 1.95 |
| $1.95 | 1.95 | Currency $#,##0.00_)...... | $1.95 |
(1) The current year will be used, since the year was not included in the date..
You can see that a stock number 1-1-03 will be interpreted as a date. Or a number like 1/1/03, or 1/44, whether it was intended to be a date or not. Excel has a simple philosophy: "If it looks like a date, it's a date."
The Text Import Wizard.
When you use File - Open on a file named with a csv extension, the Text Import Wizard is not invoked, and the assumptions above will prevail. When the file name is *.txt, the Text Import Wizard is used, allowing you to specify which fields (columns) are text, dates, etc. Changing a file extension from csv to txt is all that's necessary to invoke the Text Import Wizard.
WRITING A TEXT FILE FROM AN EXCEL SHEET
Saving sheet as a text file in Excel
A sheet can be saved as a text file in Excel. With the sheet active, select File - Save as. Set the File Type box to *.csv for a comma-delimited text file, or "Text - Tab-delimited(*.txt)" for a tab-delimited file. There is also a "Space-Delimited (*.prn) type.
Delimited text: (comma-delimited or tab-delimited)
The columns are treated as fields, and separated by commas (if you've selected *.csv) or tab characters (if you've selected *.txt). Quotations marks are added if a comma is present in the cell and a quotation mark appearing in a cell is written as two consecutive quotation marks, and Text Qualifier quotation marks are put around the field (see "Delimited Text Files" above). Use File - Save as, and change the File type box to CSV (comma delimted) or Text (tab-delimited). You may get a message stating that formatting and other stuff will be lost. This is a reminder that a text file contains text only.
Saving to a tab-delimited text file will result in text qualifiers (") around fields containing a tab character, as should be, Note that it will also result in text qualifiers around fields containing a comma. This is unnecessary (since the field delimiter isn't the comma), but generally does not cause problems, unless the program that will read the file cannot handle text qualifiers..
The data may not look the same when you reopen the text file in Excel. This is because Excel will do its characteristic automatic formatting on certain data, much like it does when data is manually keyed. For example, 01/02 will be converted to a date, 01/02/04. Leading zeroes are dropped in numbers (use Format - Cells - Number - Custom, with a code like 00000 to regain them). If you want to see exactly what was written to a text file, use NotePad to open it; it will show you exactly what the file looks like.
Space-delimited (*.prn)
This one's an outright misnomer, since there's no delimiting at all. It simply attempts go get a similar layout to that of the worksheet as it would appear on a printer or text-only display by padding the file with spaces to put the columns where they belong. There might also be spaces in the cells, thus no way to determine where fields begin and end. Spaces everywhere -- no real delimiting of fields.
The prn file type can be used to write a fixed-width text file. Use a non-proportional font in the worksheet, like Courier. All the characters in such a font have the same width, so all the characters line up vertically. Now set the column widths until you get the proper character count for each field. This can be done visually by dragging the column widths in the sheet header (if you've already used a non-proportional font), or with Format - Column - Width. For the latter, simply specify the character count. Now save the file as type prn. All fields will be padded with spaces for the designated character count. If you need left-zero padding instead for your numeric fields, format the column for the total character count. E.g.: Format - Cells - Number - Custom: 00000 (for 5 characters). Also set the column width for the 5 characters.
Options
Excel doesn't provide many options in writing a text file, other than to select comma-delimited, tab-delimited, or prn. If you need no text qualifiers, text qualifiers around all fields, writing only a part of a sheet, or any of several options, try the Text Write Program at this site. Home
Fixed-width
There isn't a direct way to write a fixed-width text file from a sheet in Excel. See "Space delimited," above for a way to accomplish this.
Here's another way, if you're willing to go to a little more trouble. It's not thoroughly developed and tested. Download Fixed-Length Fields.xls. Copy your cells to sheet Data. Copy the formulas in sheet Mirror down and to the right (use the Fill Handle) as far as you need, but no farther (or you'll get extra records with zeroes if there are numeric fields). Go to sheet FldLen to set the length of each field and the text (alpha) or numeric designation. Text columns will be written right-padded with spaces to the designated character count, and numeric fields left-padded with zeroes. Now the cells in sheet Mirror will have exactly the count of characters specified.
Now select sheet Mirror and write it with the Text Write Program, in this
site. Use the following settings:
Record Delimiter Code
13,10
Bracketing: Character
(nothing)
Field Delimiter: Character
(nothing)
Note that empty numeric fields will be filled with zeroes, same as if the field contained zero.
READING A TEXT FILE INTO EXCEL
There are two ways to read a text file in Excel, opening, and importing. Importing is available in Excel 2000 and up. In opening, you use File - Open, setting the file type to that of the file to be read (*.txt, *.csv). The file is read, and the data is placed into a sheet in a new workbook, starting in cell A1. In importing, you use Data - Get External Data. The file is read, and the data placed into a sheet in a workbook that's already open.
Be sure to look at "Formatting issues and data integrity," later in this page.
Excel 97 - up
Opening a text file
Excel can open a delimited text file, putting fields into columns. Use File - Open. The text file read directly into a sheet in a new workbook. Change the file type box to match the extension of the text file to be read. If it's *.csv (comma-separated variables), the file is presumed to be comma-delimited. If it's *.txt, the Import Wizard starts, allowing you to specify:
1) Delimited or fixed width
2) The field delimiter character, if delimited, or the count of characters of
each field if fixed-width.
3) The formatting (date, text, etc.) of the columns for each field, and any
fields to skip.
If the file extension is *.csv, and you wish to invoke the Text Import Wizard, simply change the file extension to *.txt, then open it in Excel.
Excel 2000 - up
Importing a text file with a query.
Starting with Excel 2000, you can import a text file into an existing sheet. Use Data - Import External Data - Import data. In the Select Data Source dialog box, change the file type box to Text Files. You'll go through the import wizard to make your selections, and it will then read the file into your sheet. It's set up as a query; you can right-click anywhere in the data, choose Refresh, and it will re-read the file. There are options you can change (Edit query) at any time. A macro to re-read the file need only run the Refresh method.
Formatting issues and data integrity
If you're familiar with the way Excel automatically applies formatting when keying data, you'll be happy (not!) to know it does the same thing when reading a text file. This can result in changed data, which cannot be readily recovered. For example, the data 1/1 will be converted to a date, Jan 1 of the current year, where the cell will contain the underlying date-time-serial value, and will be formatted for a date. Excel's automatic formatting can only be blocked by formatting the column for the field of concern for text as the file is read. When opening (not importing) a csv file, you have no means to pre-format a column, so you can't prevent this. In this case change the text file's extension to .txt. This invokes the Import Wizard when you do the File - Open, which allows you to pre-format any such column to text. When opening or importing any text file ( *.txt, *.csv, etc.) file, the Import Wizard is invoked, allowing you to pre-format the column of any field that will have such data for text.
Zeroes and text files
Since text files contain only the characters of the data and no formatting or data type indications, read
this seems to be as far as I've gotten. 9/6/04, 11:00 PM EST.
Comments about this document? Please feel free to write me. My email address is on the Home page.