Tuesday, May 31, 2011

ALL THE FILES, PLEASE

A number of ways to read multiple files in one data set.


First, just list the files in a series of DATALINES
in the DATA step.
DATA one;
LENGTH fil2read $ 40;
INPUT fil2read $;
INFILE dummy FILEVAR=fil2read
END=done;
DO WHILE (NOT done);
INPUT lastn $ firstn $
hiredate : mmddyy8.
salary;
OUTPUT;
END;
DATALINES;
D:\Infile\emplist.dat
D:\Infile\emplist1.dat
D:\Infile\emplist2.dat
D:\Infile\emplist3.dat
D:\Infile\emplist4.dat
RUN;



Be careful to set up the DO loop so that the DATA step never gets to the End-Of-File marker on any file. Using the END= option on the second INFILE statement sets up a temporary variable (done) which will register 0 (not the last record) or 1 (the last record) for each raw data line read
in from each file. This is necessary since, if SAS reads in any End-of- File marker, the DATA step closes.
By testing for DONE at the top of the loop (DO WHILE), and exiting the DO loop after the last line of every file, we ensure that we never hit the end-of-file for all files read in. This remains true even for empty files.


A SAS Dataset can be used to store the names of the files and would be called using a SET statement.

DATA one;
set two;
INFILE dummy FILEVAR=fil2read
END=done;
DO WHILE (NOT done);
INPUT lastn $ firstn $
hiredate : mmddyy8.
salary;
OUTPUT;
END;
RUN;



Finally, it’s possible to read in filenames dynamically, using a FILENAME with the Pipe option. This is useful when all of the files are in the same directory. With the PIPE keyword, the FILENAME statement can take an operating system command in quotes, and accept the result as valid input. Unfortunately, this is not available on Mainframe operating systems.


FILENAME indata PIPE "dir D:\Infile\*.dat /b";
DATA test;
LENGTH fil2read $40;
INFILE indata MISSOVER;
INPUT fil2read $;
fil2read="d:\infile\"fil2read;
INFILE dummy FILEVAR=fil2read END=done;
DO WHILE(NOT done);
INPUT lastn $ firstn $ hiredate : mmddyy8. salary;
OUTPUT;
END;
RUN;



The information returned from the FILENAME statement is a list of all files in D:\Infile with a .DAT type. One can specify all files, or (as above) specific files. The DATA step can use this information with one INFILE statement and then use the information to read the files by
applying it to a FILEVAR= option on a second INFILE statement.

One limitation is that the Windows command (DIR) returns only the names of the files without the pathnames. So the fil2read variable needs to be augmented with the pathname in an
assignment statement. fil2read="d:\infile\"fil2read;
In UNIX, a similar FILENAME statement would read:

FILENAME indata PIPE "ls -l /Infile/*.dat /b";

The UNIX ls command returns a fully qualified path and filename.

MISSOVER, TRUNCOVER and PAD

MISSOVER was originally created to be used in conjunction with PAD and works effectively and well in most situations. However, this can be a CPU intensive process when reading an extremely large file.

STOPOVER is a good tool for checking code and raw data when dealing with large, potentially messy files, since it forces the DATA step to stop the first time it finds a short line.

TRUNCOVER was developed later than the MISSOVER and PAD options, and deals admirably with not only short lines but with short values. TRUNCOVER is more also efficient since it doesn't require the extra "padding".