Hack #1: Reading an Excel File With Cells in Date Format

In reading an excel file using Ab Initio, most of the time we design it like this:

Read Excel File > Redefine Format > Reformat > Output File

We read each row as one record with the Read Excel File component. Then, using a Redefine Format, we slice the record and assign them in fields. Lastly, using a Reformat, we convert each field to its expected datatype.

The above design is simple, given that we're extracting strings and decimals only. Dates formatted as "YYYYMMDD" will be processed without issues. But when dates take the format of "MM/DD/YYYY" (i.e. "1/1/2010"), things get more complicated.

Supposed we have the following table in the excel file:


EMP_NODEPT_NOFROM_DATETO_DATE
100001101/1/201011/2/2014
100002101/1/201011/2/2014
100003101/1/201011/2/2014
100004202/5/201012/4/2014
100005202/5/201012/4/2014


Using the above design, when we get to the Reformat component, we need to use string_split() to parse the month, the day, and the year. Now, we can manipulate the data to make it look like "MM/DD/YYYY".

Transform function will be similar as below:

out :: reformat(in) =
begin
  let string("|")[3] from_date_vector = string_split(in.FROM_DATE,"/");
  let string("|") from_date_month = from_date_vector[0];
  let string("|") from_date_day = from_date_vector[1];
  let string("|") from_date_year = from_date_vector[2];
  let string("|")[3] to_date_vector = string_split(in.TO_DATE,"/");
  let string("|") to_date_month = to_date_vector[0];
  let string("|") to_date_day = to_date_vector[1];
  let string("|") to_date_year = to_date_vector[2];

  out.* :: in.*;
  out.FROM_DATE :: string_lpad(from_date_month, 2, "0") + "/" + string_lpad(from_date_day, 2, "0") + "/" + from_date_year;
  out.TO_DATE :: string_lpad(to_date_month, 2, "0") + "/" + string_lpad(to_date_day, 2, "0") + "/" + to_date_year;
end;

The problem is solved. Unfortunately, there is another problem. What if, for any kind of reason, the date changes its format to "MM.DD.YYYY"? There's no other way but to manually edit the code.

Now, here's a better solution.

Create an excel file using the given table above. Name it as emp_dept_info.xlsx.

Create a new graph then connect the components in the following order:

Read Excel File > Reformat > Output File

Supply the xls parameter with the excel file in the Read Excel File component. Configure other parameters as needed.

In the Ports tab, click Generate. You should get something like this:

record
  string("\t") sheet;
  string("\t") EMP_NO;
  string("\t") DEPT_NO;
  string("\t") FROM_DATE;
  string("\t") TO_DATE;
  string("\0") rest_of_line;
end

Optional:
You can also use the command-line to generate this record format, simply use:

excel-gendml.ksh -single emp_dept_info.xlsx -sheet-name "Sheet1" -sheet-header-row 1 -record-delimiter "\0"
or
excel-gendml.ksh -single emp_dept_info.xlsx -sheet-index 1 -sheet-header-row 1 -record-delimiter "\0"

Caution: If you pull the date as is, Ab Initio will give you a decimal representation of the date (as shown below). And this is not what we want.


EMP_NODEPT_NOFROM_DATETO_DATEnewline
100001104017941945\n
100002104017941945\n
100003104017941945\n
100004204021441977\n
100005204021441977\n


So the next step is to edit the embedded layout and change the format of those dates to decimal. Like this:

record
  string("\t") sheet;
  string("\t") EMP_NO;
  string("\t") DEPT_NO;
  decimal("\t") FROM_DATE;
  decimal("\t") TO_DATE;
  string("\0") rest_of_line;
end

Use the following transform function in the Reformat component:

include "~$AB_COMPONENTS/Interchange/Excel/excel_to_dml_date.xfr";
out :: reformat(in) =
begin
  out.* :: in.*;
  out.FROM_DATE :: excel_to_dml_date(in.FROM_DATE);
  out.TO_DATE :: excel_to_dml_date(in.TO_DATE);
end;

Finally, use the following record format for your Output File.

record
  decimal("|") EMP_NO;
  decimal("|") DEPT_NO;
  date("MM/DD/YYYY")("|") FROM_DATE;
  date("MM/DD/YYYY")("|") TO_DATE;
  string(1) newline = "\n";
end

Save and run the graph.

Resulting records must be similar to this:


EMP_NODEPT_NOFROM_DATETO_DATEnewline
1000011001/01/201011/02/2014\n
1000021001/01/201011/02/2014\n
1000031001/01/201011/02/2014\n
1000042002/05/201012/04/2014\n
1000052002/05/201012/04/2014\n

Comments are appreciated.

3 comments: