MergeEm Application Synopsis for Win95/NT4.0

MergeEm was built to provide a means to extract data from multiple
files and, substituting that extracted data into a template you
define, create a web page file.  The data is extracted from files you
periodically pull onto your machine via the web.  Once the data has
been extracted and substituted into the template to create the page,
the page is ready for display, on demand, with the latest available
information being displayed to the viewer.  As an example, it could
be used to create a web page that has as one of its display fields
the current temperature, wind direction, humidity, barometric
pressure, DJIA, current selling price/volume of specific stocks, etc.
 The dynamic data extracted can be a word, phrase, sentence,
paragraph, or the entire contents of a file.

How is this accomplished?  You define a RULE File that tells
MergeEm what files (called scan files) you want data from and
what data to extract from these scan files.  Each word, phrase,
sentence, paragraph, or entire file you want to extract must be
assigned to a unique variable name that you choose. 
You also define a TEMPLATE File that contains your
mark ups that will become the web page.  In the TEMPLATE
File you designate where the extracted data will
be substituted via the variable names you defined in the RULE
File.  MergeEm reads the RULE File to obtain the
template file name and path, output file name and path,
error log name and path, the files from which you want to extract
data, and the variable names with data extraction rules.  With that
information, MergeEm opens each scan file you specify, uses the rules
to extract the data you requested, and assigns the data to your
variable name.  After all files have been scanned for data, MergeEm
reads the template file into memory, substitutes the extracted data
into the matching variable names defined in the template, and writes
the results to the output file.  The output file is then ready for
display to your web page customers.

MergeEm supports three types of data extraction rules which are:
o	character pattern matching;
o	define row column extraction sets; and
o	grab the whole file.

General/Limitations:
MergeEm can be invoked with the rule file in the command line or you
can invoke the application and select the rule file via a dialog.  If
you invoke it with a rule file specified in the command line, the
application will do its thing and kill itself when complete.  This
allows you to periodically invoke it via a scheduler without
cluttering your desktop.  If you invoke it without a rule file
specified on the command line, you are responsible for killing the
application when you are finished with it.

You build/edit the rule file using the text editor of your choice. 
The application automatically supports the UNIX and DOS end-of-line
formats.  Beginning and ending pattern matching strings as well as
variable names are limited to 260 characters each.  There is no known
size constraint for the extracted text.

--------------------------------------------------------------------------
Rule File

As discussed in the synopsis, the rule file is where you define
 the pertinent information about the web page template, output
file, and error log, scan files, and variables.  The rule 
file must contain a command section (&command:) and as many scan
file sections (&ScanFile1:, &ScanFile2:, ...&ScanFileN:) as you have
files from which you want to extract data.

Each rule file section starts with one of the above mentioned section
identifiers and terminates on the first blank line encountered.

Command Section
The command section of the rule file must be the first defined
section and must begin 
&command:.

In the command section you define the name and path for the
following:
1.	TemplateFile Name;
2.	TemplateFile Path;
3.	OutputFile Name;
4.	OutputFile Path;
5.	ErrorLogFile Name; and
6.	ErrorLogFile Path.
Each definition is discussed below.

TemplateFile Name
The name of the template file containing the html mark ups with
the embedded variable names that you defined for the extracted
data from each scan file (e.g., mytemplate.txt).  This file
will be read into memory, the extracted data substituted for each of
the variables, and the results written to the output file.
Note:	You may use any name and extension that makes sense to you.

TemplateFile Path
The full path to the template file (e.g., c:\mywebpage\template).

OutputFile Name
The name of the output file where you want the results written (e.g.,
mypage.html).  This file's extension should be html or htm since you
will be referencing it on one or more of your pages.

OutputFile Path
The full path to the output file (e.g., c:\mywebpage).

ErrorLogFile Name
The name of the error log file where you want the in-process errors
to be written (e.g., error.log).

ErrorLogFile Path
The full path to the error log file (e.g., c:\mywebpage\errorlog).

The following is an example command section.
&command:
TemplateFile Name = MyTemplate.tmp
TemplateFile Path = c:\mywebpage\template
OutputFile Name = Weather.html
OutputFile Path = c:\mywebpage
ErrorLogFile Name = MergeEmError.log
ErrorLogFile Path = c:\mywebpage\errorlog

The command section is terminated a blank line after the ErrorLogFile
Path definition.

Scan File Sections
Each scan file section must begin with the &ScanFileN identifier
where the N is a number beginning with 1 and each succeeding section
incremented by one.  The first two lines following the identifier
must define the file name to be scanned and the path to that file. 
Each scan file section begins as follows:

&ScanFile1:
ScanFile Name = weather.html
ScanFile Path = c:\MyWebPage\Scanfiles

Following the ScanFile Path definition, you define the variables
and the extraction rules for the data you want to
extract from that specific file.  After you have defined the
variables to be extracted for that file, add a blank line before then
next scan file definition.  You may define as many scan file sections
as you have files to scan.  You may define as many variables
as you need to extract from each of the files.


--------------------------------------------------------------------------
ScanFile Variables

In each &ScanFileN: section of the RULE File you define
the file name and path to the file you are going to extract data
from.  Following the path definition, you may define as many unique
variables as you have pieces of data to extract from that file.  This
section describes how to define the variables that extract the
desired information for insertion into your hypertext template file.

MergeEm supports three types of data extraction rules which are:
o	character pattern matching;
o	define row column extraction sets; and
o	grab the whole file.

Character Pattern Matching
MergeEm supports two types of variable formats for pattern matching
which are:

{variable}={start pattern expression}????{end pattern
expression}{::Instance of} and
{variable}={start pattern expression}??XXX??{::Instance of}

The ???? rule is interpreted as, extract all text between (but not
including) the start and end pattern expressions (e.g., <title> and
</title>).  The ??XXX?? rule is interpreted as, copy the first XXX
characters following the start expression where, XXX represents the
number of characters (between 1 and 999) to copy.  The double ?? is
required before and after the number of characters to copy (e.g.,
<title>??8??, <img src = "??12??).

The following is an example of a variable definition using character
pattern matching:

title=<title>????</title>

The example definition directs MergeEm to:
{bmc bullet.bmp}	pattern match for <title> and </title>
respectively in the current scan file;
{bmc bullet.bmp}	extract the information between the <title> and
</title> pattern expressions; and
{bmc bullet.bmp}	assign the extracted text to variable title for
later substitution for variable %%title%% in the template.

Variable Names
A variable may be any name of your choice (title in the example
above) followed by a required equal sign and the variable definition
(<title>????</title>).  Each variable name must be unique.

Start and End Pattern Expressions
The start and end patterns must be unique character patterns.  The
application pattern matches for these expressions.  If you choose
ambiguous start and end character patterns, you may get unexpected
results.  If the start and end patterns are not unique you must
define an instance of number following the expression beginning with
a :: .  Here is an example where you need to find the 2nd instance of
the pattern the aid and extract information up to the word country:

2ndThe = the aid????country::2.

Start Expression Wild Cards
MergeEm supports two start expression wild cards.  The two wild cards
are $$$ and #.

The $$$ wild card means don't know and/or care how many or what
characters may follow.  You must specify a single ending character
following the $$$ wild card.  That character is used to denote the
end of the don't know and/or care pattern.  Choose your ending
character with care. Note that if you use the $$$ wild card, multiple
characters following the $$$ must exist in the pattern to constitute
a match (e.g., <h2>...</h2> would not be a match in the example below
for variable header1).

You may embed, within the start and end expressions, the # wild card.
 The # wild card means unknown digit.  You may embed multiple # wild
cards signifying a multi-digit number (e.g., ### for a 3 digit
number).  Note if you specify one or more # wild cards in the start
expression, you must specify the same number of # wild cards in the
end expression.  The following example illustrates usage of the
expression wild cards.

header1=<h# $$$>????</h#>::3

This ambiguous definition example directs MergeEm to pattern match
for the characters <h and a single unknown digit followed by any
number of additional characters whose pattern ends with the >
character.  When the start pattern is found, the search continues for
the end pattern (</h) followed by a single digit  In the example, the
first two instances of the match will be skipped.  On the 3rd match,
the intervening text is extracted and assigned to the variable
header1.  It skips the first two matches because of the instance
number definition (::3) following the end expression.  If no instance
number is defined following the end expression, the 1st instance is
assumed.  If less than 3 instances exist in a scan file, the result
will be no match.  Use the # sign to signify don't know or care what
the actual number is.  If you know or care what the number is, then
you should define the variable explicitly (e.g., header1=<h2
$$$>????</h2>). 

Start Expression Across Lines
MergeEm supports defining start expressions that transcend lines in
an effort to help improve your ability to define a unique pattern. 
When a character pattern starts on one line and ends on another, you
join the two patterns with a double caret (^^).  The following is a
simple example.  Assume the following text exists as a portion of a 
scan file:

Now is the time for all good men
to come to the aid of their country.

Further assume, that you wanted to extract "to the aid of their" but
the pattern "to come" exists elsewhere in the file.  The pattern
"good men and to come" however is unique especially since it starts
on one line and ends on another.  You could define your variable with
a start expression that transcends lines as follows:

theaid = good men^^to come????country.

A new line is automatically substituted for the ^^ if MergeEm detects
that the scan file is of type UNIX.  A carriage return and line feed
is automatically substituted for the ^^ if MergeEm detects that the
scan file is of type DOS.


Row Column Extraction Sets
MergeEm supports extracting text between two sets of row-column
definitions.

{variable}={rXXcXX}????{rXXcXX}

Bracketing the ???? are sets of row (started with r followed by the
row number) column (started with c followed by the column number)
definitions.  MergeEm will extract the characters starting at the
start row-column up to the end row-column.  The following example
illustrates a row-column extraction definition:

flow=r119c106????r119c111

In this example, the 5 characters starting at columns 106 through 110
on row 119 are extracted and assigned to variable flow.

Grab the Whole File
MergeEm supports extracting the contents of an entire file and
assignment of its contents to a variable.  The following is the
format for the grab all data extraction rule:

{variable}= ?all?

Comment Characters
A double ampersand (&&) may be used in the rule file as a comment
indicator.  All characters following a && to the end-of-line are
ignored.  You may comment out a rule file line by beginning it with
the &&.  Lines beginning with && are not interpreted as a blank line
which signals the end of a rule section.


--------------------------------------------------------------------------
Example Rule File

The following is an example of a rule file.

&command:
TemplateFile Name = template.html
TemplateFile Path = e:\Iola Merge Project\MergeEm
OutputFile Name = mergeem.html
OutputFile Path = e:\Iola Merge Project\MergeEm
ErrorLogFile Name = Error.log
ErrorLogFile Path = e:\Iola Merge Project\MergeEm

&ScanFile1:
ScanFile Name = river.html
ScanFile Path = e:\temp
flow=r122c123????r122c126
stage=r122c128????r122c132
riverdate=r122c134????r122c139
rivertime=r122c140????r122c145

&ScanFile2:
ScanFile Name = weather.html
ScanFile Path = e:\temp\
noaatime= <B>--------------</B><BR>^^<B>????</B>
forecast=<B>--------------</B><BR>^^<P>????<HR align="left"
forecast2=extended forecast...</a>...</B><BR>????<P><HR align="left"


See the corresponding Example Template File for
an example of how the above variables might be used.

--------------------------------------------------------------------------
Example Template File

The template file is where you define the html markups that will be
your web page.  Within your markup you embed the variables that you
defined in the &ScanFileN sections.  Each variable name in the
template must be bracketed by a double percent (%%).  As an example,
if you extracted data from a scan file and assigned that data to
variable datafurnished, in the template you would tell MergeEm where
to put that data by embedding %%datafurnished%% in your markup
possibly as follows:

<tr><td>Data Furnished By</td><td><b>%%datafurnished%%</b></td></tr>

Date and Time Variable Usage in a Template File
Date and Time are supported as automatic variables within the
template file.  They are supported without the need to define a
corresponding variable.  The two variable definitions are:

%%SYSDATE%% having the format MM/DD/YY (U.S.) or Mon. DD, YYYY
(European) and
%%SYSTIME%% having the format HH:MM AM/PM (U.S.) or HH:MM  (24 hour
European).

The computer's current date and time are substituted for these
variables when they are encountered within the template file.  You
define which style (U.S. or European) in which you wish it displayed
in the Preferences Page.

The following is an example template file definition that corresponds
to the Example Rule File.

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 2.0">
<title>MergeEm Demonstration Page</title>
</head>
<body bgcolor="#FFFFFF">
<table border="0">
    <tr>
        <td><img src="images/MergeEm.gif" align="left" hspace="0"
        width="258" height="58"></td>
    </tr>
    <tr>
        <td><img src="images/text.gif" align="left" hspace="0"
        width="341" height="30"></td>
    </tr>
</table>
<hr size="1" noshade color="#000080">
<font="arial" size="2">
<p align="center">[ <a href="mergeem.html">back</a> | templatefile ]
</p>
<div align="center"><center>
<table border="0" cellpadding="3" cellspacing="4">
    <tr>
        <td valign="top" width="50%"><font size="2" face="Arial">If
        you need to create dynamic web pages that are created
        from data located in different source files, then you've
        found one utility that your going to love! MergeEm can
        scan multiple text files, extract only the text that you
        want, save that date to a variable and then reinsert it
        exactly where you want it on your web page. </font><p><font
        size="2" face="Arial">This page is a demonstration of
        what MergeEm can do, automatically, 24 hours a day, 7
        days a week. If you need to merge text files for the web, 
        then this is one utility you can't do without. MergeEm
        only works with files on your local drive, to automate
        the process of downloading and uploading the files, take
        a look at </font><a href="http://www.unisyn.com"><font>
        size="2" face="Arial">Automate</font></a><font size="2"
        face="Arial">.</font></p>
        </td>
        <td valign="top" width="50%"><font size="2" face="Arial">
        <em>Note: scanned information in </em>
        </font><font color="#FF0000" size="2" face="Arial">
        <em>red</em></font><p>
        <font size="2" face="Arial">
        <em>MergeEm output file created: </em>
        </font><font color="#FF0000" size="2" face="Arial">
        <em>%%SYSTIME%%, %%SYSDATE%%</em></font></p>
        <p><font face="Arial"><strong>Weather Forecast for Iola, 
             Kansas<br> </strong></font><font size="2" face="Arial">
             from </font>
        <a href="http://www.nnic.noaa.gov/cgi-bin/netcast.do-
             it?state=iola%2C+kansas&amp;city=on&amp;
             area=Local+Forecast&amp;html=yes&amp;match=strong">
             <font size="2" face="Arial">NOAA Network Information
             Center</font></a></p>
        <table border="0" cellpadding="5" cellspacing="5">
            <tr>
                <td><font color="#FF0000" size="2" face="Arial">
                    <strong>%%noaatime%%</strong></font>
                    <p><font color="#FF0000" size="2" face="Arial">
                    <strong>%%forecast%%
%%forecast2%%</strong></font>
                    </p>
                </td>
            </tr>
        </table>
        <p><font face="Arial"><strong>Neosho River Conditions at
            Iola, Kansas<br></strong></font>
        <font size="2" face="Arial">from</font>
        <font face="Arial"></font>
        <a href="http://www-ks.cr.usgs.gov/Kansas/rt/">
        <font size="2" face="Arial">U.S. Geological
            Survey WRD</font></a></p>
        <div align="left"><table border="0" cellpadding="3"
        cellspacing="5">
            <tr>
                <td><font size="2" face="Arial"><strong>Water
                    Flow:</strong></font></td>
                <td><font color="#FF0000">%%flow%%</font></td>
            </tr>
            <tr>
                <td><font size="2" face="Arial"><strong>River
                    Stage:</strong></font></td>
                <td><font color="#FF0000">%%stage%%</font></td>
            </tr>
            <tr>
                <td><font size="2" face="Arial"><strong>
                    Date:</strong></font></td>
                <td><font color="#FF0000">%%riverdate%%</font></td>
            </tr>
            <tr>
                <td><font size="2" face="Arial"><strong>
                    Time:</strong></font></td>
                <td><font color="#FF0000">%%rivertime%%</font></td>
            </tr>
        </table>
        </div></td>
    </tr>
    <tr>
        <td colspan="2"><p align="center"><font size="2"
face="Arial">
        <strong>Files used for this example</strong></font></p>
        <p align="center"><font size="2" face="Arial">Raw input
            file downloaded from </font>
        <a href="http://www-ks.cr.usgs.gov/Kansas/rt/">
            <font size="2" face="Arial">
            U.S. Geological Survey WRD</font></a>
        <font size="2" face="Arial"><br>
            Raw input file downloaded from </font>
        <a href="http://www.nnic.noaa.gov/cgi-bin/netcast.do-it?
            state=iola%2C+kansas&amp;city=on&amp;
            area=Local+Forecast&amp;html=yes&amp;match=strong">
            <font size="2" face="Arial">NOAA Network 
            Information Center</font></a>
        <font size="2" face="Arial"> <br></font>
        <a href="template.html"><font size="2" face="Arial">
            Template</font></a>
        <font size="2" face="Arial"> file for this page<br>
        </font>
        <a href="rule.html"><font size="2" face="Arial">
            Rule</font></a>
        <font size="2" face="Arial"> file used for this
example</font>
        </p>
        </td>
    </tr>
</table>
</center></div>
</body>
</html>

Note1:	The above example can be viewed in action at the following
URL:
    http://www.bey.com/mergeem.html
Note2:	The rule  and template files were written by John Heard
owner of Beyond Engineering
John develops and writes web sites for hire.  You can visit his home
page at the following URL:
    http://www.bey.com/

Dick Floersch
Email:	floersch@sound.net
Site:	http://www.sound.net/~floersch/
