Rick Refactors %DTC, Step by Step

toad · **Posted:** Sat Jan 29, 2011 9:35 pm

Step five is a break from the approach we followed in the first four steps. Instead of doing a pass over the entire routine, I'm doing a single subroutine.

The reason for the change is simple. We have identified all the calls among the subroutines, but we don't yet understand them; we can't until we know what the called subroutines do. That means we need to start fully analyzing the called subroutines. The best way to proceed is to begin with the self-contained subroutines (those that make no calls out), then to move on to those that only call the subroutines we've analyzed, and so on, in waves.

So, starting with the self-contained subroutines, the easiest ones to analyze are those that are documented, i.e., the public self-contained subroutines, since we know more about them than we do about the private self-contained subroutines. There are three of them - COMMA, H-TOH, and S. We now understand them about as well as we can short of digging in, analyzing, annotating, and generally refactoring them, so that's what we need to do next.

For step five, we're completely refactoring H-TOH. I chose it first because it's the most interesting of the three. It isn't as trivial as S, and it contains what proves to be a disposable label we can remove.

Having sorted out the top-level flow of control, the main two things we need to focus on are (a) the variable scoping and (b) the meaning of the algorithm.

Variable scoping is critically important because most of the complexity in MUMPS software comes from its unusual scoping rules, which usually create a very complicated flow of information that can be nigh impenetrable for anyone but the author to fully understand. There are simple tactics you can follow to tighten up your variable scoping so that variables only exist for precisely as long as they need to, which helps your reader understand when they can be ignored and when they need to be attended to. Annotation about variable scoping should focus on the real context of the information being passed around (not just format but also whether it comes from or goes to specific fields in the database, in an HL7 message, etc.). Variable names should be chosen to help the reader remember that context

The meaning of the algorithms is the single most important reason to annotate your code. No code is self-documenting no matter how clear it is, because the most important thing to document is not what the code is doing but WHY, and no third-generation code is capable of explaining why. Third-generation languages are algorithmic in nature - that is, they specify what you are doing step by step - and without comments that's all they can describe. For example, I may not need you to explain that you're adding 1 to X - any competent MUMPS reader can see that - but I surely need you to explain that an imprecise date needs to count at least one day into the current month or year (which is why you're adding 1 if the date is imprecise).

So with those as the goals of the refactoring, here are the changes I made in refactoring H-TOH.

1) Reinsert blank lines in TOH to regroup commands. If you remember from step four, change two, when we break up long lines into multiple short lines we insert blank comment lines to separate the groups of short lines, so we can preserve their logical grouping. Well, in TOH I forgot to do some of that, so I here reinserted the missing blank lines. What's interesting about this step is that the error occurred despite how methodically we've been refactoring. Get used to it. Refactoring passes an astonishing amount of information through your brain (which is why it's such a good way to master a routine), which means you will have plenty of opportunities to violate the seven-plus-or-minus-two rule (how many things at a time you can keep track of). The goal is not to avoid making mistakes - that's impossible, you will make mistakes - but because you've refactored the routine you'll also be the best in the world at finding and fixing your mistakes quickly in the code you've mastered. This first change in step five is an example of why we refactor.

2) Figure out whether we can get rid of TOH, the seemingly private and uncalled label within H^%DTC. So far we've found no calls, so we do a more intensive search for calls. Eventually we'll have the Master Calling Index developed to make this a trivial operation, but for now it's slow and imperfect. I searched all routines and globals for "H^%DTC" and found no matches. I then searched all routines for TOH and found 144 matches, but upon review none of them turned out to be relevant. I did not go on to search all globals for TOH because there would be thousands upon thousands of false matches for so simple a string. At this point I'm going on guesswork based on my understanding of the kinds of calls people make into routines out of globals in VISTA that TOH has the wrong structure to be useful for most of them. There's a small leap of faith required here to conclude, as I do, that TOH is simply never called by anyone. This raises the question of why it is here, about which we can speculate (used to be called? was intended to be called but never was? documentation?) but can't know the answer since we aren't the original authors.

3) Eliminate label TOH & associated comments. All we leave behind is a blank comment line suggesting a logical break between what came before the label and what comes after. Because most labels are callable in MUMPS, it is important not to introduce labels (or to leave them around) unless they serve a necessary purpose. When in doubt, prune.

4) Next we need to understand the input and output variables better. We know what the documentation claims about H^%DTC's parameter, but in truth the documentation about H^%DTC is mostly a later approximation of the truth, an attempt to capture in writing some deductions about the verbal contracts George Timson and Michael Distaso made with the other VISTA Hardhats. Until we examine how this public call is actually used, we can't be sure the documentation is right. Therefore, I searched all routines and globals for H^%DTC and found 153 calls from within routines (and none from within globals, oddly). I read them all and took notes on how they handled the inputs and outputs.

5) Annotate the subroutine based on what we learned above. Here are the key points. First, some subroutines rely on X after the call, in direct contradiction of the documentation, which claims X is not returned; not only is it returned, it is used. Therefore, we annotate our description of X with this information and change "input" to "throughput." Second, frequently %H, %T, %Y, and X are not cleaned up after calls, though they are at times. This suggests the documentation should be clearer about the caller's responsibility to do so. Third, no one seems to be relying on H^%DTC to kill %, %D, or %M for them, so we could safely replace the kills at the end of the subroutine with new commands. This is a significant change in H^%DTC's handling of these variables, and in theory it could cause problems in which a caller needed H^%DTC to do the cleanup, but in practice it won't, so it's better to replace the heavy footprint of killing our temporary variables in pre-1990-MUMPS fashion with the much lighter footprint of newing them instead.

6) Having performed the annotation about the current situation, we now change the variable scoping of %, %D, and %M to new them instead of kill them. Where the kill used to be, at the end of the subroutine, we insert a comment about what we've done and why, because should our analysis prove faulty and bugs result in other packages as a side-effect, it would be quite difficult for a third party to understand the problem without the explanation, which it only takes us a few characters of space and a few seconds of typing to provide. The author should always be willing to invest in improving troubleshooting for the reader in cases like this. The desired effect of step six is to lighten H^%DTC's footprint upon the symbol table to make it easier to build complex software with it.

7) Having dealt with input, output, and throughput issues, we can turn to temporary internal variables, where we have a much freer hand. As we annotate the algorithm and so come to understand what these internal variables mean, we rename them to describe them more clearly (% > IMPRECIS, %D > DAY, and %M > MONTH, DILEAP > LEAPDAYS, and Y > YEAR), and we introduce argumentless do commands and blocks to limit the scope of these variables as tightly as possible. We also use the principle of locality here to group the lines that most closely relate to each variable to assist in tightening their scope.

The importance of this change in improving the maintainability of code cannot be overstated, so let's take a moment to explore the issues involved.

First, the question of variable names has no simple answer. In general, short variable names are fine whenever there are only a few ideas being represented and they're obvious. When the number of pieces of information an algorithm needs to refer to grows beyond say three, longer variable names are essential because you need to shift names from the abstract to the concrete. We don't remember abstractions nearly as well as we remember concrete things. Also, concrete names can contribute to the readability of a complicated algorithm in a way names like % just can't. There are times when I'll break this guideline - my new File Manager tools for example introduce lots and lots of short variable names for reasons I'll explain when I publish that project - but most of the time it's a good guideline.

Second, many MUMPS programmers would think it's strange to rearrange lines - let alone to introduce argumentless do commands and blocks - just to tighten up the scope of variables, but that's because they spend too much time reading code they already understand and not nearly enough struggling through other people's complicated logic. Especially when variables are reused for more than one piece of information - which even this subroutine does - the shifting meaning of variables can be the very hardest thing to understand, the easiest to get wrong and screw up.

I can't count the number of times I was led down a false path in troubleshooting because I didn't realize that a variable now contained a different piece of information, or that it couldn't possibly be defined now. Over the decades I've figured out that some programmers create complicated messes in the symbol table because they too have lost track of their variables and are just desperately hoping it will all work when they release it to the field.

With the advent of the 1990 standard, there's no longer any excuse for that kind of confusion in the symbol table. We can use not just the set and kill commands, but also the new and argumentless do commands to ensure that variables you only need for a couple lines exist only for those two lines, and that those you need for the entire subroutine survive for the entire subroutine. Making this distinction clear to your reader will help them more than almost anything else in prioritizing the data, in knowing which data is the main focus of your algorithm and which is just short-lived helper data used to calculate something more important.

8) The next step is to find where the same variable name is used to store different things at different times. To reduce the confusion for the reader, it's better to split up such variables until each name maps to just one piece of information. In H^%DTC we split the %Y output variable (day of the week) from two previous %Y values; # years counting from Fileman's year 0 becomes YEARFM and # years counting from $HOROLOG's year 0 becomes YEARH. Choosing names in such cases is quite the art, since you need to clearly distinguish them from each other and from other related variables (like the true year in the LEAPDAYS block, which we renamed from Y to YEAR).

Also, upon reflection, to improve the clarity of the final set to the %H output variable we split that %H from the earlier set to %H that counted the number of days in the current year (which we name DAYS). All this variable renaming and splitting lets us replace

Code:

S %H=(%H+(%Y*365)+DILEAP+%)

with

Code:

S %H=YEARH*365+LEAPDAYS+DAYS+IMPRECIS

which is a much more readable line of code.

9) Add structural and explanatory comments to divide H into logical steps.

10) Add examples to the parameter comments.

11) Replace the postconditional set in the LEAPDAYS block with if and set commands. This is a matter of taste. Postconditions used to be absolutely essential to write horizontal MUMPS. Now that MUMPS is more vertical, each postcondition in older code needs to be reevaluated to decide whether it is clearer for readers as a postcondition or separated into separate commands. In this case, I wanted to unpack the algorithm to emphasize the condition.

12) Remove extraneous parentheses from set LEAPDAYS, IMPRECIS, and set %H. Experienced programmers new to MUMPS find its operator-precedence rules confusing because they're different, but distorting how you right MUMPS to make it look more like another programming language isn't a viable strategy for dealing with that, for two reasons. First, it's hard enough to write good MUMPS code when you aren't twisting up its algorithms. Second, programmers simply need to learn the operator-precedence rules; they're simple and ubiquitous and deserve the time it takes to get used to them. The best thing you can do for people trying to make the transition from other languages to MUMPS is to write good, clear MUMPS code so they can get used to what it looks like.

Therefore, I replaced

Code:

S DILEAP=(Y\4)-(Y\100)+(Y\400)-446

with

Code:

S LEAPDAYS=YEAR\4-(YEAR\100)+(YEAR\400)-446

and I replaced

Code:

S %=('%M!'%D)

with

Code:

S IMPRECIS='MONTH!'DAY

and I replaced

Code:

S %H=(%H+(%Y*365)+DILEAP+%)

with

Code:

S %H=YEARH*365+LEAPDAYS+DAYS+IMPRECIS

In all these cases, the removal of the parentheses (and in the case of %H the rearrangement of the expression to make it work) eliminated noise that contributed nothing to the algorithm but a distraction. If the resulting code looks strange to you, it's because you still aren't comfortable with MUMPS's rules for operator precedence. The best way to fix that problem isn't to sprinkle unnecessary parentheses throughout your code, it's to spend more time getting used to MUMPS's strange grammar.

13) The last change was to take this opportunity to fix a long-standing bug, the Y2700 bug. I coordinated the project to convert all the VISTA infrastructure packages to deal with Y2K, and at the time we made the necessary changes I discovered another bug that won't kick in until the year 2700, when Fileman dates will roll over from seen digits to eight. When that happens, all the code throughout VISTA that picks apart Fileman dates into years, months, and days will break because they are hard-coded to assume seven digits.

Instead of being hard-coded to specific character positions, as they are now, these algorithms need to float from the end of the date back toward the front, so that however long or short the year is it will still work. This fix is also needed to allow early Fileman dates, those between 1701 and 1800, when the year is specified by fewer than three digits. In the code that follows, the variables DATE and LENGTH were introduced to make the length calculations easier, and the $Extracts were changed from absolute positions to calculations based on LENGTH.

This last change is an example of the kind of algorithmic improvement that almost always results as a side effect during refactoring, because the better you understand your algorithms, the more clearly you can see its strengths and weaknesses. That is the point, of course, to master the code until your understanding approaches that of the code's original author.

Here are the results of step five for H^%DTC.

Code:

H   ; converts a Fileman date/time to a $H format date/time
   ;;public;procedure;NOT clean;silent?;SAC-compliant?;recursive?
   ; original comment = called from DIG, DIP4
   ;
   ; throughput:
   ;   X = date/time in Fileman format (YYYMMDD.HHMMSS, where YYY can be
   ;       any number of Ys that together equal the # years since 1700,
   ;       e.g., 3110129.161225 for 29 January 2011 @ 16:12:25); the time
   ;       part of X is optional. X is documented as not returned, but
   ;       never cleaned up within H or TOH, and for good reason - there
   ;       are VISTA routines that rely on X after a call to H^%DTC. The
   ;       documentation is wrong and should be corrected.
   ; output:
   ;   %H = date in $H format (# days since 31 December 1840, e.g., 62120).
   ;        If X is imprecise, then the first of the month or year is
   ;        returned.
   ;   %T = time in $H format (# seconds since midnight, e.g., 58345). If
   ;        X has no time then %T = zero.
   ;   %Y = day-of-week as # from 0 to 6 (0 = Sunday and 6 = Saturday). If
   ;        X is imprecise, then %Y = -1.
   ;   None of these outputs is killed before being set, so they will
   ;   pass through any previous structures & values except the top node.
   ;
   ;   Cleanup of these input & output variables is the responsibility of
   ;   the caller, though a survey of usage of H^%DTC shows many callers
   ;   are not handling the cleanup.
   ;
   ; H.1  handle conversion problems
   ;
   ; The earliest date $HOROLOG can represent is 1 January 1841. Fileman
   ; format can gracefully represent dates back to 1 January 1800. Any
   ; Fileman dates too early to convert to $HOROLOG need to be rejected.
   ;
   I X<1410000 D  Q
   . S %H=0 ; no days since 31 December 1840
   . S %T=0 ; no seconds since midnight
   . S %Y=-1 ; no day of week (early dates are treated as imprecise)
   ;
   ; H.2  break Fileman date into components
   ;
   ; Here we fixed the Y2700 bug, in which Fileman dates roll over to 4
   ; digits for the year beginning in 2700. $Extract algorithms like this
   ; have always been hard-coded assuming a three-digit year. This is the
   ; first place we've converted it to handle any number of digit for the
   ; year. This change also fixes the early-FM-date bug (between 1700 and
   ; 1800, when the # of Y digits is 1 or 2), though that desn't matter
   ; in this call, since $HOROLOG can't handle dates that early anyway.
   ;
   N YEARFM,MONTH,DAY D
   . N DATE S DATE=X\1
   . N LENGTH S LENGTH=$L(DATE)
   . ;
   . S YEARFM=$E(DATE,1,LENGTH-4)
   . S MONTH=$E(DATE,LENGTH-3,LENGTH-2)
   . S DAY=$E(DATE,LENGTH-1,LENGTH)
   ;
   ; H.3  convert time part of X, if any, into seconds since midnight
   ;
   D
   . N TIME S TIME=X#1
   . ;
   . S %T=$E(TIME_0,1,2)*60+$E(TIME_"000",3,4)*60+$E(TIME_"00000",5,6)
   ;
   ; H.4  calculate how many leap days to add
   ;
   N LEAPDAYS D
   . N YEAR S YEAR=YEARFM+1700
   . I MONTH<3 S YEAR=YEAR-1 ; before March 1st, treat as prior year
   . S LEAPDAYS=YEAR\4-(YEAR\100)+(YEAR\400)-446
   ; 4, 100, and 400 refer to standard leap-year algorithm; 446 refers to
   ; the number of leap days prior to 1 January 1841; we only count leap
   ; days after that date
   ;
   ; H.5  convert date part of X into days since 31 December 1840
   ;
   N IMPRECIS S IMPRECIS='MONTH!'DAY ; if date's imprecise, add one day
   ; to move into the current year or month
   D
   . N YEARH S YEARH=YEARFM-141 ; convert years from FM count to $H count
   . N DAYS ; add up # days in current year (not counting leap day if any)
   . S DAYS=$P("^31^59^90^120^151^181^212^243^273^304^334","^",MONTH)+DAY
   . S %H=YEARH*365+LEAPDAYS+DAYS+IMPRECIS
   ;
   ; H.6  calculate day of week
   ;
   S %Y=$S(IMPRECIS:-1,1:%H+4#7)
   ;
   ; %, %D, and %M used to be killed at this point. Now they are newed
   ; above under their new names of IMPRECIS, DAY, and MONTH. This change
   ; could cause variable scoping problems for callers who relied upon
   ; H^%DTC to kill them, but a search of VISTA turned up no such cases,
   ; so the change was made to lighten H^%DTC's footprint.
   ;
   QUIT  ; end of H^%DTC

H^%DTC is now fully refactored according to the standards I'm currently using. A different programmer refactoring according to different standards would produce probably quite different results. For comparison, here is the original subroutine.

Code:

H   ;called from DIG, DIP4
   I X<1410000 S (%H,%T)=0,%Y=-1 Q
   S %Y=$E(X,1,3),%M=$E(X,4,5),%D=$E(X,6,7)
   S %T=$E(X_0,9,10)*60+$E(X_"000",11,12)*60+$E(X_"00000",13,14)
TOH   N DILEAP D
   . N Y S Y=%Y+1700 S:%M<3 Y=Y-1
   . S DILEAP=(Y\4)-(Y\100)+(Y\400)-446 Q
   S %H=$P("^31^59^90^120^151^181^212^243^273^304^334","^",%M)+%D
   S %=('%M!'%D),%Y=%Y-141
   S %H=(%H+(%Y*365)+DILEAP+%),%Y=$S(%:-1,1:%H+4#7)
   K %M,%D,% Q

I'll reassert what I wrote at the beginning of this project: the new code is absolutely not intrinsically better than the original code. The original is dense, compact, and powerfully expressed, and except for the Y2700 bug fix the original does every last thing the new one does.

The only reason to convert the code through refactoring is so that new programmers can be as productive with the new code as the original author is with the original code - in this example mainly so I can be as productive with it, since I'm the one doing the refactoring. As dramatic as the change to the code appears to be, the most dramatic change here isn't to the code but to my brain, to retool me to become better able to support H^%DTC.

Rick Refactors %DTC, Step by Step

Who is online