one of the perennial issues that we all encounter from time to time is that of properly formatting words that have been entered into a database. i am sure that, just as i have, you will at least once have run into the situation where you have to produce a personalized letter, report, mailing label or some such printed output that uses names from the database and have ended up with something that looks like this:

mr jeffrey wilson
123 letsbe avenue
seldom
wilts

 

which of course is the result of using lookup keys for the title and county and direct user input for everything else. by and large, vfp is pretty good about string handling and it even provides a proper() function which will easily handle this example, turning it into:

mr jeffrey wilson
123 letsbe avenue
seldom
wilts

however, things are not so easy when we have to deal with a specially formatted word, or name, like "o'reilly" (comes out as "o'reilly") or "macdonald" (ends up as "macdonald"). there have been, over the years, various attempts to handle these, but none are totally satisfactory, although some are very close. however, the situation gets worse when you move away from simply dealing with names and addresses and have to apply formatting to longer strings of strings of text - like paragraph headings in a legal document.

here's one i had to deal with recently:

pre-existing conditions that aren't specifically covered by section 12a (sub-section 2)

applying the proper() function resulted in:

pre-existing conditions that aren't specifically covered by section 12a (sub-section 2)

whereas what i really needed was:

pre-existing conditions that aren't specifically covered by section 12a (sub-section 2)

so i sat down to figure out exactly how best to tackle the problem. i quickly realized that there are three basic scenarios which could apply to any given 'word'. first it could simply follow the standard rule and have the first letter capitalized. second, it could be a word that is not simply capitalized but has special rules, like scottish names. third, it could be an exception to the first two and either should not be capitalized at all (words like "that" and "aren't") or that require special, non-rule-based, formatting (names like 'dumaurier' and 'de torres', or specialized words like "foxpro" and "sql" )

it seemed that the simplest way to handle these (non-rule based) exceptions was to create a table and define them. so that's what i did. the table, named changecase.dbf has the structure shown at table 1 and is shown in figure 1:

so much for the exceptions, how about those scottish names? well, several years ago, sue cunningham sent me a routine that she used for managing scottish name formatting and i adopted that for my own use. the code is very slick and rather than trying to define all possible names uses a series of tests to determine if the word is a recognized scots name. here is my version of sue's code:

********************************************************************
*** [p] checkscots(): check for standard scottish names
********************************************************************
protected function checkscots( tcinword )
  local lcoutword, lnlen, llisscots, lctest
  *** note: space marker has already been removed in calling routine
  *** so the length here is the true length of the name
  lcoutword = alltrim( tcinword )
  *** check for the shortened form first   
  if upper( substr( lcoutword, 1, 2 )) == 'mc'
    return 'mc' + proper( substr( lcoutword, 3 ))
  else
    *** process the word through the parser
    lnlen = len( lcoutword )
  endif

  *** need to test the names in descending order of length to eliminate
  *** the need to be too explicit
  if ! llisscots and lnlen >= 7
     lctest = upper( left( lcoutword, 7 ) )
     llisscots = inlist( lctest, 'macadam', 'maccaff','maccarl','macclos', ;
                 'macconn','maccrac','maccull','machenr', ;
                 'maclane','maclean','macleod','maclaug')
  endif
  if ! llisscots and lnlen >= 6
     lctest = upper( left( lcoutword, 6 ) )
     llisscots = inlist( lctest, 'macart','macaff','macint', ;
                 'macive','mackay','macken','maclar','macrae','macwil')
  endif
  if ! llisscots and lnlen >= 5
     lctest = upper( left( lcoutword, 5 ) )
     llisscots = inlist( lctest,'macka')
  endif
  if ! llisscots and lnlen >= 4
     lctest = upper( left( lcoutword, 4 ) )
     llisscots = inlist( lctest, 'macb','macc','macd','macf','macg', ;
                 'macm','macn','macp','mact','macv')
  endif

  *** if this is a scottish name, format it correctly
  if llisscots
    lcoutword = 'mac' + proper( substr( lcoutword, 4 ))
  endif
  return lcoutword
endfunc

that takes care of the names, and the exceptions, which now leaves only the question of how to parse the input string and handle the simple capitalization. the way i do this is to replace all spaces in the input string with a non-alphanumeric character (i use chr(96)). this marks the position of any original spaces in the string. the next step is to parse the entire string one character at a time and add a space immediately after any character that is neither a letter or a number.

now, you will be thinking, why on earth would he remove spaces, and then add them back? the answer is to catch any embedded characters like apostrophes or hyphens. after running this input string

pre-existing conditions that aren't specifically covered by section 12a (sub-section 2)

through my spacing routine, it now looks like this:

pre- existing` conditions` that` aren' t` specifically` covered` by` section` 12a` ( sub- section` 2)

the result, as you can see is to separate out, into "words", the partial words that were previously hidden. now the process is straightforward, i simply use the native getwordnum() and getword() functions to step through the string one "word" at a time. first i check to see if the 'word' exists in the formatting table – if so, i apply that formatting otherwise i simply  apply the native proper() function. finally, before restoring the word i check to see if it is a scottish name. after this process the string now looks like this:

pre- existing` conditions` that` aren' t` specifically` covered` by` section` 12a` ( sub- section` 2)

now i can restore the original spacing by removing spaces and then replacing all occurrences of chr(96) with a space. the result is:

pre-existing conditions that aren't specifically covered by section 12a (sub-section 2)

the final check is to remove any characters that were processed as if they were single character words but which are now terminal letters (the "t" in "aren't" is an example here). this is done by looking for an apostrophe in position 3 or more of each word in the string. if there is one, then everything after the apostrophe is forced to lower case. the final result of my test string is, therefore:

pre-existing conditions that aren't specifically covered by section 12a (sub-section 2)

which is exactly what i needed. the code is packaged up as a class, based on the session class (so that it's table won't interfere with anything else in the environment) and is written so as to use getwordnum() and getword() under vfp version 7.0 or higher, or to use the equivalent functions from foxtools for version 6.0 or earlier. the calling syntax and interface are very simple:

oformat = newobject( 'xchgcase', 'changecase.prg' )
oformat.formattext( [old macdonald had a farm, isn't that cute?])
result = old macdonald had a farm, isn't that cute?

note that one consequence of the my processing is that i need to include partial words like "aren" and "isn" in the formatting table to prevent them from being capitalized inappropriately – but of course i don't need to differentiate between "it'll" and "it'd" because my terminal capital handling forces everything after an apostrophe in the third position to lower case anyway.

the code, and my formatting table, are included in the zip file attached to this column. as always, please feel free to modify and improve, and please share your improvements.

One Response to Properly Formatting Text Strings

  • Mike Lewis says:

    Andy,

    I’m sorry, but Sue Cunningham’s CheckScot() function doesn’t produce entirely accurate results. This is no reflection on Sue’s code; it’s just the way things work.

    I remember discussing this issue with you and Sue at some length back in the old Compuserve forum. I ended up grabbing a copy of my local Edinburgh phone directoy and doing some analysis to prove my point.

    For every 100 Scots who style their name MacDonald, there are another hundred who prefer Macdonald. There are no official rules for this. It’s just a matter of personal (or family) preference.

    On the other hand, Sue’s routine does a good job with McDonalds (which are never written as Mcdonald).

    In fact, the function probably delivers good results more often than not, and by all means use it for the sort of application you described. I’m only saying that it’s best not to place much too much reliance on it.

    Mike

    You are, of course, quite correct correct, and your input is appreciated. However, it does as good a job as anything I have seen elsewhere and, it’s certainly better than the native PROPER() function. In fact, over the years it’s proven good enough for all but the most pernickety (read “persnickety” for this side of the pond Wink [;)]) of true Scots.
    Best Regards to you both
    Andy

     

Leave a Reply

Your email address will not be published. Required fields are marked *