Ah, one of the most artistic aspects of programming: How do you partition your code into manageable blocks? Ripe material for, um, discussion. I haven't read much that I agree with on this topic, so here is an attempt to describe my own instinctive habits.
I often write rather long functions (say, 200 lines). I support these programs and don't find them difficult to maintain, and other people I work with find it easy to understand and modify them as well. But I don't create 200-line functions without constraints. I've developed techniques for making these programs manageable:
The code is broken up into "paragraphs", blocks of code separated by blank lines. The blank lines serve the same purpose they do in this posting: they make the paragraphs easy to identify visually.
Each paragraph is usually relatively small, typically 10-20 lines, but there's no specific limit or goal.
Each paragraph starts with an abstract (comments describing what the block is supposed to do). In a few cases, I've found it helpful to document in these comments the variables that are created in the paragraph (duplicating comments in the code where the variables are defined).
All branching is either strictly within the paragraph or to the beginning of some other paragraph. (Or to the top of a loop, which may not be exactly at the beginning of a paragraph.)
These code paragraphs have some of the characteristics of small functions: They're manageably short. They implement a specific, stated task. (By looking only at the abstract and not the code you can suppress the detail of how the task is accomplished.) They are structurally self-contained with respect to branching.
Here's what I like about long functions written this way:
When you print or view the program, you see the code in the order it's executed. You don't have to jump around from one program to another to follow what's being done. Consider how annoying it would be to read a book that was filled with hundreds of "see page nn" references.
You don't have to repeatedly document variables. Virtually every function I write has, in the introductory comments, a complete description of all the inputs and outputs to the function. If a block of code is written in-line instead of being a subroutine, I don't need to document the variables that are referenced in the code. To see what variable X contains, you just look above the block and find the line that defines X. That line will have a comment saying what X represents.
This is not to say that I don't use subroutines. I do, but I tend to create them only for specific reasons:
If code has to be used in more than one place or is called more than once, I always create a subroutine. If the code is of general utility and might someday be used in other places, I create a subroutine.
I'm more inclined to create a subroutine if the code does some easily described, complete job (as opposed to being part of a larger task and useless by itself).
I'm more likely to create a subroutine if the interface with the outside world is small; that is, if it has only a few arguments and results. I occasionally have to write what I call a "macro" subroutine, a function that has numerous inputs and outputs (so many that I usually dispense with formal arguments and leave them all global). Such a function usually does not perform a complete task, but a subroutine is needed to avoid code duplication or some such thing. I don't like doing this this, and in the line [1] comment I name the main function that the "macro" is a part of. I wish I could efficiently define these as local functions within the main function, without having to put quotes around everything and use #FX.
I use subroutines to prevent a function from being really huge. Even I become uneasy when a function gets to about 400 lines or so. My programs often grow incrementally as they are enhanced and modified, and when a function's size becomes sufficiently annoying, I farm pieces of it out into subroutines.
I create subroutines to separate the user interaction, including argument parsing and input checking, from the computational guts of a program. This permits the guts to be used noninteractively (under program control) if necessary.
I sometimes create subroutines for distinct phases of processing, especially when the phases are so complicated that they can't be done in a single paragraph. For example, in my ]FNTREE function calling tree command, the main function parses the command line, gets the name lists, checks for errors, and then does:
[46] N C H{<-}FNTCALLS T @ figure out who calls whom
[47] M{<-}FNTMAINS N C A @ find the top-level functions
[48] D{<-}FNTDOTS M N C @ build the indented listing
[49] {delta_}RESULT{<-}FNTFORMAT D N H @ format it for display
Each of these four subroutines contains a loop. This may be an indicator of what I use in judging whether the task is "complicated enough" for a subroutine. Doing this also prevents loops from becoming too deeply nested in a single function.
I believe the complexity of a subproblem is the main criterion I use in deciding whether to write a paragraph or a subroutine. For example, my calendar-printing program is 170 lines in 18 paragraphs. I felt no urge to break it into subroutines because the problem wasn't hard enough. On the other hand, ]FNTREE isn't much longer than this, but I wouldn't think of writing it monolithically. It's too hard a problem to deal with mentally all at once. Breaking it into subroutines reduced the complexity to something I could manage.
I've noticed that some people seem to think that writing small functions will make their applications easier to understand and/or maintain. I don't find this to be true. In fact, I find excessive subroutinization to be an impediment to understanding new code. Here's what I don't like about having lots of tiny functions:
If you haven't memorized what the subroutines do, they make it hard to figure out what's going on. You constantly have to make side trips to learn what this or that function does, and sometimes you have to figure out how the variable names in the subroutine match up with those in the calling routine. If there's no compelling reason for a subroutine, an in-line paragraph may be easier to understand. It's really annoying to make the trip to a subroutine only to find out that it does something trivial and is called in only one place. Gimme a break!
The inputs and outputs of small subroutines are often not completely documented. (Doing so with many tiny subroutines would result in numerous comments, most of which would duplicate comments in the location where the variables were first defined.) The lack of interface documentation within the subroutine makes it harder to understand the code in the subroutine. You have to refer to some other program to find out what the arguments represent.
When something is cast as a subroutine, you have to be careful about changing it. Why is the code in a subroutine? Is it called in more than one place? If so, how will the change affect other users of the subroutine? It's like the difference between a global and a strictly- local variable. Before you change the global, you have find out who else uses it. With a strictly-local variable, all you have to do is examine the program at hand.
My worst nightmare would be an application consisting of a thousand or so direct definition functions. Direct definition is great for including code in articles, where the prose takes the place of comments and the total amount of code is small, but a real-life application of significant complexity written as uncommented direct definition functions would be horrifying to deal with.
I want to emphasize that I'm not equating "big" with "good" here. A huge, monolithic function with deeply nested or unstructured branching can be a just as incomprehensible as a swarm of tiny undocumented subroutines. But a large function doesn't have to be indecipherable, and if your only technique for dividing code into blocks is to write separate functions, you're ignoring some very practical ways of writing clear, easily-maintained programs.
Jim
Home Page