DataStage Performance Tuning

DataStage Performance Tuning can be done at 3 levels.

  1. Job level
  2. Sequence level
  3. SQL level

Job Level:

  • Use suitable configuration file (no.of nodes, hardware config, data volume)
  • Pickup proper Partitioning Algorithm for the stages and avoid re-partition the data again (Use Same partition algorithm).
  • Sort data before using some stages like aggregate, join, merge.
  • Remove unwanted columns, filter rows at earliest/source level.
  • Use DB stage SQL to sort, filter and join tables.
  • Choose Join, Merge and Lookkup stages based on Data volume.
  • Minimize the use Transfer stage.
  • User Buffer parameters if required (APT_BUFFER_MAXIMUM_MEMORY (3MB Default- incr upto 30mb), APT_BUFFER_DISK_WRITE_INCREMENT, APT_BUFFER_FREE_RUN).
  • Don’t use Run time Column Propagation if not required.

Sequence level

  • If there is no dependency, run the jobs in Parallel i.e., create Job Activity (for jobs in Sequencer) in parallel without giving trigger condition.
  • Use the Terminator and Exceptional Handler for better terminating the seq.
Advertisements

Configuration File

Why?

One of the great strengths of InfoSphere DataStage is that, when designing parallel jobs, you don’t have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities.

If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you don’t necessarily have to change your job design.

InfoSphere DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs.

Unless you specify otherwise, the parallel engine uses a default configuration file that is set up when DataStage is installed.

Opening the default configuration file.

To open the default configuration file Select Tools > Configurations.

Example configuration file

The following example shows a default configuration file from a four-processor SMP computer system.

{
node "node1"
	{
	fastname "R101"
	pools ""
	resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""}
	resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""}
	}
node "node2"
	{
	fastname "R101"
	pools ""
	resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""}
	resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""}
	}

}

The default configuration file is created when InfoSphere DataStage is installed. Although the system has four processors, the configuration file specifies two processing nodes. Specify fewer processing nodes than there are physical processors to ensure that your computer has processing resources available for other tasks while it runs InfoSphere DataStage jobs.

This file contains the following fields:

node
The name of the processing node that this entry defines.
fastname
The name of the node as it is referred to on the fastest network in the system. For an SMP system, all processors share a single connection to the network, so the fastname node is the same.
pools
Specifies that nodes belong to a particular pool of processing nodes. A pool of nodes typically has access to the same resource, for example, access to a high-speed network link or to a mainframe computer. The pools string is empty for both nodes, specifying that both nodes belong to the default pool.
resource disk
Specifies the name of the directory where the processing node will write data set files. When you create a data set or file set, you specify where the controlling file is called and where it is stored, but the controlling file points to other files that store the data. These files are written to the directory that is specified by the resource disk field.
resource scratchdisk
Specifies the name of a directory where intermediate, temporary data is stored.

Configuration files can be more complex and sophisticated than the example file and can be used to tune your system to get the best possible performance from the parallel jobs that you design.

config