INTELLIGENCE TESTS






The Stanford Binet: Fourth Edition

First to employ the concept of IQ. Also introduced the use of "alternate item" - something to be used in place of regular question under certain conditions.
 

First edition - 1916 - ratio IQ uses rato of mental age / chronological age x 100
 

Second edition = Terman Merrill Revision - 1937 - 2 alternate forms - increased the range both down & up - added scoring examples

- enlarge range of standardization sample
Third Edition - 1960 & 1972- single form (L-M)- best of both - switch from ratio to deviation IQ - standard score
 

Fourth Edition- 1986 - big changes in all areas - no longer are tasks grouped by age (age scale) - now a point scale - 15
subtests in 4 areas:
 

Verbal Reasoning, Abstract / Visual Reasoning, Quantitative Reasoning. Short Term Memory
Now use "test composite" as term for deviation IQ
Now has an explicit theoretical model of intelligence-based on Horn & Cattell--4 factor model --uses "g"(general mental ability) at top level; second level which includes crystallized and fluid analytic abilities and short-term memory (area scores on test)
 

Standardization sample = 5,000 from 2 - 24 - stratified using census data Good KR-20 reliability coefficients; acceptable criterion validity coefficients when used with "normal" subjects

Good for use with DD - but not good at discriminating among gifted

Administration - uses principle of adaptive testing - start where you think they are -then go down or up based on response
basal level =pass 4 items at 2 consecutive levels

ceiling = fail three out of four or 4 out of 4 at two consecutive levels.

Scoring is all 0 or 1 - includes space for behavioral observations

Evaluation - reliable & valid measure of overall general ability good point is adaptive testing approach

standardization sample - much better but still imbalanced

No inter scorer reliabilities given

Studies have not supported construct validity very well
 

The Family of Wechsler Scales

all individual - covers preschool to adulthood--all similar in structure
designed to assess the individual's "overall capacity to understand and cope with the world around him."

Relative ease of administration
Psychometrically sound
Have good construct validity
high inter-rater reliability
norms are deviation IQ's by age group
 

The Wechsler Adult Intelligence Scale-Revised - 16 and up
This was the original Wechsler and started out in 1939 with the publishing of the Wechsler-Bellview I (WB-I). He was at Bellview when he did this and was trying to estimate the intelligence level of the multi-lingual, multi-national and multi-cultural clients referred there.

He devised a point scale instead of an age scale. Items were classified by sub-test according to what type of tasks they required.

Original version had some major problems:

the standardization sample was restricted,
some sub-tests lacked inter-item reliability,
some of the sub-tests were made up of items that were too easy,
the scoring criteria were too ambiguous.

Therefore, 16 years later, they came out with a revised form and that was called the Wechsler Adult Intelligence Scale. The WAIS followed the same general format as the Wechsler-Bellview, but revised many of the questions and added questions.
 

It improved the administration and scoring, it was carefully standardized and quickly became the benchmark in intelligence testing.

The new version WAIS-R made relatively few changes in the WAIS. Some of the things that were changed were directions for
scoring and the record form, and the order in which the tests are given.

In the old form, all verbal sub-tests were given first and then performance; now they're alternated.

It used a well-stratified random sample as its standardization and made sure that it did not include in the standardization
people who (a) could not speak/understand English, (b) people who are institutionalized with mental retardation or brain damage, (c) severely emotionally or behaviorally disturbed individuals, or (d) subjects physically disabled in a way that would restrict their performance. Part of their rule was no more than one member of a family was tested.

The new form has good internal consistency reliability, and it has high split-half reliability coefficients for internal consistency for
those sub-tests for which split-half is appropriate. These reliability estimates are similar across the age range. It has a test-retest
reliability over two weeks of 94 for the verbal and 95 for the full- scale and 89 for the performance.

Different factor analyses have been done on this scale and they have come out with from one to three factors. In a criterion validity study, they found that those sub-tests with lower reliabilities also ended up with lower criterion validity scores.

The standard scores used have a mean of 10 and a deviation of 3. In any of the Wechsler tests, a full-scale IQ of 100 would be average. The IQs used are deviation IQs, which means they are compared with scores earned by individuals in his/her own age group.

The Wechsler Intelligence Scale for Children - 3rd ed. (WISC-III)

This test was originally published in 1949 and was considered to be a downward extension of the Wechsler-Bellview. Again, it was well- standardized and stable and correlated with other existing tests of intelligence.

It did have its flaws:

(1) its standardization sample only contained white children, (2) some of the test items were viewed as perpetuating gender and culture stereotypes, and (3) parts of the manual were unclear both for administration and scoring.

Based on this, it was revised in 1974, and included nonwhite in the sample, pictures were made more culturally "balanced", the language was modernized and "child-ized" and changes were made in administration and scoring.

The WISC-III was published in 1991. It has changed over a quarter of the items from its predecessor and it has also updated the normative data.

It now contains a new sub-test called Symbol Search designed to measure cognitive processing speed. It has modernized
some of the tests as well as enlarging some of the others. The process by which it was revised shows good test development
procedures. The people in-house did a review based on feedback from users and they kept in touch with experts in the field all during the revision process. They also pilot tested new items before they were included in the test. They pilot tested the whole test on 500 children before it was ever released.

Each item was individually analyzed to look at performance as a function of gender, ethnicity and age. An attempt was made to make it more user friendly to both examiners and examinees. Their standardization sample consisted of 200 children in each of
eleven age groups divided equally by gender. The other major variables were matched against the 1988 US Census data. They did additional testing with Black and Hispanic children to ensure accuracy of item-biased statistics. They also did testing to see if it truly overlapped where it should with the WIPSE and the WAIS.

The results show that it has sound internal consistency, reliability, test-retest reliability and interrater reliability. It has evidence for construct validity as well as concurrent validity.
 
 
 

The Wechsler PreSchool & Primary Scale of Intelligence - Revised (WIPSE-R)

People had asked Wechsler for something that could test children who were younger than the age range for the WISC and instead of just adding to the WISC and extending it downward, he decided to create another scale just for children of that age.

It was originally published in 1967, and it went down to the age range of four. The WIPSE-R was published in 1989 and goes from ages three through seven years and three months.

There were considerable revisions in the WIPSE-R: they added tests, they renamed tests, they extended it downward, and they kept about 50% of the original items. They used a standardization sample of 1700 children -- 100 boys, 100 girls -- in each of eight age groups that were separated only by six months and one group of 50 boys, 50 girls in the seven to 7.3 year/month interval. They also used the census to standardize for other variables.

Again, it has shown to have both reliability and validity.The only problems that could occur with any of the WAIS or any of
the Wechsler scales may be in either scoring (which is sometimes and in some items subjective, although they do give lots of examples of their various levels) and on interpreting sub-tests.

The scaled scores at different age ranges are not uniform and thescaled-score system does not equate from sub-test to sub-test. This makes analysis and interpretation difficult sometimes.
 

OTHER TESTS OF INTELLIGENCE

1. Slossen Intelligence Test-Revised (SIT-R) Some people call it the short intelligence test. It was designed to be quick, easily
administered, but a valid measure of intelligence. It was considered originally to be an abbreviated version of the Stanford-Binet. The 1991 revision contains items similar to those found on the Wechsler scales. It's really only meant as a screening and is particularly a verbal test. The areas included are vocabulary, general information, similarities and differences, comprehension, auditory memory and quantitative ability.

Although they tried to match the US population with respect toeducation and social characteristics, they did over represent whites and better educated people. It says that it extends downward into infant intelligence and upwards to age 27.  Above age seven, it becomes increasingly weighted with verbal items, although there are still some others that include perceptual-motor or motor skills.  It takes only 15-20 minutes to do, and can be scored very quickly with no subjectivity. It has reliability coefficients in the 90s and validity coefficients in the 80s-low 90s.  It has the advantage of being able to be administered by someone who is not a trained examiner and therefore, can be used by screening people such as those in personnel. It does, however, give you very limited information.

2. Figure drawings as measures of intelligence. Figure drawings are used both to measure intelligence and also in the measurement of personality. People are asked to draw a person or house or tree. There are different scoring systems available for handling these, the most famous of which is called the Goodenough-Harris Scoring System. It's meant as a screening device so that if someone scores low on this, you do more extensive testing. When using this system, interrate reliabilities are in the 80s-90s and test-retest go from 52-87 over a two-week period. It is very controversial professionally. There is much argument
in the literature about the validity of using figure drawing for measuring intelligence since it is affected by so many other things,
and even the test itself is touted as a way to measure personality.

GROUP INTELLIGENCE TESTS

Group tests have a number of advantages:
(1) they can be administered to large numbers at the same time and are, therefore, more efficient;
(2) most of them can be reliably machine or computer-scored;
(3) they're more economical, because they are on a one-pagecomputer sheet and reusable booklets;
(4) a larger and more representative sample of test takers can be used for norming; and
(5) the test administrator need not be highly trained.

They are primarily used for screening.

They also have disadvantages:
(1) they assume that all the people taking the test understand what is expected of them, and all are motivated to perform on the
test.
(2) you cannot observe anything about the way people tend toproblem-solve;
(3) you are unable to observe anxiety, frustration, or any otherfactor which might hinder performance but not be a part of actual
intelligence;
(4) since they are designed for masses of people, the person who's different usually has a harder time and would score lower on
this than his "true" score;
(5) in a mass test, they all start on the same item and frequently end on the same item, as well. Therefore, if a student starts to
fail early, they are faced with much more range of failures than on an individual test.
(6) almost all of the group intelligence tests require the test taker to read;
(7) another skill required of the test taker is the ability to mark/manipulate a pencil on an answer sheet -- when you're dealing
with younger children or those with eye-hand coordination problems, or concentration problems;
(8) even though the standardization groups may be large, they may not be representative, because they tend to be standardized on school districts rather than on individuals. This may destroy representativeness because even if the district is representative,

(a) the district had to volunteer to be part of the standardization and
(b) they had to get parental permission for the students to take part in the testing;
(c) since they're so easy to give and administer, they often get misused by schools for tracking that is irrelevant to the actual abilities of a given student.
 
 
 

Group Testing in Schools
 

They are often used in schools, and even though legislation has limited this, they are still used quite often for getting information
about instruction related activities. For example, one of the main things they are able to do is tell you when further testing is necessary.  There are group tests can go as young as Kindergarten. Some of  the group tests used in schools are:

(1) California Test of Mental Maturity;
(2) Kuhlmann-Anderson Intelligence Tests;
(3) Henmon-Nelson Tests of Mental Ability;
(4) the Cognitive Abilities Test; and
(5) the Otis-Lennon School Ability Test.

The Otis-Lennon is one of the favorite of these group tests. It's designed to be used from Kindergarten through Grade 13. Its primary function is to assess a test taker's ability to cope with school learning tasks. It is in its sixth edition. There are both verbal and non-verbal items at every level. You can get a verbal and non-verbal score. The resulting total score is a
School Ability Index (SAI) which is a normalized standard score with a mean of 100 and a standard deviation of 16, so that it looks very  similar to an IQ test.

Group Intelligence Tests in the Military

The original of these group intelligence tests used in WWI were the Army Alpha, which is primarily verbal in nature, and the Army Beta, non-verbal. There are still group tests being administered to prospective recruits for screening purposes and also as an aid in assigning soldiers to training programs and jobs. The fact that mean IQs have gone down since the end of the draft
has caused the military to change some of its training manuals and procedures to put them in simpler language so that they can be sure that the people reading them understand. They are also used to screen candidates for Officer Candidate School (OCS) like the Officer Qualifying Test used by the Navy and the Airman Qualifying Exam used on all air force volunteers.

The Armed Service Vocational Aptitude Battery (ASVAB) is administered to prospective new recruits in all the armed services. It consists of 334 multiple-choice items in ten different sub-tests.  They use a sub-set of these as the Armed Forces Qualification  Test, which they consider to be a measure of general ability, and that's used for selection of recruits. Then, the different armed forces use different cutoff points in accepting or rejecting people into the different service.

They then use the ten sub-scores to decide what different aptitudes people have and thus, direct them into one form of training
or another.  It has been shown to support construct, content and criterion-related validity with regard to guiding training and selection.
 

MEASURES OF SPECIFIC INTELLECTUAL ABILITIES

These are abilities that are not necessarily picked up by general intelligence tests:

(1) Measures of Creativity. A criticism laid against most intelligence tests is that they concentrate pretty much on convergent
thinking which means bringing deduction to bear, emphasizing one solution to a problem. Divergent thinking, on the other hand, involves reasoning that moves in many different directions and comes up with many solutions to the same problem. It requires flexibility of thought, originality and imagination. Tests used for this purpose are things like the Remote Associates Test, where the person is given three words, and the task is to find a fourth word that's associated with the other three in some way or the
Torrence Tests of Creative Thinking, which have picture-based and sound-based materials and the task is to respond with
whatever thoughts each sound conjures up. One of the problems with these is that they have not had much validity studying especially construct validity that holds up.

(2) There are other tests, such as those of art judgment, ability to hear and differentiate music and sounds, etc.
 
 

 | PSYC 200 | | PSYC 307 | | PSYC 308 | | WEB BOARD |
| EMAIL ME | | GERONTOLOGY | | PSYC DEPT | | NEIU LIBRARY |