The Problem With Values

25 Nov

The readings for the past few weeks have emphasized to me the importance of critically examining every change in assessment (and also the non-changes that perhaps should be occurring). As Bob Broad might say, we need to be attuned to what we really value – and, I would add, we need to be evaluating assessments at every level through these values.  Do we value multi-modal digital composing? Do we value the re-thinking and revising process? Do we value and understand student writing in ways that cannot be numerically calculated? Then we should have ways of assessing that reflect those values.

This, of course, is easier said than done. Several of the readings illustrated for me the importance of clearly defining our terms, and asking others to do the same. When the designers of essay scoring machines claim that the machines can “read” for content, we need to ask exactly what they mean by “content.” Yes, their formulas are copyrighted, but I think that the educators who will implement and perhaps even be judged by these machines have the right to know if reading for content means counting vocabulary words, or some other process than what they believe reading for content actually means.

Hillocks, in his article, “How State Assessments Lead to Vacuous Thinking and Writing,” argues that, as his title might suggest, state assessments claim to value and assess critical thinking, but in reality, they are encouraging (and are satisfied with) students who simply add more sentences on, or offer a slew of personal opinions as reasons or evidence. While I don’t necessarily agree with everything that Hillocks writes, I think he is doing something important in his article. He is testing the explicit claims about what an assessment encourages or values against how it actually responds to and assesses student writing, which reveals implicit values that can then affect what students are taught.

So, once we have discovered that an assessment doesn’t actually value the things that we want it to, what can we do? Hillocks addresses the problem that he sees by suggesting that assessments be accompanied by source material that students can draw from in supporting their arguments, thereby encouraging teachers to help students learn how to use evidence in their thinking and writing. Condon, in his article, “Looking Beyond Judging and Ranking: Writing Assessment as a Generative Practice,” argues that administrators and politicians will always want to go with the cheapest assessment possible (that at least has the appearance of being accurate and appropriate). In order to be able to create and implement richer assessments that are more reflective of what we value in student writing, Condon argues that we need to make the assessment more monetarily valuable than just placing or ranking students. He thinks this can be done by coupling research with assessment.

However, the problem with this of course is that at smaller schools there is very little if any research budget for their faculty (particularly in the humanities). In spite of this shortcoming, at least both Condon and Hillocks are attempting to address these problems that they see in assessment with practical solutions that might work for at least some situations. What we need to see more of is scholarship like this – critical analysis of the problems in an assessment, and then practical, feasible solutions that will at least work for that particular situation, if not others.

The Power of Assessment

2 Nov

As I look back on the last three weeks’ worth of reading on “evaluating (digital) writing and programs,” I am struck by a realization: while the articles/books are ostensibly about promoting or investigating particular forms/theories/principles of assessment, there is another thread entwining them all together. These writings, either explicitly or implicitly, are about power. Who should create and manage assessments?

To take this idea in a Foucauldian manner, if discourse is power and assessment shapes discourse, then really, arguments about assessment are also intrinsically arguments about power.  It seems as though some of the controversies surrounding assessment have more to do with the distribution of power between institutions, teachers, students, and other interested third parties than it does with the validity of an assessment, perhaps because some people might pre-determine the validity of an assessment based on who it gives the most power to.

Although I think it is wrong to automatically write-off/embrace particular assessments because of the parties it tends to give more power to, I don’t think a consideration of the movement of power that occurs within an assessment is out of place. Movements of power does occur in assessment – the question is whether we are going to recognize it, investigate it, and perhaps try to justify it.

For instance, Bob Broad in his introduction to Organic Writing Assessment: Dynamic Criteria Mapping in Action argues for the importance of faculty developed assessments, rather than assessments imported by administrators. Among other things, this is a move to direct power from the administration to faculty. This isn’t necessarily a bad thing – after all, faculty typically have years of training and experience in evaluating writing and are, arguably, the best people to create an assessment for their particular context. Nonetheless, this power shift is present in Broad’s argument, and ought to be understood and examined.

Daniel Royer and Roger Gilles in “Directed Self-Placement: An Attitude of Orientation” go even further than Broad in moving power down the traditional hierarchy: they argue that students are the ones who should decide which writing course they should attend. However, although power apparently moves from the administration to students, Directed Self Placement also in some ways empowers the administration. Now the administration does not need to expend nearly as much money or time in placing students – they have outsourced the labor. Again, this is not necessarily a criticism, but something to be aware of and consider further.

Finally, in Russel Durst et al.’s article “Portfolio Negotiations: Acts in Speech” the apparently innocent act of negotiating portfolio assessments between teachers can be seen as removing power from the individual teacher and transferring it to the collective group of teachers at that institution. As the authors note, while the negotiating process can lead to faculty development and “quality control,” it can also have the negative side effect of silencing minority viewpoints.

These are just a few examples; my power analysis could go on and on.

I am not an extremist; I don’t think that any particular stakeholder in assessment should hold all or most of the power. But there are situations where the transference of power might be desirable. For instance, administration needs some power within assessments in order to ensure that the institution remains accredited and so they can market the institution effectively. Faculty (both individually and collectively) need power in order to experiment with new assessment strategies and revise current practices. And students, we are realizing more and more, need to be empowered so that they can have a sense of ownership over their own writing and be capable of assessing and reflecting on their own writing practices and abilities.

How to balance the relations of power between these stakeholders is a problem that is, and most likely will continue to be, a subject for debate. However, what I believe should be at the heart of these negotiations (or grabs) for power is an unswerving commitment to student learning. That is, after all, supposed to be the primary mission of educational institutions. If this can be sincerely kept at the forefront of discussions concerning assessment, rather than our political views on how power should be distributed, then perhaps we will begin to better see when and how power should move.

Who Assesses the Assessors?

12 Oct

    In creating a writing assessment, it is important to recognize the situated nature of all of the people involved. The creators and administrators of the assessment have certain perspectives and biases, as do the raters/graders, those who use the assessment to make decisions (admissions offices, principals, deans, department chairs, teachers), and those who are being assessed. If this situated, contextual nature is not attended to (and sometimes even if it is), assessments are likely to have different levels of validity for different groups, because they perpetuate social biases.

    However, while investigating any potential biases is necessary and admirable, it can also be extremely difficult to do well. Those who assess the assessment also begin with a set of hypotheses and assumptions, which can affect the kinds of questions they ask, the information they pay attention to, and the methods that they use in compiling and sorting through their data. Additionally, biases are complex, resulting from a variety of factors and manifesting themselves in multiple ways.

    For instance, Haswell & Haswell in “Gender Bias and Critique of Student Writing” found that raters tended to score essays more harshly when they knew that the writer was of the same gender as the rater (405 Fig. 23-2), but this wasn’t always true; both male and female student raters (rather than teacher raters) rated the female essay more highly when they knew it was written by a female than when they didn’t know the gender of the author, but in the case of the male essay, they stayed true to the pattern, where the female students rated the essay more highly (because he was of the opposite gender), while male students rated it lower. Additionally, in Johnson and VanBrackle’s article, “Linguistic Discrimination in Writing Assessment: How Raters React to African American ‘Errors,’ ESL errors, and Standard English Errors on a State-Mandated Writing Exam,” they showed that raters aren’t just biased in favor of Standard English errors over any other kind of error; rather, raters viewed ESL errors more leniently, because they are just learning the language, whereas African American’s should “know better” because they are native English speakers.

    On top of all of this complexity, there is also the issue that when trying to find out the nature of test bias for or against particular groups of people, the researcher is assuming that the problem resides in the assessment, and not in other social factors that might produce differentiated assessment results, even in the fairest of assessments. Making this kind of determination is difficult, and will perhaps come to light more and more as we continue to investigate potential biases in assessment design.

    The other difficulty that I have with the question of bias in assessments is that, if you can make a solid case for discrimination (as I believe Johnson and VanBrackle have done), then the problem becomes, now what? How do we get rid of or at least reduce these biases, especially in the case of high-stakes assessment? Johnson and VanBrackle tentatively argue that “it is possible that a better understanding of AAE [African American English] on the part of raters would change the results of this study” (47). The basis for this argument is that they suspect ESL errors in the past probably would have been viewed more harshly, but because information regarding the challenges of ESL students in achieving “perfect” Standard American English has been disseminated widely, raters have different perceptions of these types of errors. If AAE could become widely recognized as following its own sociolinguistic logic, then perhaps its “errors” would not be rated as harshly.

    While I feel that Johnson and VanBrackle are commendable in their attempt to offer a solution to this problem, I feel that in general academics tend to focus too much on only pointing out problems, and not enough on recommending, and then putting into practice, reasonable solutions. I realize that the first step in changing anything is recognizing and making a strong case for the existence of the problem, but we need to also make sure that we don’t get stuck at the first step. Although no solution is going to be perfect and achieved without any criticism, if the problems are worth being recognized, they are also worth being solved.

Frames Change the Picture: Theories, Models, Technologies

28 Sep

For the last two weeks I have been reading about some different theories and models of assessment. From my reading, there are a few things that have seemed especially noteworthy to me. One of those things is the concept of “framing” that Linda Adler-Kassner and Peggy O’Neill explore in their book Reframing Writing Assessment to Improve Teaching and Learning (2010). In their book, “frames” are the narratives or ideologies surrounding an issue that determine what kinds of questions we ask and how we go about finding answers. For example, if we view the purpose of school as primarily job preparation, then we aren’t going to ask if there are other things schools should be doing, we’re going to ask, “How well is the school preparing students for their future jobs?” and “Are there things schools are doing that are superfluous or harmful to job preparation?” etc.

Some of the most obvious frames that are currently at work in writing assessment are the constant call for more accountability and the acceptance of standardized tests as providing a valid basis for making high-stakes decisions (and their converse – we don’t need any accountability for writing instruction and that all standardized tests are invalid). But, as my professor reminded my class, some of the most dangerous frames are the ones that are so accepted that we can’t even see them. In setting out to create a writing assessment there are many difficult questions that need to be asked, but a good place to start is: “What are my assumptions? What is the framework that makes this writing assessment necessary and/or justified?”

One of the models of writing assessment that I found most interesting is in Elizabeth Wardle and Kevin Roozen’s article, “Addressing the complexity of writing development: Toward an ecological model of assessment” (2012). For Wardle and Roozen, an ecological assessment would look at a wide array of literate activities that students perform in order to see how all of their experiences with writing impact their ability “to accomplish academic tasks” (107). This form of assessment is not defined in terms of traditional goals for assessments: placement, acceptance/rejection, scoring/grading/ranking. Instead, the authors argue that an ecological assessment can help administrators consider “the multiple sites where writing takes place” and can then subsequently strengthen those sites; it can also accomplish tasks like bringing “stakeholders closer together” in their “assumptions about what writing is and, subsequently, what writing assessment should be” (107).

The reason why this form of assessment is so interesting to me is because it looks beyond some of the decisions that are made based on writing assessment (as important as those decisions are), and uses writing assessment as a means to help us learn about writing and about how we can best support people in becoming more proficient, confident writers. William Condon in his article “Looking beyond judging and ranking: Writing assessment as a generative practice” (2009) argues for the importance of expanding our writing assessments so that they can be useful for more than their immediate role in helping us make decisions about students. He takes as his premise that “the lowest form of assessment that provides the appearance of thoroughness and the greatest economy will prevail” (142). Because assessments that are more valid and reflect a richer theory of writing also tend to be more expensive, he claims that we need to be able to use the assessment for a wide variety of purposes in order to justify its higher cost (142-143).

This consideration of the material costs and benefits of writing assessments brings me to the final concept: writing assessment as and with technology, as Michael Neal puts it in his book Writing Assessment and the Revolution in Digital Texts and Technologies. By looking at writing assessment as a technology, we can see that like any other technology it is designed with certain purposes in mind, and it carries with it underlying values and assumptions. By examining writing assessment as a technology, some of the “frames” about writing and writing assessment become more apparent. In considering writing assessment with technology (i.e. technologies that can become part of writing assessment), we have to carefully examine how incorporating these technologies will change writing assessments and therefore writing.

For more information about writing assessment as technology, see Asao Inoue’s prezi: http://prezi.com/7xx8pkzwn8xu/writing-assessment-technologies/

Validity and Reliability: Why Definitions Matter

13 Sep

This week’s readings were thematically organized by the concepts of validity and reliability. These two terms are potentially the most controversial and the most pivotal words in writing assessment; defining and then applying these concepts could arguably be what writing assessment is all about. The most widely circulated and recognized definitions of these terms (particularly among psychometricians) run something along the lines of:

Validity: A test is valid if it measures what it purports to measure

Reliability: A test is reliable if it consistently measures whatever it measures and (to explain its relationship to validity), it is also a necessary but insufficient condition of validity

These definitions seem fairly reasonable (and are oh so nice and neat). However, just because they are the most widely used definitions does not necessarily mean that they are the best or most meaningful definitions.

First of all, these definitions obscure the multi-faceted nature of determining what constitutes reliability and validity. For instance, as Roger Cherry and Paul Meyer point out in their article “Reliability Issues in Holistic Assessment,” inconsistencies in test results (and presumably therefore unreliability) stem from at least three main sources: students, the tests themselves, and test administration/scoring (31). Additionally, in calculating the rate of inconsistency, Cherry and Meyer identify at least eight different statistical methods for calculating inter-rater reliability alone (38). With this amount of diversity, simply saying a test is “reliable” (meaning that it is consistent) obscures more than it illuminates; test developers need to be more forthcoming about their methods for determining reliability and their justification for using those methods rather than others.

This idea of “justification” leads into the other key term, validity. Validity is even more complicated than reliability. To explore this complexity, I would first like to turn to the Oxford English Dictionary’s second and fourth definitions for this term:  “2. The quality of being well-founded on fact, or established on sound principles, and thoroughly applicable to the case or circumstances; soundness and strength (of argument, proof, authority, etc.)… 4. Value or worth; efficacy.”

In the context of writing assessment, the common definition of validity falls short in many ways. Basically, a test is valid if it doesn’t misrepresent what it is measuring; i.e., if it doesn’t lie, it’s valid. This definition seems to me far too limited and trite. I believe that the definition of validity in writing assessment should follow more closely the general use of the term, particularly in being “established on sound principles” and “thoroughly applicable to the case or circumstances.”

In being “established on sound principles,” a writing assessment would be influenced by current theories of writing, and would more closely reflect what is actually being taught in the writing classroom. In being “thoroughly applicable to the case or circumstances,” the assessment would be both accurate and appropriate. An assessment cannot be applicable if it is not fairly accurate; if we can’t trust the accuracy of the assessment results, then any decisions made based on the test (applications of the test) would be arbitrary and unfair. Also, in order to be applicable, it has to be appropriate for the particular case or circumstance. Is the SAT an appropriate basis for deciding who should attend particular colleges/universities? For allocating scholarships? Is the COMPASS assessment an appropriate means for determining who needs “remedial” writing courses and who does not? Determining the appropriateness of a writing assessment for a particular use should be the overarching concern for assessment designers. Assessments are never designed for their own sake; they are designed so that they can be used for a particular purpose or purposes.

Justification has to do with argumentation, with providing reasonable reasons to persuade people that a particular course of action or a particular theory is justified. By defining two important concepts in a facile manner and thereby avoiding a more complex process of justification, those who design writing assessments and those who use their results to make decisions are actually attempting to sidestep responsibility for some of the wide-ranging effects of assessment, which can (and does) lead to disastrous consequences.

In a few weeks I will be designing and completing my own validity analysis of an assessment; I can already see that it is going to be a complicated and frustrating process. This post is an attempt to begin that process, and to establish some primary concerns.

Histories of Writing Assessment

7 Sep

In the past two weeks I have read selections on the history of Writing Assessment drawn from six different books, each of them attempting to explain how the field has come to be what it is now, and sometimes even project where it is going (or should go) in the future. Drawing from and distilling many of these histories it would be easy to characterize the history of Writing Assessment as a history strewn with conflicts between reliability v. validity, local v. national, direct v. indirect, machine v. human, psychometricians v. compositionists, scientists v. humanists, etc. While these divisions are very real and the conflicts between them have shaped the field, it is possible to tell the story in other ways.

Most of the six histories that I read hinted at or explicitly blurred some of these apparently clear-cut divisions. You can’t have validity without reliability, and reliability is meaningless (and possibly dangerous) without validity. Local assessments are always done in the context of national practices and standards, and national assessments cannot be administered without some level of local cooperation. Machines are designed by humans (and therefore presumably carry traces of human biases and values), and human assessors use machines/technology to assist them in their assessment. Direct assessments give one kind of information, and indirect, another; while there is correlation between the two, they are testing different processes and abilities. That does not mean one is inherently more valid or invalid than the other, but in different contexts and for different purposes, one might be more preferable than the other, or both together (or neither) might be more preferable still.

Between psychometricians and compositionists, or between scientists and humanists, the divide seems more insurmountable. Each discipline has its own methodologies, values, language, and goals, making it difficult to foster real inter-disciplinary communication and collaboration. However, rather than trying to agree on everything, perhaps more energy should be put into creating goals that both “sides” view as desirable and worth achieving, while recognizing that not every writing assessment is going to have the same goals, and that different assessments can be useful for different purposes.

However, telling the history of Writing Assessment in terms of binaries, even when seeking to complicate them, ignores the larger context of writing assessments. In my freshman year of college I completed a short group research project on standardized testing. Although my research was fairly simplistic, it did have the virtue of acknowledging the wider context in which these arguments play out. Writing Assessment isn’t just about two camps at war with one another; it is also about political ideologies, economic competitiveness, (un)ethical approaches to education, and the individual people that are affected by the results of these assessments. As I dig deeper into the more specialized areas of writing assessment, I will strive to keep this larger context in mind, as well as the history of Writing Assessment as a discipline (or sub-discipline).