Studying the Effect of AI Code Generators on Supporting Novice Learners in Introductory Programming

In this episode I unpack Kazemitabaar et al.’s (2023) publication titled “Studying the effect of AI code generators on supporting novice learners in introductory programming,” which found that students who had access to AI code generators while learning how to code out performed students who did not have access, even when engaging in manual coding exercises.

  • Quote our results show that Learners who

    had access to the AI code generator the

    Codex group were able to successfully

    generate code and showed evidence of

    understanding the generated code during

    the training phase they performed

    significantly better on-code authoring

    tasks 1.15 X increased progress 0.59x

    less errors 1.8 x higher correctness and

    performance on The Following manual code

    modification tasks in which both groups

    performed similarly furthermore during

    the evaluation phase on the immediate

    post test Learners from the Codex groups

    were able to perform similar to the

    Baseline group despite not having access

    to the AI code generator in the

    retention test which was conducted one

    week later Learners from the Codex group

    performed slightly better on coding

    tasks and multiple choice questions

    although these results did not reach

    statistical significance finally our

    analysis indicates that Learners with

    more prior programming competency May

    benefit from AI code generators end

    quote and that's from page 2 of the

    publication titled studying the effect

    of AI code generators on supporting

    novice Learners in introductory

    programming this paper is written by

    Majid kazima tabar Justin Cho Carl Kato

    ma Barbara J Erickson David weintrop and

    Toby Grossman apologies if I

    mispronounced any names here's the

    abstract for this paper quote AI code

    generators like open AI codecs have the

    potential to assist novice programmers

    by generating code from natural language

    descriptions however over-reliance might

    negatively impact learning and retention

    to explore the implications that AI code

    generators have on introductory

    programming we conducted a controlled

    experiment with 69 novices ages 10

    through 17. Learners worked on 45 python

    code authoring tasks for which half of

    the Learners had access to codex each

    followed by a code modification task our

    results show that using codex

    significantly increase co-authoring

    performance 1.15 X increased completion

    rate and 1.8 x higher scores while not

    decreasing performance on manual code

    modification tasks additionally Learners

    with access to codex during the training

    phase performs slightly better on the

    evaluation post-test conducted one week

    earlier although this difference did not

    reach statistical significance of

    Interest Learners with higher scratch

    pre-test scores performed significantly

    better on retention post-tests if they

    had prior access to codex end quote

    about a summarizes study into a single

    sentence I'd say that this study found

    that students who had access to AI code

    generators while learning how to code

    outperform students who did not have

    access even when engaging in manual

    coding exercises and that right there is

    a fascinating little finding that we're

    going to unpack in today's episode of

    the csk8 podcast now if you don't know

    who I am my name is Jared O'Leary I've

    worked with evergrade kindergarten

    through doctoral students in a variety

    of contexts from music education to

    computer science education classes you

    can find my CB on my website

    jaredolary.com while you're there you

    also find over 180 other podcast

    episodes as well as some interviews with

    some awesome guests and solo episodes

    like this where I unpacks scholarship in

    relation to Computer Science Education

    now this week's episode kind of builds

    off of last week's episode which was

    talking about plagiarism and chat GPT so

    this study's actually going to look at

    well what happens if we use code

    generation when learning how to code now

    there's an idea called distributed

    cognition where as an example you can

    take a tool and it will allow you to

    focus on higher level processes rather

    than the mundane things like a

    calculator instead of focusing on like

    multiplying and carrying the one Etc

    blah blah blah you can instead focus on

    like using the calculator to give you

    the result that you're looking for to

    then apply that result into like your

    construction project or whatever so

    thinking about the bigger picture rather

    than focusing on the mundane task of

    actually doing the mathematics

    generative AI can do the same thing so

    instead of focusing on writing out your

    for Loop you can just instead write out

    a simple prompt that fills in your for

    Loop that I don't know maybe is 20 lines

    of code long for an entire function and

    you can just copy and paste that into

    your application and write it out in I

    don't know let's say half the amount

    time that is a potential benefit when

    you are using generative AI to help you

    write code but there can be some

    drawbacks with that so what happens if

    you don't have access to that tool like

    let's say the server is shut down for

    chat GPT for a day but you have a

    deadline that you have to meet when

    authoring that code and you can't do

    that because you might not know how to

    do it you are 100 reliant on AI writing

    it for you or what if the AI is wrong

    and it creates an error and you can't

    fix that error so while on one hand it

    provides some affordances using it also

    creates some constraints and so we need

    to kind of consider this and explore it

    so this study's actually going to do

    that by exploring five different

    research questions this is from page two

    of the PDF quote rq1 are novices able to

    utilize AI code generators to solve

    introductory programming tasks rq2 how

    do learners's task performance measures

    EG correctness score completion time and

    error rate differ with and without AI

    code generators rq3 how does learners's

    ability to manually modify code differ

    with and without the use of AI code

    generators our Q4 what are the effects

    on learning performance and retention

    from using an AI code generator versus

    not rq5 what is the relationship between

    existing programming competency and the

    effectiveness of using AI code

    generators in introductory programming

    end quote alright so the next section is

    on the related word so they have

    subsections in here on natural language

    programming on AI coding assistance

    introductory programming

    Etc so if you're interested in learning

    more about some scholarship that

    explores those areas then make sure you

    check out that section the following

    section talked about the AI assisted

    learning environment so if you want to

    learn more about how to create your own

    version of this and like what they did

    when considering it like on the

    implementation data instrumentation

    programming task design quality of AI

    generated code Etc check out that

    particular section let's talk about okay

    well what was this particular study so

    this is under the user study section so

    there are three phases to the study the

    first phase was like a two hour long

    introduction to how to code specifically

    with scratch the second phase was a

    training phase so this is where the

    Learners were in two different groups

    one group was just learning how to code

    without AI code generators and the other

    was actually using that to help them

    code and to go through 45 programming

    tasks and complete 40 different multiple

    choice questions and then in the third

    phase which is the evaluation phase this

    was kind of like a post-test so this

    phase nobody was allowed to use the AI

    code generators or the python

    documentation or receive any feedback on

    the coding task assignments and they

    basically had to do a post test to see

    whether or not they still remembered the

    information or the things that they

    learned during the training now in the

    study there were 69 Learners 21 of them

    were female 48 were male and they were

    all in the age range of 10 to 17 and

    they had a different range of background

    demographics Etc which you can check out

    on PDF page six on the bottom right

    corner or if you're interested in that

    you can also check out the data

    collection discussion and data analysis

    but I'm gonna skip that because I don't

    think that'd be interesting for this

    particular podcast so let's get nerdy

    and get into the results starting on

    page seven in the training phase overall

    the complete Nation showed that the

    people who were able to use the Codex

    which is the AI code generator were able

    to finish more of the tasks than the

    group that used the Baseline so the

    Codex group had a mean of 90.9

    completion whereas the Baseline group

    that did not have the code generator

    only completed 79 their correctness

    score was also much higher so the group

    that had the AI was able to get 80.1

    percent for the mean as compared to 44.4

    percent for the Baseline group and they

    spent less time doing it so the mean

    seconds for this was 210 seconds for the

    AI group where I was it was 361 seconds

    for the Baseline group so that is a huge

    difference in all these different

    categories and the Codex group actually

    used less documentation than the other

    group so they used it 22.1 percent of

    the time as compared to the Baseline

    Group which used it 54.3 percent of the

    time they had less errors overall most

    of the errors were syntax errors and in

    general these errors were less than the

    Baseline group but they had the same

    amount of like semantic errors across

    both groups or a relatively same amount

    so .01 and 0.03 now at the top of page

    statistics of tasks in which the AI code

    generator was used broken down by topic

    so the topics were Basics data types

    conditionals loops and arrays so it

    talks about what percentage of time did

    students actually use the AI for the

    different ones so for like the basic

    stuff they used it 48 of the time for

    data types they use it 61 of the time

    for conditional 75 for Loops 84 and for

    arrays 85 percent so as it got into more

    complex topics then it seems like the

    students were using the AI code

    generator more than they were at the

    beginning this figure also breaks down

    the number of multiple usages for each

    one of those whereas like compared to a

    single use as well as the number of

    tasks that were 100 completed by the AI

    generators and then the percentage of

    time that they actually just copied the

    question directly from the code

    generator into the answer and you'll

    notice as it got more complex they did

    this more often but a question that I

    have is was were they running out of

    time or was it because they just found

    it was easier than actually typing it

    out Etc was it a causation in that as it

    got more complex that these students

    relied upon the AI more or is it more of

    just correlation in that it just

    happened to be the complexity was kind

    of unrelated to it and really it was

    just like the time in which that

    something was introduced so like as they

    got later on into this particular study

    they're running out of time they're like

    ah let's just get through the arrays and

    the loops let's just copy and paste this

    it'll save me some time and I'm not sure

    about that but the authors do note that

    even though these like percentages are

    interesting the patterns were not

    consistent so like some students use the

    AI code generator a lot While others who

    had options to it barely used it at all

    so there's like this large Continuum

    this large spread between usage it

    wasn't consistent among the different

    students or participants which is a

    great point that the authors made now

    one of the things that I personally

    would be afraid of with students

    learning how to code using generative AI

    is well maybe they're not actually going

    to learn the concepts they're just going

    to rely on this tool doing it for them

    which is what I've often heard some

    people say with mathematics Concepts

    well you need to write it out by hand

    otherwise you're going to rely on your

    calculator and you're not going to

    understand it well in this particular

    study they found quote although Learners

    in the Codex group use the documentation

    less and relied heavily on AI generated

    code for the authoring tasks they still

    did just as well and in some cases

    better in manual code modification tasks

    end quote from page 10. that is such a

    key finding right there we really need

    more studies to kind of like follow up

    with this and figure out is this

    something that is found in other areas

    with other platforms Etc because if so

    that is a fascinating result honestly

    one that I wouldn't have predicted what

    it's basically saying is hey we could

    use these code generators that allow

    students to learn something faster and

    they're actually scoring better than if

    they hadn't used it right now I'm

    looking at these findings and going

    what's the downside here well would

    anyone not do this especially when we

    look at the evaluation phase which is

    like the post-test and you find that

    both groups perform similarly on all

    three different tasks that were measured

    in terms of the authoring tasks the

    modifying task and the multiple choice

    tasks and in some of those tests the

    students who use codex the AI code

    generator actually perform better than

    the students who did not that is really

    interesting now before we sing Our

    praises to AI code generators there were

    more errors from the students who use

    codex when they were doing the manual

    coding so they had a mean of 1.58 errors

    compared to a mean of 0.99 when there

    was no starter code provided so they're

    making slightly more errors but not

    enough to make me go yeah we shouldn't

    do this this one did have some

    statistical significance however they

    also had slightly more errors when

    modifying tasks but that did not have

    statistical significance so again more

    studies would like help to figure out

    why this is happening and whether or not

    this is generalizable outside of this

    population now the authors also had some

    qualitative feedback so they asked

    students what their perspectives around

    this which is great so just looking at

    only how well students performed is

    something that a lot of researchers end

    up doing that kind of like leaves out a

    larger piece of the puzzle when it comes

    to educational psychology like yeah a

    student may have performed better on

    this but did they walk away from the

    experience going wow I really hate that

    subject area because if so that's

    something we need to know and figure out

    why so it's great that they were not

    only looking at what was learned in the

    retention of that learning but what

    students actually thought while doing

    this more researchers should do this

    kind of research where it's looking at

    this broader like whole more holistic

    approach to Learners and learning so

    kudos to the authors for this here's a

    quote from page 11 that's interesting

    quote both groups felt they learned

    about Python Programming and its

    Concepts during the training phase

    however on stress and discouragement

    Learners in the Codex group felt

    slightly less stress you 390.5 and

    p-value of 0.056 some Learners from the

    Codex explicitly attributed their

    reduced stress to using the AI code

    generator for example participant 26

    reported quote using the code generator

    helped me save time and reduce pressure

    end quote and they've got more

    quotations from the different

    participants and whatnot that kind of

    reinforce this so in general not only

    did the students perform better but they

    actually liked it better some like small

    percentage of students said hey I'd

    actually prefer to do this on my own

    rather than use a tool that does it for

    me I might be that kind of person like I

    kind of like the struggle because I feel

    like I'll learn more through the process

    but at least with these participants

    that might be more of an outlier than

    the norm alright so the discussion

    section is like really neatly organized

    with the different research questions so

    the first research question is around

    whether or not can novices use AI code

    generators the answer is yes so the

    second research question which is how do

    Learners as task performance differ with

    and without AI code generators well they

    perform better with the code generators

    the third question is how does

    learners's ability to manually modify

    code differ with and without code AI

    generators and the student students who

    use the code generators were able to

    perform just as good if not better than

    the students who did not use that when

    they were doing the post-test that did

    not allow anyone to use the code

    generators that's a interesting finding

    the fourth research question which was

    what are the effects on learning

    performance and retention from using AI

    code generators versus not so they found

    that this doesn't like impede the

    learning results and in fact it might

    lead to Better Learning results than not

    using it in the final research question

    rq5 is what is the relationship between

    existing programming competency and the

    effectiveness of using AI code

    generators in introductory programming

    and what they found is that if you came

    in with a higher pre-test score you are

    likely going to benefit even more using

    the code generators than if you had a

    lower score so if you have prior

    experience with learning how to code and

    then you are going in and adding in an

    AI code generator on top of this it

    might make it so you can Excel and learn

    faster at a better rate and get through

    this faster than if you did not have

    that prior experience and we're using

    the AI code generator so this may be an

    accelerant for those kinds of students

    now the authors mentioned the use modify

    create framework which if you're

    unfamiliar with that framework check out

    episode 26 that was well over 150

    episodes ago which is titled

    computational thinking for youth in

    practice so if we think of like use

    modify create with modding video games

    so making a video game do something

    different so the use would be just

    playing the video game the modify would

    be actually modding the game where you

    are changing the code that exists and

    creating would be like hey I have this

    blank IDE page I'm going to create a

    brand new game from scratch so that

    Continuum of like playing versus

    modifying changing a little bit versus

    creating something from nothing and in

    between each one of those so the authors

    talk about how AI code generators can be

    a form of this it could be like a crutch

    for students to be able to use a AI code

    generator and eventually be able to

    modify the code that is created by it

    and then eventually be able to get to a

    point where they can create it without

    it but a question that I have is if it

    ends up saving you time and you get the

    same out if not better understanding why

    would you even get to the modify and

    create when you could just go with the

    use and learn how to create code with AI

    in collaboration with the AI rather than

    modifying what's given because code

    might not work very good or creating it

    from scratch from on your own like I

    have a drum kit behind me I could learn

    how to whittle my own drumsticks or I

    could you know just buy some from the

    store and then focus on making music

    with it it's the same thing with

    programming instead of focusing on like

    whittling away with by writing out lines

    of code maybe instead I want to just be

    able to very quickly create this thing

    and then use that program to do

    something like to play a game or to do

    whatever so here's an interesting quote

    from page 13. quote the benefit of AI

    code generators for novice Learners

    could be explained by the effective

    employment of the use modify create

    pedagogical strategy often used in

    introductory programming although

    Learners in the Baseline group use the

    documentation more frequently they had

    to start each task by creating a new

    program from scratch before getting to

    the modify portion of the activity and

    thus encountered more errors however

    Learners from the Codex group had the

    advantage of using Code that was

    generated specifically for an intended

    behavior that they wrote this meant that

    the AI coding assistant turned the crate

    task into a used modify task as they

    were provided with functional pieces of

    code in response to the written

    description which they could then modify

    therefore they were able to spend more

    time to trace test and learn about the

    structure of the generated code before

    moving on to the modifying it in the

    next task end quote that's a great point

    so if you're able to spend more time

    creating and thinking rather than

    actually writing out lines of code is

    that a win especially if it doesn't lead

    to learning loss but instead potential

    learning gain over not doing that

    approach so on page 14 the authors talk

    about okay well what are some of the

    potential implications for design so

    they talk about supporting complete

    beginners so AI assistance could be used

    to help them out they talked about

    control for over utilization of these

    tools so if you are over Reliant upon an

    AI code generator they provide some

    suggestions in there and they also talk

    about how you can use this for creating

    some writing prompts with the classes

    that you work with so if students are

    unfamiliar with how to prompt an AI to

    create something for them you could

    generate some prompts for students to

    work with or create a list of prompts

    that students can kind of like use to

    start their line of thinking with the

    things they want to create in

    collaboration with generative AI if you

    want to learn more about those ideas

    because I just kind of tease them make

    sure you check out page 14. now at the

    end of these unpacking scholarship

    episodes I'd like to share some

    lingering questions and thoughts so one

    of them is how do the findings for the

    study compare with studies on students

    using stack Overflow so as I mentioned

    last week stack Overflow is like a

    repository of questions that are related

    to programming or people will say hey I

    can't figure out how to get my program

    to do blah blah blah blah and people

    respond to it with code or with a

    description of how to solve that problem

    professional programmers often say that

    they frequently go to places like stack

    Overflow and they will often copy and

    paste and then modify slightly to make

    it so that the code Works in their

    particular program that's something that

    I've done on different projects that

    I've worked on Etc but I imagine that

    there are some studies out there on

    whether or not people learn something

    when they're using stack Overflow so

    it'd be really interesting to kind of

    compare those studies specifically on

    copying and pasting from stack overflow

    with studies like this one that are

    actually looking at generative Ai and

    that kind of collaboration that could go

    on in a classroom but last week's

    episode kind of shared some more

    questions and lingering thoughts that I

    had about using generative Ai and

    plagiarism when it came to coding so if

    you want to hear more about that make

    sure you check out last week's episode

    which was episode 187 and again it was

    titled will chatgpt get you caught

    rethinking of plagiarism detection if

    you enjoyed this particular discussion

    on generative AI there are many more

    podcast episodes there's currently over

    April fools ones if you're interested in

    those but if you check out these show

    notes at jaredoleery.com there are

    multiple episodes that are specifically

    related to AI like episode 13 AI for all

    curriculum development and gender

    discourse with Sarah Judd episode 142

    teaching AI in elementary school with

    Charlotte Duncan episode 173 empathetic

    listening in computer science with Josh

    Sheldon in episode 176 the end of

    programming and last week's episode that

    I already mentioned if you enjoyed this

    episode consider sharing it with

    somebody else or leaving a review or

    just simply pressing the like or putting

    a comment in the YouTube video that you

    may be listening to this on stay tuned

    next week for another episode until then

    I hope you're all staying safe and are

    having a wonderful week

Article

Kazemitabaar, M., Chow, J., Ka To Ma, C., Ericson, B. J., Weintrop, D., & Grossman, T. (2023). Studying the effect of AI code generators on supporting novice learners in introductory programming. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems - CHI '23, 1-23.


Abstract

“AI code generators like OpenAI Codex have the potential to assist novice programmers by generating code from natural language descriptions, however, over-reliance might negatively impact learning and retention. To explore the implications that AI code generators have on introductory programming, we conducted a controlled experiment with 69 novices (ages 10-17). Learners worked on 45 Python code-authoring tasks, for which half of the learners had access to Codex, each followed by a code-modification task. Our results show that using Codex significantly increased codeauthoring performance (1.15x increased completion rate and 1.8x higher scores) while not decreasing performance on manual codemodification tasks. Additionally, learners with access to Codex during the training phase performed slightly better on the evaluation post-tests conducted one week later, although this difference did not reach statistical significance. Of interest, learners with higher Scratch pre-test scores performed significantly better on retention post-tests, if they had prior access to Codex.”


Author Keywords

Large Language Models, Generative Models, AI Coding Assistants, AI-Assisted Pair-Programming, OpenAI Codex, Introductory Programming, K-12 Computer Science Education, GPT-3


My One Sentence Summary

This study found that students who had access to AI code generators while learning how to code out performed students who did not have access, even when engaging in manual coding exercises.


Some Of My Lingering Questions/Thoughts

  • How do the findings for this study compare with studies on students using Stack Overflow?


Resources/Links Relevant to This Episode

  • Other podcast episodes that were mentioned or are relevant to this episode

    • AI4ALL, Curriculum Development, and Gender Discourse with Sarah Judd

      • In this interview with Sarah Judd, we discuss what Sarah learned both in the classroom and as a CS curriculum writer, the curriculum Sarah continues to develop for AI4ALL, advice and philosophies that can guide facilitating a class and designing curriculum, some of our concerns with discourse on gender in CS, my recommended approach to sustainable professional development, and much more.

    • Empathetic Listening in Computer Science with Josh Sheldon

      • In this interview with Josh Sheldon, we discuss computational action, designing exploratory professional development experiences, learning how to listen to and empathize with students, applying SEL with teachers, the future of teaching and learning, the problems with external influences on CS education, and so much more.

    • Teaching AI in Elementary School with Charlotte Dungan

      • In this interview with Charlotte Dungan, we discuss Charlotte’s holistic approach to education, remotely teaching CS to rural communities, why Charlotte believes teaching is harder than working in industry, teaching AI in elementary school, the influence of money on research and practice, the future of work, and much more.

    • The End of Programming

      • In this episode I unpack Welsh’s (2023) publication titled “The end of programming,” which asks when generative AI will replace the need for knowing how to program.

    • Will ChatGPT Get You Caught? Rethinking of Plagiarism Detection

      • In this episode I unpack Khalil & Er’s (2023) publication titled “Will ChatGPT get you caught? Rethinking of plagiarism detection,” which explores how likely it is for plagiarism software to detect whether an essay was written by generative AI.

    • More episodes related to AI

    • All other episodes

  • More episodes related to AI



More Content