IBM compiles dataset to teach software how software is made:

IBM compiles dataset to teach software how software is made: 14m code samples, half of which actually work


Big Blue hopes to create the ImageNet of training resources for AI-powered programming tools
Share
Copy
Think IBM has assembled a massive silo of source code for teaching machine-learning programs about programming.
Dubbed Project CodeNet, the set contains, we're told, 14 million code samples totaling 500 million lines in more than 55 programming languages, from Java, C, and Go to COBOL, Pascal, and FORTRAN. Truth be told, more than three-quarters of it all is in C++ and Python.
This source code wasn't taken from production nor in-development applications: it was collected from entries submitted to two programming contests organized in Japan: Aizu and AtCoder. In these contests, competitors are challenged to write the necessary code to turn a given set of inputs into a set of desired outputs. About half of the samples work as expected, and the rest are labeled as either wrong solutions, non-building, or buggy.

Related Keywords

Aizu , Fukushima , Japan , , Watson Ai Lab , Ibm , Project Codenet , Hi , ஐஸு , ஃபுகுஷிமா , ஜப்பான் , வாட்சன் ஐ ஆய்வகம் , ஐபீயெம் , நான் ,

© 2025 Vimarsana