New system cleans messy data tables automatically : vimarsan

New system cleans messy data tables automatically

According to surveys conducted by Anaconda and Figure Eight, data cleaning can take a quarter of a data scientist's time. Automating the task is challenging because different datasets require different types of cleaning, and common-sense judgment calls about objects in the world are often needed (e.g., which of several cities called “Beverly Hills” someone lives in). PClean provides generic common-sense models for these kinds of judgment calls that can be customized to specific databases and types of errors.
PClean uses a knowledge-based approach to automate the data cleaning process: Users encode background knowledge about the database and what sorts of issues might appear. Take, for instance, the problem of cleaning state names in a database of apartment listings. What if someone said they lived in Beverly Hills but left the state column empty? Though there is a well-known Beverly Hills in California, there’s also one in Florida, Missouri, and Texas … and there’s a neighborhood of Baltimore known as Beverly Hills. How can you know in which the person lives? This is where PClean’s expressive scripting language comes in. Users can give PClean background knowledge about the domain and about how data might be corrupted. PClean combines this knowledge via common-sense probabilistic reasoning to come up with the answer. For example, given additional knowledge about typical rents, PClean infers the correct Beverly Hills is in California because of the high cost of rent where the respondent lives.

Related Keywords

Florida , United States , California , University Of California At Berkeley , Missouri , Texas , Beverly Hills , David Pfau , David Sontag , Hanna Pasula , Monica Agrawal , Alex Lew , Vikashk Mansinghka , Stuart Russell , Department Of Brain , Society For Artificial Intelligence , Linkedin , Cognitive Sciences , Medicare Physician Compare National , Department Of Electrical Engineering , Probabilistic Computing Project , Figure Eight , Electrical Engineering , Computer Science , Computing Project , Artificial Intelligence , புளோரிடா , ஒன்றுபட்டது மாநிலங்களில் , கலிஃபோர்னியா , பல்கலைக்கழகம் ஆஃப் கலிஃபோர்னியா இல் பெர்க்லி , மிச Ou ரி , டெக்சாஸ் , பெவர்லி மலைகள் , டேவிட் ப்போ , டேவிட் சொந்தக் , அலெக்ஸ் ல்யூ , ஸ்டூவர்ட் ரஸ்ஸல் , துறை ஆஃப் மூளை , சென்டர் , அறிவாற்றல் அறிவியல் , துறை ஆஃப் மின் பொறியியல் , எண்ணிக்கை எட்டு , மின் பொறியியல் , கணினி அறிவியல் , கணினி ப்ராஜெக்ட் , செயற்கை உளவுத்துறை ,