Skip to content

Academic webpage classification. A data mining and machine learning task

Notifications You must be signed in to change notification settings

lfpelison/ufsc-machinelearning-DAS410058

Repository files navigation

Webpage classification

UFSC - DAS410058 - Jomi Fred Hubner

A data mining and machine learning task. (http://jomi.das.ufsc.br/ia/2017/tp-dm.pdf) The task is based on html data, found in: ./webkb/ or http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

This data has to be classified in 7 target classes:

  • student (1641)
  • faculty (1124)
  • staff (137)
  • department (182)
  • course (930)
  • project (504)
  • other (3764)

And, the data is divided by universities:

  • Cornell (867)
  • Texas (827)
  • Washington (1205)
  • Wisconsin (1263)

--

To solve this problem, we made a jupyter-notebook, called "final_project.ipynb", on Python 2.7. Visualize it on ./final_project.html.

PS.1 To edit, please use Python 2.7 and download the jupyter: pip install jupyter

PS. To get all the data and put on a csv (./corpus.csv), we made a python script (./script.py), that also is on the jupyter-notebook.

This project is to an academic discipline from Universidade Federal de Santa Catarina - http://jomi.das.ufsc.br/ia/


The authors are: Luis Felipe Pelison, Alex Amadeu Cani and Iago Oliveira