Until now, multilabel datasets have been provided in different file formats for different pieces of software. mldr was created with compatibility in mind and allowed to read two widely-known formats: datasets from Mulan and MEKA repositories in ARFF format.
With the creation of the Ultimate Multilabel Dataset Repository (RUMDR) and a new R package, mldr.datasets, a huge set of multilabel datasets are now available in a common format and with the possibility of being converted into many more.
Note: mldr.datasets does not depend on mldr, but it’s useful to have both of them installed to access all functionality.
After installing and loading the package, some pre-loaded datasets will be available directly in the environment:
These are accessible via their names and the usual members of
"mldr" objects (
$datasets…). Additionally, a
toBibtex() method is provided for fast access to the citation information for each dataset.
Larger datasets are available to download from the repository (you can consult the complete list of datasets or call
mldrs()) via the
stratified.kfolds() functions partition multilabel datasets following a random strategy and a stratified one, respectively.
write.mldr() function is able to export
"mldr.folds" objects into several file formats: Mulan, MEKA, KEEL, LibSVM and CSV. This way, regular, partitioned and preprocessed datasets can be saved for later use in any well-known multilabel classification tool.
We’ve updated mldr to integrate functionality from mldr.datasets when it’s installed. Thus, now calling
mldr() with simply a dataset name will trigger a search within the datasets in the repository. If a dataset isn’t found, the function will attempt to read the dataset locally (this behavior can be forced using the
Other changes in this update include exposing the
read.arff() function, able to read ARFF files and differentiate input and output features without calculating any related measure, as suggested in issue 26; several fixes related to dataset reading and calculation of measures, and slight changes in some calculations. For detailed information visit our changelog or the commit history.
We have released a new version of mldr, 0.2.51, already live on CRAN. It fixes a recently found bug and adds functionality to the plotting function.
Multiple plots in one call
plot function now allows a vector of plot types as its
type parameter. This results in the generation of multiple plots, with a pause between them if needed to display all of them separately. The following is an example of this functionality:
plot(emotions, type = c("LB", "LSB", "LH", "LSH"))
Until now, color in plots were fixed and couldn’t be changed by the user. The update adds the
color.function parameters. The former can be used on all plot types except for the label concurrence plot, and must be a vector of colors. The latter is only used on the label concurrence plot and accepts a coloring function, such as
heat.colors, or the ones provided by the colorspace package:
layout(matrix(c(1, 2, 3), 1, 3))
plot(emotions, color.function = rainbow)
plot(emotions, color.function = colorspace::rainbow_hcl)
plot(emotions, color.function = colorspace::heat_hcl)
A bug was found when loading sparse datasets with a certain formatting. This has been fixed on the update and shouldn’t be a problem anymore.
A small update has been released on CRAN, including a GUI redesign and better ability to read datasets from it.
GUI design changes
Some changes have been made to the web interface of mldr. This design saves more space, better highlights content and is less distracting while being visually appealing.
Loading MEKA datasets from GUI
Users are now able to load MEKA-style datasets (in ARFF format without a XML file for labels) from the web interface as well. To do this, just upload the ARFF file and click Load dataset without selecting any XML file.
A new version of mldr was just released and is now live on CRAN. This update adds new functions able to assess multilabel classifier performance, allows to create
mld objects out of ARFF files with different structures (see example below), and fixes several bugs.
Classification performance measures
mldr now includes the
mldr_evaluate function, which analyzes the performance of classifier predictions via several well-known measures (Accuracy, Precision, Recall, F-measure, Hamming Loss among others). Using it is simple: just call it with the test dataset and the predictions generated by the classifier. The function will return a list with all 20 measures identified by their names.
res <- mldr_evaluate(emotions, predictions)
New parameters to identify labels
Labels in ARFF files can be structured in several ways, so now the
mldr constructor allows the use of three new parameters that will ease the read of a multilabel dataset:
label_amount. The first one enables the user to specify exactly the indices the labels will be taking in the dataset; the second one identifies labels by using their names, and the last one takes a number of labels to be read from the last attributes in the ARFF file.
corel5k <- mldr("corel5k", label_amount = 374)
emotions <- mldr("emotions", label_indices = c(73, 74, 75, 76, 77, 78))
New vignette: Working with Multilabel Datasets in R
A vignette has been added to mldr as well. The document instructs the reader on how the package works and provides examples to ease the learning. It can be loaded with the command
vignette("mldr") or downloaded from CRAN.