Bias mitigation benchmark that includes two datasets - Open Research Data - MOST Wiedzy

Wyszukiwarka

Bias mitigation benchmark that includes two datasets

Opis

ISIC-2020 is the largest skin lesion dataset divided into two classes -- benign and malignant. It contains 33126 dermoscopic images from over 2000 patients. The diagnoses were confirmed either by histopathology, expert agreement or longitudinal follow-up. The dataset was gathered by The International Skin Imaging Collaboration (ISIC) from several medical facilities. The dataset was used in SIIM-ISIC Melanoma Classification Challenge. In the images, the lesion is usually in the centre and well-visible. The examples of artifacts in this dataset that may introduce bias into the model include hair, frames, rulers, pen marks, or gel drops. Past research showed that frames are correlated with the malignant class and ruler marks with the benign \cite{mikolajczyk_biasing_2022}.

melanoma external malignant 256 (kaggle.com)

Gender classification dataset consists of cropped images of male and female faces. The data were collected from various Internet sources, most of which were extracted from the IMDB dataset. It contains 58658 images, with a similar distribution into female and male subsets. The authors of this article have discovered that glasses are a possible bias source, as actors wore them more often than actresses.

Gender Classification Dataset (kaggle.com)

Additionally, the presented dataset consists of masks that can be used for targeted data augmentation according to the method presented in the paper https://doi.org/10.48550/arXiv.2308.11386.

The  research on bias reported  in  this  publication  was supported  by Polish National Science Centre (Grant Preludium No: UMO-2019/35/N/ST6/04052). The  authors  wish  to express their thanks for the support.

Plik z danymi badawczymi

TDA-datasets.zip
1.3 GB, S3 ETag 697d36b88a6cfc2bd0e59dbb2170790a-3, pobrań: 41
Hash pliku liczony jest ze wzoru
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} gdzie pojedyncza część pliku jest wielkości 512 MB

Przykładowy skrypt do wyliczenia:
https://github.com/antespi/s3md5
pobierz plik TDA-datasets.zip

Informacje szczegółowe o pliku

Licencja:
Creative Commons: 0 1.0 otwiera się w nowej karcie
CC 0
Przekazanie do Domeny Publicznej
Embargo na plik:
2024-04-01

Informacje szczegółowe

Rok publikacji:
2024
Data zatwierdzenia:
2024-08-06
Data wytworzenia:
2022
Język danych badawczych:
angielski
Dyscypliny:
  • automatyka, elektronika, elektrotechnika i technologie kosmiczne (Dziedzina nauk inżynieryjno-technicznych)
  • informatyka techniczna i telekomunikacja (Dziedzina nauk inżynieryjno-technicznych)
DOI:
Identyfikator DOI 10.34808/d7pe-r837 otwiera się w nowej karcie
Weryfikacja:
Politechnika Gdańska

Słowa kluczowe

Cytuj jako

wyświetlono 124 razy