src.dataset_generator.DatasetGenerator

class documentation

class DatasetGenerator: (source)

Constructor: DatasetGenerator(root)

Generate dataset for training.

Static Method	`final_filter_translated_data`	Filter the translated data based on checktype.
Static Method	`remove_quotes`	Remove quotes in text.
Method	`__init__`	Initialize DatasetGenerator.
Method	`filter_valid_data`	Validate the data of hospitals. Keep the valid data and remove the invalid data.
Method	`generate_filepath`	Generate file path for each image.
Method	`generate_merged_csv`	Generate merged csv files for each hospital.
Method	`make_align_dataset`	Convert the translated info to a aligned dataset for training the minigpt4 model.
Method	`match_paired_data`	Pair check data with image data.
Method	`merge_all_hospital`	Merge all data with checktype into a single json file.
Method	`merge_excels`	Merge excel files into one dataframe.
Method	`refine_caption`	Refine the caption.
Method	`reorganize_data_structure`	Reorganize the data structure.
Instance Variable	`EXCEL_MAX_ROWS`	upper limit of excel rows
Instance Variable	`MIN_CAPTION_LENGTH`	minimum length of caption
Instance Variable	`NON_VALID_PRINT_ID`	non valid print id
Instance Variable	`root`	dataset root path.
Instance Variable	`save_dir`	root path of processed dataset.
Method	`_defint_constants`	Define constant variables.
Method	`_filter_metadata`	Check if the metadata is valid.
Method	`_get_imgs_info_list`	Given a matched dataframe, get a list of images info for each patient.
Method	`_get_valid_images`	Get valid images for a patient.
Method	`_refine_metadata_keys`	Delete unnecessary keys and rename keys to English.

@staticmethod
def final_filter_translated_data(checktype: str, data: dict) -> dict: (source) ¶

Filter the translated data based on checktype.

Parameters
checktype:`str`	checktype of the data. One of "Laryngoscope", "Rhinoscope", "Otoscope".
data:`dict`	translated data.
Returns
`dict`	filtered data.

@staticmethod
def remove_quotes(text: str) -> str: (source) ¶

Remove quotes in text.

full width and half width quotes are both removed.

Parameters
text:`str`	text with quotes
Returns
`str`	text without quotes

def __init__(self, root: Path): (source) ¶

Initialize DatasetGenerator.

Parameters
root:`Path`	dataset root path.

def filter_valid_data(self): (source) ¶

Validate the data of hospitals. Keep the valid data and remove the invalid data.

Check if the check date in the metadata is the same as the date in the foler's name.

def generate_filepath(self): (source) ¶

Generate file path for each image.

def generate_merged_csv(self): (source) ¶

Generate merged csv files for each hospital.

def make_align_dataset(self): (source) ¶

Convert the translated info to a aligned dataset for training the minigpt4 model.

See: https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_2_STAGE.md

def match_paired_data(self, hospital: str): (source) ¶

Pair check data with image data.

The output is saved in self.save_dir / f"{hospital}-图像检查表.csv"

Parameters
hospital:`str`	hospital name.

def merge_all_hospital(self): (source) ¶

Merge all data with checktype into a single json file.

def merge_excels(self, sheets: list[str], colnames: list[str], save_path: Path): (source) ¶

Merge excel files into one dataframe.

Parameters
sheets:`List[str]`	sheet names
colnames:`List[str]`	column names
save_path:`Path`	save path of the merged csv file

def refine_caption(self, caption: str) -> tuple[bool, str]: (source) ¶

Refine the caption.

Parameters
caption:`str`	caption of a sample
Returns
`tuple[bool`, `str]`	whether the caption is valid and the refined caption

def reorganize_data_structure(self): (source) ¶

Reorganize the data structure.

Save images using "{hospital}/{check_date}-{pid}/{sequence}.jpg" template.

EXCEL_MAX_ROWS: int = (source) ¶

upper limit of excel rows

MIN_CAPTION_LENGTH: int = (source) ¶

minimum length of caption

NON_VALID_PRINT_ID: int = (source) ¶

non valid print id

root: Path = (source) ¶

dataset root path.

save_dir: Path = (source) ¶

root path of processed dataset.

def _defint_constants(self): (source) ¶

Define constant variables.

def _filter_metadata(self, data: dict) -> dict: (source) ¶

Check if the metadata is valid.

Conditions:

Check type is valid.
Caption is meaningful.

The check type field seems to be filled in randomly, so we have to match the check type based on the content of the caption.

Parameters
data:`dict`	a patient's metadata
Returns
`dict`	filtered data

def _get_imgs_info_list(self, matched_df: DataFrame, filepath_dict: dict[str, list]) -> list[dict]: (source) ¶

Given a matched dataframe, get a list of images info for each patient.

Parameters
matched_df:`DataFrame`	a dataframe with a matched patient.
filepath_dict:`Dict[str`, `list]`	dict for all file paths of a hospital. Format: {filename: [filepaths]}
Returns
`List[dict]`	a list of matched images info.

def _get_valid_images(self, images: list[dict], meta: dict) -> list[dict]: (source) ¶

Get valid images for a patient.

Check if the check date in the metadata is the same as the date in the foler's name.

Parameters
images:`List[dict]`	list of images
meta:`dict`	meta information of the patient
Returns
`List[dict]`	list of valid images

def _refine_metadata_keys(self, data: dict) -> dict: (source) ¶

Delete unnecessary keys and rename keys to English.

Parameters
data:`dict`	meta information of a patient
Returns
`dict`	refined meta information