class documentation

class DatasetGenerator: (source)

Constructor: DatasetGenerator(root)

View In Hierarchy

Generate dataset for training.

Static Method final_filter_translated_data Filter the translated data based on checktype.
Static Method remove_quotes Remove quotes in text.
Method __init__ Initialize DatasetGenerator.
Method filter_valid_data Validate the data of hospitals. Keep the valid data and remove the invalid data.
Method generate_filepath Generate file path for each image.
Method generate_merged_csv Generate merged csv files for each hospital.
Method make_align_dataset Convert the translated info to a aligned dataset for training the minigpt4 model.
Method match_paired_data Pair check data with image data.
Method merge_all_hospital Merge all data with checktype into a single json file.
Method merge_excels Merge excel files into one dataframe.
Method refine_caption Refine the caption.
Method reorganize_data_structure Reorganize the data structure.
Instance Variable EXCEL_MAX_ROWS upper limit of excel rows
Instance Variable MIN_CAPTION_LENGTH minimum length of caption
Instance Variable NON_VALID_PRINT_ID non valid print id
Instance Variable root dataset root path.
Instance Variable save_dir root path of processed dataset.
Method _defint_constants Define constant variables.
Method _filter_metadata Check if the metadata is valid.
Method _get_imgs_info_list Given a matched dataframe, get a list of images info for each patient.
Method _get_valid_images Get valid images for a patient.
Method _refine_metadata_keys Delete unnecessary keys and rename keys to English.
@staticmethod
def final_filter_translated_data(checktype: str, data: dict) -> dict: (source)

Filter the translated data based on checktype.

Parameters
checktype:strchecktype of the data. One of "Laryngoscope", "Rhinoscope", "Otoscope".
data:dicttranslated data.
Returns
dictfiltered data.
@staticmethod
def remove_quotes(text: str) -> str: (source)

Remove quotes in text.

full width and half width quotes are both removed.

Parameters
text:strtext with quotes
Returns
strtext without quotes
def __init__(self, root: Path): (source)

Initialize DatasetGenerator.

Parameters
root:Pathdataset root path.
def filter_valid_data(self): (source)

Validate the data of hospitals. Keep the valid data and remove the invalid data.

Check if the check date in the metadata is the same as the date in the foler's name.

def generate_filepath(self): (source)

Generate file path for each image.

def generate_merged_csv(self): (source)

Generate merged csv files for each hospital.

def make_align_dataset(self): (source)

Convert the translated info to a aligned dataset for training the minigpt4 model.

See: https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_2_STAGE.md

def match_paired_data(self, hospital: str): (source)

Pair check data with image data.

The output is saved in self.save_dir / f"{hospital}-图像检查表.csv"

Parameters
hospital:strhospital name.
def merge_all_hospital(self): (source)

Merge all data with checktype into a single json file.

def merge_excels(self, sheets: list[str], colnames: list[str], save_path: Path): (source)

Merge excel files into one dataframe.

Parameters
sheets:List[str]sheet names
colnames:List[str]column names
save_path:Pathsave path of the merged csv file
def refine_caption(self, caption: str) -> tuple[bool, str]: (source)

Refine the caption.

Parameters
caption:strcaption of a sample
Returns
tuple[bool, str]whether the caption is valid and the refined caption
def reorganize_data_structure(self): (source)

Reorganize the data structure.

Save images using "{hospital}/{check_date}-{pid}/{sequence}.jpg" template.

EXCEL_MAX_ROWS: int = (source)

upper limit of excel rows

MIN_CAPTION_LENGTH: int = (source)

minimum length of caption

NON_VALID_PRINT_ID: int = (source)

non valid print id

dataset root path.

save_dir: Path = (source)

root path of processed dataset.

def _defint_constants(self): (source)

Define constant variables.

def _filter_metadata(self, data: dict) -> dict: (source)

Check if the metadata is valid.

Conditions:
  • Check type is valid.
  • Caption is meaningful.

The check type field seems to be filled in randomly, so we have to match the check type based on the content of the caption.

Parameters
data:dicta patient's metadata
Returns
dictfiltered data
def _get_imgs_info_list(self, matched_df: DataFrame, filepath_dict: dict[str, list]) -> list[dict]: (source)

Given a matched dataframe, get a list of images info for each patient.

Parameters
matched_df:DataFramea dataframe with a matched patient.
filepath_dict:Dict[str, list]dict for all file paths of a hospital. Format: {filename: [filepaths]}
Returns
List[dict]a list of matched images info.
def _get_valid_images(self, images: list[dict], meta: dict) -> list[dict]: (source)

Get valid images for a patient.

Check if the check date in the metadata is the same as the date in the foler's name.

Parameters
images:List[dict]list of images
meta:dictmeta information of the patient
Returns
List[dict]list of valid images
def _refine_metadata_keys(self, data: dict) -> dict: (source)

Delete unnecessary keys and rename keys to English.

Parameters
data:dictmeta information of a patient
Returns
dictrefined meta information