CSV Harvesting Technical Information
CSV Format
CSV files should adhere to the RFC4180 standard
CSV Filename
The harvester utilises the CSV filename to determine the type of record by the filename. For example, to harvest party records, the filename must be Party_People.csv. Any other filename will cause the harvester to fail as the record type will not be recognised. The filenames required or each record type is outlined below in the record details section.
Adding and updating records
The Mint harvester requires utilises an identifier column to identify whether a record already exists in storage and requires updating or is new and requires adding. As this is column is being used an identifier, there is a requirement that each record is unique. The field can be a string of any format though numeric identifiers are often used.
Deleting a record
You cannot delete a record in Mint via the CSV harvest. This rule is implemented for the following reasons:
- In some situations, the data feed from the central system (source of truth) is not reliable and records go "missing". For example, the HR system or research management system may not publish information about staff that left.
- Mint often creates additional metadata as part of the curation model (see Curating Linked Data) and this would be lost if you delete the record. For example, the person has an NLA identifier and links to a ReDBox record
- There's a hidden benefit - your CSV file needs only contain records that require updating/inserting.
As Mint is in charge of publishing records to services (such as the National Libraries Australia and Research Data Australia), it isn't appropriate for these records to just disappear. Once they're in Mint they need to be treated with curatorial gloves. It is however possible to delete a record via the Mint user interface by logging in as an administrator. This is of course strongly discouraged and you must make sure that you aren't breaking any relationships with other objects (in Mint or ReDBox) or feeds to other services.
Record types and CSV schemas
Parties
Parties data is used to record researcher information.
Information:
- Filename - Parties_People.csv
- Demonstration - http://demo.redboxresearchdata.com.au/mint/Parties_People/search
- Sample data file - Parties_People.csv
- See also - http://ands.org.au/guides/cpguide/cpgparty-bestpractice.html
CSV Fields
Mandatory | Field name | Description | Example |
---|---|---|---|
* | id | The institution's ID - likely to be a staff, student or researcher ID | 00007429 |
* | Given_Name | The first given name | Michael |
Other_Names | Any intermediary (middle) names | Alfred | |
Family_Name | Often referred to as surname | Jones | |
Pref_Name | The preferred name, usually a single name | Mick | |
Honorific | A name prefix | Mrs, Dr, Prof | |
The person's email address | mick@staff.edu.au | ||
Job_Title | The primary job title for the person | Facility Director | |
GroupID_1 | Links to Party (Group) records. This needs to match an ID in the Parties-Group CSV | ||
GroupID_2 | |||
GroupID_3 | |||
ANZSRC_FOR_1 | The 2, 4 or 6 digit Field of Research code as designated by the Australian Bureau of Statistics under ANZSRC (2008) | 04, 0101, 070201 | |
ANZSRC_FOR_2 | |||
ANZSRC_FOR_3 | |||
NLA_Party_Identifier | A National Library of Australia party identifier | http://nla.gov.au/nla.party-461793 | |
ResearcherID | A Thomson Reuters ID created using the http://www.researchid.com service | http://www.researcherid.com/rid/F-3500-2011 | |
Personal_Homepage | The researcher's personal website | http://www.me.example.com/ | |
Staff_Profile_Homepage | The researcher's webpage within the institutional web presence | http://staffprofiles.edu.au/mjones | |
Description | A brief bio or overview of the researcher | Mike has been investigating the effects of loud noises. |
Groups
Groups are any organisational unit that relates to research. This could be formal groups such as faculties or informal such as an arts collective. Typically, this data is provided in a hierarchical model. For example, the University is named as the top-level group with faculties and institutes below and then schools below them:
Information:
- Filename - Parties_Groups.csv
- Demonstration - http://demo.redboxresearchdata.com.au/mint/Parties_Groups/search
- Sample data file - Parties_Groups.csv
- See also - http://ands.org.au/guides/cpguide/cpgparty-bestpractice.html
CSV Fields
Mandatory | Field name | Description | Note | Examples |
---|---|---|---|---|
* | id | An identifier for the group | This is linked to via the GroupID_1 field in the Party data or the Parent_Group_ID field in the Groups data | FoS |
* | Name | The name of the group | Faculty of Science | |
A contact email address for the group | fos@uni.edu.au | |||
Phone | A phone number for contacting the group | 07 3456 7654 | ||
Parent_Group_ID | Link to the parent group | UoE | ||
Homepage | The group's website | http://www.fos.uni.edu.au/ | ||
Description | A brief description of the group |
Activities
Activities data is used to record research activities that are not funded via the ARC or the NHMRC.
Information:
- Filename - Activities.csv
- Demonstration - http://demo.redboxresearchdata.com.au/mint/Activities/search
- Sample data file - Activities.csv
- See also - http://ands.org.au/guides/cpguide/cpgparty-bestpractice.html
CSV Fields
Mandatory | Field name | Description |
---|---|---|
* | id | An identifier for the activity |
Title | The title of the activity | |
Name | The name of the activity | |
Type | The type of activity as described by the ANDS Content Providers Guide | |
Existence_Start | The year in which the activity started | |
Existence_End | The year in which the activity completed | |
Description | A description of the activity | |
Primary_Investigator_ID | The ID of the PI (links to the Parties-People data) | |
Investigators | A semi-colon (;) separated list of IDs for other researchers related to the activity. These IDs must match records in the Parties-People data | |
Website | A website for the activity | |
ANZSRC_FOR_1 | The 2, 4 or 6 digit Field of Research code as designated by the Australian Bureau of Statistics under ANZSRC (2008) | |
ANZSRC_FOR_2 | ||
ANZSRC_FOR_3 |
Pre-loaded data
As well as the following data sets that contain institution specific there are other vocabularies that Mint stores. These are either:
- Standard vocabularies used by the ReDBox to enhance it's forms (e.g. language codes)
- National standardised data that is relevant for all Australian institutions (e.g. ARC and NHMRC grant data)
The complete list of data pre-loaded is as follows:
- ANZSRC FOR and SEO codes
- Language codes
- Geo-spatial information (used by the map widget)
- MARC Country codes
- ARC and NHMRC Grant data