DataSet class#

class tally.dataset.DataSet(api_key=None, host='tally.datasmoothie.com', ssl=True, use_futures=False)#

A class that wraps a dataset and has all the information needed to send to the API in order to perform the various tasks.

Parameters

name (string) – Name for the dataset

add_credentials(api_key=None, host='tally.datasmoothie.com', ssl=True)#

Add your API key and what server it is authorized to connect to. Useful for on-prem installations and development.

band(**kwargs)#

Group numeric data with band definitions treated as group text labels.

Parameters
  • name (string) – The column variable name keyed in _meta[‘columns’] that will be banded into summarized categories.

  • bands (array) – The categorical bands to be used. Bands can be single numeric values or ranges.

  • new_name ((string, default None)) – The created variable will be named ‘<name>_banded’, unless a desired name is provided explicitly here.

  • label ((string, default None)) – The created variable’s text label will be identical to the originating one’s passed in name, unless a desired label is provided explicitly here.

  • text_key ((string, default None)) – Text key for text-based label information. Uses the DataSet.text_key information if not provided.

compare(**kwargs)#

Compares types, codes, values, question labels of two datasets.

Parameters
  • dataset (object) – (quantipy.DataSet instance). Test if all variables in the provided dataset are also in self and compare their metadata definitions.

  • variables (str, array of str) – Check only these variables

  • strict ((bool, default False)) – If True lower/ upper cases and spaces are taken into account.

  • text_key (str, array of str) – Text key for text-based label information. Uses the DataSet.text_key information if not provided.

convert_data_to_csv_json(**kwargs)#

Converts data, either sent or from an external source to Quantipy CSV and JSON.

The data to convert can be from parquet, SPSS, or UNICOM Intelligence (fka Dimensions) or a pure CSV exported from Excel.

convert_data_to_sav(**kwargs)#

Converts data, either sent or from an external source to an SPSS sav file.

The data to convert can be from Quantipy, or UNICOM Intelligence (fka Dimensions) or a pure CSV exported from Excel.

The sav files created do not support Quantipy’s delimited set.

copy(**kwargs)#

Copy meta and case data of the variable defintion given per name.

Parameters
  • name (string) – The column variable name.

  • suffix (string (default "rec")) – The new variable name will be constructed by suffixing the original name with _suffix, e.g. age_rec

  • copy_data (boolean (default true)) – The new variable assumes the data of the original variable.

  • slicer (dict) – If the data is copied it is possible to filter the data with a complex logic.

  • copy_only (int or list of int, default None) – If provided, the copied version of the variable will only contain (data and) meta for the specified codes.

  • copy_not (int or list of int, default None) – If provided, the copied version of the variable will contain (data and) meta for the all codes, except of the indicated.

  • new_name (string) – If provided, the returned object will contain this new name instead of name_suffix

derive(**kwargs)#

Create meta and recode case data by specifying derived category logics.

Derived variables have their answer codes derived from other variables. Derived variables can either be multi-choice (called delimited set) or single-choice.

A derived variable can be created from one variable, for example when a likert scale question has NETs added to it, or it can be created from multiple variables. When a derived variable is created from multiple variables, the user has to define how these variables should be used to create the new one with by defining whether the new variable is an intersection or a union (logical expressions and or or).

The conditional map has a list/array of of either three or four elements of following structures:

#### 3 elements, type of logic not specified: [code, label, logic dictionary], e.g.:

[1, "People who live in urban and sub-urban settings", {'locality': [1,2]}

#### 4 elements, type of logic speficied: [1, label, type of logic, logic_dictionary), e.g.:

[1, "Men in urban and suburban locations", 'intersection', {'gender': [1] 'locality': [1,2]))]

[2, "Women in urban and suburban locations", 'intersection', {'gender': [2] 'locality': [1,2]))

The logic types are ‘union’ and ‘intersection’. If no logic type is specified, ‘union’ is used. union is equivalent to the logical expression or and intersection is equivalent to and.

namestring

The column variable name.

labelstring

Name of the variable to show meta data for

qtypestring

The structural type of the data the meta describes (int, float, single or delimited set)

cond_mapsarray

List of logic dictionaries that define how each answer and code is derived.

cond_maparray (deprecated)

List of “tuples”, see documentatio above.

extend_values(**kwargs)#

Add an answer/value and code to the list of answer/values/codes already in the meta data for the variable.

Attempting to add already existing value codes or providing already present value texts will both raise invalid_arguments error!

feature_select(**kwargs)#

Shows the variables that score the highest with a given ML features select algorithm

filter(**kwargs)#

Filter the DataSet using a logical expression.

Parameters
  • alias (string) – Name of the filter

  • condition (object) – An object that defines the filter logic

find(**kwargs)#

Find variables by searching their names for substrings.

Parameters
  • str_tags (string or list of strings) – The strings tags to look for in the variable names. If not provided, the modules’ default global list of substrings from VAR_SUFFIXES will be used.

  • suffixed (boolean (default false)) – If true, only variable names that end with a given string dequence will qualitfy.

get_variable_text(**kwargs)#

Return the variables text label information.

hmerge(**kwargs)#

Merge Quantipy datasets together by appending rows. This function merges two Quantipy datasets together, updating variables that exist in the left dataset and appending others. New variables will be appended in the order indicated by the ‘data file’ set if found, otherwise they will be appended in alphanumeric order. This merge happens vertically (row-wise).

Parameters
  • dataset (object) – (quantipy.DataSet instance). Test if all variables in the provided dataset are also in self and compare their metadata definitions.

  • on (str, default=None) – The column to use to identify unique rows in both datasets.

  • left_on (str, default=None) – The column to use to identify unique in the left dataset.

  • right_on (str, default=None) – The column to use to identify unique in the right dataset.

  • row_id_name (str, default=None) – The named column will be filled with the ids indicated for each dataset, as per left_id/right_id/row_ids. If meta for the named column doesn’t already exist a new column definition will be added and assigned a reductive-appropriate type.

  • left_id (str, int, float, default=None) – Where the row_id_name column is not already populated for the dataset_left, this value will be populated.

  • right_id (str, int, float, default=None) – Where the row_id_name column is not already populated for the dataset_right, this value will be populated.

  • row_ids (array of (str, int, float), default=None) – When datasets has been used, this list provides the row ids that will be populated in the row_id_name column for each of those datasets, respectively.

  • overwrite_text (bool, default=False) – If True, text_keys in the left meta that also exist in right meta will be overwritten instead of ignored.

  • from_set (str, default=None) – Use a set defined in the right meta to control which columns are merged from the right dataset.

  • uniquify_key (str, default=None) – An int-like column name found in all the passed DataSet objects that will be protected from having duplicates. The original version of the column will be kept under its name prefixed with ‘original’.

  • reset_index (bool, default=True) – If True pandas.DataFrame.reindex() will be applied to the merged dataframe.

joined_crosstab(**kwargs)#

Does crosstab tabulation using the provided parameters, allowing for multiple datasources to be sent along with the request to run multiple crosstabs in one result. Returns a json ecoded dataframe as a result.

meta(**kwargs)#

Shows the meta-data for a variable

Parameters
  • variable (string) – Name of the variable to show meta data for

  • variables (array) – Name of multiple variables to show meta data for

recode(**kwargs)#

Create a new or copied series from data, recoded using a mapper.

This function takes a mapper of {key: logic} entries and injects the key into the target column where its paired logic is True. The logic may be arbitrarily complex and may refer to any other variable or variables in data. Where a pre-existing column has been used to start the recode, the injected values can replace or be appended to any data found there to begin with. The recoded data will always comply with the column type indicated for the target column according to the meta.

#### Mapping example:
recode_mapper = {

1: {“$union”:[{“$intersection”: [{‘locality’: [3]}, {“gender”:[1]}]},{“$intersection”: [{‘locality’: [4]}, {“gender”:[1]}]}] }, 2: {“$intersection”:[{‘locality’: [2]}, {‘gender’:[2]}]}, 3: {“$union”:[{‘locality’: [1]}, {‘gender’:[1]}]}, 4: {‘locality’: [4]}, 5: {‘locality’: [5]}, 6: {‘locality’: [6]}

}

Logical functions are strings preceded with the symbol $ and logic can be nested at an arbitrary depth.

Parameters
  • target (string) – The variable name of the target of the recode.

  • mapper (dict) – A mapper of {key: logic} entries.

  • default (string) – The column name to default to in cases where unattended lists are given in your logic, where an auto-transformation of {key: list} to {key: {default: list}} is provided. Note that lists in logical statements are themselves a form of shorthand and this will ultimately be interpreted as: {key: {default: has_any(list)}}.

  • append (boolean) – Should the new recoded data be appended to values already found in the series? If False, data from series (where found) will overwrite whatever was found for that item instead.

  • intersect (dict) – If a logical statement is given here then it will be used as an implied intersection of all logical conditions given in the mapper. For example, we could limit our mapper to males.

  • initialize (str (default: None)) – Name of variable to use to populate the variable before the recode

  • fillna (int) – If provided, can be used to fill empty/nan values.

remove_values(**kwargs)#

Erase value codes safely from both meta and case data components.

set_value_texts(**kwargs)#

Rename or add value texts in the ‘values’ object.

This method works for array masks and column meta data.

set_variable_text(**kwargs)#

Change the variable text for a named variable.

sum(**kwargs)#

Adds all values in each column and returns the sum for each column. :param new_variable_name: Name for new variable that will contain the summarized amounts. :type new_variable_name: string :param variables: The variables to sum. Only float or int types. :type variables: array

to_array(**kwargs)#

Create a new variable grid (array) variable from two or more single variables with the same labels.

to_delimited_set(**kwargs)#

Create a new variable delimited set (multi choice) variable from two or more single variables.

use_confirmit(source_projectid, source_idp_url, source_client_id, source_client_secret, source_public_url)#

Load remote Forsta/Confirmit data into the dataset as the data to send with all requests.

Parameters
  • source_projectid (string) – Project id of the survey

  • source_idp_url (string) – IPD Url of the survey

  • source_client_id (string) – Your client id

  • source_client_secret (string) – Client secret (don’t commit this to a repository)

  • source_public_url (string) – Public url to source

use_csv(csv_file)#

Load CSV file into the dataset as the file to send with all requests.

Parameters

csv_file (string) – Path to the CSV file we want to use as our data.

use_nebu(nebu_url)#

Load remote Nebu/Enghouse file into the dataset as the file to send with all requests.

Parameters

nebu_url (string) – Path to the Nebu data file we want to use as our data.

use_parquet(pq_data_filename, pq_meta_filename=None)#

Load parquet file into memory as the file to send with all requests.

Parameters
  • pq_data_filename (string : BytesIO) – Path to the parquet file we want to use as our data OR a bytes array

  • pq_meta_filename (string : BytesIO) – Path to the meta file we want to use as our data OR a bytes array

use_quantipy(meta_json, data_csv)#

Load Quantipy meta and data files to this dataset.

Parameters
  • meta_json (string) – Path to the json file we want to use as our meta data.

  • data_csv (string) – Path to the csv file we want to use as our data file.

use_spss(file_path)#

Load SPSS file into memory as the file to send with all requests.

Parameters

file_path (string : BytesIO) – Path to the sav file we want to use as our data OR a bytes array

use_unicom(mdd_filename, ddf_filename)#

Load parquet file into memory as the file to send with all requests.

Note: If the mdd_filename and ddf_filename are passed as ByteIO then

they will be sent directly to Tally and get named “file.mdd” and “file.ddf” respectively.

Parameters
  • mdd_filename (string : BytesIO) – Path to the parquet file we want to use as our data OR a bytes array

  • ddf_filename (string : BytesIO) – Path to the meta file we want to use as our data OR a bytes array

values(**kwargs)#

Get a list of value texts and codes for a categorical variable, as a dictionary. The method will return the codes in the data and the labels that apply to those codes in the chosen language (or the default language if no language is chosen).

Parameters
  • name (string) – Name of variable to fetch values/codes for.

  • text_key (string (default None)) – The language key that should be used when taking labels from the meta data.

  • include_variable_texts (boolean (default false)) – Include labels for the variable name in the results.

variables(**kwargs)#

Shows list of variables.

vmerge(**kwargs)#

Merge Quantipy datasets together by appending rows. This function merges two Quantipy datasets together, updating variables that exist in the left dataset and appending others. New variables will be appended in the order indicated by the ‘data file’ set if found, otherwise they will be appended in alphanumeric order. This merge happens vertically (row-wise).

Parameters
  • dataset (object) – (quantipy.DataSet instance). Test if all variables in the provided dataset are also in self and compare their metadata definitions.

  • on (str, default=None) – The column to use to identify unique rows in both datasets.

  • left_on (str, default=None) – The column to use to identify unique in the left dataset.

  • right_on (str, default=None) – The column to use to identify unique in the right dataset.

  • row_id_name (str, default=None) – The named column will be filled with the ids indicated for each dataset, as per left_id/right_id/row_ids. If meta for the named column doesn’t already exist a new column definition will be added and assigned a reductive-appropriate type.

  • left_id (str, int, float, default=None) – Where the row_id_name column is not already populated for the dataset_left, this value will be populated.

  • right_id (str, int, float, default=None) – Where the row_id_name column is not already populated for the dataset_right, this value will be populated.

  • row_ids (array of (str, int, float), default=None) – When datasets has been used, this list provides the row ids that will be populated in the row_id_name column for each of those datasets, respectively.

  • overwrite_text (bool, default=False) – If True, text_keys in the left meta that also exist in right meta will be overwritten instead of ignored.

  • from_set (str, default=None) – Use a set defined in the right meta to control which columns are merged from the right dataset.

  • uniquify_key (str, default=None) – An int-like column name found in all the passed DataSet objects that will be protected from having duplicates. The original version of the column will be kept under its name prefixed with ‘original’.

  • reset_index (bool, default=True) – If True pandas.DataFrame.reindex() will be applied to the merged dataframe.

write_quantipy(file_path_json, file_path_csv)#

Write the case and meta data as Quantipy compatible json and csv files.

Parameters
  • file_path_json (string) – Path to the json meta data file to create.

  • file_path_csv (string) – Path to the csv data file to create.

write_spss(file_path, data_params, **kwargs)#

Writes the dataset to an SPSS (sav) file.

Parameters

file_path (string) – Path to the sav file to write.