Arguments and Usage

This page is generated from the source code (config.py) and provides an overview of lbsntransform command line arguments.

Usage


usage: lbsntransform [-h] [--version] [-o ORIGIN] [--dry-run] [-l]
                     [--file_type FILE_TYPE] [--input_path_url INPUT_PATH_URL]
                     [--is_stacked_json] [--is_line_separated_json]
                     [--dbpassword_hllworker DBPASSWORD_HLLWORKER]
                     [--dbuser_hllworker DBUSER_HLLWORKER]
                     [--dbserveraddress_hllworker DBSERVERADDRESS_HLLWORKER]
                     [--dbname_hllworker DBNAME_HLLWORKER]
                     [-p DBPASSWORD_OUTPUT] [-u DBUSER_OUTPUT]
                     [-a DBSERVERADDRESS_OUTPUT] [-n DBNAME_OUTPUT]
                     [--dbformat_output DBFORMAT_OUTPUT]
                     [--dbpassword_input DBPASSWORD_INPUT]
                     [--dbuser_input DBUSER_INPUT]
                     [--dbserveraddress_input DBSERVERADDRESS_INPUT]
                     [--dbname_input DBNAME_INPUT]
                     [--dbformat_input DBFORMAT_INPUT] [-t TRANSFERLIMIT]
                     [--transfer_count TRANSFER_COUNT]
                     [--commit_volume COMMIT_VOLUME]
                     [--records_tofetch RECORDS_TOFETCH]
                     [--disable_transfer_reactions]
                     [--disable_reaction_post_referencing]
                     [--ignore_non_geotagged]
                     [--startwith_db_rownumber STARTWITH_DB_ROWNUMBER]
                     [--endwith_db_rownumber ENDWITH_DB_ROWNUMBER]
                     [--debug_mode DEBUG_MODE]
                     [--geocode_locations GEOCODE_LOCATIONS]
                     [--ignore_input_source_list IGNORE_INPUT_SOURCE_LIST]
                     [--mappings_path MAPPINGS_PATH]
                     [--input_lbsn_type INPUT_LBSN_TYPE]
                     [--map_full_relations] [--csv_output]
                     [--csv_allow_linebreaks] [--csv_delimiter CSV_DELIMITER]
                     [--use_csv_dictreader] [--recursive_load]
                     [--skip_until_file SKIP_UNTIL_FILE]
                     [--skip_until_record SKIP_UNTIL_RECORD] [--zip_records]
                     [--min_geoaccuracy MIN_GEOACCURACY]
                     [--include_lbsn_objects INCLUDE_LBSN_OBJECTS]
                     [--include_lbsn_bases INCLUDE_LBSN_BASES]
                     [--override_lbsn_query_schema OVERRIDE_LBSN_QUERY_SCHEMA]
                     [--hmac_key HMAC_KEY]

Arguments

Quick reference table

The quick reference table contains truncated short summaries of descriptions. Jump to individual arguments in the navigation submenu on the left side.

Short Long Default Description
-h --help show this help message
--version show program's version
-o --origin 0 Input source type (id)
--dry-run Perform a trial run
-l --file_input This flag enables file
--file_type json Specify filetype
--input_path_url 01_Input Path to input folder.
--is_stacked_json Input is stacked json.
--is_line_separated_json Json is line separated
--dbpassword_hllworker None Password for hllworker
--dbuser_hllworker postgres Username for hllworker
--dbserveraddress_hllworker None IP for hllworker db
--dbname_hllworker None DB name for hllworker
-p --dbpassword_output None Password for out-db
-u --dbuser_output postgres Username for out-db.
-a --dbserveraddress_output None IP for output db,
-n --dbname_output None DB name for output db
--dbformat_output lbsn Format of the out-db.
--dbpassword_input None Password for input-db
--dbuser_input postgres Username for input-db.
--dbserveraddress_input None IP for input-db,
--dbname_input None DB name for input-db,
--dbformat_input json Format of the input-db
-t --transferlimit None Abort after x records.
--transfer_count 50000 Transfer batch limit x
--commit_volume None After x commit_volume,
--records_tofetch 10000 Fetch x records /batch
--disable_transfer_reactions Disable reactions.
--disable_reaction_post_referencing Disable reactions-refs
--ignore_non_geotagged Ignore none-geotagged.
--startwith_db_rownumber None Start with db row x.
--endwith_db_rownumber None End with db row x.
--debug_mode None Enable debug mode.
--geocode_locations None Path to loc-geocodes.
--ignore_input_source_list None Path to input ignore.
--mappings_path None Path mappings folder.
--input_lbsn_type None Input sub-type
--map_full_relations Map full relations.
--csv_output Store to local CSV.
--csv_allow_linebreaks Disable linebreak-rem.
--csv_delimiter , CSV delimiter.
--use_csv_dictreader Use csv.DictReader.
--recursive_load Recursive local sub di
--skip_until_file None Skip until file x.
--skip_until_record None Skip until record x.
--zip_records Zip records parallel.
--min_geoaccuracy None Min geoaccuracy to use
--include_lbsn_objects None lbsn objects to proces
--include_lbsn_bases None lbsn bases to update
--override_lbsn_query_schema None Override schema and ta
--hmac_key None Override db hmac key

-h, --help

show this help message and exit

--version

show program's version number and exit

-o, --origin

(Default: 0)

Input source type (id).

  • Defaults to 0: LBSN

Other possible values:

  • 1 - Instagram
  • 2 - Flickr
  • 21 - Flickr YFCC100M
  • 3 - Twitter

--dry-run

Perform a trial run
with no changes made to database/output

-l, --file_input

This flag enables file input

(instead of reading data from a database).

  • To specify which files to process, see parameter --input_path_url.
  • To specify file types, e.g. whether to process data from json or csv, or from URLs,
    see --file_type

--file_type

(Default: json)

Specify filetype

(json, csv etc.)

  • only applies if --file_input is used.

--input_path_url

(Default: 01_Input)

Path to input folder.

  • If not provided, subfolder ./01_Input/ will be used.
  • You can also provide a web-url, starting with http(s)
  • URLs will be accessed using requests.get(url, stream=True).
  • To separate multiple urls, use semicolon (;). In this case, see also --zip_records.

--is_stacked_json

Input is stacked json.

  • The typical form of json is [{json1},{json2}]
  • If --is_stacked_json is set, it will process stacked jsons in the form of {json1}{json2} (no comma)

--is_line_separated_json

Json is line separated

  • The typical form is [{json1},{json2}]
  • If --is_line_separated_json is set, it will process stacked jsons in the form of {json1} {json2} (with linebreak)
  • Unix style linebreaks (CR) will be used across platforms
  • Windows users, use (e.g.) notepad++ to convert from Windows style linebreaks (CRLF)

--dbpassword_hllworker

(Default: None)

Password for hllworker db

  • If reading data into hlldb, all HLL Worker parameters must be supplied bydefault.
  • You can substitute hlldb parameters here
  • In this case, lbsntransform will use hlldb to convert and union hll sets and to store output results
  • Currently, this re-use of hlldb requires to supply the same set of parameters twice
  • For separation of concerns, it is recommended to use a separate HLL Worker database

--dbuser_hllworker

(Default: postgres)

Username for hllworker db.

--dbserveraddress_hllworker

(Default: None)

IP for hllworker db

  • e.g. 111.11.11.11
  • Optionally add port the to use, e.g. 111.11.11.11:5432.
  • 5432 is the default port

--dbname_hllworker

(Default: None)

DB name for hllworker db

  • e.g. hllworkerdb

-p, --dbpassword_output

(Default: None)

Password for out-db

(postgres raw/hll db)

-u, --dbuser_output

(Default: postgres)

Username for out-db.

Default: example-user-name2

-a, --dbserveraddress_output

(Default: None)

IP for output db,

  • e.g. 111.11.11.11
  • Optionally add port to use, e.g. 111.11.11.11:5432.
  • 5432 is the default port

-n, --dbname_output

(Default: None)

DB name for output db

  • e.g. rawdb or hlldb

--dbformat_output

(Default: lbsn)

Format of the out-db.

  • Either hll or lbsn.
  • This setting affects how data is stored, either in anonymized and aggregate form (hll), or in the lbsn raw structure (lbsn).

--dbpassword_input

(Default: None)

Password for input-db

--dbuser_input

(Default: postgres)

Username for input-db.

--dbserveraddress_input

(Default: None)

IP for input-db,

  • e.g. 111.11.11.11
  • Optionally add port to use, e.g. 111.11.11.11:5432.
  • 5432 is the default port

--dbname_input

(Default: None)

DB name for input-db,

  • e.g.: rawdb

--dbformat_input

(Default: json)

Format of the input-db.

  • Either lbsn or json
  • If lbsn is used, the native lbsn raw input mapping (0) will be used
  • If json is used, a custom mapping for json must be provided, for mapping database json's to the lbsn structure. See input mappings

-t, --transferlimit

(Default: None)

Abort after x records.

  • This can be used to limit the number of records that will be processed.
  • e.g. --transferlimit 10000 will process the first 10000 input records
  • Defaults to None (= process all)
  • Note that one input record can map to many output records. This number applies to the number of input records, not the output count.

--transfer_count

(Default: 50000)

Transfer batch limit x.

  • Defines after how many parsed records the results will be transferred to the DB.
  • Defaults to 50000
  • If you have a slow server, but a fast machine, larger values improve speed because duplicate check happens in Python, and not in Postgres coalesce;
  • However, larger values require more local memory. If you have a fast server, but a slow machine, try if a smaller batch --transfer_count (e.g. 5000) improves speed.

Note

Use --transferlimit to limit the total number of records transferred. --transfer_count instead defines the batch count that is used to transfer data incrementally.

--commit_volume

(Default: None)

After x commit_volume, changes (transactions) will be written to the output database (a Postgres COMMIT).

Note that updated entries in the output database are only written from the WAL buffer after a commit.

  • Default for rawdb: 10000
  • Default for hlldb: 100000

Warning

If you have concurrent writes to the DB (e.g. multiple lbsntransform processes) and if you see transaction deadlocks, reduce the commit_volume.

--records_tofetch

(Default: 10000)

Fetch x records /batch.

  • If retrieving data from a db (lbsn), limit the number of records to fetch at once.
  • Defaults to 10000

--disable_transfer_reactions

Disable reactions.

  • If set, processing of lbsn reactions will be skipped,
  • only original posts are transferred.
  • This is usefull to reduce processing and data footprint for some service data, e.g. for Twitter, with a large number of reactions containing little original content.

--disable_reaction_post_referencing

Disable reactions-refs.

Enable this option in args to prevent empty posts being stored due to Foreign-Key-Exists Requirement.
Possible parameters:

  • 0 = Save Original Tweets of Retweets as posts;
  • 1 = do not store Original Tweets of Retweets;
  • 2 = !Not implemented: Store Original Tweets of Retweets as post_reactions

--ignore_non_geotagged

Ignore none-geotagged.

If set, posts that are not geotagged are ignored during processing.

--startwith_db_rownumber

(Default: None)

Start with db row x.

If transferring from a databse (input), this flag can be used to resume processing (e.g.) if a transfer has been aborted.

  • Provide a number (row-id) to start processing from live db.
  • If input db type is lbsn, this is the primary key, without the origin_id, (e.g. the post_guid, place_guid etc.).
  • This flag will only work if processing a single lbsn object (e.g. --include_lbsn_objects "post").

Example:

--startwith_db_rownumber "123456789"
will lead to the first batch-query from the DB looking like this:

SELECT * FROM topical."post"
WHERE post_guid > '123456789'
ORDER BY post_guid ASC
LIMIT 10000;

--endwith_db_rownumber

(Default: None)

End with db row x.

Provide a number (row-id) to end processing from live db

--debug_mode

(Default: None)

Enable debug mode.

--geocode_locations

(Default: None)

Path to loc-geocodes.

  • Provide path to a CSV file with location geocodes
  • CSV Header must be: lat, lng, name).
  • This can be used in mappings to assign coordinates (lat, lng) to use provided locations as text

--ignore_input_source_list

(Default: None)

Path to input ignore.

Provide a path to a list of input_source types that will be ignored (e.g. to ignore certain bots etc.)

--mappings_path

(Default: None)

Path mappings folder.

Provide a path to a custom folder that contains one or more input mapping modules (*.py).

--input_lbsn_type

(Default: None)

Input sub-type

  • e.g. post, profile, friendslist, followerslist etc.
  • This can be used to select an appropiate mapping procedure in a single mapping module.

--map_full_relations

Map full relations.

Set to true to map full relations, e.g. many-to-many relationships, such as user_follows, user_friend, or user_mentions etc. are mapped in a separate table. Defaults to False.

--csv_output

Store to local CSV.

If set, will store all submit values to local CSV instead. Currently, this type of output is not available.

--csv_allow_linebreaks

Disable linebreak-rem.

If set, will not remove intext-linebreaks (\r or \n) in output CSVs

--csv_delimiter

(Default: ,)

CSV delimiter.

  • Provide the CSV delimiter to be used.
  • Default is comma (,).
  • Note: to pass tab, use variable substitution ($"\t")

--use_csv_dictreader

Use csv.DictReader.

By default, CSVs will be read line by line,
using the standard csv.reader().

This will enable csv.DictReader(),
which allows to access CSV fields by name in mappings.

A CSV with a header is required for this setting to work.

Note that csv.DictReader() may be slower than the default csv.reader().

--recursive_load

Recursive local sub dirs.

If set, process input directories recursively (default depth: 2)

--skip_until_file

(Default: None)

Skip until file x.

If local input, skip all files until file with name x appears (default: start immediately)

--skip_until_record

(Default: None)

Skip until record x.

If local input, skip all records until record x (default: start with first)

--zip_records

Zip records parallel.

  • Use this flag to zip records of multiple input files
  • e.g. List1[A,B,C], List2[1,2,3] will be combined (zipped) on read to List[A1,B2,C3]

--min_geoaccuracy

(Default: None)

Min geoaccuracy to use

Set to latlng, place, or city to limit processing of records based on mininum geoaccuracy (default: no limit)

--include_lbsn_objects

(Default: None)

lbsn objects to process

If processing from lbsn db (rawdb), provide a comma separated list of lbsn objects to include.
May contain:

  • origin
  • country
  • city
  • place
  • user_groups
  • user
  • post
  • post_reaction

Notes:

  • Excluded objects will not be queried, but empty objects may be created due to referenced foreign key relationships.
  • Defaults to origin,post

--include_lbsn_bases

(Default: None)

lbsn bases to update

If the target output type is hll, provide a comma separated list of lbsn bases to include/update/store to.

Currently supported:

  • hashtag
  • emoji
  • term
  • _hashtag_latlng
  • _term_latlng
  • _emoji_latlng
  • _month_hashtag
  • _month_hashtag_latlng
  • _month_latlng
  • monthofyear
  • month
  • dayofmonth
  • dayofweek
  • hourofday
  • year
  • month
  • date
  • timestamp
  • country
  • region
  • city
  • place
  • latlng
  • community

Bases not included will be skipped. Per default, no bases will be considered.

Example:

--include_lbsn_bases hashtag,place,date,community

This will update entries in the Postgres hlldb tables social.community. topical.hashtag, spatial.place, temporal.date and non-existing entries will be created, existing ones will be updated (a hll_union).

See the structure definition in SQL here for a full list of hlldb table structures.

Argument only allowed one time.

--override_lbsn_query_schema

(Default: None)

Override schema and table name

This can be used to redirect lbsn queries on the given object from input db to a specific schema/table such as a materialized view.

This can be usefull (e.g.) to limit processing of input data to a specific query.

Format is lbsn_type,schema.table.

Example:

--override_lbsn_query_schema post,mviews.mypostquery

Argument can be used multiple times.

--hmac_key

(Default: None)

Override db hmac key

The hmac key that is used for cryptographic hashing during creation of HLL sets. Override what is set in hllworker database here.

Remember to re-use the same hmac key for any consecutive update of HLL sets.

The crypt.salt variable can also be set (temporarily or permanently) in the hll worker database itself.
Example:

ALTER DATABASE hllworkerdb SET crypt.salt = 'CRYPTSALT';

Further information is available in the YFCC HLL tutorial.