Arguments and Usage¶
This page is generated from the source code (config.py) and provides an overview of lbsntransform command line arguments.
Usage¶
usage: lbsntransform [-h] [--version] [-o ORIGIN] [--dry-run] [-l]
[--file_type FILE_TYPE] [--input_path_url INPUT_PATH_URL]
[--is_stacked_json] [--is_line_separated_json]
[--dbpassword_hllworker DBPASSWORD_HLLWORKER]
[--dbuser_hllworker DBUSER_HLLWORKER]
[--dbserveraddress_hllworker DBSERVERADDRESS_HLLWORKER]
[--dbname_hllworker DBNAME_HLLWORKER]
[-p DBPASSWORD_OUTPUT] [-u DBUSER_OUTPUT]
[-a DBSERVERADDRESS_OUTPUT] [-n DBNAME_OUTPUT]
[--dbformat_output DBFORMAT_OUTPUT]
[--dbpassword_input DBPASSWORD_INPUT]
[--dbuser_input DBUSER_INPUT]
[--dbserveraddress_input DBSERVERADDRESS_INPUT]
[--dbname_input DBNAME_INPUT]
[--dbformat_input DBFORMAT_INPUT] [-t TRANSFERLIMIT]
[--transfer_count TRANSFER_COUNT]
[--commit_volume COMMIT_VOLUME]
[--records_tofetch RECORDS_TOFETCH]
[--disable_transfer_reactions]
[--disable_reaction_post_referencing]
[--ignore_non_geotagged]
[--startwith_db_rownumber STARTWITH_DB_ROWNUMBER]
[--endwith_db_rownumber ENDWITH_DB_ROWNUMBER]
[--debug_mode DEBUG_MODE]
[--geocode_locations GEOCODE_LOCATIONS]
[--ignore_input_source_list IGNORE_INPUT_SOURCE_LIST]
[--mappings_path MAPPINGS_PATH]
[--input_lbsn_type INPUT_LBSN_TYPE]
[--map_full_relations] [--csv_output]
[--csv_allow_linebreaks] [--csv_delimiter CSV_DELIMITER]
[--use_csv_dictreader] [--recursive_load]
[--skip_until_file SKIP_UNTIL_FILE]
[--skip_until_record SKIP_UNTIL_RECORD] [--zip_records]
[--min_geoaccuracy MIN_GEOACCURACY]
[--include_lbsn_objects INCLUDE_LBSN_OBJECTS]
[--include_lbsn_bases INCLUDE_LBSN_BASES]
[--override_lbsn_query_schema OVERRIDE_LBSN_QUERY_SCHEMA]
[--hmac_key HMAC_KEY]
Arguments¶
Quick reference table¶
The quick reference table contains truncated short summaries of descriptions. Jump to individual arguments in the navigation submenu on the left side.
Short | Long | Default | Description |
---|---|---|---|
-h |
--help |
show this help message | |
--version |
show program's version | ||
-o |
--origin |
0 |
Input source type (id) |
--dry-run |
Perform a trial run | ||
-l |
--file_input |
This flag enables file | |
--file_type |
json |
Specify filetype | |
--input_path_url |
01_Input |
Path to input folder. | |
--is_stacked_json |
Input is stacked json. | ||
--is_line_separated_json |
Json is line separated | ||
--dbpassword_hllworker |
None |
Password for hllworker | |
--dbuser_hllworker |
postgres |
Username for hllworker | |
--dbserveraddress_hllworker |
None |
IP for hllworker db | |
--dbname_hllworker |
None |
DB name for hllworker | |
-p |
--dbpassword_output |
None |
Password for out-db |
-u |
--dbuser_output |
postgres |
Username for out-db. |
-a |
--dbserveraddress_output |
None |
IP for output db, |
-n |
--dbname_output |
None |
DB name for output db |
--dbformat_output |
lbsn |
Format of the out-db. | |
--dbpassword_input |
None |
Password for input-db | |
--dbuser_input |
postgres |
Username for input-db. | |
--dbserveraddress_input |
None |
IP for input-db, | |
--dbname_input |
None |
DB name for input-db, | |
--dbformat_input |
json |
Format of the input-db | |
-t |
--transferlimit |
None |
Abort after x records. |
--transfer_count |
50000 |
Transfer batch limit x | |
--commit_volume |
None |
After x commit_volume, | |
--records_tofetch |
10000 |
Fetch x records /batch | |
--disable_transfer_reactions |
Disable reactions. | ||
--disable_reaction_post_referencing |
Disable reactions-refs | ||
--ignore_non_geotagged |
Ignore none-geotagged. | ||
--startwith_db_rownumber |
None |
Start with db row x. | |
--endwith_db_rownumber |
None |
End with db row x. | |
--debug_mode |
None |
Enable debug mode. | |
--geocode_locations |
None |
Path to loc-geocodes. | |
--ignore_input_source_list |
None |
Path to input ignore. | |
--mappings_path |
None |
Path mappings folder. | |
--input_lbsn_type |
None |
Input sub-type | |
--map_full_relations |
Map full relations. | ||
--csv_output |
Store to local CSV. | ||
--csv_allow_linebreaks |
Disable linebreak-rem. | ||
--csv_delimiter |
, |
CSV delimiter. | |
--use_csv_dictreader |
Use csv.DictReader. | ||
--recursive_load |
Recursive local sub di | ||
--skip_until_file |
None |
Skip until file x. | |
--skip_until_record |
None |
Skip until record x. | |
--zip_records |
Zip records parallel. | ||
--min_geoaccuracy |
None |
Min geoaccuracy to use | |
--include_lbsn_objects |
None |
lbsn objects to proces | |
--include_lbsn_bases |
None |
lbsn bases to update | |
--override_lbsn_query_schema |
None |
Override schema and ta | |
--hmac_key |
None |
Override db hmac key |
-h
, --help
¶
show this help message and exit
--version
¶
show program's version number and exit
-o
, --origin
¶
(Default: 0
)
Input source type (id).
- Defaults to
0
: LBSN
Other possible values:
1
- Instagram2
- Flickr21
- Flickr YFCC100M3
- Twitter
--dry-run
¶
Perform a trial run
with no changes made to database/output
-l
, --file_input
¶
This flag enables file input
(instead of reading data from a database).
- To specify which files to process, see parameter
--input_path_url
. - To specify file types, e.g. whether to process data from
json
orcsv
, or from URLs,
see--file_type
--file_type
¶
(Default: json
)
Specify filetype
(json
, csv
etc.)
- only applies if
--file_input
is used.
--input_path_url
¶
(Default: 01_Input
)
Path to input folder.
- If not provided, subfolder
./01_Input/
will be used. - You can also provide a web-url, starting with
http(s)
- URLs will be accessed using
requests.get(url, stream=True)
. - To separate multiple urls, use semicolon (
;
). In this case, see also--zip_records
.
--is_stacked_json
¶
Input is stacked json.
- The typical form of json is
[{json1},{json2}]
- If
--is_stacked_json
is set, it will process stacked jsons in the form of{json1}{json2}
(no comma)
--is_line_separated_json
¶
Json is line separated
- The typical form is
[{json1},{json2}]
- If
--is_line_separated_json
is set, it will process stacked jsons in the form of{json1} {json2}
(with linebreak) - Unix style linebreaks (CR) will be used across platforms
- Windows users, use (e.g.) notepad++ to convert from Windows style linebreaks (CRLF)
--dbpassword_hllworker
¶
(Default: None
)
Password for hllworker db
- If reading data into
hlldb
, all HLL Worker parameters must be supplied bydefault. - You can substitute hlldb parameters here
- In this case, lbsntransform will use hlldb to convert and union hll sets and to store output results
- Currently, this re-use of hlldb requires to supply the same set of parameters twice
- For separation of concerns, it is recommended to use a separate HLL Worker database
--dbuser_hllworker
¶
(Default: postgres
)
Username for hllworker db.
--dbserveraddress_hllworker
¶
(Default: None
)
IP for hllworker db
- e.g.
111.11.11.11
- Optionally add port the to use, e.g.
111.11.11.11:5432
. 5432
is the default port
--dbname_hllworker
¶
(Default: None
)
DB name for hllworker db
- e.g.
hllworkerdb
-p
, --dbpassword_output
¶
(Default: None
)
Password for out-db
(postgres raw/hll db)
-u
, --dbuser_output
¶
(Default: postgres
)
Username for out-db.
Default: example-user-name2
-a
, --dbserveraddress_output
¶
(Default: None
)
IP for output db,
- e.g.
111.11.11.11
- Optionally add port to use, e.g.
111.11.11.11:5432
. 5432
is the default port
-n
, --dbname_output
¶
(Default: None
)
DB name for output db
- e.g.
rawdb
orhlldb
--dbformat_output
¶
(Default: lbsn
)
Format of the out-db.
- Either
hll
orlbsn
. - This setting affects how data is stored, either in anonymized and aggregate form (
hll
), or in the lbsn raw structure (lbsn
).
--dbpassword_input
¶
(Default: None
)
Password for input-db
--dbuser_input
¶
(Default: postgres
)
Username for input-db.
--dbserveraddress_input
¶
(Default: None
)
IP for input-db,
- e.g.
111.11.11.11
- Optionally add port to use, e.g.
111.11.11.11:5432
. 5432
is the default port
--dbname_input
¶
(Default: None
)
DB name for input-db,
- e.g.:
rawdb
--dbformat_input
¶
(Default: json
)
Format of the input-db.
- Either
lbsn
orjson
- If lbsn is used, the native lbsn raw input mapping (
0
) will be used - If
json
is used, a custom mapping for json must be provided, for mapping database json's to the lbsn structure. See input mappings
-t
, --transferlimit
¶
(Default: None
)
Abort after x records.
- This can be used to limit the number of records that will be processed.
- e.g.
--transferlimit 10000
will process the first 10000 input records - Defaults to None (= process all)
- Note that one input record can map to many output records. This number applies to the number of input records, not the output count.
--transfer_count
¶
(Default: 50000
)
Transfer batch limit x.
- Defines after how many parsed records the results will be transferred to the DB.
- Defaults to 50000
- If you have a slow server, but a fast machine, larger values improve speed because duplicate check happens in Python, and not in Postgres coalesce;
- However, larger values require more local memory. If you have a fast server, but a slow machine, try if a smaller batch
--transfer_count
(e.g. 5000) improves speed.
Note
Use --transferlimit
to limit the
total number of records transferred. --transfer_count
instead defines the batch count that is used to transfer
data incrementally.
--commit_volume
¶
(Default: None
)
After x commit_volume, changes (transactions) will be written to the output database (a Postgres COMMIT).
Note that updated entries in the output database are only written from the WAL buffer after a commit.
- Default for rawdb: 10000
- Default for hlldb: 100000
Warning
If you have concurrent writes to the DB (e.g. multiple lbsntransform processes) and if you see transaction deadlocks, reduce the commit_volume.
--records_tofetch
¶
(Default: 10000
)
Fetch x records /batch.
- If retrieving data from a db (
lbsn
), limit the number of records to fetch at once. - Defaults to 10000
--disable_transfer_reactions
¶
Disable reactions.
- If set, processing of lbsn reactions will be skipped,
- only original posts are transferred.
- This is usefull to reduce processing and data footprint for some service data, e.g. for Twitter, with a large number of reactions containing little original content.
--disable_reaction_post_referencing
¶
Disable reactions-refs.
Enable this option in args to prevent empty posts being stored due to Foreign-Key-Exists Requirement.
Possible parameters:
0
= Save Original Tweets of Retweets asposts
;1
= do not store Original Tweets of Retweets;2
= !Not implemented: Store Original Tweets of Retweets aspost_reactions
--ignore_non_geotagged
¶
Ignore none-geotagged.
If set, posts that are not geotagged are ignored during processing.
--startwith_db_rownumber
¶
(Default: None
)
Start with db row x.
If transferring from a databse (input), this flag can be used to resume processing (e.g.) if a transfer has been aborted.
- Provide a number (row-id) to start processing from live db.
- If input db type is
lbsn
, this is the primary key, without theorigin_id
, (e.g. thepost_guid
,place_guid
etc.). - This flag will only work if processing a single lbsn object (e.g.
--include_lbsn_objects "post"
).
Example:
--startwith_db_rownumber "123456789"
will lead to the first batch-query from the DB looking like this:
SELECT * FROM topical."post"
WHERE post_guid > '123456789'
ORDER BY post_guid ASC
LIMIT 10000;
--endwith_db_rownumber
¶
(Default: None
)
End with db row x.
Provide a number (row-id) to end processing from live db
--debug_mode
¶
(Default: None
)
Enable debug mode.
--geocode_locations
¶
(Default: None
)
Path to loc-geocodes.
- Provide path to a CSV file with location geocodes
- CSV Header must be:
lat, lng, name
). - This can be used in mappings to assign coordinates (lat, lng) to use provided locations as text
--ignore_input_source_list
¶
(Default: None
)
Path to input ignore.
Provide a path to a list of input_source types that will be ignored (e.g. to ignore certain bots etc.)
--mappings_path
¶
(Default: None
)
Path mappings folder.
Provide a path to a custom folder that contains one or more input mapping modules (*.py
).
- Have a look at the two sample mappings in the resources folder.
- See how to define custom input mappings in the docs
--input_lbsn_type
¶
(Default: None
)
Input sub-type
- e.g.
post
,profile
,friendslist
,followerslist
etc. - This can be used to select an appropiate mapping procedure in a single mapping module.
--map_full_relations
¶
Map full relations.
Set to true to map full relations, e.g. many-to-many relationships, such as user_follows
, user_friend
, or user_mentions
etc. are mapped in a separate table. Defaults to False.
--csv_output
¶
Store to local CSV.
If set, will store all submit values to local CSV instead. Currently, this type of output is not available.
--csv_allow_linebreaks
¶
Disable linebreak-rem.
If set, will not remove intext-linebreaks (\r
or \n
) in output CSVs
--csv_delimiter
¶
(Default: ,
)
CSV delimiter.
- Provide the CSV delimiter to be used.
- Default is comma (
,
). - Note: to pass tab, use variable substitution (
$"\t"
)
--use_csv_dictreader
¶
Use csv.DictReader.
By default, CSVs will be read line by line,
using the standard csv.reader().
This will enable csv.DictReader(),
which allows to access CSV fields by name in mappings.
A CSV with a header is required for this setting to work.
Note that csv.DictReader()
may be slower than the default csv.reader()
.
--recursive_load
¶
Recursive local sub dirs.
If set, process input directories recursively (default depth: 2
)
--skip_until_file
¶
(Default: None
)
Skip until file x.
If local input, skip all files until file with name x
appears (default: start immediately)
--skip_until_record
¶
(Default: None
)
Skip until record x.
If local input, skip all records until record x
(default: start with first)
--zip_records
¶
Zip records parallel.
- Use this flag to zip records of multiple input files
- e.g.
List1[A,B,C]
,List2[1,2,3]
will be combined (zipped) on read toList[A1,B2,C3]
--min_geoaccuracy
¶
(Default: None
)
Min geoaccuracy to use
Set to latlng
, place
, or city
to limit processing of records based on mininum geoaccuracy (default: no limit)
--include_lbsn_objects
¶
(Default: None
)
lbsn objects to process
If processing from lbsn db (rawdb
), provide a comma separated list of lbsn objects to include.
May contain:
- origin
- country
- city
- place
- user_groups
- user
- post
- post_reaction
Notes:
- Excluded objects will not be queried, but empty objects may be created due to referenced foreign key relationships.
- Defaults to
origin,post
--include_lbsn_bases
¶
(Default: None
)
lbsn bases to update
If the target output type is hll
, provide a comma separated list of lbsn bases to include/update/store to.
Currently supported:
- hashtag
- emoji
- term
- _hashtag_latlng
- _term_latlng
- _emoji_latlng
- _month_hashtag
- _month_hashtag_latlng
- _month_latlng
- monthofyear
- month
- dayofmonth
- dayofweek
- hourofday
- year
- month
- date
- timestamp
- country
- region
- city
- place
- latlng
- community
Bases not included will be skipped. Per default, no bases will be considered.
Example:
--include_lbsn_bases hashtag,place,date,community
This will update entries in the Postgres hlldb tables social.community
. topical.hashtag
, spatial.place
, temporal.date
and non-existing entries will be created, existing ones will be updated (a hll_union
).
See the structure definition in SQL here for a full list of hlldb table structures.
Argument only allowed one time.
--override_lbsn_query_schema
¶
(Default: None
)
Override schema and table name
This can be used to redirect lbsn queries on the given object from input db to a specific schema/table such as a materialized view.
This can be usefull (e.g.) to limit processing of input data to a specific query.
Format is lbsn_type,schema.table
.
Example:
--override_lbsn_query_schema post,mviews.mypostquery
Argument can be used multiple times.
--hmac_key
¶
(Default: None
)
Override db hmac key
The hmac key that is used for cryptographic hashing during creation of HLL sets. Override what is set in hllworker database here.
Remember to re-use the same hmac key for any consecutive update of HLL sets.
The crypt.salt variable can also be set (temporarily or permanently) in the hll worker database itself.
Example:
ALTER DATABASE hllworkerdb SET crypt.salt = 'CRYPTSALT';
Further information is available in the YFCC HLL tutorial.