Pymarc

Release v4.1.0

Pymarc is a Python 3 library for working with bibliographic data encoded in MARC21.

Starting with version 4.0.0 it requires python 3.6 and up. It provides an API for reading, writing and modifying MARC records. It was mostly designed to be an emergency eject seat, for getting your data assets out of MARC and into some kind of saner representation. However over the years it has been used to create and modify MARC records, since despite repeated calls for it to die as a format, MARC seems to be living quite happily as a zombie.

Below are some common examples of how you might want to use pymarc. If you run across an example that you think should be here please send a pull request.

Reading

Most often you will have some MARC data and will want to extract data from it. Here’s an example of reading a batch of records and printing out the title. If you are curious this example uses the batch file available here in pymarc repository:

from pymarc import MARCReader

with open('test/marc.dat', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        print(record.title())

The pragmatic programmer : from journeyman to master /
Programming Python /
Learning Python /
Python cookbook /
Python programming for the absolute beginner /
Web programming : techniques for integrating Python, Linux, Apache, and MySQL /
Python programming on Win32 /
Python programming : an introduction to computer science /
Python Web programming /
Core python programming /
Python and Tkinter programming /
Game programming with Python, Lua, and Ruby /
Python programming patterns /
Python programming with the Java class libraries : a tutorial for building Web
and Enterprise applications /
Learn to program using Python : a tutorial for hobbyists, self-starters, and all
who want to learn the art of computer programming /
Programming with Python /
BSD Sockets programming from a multi-language perspective /
Design patterns : elements of reusable object-oriented software /
Introduction to algorithms /
ANSI Common Lisp /

Sometimes MARC data contains an errors of some kind. In this case reader returns None instead of record object and two reader’s properties current_exception and current_chunk can help the user to take a corrective action and continue or stop the reading:

from pymarc import MARCReader
from pymarc import exceptions as exc

with open('test/marc.dat', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        if record:
            # consume the record:
            print(record.title())
        elif isinstance(reader.current_exception, exc.FatalReaderError):
            # data file format error
            # reader will raise StopIteration
            print(reader.current_exception)
            print(reader.current_chunk)
        else:
            # fix the record data, skip or stop reading:
            print(reader.current_exception)
            print(reader.current_chunk)
            # break/continue/raise

FatalReaderError happens when reader can’t determine record’s boundaries in the data stream. To avoid data misinterpretation it stops. In case of other errors (wrong encodind etc.) reader continues to the next record.

A pymarc.Record object has a few handy methods like title for getting at bits of a bibliographic record, others include: author, isbn, subjects, location, notes, physicaldescription, publisher, pubyear. But really, to work with MARC data you need to understand the numeric field tags and subfield codes that are used to designate various bits of information. There is a lot more hiding in a MARC record than these methods provide access to. For example the title method extracts the information from the 245 field, subfields a and b. You can access 245a like so:

print(record['245']['a'])

Some fields like subjects can repeat. In cases like that you will want to use get_fields to get all of them as pymarc.Field objects, which you can then interact with further:

for f in record.get_fields('650'):
    print(f)

If you are new to MARC fields Understanding MARC (http://www.loc.gov/marc/umb/) is a pretty good primer, and the MARC 21 Formats (http://www.loc.gov/marc/marcdocz.html) page at the Library of Congress is a good reference once you understand the basics.

Writing

Here’s an example of creating a record and writing it out to a file.

from pymarc import Record, Field

record = Record()
record.add_field(
    Field(
        tag = '245',
        indicators = ['0','1'],
        subfields = [
            'a', 'The pragmatic programmer : ',
            'b', 'from journeyman to master /',
            'c', 'Andrew Hunt, David Thomas.'
        ]))
with open('file.dat', 'wb') as out:
    out.write(record.as_marc())

Updating

Updating works the same way, you read it in, modify it, and then write it out again:

from pymarc import MARCReader

with open('test/marc.dat', 'rb') as fh:
   reader = MARCReader(fh)
   record = next(reader)
   record['245']['a'] = 'The Zombie Programmer'
with open('file.dat', 'wb') as out:
   out.write(record.as_marc())

JSON and XML

If you find yourself using MARC data a fair bit, and distributing it, you may make other developers a bit happier by using the JSON or XML serializations. pymarc has support for both. The main benefit here is that the UTF8 character encoding is used, rather than the frustratingly archaic MARC8 encoding. Also they will be able to use JSON and XML tools to get at the data they want instead of some crazy MARC processing library like, ahem, pymarc.

API Docs

Reader

Pymarc Reader.

class pymarc.reader.JSONReader(marc_target: Union[bytes, str], encoding: str = 'utf-8', stream: bool = False)[source]

Bases: pymarc.reader.Reader

JSON Reader.

class pymarc.reader.MARCReader(marc_target: Union[BinaryIO, bytes], to_unicode: bool = True, force_utf8: bool = False, hide_utf8_warnings: bool = False, utf8_handling: str = 'strict', file_encoding: str = 'iso8859-1', permissive: bool = False)[source]

Bases: pymarc.reader.Reader

An iterator class for reading a file of MARC21 records.

Simple usage:

from pymarc import MARCReader

## pass in a file object
reader = MARCReader(open('file.dat', 'rb'))
for record in reader:
    ...

## pass in marc in transmission format
reader = MARCReader(rawmarc)
for record in reader:
    ...

If you would like to have your Record object contain unicode strings use the to_unicode parameter:

reader = MARCReader(open('file.dat', 'rb'), to_unicode=True)

This will decode from MARC-8 or UTF-8 depending on the value in the MARC leader at position 9.

If you find yourself in the unfortunate position of having data that is utf-8 encoded without the leader set appropriately you can use the force_utf8 parameter:

reader = MARCReader(open('file.dat', 'rb'), to_unicode=True,
    force_utf8=True)

If you find yourself in the unfortunate position of having data that is mostly utf-8 encoded but with a few non-utf-8 characters, you can also use the utf8_handling parameter, which takes the same values (‘strict’, ‘replace’, and ‘ignore’) as the Python Unicode codecs (see http://docs.python.org/library/codecs.html for more info).

Although, it’s not legal in MARC-21 to use anything but MARC-8 or UTF-8, but if you have a file in incorrect encode and you know what it is, you can try to use your encode in parameter “file_encoding”.

MARCReader parses data in a permissive way and gives the user full control on what to do in case wrong record is encountered. Whenever any error is found reader returns None instead of regular record object. The exception information and corresponding data are available through reader.current_exception and reader.current_chunk properties:

reader = MARCReader(open('file.dat', 'rb'))
for record in reader:
    if record is None:
        print(
            "Current chunk: ",
            reader.current_chunk,
            " was ignored because the following exception raised: ",
            reader.current_exception
        )
    else:
        # do something with record
close() → None[source]

Close the handle.

current_chunk

Current chunk.

current_exception

Current exception.

class pymarc.reader.Reader[source]

Bases: object

A base class for all iterating readers in the pymarc package.

pymarc.reader.map_records(f: Callable, *files) → None[source]

Applies a given function to each record in a batch.

You can pass in multiple batches.

def print_title(r):
    print(r['245'])
map_records(print_title, file('marc.dat'))

Record

Pymarc Record.

class pymarc.record.Record(data='', to_unicode=True, force_utf8=False, hide_utf8_warnings=False, utf8_handling='strict', leader=' ', file_encoding='iso8859-1')[source]

Bases: object

A class for representing a MARC record.

Each Record object is made up of multiple Field objects. You’ll probably want to look at the docs for Field to see how to fully use a Record object.

Basic usage:

field = Field(
    tag = '245',
    indicators = ['0','1'],
    subfields = [
        'a', 'The pragmatic programmer : ',
        'b', 'from journeyman to master /',
        'c', 'Andrew Hunt, David Thomas.',
    ])

record.add_field(field)

Or creating a record from a chunk of MARC in transmission format:

record = Record(data=chunk)

Or getting a record as serialized MARC21.

raw = record.as_marc()

You’ll normally want to use a MARCReader object to iterate through MARC records in a file.

add_field(*fields)[source]

Add pymarc.Field objects to a Record object.

Optionally you can pass in multiple fields.

add_grouped_field(*fields)[source]

Add pymarc.Field objects to a Record object and sort them “grouped”.

Which means, attempting to maintain a loose numeric order per the MARC standard for “Organization of the record” (http://www.loc.gov/marc/96principl.html). Optionally you can pass in multiple fields.

add_ordered_field(*fields)[source]

Add pymarc.Field objects to a Record object and sort them “ordered”.

Which means, attempting to maintain a strict numeric order. Optionally you can pass in multiple fields.

addedentries()[source]

Returns Added entries fields.

Note: Fields 790-799 are considered “local” added entry fields but occur with some frequency in OCLC and RLIN records.

as_dict()[source]

Turn a MARC record into a dictionary, which is used for as_json.

as_json(**kwargs)[source]

Serialize a record as JSON.

See: http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/

as_marc()[source]

Returns the record serialized as MARC21.

as_marc21()

Returns the record serialized as MARC21.

author()[source]

Returns the author from field 100, 110 or 111.

decode_marc(marc, to_unicode=True, force_utf8=False, hide_utf8_warnings=False, utf8_handling='strict', encoding='iso8859-1')[source]

Populate the object based on the marc` record in transmission format.

The Record constructor actually uses decode_marc() behind the scenes when you pass in a chunk of MARC data to it.

get_fields(*args)[source]

Return a list of all the fields in a record tags matching args.

title = record.get_fields('245')

If no fields with the specified tag are found then an empty list is returned. If you are interested in more than one tag you can pass it as multiple arguments.

subjects = record.get_fields('600', '610', '650')

If no tag is passed in to get_fields() a list of all the fields will be returned.

isbn()[source]

Returns the first ISBN in the record or None if one is not present.

The returned ISBN will be all numeric, except for an x/X which may occur in the checksum position. Dashes and extraneous information will be automatically removed. If you need this information you’ll want to look directly at the 020 field, e.g. record[‘020’][‘a’]

issn()[source]

Returns the ISSN number [022][‘a’] in the record or None.

issn_title()[source]

Returns the key title of the record (222 $a and $b).

location()[source]

Returns location field (852).

notes()[source]

Return notes fields (all 5xx fields).

physicaldescription()[source]

Return physical description fields (300).

publisher()[source]

Return publisher from 260 or 264.

Note: 264 field with second indicator ‘1’ indicates publisher.

pubyear()[source]

Returns publication year from 260 or 264.

remove_field(*fields)[source]

Remove one or more pymarc.Field objects from a Record object.

remove_fields(*tags)[source]

Remove all the fields with the tags passed to the function.

# remove all the fields marked with tags '200' or '899'.
self.remove_fields('200', '899')
series()[source]

Returns series fields.

Note: 490 supersedes the 440 series statement which was both series statement and added entry. 8XX fields are added entries.

subjects()[source]

Returns subjects fields.

Note: Fields 690-699 are considered “local” added entry fields but occur with some frequency in OCLC and RLIN records.

sudoc()[source]

Returns a SuDoc classification number.

Returns a Superintendent of Documents (SuDoc) classification number held in the 086 MARC tag. Classification number will be made up of a variety of dashes, dots, slashes, and colons. More information can be found at the following URL: https://www.fdlp.gov/file-repository/gpo-cataloging/1172-gpo-classification-manual

title()[source]

Returns the title of the record (245 $a and $b).

uniformtitle()[source]

Returns the uniform title from field 130 or 240.

pymarc.record.map_marc8_record(record)[source]

Map MARC-8 record.

pymarc.record.normalize_subfield_code(subfield)[source]

Normalize subfield code.

Writer

Pymarc Writer.

class pymarc.writer.JSONWriter(file_handle: IO)[source]

Bases: pymarc.writer.Writer

A class for writing records as an array of MARC-in-JSON objects.

IMPORTANT: You must the close a JSONWriter, otherwise you will not get valid JSON.

Simple usage:

.. code-block:: python

from pymarc import JSONWriter

# writing to a file writer = JSONWriter(open(‘file.json’,’wt’)) writer.write(record) writer.close() # Important!

# writing to a string string = StringIO() writer = JSONWriter(string) writer.write(record) writer.close(close_fh=False) # Important! print(string)

close(close_fh: bool = True) → None[source]

Closes the writer.

If close_fh is False close will also close the underlying file handle that was passed in to the constructor. The default is True.

write(record: pymarc.record.Record) → None[source]

Writes a record.

class pymarc.writer.MARCWriter(file_handle: IO)[source]

Bases: pymarc.writer.Writer

A class for writing MARC21 records in transmission format.

Simple usage:

.. code-block:: python

from pymarc import MARCWriter

# writing to a file writer = MARCWriter(open(‘file.dat’,’wb’)) writer.write(record) writer.close()

# writing to a string (Python 2 only) string = StringIO() writer = MARCWriter(string) writer.write(record) writer.close(close_fh=False) print(string)

# writing to memory (Python 3 only)

memory = BytesIO() writer = MARCWriter(memory) writer.write(record) writer.close(close_fh=False)

write(record: pymarc.record.Record) → None[source]

Writes a record.

class pymarc.writer.TextWriter(file_handle: IO)[source]

Bases: pymarc.writer.Writer

A class for writing records in prettified text MARCMaker format.

A blank line separates each record.

Simple usage:

from pymarc import TextWriter

# writing to a file
writer = TextWriter(open('file.txt','wt'))
writer.write(record)
writer.close()

# writing to a string
string = StringIO()
writer = TextWriter(string)
writer.write(record)
writer.close(close_fh=False)
print(string)
write(record: pymarc.record.Record) → None[source]

Writes a record.

class pymarc.writer.Writer(file_handle: IO)[source]

Bases: object

Base Writer object.

close(close_fh: bool = True) → None[source]

Closes the writer.

If close_fh is False close will also close the underlying file handle that was passed in to the constructor. The default is True.

write(record: pymarc.record.Record) → None[source]

Write.

class pymarc.writer.XMLWriter(file_handle: IO)[source]

Bases: pymarc.writer.Writer

A class for writing records as a MARCXML collection.

IMPORTANT: You must then close an XMLWriter, otherwise you will not get a valid XML document.

Simple usage:

from pymarc import XMLWriter

# writing to a file
writer = XMLWriter(open('file.xml','wb'))
writer.write(record)
writer.close()  # Important!

# writing to a string (Python 2 only)
string = StringIO()
writer = XMLWriter(string)
writer.write(record)
writer.close(close_fh=False)  # Important!
print(string)

# writing to memory (Python 3 only)
memory = BytesIO()
writer = XMLWriter(memory)
writer.write(record)
writer.close(close_fh=False)  # Important!
close(close_fh: bool = True) → None[source]

Closes the writer.

If close_fh is False close will also close the underlying file handle that was passed in to the constructor. The default is True.

write(record: pymarc.record.Record) → None[source]

Writes a record.

Field

The pymarc.field file.

class pymarc.field.Field(tag: str, indicators: Optional[List[str]] = None, subfields: Optional[List[str]] = None, data: str = '')[source]

Bases: object

Field() pass in the field tag, indicators and subfields for the tag.

field = Field(
    tag = '245',
    indicators = ['0','1'],
    subfields = [
        'a', 'The pragmatic programmer : ',
        'b', 'from journeyman to master /',
        'c', 'Andrew Hunt, David Thomas.',
    ])

If you want to create a control field, don’t pass in the indicators and use a data parameter rather than a subfields parameter:

field = Field(tag='001', data='fol05731351')
add_subfield(code: str, value: str, pos=None) → None[source]

Adds a subfield code/value to the end of a field or at a position (pos).

field.add_subfield('u', 'http://www.loc.gov')
field.add_subfield('u', 'http://www.loc.gov', 0)

If pos is not supplied or out of range, the subfield will be added at the end.

as_marc(encoding: str) → bytes[source]

Used during conversion of a field to raw marc.

as_marc21(encoding: str) → bytes

Used during conversion of a field to raw marc.

delete_subfield(code: str) → Optional[str][source]

Deletes the first subfield with the specified ‘code’ and returns its value.

value = field.delete_subfield('a')

If no subfield is found with the specified code None is returned.

format_field() → str[source]

Returns the field as a string w/ tag, indicators, and subfield indicators.

Like Field.value(), but prettier (adds spaces, formats subject headings).

get_subfields(*codes) → List[str][source]

Get subfields matching codes.

get_subfields() accepts one or more subfield codes and returns a list of subfield values. The order of the subfield values in the list will be the order that they appear in the field.

print(field.get_subfields('a'))
print(field.get_subfields('a', 'b', 'z'))
indicator1

Indicator 1.

indicator2

Indicator 2.

is_control_field() → bool[source]

Returns true or false if the field is considered a control field.

Control fields lack indicators and subfields.

is_subject_field() → bool[source]

Returns True or False if the field is considered a subject field.

Used by format_field() .

subfields_as_dict() → DefaultDict[str, list][source]

Returns the subfields as a dictionary.

The dictionary is a mapping of subfield codes and values. Since subfield codes can repeat the values are a list.

value() → str[source]

Returns the field as a string w/ tag, indicators, and subfield indicators.

class pymarc.field.RawField(tag: str, indicators: Optional[List[str]] = None, subfields: Optional[List[str]] = None, data: str = '')[source]

Bases: pymarc.field.Field

MARC field that keeps data in raw, undecoded byte strings.

Should only be used when input records are wrongly encoded.

as_marc(encoding: Optional[str] = None)[source]

Used during conversion of a field to raw marc.

pymarc.field.map_marc8_field(f: pymarc.field.Field) → pymarc.field.Field[source]

Map MARC8 field.

Exceptions

Exceptions for pymarc.

exception pymarc.exceptions.BadLeaderValue[source]

Bases: pymarc.exceptions.PymarcException

Error when setting a leader value.

exception pymarc.exceptions.BadSubfieldCodeWarning[source]

Bases: Warning

Warning about a non-ASCII subfield code.

exception pymarc.exceptions.BaseAddressInvalid[source]

Bases: pymarc.exceptions.PymarcException

Base address exceeds size of record.

exception pymarc.exceptions.BaseAddressNotFound[source]

Bases: pymarc.exceptions.PymarcException

Unable to locate base address of record.

exception pymarc.exceptions.EndOfRecordNotFound[source]

Bases: pymarc.exceptions.FatalReaderEror

Unable to locate end of record marker.

exception pymarc.exceptions.FatalReaderEror[source]

Bases: pymarc.exceptions.PymarcException

Error preventing further reading.

exception pymarc.exceptions.FieldNotFound[source]

Bases: pymarc.exceptions.PymarcException

Record does not contain the specified field.

exception pymarc.exceptions.NoActiveFile[source]

Bases: pymarc.exceptions.PymarcException

There is no active file to write to in call to write.

exception pymarc.exceptions.NoFieldsFound[source]

Bases: pymarc.exceptions.PymarcException

Unable to locate fields in record data.

exception pymarc.exceptions.PymarcException[source]

Bases: Exception

Base pymarc Exception.

exception pymarc.exceptions.RecordDirectoryInvalid[source]

Bases: pymarc.exceptions.PymarcException

Invalid directory.

exception pymarc.exceptions.RecordLeaderInvalid[source]

Bases: pymarc.exceptions.PymarcException

Unable to extract record leader.

exception pymarc.exceptions.RecordLengthInvalid[source]

Bases: pymarc.exceptions.FatalReaderEror

Invalid record length.

exception pymarc.exceptions.TruncatedRecord[source]

Bases: pymarc.exceptions.FatalReaderEror

Truncated record data.

exception pymarc.exceptions.WriteNeedsRecord[source]

Bases: pymarc.exceptions.PymarcException

Write requires a pymarc.Record object as an argument.

MarcXML

From XML to MARC21 and back again.

class pymarc.marcxml.XmlHandler(strict=False, normalize_form=None)[source]

Bases: xml.sax.handler.ContentHandler

XML Handler.

You can subclass XmlHandler and add your own process_record method that’ll be passed a pymarc.Record as it becomes available. This could be useful if you want to stream the records elsewhere (like to a rdbms) without having to store them all in memory.

characters(chars)[source]

Append chars to _text.

endElementNS(name, qname)[source]

End element NS.

process_record(record)[source]

Append record to records.

startElementNS(name, qname, attrs)[source]

Start element NS.

pymarc.marcxml.map_xml(function, *files)[source]

Map a function onto the file.

So that for each record that is parsed the function will get called with the extracted record

def do_it(r):
    print(r)

map_xml(do_it, 'marc.xml')
pymarc.marcxml.parse_xml(xml_file, handler)[source]

Parse a file with a given subclass of xml.sax.handler.ContentHandler.

pymarc.marcxml.parse_xml_to_array(xml_file, strict=False, normalize_form=None)[source]

Parse an XML file and return the records as an array.

Instead of passing in a file path you can also pass in an open file handle, or a file like object like StringIO. If you would like the parser to explicitly check the namespaces for the MARCSlim namespace use the strict=True option. Valid values for normalize_form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. See unicodedata.normalize for more info on these.

pymarc.marcxml.record_to_xml(record, quiet=False, namespace=False)[source]

From MARC to XML.

pymarc.marcxml.record_to_xml_node(record, quiet=False, namespace=False)[source]

Converts a record object to a chunk of XML.

If you would like to include the marcxml namespace in the root tag set namespace to True.

Constants

Constants for pymarc.

MARC-8

Handle MARC-8 files.

see http://www.loc.gov/marc/specifications/speccharmarc8.html

class pymarc.marc8.MARC8ToUnicode(G0: int = 66, G1: int = 69, quiet: bool = False)[source]

Bases: object

Converts MARC-8 to Unicode.

Note that currently, unicode strings aren’t normalized, and some codecs (e.g. iso8859-1) will fail on such strings. When I can require python 2.3, this will go away.

Warning: MARC-8 EACC (East Asian characters) makes some distinctions which aren’t captured in Unicode. The LC tables give the option of mapping such characters either to a Unicode private use area, or a substitute character which (usually) gives the sense. I’ve picked the second, so this means that the MARC data should be treated as primary and the Unicode data used for display purposes only. (If you know of either of fonts designed for use with LC’s private-use Unicode assignments, or of attempts to standardize Unicode characters to allow round-trips from EACC, or if you need the private-use Unicode character translations, please inform me, asl2@pobox.com.

ansel = 69
basic_latin = 66
translate(marc8_string)[source]

Translate.

pymarc.marc8.marc8_to_unicode(marc8, hide_utf8_warnings: bool = False) → str[source]

Pass in a string, and get back a Unicode object.

print marc8_to_unicode(record.title())

MARC-8 mapping

MARC-8 mapping.

Leader

The pymarc.leader file.

class pymarc.leader.Leader(leader: str)[source]

Bases: object

Mutable leader.

A class to manipulate a Record’s leader.

You can use the properties (status, bibliographic_level, etc.) or their slices/index equivalent (leader[5], leader[7], etc.) to read and write values.

See LoC’s documentation for more infos about those fields.

leader = Leader("00475cas a2200169 i 4500")
leader[0:4]  # returns "00475"
leader.status  # returns "c"
leader.status = "a"  # sets the status to "a"
leader[5] # returns the status "a"
leader[5] = "b" # sets the status to "b"
str(leader)  # "00475bas a2200169 i 4500"

Usually the leader is accessed through the leader property of a record.

from pymarc import MARCReader
with open('test/marc.dat', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        print(record.leader)

When creating/updating a Record please note that record_length and base_address will only be generated in the marc21 output of record.as_marc()

base_address

Base address of data (12-16).

bibliographic_level

Bibliographic level (07).

cataloging_form

Descriptive cataloging form (18).

coding_scheme

Character coding scheme (09).

encoding_level

Encoding level (17).

implementation_defined_length

Length of the implementation-defined portion (22).

indicator_count

Indicator count (10).

length_of_field_length

Length of the length-of-field portion (20).

multipart_ressource

Multipart resource record level (19).

record_length

Record length (00-04).

record_status

Record status (05).

starting_character_position_length

Length of the starting-character-position portion (21).

subfield_code_count

Subfield code count (11).

type_of_control

Type of control (08).

type_of_record

Type of record (06).

Indices and tables